linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator
@ 2017-07-16  2:23 Dennis Zhou
  2017-07-16  2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou, Dennis Zhou

Hi everyone,

The Linux kernel percpu memory allocator is responsible for managing
percpu memory. It allocates memory from chunks of percpu areas and uses a
simple first-fit area allocator to manage allocations inside each chunk.
There now exist use cases where allocating and deallocating a million or
more objects occurs making the current implementation inadequate.

The two primary problems with the current area map allocator are:
  1. The backing data structure is an array of the areas. To manage this
     array, it is possible to need to memmove a large portion of it.
  2. On allocation, chunks are considered based on the contig_hint. It is
     possible that the contig_hint may be large enough while the alignment 
     could not meet the request. This causes scanning over every free
     fragment that could spill over into scanning chunks.

The primary considerations for the new allocator were the following:
 - Remove the memmove operation from the critical path
 - Be conservative with additional use of memory
 - Provide consistency in performance and memory footprint
 - Focus on small allocations < 64 bytes

This patchset introduces a simple bitmap allocator backed by metadata
blocks as a replacement for the area map allocator for percpu memory. Each
chunk has an allocation bitmap, a boundary bitmap, and a set of metadata
blocks. The allocation map serves as the ground truth for allocations
while the boundary map serves as a way to distinguish between consecutive
allocations. The minimum allocation size has been increased to 4-bytes.

The key property behind the bitmap allocator is its static metadata. The
main problem it solves is that a memmove is no longer part of the critical
path for freeing, which was the primary source of latency. This also helps
bound the metadata overhead. The area map allocator prior required an
integer per allocation. This may be beneficial with larger allocations,
but as mentioned, allocating a significant number of small objects is
becoming more common. This causes worst-case scenarios for metadata
overhead.

There is one caveat with this implementation. In an effort to make freeing
fast, the only time metadata is updated on the free path is if a whole
block becomes free or the freed area spans across metadata blocks. This
causes the chunka??s contig_hint to be potentially smaller than what it
could allocate by up to a block. If the chunka??s contig_hint is smaller
than a block, a check occurs and the hint is kept accurate. Metadata is
always kept accurate on allocation and therefore the situation where a
chunk has a larger contig_hint than available will never occur.

I have primarily done testing against a simple workload of allocation of
1 million objects of varying size. Deallocation was done by in order,
alternating, and in reverse. These numbers were collected after rebasing
ontop of a80099a152. I present the worst-case numbers here:

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        335      |     4960
             16B    |        485      |     1150
             64B    |        445      |      280
            128B    |        505      |      177
           1024B    |       3385      |      140

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        725      |       70
             16B    |        760      |       70
             64B    |        855      |       80
            128B    |        910      |       90
           1024B    |       3770      |      260

This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.

While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The second problem of
additional scanning can result in the area map allocator completing in
52 minutes. The same workload takes only 14 seconds to complete for the
bitmap allocator. This was produced under a more contrived scenario of
allocating 1 milion 4-byte objects with 8-byte alignment.

Alternative implementations were evaluated including: linked lists, trees,
and buddy systems. These all suffer from either high metadata overhead for
small allocations or from the amplified costs of fragmentation with percpu
memory.

This patchset contains the following ten patches:

  0001-percpu-pcpu-stats-change-void-buffer-to-int-buffer.patch
  0002-percpu-change-the-format-for-percpu_stats-output.patch
  0003-percpu-expose-pcpu_nr_empty_pop_pages-in-pcpu_stats.patch
  0004-percpu-update-the-header-comment-and-pcpu_build_allo.patch
  0005-percpu-change-reserved_size-to-end-page-aligned.patch
  0006-percpu-modify-base_addr-to-be-region-specific.patch
  0007-percpu-fix-misnomer-in-schunk-dchunk-variable-names.patch
  0008-percpu-change-the-number-of-pages-marked-in-the-firs.patch
  0009-percpu-replace-area-map-allocator-with-bitmap-alloca.patch
  0010-percpu-add-optimizations-on-allocation-path-for-the-.patch

0001-0002 are minor fixes to percpu_stats. 0003 exposes a new field via
percpu_stats. 0004 updates comments in the percpu allocator. 0005-0006 are
preparatory patches that modify the first_chunk's base_addr management and
the reserved region. 0007 does some variable renaming for clarity. 0008
modifies the population map and the variables surrounding population. 0009
is the bitmap allocator backed by metadata blocks implementation. 0010
adds two optimizations on top of the allocator.

This patchset is on top of linus#master a80099a152.

diffstats below:

Dennis Zhou (Facebook) (10):
  percpu: pcpu-stats change void buffer to int buffer
  percpu: change the format for percpu_stats output
  percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats
  percpu: update the header comment and pcpu_build_alloc_info comments
  percpu: change reserved_size to end page aligned
  percpu: modify base_addr to be region specific
  percpu: fix misnomer in schunk/dchunk variable names
  percpu: change the number of pages marked in the first_chunk bitmaps
  percpu: replace area map allocator with bitmap allocator
  percpu: add optimizations on allocation path for the bitmap allocator

 arch/ia64/mm/contig.c    |    3 +-
 arch/ia64/mm/discontig.c |    3 +-
 include/linux/percpu.h   |   43 +-
 init/main.c              |    1 -
 mm/percpu-internal.h     |   84 ++-
 mm/percpu-stats.c        |  111 ++--
 mm/percpu.c              | 1461 +++++++++++++++++++++++++++++-----------------
 7 files changed, 1107 insertions(+), 599 deletions(-)

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 14:44   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

Changes the use of a void buffer to an int buffer for clarity.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu-stats.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 03524a5..0d81044 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -49,7 +49,7 @@ static int find_max_map_used(void)
  * the beginning of the chunk to the last allocation.
  */
 static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
-			    void *buffer)
+			    int *buffer)
 {
 	int i, s_index, last_alloc, alloc_sign, as_len;
 	int *alloc_sizes, *p;
@@ -113,7 +113,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 {
 	struct pcpu_chunk *chunk;
 	int slot, max_map_used;
-	void *buffer;
+	int *buffer;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/10] percpu: change the format for percpu_stats output
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
  2017-07-16  2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 14:46   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

This makes the debugfs output for percpu_stats a little easier
to read by changing the spacing of the output to be consistent.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu-stats.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 0d81044..fa0f5de 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -18,7 +18,7 @@
 #include "percpu-internal.h"
 
 #define P(X, Y) \
-	seq_printf(m, "  %-24s: %8lld\n", X, (long long int)Y)
+	seq_printf(m, "  %-20s: %12lld\n", X, (long long int)Y)
 
 struct percpu_stats pcpu_stats;
 struct pcpu_alloc_info pcpu_stats_ai;
@@ -134,7 +134,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	}
 
 #define PL(X) \
-	seq_printf(m, "  %-24s: %8lld\n", #X, (long long int)pcpu_stats_ai.X)
+	seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
 
 	seq_printf(m,
 			"Percpu Memory Statistics\n"
@@ -151,7 +151,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 #undef PL
 
 #define PU(X) \
-	seq_printf(m, "  %-18s: %14llu\n", #X, (unsigned long long)pcpu_stats.X)
+	seq_printf(m, "  %-20s: %12llu\n", #X, (unsigned long long)pcpu_stats.X)
 
 	seq_printf(m,
 			"Global Stats:\n"
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
  2017-07-16  2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou
  2017-07-16  2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 14:47   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

Percpu memory holds a minimum threshold of pages that are populated
in order to serve atomic percpu memory requests. This change makes it
easier to verify that there are a minimum number of populated pages
lying around.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu-internal.h | 1 +
 mm/percpu-stats.c    | 1 +
 mm/percpu.c          | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index cd2442e..c9158a4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -36,6 +36,7 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_slot;
 extern int pcpu_nr_slots;
+extern int pcpu_nr_empty_pop_pages;
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index fa0f5de..44e561d 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -164,6 +164,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
 	PU(nr_max_chunks);
 	PU(min_alloc_size);
 	PU(max_alloc_size);
+	P("empty_pop_pages", pcpu_nr_empty_pop_pages);
 	seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index bd4130a..9ec5fd4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -160,7 +160,7 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  * The number of empty populated pages, protected by pcpu_lock.  The
  * reserved chunk doesn't contribute to the count.
  */
-static int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages;
 
 /*
  * Balance work is used to populate or destroy chunks asynchronously.  We
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (2 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 14:53   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

The header comment for percpu memory is a little hard to parse and is
not super clear about how the first chunk is managed. This adds a
little more clarity to the situation.

There is also quite a bit of tricky logic in the pcpu_build_alloc_info.
This adds a restructure of a comment to add a little more information.
Unfortunately, you will still have to piece together a handful of other
comments too, but should help direct you to the meaningful comments.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu.c | 58 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 9ec5fd4..5bb90d8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -4,36 +4,35 @@
  * Copyright (C) 2009		SUSE Linux Products GmbH
  * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
  *
- * This file is released under the GPLv2.
+ * This file is released under the GPLv2 license.
  *
- * This is percpu allocator which can handle both static and dynamic
- * areas.  Percpu areas are allocated in chunks.  Each chunk is
- * consisted of boot-time determined number of units and the first
- * chunk is used for static percpu variables in the kernel image
- * (special boot time alloc/init handling necessary as these areas
- * need to be brought up before allocation services are running).
- * Unit grows as necessary and all units grow or shrink in unison.
- * When a chunk is filled up, another chunk is allocated.
+ * The percpu allocator handles both static and dynamic areas.  Percpu
+ * areas are allocated in chunks which are divided into units.  There is
+ * a 1-to-1 mapping for units to possible cpus.  These units are grouped
+ * based on NUMA properties of the machine.
  *
  *  c0                           c1                         c2
  *  -------------------          -------------------        ------------
  * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
  *  -------------------  ......  -------------------  ....  ------------
+
+ * Allocation is done by offsets into a unit's address space.  Ie., an
+ * area of 512 bytes at 6k in c1 occupies 512 bytes at 6k in c1:u0,
+ * c1:u1, c1:u2, etc.  On NUMA machines, the mapping may be non-linear
+ * and even sparse.  Access is handled by configuring percpu base
+ * registers according to the cpu to unit mappings and offsetting the
+ * base address using pcpu_unit_size.
+ *
+ * There is special consideration for the first chunk which must handle
+ * the static percpu variables in the kernel image as allocation services
+ * are not online yet.  In short, the first chunk is structure like so:
  *
- * Allocation is done in offset-size areas of single unit space.  Ie,
- * an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
- * c1:u1, c1:u2 and c1:u3.  On UMA, units corresponds directly to
- * cpus.  On NUMA, the mapping can be non-linear and even sparse.
- * Percpu access can be done by configuring percpu base registers
- * according to cpu to unit mapping and pcpu_unit_size.
- *
- * There are usually many small percpu allocations many of them being
- * as small as 4 bytes.  The allocator organizes chunks into lists
- * according to free size and tries to allocate from the fullest one.
- * Each chunk keeps the maximum contiguous area size hint which is
- * guaranteed to be equal to or larger than the maximum contiguous
- * area in the chunk.  This helps the allocator not to iterate the
- * chunk maps unnecessarily.
+ *                  <Static | [Reserved] | Dynamic>
+ *
+ * The static data is copied from the original section managed by the
+ * linker.  The reserved section, if non-zero, primarily manages static
+ * percpu variables from kernel modules.  Finally, the dynamic section
+ * takes care of normal allocations.
  *
  * Allocation state in each chunk is kept using an array of integers
  * on chunk->map.  A positive value in the map represents a free
@@ -43,6 +42,12 @@
  * Chunks can be determined from the address using the index field
  * in the page struct. The index field contains a pointer to the chunk.
  *
+ * These chunks are organized into lists according to free_size and
+ * tries to allocate from the fullest chunk first. Each chunk maintains
+ * a maximum contiguous area size hint which is guaranteed to be equal
+ * to or larger than the maximum contiguous area in the chunk. This
+ * helps prevent the allocator from iterating over chunks unnecessarily.
+ *
  * To use this allocator, arch code should do the following:
  *
  * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
@@ -1842,6 +1847,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
 	 */
 	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
 
+	/* determine the maximum # of units that can fit in an allocation */
 	alloc_size = roundup(min_unit_size, atom_size);
 	upa = alloc_size / min_unit_size;
 	while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
@@ -1868,9 +1874,9 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
 	}
 
 	/*
-	 * Expand unit size until address space usage goes over 75%
-	 * and then as much as possible without using more address
-	 * space.
+	 * Wasted space is caused by a ratio imbalance of upa to group_cnt.
+	 * Expand the unit_size until we use >= 75% of the units allocated.
+	 * Related to atom_size, which could be much larger than the unit_size.
 	 */
 	last_allocs = INT_MAX;
 	for (upa = max_upa; upa; upa--) {
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (3 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-16  4:01   ` kbuild test robot
                     ` (2 more replies)
  2017-07-16  2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

Preparatory patch to modify the first chunk's static_size +
reserved_size to end page aligned. The first chunk has a unique
allocation scheme overlaying the static, reserved, and dynamic regions.
The other regions of each chunk are reserved or hidden. The bitmap
allocator would have to allocate in the bitmap the static region to
replicate this. By having the reserved region to end page aligned, the
metadata overhead can be saved. The consequence is that up to an
additional page of memory will be allocated to the reserved region that
primarily serves static percpu variables.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 arch/ia64/mm/contig.c    |  3 ++-
 arch/ia64/mm/discontig.c |  3 ++-
 include/linux/percpu.h   | 29 +++++++++++++++++++++++++++++
 mm/percpu.c              |  6 ++++++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 52715a7..20ee2b2 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -164,7 +164,8 @@ setup_per_cpu_areas(void)
 
 	/* set parameters */
 	static_size = __per_cpu_end - __per_cpu_start;
-	reserved_size = PERCPU_MODULE_RESERVE;
+	reserved_size = pcpu_align_reserved_region(static_size,
+						   PERCPU_MODULE_RESERVE);
 	dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
 	if (dyn_size < 0)
 		panic("percpu area overflow static=%zd reserved=%zd\n",
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 8786268..f898b24 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -214,7 +214,8 @@ void __init setup_per_cpu_areas(void)
 
 	/* set basic parameters */
 	static_size = __per_cpu_end - __per_cpu_start;
-	reserved_size = PERCPU_MODULE_RESERVE;
+	reserved_size = pcpu_align_reserved_region(static_size,
+						   PERCPU_MODULE_RSERVE);
 	dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
 	if (dyn_size < 0)
 		panic("percpu area overflow static=%zd reserved=%zd\n",
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 491b3f5..98a371c 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -130,4 +130,33 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 	(typeof(type) __percpu *)__alloc_percpu(sizeof(type),		\
 						__alignof__(type))
 
+/*
+ * pcpu_align_reserved_region - page align the end of the reserved region
+ * @static_size: the static region size
+ * @reserved_size: the minimum reserved region size
+ *
+ * This function calculates the size of the reserved region required to
+ * make the reserved region end page aligned.
+ *
+ * Percpu memory offers a maximum alignment of PAGE_SIZE.  Aligning this
+ * minimizes the metadata overhead of overlapping the static, reserved,
+ * and dynamic regions by allowing the metadata for the static region to
+ * not be allocated.  This lets the base_addr be moved up to a page
+ * aligned address and disregard the static region as offsets are allocated.
+ * The beginning of the reserved region will overlap with the static
+ * region if the end of the static region is not page aligned.
+ *
+ * RETURNS:
+ * Size of reserved region required to make static_size + reserved_size
+ * page aligned.
+ */
+static inline ssize_t pcpu_align_reserved_region(ssize_t static_size,
+						 ssize_t reserved_size)
+{
+	if (!reserved_size)
+		return 0;
+
+	return PFN_ALIGN(static_size + reserved_size) - static_size;
+}
+
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 5bb90d8..7704db9 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1597,6 +1597,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
 	PCPU_SETUP_BUG_ON(offset_in_page(ai->unit_size));
 	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
+	PCPU_SETUP_BUG_ON(ai->reserved_size &&
+			  !PAGE_ALIGNED(ai->static_size + ai->reserved_size));
 	PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
 	PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);
 
@@ -1800,6 +1802,9 @@ early_param("percpu_alloc", percpu_alloc_setup);
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  *
+ * If there is a @reserved_size, it is expanded to ensure the end of the
+ * reserved region is page aligned.
+ *
  * This function determines grouping of units, their mappings to cpus
  * and other parameters considering needed percpu size, allocation
  * atom size and distances between CPUs.
@@ -1835,6 +1840,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
 	memset(group_cnt, 0, sizeof(group_cnt));
 
 	/* calculate size_sum and ensure dyn_size is enough for early alloc */
+	reserved_size = pcpu_align_reserved_region(static_size, reserved_size);
 	size_sum = PFN_ALIGN(static_size + reserved_size +
 			    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
 	dyn_size = size_sum - static_size - reserved_size;
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/10] percpu: modify base_addr to be region specific
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (4 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 18:57   ` Tejun Heo
  2017-07-18 19:26   ` Josef Bacik
  2017-07-16  2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

Originally, the first chunk is served by up to three chunks, each given
a region they are responsible for. Despite this, the arithmetic was based
off of the base_addr making it require offsets or be overly inclusive.
This patch changes percpu checks for first chunk to consider the only
the dynamic region and the reserved check to be only the reserved
region. There is no impact here besides making these checks a little
more accurate.

This patch also adds the ground work increasing the minimum allocation
size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to
keep track of the number of pages the bitmap serves. The arithmetic for
identifying first chunk and reserved chunk reflect this change.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 include/linux/percpu.h |   4 ++
 mm/percpu-internal.h   |  12 +++--
 mm/percpu.c            | 127 ++++++++++++++++++++++++++++++++++---------------
 3 files changed, 100 insertions(+), 43 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 98a371c..a5cedcd 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -21,6 +21,10 @@
 /* minimum unit size, also is the maximum supported allocation size */
 #define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(32 << 10)
 
+/* minimum allocation size and shift in bytes */
+#define PCPU_MIN_ALLOC_SIZE		(1 << PCPU_MIN_ALLOC_SHIFT)
+#define PCPU_MIN_ALLOC_SHIFT		2
+
 /*
  * Percpu allocator can serve percpu allocations before slab is
  * initialized which allows slab to depend on the percpu allocator.
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index c9158a4..56e1aba 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -23,11 +23,12 @@ struct pcpu_chunk {
 	void			*data;		/* chunk data */
 	int			first_free;	/* no free below this */
 	bool			immutable;	/* no [de]population allowed */
-	bool			has_reserved;	/* Indicates if chunk has reserved space
-						   at the beginning. Reserved chunk will
-						   contain reservation for static chunk.
-						   Dynamic chunk will contain reservation
-						   for static and reserved chunks. */
+	bool			has_reserved;	/* indicates if the region this chunk
+						   is responsible for overlaps with
+						   the prior adjacent region */
+
+	int                     nr_pages;       /* # of PAGE_SIZE pages served
+						   by this chunk */
 	int			nr_populated;	/* # of populated pages */
 	unsigned long		populated[];	/* populated bitmap */
 };
@@ -40,6 +41,7 @@ extern int pcpu_nr_empty_pop_pages;
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
+extern unsigned long pcpu_reserved_offset;
 
 #ifdef CONFIG_PERCPU_STATS
 
diff --git a/mm/percpu.c b/mm/percpu.c
index 7704db9..c74ad68 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -144,14 +144,14 @@ static const size_t *pcpu_group_sizes __ro_after_init;
 struct pcpu_chunk *pcpu_first_chunk __ro_after_init;
 
 /*
- * Optional reserved chunk.  This chunk reserves part of the first
- * chunk and serves it for reserved allocations.  The amount of
- * reserved offset is in pcpu_reserved_chunk_limit.  When reserved
- * area doesn't exist, the following variables contain NULL and 0
- * respectively.
+ * Optional reserved chunk.  This is the part of the first chunk that
+ * serves reserved allocations.  The pcpu_reserved_offset is the amount
+ * the pcpu_reserved_chunk->base_addr is push back into the static
+ * region for the base_addr to be page aligned.  When the reserved area
+ * doesn't exist, the following variables contain NULL and 0 respectively.
  */
 struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
-static int pcpu_reserved_chunk_limit __ro_after_init;
+unsigned long pcpu_reserved_offset __ro_after_init;
 
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
@@ -184,19 +184,32 @@ static void pcpu_schedule_balance_work(void)
 		schedule_work(&pcpu_balance_work);
 }
 
+/*
+ * Static addresses should never be passed into the allocator.  They
+ * are accessed using the group_offsets and therefore do not rely on
+ * chunk->base_addr.
+ */
 static bool pcpu_addr_in_first_chunk(void *addr)
 {
 	void *first_start = pcpu_first_chunk->base_addr;
 
-	return addr >= first_start && addr < first_start + pcpu_unit_size;
+	return addr >= first_start &&
+	       addr < first_start +
+	       pcpu_first_chunk->nr_pages * PAGE_SIZE;
 }
 
 static bool pcpu_addr_in_reserved_chunk(void *addr)
 {
-	void *first_start = pcpu_first_chunk->base_addr;
+	void *first_start;
 
-	return addr >= first_start &&
-		addr < first_start + pcpu_reserved_chunk_limit;
+	if (!pcpu_reserved_chunk)
+		return false;
+
+	first_start = pcpu_reserved_chunk->base_addr;
+
+	return addr >= first_start + pcpu_reserved_offset &&
+	       addr < first_start +
+	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
 }
 
 static int __pcpu_size_to_slot(int size)
@@ -237,11 +250,16 @@ static int __maybe_unused pcpu_page_idx(unsigned int cpu, int page_idx)
 	return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx;
 }
 
+static unsigned long pcpu_unit_page_offset(unsigned int cpu, int page_idx)
+{
+	return pcpu_unit_offsets[cpu] + (page_idx << PAGE_SHIFT);
+}
+
 static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
 				     unsigned int cpu, int page_idx)
 {
-	return (unsigned long)chunk->base_addr + pcpu_unit_offsets[cpu] +
-		(page_idx << PAGE_SHIFT);
+	return (unsigned long)chunk->base_addr +
+		pcpu_unit_page_offset(cpu, page_idx);
 }
 
 static void __maybe_unused pcpu_next_unpop(struct pcpu_chunk *chunk,
@@ -737,6 +755,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
 	chunk->free_size = pcpu_unit_size;
 	chunk->contig_hint = pcpu_unit_size;
 
+	chunk->nr_pages = pcpu_unit_pages;
+
 	return chunk;
 }
 
@@ -824,18 +844,20 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
  * pcpu_chunk_addr_search - determine chunk containing specified address
  * @addr: address for which the chunk needs to be determined.
  *
+ * This is an internal function that handles all but static allocations.
+ * Static percpu address values should never be passed into the allocator.
+ *
  * RETURNS:
  * The address of the found chunk.
  */
 static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 {
 	/* is it in the first chunk? */
-	if (pcpu_addr_in_first_chunk(addr)) {
-		/* is it in the reserved area? */
-		if (pcpu_addr_in_reserved_chunk(addr))
-			return pcpu_reserved_chunk;
+	if (pcpu_addr_in_first_chunk(addr))
 		return pcpu_first_chunk;
-	}
+	/* is it in the reserved chunk? */
+	if (pcpu_addr_in_reserved_chunk(addr))
+		return pcpu_reserved_chunk;
 
 	/*
 	 * The address is relative to unit0 which might be unused and
@@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
 	 * The following test on unit_low/high isn't strictly
 	 * necessary but will speed up lookups of addresses which
 	 * aren't in the first chunk.
+	 *
+	 * The address check is of high granularity checking against full
+	 * chunk sizes.  pcpu_base_addr points to the beginning of the first
+	 * chunk including the static region.  This allows us to examine all
+	 * regions of the first chunk. Assumes good intent as the first
+	 * chunk may not be full (ie. < pcpu_unit_pages in size).
 	 */
-	first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0);
-	first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu,
-				     pcpu_unit_pages);
+	first_low = (unsigned long) pcpu_base_addr +
+		    pcpu_unit_page_offset(pcpu_low_unit_cpu, 0);
+	first_high = (unsigned long) pcpu_base_addr +
+		     pcpu_unit_page_offset(pcpu_high_unit_cpu, pcpu_unit_pages);
 	if ((unsigned long)addr >= first_low &&
 	    (unsigned long)addr < first_high) {
 		for_each_possible_cpu(cpu) {
@@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	unsigned int cpu;
 	int *unit_map;
 	int group, unit, i;
+	unsigned long tmp_addr, aligned_addr;
+	unsigned long map_size_bytes;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		INIT_LIST_HEAD(&pcpu_slot[i]);
 
 	/*
-	 * Initialize static chunk.  If reserved_size is zero, the
-	 * static chunk covers static area + dynamic allocation area
-	 * in the first chunk.  If reserved_size is not zero, it
-	 * covers static area + reserved area (mostly used for module
-	 * static percpu allocation).
+	 * Initialize static chunk.
+	 * The static region is dropped as those addresses are already
+	 * allocated and do not rely on chunk->base_addr.
+	 * reserved_size == 0:
+	 *      the static chunk covers the dynamic area
+	 * reserved_size > 0:
+	 *      the static chunk covers the reserved area
+	 *
+	 * If the static area is not page aligned, the region adjacent
+	 * to the static area must have its base_addr be offset into
+	 * the static area to have it be page aligned.  The overlap is
+	 * then allocated preserving the alignment in the metadata for
+	 * the actual region.
 	 */
+	tmp_addr = (unsigned long)base_addr + ai->static_size;
+	aligned_addr = tmp_addr & PAGE_MASK;
+	pcpu_reserved_offset = tmp_addr - aligned_addr;
+
+	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
+			 pcpu_reserved_offset;
+
+	/* schunk allocation */
 	schunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
 	INIT_LIST_HEAD(&schunk->list);
 	INIT_LIST_HEAD(&schunk->map_extend_list);
-	schunk->base_addr = base_addr;
+	schunk->base_addr = (void *)aligned_addr;
 	schunk->map = smap;
 	schunk->map_alloc = ARRAY_SIZE(smap);
 	schunk->immutable = true;
 	bitmap_fill(schunk->populated, pcpu_unit_pages);
 	schunk->nr_populated = pcpu_unit_pages;
 
+	schunk->nr_pages = map_size_bytes >> PAGE_SHIFT;
+
 	if (ai->reserved_size) {
 		schunk->free_size = ai->reserved_size;
 		pcpu_reserved_chunk = schunk;
-		pcpu_reserved_chunk_limit = ai->static_size + ai->reserved_size;
 	} else {
 		schunk->free_size = dyn_size;
 		dyn_size = 0;			/* dynamic area covered */
 	}
 	schunk->contig_hint = schunk->free_size;
 
-	schunk->map[0] = 1;
-	schunk->map[1] = ai->static_size;
-	schunk->map_used = 1;
+	if (pcpu_reserved_offset) {
+		schunk->has_reserved = true;
+		schunk->map[0] = 1;
+		schunk->map[1] = pcpu_reserved_offset;
+		schunk->map_used = 1;
+	}
 	if (schunk->free_size)
-		schunk->map[++schunk->map_used] = ai->static_size + schunk->free_size;
+		schunk->map[++schunk->map_used] = map_size_bytes;
 	schunk->map[schunk->map_used] |= 1;
-	schunk->has_reserved = true;
 
 	/* init dynamic chunk if necessary */
 	if (dyn_size) {
 		dchunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
 		INIT_LIST_HEAD(&dchunk->list);
 		INIT_LIST_HEAD(&dchunk->map_extend_list);
-		dchunk->base_addr = base_addr;
+		dchunk->base_addr = base_addr + ai->static_size +
+				    ai->reserved_size;
 		dchunk->map = dmap;
 		dchunk->map_alloc = ARRAY_SIZE(dmap);
 		dchunk->immutable = true;
@@ -1725,11 +1776,11 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		dchunk->nr_populated = pcpu_unit_pages;
 
 		dchunk->contig_hint = dchunk->free_size = dyn_size;
-		dchunk->map[0] = 1;
-		dchunk->map[1] = pcpu_reserved_chunk_limit;
-		dchunk->map[2] = (pcpu_reserved_chunk_limit + dchunk->free_size) | 1;
-		dchunk->map_used = 2;
-		dchunk->has_reserved = true;
+		dchunk->map[0] = 0;
+		dchunk->map[1] = dchunk->free_size | 1;
+		dchunk->map_used = 1;
+
+		dchunk->nr_pages = dyn_size >> PAGE_SHIFT;
 	}
 
 	/* link the first chunk in */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (5 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 19:10   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

With moving the base_addr in the chunks responsible for serving the
first chunk up, the use of schunk/dchunk in pcpu_setup_first_chunk no
longer makes sense. This makes the linking in the first chunk code not
rely on a ternary and renames the variables to a shared variable, chunk,
because the allocation path is sequential.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu.c | 96 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index c74ad68..9dd28a2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1597,7 +1597,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
 	size_t dyn_size = ai->dyn_size;
 	size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
-	struct pcpu_chunk *schunk, *dchunk = NULL;
+	struct pcpu_chunk *chunk;
 	unsigned long *group_offsets;
 	size_t *group_sizes;
 	unsigned long *unit_off;
@@ -1709,13 +1709,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		INIT_LIST_HEAD(&pcpu_slot[i]);
 
 	/*
-	 * Initialize static chunk.
-	 * The static region is dropped as those addresses are already
-	 * allocated and do not rely on chunk->base_addr.
-	 * reserved_size == 0:
-	 *      the static chunk covers the dynamic area
-	 * reserved_size > 0:
-	 *      the static chunk covers the reserved area
+	 * Initialize first chunk.
+	 * pcpu_first_chunk will always manage the dynamic region of the
+	 * first chunk.  The static region is dropped as those addresses
+	 * are already allocated and do not rely on chunk->base_addr.
+	 *
+	 * ai->reserved == 0:
+	 *	reserved_chunk == NULL;
 	 *
 	 * If the static area is not page aligned, the region adjacent
 	 * to the static area must have its base_addr be offset into
@@ -1730,61 +1730,61 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
 			 pcpu_reserved_offset;
 
-	/* schunk allocation */
-	schunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
-	INIT_LIST_HEAD(&schunk->list);
-	INIT_LIST_HEAD(&schunk->map_extend_list);
-	schunk->base_addr = (void *)aligned_addr;
-	schunk->map = smap;
-	schunk->map_alloc = ARRAY_SIZE(smap);
-	schunk->immutable = true;
-	bitmap_fill(schunk->populated, pcpu_unit_pages);
-	schunk->nr_populated = pcpu_unit_pages;
+	/* chunk adjacent to static region allocation */
+	chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
+	INIT_LIST_HEAD(&chunk->list);
+	INIT_LIST_HEAD(&chunk->map_extend_list);
+	chunk->base_addr = (void *)aligned_addr;
+	chunk->map = smap;
+	chunk->map_alloc = ARRAY_SIZE(smap);
+	chunk->immutable = true;
+	bitmap_fill(chunk->populated, pcpu_unit_pages);
+	chunk->nr_populated = pcpu_unit_pages;
 
-	schunk->nr_pages = map_size_bytes >> PAGE_SHIFT;
+	chunk->nr_pages = map_size_bytes >> PAGE_SHIFT;
 
 	if (ai->reserved_size) {
-		schunk->free_size = ai->reserved_size;
-		pcpu_reserved_chunk = schunk;
+		chunk->free_size = ai->reserved_size;
+		pcpu_reserved_chunk = chunk;
 	} else {
-		schunk->free_size = dyn_size;
+		chunk->free_size = dyn_size;
 		dyn_size = 0;			/* dynamic area covered */
 	}
-	schunk->contig_hint = schunk->free_size;
+	chunk->contig_hint = chunk->free_size;
 
 	if (pcpu_reserved_offset) {
-		schunk->has_reserved = true;
-		schunk->map[0] = 1;
-		schunk->map[1] = pcpu_reserved_offset;
-		schunk->map_used = 1;
+		chunk->has_reserved = true;
+		chunk->map[0] = 1;
+		chunk->map[1] = pcpu_reserved_offset;
+		chunk->map_used = 1;
 	}
-	if (schunk->free_size)
-		schunk->map[++schunk->map_used] = map_size_bytes;
-	schunk->map[schunk->map_used] |= 1;
+	if (chunk->free_size)
+		chunk->map[++chunk->map_used] = map_size_bytes;
+	chunk->map[chunk->map_used] |= 1;
 
-	/* init dynamic chunk if necessary */
+	/* init dynamic region of first chunk if necessary */
 	if (dyn_size) {
-		dchunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
-		INIT_LIST_HEAD(&dchunk->list);
-		INIT_LIST_HEAD(&dchunk->map_extend_list);
-		dchunk->base_addr = base_addr + ai->static_size +
+		chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
+		INIT_LIST_HEAD(&chunk->list);
+		INIT_LIST_HEAD(&chunk->map_extend_list);
+		chunk->base_addr = base_addr + ai->static_size +
 				    ai->reserved_size;
-		dchunk->map = dmap;
-		dchunk->map_alloc = ARRAY_SIZE(dmap);
-		dchunk->immutable = true;
-		bitmap_fill(dchunk->populated, pcpu_unit_pages);
-		dchunk->nr_populated = pcpu_unit_pages;
-
-		dchunk->contig_hint = dchunk->free_size = dyn_size;
-		dchunk->map[0] = 0;
-		dchunk->map[1] = dchunk->free_size | 1;
-		dchunk->map_used = 1;
-
-		dchunk->nr_pages = dyn_size >> PAGE_SHIFT;
+		chunk->map = dmap;
+		chunk->map_alloc = ARRAY_SIZE(dmap);
+		chunk->immutable = true;
+		bitmap_fill(chunk->populated, pcpu_unit_pages);
+		chunk->nr_populated = pcpu_unit_pages;
+
+		chunk->contig_hint = chunk->free_size = dyn_size;
+		chunk->map[0] = 0;
+		chunk->map[1] = chunk->free_size | 1;
+		chunk->map_used = 1;
+
+		chunk->nr_pages = dyn_size >> PAGE_SHIFT;
 	}
 
 	/* link the first chunk in */
-	pcpu_first_chunk = dchunk ?: schunk;
+	pcpu_first_chunk = chunk;
 	pcpu_nr_empty_pop_pages +=
 		pcpu_count_occupied_pages(pcpu_first_chunk, 1);
 	pcpu_chunk_relocate(pcpu_first_chunk, -1);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (6 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 19:26   ` Tejun Heo
  2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

This patch changes the allocator to only mark allocated pages for the
region the population bitmap is used for. Prior, the bitmap was marked
completely used as the first chunk was allocated and immutable. This is
misleading because the first chunk may not be completely filled.
Additionally, with moving the base_addr up in the previous patch, the
population map no longer corresponds to what is being checked.

pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
is added to the pcpu_chunk struct to keep track of the number of empty
pages. This field is needed as the number of empty populated pages is
globally kept track of and deltas are used to update it. This new field
is exposed in percpu_stats.

Now that chunk->nr_pages is the number of pages the chunk is serving, it
is nice to use this in the work function for population and freeing of
chunks rather than use the global variable pcpu_unit_pages.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu-internal.h |  1 +
 mm/percpu-stats.c    |  1 +
 mm/percpu.c          | 34 +++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 56e1aba..f0776f6 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -30,6 +30,7 @@ struct pcpu_chunk {
 	int                     nr_pages;       /* # of PAGE_SIZE pages served
 						   by this chunk */
 	int			nr_populated;	/* # of populated pages */
+	int			nr_empty_pop_pages; /* # of empty populated pages */
 	unsigned long		populated[];	/* populated bitmap */
 };
 
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 44e561d..6fc04b1 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -99,6 +99,7 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 
 	P("nr_alloc", chunk->nr_alloc);
 	P("max_alloc_size", chunk->max_alloc_size);
+	P("empty_pop_pages", chunk->nr_empty_pop_pages);
 	P("free_size", chunk->free_size);
 	P("contig_hint", chunk->contig_hint);
 	P("sum_frag", sum_frag);
diff --git a/mm/percpu.c b/mm/percpu.c
index 9dd28a2..fb01841 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1164,7 +1164,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 	list_for_each_entry_safe(chunk, next, &to_free, list) {
 		int rs, re;
 
-		pcpu_for_each_pop_region(chunk, rs, re, 0, pcpu_unit_pages) {
+		pcpu_for_each_pop_region(chunk, rs, re, 0, chunk->nr_pages) {
 			pcpu_depopulate_chunk(chunk, rs, re);
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_depopulated(chunk, rs, re);
@@ -1221,7 +1221,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 
 		spin_lock_irq(&pcpu_lock);
 		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			nr_unpop = pcpu_unit_pages - chunk->nr_populated;
+			nr_unpop = chunk->nr_pages - chunk->nr_populated;
 			if (nr_unpop)
 				break;
 		}
@@ -1231,7 +1231,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
 			continue;
 
 		/* @chunk can't go away while pcpu_alloc_mutex is held */
-		pcpu_for_each_unpop_region(chunk, rs, re, 0, pcpu_unit_pages) {
+		pcpu_for_each_unpop_region(chunk, rs, re, 0, chunk->nr_pages) {
 			int nr = min(re - rs, nr_to_pop);
 
 			ret = pcpu_populate_chunk(chunk, rs, rs + nr);
@@ -1604,6 +1604,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	unsigned int cpu;
 	int *unit_map;
 	int group, unit, i;
+	int chunk_pages;
 	unsigned long tmp_addr, aligned_addr;
 	unsigned long map_size_bytes;
 
@@ -1729,19 +1730,21 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 
 	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
 			 pcpu_reserved_offset;
+	chunk_pages = map_size_bytes >> PAGE_SHIFT;
 
 	/* chunk adjacent to static region allocation */
-	chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
+	chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) +
+				     BITS_TO_LONGS(chunk_pages), 0);
 	INIT_LIST_HEAD(&chunk->list);
 	INIT_LIST_HEAD(&chunk->map_extend_list);
 	chunk->base_addr = (void *)aligned_addr;
 	chunk->map = smap;
 	chunk->map_alloc = ARRAY_SIZE(smap);
 	chunk->immutable = true;
-	bitmap_fill(chunk->populated, pcpu_unit_pages);
-	chunk->nr_populated = pcpu_unit_pages;
+	bitmap_fill(chunk->populated, chunk_pages);
+	chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages;
 
-	chunk->nr_pages = map_size_bytes >> PAGE_SHIFT;
+	chunk->nr_pages = chunk_pages;
 
 	if (ai->reserved_size) {
 		chunk->free_size = ai->reserved_size;
@@ -1754,6 +1757,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 
 	if (pcpu_reserved_offset) {
 		chunk->has_reserved = true;
+		chunk->nr_empty_pop_pages--;
 		chunk->map[0] = 1;
 		chunk->map[1] = pcpu_reserved_offset;
 		chunk->map_used = 1;
@@ -1764,7 +1768,11 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 
 	/* init dynamic region of first chunk if necessary */
 	if (dyn_size) {
-		chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0);
+		chunk_pages = dyn_size >> PAGE_SHIFT;
+
+		/* chunk allocation */
+		chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) +
+					     BITS_TO_LONGS(chunk_pages), 0);
 		INIT_LIST_HEAD(&chunk->list);
 		INIT_LIST_HEAD(&chunk->map_extend_list);
 		chunk->base_addr = base_addr + ai->static_size +
@@ -1772,21 +1780,21 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 		chunk->map = dmap;
 		chunk->map_alloc = ARRAY_SIZE(dmap);
 		chunk->immutable = true;
-		bitmap_fill(chunk->populated, pcpu_unit_pages);
-		chunk->nr_populated = pcpu_unit_pages;
+		bitmap_fill(chunk->populated, chunk_pages);
+		chunk->nr_populated = chunk_pages;
+		chunk->nr_empty_pop_pages = chunk_pages;
 
 		chunk->contig_hint = chunk->free_size = dyn_size;
 		chunk->map[0] = 0;
 		chunk->map[1] = chunk->free_size | 1;
 		chunk->map_used = 1;
 
-		chunk->nr_pages = dyn_size >> PAGE_SHIFT;
+		chunk->nr_pages = chunk_pages;
 	}
 
 	/* link the first chunk in */
 	pcpu_first_chunk = chunk;
-	pcpu_nr_empty_pop_pages +=
-		pcpu_count_occupied_pages(pcpu_first_chunk, 1);
+	pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
 	pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
 	pcpu_stats_chunk_alloc();
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (7 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 23:27   ` Tejun Heo
                     ` (2 more replies)
  2017-07-16  2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou
  2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik
  10 siblings, 3 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.

Implementation:
This patch removes the area map allocator in favor of a bitmap allocator
backed by metadata blocks. The primary goal is to provide consistency
in performance and memory footprint with a focus on small allocations
(< 64 bytes). The bitmap removes the heavy memmove from the freeing
critical path and provides a consistent memory footprint. The metadata
blocks provide a bound on the amount of scanning required by maintaining
a set of hints.

The chunks previously were managed by free_size, a value maintained in
bytes. Now, the chunks are managed in terms of bits, which is just a
scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.

There is one caveat with this implementation. In an effort to make
freeing fast, the only time metadata is updated on the free path is if a
whole block becomes free or the freed area spans across metadata blocks.
This causes the chunka??s contig_hint to be potentially smaller than what
it could allocate by up to a block. If the chunka??s contig_hint is
smaller than a block, a check occurs and the hint is kept accurate.
Metadata is always kept accurate on allocation and therefore the
situation where a chunk has a larger contig_hint than available will
never occur.

Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects of varying size. Deallocation was done by in order,
alternating, and in reverse. These numbers were collected after rebasing
ontop of a80099a152. I present the worst-case numbers here:

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        335      |     4960
             16B    |        485      |     1150
             64B    |        445      |      280
            128B    |        505      |      177
           1024B    |       3385      |      140

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        725      |       70
             16B    |        760      |       70
             64B    |        855      |       80
            128B    |        910      |       90
           1024B    |       3770      |      260

This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.

While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The second problem of
additional scanning can result in the area map allocator completing in
52 minutes. The same workload takes only 14 seconds to complete for the
bitmap allocator. This was produced under a more contrived scenario of
allocating 1 milion 4-byte objects with 8-byte alignment.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 include/linux/percpu.h |   10 +-
 init/main.c            |    1 -
 mm/percpu-internal.h   |   70 ++-
 mm/percpu-stats.c      |   99 ++--
 mm/percpu.c            | 1280 ++++++++++++++++++++++++++++++------------------
 5 files changed, 923 insertions(+), 537 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index a5cedcd..8f62b10 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -26,6 +26,15 @@
 #define PCPU_MIN_ALLOC_SHIFT		2
 
 /*
+ * This determines the size of each metadata block.  There are several subtle
+ * constraints around this variable.  The reserved_region and dynamic_region
+ * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE.  This is
+ * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks
+ * that are backing multiple pages, this needs to be accounted for.
+ */
+#define PCPU_BITMAP_BLOCK_SIZE		(PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
+
+/*
  * Percpu allocator can serve percpu allocations before slab is
  * initialized which allows slab to depend on the percpu allocator.
  * The following two parameters decide how much resource to
@@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr);
 #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
 extern void __init setup_per_cpu_areas(void);
 #endif
-extern void __init percpu_init_late(void);
 
 extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp);
 extern void __percpu *__alloc_percpu(size_t size, size_t align);
diff --git a/init/main.c b/init/main.c
index 052481f..c9a9fff 100644
--- a/init/main.c
+++ b/init/main.c
@@ -500,7 +500,6 @@ static void __init mm_init(void)
 	page_ext_init_flatmem();
 	mem_init();
 	kmem_cache_init();
-	percpu_init_late();
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index f0776f6..2dac644 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -4,6 +4,21 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 
+/*
+ * pcpu_bitmap_md is the metadata block struct.
+ * All units are in terms of bits.
+ */
+struct pcpu_bitmap_md {
+	int			contig_hint;	/* contig hint for block */
+	int			contig_hint_start; /* block relative starting
+						      position of the contig hint */
+	int			left_free;	/* size of free space along
+						   the left side of the block */
+	int			right_free;	/* size of free space along
+						   the right side of the block */
+	int			first_free;	/* block position of first free */
+};
+
 struct pcpu_chunk {
 #ifdef CONFIG_PERCPU_STATS
 	int			nr_alloc;	/* # of allocations */
@@ -11,17 +26,20 @@ struct pcpu_chunk {
 #endif
 
 	struct list_head	list;		/* linked to pcpu_slot lists */
-	int			free_size;	/* free bytes in the chunk */
-	int			contig_hint;	/* max contiguous size hint */
+	int			free_bits;	/* free bits in the chunk */
+	int			contig_hint;	/* max contiguous size hint
+						   in bits */
+	int			contig_hint_start; /* contig_hint starting
+						      bit offset */
 	void			*base_addr;	/* base address of this chunk */
 
-	int			map_used;	/* # of map entries used before the sentry */
-	int			map_alloc;	/* # of map entries allocated */
-	int			*map;		/* allocation map */
-	struct list_head	map_extend_list;/* on pcpu_map_extend_chunks */
+	unsigned long		*alloc_map;	/* allocation map */
+	unsigned long		*bound_map;	/* boundary map */
+	struct pcpu_bitmap_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */
-	int			first_free;	/* no free below this */
+	int			first_free_block; /* block that contains the first
+						     free bit */
 	bool			immutable;	/* no [de]population allowed */
 	bool			has_reserved;	/* indicates if the region this chunk
 						   is responsible for overlaps with
@@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
 extern unsigned long pcpu_reserved_offset;
 
+/*
+ * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks
+ * @chunk: chunk of interest
+ *
+ * This conversion is from the number of physical pages that the chunk
+ * serves to the number of bitmap blocks required.  It converts to bytes
+ * served to bits required and then blocks used.
+ */
+static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk)
+{
+	return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE /
+	       PCPU_BITMAP_BLOCK_SIZE;
+}
+
+/*
+ * pcpu_pages_to_bits - converts the pages to size of bitmap
+ * @pages: number of physical pages
+ *
+ * This conversion is from physical pages to the number of bits
+ * required in the bitmap.
+ */
+static inline int pcpu_pages_to_bits(int pages)
+{
+	return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
+}
+
+/*
+ * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap
+ * @chunk: chunk of interest
+ *
+ * This conversion is from the number of physical pages that the chunk
+ * serves to the number of bits in the bitmap.
+ */
+static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk)
+{
+	return pcpu_pages_to_bits(chunk->nr_pages);
+}
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index 6fc04b1..8dbef0c 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b)
 }
 
 /*
- * Iterates over all chunks to find the max # of map entries used.
+ * Iterates over all chunks to find the max nr_alloc entries.
  */
-static int find_max_map_used(void)
+static int find_max_nr_alloc(void)
 {
 	struct pcpu_chunk *chunk;
-	int slot, max_map_used;
+	int slot, max_nr_alloc;
 
-	max_map_used = 0;
+	max_nr_alloc = 0;
 	for (slot = 0; slot < pcpu_nr_slots; slot++)
 		list_for_each_entry(chunk, &pcpu_slot[slot], list)
-			max_map_used = max(max_map_used, chunk->map_used);
+			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
 
-	return max_map_used;
+	return max_nr_alloc;
 }
 
 /*
  * Prints out chunk state. Fragmentation is considered between
  * the beginning of the chunk to the last allocation.
+ *
+ * All statistics are in bytes unless stated otherwise.
  */
 static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 			    int *buffer)
 {
-	int i, s_index, last_alloc, alloc_sign, as_len;
+	int i, last_alloc, as_len, start, end;
 	int *alloc_sizes, *p;
 	/* statistics */
 	int sum_frag = 0, max_frag = 0;
 	int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0;
 
 	alloc_sizes = buffer;
-	s_index = chunk->has_reserved ? 1 : 0;
-
-	/* find last allocation */
-	last_alloc = -1;
-	for (i = chunk->map_used - 1; i >= s_index; i--) {
-		if (chunk->map[i] & 1) {
-			last_alloc = i;
-			break;
-		}
-	}
 
-	/* if the chunk is not empty - ignoring reserve */
-	if (last_alloc >= s_index) {
-		as_len = last_alloc + 1 - s_index;
-
-		/*
-		 * Iterate through chunk map computing size info.
-		 * The first bit is overloaded to be a used flag.
-		 * negative = free space, positive = allocated
-		 */
-		for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) {
-			alloc_sign = (*p & 1) ? 1 : -1;
-			alloc_sizes[i] = alloc_sign *
-				((p[1] & ~1) - (p[0] & ~1));
+	/*
+	 * find_last_bit returns the start value if nothing found.
+	 * Therefore, we must determine if it is a failure of find_last_bit
+	 * and set the appropriate value.
+	 */
+	last_alloc = find_last_bit(chunk->alloc_map,
+				   pcpu_nr_pages_to_bits(chunk) - 1);
+	last_alloc = test_bit(last_alloc, chunk->alloc_map) ?
+		     last_alloc + 1 : 0;
+
+	start = as_len = 0;
+	if (chunk->has_reserved)
+		start = pcpu_reserved_offset;
+
+	/*
+	 * If a bit is set in the allocation map, the bound_map identifies
+	 * where the allocation ends.  If the allocation is not set, the
+	 * bound_map does not identify free areas as it is only kept accurate
+	 * on allocation, not free.
+	 *
+	 * Positive values are allocations and negative values are free
+	 * fragments.
+	 */
+	while (start < last_alloc) {
+		if (test_bit(start, chunk->alloc_map)) {
+			end = find_next_bit(chunk->bound_map, last_alloc,
+					    start + 1);
+			alloc_sizes[as_len] = 1;
+		} else {
+			end = find_next_bit(chunk->alloc_map, last_alloc,
+					    start + 1);
+			alloc_sizes[as_len] = -1;
 		}
 
-		sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL);
+		alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE;
+
+		start = end;
+	}
+
+	/*
+	 * The negative values are free fragments and thus sorting gives the
+	 * free fragments at the beginning in largest first order.
+	 */
+	if (as_len > 0) {
+		sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL);
 
-		/* Iterate through the unallocated fragements. */
+		/* iterate through the unallocated fragments */
 		for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) {
 			sum_frag -= *p;
 			max_frag = max(max_frag, -1 * (*p));
@@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 	P("nr_alloc", chunk->nr_alloc);
 	P("max_alloc_size", chunk->max_alloc_size);
 	P("empty_pop_pages", chunk->nr_empty_pop_pages);
-	P("free_size", chunk->free_size);
-	P("contig_hint", chunk->contig_hint);
+	P("first_free_block", chunk->first_free_block);
+	P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE);
+	P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE);
 	P("sum_frag", sum_frag);
 	P("max_frag", max_frag);
 	P("cur_min_alloc", cur_min_alloc);
@@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
 static int percpu_stats_show(struct seq_file *m, void *v)
 {
 	struct pcpu_chunk *chunk;
-	int slot, max_map_used;
+	int slot, max_nr_alloc;
 	int *buffer;
 
 alloc_buffer:
 	spin_lock_irq(&pcpu_lock);
-	max_map_used = find_max_map_used();
+	max_nr_alloc = find_max_nr_alloc();
 	spin_unlock_irq(&pcpu_lock);
 
-	buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0]));
+	/* there can be at most this many free and allocated fragments */
+	buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int));
 	if (!buffer)
 		return -ENOMEM;
 
 	spin_lock_irq(&pcpu_lock);
 
 	/* if the buffer allocated earlier is too small */
-	if (max_map_used < find_max_map_used()) {
+	if (max_nr_alloc < find_max_nr_alloc()) {
 		spin_unlock_irq(&pcpu_lock);
 		vfree(buffer);
 		goto alloc_buffer;
diff --git a/mm/percpu.c b/mm/percpu.c
index fb01841..569df63 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -4,6 +4,9 @@
  * Copyright (C) 2009		SUSE Linux Products GmbH
  * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
  *
+ * Copyright (C) 2017		Facebook Inc.
+ * Copyright (C) 2017		Dennis Zhou <dennisszhou@gmail.com>
+ *
  * This file is released under the GPLv2 license.
  *
  * The percpu allocator handles both static and dynamic areas.  Percpu
@@ -34,19 +37,20 @@
  * percpu variables from kernel modules.  Finally, the dynamic section
  * takes care of normal allocations.
  *
- * Allocation state in each chunk is kept using an array of integers
- * on chunk->map.  A positive value in the map represents a free
- * region and negative allocated.  Allocation inside a chunk is done
- * by scanning this map sequentially and serving the first matching
- * entry.  This is mostly copied from the percpu_modalloc() allocator.
- * Chunks can be determined from the address using the index field
- * in the page struct. The index field contains a pointer to the chunk.
- *
- * These chunks are organized into lists according to free_size and
- * tries to allocate from the fullest chunk first. Each chunk maintains
- * a maximum contiguous area size hint which is guaranteed to be equal
- * to or larger than the maximum contiguous area in the chunk. This
- * helps prevent the allocator from iterating over chunks unnecessarily.
+ * The allocator organizes chunks into lists according to free size and
+ * tries to allocate from the fullest chunk first.  Each chunk is managed
+ * by a bitmap with metadata blocks.  The allocation map is updated on
+ * every allocation to reflect the current state while the boundary map
+ * is only updated on allocation.  Each metadata block contains
+ * information to help mitigate the need to iterate over large portions
+ * of the bitmap.  The reverse mapping from page to chunk is stored in
+ * the page's index.  Lastly, units are lazily backed and grow in unison.
+ *
+ * There is a unique conversion that goes on here between bytes and bits.
+ * The chunk tracks the number of pages it is responsible for in nr_pages.
+ * From there, helper functions are used to convert from physical pages
+ * to bitmap bits and bitmap blocks.  All hints are managed in bits
+ * unless explicitly stated.
  *
  * To use this allocator, arch code should do the following:
  *
@@ -86,10 +90,13 @@
 
 #include "percpu-internal.h"
 
-#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
-#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
-#define PCPU_ATOMIC_MAP_MARGIN_LOW	32
-#define PCPU_ATOMIC_MAP_MARGIN_HIGH	64
+/*
+ * The metadata is managed in terms of bits with each bit mapping to
+ * a fragment of size PCPU_MIN_ALLOC_SIZE.  Thus, the slots are calculated
+ * with respect to the number of bits available.
+ */
+#define PCPU_SLOT_BASE_SHIFT		3
+
 #define PCPU_EMPTY_POP_PAGES_LOW	2
 #define PCPU_EMPTY_POP_PAGES_HIGH	4
 
@@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init;
 DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
 static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
 
-struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
-
-/* chunks which need their map areas extended, protected by pcpu_lock */
-static LIST_HEAD(pcpu_map_extend_chunks);
+/*
+ * Chunk list slots.  These slots order the chunks by the number of
+ * free bits available in the bitmap.
+ */
+struct list_head *pcpu_slot __ro_after_init;
 
 /*
  * The number of empty populated pages, protected by pcpu_lock.  The
@@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr)
 	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
 }
 
-static int __pcpu_size_to_slot(int size)
+static int __pcpu_size_to_slot(int bit_size)
 {
-	int highbit = fls(size);	/* size is in bytes */
+	int highbit = fls(bit_size);	/* size is in bits */
 	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
 }
 
-static int pcpu_size_to_slot(int size)
+static int pcpu_size_to_slot(int bit_size)
 {
-	if (size == pcpu_unit_size)
+	if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages))
 		return pcpu_nr_slots - 1;
-	return __pcpu_size_to_slot(size);
+	return __pcpu_size_to_slot(bit_size);
 }
 
 static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
 {
-	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
+	if (chunk->free_bits == 0 || chunk->contig_hint == 0)
 		return 0;
 
-	return pcpu_size_to_slot(chunk->free_size);
+	return pcpu_size_to_slot(chunk->free_bits);
 }
 
 /* set the pointer to a chunk in a page struct */
@@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk,
 }
 
 /*
+ * The following are helper functions to help access bitmaps and convert
+ * between bitmap offsets to actual address offsets.
+ */
+static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index)
+{
+	return chunk->alloc_map +
+		(index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG);
+}
+
+static unsigned long pcpu_off_to_block_index(int off)
+{
+	return off / PCPU_BITMAP_BLOCK_SIZE;
+}
+
+static unsigned long pcpu_off_to_block_off(int off)
+{
+	return off & (PCPU_BITMAP_BLOCK_SIZE - 1);
+}
+
+static unsigned long pcpu_block_off_to_off(int index, int off)
+{
+	return index * PCPU_BITMAP_BLOCK_SIZE + off;
+}
+
+static unsigned long pcpu_block_get_first_page(int index)
+{
+	return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE *
+			PCPU_MIN_ALLOC_SIZE);
+}
+
+/*
  * (Un)populated page region iterators.  Iterate over (un)populated
  * page regions between @start and @end in @chunk.  @rs and @re should
  * be integer variables and will be set to start and end page index of
@@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr)
 }
 
 /**
- * pcpu_count_occupied_pages - count the number of pages an area occupies
- * @chunk: chunk of interest
- * @i: index of the area in question
- *
- * Count the number of pages chunk's @i'th area occupies.  When the area's
- * start and/or end address isn't aligned to page boundary, the straddled
- * page is included in the count iff the rest of the page is free.
- */
-static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i)
-{
-	int off = chunk->map[i] & ~1;
-	int end = chunk->map[i + 1] & ~1;
-
-	if (!PAGE_ALIGNED(off) && i > 0) {
-		int prev = chunk->map[i - 1];
-
-		if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE))
-			off = round_down(off, PAGE_SIZE);
-	}
-
-	if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) {
-		int next = chunk->map[i + 1];
-		int nend = chunk->map[i + 2] & ~1;
-
-		if (!(next & 1) && nend >= round_up(end, PAGE_SIZE))
-			end = round_up(end, PAGE_SIZE);
-	}
-
-	return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0);
-}
-
-/**
  * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
  * @chunk: chunk of interest
  * @oslot: the previous slot it was on
@@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
 }
 
 /**
- * pcpu_need_to_extend - determine whether chunk area map needs to be extended
+ * pcpu_cnt_pop_pages- counts populated backing pages in range
  * @chunk: chunk of interest
- * @is_atomic: the allocation context
+ * @start: start index
+ * @end: end index
  *
- * Determine whether area map of @chunk needs to be extended.  If
- * @is_atomic, only the amount necessary for a new allocation is
- * considered; however, async extension is scheduled if the left amount is
- * low.  If !@is_atomic, it aims for more empty space.  Combined, this
- * ensures that the map is likely to have enough available space to
- * accomodate atomic allocations which can't extend maps directly.
- *
- * CONTEXT:
- * pcpu_lock.
+ * Calculates the number of populated pages in the region [start, end).
+ * This lets us keep track of how many empty populated pages are available
+ * and decide if we should schedule async work.
  *
  * RETURNS:
- * New target map allocation length if extension is necessary, 0
- * otherwise.
+ * The nr of populated pages.
  */
-static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic)
+static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk,
+				      int start, int end)
 {
-	int margin, new_alloc;
-
-	lockdep_assert_held(&pcpu_lock);
+	return bitmap_weight(chunk->populated, end) -
+	       bitmap_weight(chunk->populated, start);
+}
 
-	if (is_atomic) {
-		margin = 3;
+/**
+ * pcpu_chunk_update_hint - updates metadata about a chunk
+ * @chunk: chunk of interest
+ *
+ * Responsible for iterating over metadata blocks to aggregate the
+ * overall statistics of the chunk.
+ *
+ * Updates:
+ *      chunk->contig_hint
+ *      chunk->contig_hint_start
+ *      nr_empty_pop_pages
+ */
+static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk)
+{
+	bool is_page_empty = true;
+	int i, off, cur_contig, nr_empty_pop_pages, l_pop_off;
+	struct pcpu_bitmap_md *block;
+
+	chunk->contig_hint = cur_contig = 0;
+	off = nr_empty_pop_pages = 0;
+	l_pop_off = pcpu_block_get_first_page(chunk->first_free_block);
+
+	for (i = chunk->first_free_block, block = chunk->md_blocks + i;
+	     i < pcpu_nr_pages_to_blocks(chunk); i++, block++) {
+		/* Manage nr_empty_pop_pages.
+		 *
+		 * This is tricky.  So the the background work function is
+		 * triggered when there are not enough free populated pages.
+		 * This is necessary to make sure atomic allocations can
+		 * succeed.
+		 *
+		 * The first page of each block is kept track of here allowing
+		 * this to scale in both situations where there are > 1 page
+		 * per block and where a block may be a portion of a page.
+		 */
+		int pop_off = pcpu_block_get_first_page(i);
+
+		if (pop_off > l_pop_off) {
+			if (is_page_empty)
+				nr_empty_pop_pages +=
+					pcpu_cnt_pop_pages(chunk, l_pop_off,
+							   pop_off);
+			l_pop_off = pop_off;
+			is_page_empty = true;
+		}
+		if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE)
+			is_page_empty = false;
 
-		if (chunk->map_alloc <
-		    chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
-			if (list_empty(&chunk->map_extend_list)) {
-				list_add_tail(&chunk->map_extend_list,
-					      &pcpu_map_extend_chunks);
-				pcpu_schedule_balance_work();
+		/* continue from prev block adding to the cur_contig hint */
+		if (cur_contig) {
+			cur_contig += block->left_free;
+			if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
+				continue;
+			} else if (cur_contig > chunk->contig_hint) {
+				chunk->contig_hint = cur_contig;
+				chunk->contig_hint_start = off;
 			}
+			cur_contig = 0;
 		}
-	} else {
-		margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
+		/* check if the block->contig_hint is larger */
+		if (block->contig_hint > chunk->contig_hint) {
+			chunk->contig_hint = block->contig_hint;
+			chunk->contig_hint_start =
+				pcpu_block_off_to_off(i,
+						      block->contig_hint_start);
+		}
+		/* let the next iteration catch the right_free */
+		cur_contig = block->right_free;
+		off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free;
 	}
 
-	if (chunk->map_alloc >= chunk->map_used + margin)
-		return 0;
-
-	new_alloc = PCPU_DFL_MAP_ALLOC;
-	while (new_alloc < chunk->map_used + margin)
-		new_alloc *= 2;
+	/* catch last iteration if the last block ends with free space */
+	if (cur_contig > chunk->contig_hint) {
+		chunk->contig_hint = cur_contig;
+		chunk->contig_hint_start = off;
+	}
 
-	return new_alloc;
+	/*
+	 * Keep track of nr_empty_pop_pages.
+	 *
+	 * The chunk is maintains the previous number of free pages it held,
+	 * so the delta is used to update the global counter.  The reserved
+	 * chunk is not part of the free page count as they are populated
+	 * at init and are special to serving reserved allocations.
+	 */
+	if (is_page_empty) {
+		nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off,
+							 chunk->nr_pages);
+	}
+	if (chunk != pcpu_reserved_chunk)
+		pcpu_nr_empty_pop_pages +=
+			(nr_empty_pop_pages - chunk->nr_empty_pop_pages);
+	chunk->nr_empty_pop_pages = nr_empty_pop_pages;
 }
 
 /**
- * pcpu_extend_area_map - extend area map of a chunk
+ * pcpu_block_update_hint
  * @chunk: chunk of interest
- * @new_alloc: new target allocation length of the area map
+ * @index: block index of the metadata block
  *
- * Extend area map of @chunk to have @new_alloc entries.
+ * Full scan over the entire block to recalculate block-level metadata.
+ */
+static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
+{
+	unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
+	struct pcpu_bitmap_md *block = chunk->md_blocks + index;
+	bool is_left_free = false, is_right_free = false;
+	int contig;
+	unsigned long start, end;
+
+	block->contig_hint = 0;
+	start = end = block->first_free;
+	while (start < PCPU_BITMAP_BLOCK_SIZE) {
+		/*
+		 * Scans the allocation map corresponding to this block
+		 * to find free fragments and update metadata accordingly.
+		 */
+		start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE,
+					   start);
+		if (start >= PCPU_BITMAP_BLOCK_SIZE)
+			break;
+		/* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */
+		end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start);
+		/* update left_free */
+		contig = end - start;
+		if (start == 0) {
+			block->left_free = contig;
+			is_left_free = true;
+		}
+		/* update right_free */
+		if (end == PCPU_BITMAP_BLOCK_SIZE) {
+			block->right_free = contig;
+			is_right_free = true;
+		}
+		/* update block contig_hints */
+		if (block->contig_hint < contig) {
+			block->contig_hint = contig;
+			block->contig_hint_start = start;
+		}
+		start = end;
+	}
+
+	/* clear left/right free hints */
+	if (!is_left_free)
+		block->left_free = 0;
+	if (!is_right_free)
+		block->right_free = 0;
+}
+
+/**
+ * pcpu_block_update_hint_alloc - update hint on allocation path
+ * @chunk: chunk of interest
+ * @bit_off: bitmap offset
+ * @bit_size: size of request in allocation units
  *
- * CONTEXT:
- * Does GFP_KERNEL allocation.  Grabs and releases pcpu_lock.
+ * Updates metadata for the allocation path.  The metadata only has to be
+ * refreshed by a full scan iff we break the largest contig region.
  *
  * RETURNS:
- * 0 on success, -errno on failure.
+ * Bool if we need to update the chunk's metadata. This occurs only if we
+ * break the chunk's contig hint.
  */
-static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc)
+static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
+					 int bit_size)
 {
-	int *old = NULL, *new = NULL;
-	size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
-	unsigned long flags;
-
-	lockdep_assert_held(&pcpu_alloc_mutex);
+	bool update_chunk = false;
+	int i;
+	int s_index, e_index, s_off, e_off;
+	struct pcpu_bitmap_md *s_block, *e_block, *block;
 
-	new = pcpu_mem_zalloc(new_size);
-	if (!new)
-		return -ENOMEM;
+	/* calculate per block offsets */
+	s_index = pcpu_off_to_block_index(bit_off);
+	e_index = pcpu_off_to_block_index(bit_off + bit_size);
+	s_off = pcpu_off_to_block_off(bit_off);
+	e_off = pcpu_off_to_block_off(bit_off + bit_size);
 
-	/* acquire pcpu_lock and switch to new area map */
-	spin_lock_irqsave(&pcpu_lock, flags);
+	/*
+	 * If the offset is the beginning of the next block, set it to the
+	 * end of the previous block as the last bit is the exclusive.
+	 */
+	if (e_off == 0) {
+		e_off = PCPU_BITMAP_BLOCK_SIZE;
+		e_index--;
+	}
 
-	if (new_alloc <= chunk->map_alloc)
-		goto out_unlock;
+	s_block = chunk->md_blocks + s_index;
+	e_block = chunk->md_blocks + e_index;
 
-	old_size = chunk->map_alloc * sizeof(chunk->map[0]);
-	old = chunk->map;
+	/*
+	 * Update s_block.
+	 *
+	 * block->first_free must be updated if the allocation takes its place.
+	 * If the allocation breaks the contig_hint, a scan is required to
+	 * restore this hint.
+	 */
+	if (s_off == s_block->first_free)
+		s_block->first_free = find_next_zero_bit(
+					pcpu_index_alloc_map(chunk, s_index),
+					PCPU_BITMAP_BLOCK_SIZE,
+					s_off + bit_size);
+
+	if (s_off >= s_block->contig_hint_start &&
+	    s_off < s_block->contig_hint_start + s_block->contig_hint) {
+		pcpu_block_refresh_hint(chunk, s_index);
+	} else {
+		/* update left and right contig manually */
+		s_block->left_free = min(s_block->left_free, s_off);
+		if (s_index == e_index)
+			s_block->right_free = min_t(int, s_block->right_free,
+					PCPU_BITMAP_BLOCK_SIZE - e_off);
+		else
+			s_block->right_free = 0;
+	}
 
-	memcpy(new, old, old_size);
+	/*
+	 * Update e_block.
+	 * If they are different, then e_block's first_free is guaranteed to
+	 * be the extend of e_off.  first_free must be updated and a scan
+	 * over e_block is issued.
+	 */
+	if (s_index != e_index) {
+		e_block->first_free = find_next_zero_bit(
+				pcpu_index_alloc_map(chunk, e_index),
+				PCPU_BITMAP_BLOCK_SIZE, e_off);
 
-	chunk->map_alloc = new_alloc;
-	chunk->map = new;
-	new = NULL;
+		pcpu_block_refresh_hint(chunk, e_index);
+	}
 
-out_unlock:
-	spin_unlock_irqrestore(&pcpu_lock, flags);
+	/* update in-between md_blocks */
+	for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index;
+	     i++, block++) {
+		block->contig_hint = 0;
+		block->left_free = 0;
+		block->right_free = 0;
+	}
 
 	/*
-	 * pcpu_mem_free() might end up calling vfree() which uses
-	 * IRQ-unsafe lock and thus can't be called under pcpu_lock.
+	 * The only time a full chunk scan is required is if the global
+	 * contig_hint is broken.  Otherwise, it means a smaller space
+	 * was used and therefore the global contig_hint is still correct.
 	 */
-	pcpu_mem_free(old);
-	pcpu_mem_free(new);
+	if (bit_off >= chunk->contig_hint_start &&
+	    bit_off < chunk->contig_hint_start + chunk->contig_hint)
+		update_chunk = true;
 
-	return 0;
+	return update_chunk;
 }
 
 /**
- * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
- * @chunk: chunk the candidate area belongs to
- * @off: the offset to the start of the candidate area
- * @this_size: the size of the candidate area
- * @size: the size of the target allocation
- * @align: the alignment of the target allocation
- * @pop_only: only allocate from already populated region
- *
- * We're trying to allocate @size bytes aligned at @align.  @chunk's area
- * at @off sized @this_size is a candidate.  This function determines
- * whether the target allocation fits in the candidate area and returns the
- * number of bytes to pad after @off.  If the target area doesn't fit, -1
- * is returned.
- *
- * If @pop_only is %true, this function only considers the already
- * populated part of the candidate area.
+ * pcpu_block_update_hint_free - updates the block hints on the free path
+ * @chunk: chunk of interest
+ * @bit_off: bitmap offset
+ * @bit_size: size of request in allocation units
+ *
+ * Updates the hint along the free path by taking advantage of current metadata
+ * to minimize scanning of the bitmap.  Triggers a global update if an entire
+ * block becomes free or the free spans across blocks.  This tradeoff is to
+ * minimize global scanning to update the chunk->contig_hint.  The
+ * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint
+ * will never be more than the available space.  If the chunk->contig_hint is
+ * in this block, it will be accurate.
+ *
+ * RETURNS:
+ * Bool if we need to update the chunk's metadata.  This occurs if a larger
+ * contig region is created along the edges or we free across blocks.
  */
-static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size,
-			    int size, int align, bool pop_only)
+static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
+					int bit_size)
 {
-	int cand_off = off;
+	bool update_chunk = false;
+	int i;
+	int s_index, e_index, s_off, e_off;
+	int start, end, contig;
+	struct pcpu_bitmap_md *s_block, *e_block, *block;
 
-	while (true) {
-		int head = ALIGN(cand_off, align) - off;
-		int page_start, page_end, rs, re;
+	/* calculate per block offsets */
+	s_index = pcpu_off_to_block_index(bit_off);
+	e_index = pcpu_off_to_block_index(bit_off + bit_size);
+	s_off = pcpu_off_to_block_off(bit_off);
+	e_off = pcpu_off_to_block_off(bit_off + bit_size);
+
+	/*
+	 * If the offset is the beginning of the next block, set it to the
+	 * end of the previous block as the last bit is the exclusive.
+	 */
+	if (e_off == 0) {
+		e_off = PCPU_BITMAP_BLOCK_SIZE;
+		e_index--;
+	}
+
+	s_block = chunk->md_blocks + s_index;
+	e_block = chunk->md_blocks + e_index;
+
+	/*
+	 * Check if the freed area aligns with the block->contig_hint.
+	 * If it does, then the scan to find the beginning/end of the
+	 * larger free area can be avoided.
+	 *
+	 * start and end refer to beginning and end of the free region
+	 * within each their respective blocks.  This is not necessarily
+	 * the entire free region as it may span blocks past the beginning
+	 * or end of the block.
+	 */
+	start = s_off;
+	if (s_off == s_block->contig_hint + s_block->contig_hint_start) {
+		start = s_block->contig_hint_start;
+	} else {
+		int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index),
+					  start);
+		start = (start == l_bit) ? 0 : l_bit + 1;
+	}
+
+	end = e_off;
+	if (e_off == e_block->contig_hint_start)
+		end = e_block->contig_hint_start + e_block->contig_hint;
+	else
+		end = find_next_bit(pcpu_index_alloc_map(chunk, e_index),
+				    PCPU_BITMAP_BLOCK_SIZE, end);
 
-		if (this_size < head + size)
-			return -1;
+	/* freeing in the same block */
+	if (s_index == e_index) {
+		contig = end - start;
 
-		if (!pop_only)
-			return head;
+		if (start == 0)
+			s_block->left_free = contig;
 
+		if (end == PCPU_BITMAP_BLOCK_SIZE)
+			s_block->right_free = contig;
+
+		s_block->first_free = min(s_block->first_free, start);
+		if (contig > s_block->contig_hint) {
+			s_block->contig_hint = contig;
+			s_block->contig_hint_start = start;
+		}
+
+	} else {
 		/*
-		 * If the first unpopulated page is beyond the end of the
-		 * allocation, the whole allocation is populated;
-		 * otherwise, retry from the end of the unpopulated area.
+		 * Freeing across md_blocks.
+		 *
+		 * If the start is at the beginning of the block, just
+		 * reset the block instead.
 		 */
-		page_start = PFN_DOWN(head + off);
-		page_end = PFN_UP(head + off + size);
-
-		rs = page_start;
-		pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size));
-		if (rs >= page_end)
-			return head;
-		cand_off = re * PAGE_SIZE;
+		if (start == 0) {
+			s_index--;
+		} else {
+			/*
+			 * Knowing that the free is across blocks, this means
+			 * the hint can be updated on the right side and the
+			 * left side does not need to be touched.
+			 */
+			s_block->first_free = min(s_block->first_free, start);
+			contig = PCPU_BITMAP_BLOCK_SIZE - start;
+			s_block->right_free = contig;
+			if (contig > s_block->contig_hint) {
+				s_block->contig_hint = contig;
+				s_block->contig_hint_start = start;
+			}
+		}
+		/*
+		 * If end is the entire e_block, just reset the block
+		 * as well.
+		 */
+		if (end == PCPU_BITMAP_BLOCK_SIZE) {
+			e_index++;
+		} else {
+			/*
+			 * The hint must only be on the left side, so
+			 * update accordingly.
+			 */
+			e_block->first_free = 0;
+			e_block->left_free = end;
+			if (end > e_block->contig_hint) {
+				e_block->contig_hint = end;
+				e_block->contig_hint_start = 0;
+			}
+		}
+
+		/* reset md_blocks in the middle */
+		for (i = s_index + 1, block = chunk->md_blocks + i;
+		     i < e_index; i++, block++) {
+			block->first_free = 0;
+			block->contig_hint_start = 0;
+			block->contig_hint = PCPU_BITMAP_BLOCK_SIZE;
+			block->left_free = PCPU_BITMAP_BLOCK_SIZE;
+			block->right_free = PCPU_BITMAP_BLOCK_SIZE;
+		}
 	}
+
+	/*
+	 * The hint is only checked in the s_block and e_block when
+	 * freeing and particularly only when it is self contained within
+	 * its own block.  A scan is required if the free space spans
+	 * blocks or makes a block whole as the scan will take into
+	 * account free space across blocks.
+	 */
+	if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) ||
+	    s_index != e_index) {
+		update_chunk = true;
+	} else if (s_block->contig_hint > chunk->contig_hint) {
+		/* check if block contig_hint is bigger */
+		chunk->contig_hint = s_block->contig_hint;
+		chunk->contig_hint_start =
+			pcpu_block_off_to_off(s_index,
+					      s_block->contig_hint_start);
+	}
+
+	return update_chunk;
 }
 
 /**
- * pcpu_alloc_area - allocate area from a pcpu_chunk
+ * pcpu_is_populated - determines if the region is populated
  * @chunk: chunk of interest
- * @size: wanted size in bytes
- * @align: wanted align
- * @pop_only: allocate only from the populated area
- * @occ_pages_p: out param for the number of pages the area occupies
+ * @index: block index
+ * @block_off: offset within the bitmap
+ * @bit_size: size of request in allocation units
+ * @next_index: return value for next block index that is populated
  *
- * Try to allocate @size bytes area aligned at @align from @chunk.
- * Note that this function only allocates the offset.  It doesn't
- * populate or map the area.
+ * For atomic allocations, we must check if the backing pages are populated.
  *
- * @chunk->map must have at least two free slots.
+ * RETURNS:
+ * Bool if the backing pages are populated.  next_index is to skip over
+ * unpopulated blocks in pcpu_find_block_fit.
+ */
+static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index,
+			      int block_off, int bit_size, int *next_index)
+{
+	int page_start, page_end, rs, re;
+	int off = pcpu_block_off_to_off(index, block_off);
+	int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE;
+
+	page_start = PFN_DOWN(off);
+	page_end = PFN_UP(e_off);
+
+	rs = page_start;
+	pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off));
+	if (rs >= page_end)
+		return true;
+	*next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE;
+	return false;
+}
+
+/**
+ * pcpu_find_block_fit - finds the block index to start searching
+ * @chunk: chunk of interest
+ * @bit_size: size of request in allocation units
+ * @align: alignment of area (max PAGE_SIZE)
+ * @pop_only: use populated regions only
  *
- * CONTEXT:
- * pcpu_lock.
+ * Given a chunk and an allocation spec, find the offset to begin searching
+ * for a free region.  This is done by iterating over the bitmap metadata
+ * blocks and then only returning regions that will be guaranteed to fit
+ * alignment by comparing against the block->contig_hint_start or a correctly
+ * aligned offset.  Iteration is used within a block as an allocation may be
+ * able to be served prior to the contig_hint.
+ *
+ * Note: This errs on the side of caution by only selecting blocks guaranteed
+ *	 to have a fit in the chunk's contig_hint.  Poor alignment can cause
+ *	 us to skip over chunk's that have valid vacancies.
  *
  * RETURNS:
- * Allocated offset in @chunk on success, -1 if no matching area is
- * found.
+ * The offset in the bitmap to begin searching.
+ * -1 if no offset is found.
  */
-static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align,
-			   bool pop_only, int *occ_pages_p)
+static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
+			       size_t align, bool pop_only)
 {
-	int oslot = pcpu_chunk_slot(chunk);
-	int max_contig = 0;
-	int i, off;
-	bool seen_free = false;
-	int *p;
-
-	for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) {
-		int head, tail;
-		int this_size;
-
-		off = *p;
-		if (off & 1)
-			continue;
+	int i, cur_free;
+	int s_index, block_off, next_index, end_off; /* interior alloc index */
+	struct pcpu_bitmap_md *block;
+	unsigned long *alloc_map;
 
-		this_size = (p[1] & ~1) - off;
+	lockdep_assert_held(&pcpu_lock);
 
-		head = pcpu_fit_in_area(chunk, off, this_size, size, align,
-					pop_only);
-		if (head < 0) {
-			if (!seen_free) {
-				chunk->first_free = i;
-				seen_free = true;
-			}
-			max_contig = max(this_size, max_contig);
+	cur_free = block_off = 0;
+	s_index = chunk->first_free_block;
+	for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk);
+	     i++) {
+		alloc_map = pcpu_index_alloc_map(chunk, i);
+		block = chunk->md_blocks + i;
+
+		/* continue from prev block */
+		cur_free += block->left_free;
+		if (cur_free >= bit_size) {
+			end_off = bit_size;
+			goto check_populated;
+		} else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
 			continue;
 		}
 
 		/*
-		 * If head is small or the previous block is free,
-		 * merge'em.  Note that 'small' is defined as smaller
-		 * than sizeof(int), which is very small but isn't too
-		 * uncommon for percpu allocations.
+		 * Can this block hold this alloc?
+		 *
+		 * Here the block->contig_hint is used to guarantee a fit,
+		 * but the block->first_free is returned as we may be able
+		 * to serve the allocation earlier.  The population check
+		 * must take into account the area beginning at first_free
+		 * through the end of the contig_hint.
 		 */
-		if (head && (head < sizeof(int) || !(p[-1] & 1))) {
-			*p = off += head;
-			if (p[-1] & 1)
-				chunk->free_size -= head;
-			else
-				max_contig = max(*p - p[-1], max_contig);
-			this_size -= head;
-			head = 0;
+		cur_free = 0;
+		s_index = i;
+		block_off = ALIGN(block->contig_hint_start, align);
+		block_off -= block->contig_hint_start;
+		if (block->contig_hint >= block_off + bit_size) {
+			block_off = block->first_free;
+			end_off = block->contig_hint_start - block_off +
+				  bit_size;
+			goto check_populated;
 		}
 
-		/* if tail is small, just keep it around */
-		tail = this_size - head - size;
-		if (tail < sizeof(int)) {
-			tail = 0;
-			size = this_size - head;
+		/* check right */
+		block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free,
+				  align);
+		/* reset to start looking in the next block */
+		if (block_off >= PCPU_BITMAP_BLOCK_SIZE) {
+			s_index++;
+			cur_free = block_off = 0;
+			continue;
 		}
+		cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off;
+		if (cur_free >= bit_size) {
+			end_off = bit_size;
+check_populated:
+			if (!pop_only ||
+			    pcpu_is_populated(chunk, s_index, block_off,
+					      end_off, &next_index))
+				break;
 
-		/* split if warranted */
-		if (head || tail) {
-			int nr_extra = !!head + !!tail;
-
-			/* insert new subblocks */
-			memmove(p + nr_extra + 1, p + 1,
-				sizeof(chunk->map[0]) * (chunk->map_used - i));
-			chunk->map_used += nr_extra;
-
-			if (head) {
-				if (!seen_free) {
-					chunk->first_free = i;
-					seen_free = true;
-				}
-				*++p = off += head;
-				++i;
-				max_contig = max(head, max_contig);
-			}
-			if (tail) {
-				p[1] = off + size;
-				max_contig = max(tail, max_contig);
-			}
+			i = next_index - 1;
+			s_index = next_index;
+			cur_free = block_off = 0;
 		}
+	}
 
-		if (!seen_free)
-			chunk->first_free = i + 1;
+	/* nothing found */
+	if (i == pcpu_nr_pages_to_blocks(chunk))
+		return -1;
 
-		/* update hint and mark allocated */
-		if (i + 1 == chunk->map_used)
-			chunk->contig_hint = max_contig; /* fully scanned */
-		else
-			chunk->contig_hint = max(chunk->contig_hint,
-						 max_contig);
+	return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off;
+}
 
-		chunk->free_size -= size;
-		*p |= 1;
 
-		*occ_pages_p = pcpu_count_occupied_pages(chunk, i);
-		pcpu_chunk_relocate(chunk, oslot);
-		return off;
-	}
+/**
+ * pcpu_alloc_area - allocates area from a pcpu_chunk
+ * @chunk: chunk of interest
+ * @bit_size: size of request in allocation units
+ * @align: alignment of area (max PAGE_SIZE)
+ * @start: bit_off to start searching
+ *
+ * This function takes in a start bit_offset to begin searching.  It
+ * searches the allocation bitmap to verify that the offset is available
+ * as block->first_free is provided when allocation within a block is
+ * available.
+ *
+ * RETURNS:
+ * Allocated addr offset in @chunk on success,
+ * -1 if no matching area is found
+ */
+static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size,
+			   size_t align, int start)
+{
+	size_t align_mask = (align) ? (align - 1) : 0;
+	int i, bit_off, oslot;
+	struct pcpu_bitmap_md *block;
+
+	lockdep_assert_held(&pcpu_lock);
+
+	oslot = pcpu_chunk_slot(chunk);
+
+	/* search to find fit */
+	bit_off = bitmap_find_next_zero_area(chunk->alloc_map,
+					     pcpu_nr_pages_to_bits(chunk),
+					     start, bit_size, align_mask);
+
+	if (bit_off >= pcpu_nr_pages_to_bits(chunk))
+		return -1;
+
+	/* update alloc map */
+	bitmap_set(chunk->alloc_map, bit_off, bit_size);
+	/* update boundary map */
+	set_bit(bit_off, chunk->bound_map);
+	bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1);
+	set_bit(bit_off + bit_size, chunk->bound_map);
+
+	chunk->free_bits -= bit_size;
+
+	if (pcpu_block_update_hint_alloc(chunk, bit_off, bit_size))
+		pcpu_chunk_update_hint(chunk);
+
+	/* update chunk first_free */
+	for (i = chunk->first_free_block, block = chunk->md_blocks + i;
+	     i < pcpu_nr_pages_to_blocks(chunk); i++, block++)
+		if (block->contig_hint != 0)
+			break;
+
+	chunk->first_free_block = i;
 
-	chunk->contig_hint = max_contig;	/* fully scanned */
 	pcpu_chunk_relocate(chunk, oslot);
 
-	/* tell the upper layer that this chunk has no matching area */
-	return -1;
+	return bit_off * PCPU_MIN_ALLOC_SIZE;
 }
 
 /**
- * pcpu_free_area - free area to a pcpu_chunk
+ * pcpu_free_area - frees the corresponding offset
  * @chunk: chunk of interest
- * @freeme: offset of area to free
- * @occ_pages_p: out param for the number of pages the area occupies
+ * @off: addr offset into chunk
  *
- * Free area starting from @freeme to @chunk.  Note that this function
- * only modifies the allocation map.  It doesn't depopulate or unmap
- * the area.
- *
- * CONTEXT:
- * pcpu_lock.
+ * This function determines the size of an allocation to free using
+ * the boundary bitmap and clears the allocation map.  A block metadata
+ * update is triggered and potentially a chunk update occurs.
  */
-static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme,
-			   int *occ_pages_p)
+static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
 {
-	int oslot = pcpu_chunk_slot(chunk);
-	int off = 0;
-	unsigned i, j;
-	int to_free = 0;
-	int *p;
+	int bit_off, bit_size, index, end, oslot;
+	struct pcpu_bitmap_md *block;
 
 	lockdep_assert_held(&pcpu_lock);
 	pcpu_stats_area_dealloc(chunk);
 
-	freeme |= 1;	/* we are searching for <given offset, in use> pair */
-
-	i = 0;
-	j = chunk->map_used;
-	while (i != j) {
-		unsigned k = (i + j) / 2;
-		off = chunk->map[k];
-		if (off < freeme)
-			i = k + 1;
-		else if (off > freeme)
-			j = k;
-		else
-			i = j = k;
-	}
-	BUG_ON(off != freeme);
+	oslot = pcpu_chunk_slot(chunk);
 
-	if (i < chunk->first_free)
-		chunk->first_free = i;
+	bit_off = off / PCPU_MIN_ALLOC_SIZE;
 
-	p = chunk->map + i;
-	*p = off &= ~1;
-	chunk->free_size += (p[1] & ~1) - off;
+	/* find end index */
+	end = find_next_bit(chunk->bound_map, pcpu_nr_pages_to_bits(chunk),
+			    bit_off + 1);
+	bit_size = end - bit_off;
 
-	*occ_pages_p = pcpu_count_occupied_pages(chunk, i);
+	bitmap_clear(chunk->alloc_map, bit_off, bit_size);
 
-	/* merge with next? */
-	if (!(p[1] & 1))
-		to_free++;
-	/* merge with previous? */
-	if (i > 0 && !(p[-1] & 1)) {
-		to_free++;
-		i--;
-		p--;
-	}
-	if (to_free) {
-		chunk->map_used -= to_free;
-		memmove(p + 1, p + 1 + to_free,
-			(chunk->map_used - i) * sizeof(chunk->map[0]));
-	}
+	chunk->free_bits += bit_size;
+
+	/* update first_free */
+	index = pcpu_off_to_block_index(bit_off);
+	block = chunk->md_blocks + index;
+	block->first_free = min_t(int, block->first_free,
+				  bit_off % PCPU_BITMAP_BLOCK_SIZE);
+
+	chunk->first_free_block = min(chunk->first_free_block, index);
+
+	if (pcpu_block_update_hint_free(chunk, bit_off, bit_size))
+		pcpu_chunk_update_hint(chunk);
 
-	chunk->contig_hint = max(chunk->map[i + 1] - chunk->map[i] - 1, chunk->contig_hint);
 	pcpu_chunk_relocate(chunk, oslot);
 }
 
+static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
+{
+	struct pcpu_bitmap_md *md_block;
+
+	for (md_block = chunk->md_blocks;
+	     md_block != chunk->md_blocks + pcpu_nr_pages_to_blocks(chunk);
+	     md_block++) {
+		md_block->contig_hint = PCPU_BITMAP_BLOCK_SIZE;
+		md_block->left_free = PCPU_BITMAP_BLOCK_SIZE;
+		md_block->right_free = PCPU_BITMAP_BLOCK_SIZE;
+	}
+}
+
+static struct pcpu_chunk * __init pcpu_alloc_first_chunk(int chunk_pages)
+{
+	struct pcpu_chunk *chunk;
+	int map_size_bits;
+
+	chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) +
+				     BITS_TO_LONGS(chunk_pages), 0);
+
+	INIT_LIST_HEAD(&chunk->list);
+	chunk->has_reserved = false;
+	chunk->immutable = true;
+
+	chunk->nr_pages = chunk_pages;
+	map_size_bits = pcpu_nr_pages_to_bits(chunk);
+
+	chunk->alloc_map = memblock_virt_alloc(
+				BITS_TO_LONGS(map_size_bits) *
+				sizeof(chunk->alloc_map[0]), 0);
+	chunk->bound_map = memblock_virt_alloc(
+				BITS_TO_LONGS(map_size_bits + 1) *
+				sizeof(chunk->bound_map[0]), 0);
+	chunk->md_blocks = memblock_virt_alloc(
+				pcpu_nr_pages_to_blocks(chunk) *
+				sizeof(struct pcpu_bitmap_md), 0);
+	pcpu_init_md_blocks(chunk);
+
+	/* fill page populated map - the first chunk is fully populated */
+	bitmap_fill(chunk->populated, chunk_pages);
+	chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages;
+
+	return chunk;
+}
+
 static struct pcpu_chunk *pcpu_alloc_chunk(void)
 {
 	struct pcpu_chunk *chunk;
+	int map_size_bits;
 
 	chunk = pcpu_mem_zalloc(pcpu_chunk_struct_size);
 	if (!chunk)
 		return NULL;
 
-	chunk->map = pcpu_mem_zalloc(PCPU_DFL_MAP_ALLOC *
-						sizeof(chunk->map[0]));
-	if (!chunk->map) {
-		pcpu_mem_free(chunk);
-		return NULL;
-	}
-
-	chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
-	chunk->map[0] = 0;
-	chunk->map[1] = pcpu_unit_size | 1;
-	chunk->map_used = 1;
-	chunk->has_reserved = false;
-
 	INIT_LIST_HEAD(&chunk->list);
-	INIT_LIST_HEAD(&chunk->map_extend_list);
-	chunk->free_size = pcpu_unit_size;
-	chunk->contig_hint = pcpu_unit_size;
+	chunk->has_reserved = false;
 
 	chunk->nr_pages = pcpu_unit_pages;
+	map_size_bits = pcpu_nr_pages_to_bits(chunk);
+
+	chunk->alloc_map = pcpu_mem_zalloc(BITS_TO_LONGS(map_size_bits) *
+					   sizeof(chunk->alloc_map[0]));
+	if (!chunk->alloc_map)
+		goto alloc_map_fail;
+
+	chunk->bound_map = pcpu_mem_zalloc(BITS_TO_LONGS(map_size_bits + 1) *
+					   sizeof(chunk->bound_map[0]));
+	if (!chunk->alloc_map)
+		goto bound_map_fail;
+
+	chunk->md_blocks = pcpu_mem_zalloc(pcpu_nr_pages_to_blocks(chunk) *
+					   sizeof(chunk->md_blocks[0]));
+	if (!chunk->alloc_map)
+		goto md_blocks_fail;
+
+	pcpu_init_md_blocks(chunk);
+
+	/* init metadata */
+	chunk->contig_hint = chunk->free_bits = map_size_bits;
 
 	return chunk;
+
+md_blocks_fail:
+	pcpu_mem_free(chunk->bound_map);
+bound_map_fail:
+	pcpu_mem_free(chunk->alloc_map);
+alloc_map_fail:
+	pcpu_mem_free(chunk);
+
+	return NULL;
 }
 
 static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
 	if (!chunk)
 		return;
-	pcpu_mem_free(chunk->map);
+	pcpu_mem_free(chunk->md_blocks);
+	pcpu_mem_free(chunk->bound_map);
+	pcpu_mem_free(chunk->alloc_map);
 	pcpu_mem_free(chunk);
 }
 
@@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk,
 
 	bitmap_set(chunk->populated, page_start, nr);
 	chunk->nr_populated += nr;
+	chunk->nr_empty_pop_pages += nr;
 	pcpu_nr_empty_pop_pages += nr;
 }
 
@@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
 
 	bitmap_clear(chunk->populated, page_start, nr);
 	chunk->nr_populated -= nr;
+	chunk->nr_empty_pop_pages -= nr;
 	pcpu_nr_empty_pop_pages -= nr;
 }
 
@@ -890,19 +1284,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	struct pcpu_chunk *chunk;
 	const char *err;
 	bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
-	int occ_pages = 0;
-	int slot, off, new_alloc, cpu, ret;
+	int slot, off, cpu, ret;
 	unsigned long flags;
 	void __percpu *ptr;
+	size_t bit_size, bit_align;
 
 	/*
-	 * We want the lowest bit of offset available for in-use/free
-	 * indicator, so force >= 16bit alignment and make size even.
+	 * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE.
+	 * Therefore alignment must be a minimum of that many bytes as well
+	 * as the allocation will have internal fragmentation from
+	 * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
 	 */
-	if (unlikely(align < 2))
-		align = 2;
+	if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
+		align = PCPU_MIN_ALLOC_SIZE;
 
-	size = ALIGN(size, 2);
+	size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
+	bit_size = size >> PCPU_MIN_ALLOC_SHIFT;
+	bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
 
 	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
 		     !is_power_of_2(align))) {
@@ -920,23 +1318,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 	if (reserved && pcpu_reserved_chunk) {
 		chunk = pcpu_reserved_chunk;
 
-		if (size > chunk->contig_hint) {
+		off = pcpu_find_block_fit(chunk, bit_size, bit_align,
+					  is_atomic);
+		if (off < 0) {
 			err = "alloc from reserved chunk failed";
 			goto fail_unlock;
 		}
 
-		while ((new_alloc = pcpu_need_to_extend(chunk, is_atomic))) {
-			spin_unlock_irqrestore(&pcpu_lock, flags);
-			if (is_atomic ||
-			    pcpu_extend_area_map(chunk, new_alloc) < 0) {
-				err = "failed to extend area map of reserved chunk";
-				goto fail;
-			}
-			spin_lock_irqsave(&pcpu_lock, flags);
-		}
-
-		off = pcpu_alloc_area(chunk, size, align, is_atomic,
-				      &occ_pages);
+		off = pcpu_alloc_area(chunk, bit_size, bit_align, off);
 		if (off >= 0)
 			goto area_found;
 
@@ -946,31 +1335,17 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 restart:
 	/* search through normal chunks */
-	for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_size_to_slot(bit_size); slot < pcpu_nr_slots; slot++) {
 		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
-			if (size > chunk->contig_hint)
+			if (bit_size > chunk->contig_hint)
 				continue;
 
-			new_alloc = pcpu_need_to_extend(chunk, is_atomic);
-			if (new_alloc) {
-				if (is_atomic)
-					continue;
-				spin_unlock_irqrestore(&pcpu_lock, flags);
-				if (pcpu_extend_area_map(chunk,
-							 new_alloc) < 0) {
-					err = "failed to extend area map";
-					goto fail;
-				}
-				spin_lock_irqsave(&pcpu_lock, flags);
-				/*
-				 * pcpu_lock has been dropped, need to
-				 * restart cpu_slot list walking.
-				 */
-				goto restart;
-			}
+			off = pcpu_find_block_fit(chunk, bit_size, bit_align,
+						  is_atomic);
+			if (off < 0)
+				continue;
 
-			off = pcpu_alloc_area(chunk, size, align, is_atomic,
-					      &occ_pages);
+			off = pcpu_alloc_area(chunk, bit_size, bit_align, off);
 			if (off >= 0)
 				goto area_found;
 		}
@@ -1021,7 +1396,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 
 			spin_lock_irqsave(&pcpu_lock, flags);
 			if (ret) {
-				pcpu_free_area(chunk, off, &occ_pages);
+				pcpu_free_area(chunk, off);
 				err = "failed to populate";
 				goto fail_unlock;
 			}
@@ -1032,12 +1407,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
 		mutex_unlock(&pcpu_alloc_mutex);
 	}
 
-	if (chunk != pcpu_reserved_chunk) {
-		spin_lock_irqsave(&pcpu_lock, flags);
-		pcpu_nr_empty_pop_pages -= occ_pages;
-		spin_unlock_irqrestore(&pcpu_lock, flags);
-	}
-
 	if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
 		pcpu_schedule_balance_work();
 
@@ -1155,7 +1524,6 @@ static void pcpu_balance_workfn(struct work_struct *work)
 		if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
 			continue;
 
-		list_del_init(&chunk->map_extend_list);
 		list_move(&chunk->list, &to_free);
 	}
 
@@ -1173,25 +1541,6 @@ static void pcpu_balance_workfn(struct work_struct *work)
 		pcpu_destroy_chunk(chunk);
 	}
 
-	/* service chunks which requested async area map extension */
-	do {
-		int new_alloc = 0;
-
-		spin_lock_irq(&pcpu_lock);
-
-		chunk = list_first_entry_or_null(&pcpu_map_extend_chunks,
-					struct pcpu_chunk, map_extend_list);
-		if (chunk) {
-			list_del_init(&chunk->map_extend_list);
-			new_alloc = pcpu_need_to_extend(chunk, false);
-		}
-
-		spin_unlock_irq(&pcpu_lock);
-
-		if (new_alloc)
-			pcpu_extend_area_map(chunk, new_alloc);
-	} while (chunk);
-
 	/*
 	 * Ensure there are certain number of free populated pages for
 	 * atomic allocs.  Fill up from the most packed so that atomic
@@ -1213,7 +1562,8 @@ static void pcpu_balance_workfn(struct work_struct *work)
 				  0, PCPU_EMPTY_POP_PAGES_HIGH);
 	}
 
-	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
+	for (slot = pcpu_size_to_slot(PAGE_SIZE / PCPU_MIN_ALLOC_SIZE);
+	     slot < pcpu_nr_slots; slot++) {
 		int nr_unpop = 0, rs, re;
 
 		if (!nr_to_pop)
@@ -1277,7 +1627,7 @@ void free_percpu(void __percpu *ptr)
 	void *addr;
 	struct pcpu_chunk *chunk;
 	unsigned long flags;
-	int off, occ_pages;
+	int off;
 
 	if (!ptr)
 		return;
@@ -1291,13 +1641,10 @@ void free_percpu(void __percpu *ptr)
 	chunk = pcpu_chunk_addr_search(addr);
 	off = addr - chunk->base_addr;
 
-	pcpu_free_area(chunk, off, &occ_pages);
-
-	if (chunk != pcpu_reserved_chunk)
-		pcpu_nr_empty_pop_pages += occ_pages;
+	pcpu_free_area(chunk, off);
 
 	/* if there are more than one fully free chunks, wake up grim reaper */
-	if (chunk->free_size == pcpu_unit_size) {
+	if (chunk->free_bits == pcpu_pages_to_bits(pcpu_unit_pages)) {
 		struct pcpu_chunk *pos;
 
 		list_for_each_entry(pos, &pcpu_slot[pcpu_nr_slots - 1], list)
@@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr)
  * address.  The caller is responsible for ensuring @addr stays valid
  * until this function finishes.
  *
- * percpu allocator has special setup for the first chunk, which currently
+ * Percpu allocator has special setup for the first chunk, which currently
  * supports either embedding in linear address space or vmalloc mapping,
  * and, from the second one, the backing allocator (currently either vm or
  * km) provides translation.
  *
  * The addr can be translated simply without checking if it falls into the
- * first chunk. But the current code reflects better how percpu allocator
+ * first chunk.  But the current code reflects better how percpu allocator
  * actually works, and the verification can discover both bugs in percpu
- * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current
+ * allocator itself and per_cpu_ptr_to_phys() callers.  So we keep current
  * code.
  *
  * RETURNS:
@@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
 		else
 			return page_to_phys(vmalloc_to_page(addr)) +
 			       offset_in_page(addr);
-	} else
+	} else {
 		return page_to_phys(pcpu_addr_to_page(addr)) +
 		       offset_in_page(addr);
+	}
 }
 
 /**
@@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl,
  * static areas on architectures where the addressing model has
  * limited offset range for symbol relocations to guarantee module
  * percpu symbols fall inside the relocatable range.
+ * @ai->static_size + @ai->reserved_size is expected to be page aligned.
  *
  * @ai->dyn_size determines the number of bytes available for dynamic
- * allocation in the first chunk.  The area between @ai->static_size +
- * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused.
+ * allocation in the first chunk. Both the start and the end are expected
+ * to be page aligned. The area between @ai->static_size + @ai->reserved_size
+ * + @ai->dyn_size and @ai->unit_size is unused.
  *
  * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE
  * and equal to or larger than @ai->static_size + @ai->reserved_size +
@@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl,
  * copied static data to each unit.
  *
  * If the first chunk ends up with both reserved and dynamic areas, it
- * is served by two chunks - one to serve the core static and reserved
- * areas and the other for the dynamic area.  They share the same vm
- * and page map but uses different area allocation map to stay away
- * from each other.  The latter chunk is circulated in the chunk slots
- * and available for dynamic allocation like any other chunks.
+ * is served by two chunks - one to serve the reserved area and the other
+ * for the dynamic area.  They share the same vm and page map but use
+ * different area allocation map to stay away from each other.  The latter
+ * chunk is circulated in the chunk slots and available for dynamic allocation
+ * like any other chunks.
  *
  * RETURNS:
  * 0 on success, -errno on failure.
@@ -1593,8 +1943,6 @@ static void pcpu_dump_alloc_info(const char *lvl,
 int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 				  void *base_addr)
 {
-	static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
-	static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
 	size_t dyn_size = ai->dyn_size;
 	size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
 	struct pcpu_chunk *chunk;
@@ -1606,7 +1954,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	int group, unit, i;
 	int chunk_pages;
 	unsigned long tmp_addr, aligned_addr;
-	unsigned long map_size_bytes;
+	unsigned long map_size_bytes, begin_fill_bits;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
 	if (unlikely(cond)) {						\
@@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 * Allocate chunk slots.  The additional last slot is for
 	 * empty chunks.
 	 */
-	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
+	pcpu_nr_slots = __pcpu_size_to_slot(
+				pcpu_pages_to_bits(pcpu_unit_pages)) + 2;
 	pcpu_slot = memblock_virt_alloc(
 			pcpu_nr_slots * sizeof(pcpu_slot[0]), 0);
 	for (i = 0; i < pcpu_nr_slots; i++)
@@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	tmp_addr = (unsigned long)base_addr + ai->static_size;
 	aligned_addr = tmp_addr & PAGE_MASK;
 	pcpu_reserved_offset = tmp_addr - aligned_addr;
+	begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE;
 
 	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
 			 pcpu_reserved_offset;
+
 	chunk_pages = map_size_bytes >> PAGE_SHIFT;
 
 	/* chunk adjacent to static region allocation */
-	chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) +
-				     BITS_TO_LONGS(chunk_pages), 0);
-	INIT_LIST_HEAD(&chunk->list);
-	INIT_LIST_HEAD(&chunk->map_extend_list);
+	chunk = pcpu_alloc_first_chunk(chunk_pages);
 	chunk->base_addr = (void *)aligned_addr;
-	chunk->map = smap;
-	chunk->map_alloc = ARRAY_SIZE(smap);
 	chunk->immutable = true;
-	bitmap_fill(chunk->populated, chunk_pages);
-	chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages;
-
-	chunk->nr_pages = chunk_pages;
 
-	if (ai->reserved_size) {
-		chunk->free_size = ai->reserved_size;
-		pcpu_reserved_chunk = chunk;
-	} else {
-		chunk->free_size = dyn_size;
-		dyn_size = 0;			/* dynamic area covered */
-	}
-	chunk->contig_hint = chunk->free_size;
+	/* set metadata */
+	chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
+	chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
 
-	if (pcpu_reserved_offset) {
+	/*
+	 * If the beginning of the reserved region overlaps the end of the
+	 * static region, hide that portion in the metadata.
+	 */
+	if (begin_fill_bits) {
 		chunk->has_reserved = true;
-		chunk->nr_empty_pop_pages--;
-		chunk->map[0] = 1;
-		chunk->map[1] = pcpu_reserved_offset;
-		chunk->map_used = 1;
+		bitmap_fill(chunk->alloc_map, begin_fill_bits);
+		set_bit(0, chunk->bound_map);
+		set_bit(begin_fill_bits, chunk->bound_map);
+
+		if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits))
+			pcpu_chunk_update_hint(chunk);
 	}
-	if (chunk->free_size)
-		chunk->map[++chunk->map_used] = map_size_bytes;
-	chunk->map[chunk->map_used] |= 1;
 
-	/* init dynamic region of first chunk if necessary */
-	if (dyn_size) {
+	/* init dynamic chunk if necessary */
+	if (ai->reserved_size) {
+		pcpu_reserved_chunk = chunk;
+
 		chunk_pages = dyn_size >> PAGE_SHIFT;
 
 		/* chunk allocation */
-		chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) +
-					     BITS_TO_LONGS(chunk_pages), 0);
-		INIT_LIST_HEAD(&chunk->list);
-		INIT_LIST_HEAD(&chunk->map_extend_list);
+		chunk = pcpu_alloc_first_chunk(chunk_pages);
 		chunk->base_addr = base_addr + ai->static_size +
 				    ai->reserved_size;
-		chunk->map = dmap;
-		chunk->map_alloc = ARRAY_SIZE(dmap);
-		chunk->immutable = true;
-		bitmap_fill(chunk->populated, chunk_pages);
-		chunk->nr_populated = chunk_pages;
-		chunk->nr_empty_pop_pages = chunk_pages;
-
-		chunk->contig_hint = chunk->free_size = dyn_size;
-		chunk->map[0] = 0;
-		chunk->map[1] = chunk->free_size | 1;
-		chunk->map_used = 1;
-
-		chunk->nr_pages = chunk_pages;
+
+		/* set metadata */
+		chunk->contig_hint = pcpu_nr_pages_to_bits(chunk);
+		chunk->free_bits = pcpu_nr_pages_to_bits(chunk);
 	}
 
 	/* link the first chunk in */
@@ -2370,36 +2700,6 @@ void __init setup_per_cpu_areas(void)
 #endif	/* CONFIG_SMP */
 
 /*
- * First and reserved chunks are initialized with temporary allocation
- * map in initdata so that they can be used before slab is online.
- * This function is called after slab is brought up and replaces those
- * with properly allocated maps.
- */
-void __init percpu_init_late(void)
-{
-	struct pcpu_chunk *target_chunks[] =
-		{ pcpu_first_chunk, pcpu_reserved_chunk, NULL };
-	struct pcpu_chunk *chunk;
-	unsigned long flags;
-	int i;
-
-	for (i = 0; (chunk = target_chunks[i]); i++) {
-		int *map;
-		const size_t size = PERCPU_DYNAMIC_EARLY_SLOTS * sizeof(map[0]);
-
-		BUILD_BUG_ON(size > PAGE_SIZE);
-
-		map = pcpu_mem_zalloc(size);
-		BUG_ON(!map);
-
-		spin_lock_irqsave(&pcpu_lock, flags);
-		memcpy(map, chunk->map, size);
-		chunk->map = map;
-		spin_unlock_irqrestore(&pcpu_lock, flags);
-	}
-}
-
-/*
  * Percpu allocator is initialized early during boot when neither slab or
  * workqueue is available.  Plug async management until everything is up
  * and running.
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/10] percpu: add optimizations on allocation path for the bitmap allocator
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (8 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
@ 2017-07-16  2:23 ` Dennis Zhou
  2017-07-17 23:32   ` Tejun Heo
  2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik
  10 siblings, 1 reply; 37+ messages in thread
From: Dennis Zhou @ 2017-07-16  2:23 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

This patch adds two optimizations to the allocation path. The first is
to not consider a chunk if the requested allocation cannot fit in the
chunk's contig_hint. The benefit is that this avoids unncessary scanning
over a chunk as the assumption is memory pressure is high and creating a
new chunk has minimal consequences. This may fail when the contig_hint
has poor alignment, but again we fall back on the high memory pressure
argument.

The second is just a fail-fast mechanism. When allocating, a offset is
identified within a block and then scanning is used to see if it will
fit. An offset should never be returned unless it is known to fit, so
here we just bind the scanning to the size of a block.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
---
 mm/percpu.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 569df63..7496571 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -885,6 +885,12 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
 
 	lockdep_assert_held(&pcpu_lock);
 
+	/* check chunk->contig_hint to see if alloc can fit - see note above */
+	block_off = ALIGN(chunk->contig_hint_start, align) -
+		    chunk->contig_hint_start;
+	if (block_off + bit_size > chunk->contig_hint)
+		return -1;
+
 	cur_free = block_off = 0;
 	s_index = chunk->first_free_block;
 	for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk);
@@ -973,19 +979,23 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size,
 			   size_t align, int start)
 {
 	size_t align_mask = (align) ? (align - 1) : 0;
-	int i, bit_off, oslot;
+	int i, bit_off, end, oslot;
 	struct pcpu_bitmap_md *block;
 
 	lockdep_assert_held(&pcpu_lock);
 
 	oslot = pcpu_chunk_slot(chunk);
 
-	/* search to find fit */
-	bit_off = bitmap_find_next_zero_area(chunk->alloc_map,
-					     pcpu_nr_pages_to_bits(chunk),
-					     start, bit_size, align_mask);
+	/*
+	 * Search to find fit.  The search for the start is limited to
+	 * be within a block_size, but should in reality never be hit
+	 * as the contig_hint should be a valid placement.
+	 */
+	end = start + bit_size + PCPU_BITMAP_BLOCK_SIZE;
+	bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
+					     bit_size, align_mask);
 
-	if (bit_off >= pcpu_nr_pages_to_bits(chunk))
+	if (bit_off >= end)
 		return -1;
 
 	/* update alloc map */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
@ 2017-07-16  4:01   ` kbuild test robot
  2017-07-16  5:11   ` kbuild test robot
  2017-07-17 16:46   ` Tejun Heo
  2 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-07-16  4:01 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: kbuild-all, Tejun Heo, Christoph Lameter, kernel-team,
	linux-kernel, linux-mm, Dennis Zhou

[-- Attachment #1: Type: text/plain, Size: 8175 bytes --]

Hi Dennis,

[auto build test ERROR on percpu/for-next]
[also build test ERROR on v4.13-rc1 next-20170714]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/percpu-replace-percpu-area-map-allocator-with-bitmap-allocator/20170716-103337
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/percpu.h:9:0,
                    from include/linux/percpu-rwsem.h:6,
                    from include/linux/fs.h:30,
                    from fs/affs/affs.h:8,
                    from fs/affs/namei.c:11:
   include/linux/percpu.h: In function 'pcpu_align_reserved_region':
>> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
   include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
>> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                                   ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
--
   In file included from include/linux/percpu.h:9:0,
                    from include/linux/percpu-rwsem.h:6,
                    from include/linux/fs.h:30,
                    from fs/ocfs2/file.c:27:
   include/linux/percpu.h: In function 'pcpu_align_reserved_region':
>> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
   include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
>> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                                   ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
   In file included from arch/xtensa/include/asm/atomic.h:21:0,
                    from include/linux/atomic.h:4,
                    from include/linux/debug_locks.h:5,
                    from include/linux/lockdep.h:25,
                    from include/linux/spinlock_types.h:18,
                    from include/linux/spinlock.h:81,
                    from include/linux/wait.h:8,
                    from include/linux/fs.h:5,
                    from fs/ocfs2/file.c:27:
   fs/ocfs2/file.c: In function 'ocfs2_file_write_iter':
   arch/xtensa/include/asm/cmpxchg.h:139:3: warning: value computed is not used [-Wunused-value]
     ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
      ^
   fs/ocfs2/file.c:2341:3: note: in expansion of macro 'xchg'
      xchg(&iocb->ki_complete, saved_ki_complete);
      ^
--
   In file included from include/linux/percpu.h:9:0,
                    from include/linux/context_tracking_state.h:4,
                    from include/linux/vtime.h:4,
                    from include/linux/hardirq.h:7,
                    from include/linux/interrupt.h:12,
                    from drivers/scsi/sym53c8xx_2/sym_glue.h:45,
                    from drivers/scsi/sym53c8xx_2/sym_fw.c:40:
   include/linux/percpu.h: In function 'pcpu_align_reserved_region':
>> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
   include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                 ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
>> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function)
    #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
                                                                   ^
>> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN'
     return PFN_ALIGN(static_size + reserved_size) - static_size;
            ^
   In file included from drivers/scsi/sym53c8xx_2/sym_glue.h:64:0,
                    from drivers/scsi/sym53c8xx_2/sym_fw.c:40:
   drivers/scsi/sym53c8xx_2/sym_defs.h: At top level:
   drivers/scsi/sym53c8xx_2/sym_defs.h:109:0: warning: "WSR" redefined
     #define   WSR     0x01  /* sta: wide scsi received       [W]*/
    ^
   In file included from arch/xtensa/include/asm/bitops.h:22:0,
                    from include/linux/bitops.h:36,
                    from include/linux/kernel.h:10,
                    from include/linux/list.h:8,
                    from include/linux/wait.h:6,
                    from include/linux/completion.h:11,
                    from drivers/scsi/sym53c8xx_2/sym_glue.h:43,
                    from drivers/scsi/sym53c8xx_2/sym_fw.c:40:
   arch/xtensa/include/asm/processor.h:227:0: note: this is the location of the previous definition
    #define WSR(v,sr) __asm__ __volatile__ ("wsr %0,"__stringify(sr) :: "a"(v));
    ^

vim +/PAGE_SIZE +17 include/linux/pfn.h

947d0496 Jeremy Fitzhardinge 2008-09-11  16  
22a9835c Dave Hansen         2006-03-27 @17  #define PFN_ALIGN(x)	(((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)
22a9835c Dave Hansen         2006-03-27  18  #define PFN_UP(x)	(((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
22a9835c Dave Hansen         2006-03-27  19  #define PFN_DOWN(x)	((x) >> PAGE_SHIFT)
947d0496 Jeremy Fitzhardinge 2008-09-11  20  #define PFN_PHYS(x)	((phys_addr_t)(x) << PAGE_SHIFT)
8f235d1a Chen Gang           2016-01-14  21  #define PHYS_PFN(x)	((unsigned long)((x) >> PAGE_SHIFT))
22a9835c Dave Hansen         2006-03-27  22  

:::::: The code at line 17 was first introduced by commit
:::::: 22a9835c350782a5c3257343713932af3ac92ee0 [PATCH] unify PFN_* macros

:::::: TO: Dave Hansen <haveblue@us.ibm.com>
:::::: CC: Linus Torvalds <torvalds@g5.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50216 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
  2017-07-16  4:01   ` kbuild test robot
@ 2017-07-16  5:11   ` kbuild test robot
  2017-07-17 16:46   ` Tejun Heo
  2 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-07-16  5:11 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: kbuild-all, Tejun Heo, Christoph Lameter, kernel-team,
	linux-kernel, linux-mm, Dennis Zhou

[-- Attachment #1: Type: text/plain, Size: 4298 bytes --]

Hi Dennis,

[auto build test ERROR on percpu/for-next]
[also build test ERROR on v4.13-rc1 next-20170714]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/percpu-replace-percpu-area-map-allocator-with-bitmap-allocator/20170716-103337
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=ia64 

All errors (new ones prefixed by >>):

   arch/ia64/mm/discontig.c: In function 'setup_per_cpu_areas':
>> arch/ia64/mm/discontig.c:218:10: error: 'PERCPU_MODULE_RSERVE' undeclared (first use in this function)
             PERCPU_MODULE_RSERVE);
             ^~~~~~~~~~~~~~~~~~~~
   arch/ia64/mm/discontig.c:218:10: note: each undeclared identifier is reported only once for each function it appears in

vim +/PERCPU_MODULE_RSERVE +218 arch/ia64/mm/discontig.c

   174	
   175	#ifdef CONFIG_SMP
   176	/**
   177	 * setup_per_cpu_areas - setup percpu areas
   178	 *
   179	 * Arch code has already allocated and initialized percpu areas.  All
   180	 * this function has to do is to teach the determined layout to the
   181	 * dynamic percpu allocator, which happens to be more complex than
   182	 * creating whole new ones using helpers.
   183	 */
   184	void __init setup_per_cpu_areas(void)
   185	{
   186		struct pcpu_alloc_info *ai;
   187		struct pcpu_group_info *uninitialized_var(gi);
   188		unsigned int *cpu_map;
   189		void *base;
   190		unsigned long base_offset;
   191		unsigned int cpu;
   192		ssize_t static_size, reserved_size, dyn_size;
   193		int node, prev_node, unit, nr_units, rc;
   194	
   195		ai = pcpu_alloc_alloc_info(MAX_NUMNODES, nr_cpu_ids);
   196		if (!ai)
   197			panic("failed to allocate pcpu_alloc_info");
   198		cpu_map = ai->groups[0].cpu_map;
   199	
   200		/* determine base */
   201		base = (void *)ULONG_MAX;
   202		for_each_possible_cpu(cpu)
   203			base = min(base,
   204				   (void *)(__per_cpu_offset[cpu] + __per_cpu_start));
   205		base_offset = (void *)__per_cpu_start - base;
   206	
   207		/* build cpu_map, units are grouped by node */
   208		unit = 0;
   209		for_each_node(node)
   210			for_each_possible_cpu(cpu)
   211				if (node == node_cpuid[cpu].nid)
   212					cpu_map[unit++] = cpu;
   213		nr_units = unit;
   214	
   215		/* set basic parameters */
   216		static_size = __per_cpu_end - __per_cpu_start;
   217		reserved_size = pcpu_align_reserved_region(static_size,
 > 218							   PERCPU_MODULE_RSERVE);
   219		dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
   220		if (dyn_size < 0)
   221			panic("percpu area overflow static=%zd reserved=%zd\n",
   222			      static_size, reserved_size);
   223	
   224		ai->static_size		= static_size;
   225		ai->reserved_size	= reserved_size;
   226		ai->dyn_size		= dyn_size;
   227		ai->unit_size		= PERCPU_PAGE_SIZE;
   228		ai->atom_size		= PAGE_SIZE;
   229		ai->alloc_size		= PERCPU_PAGE_SIZE;
   230	
   231		/*
   232		 * CPUs are put into groups according to node.  Walk cpu_map
   233		 * and create new groups at node boundaries.
   234		 */
   235		prev_node = -1;
   236		ai->nr_groups = 0;
   237		for (unit = 0; unit < nr_units; unit++) {
   238			cpu = cpu_map[unit];
   239			node = node_cpuid[cpu].nid;
   240	
   241			if (node == prev_node) {
   242				gi->nr_units++;
   243				continue;
   244			}
   245			prev_node = node;
   246	
   247			gi = &ai->groups[ai->nr_groups++];
   248			gi->nr_units		= 1;
   249			gi->base_offset		= __per_cpu_offset[cpu] + base_offset;
   250			gi->cpu_map		= &cpu_map[unit];
   251		}
   252	
   253		rc = pcpu_setup_first_chunk(ai, base);
   254		if (rc)
   255			panic("failed to setup percpu area (err=%d)", rc);
   256	
   257		pcpu_free_alloc_info(ai);
   258	}
   259	#endif
   260	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 47682 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer
  2017-07-16  2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou
@ 2017-07-17 14:44   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 14:44 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:06PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> Changes the use of a void buffer to an int buffer for clarity.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>

Applied to percpu/for-4.14.h

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/10] percpu: change the format for percpu_stats output
  2017-07-16  2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou
@ 2017-07-17 14:46   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 14:46 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:07PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> This makes the debugfs output for percpu_stats a little easier
> to read by changing the spacing of the output to be consistent.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>

Applied to percpu/for-4.14.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats
  2017-07-16  2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou
@ 2017-07-17 14:47   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 14:47 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:08PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> Percpu memory holds a minimum threshold of pages that are populated
> in order to serve atomic percpu memory requests. This change makes it
> easier to verify that there are a minimum number of populated pages
> lying around.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>

Applied to percpu/for-4.14.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments
  2017-07-16  2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou
@ 2017-07-17 14:53   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 14:53 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:09PM -0400, Dennis Zhou wrote:
>   *  c0                           c1                         c2
>   *  -------------------          -------------------        ------------
>   * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
>   *  -------------------  ......  -------------------  ....  ------------
> +

Missing '*'.  Added and applied to percpu/for-4.14.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
  2017-07-16  4:01   ` kbuild test robot
  2017-07-16  5:11   ` kbuild test robot
@ 2017-07-17 16:46   ` Tejun Heo
  2017-07-17 19:10     ` Dennis Zhou
  2017-07-24 20:04     ` Dennis Zhou
  2 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 16:46 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hello,

On Sat, Jul 15, 2017 at 10:23:10PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> Preparatory patch to modify the first chunk's static_size +
> reserved_size to end page aligned. The first chunk has a unique
> allocation scheme overlaying the static, reserved, and dynamic regions.
> The other regions of each chunk are reserved or hidden. The bitmap
> allocator would have to allocate in the bitmap the static region to
> replicate this. By having the reserved region to end page aligned, the
> metadata overhead can be saved. The consequence is that up to an
> additional page of memory will be allocated to the reserved region that
> primarily serves static percpu variables.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>

Sans the build warnings, generally looks good to me.  Some nits asnd
one question below.

> +/*

Should be /**

> + * pcpu_align_reserved_region - page align the end of the reserved region
> + * @static_size: the static region size
> + * @reserved_size: the minimum reserved region size
> + *
> + * This function calculates the size of the reserved region required to
> + * make the reserved region end page aligned.
> + *
> + * Percpu memory offers a maximum alignment of PAGE_SIZE.  Aligning this
> + * minimizes the metadata overhead of overlapping the static, reserved,
> + * and dynamic regions by allowing the metadata for the static region to
> + * not be allocated.  This lets the base_addr be moved up to a page
> + * aligned address and disregard the static region as offsets are allocated.
> + * The beginning of the reserved region will overlap with the static
> + * region if the end of the static region is not page aligned.

Heh, that was pretty difficult to parse, but here's my question.  So,
we're expanding reserved area so that its end aligns to page boundary
which is completely fine.  We may end up with reserved area which is a
bit larger than specified but no big deal.  However, we can't do the
same thing with the boundary between the static and reserved chunks,
so instead we pull down the start of the reserved area and mark off
the overwrapping area, which is fine too.

My question is why we're doing one thing for the end of reserved area
while we need to do a different thing for the beginning of it.  Can't
we do the same thing in both cases?  ie. for the both boundaries
between static and reserved, and reserved and dynamic, pull down the
start to the page boundary and mark the overlapping areas used?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/10] percpu: modify base_addr to be region specific
  2017-07-16  2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou
@ 2017-07-17 18:57   ` Tejun Heo
  2017-07-18 19:26   ` Josef Bacik
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 18:57 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hello,

On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> Originally, the first chunk is served by up to three chunks, each given
> a region they are responsible for. Despite this, the arithmetic was based
> off of the base_addr making it require offsets or be overly inclusive.
> This patch changes percpu checks for first chunk to consider the only
> the dynamic region and the reserved check to be only the reserved
> region. There is no impact here besides making these checks a little
> more accurate.
> 
> This patch also adds the ground work increasing the minimum allocation
> size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to
> keep track of the number of pages the bitmap serves. The arithmetic for
> identifying first chunk and reserved chunk reflect this change.

However small the patch might end up being, I'd much prefer changing
the minimum alloc size to be a separate patch with rationale.

> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 98a371c..a5cedcd 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -21,6 +21,10 @@
>  /* minimum unit size, also is the maximum supported allocation size */
>  #define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(32 << 10)
>  
> +/* minimum allocation size and shift in bytes */
> +#define PCPU_MIN_ALLOC_SIZE		(1 << PCPU_MIN_ALLOC_SHIFT)
> +#define PCPU_MIN_ALLOC_SHIFT		2

nitpick: Put SHIFT def above SIZE def?

> +/*
> + * Static addresses should never be passed into the allocator.  They
> + * are accessed using the group_offsets and therefore do not rely on
> + * chunk->base_addr.
> + */
>  static bool pcpu_addr_in_first_chunk(void *addr)
>  {
>  	void *first_start = pcpu_first_chunk->base_addr;
>  
> -	return addr >= first_start && addr < first_start + pcpu_unit_size;
> +	return addr >= first_start &&
> +	       addr < first_start +
> +	       pcpu_first_chunk->nr_pages * PAGE_SIZE;

Does the above line need line break?  If so, it'd probably be easier
to read if the broken line is indented (preferably to align with the
start of the sub expression).  e.g.

	return addr < first_start +
		      pcpu_first_chunk->nr_pages * PAGE_SIZE;

>  static bool pcpu_addr_in_reserved_chunk(void *addr)
>  {
> -	void *first_start = pcpu_first_chunk->base_addr;
> +	void *first_start;
>  
> -	return addr >= first_start &&
> -		addr < first_start + pcpu_reserved_chunk_limit;
> +	if (!pcpu_reserved_chunk)
> +		return false;
> +
> +	first_start = pcpu_reserved_chunk->base_addr;
> +
> +	return addr >= first_start + pcpu_reserved_offset &&
> +	       addr < first_start +
> +	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;

Ditto on indentation.

> @@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
>  	 * The following test on unit_low/high isn't strictly
>  	 * necessary but will speed up lookups of addresses which
>  	 * aren't in the first chunk.
> +	 *
> +	 * The address check is of high granularity checking against full
> +	 * chunk sizes.  pcpu_base_addr points to the beginning of the first
> +	 * chunk including the static region.  This allows us to examine all
> +	 * regions of the first chunk. Assumes good intent as the first
> +	 * chunk may not be full (ie. < pcpu_unit_pages in size).
>  	 */
> -	first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0);
> -	first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu,
> -				     pcpu_unit_pages);
> +	first_low = (unsigned long) pcpu_base_addr +
                                   ^
			 no space for type casts

> @@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	unsigned int cpu;
>  	int *unit_map;
>  	int group, unit, i;
> +	unsigned long tmp_addr, aligned_addr;
> +	unsigned long map_size_bytes;

How about just map_size or map_bytes?

> @@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  		INIT_LIST_HEAD(&pcpu_slot[i]);
>  
>  	/*
> -	 * Initialize static chunk.  If reserved_size is zero, the
> -	 * static chunk covers static area + dynamic allocation area
> -	 * in the first chunk.  If reserved_size is not zero, it
> -	 * covers static area + reserved area (mostly used for module
> -	 * static percpu allocation).
> +	 * Initialize static chunk.
> +	 * The static region is dropped as those addresses are already
> +	 * allocated and do not rely on chunk->base_addr.
> +	 * reserved_size == 0:
> +	 *      the static chunk covers the dynamic area
> +	 * reserved_size > 0:
> +	 *      the static chunk covers the reserved area
> +	 *
> +	 * If the static area is not page aligned, the region adjacent
> +	 * to the static area must have its base_addr be offset into
> +	 * the static area to have it be page aligned.  The overlap is
> +	 * then allocated preserving the alignment in the metadata for
> +	 * the actual region.

We can address this later but static chunk not covering static area is
kinda confusing.  The original complication came from trying to make
the static chunk service either reserved or first dynamic chunk.  We
don't need that anymore and might as well use separate rchunk and
dchunk.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names
  2017-07-16  2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou
@ 2017-07-17 19:10   ` Tejun Heo
  2017-07-24 20:07     ` Dennis Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 19:10 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:12PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> With moving the base_addr in the chunks responsible for serving the
> first chunk up, the use of schunk/dchunk in pcpu_setup_first_chunk no
> longer makes sense. This makes the linking in the first chunk code not
> rely on a ternary and renames the variables to a shared variable, chunk,
> because the allocation path is sequential.

Ah cool, please disregard my previous comment on the misnomer.  You
can explain in the previous patch's description that a follow-up patch
will resolve the situation tho.

> @@ -1709,13 +1709,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  		INIT_LIST_HEAD(&pcpu_slot[i]);
>  
>  	/*
> +	 * Initialize first chunk.
> +	 * pcpu_first_chunk will always manage the dynamic region of the
> +	 * first chunk.  The static region is dropped as those addresses

Would "not covered by any chunk" be clearer than "dropped"?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-17 16:46   ` Tejun Heo
@ 2017-07-17 19:10     ` Dennis Zhou
  2017-07-24 20:04     ` Dennis Zhou
  1 sibling, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-17 19:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hi Tejun,

On Mon, Jul 17, 2017 at 12:46:50PM -0400, Tejun Heo wrote:

> > +/*
> 
> Should be /**
> 

I have fixed this in v2.

> > + * pcpu_align_reserved_region - page align the end of the reserved region
> > + * @static_size: the static region size
> > + * @reserved_size: the minimum reserved region size
> > + *
> > + * This function calculates the size of the reserved region required to
> > + * make the reserved region end page aligned.
> > + *
> > + * Percpu memory offers a maximum alignment of PAGE_SIZE.  Aligning this
> > + * minimizes the metadata overhead of overlapping the static, reserved,
> > + * and dynamic regions by allowing the metadata for the static region to
> > + * not be allocated.  This lets the base_addr be moved up to a page
> > + * aligned address and disregard the static region as offsets are allocated.
> > + * The beginning of the reserved region will overlap with the static
> > + * region if the end of the static region is not page aligned.
> 
> Heh, that was pretty difficult to parse, but here's my question.  So,
> we're expanding reserved area so that its end aligns to page boundary
> which is completely fine.  We may end up with reserved area which is a
> bit larger than specified but no big deal.  However, we can't do the
> same thing with the boundary between the static and reserved chunks,
> so instead we pull down the start of the reserved area and mark off
> the overwrapping area, which is fine too.
> 
> My question is why we're doing one thing for the end of reserved area
> while we need to do a different thing for the beginning of it.  Can't
> we do the same thing in both cases?  ie. for the both boundaries
> between static and reserved, and reserved and dynamic, pull down the
> start to the page boundary and mark the overlapping areas used?

I don't have a very good answer to why I chose to do it different for
the beginning and then end. I think it came down to wanting to maximize
metadata usage at the time.

A benefit to doing it this way is that it clarifies the number of full
pages that will be allocated to the reserved region. For example, if
the reserved region is set to 8KB and the region is offset due to the
static region, the reserved region would only be given one full page.
The first and last page are shared with the static region and dynamic
region respectively. Expanding the reserved region would allocate two
4KB pages to it + the partial at the beginning if the static region is
not aligned. It's not perfect, but it makes alignment slightly easier to
understand for the reserved region.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps
  2017-07-16  2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou
@ 2017-07-17 19:26   ` Tejun Heo
  2017-07-24 20:13     ` Dennis Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 19:26 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hello,

On Sat, Jul 15, 2017 at 10:23:13PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> This patch changes the allocator to only mark allocated pages for the
> region the population bitmap is used for. Prior, the bitmap was marked
> completely used as the first chunk was allocated and immutable. This is
> misleading because the first chunk may not be completely filled.
> Additionally, with moving the base_addr up in the previous patch, the
> population map no longer corresponds to what is being checked.

This in isolation makes sense although the rationale isn't clear from
the description.  Is it a mere cleanup or is this needed to enable
further changes?

> pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
> pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
> is added to the pcpu_chunk struct to keep track of the number of empty
> pages. This field is needed as the number of empty populated pages is
> globally kept track of and deltas are used to update it. This new field
> is exposed in percpu_stats.

But I can't see why this is being added or why this is in the same
patch with the previous change.

> Now that chunk->nr_pages is the number of pages the chunk is serving, it
> is nice to use this in the work function for population and freeing of
> chunks rather than use the global variable pcpu_unit_pages.

The same goes for the above part.  It's fine to collect misc changes
into a patch when they're trivial and related in some ways but the
content of this patch seems a bit random.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
@ 2017-07-17 23:27   ` Tejun Heo
  2017-07-24 21:37     ` Dennis Zhou
  2017-07-19 19:11   ` Josef Bacik
  2017-07-19 19:16   ` Josef Bacik
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 23:27 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
...
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.

I think it'd be nice to have a similar table for allocation patterns
which aren't ideal for the original allocator.  The biggest goal is
avoiding cases where the allocator collapses and just glancing at the
table doesn't seem very compelling.

>  /*
> + * This determines the size of each metadata block.  There are several subtle
> + * constraints around this variable.  The reserved_region and dynamic_region
                              ^
			      constant
			      
> + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE.  This is
> + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks
> + * that are backing multiple pages, this needs to be accounted for.
> + */
> +#define PCPU_BITMAP_BLOCK_SIZE		(PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)

Given that percpu allocator can align only upto a page, the
restriction makes sense to me.  I'm kinda curious whether PAGE_SIZE
blocks is optimal tho.  Why did you pick PAGE_SIZE?

> @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk;
>  extern struct pcpu_chunk *pcpu_reserved_chunk;
>  extern unsigned long pcpu_reserved_offset;
>  
> +/*
   ^^
   /**

Ditto for other comments.

> + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bitmap blocks required.  It converts to bytes
> + * served to bits required and then blocks used.
> + */
> +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk)

Maybe just pcpu_chunk_nr_blocks()?

> +{
> +	return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE /
> +	       PCPU_BITMAP_BLOCK_SIZE;
> +}
> +
> +/*
> + * pcpu_pages_to_bits - converts the pages to size of bitmap
> + * @pages: number of physical pages
> + *
> + * This conversion is from physical pages to the number of bits
> + * required in the bitmap.
> + */
> +static inline int pcpu_pages_to_bits(int pages)

pcpu_nr_pages_to_map_bits()?

> +{
> +	return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
> +}
> +
> +/*
> + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bits in the bitmap.
> + */
> +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk)

pcpu_chunk_map_bits()?

> @@ -86,10 +90,13 @@
>  
>  #include "percpu-internal.h"
>  
> -#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
> -#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
> -#define PCPU_ATOMIC_MAP_MARGIN_LOW	32
> -#define PCPU_ATOMIC_MAP_MARGIN_HIGH	64
> +/*
> + * The metadata is managed in terms of bits with each bit mapping to
> + * a fragment of size PCPU_MIN_ALLOC_SIZE.  Thus, the slots are calculated
> + * with respect to the number of bits available.
> + */
> +#define PCPU_SLOT_BASE_SHIFT		3

Ah, so this is actually the same as before 3 + 2, order 5.  Can you
please note the explicit number in the comment?

>  #define PCPU_EMPTY_POP_PAGES_LOW	2
>  #define PCPU_EMPTY_POP_PAGES_HIGH	4

and these numbers too.  I can't tell how these numbers would map.
Also, any chance we can have these numbers in a more intuitive unit?

> @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr)
>  	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
>  }
>  
> -static int __pcpu_size_to_slot(int size)
> +static int __pcpu_size_to_slot(int bit_size)

Wouldn't sth like @map_bits more intuitive than @bit_size?  We can
just use @bits too.

>  {
> -	int highbit = fls(size);	/* size is in bytes */
> +	int highbit = fls(bit_size);	/* size is in bits */
>  	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
>  }
>  
> -static int pcpu_size_to_slot(int size)
> +static int pcpu_size_to_slot(int bit_size)

Ditto.

> +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk)
> +{
> +	bool is_page_empty = true;
> +	int i, off, cur_contig, nr_empty_pop_pages, l_pop_off;
> +	struct pcpu_bitmap_md *block;
> +
> +	chunk->contig_hint = cur_contig = 0;
> +	off = nr_empty_pop_pages = 0;
> +	l_pop_off = pcpu_block_get_first_page(chunk->first_free_block);
> +
> +	for (i = chunk->first_free_block, block = chunk->md_blocks + i;
> +	     i < pcpu_nr_pages_to_blocks(chunk); i++, block++) {
> +		/* Manage nr_empty_pop_pages.

		The first line of a winged comment should be blank, so...

		/*
		 * Manage nr_empty_pop_pages.

> +		 *
> +		 * This is tricky.  So the the background work function is
                                       ^^^^^^^

> +		 * triggered when there are not enough free populated pages.
> +		 * This is necessary to make sure atomic allocations can
> +		 * succeed.
> +		 *
> +		 * The first page of each block is kept track of here allowing
> +		 * this to scale in both situations where there are > 1 page
> +		 * per block and where a block may be a portion of a page.
> +		 */
> +		int pop_off = pcpu_block_get_first_page(i);
> +
> +		if (pop_off > l_pop_off) {
> +			if (is_page_empty)
> +				nr_empty_pop_pages +=
> +					pcpu_cnt_pop_pages(chunk, l_pop_off,
> +							   pop_off);

IIUC, this is trying to handle multi-page block size, right?

> +			l_pop_off = pop_off;
> +			is_page_empty = true;
> +		}
> +		if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE)

But isn't this assuming that each block is page sized?

> +			is_page_empty = false;
>  
> +		/* continue from prev block adding to the cur_contig hint */
> +		if (cur_contig) {
> +			cur_contig += block->left_free;
> +			if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
> +				continue;
> +			} else if (cur_contig > chunk->contig_hint) {

The "else" here is superflous, right?  The if block always continues.

> +				chunk->contig_hint = cur_contig;
> +				chunk->contig_hint_start = off;
>  			}
> +			cur_contig = 0;
>  		}
> +		/* check if the block->contig_hint is larger */
> +		if (block->contig_hint > chunk->contig_hint) {
> +			chunk->contig_hint = block->contig_hint;
> +			chunk->contig_hint_start =
> +				pcpu_block_off_to_off(i,
> +						      block->contig_hint_start);
> +		}
> +		/* let the next iteration catch the right_free */
> +		cur_contig = block->right_free;
> +		off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free;
>  	}
>  
> +	/* catch last iteration if the last block ends with free space */
> +	if (cur_contig > chunk->contig_hint) {
> +		chunk->contig_hint = cur_contig;
> +		chunk->contig_hint_start = off;
> +	}
>  
> +	/*
> +	 * Keep track of nr_empty_pop_pages.
> +	 *
> +	 * The chunk is maintains the previous number of free pages it held,
> +	 * so the delta is used to update the global counter.  The reserved
> +	 * chunk is not part of the free page count as they are populated
> +	 * at init and are special to serving reserved allocations.
> +	 */
> +	if (is_page_empty) {
> +		nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off,
> +							 chunk->nr_pages);
> +	}

Unnecessary {}.

> +	if (chunk != pcpu_reserved_chunk)
> +		pcpu_nr_empty_pop_pages +=
> +			(nr_empty_pop_pages - chunk->nr_empty_pop_pages);
> +	chunk->nr_empty_pop_pages = nr_empty_pop_pages;
>  }

I am really not a big fan of the above implementation.  All it wants
to do is calculating the biggest free span and count unpopulated pages
in the chunk.  There gotta be a more readable way to implement this.
For example, would it be possible to implement span iterator over a
chunk which walks free spans of the chunk so that the above function
can do something similar to the following?

	for_each_free_span(blah blah) {
		nr_pop_free += count populated whole pages in the span;
		update contig hint;
	}

It's a span iteration problem.  When abstracted properly, it shouldn't
be too difficult to follow.

>  /**
> + * pcpu_block_update_hint
>   * @chunk: chunk of interest
> + * @index: block index of the metadata block
>   *
> + * Full scan over the entire block to recalculate block-level metadata.
> + */
> +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
> +{
> +	unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
> +	struct pcpu_bitmap_md *block = chunk->md_blocks + index;
> +	bool is_left_free = false, is_right_free = false;
> +	int contig;
> +	unsigned long start, end;
> +
> +	block->contig_hint = 0;
> +	start = end = block->first_free;
> +	while (start < PCPU_BITMAP_BLOCK_SIZE) {
> +		/*
> +		 * Scans the allocation map corresponding to this block
> +		 * to find free fragments and update metadata accordingly.
> +		 */
> +		start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE,
> +					   start);
> +		if (start >= PCPU_BITMAP_BLOCK_SIZE)
> +			break;

It's a lot simpler here but this too might look simpler with an
appropriate interation abstraction.

> +		/* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */
> +		end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start);

This isn't by no means a hard rule but it's often easier on the eyes
to have a blank line when code and comment are packed like this.

> +		/* update left_free */
> +		contig = end - start;
> +		if (start == 0) {
> +			block->left_free = contig;
> +			is_left_free = true;
> +		}
> +		/* update right_free */
> +		if (end == PCPU_BITMAP_BLOCK_SIZE) {
> +			block->right_free = contig;
> +			is_right_free = true;
> +		}
> +		/* update block contig_hints */
> +		if (block->contig_hint < contig) {
> +			block->contig_hint = contig;
> +			block->contig_hint_start = start;
> +		}
> +		start = end;
> +	}
> +
> +	/* clear left/right free hints */
> +	if (!is_left_free)
> +		block->left_free = 0;
> +	if (!is_right_free)
> +		block->right_free = 0;

Hmm... why do we need is_left/right_free?  Can't we reset them to zero
at the top and update directly during the loop?

> +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> +					 int bit_size)
>  {
> +	bool update_chunk = false;
> +	int i;
> +	int s_index, e_index, s_off, e_off;
> +	struct pcpu_bitmap_md *s_block, *e_block, *block;
>  
> +	/* calculate per block offsets */
> +	s_index = pcpu_off_to_block_index(bit_off);
> +	e_index = pcpu_off_to_block_index(bit_off + bit_size);
> +	s_off = pcpu_off_to_block_off(bit_off);
> +	e_off = pcpu_off_to_block_off(bit_off + bit_size);
>  
> +	/*
> +	 * If the offset is the beginning of the next block, set it to the
> +	 * end of the previous block as the last bit is the exclusive.
> +	 */
> +	if (e_off == 0) {
> +		e_off = PCPU_BITMAP_BLOCK_SIZE;
> +		e_index--;
> +	}
>  
> +	s_block = chunk->md_blocks + s_index;
> +	e_block = chunk->md_blocks + e_index;
>  
> +	/*
> +	 * Update s_block.
> +	 *
> +	 * block->first_free must be updated if the allocation takes its place.
> +	 * If the allocation breaks the contig_hint, a scan is required to
> +	 * restore this hint.
> +	 */
> +	if (s_off == s_block->first_free)
> +		s_block->first_free = find_next_zero_bit(
> +					pcpu_index_alloc_map(chunk, s_index),
> +					PCPU_BITMAP_BLOCK_SIZE,
> +					s_off + bit_size);
> +
> +	if (s_off >= s_block->contig_hint_start &&
> +	    s_off < s_block->contig_hint_start + s_block->contig_hint) {
> +		pcpu_block_refresh_hint(chunk, s_index);
> +	} else {
> +		/* update left and right contig manually */
> +		s_block->left_free = min(s_block->left_free, s_off);
> +		if (s_index == e_index)
> +			s_block->right_free = min_t(int, s_block->right_free,
> +					PCPU_BITMAP_BLOCK_SIZE - e_off);
> +		else
> +			s_block->right_free = 0;
> +	}
>  
> +	/*
> +	 * Update e_block.
> +	 * If they are different, then e_block's first_free is guaranteed to
> +	 * be the extend of e_off.  first_free must be updated and a scan
> +	 * over e_block is issued.
> +	 */
> +	if (s_index != e_index) {
> +		e_block->first_free = find_next_zero_bit(
> +				pcpu_index_alloc_map(chunk, e_index),
> +				PCPU_BITMAP_BLOCK_SIZE, e_off);
>  
> +		pcpu_block_refresh_hint(chunk, e_index);
> +	}
>  
> +	/* update in-between md_blocks */
> +	for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index;
> +	     i++, block++) {
> +		block->contig_hint = 0;
> +		block->left_free = 0;
> +		block->right_free = 0;
> +	}
>  
>  	/*
> +	 * The only time a full chunk scan is required is if the global
> +	 * contig_hint is broken.  Otherwise, it means a smaller space
> +	 * was used and therefore the global contig_hint is still correct.
>  	 */
> +	if (bit_off >= chunk->contig_hint_start &&
> +	    bit_off < chunk->contig_hint_start + chunk->contig_hint)
> +		update_chunk = true;
>  
> +	return update_chunk;

@update_chunk seems unnecessary.

> +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
> +					int bit_size)
>  {
> +	bool update_chunk = false;
> +	int i;
> +	int s_index, e_index, s_off, e_off;
> +	int start, end, contig;
> +	struct pcpu_bitmap_md *s_block, *e_block, *block;
>  
> +	/* calculate per block offsets */
> +	s_index = pcpu_off_to_block_index(bit_off);
> +	e_index = pcpu_off_to_block_index(bit_off + bit_size);
> +	s_off = pcpu_off_to_block_off(bit_off);
> +	e_off = pcpu_off_to_block_off(bit_off + bit_size);
> +
> +	/*
> +	 * If the offset is the beginning of the next block, set it to the
> +	 * end of the previous block as the last bit is the exclusive.
> +	 */
> +	if (e_off == 0) {
> +		e_off = PCPU_BITMAP_BLOCK_SIZE;
> +		e_index--;
> +	}

So, if you do the above with inclusive range, it becomes

	s_index = pcpu_off_to_block_index(start_bit);
	e_index = pcpu_off_to_block_index(end_bit - 1);
	s_off = pcpu_off_to_block_off(start_bit);
	e_off = pcpu_off_to_block_off(end_bit - 1) + 1;

and you can just comment that you're using inclusive range so that the
e_index always points to the last block in the range.  Wouldn't that
be easier?  People do use inclusive ranges for these sorts of
calculations.

> +	s_block = chunk->md_blocks + s_index;
> +	e_block = chunk->md_blocks + e_index;
> +
> +	/*
> +	 * Check if the freed area aligns with the block->contig_hint.
> +	 * If it does, then the scan to find the beginning/end of the
> +	 * larger free area can be avoided.
> +	 *
> +	 * start and end refer to beginning and end of the free region
> +	 * within each their respective blocks.  This is not necessarily
> +	 * the entire free region as it may span blocks past the beginning
> +	 * or end of the block.
> +	 */
> +	start = s_off;
> +	if (s_off == s_block->contig_hint + s_block->contig_hint_start) {
> +		start = s_block->contig_hint_start;
> +	} else {
> +		int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index),
> +					  start);
> +		start = (start == l_bit) ? 0 : l_bit + 1;
> +	}
> +
> +	end = e_off;
> +	if (e_off == e_block->contig_hint_start)
> +		end = e_block->contig_hint_start + e_block->contig_hint;
> +	else
> +		end = find_next_bit(pcpu_index_alloc_map(chunk, e_index),
> +				    PCPU_BITMAP_BLOCK_SIZE, end);
>  
> +	/* freeing in the same block */
> +	if (s_index == e_index) {
> +		contig = end - start;
>  
> +		if (start == 0)
> +			s_block->left_free = contig;
>  
> +		if (end == PCPU_BITMAP_BLOCK_SIZE)
> +			s_block->right_free = contig;
> +
> +		s_block->first_free = min(s_block->first_free, start);
> +		if (contig > s_block->contig_hint) {
> +			s_block->contig_hint = contig;
> +			s_block->contig_hint_start = start;
> +		}
> +
> +	} else {
>  		/*
> +		 * Freeing across md_blocks.
> +		 *
> +		 * If the start is at the beginning of the block, just
> +		 * reset the block instead.
>  		 */
> +		if (start == 0) {

The above comment can be moved here and lose the if in the sentence.
ie. "As the start is ..., just .."

> +			s_index--;
> +		} else {
> +			/*
> +			 * Knowing that the free is across blocks, this means
> +			 * the hint can be updated on the right side and the
> +			 * left side does not need to be touched.
> +			 */
> +			s_block->first_free = min(s_block->first_free, start);
> +			contig = PCPU_BITMAP_BLOCK_SIZE - start;
> +			s_block->right_free = contig;
> +			if (contig > s_block->contig_hint) {
> +				s_block->contig_hint = contig;
> +				s_block->contig_hint_start = start;
> +			}
> +		}

Blank line, please.

> +		/*
> +		 * If end is the entire e_block, just reset the block
> +		 * as well.
> +		 */
> +		if (end == PCPU_BITMAP_BLOCK_SIZE) {

ditto

> +			e_index++;
> +		} else {
> +			/*
> +			 * The hint must only be on the left side, so
> +			 * update accordingly.
> +			 */
> +			e_block->first_free = 0;
> +			e_block->left_free = end;
> +			if (end > e_block->contig_hint) {
> +				e_block->contig_hint = end;
> +				e_block->contig_hint_start = 0;
> +			}
> +		}
> +
> +		/* reset md_blocks in the middle */
> +		for (i = s_index + 1, block = chunk->md_blocks + i;
> +		     i < e_index; i++, block++) {

How about something like the following?  It's kinda weird to have an
extra loop var which isn't really used for anything.  The same goes
for other places too.

		for (block = chunk->md_blocks + s_index + 1;
		     block < chunk->md_blocks + e_index; block++)

> +			block->first_free = 0;
> +			block->contig_hint_start = 0;
> +			block->contig_hint = PCPU_BITMAP_BLOCK_SIZE;
> +			block->left_free = PCPU_BITMAP_BLOCK_SIZE;
> +			block->right_free = PCPU_BITMAP_BLOCK_SIZE;
> +		}
>  	}
> +
> +	/*
> +	 * The hint is only checked in the s_block and e_block when
> +	 * freeing and particularly only when it is self contained within
> +	 * its own block.  A scan is required if the free space spans
> +	 * blocks or makes a block whole as the scan will take into
> +	 * account free space across blocks.
> +	 */
> +	if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) ||
> +	    s_index != e_index) {
> +		update_chunk = true;
> +	} else if (s_block->contig_hint > chunk->contig_hint) {
> +		/* check if block contig_hint is bigger */
> +		chunk->contig_hint = s_block->contig_hint;
> +		chunk->contig_hint_start =
> +			pcpu_block_off_to_off(s_index,
> +					      s_block->contig_hint_start);
> +	}
> +
> +	return update_chunk;

Ditto with @update_chunk.

> +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
> +			       size_t align, bool pop_only)
>  {
> +	int i, cur_free;
> +	int s_index, block_off, next_index, end_off; /* interior alloc index */
> +	struct pcpu_bitmap_md *block;
> +	unsigned long *alloc_map;
>  
> +	lockdep_assert_held(&pcpu_lock);
>  
> +	cur_free = block_off = 0;
> +	s_index = chunk->first_free_block;
> +	for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk);
> +	     i++) {
> +		alloc_map = pcpu_index_alloc_map(chunk, i);
> +		block = chunk->md_blocks + i;
> +
> +		/* continue from prev block */
> +		cur_free += block->left_free;
> +		if (cur_free >= bit_size) {
> +			end_off = bit_size;
> +			goto check_populated;
> +		} else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
>  			continue;
>  		}
>  
>  		/*
> +		 * Can this block hold this alloc?
> +		 *
> +		 * Here the block->contig_hint is used to guarantee a fit,
> +		 * but the block->first_free is returned as we may be able
> +		 * to serve the allocation earlier.  The population check
> +		 * must take into account the area beginning at first_free
> +		 * through the end of the contig_hint.
>  		 */
> +		cur_free = 0;
> +		s_index = i;
> +		block_off = ALIGN(block->contig_hint_start, align);
> +		block_off -= block->contig_hint_start;
> +		if (block->contig_hint >= block_off + bit_size) {
> +			block_off = block->first_free;
> +			end_off = block->contig_hint_start - block_off +
> +				  bit_size;
> +			goto check_populated;
>  		}
>  
> +		/* check right */
> +		block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free,
> +				  align);
> +		/* reset to start looking in the next block */
> +		if (block_off >= PCPU_BITMAP_BLOCK_SIZE) {
> +			s_index++;
> +			cur_free = block_off = 0;
> +			continue;
>  		}
> +		cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off;
> +		if (cur_free >= bit_size) {
> +			end_off = bit_size;
> +check_populated:
> +			if (!pop_only ||
> +			    pcpu_is_populated(chunk, s_index, block_off,
> +					      end_off, &next_index))
> +				break;
>  
> +			i = next_index - 1;
> +			s_index = next_index;
> +			cur_free = block_off = 0;
>  		}
> +	}
>  
> +	/* nothing found */
> +	if (i == pcpu_nr_pages_to_blocks(chunk))
> +		return -1;
>  
> +	return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off;
> +}

Wouldn't this function be a lot simpler too if there were free span
iterator?

> +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size,
> +			   size_t align, int start)
> +{
> +	size_t align_mask = (align) ? (align - 1) : 0;
> +	int i, bit_off, oslot;
> +	struct pcpu_bitmap_md *block;
> +
> +	lockdep_assert_held(&pcpu_lock);
> +
> +	oslot = pcpu_chunk_slot(chunk);
> +
> +	/* search to find fit */
> +	bit_off = bitmap_find_next_zero_area(chunk->alloc_map,
> +					     pcpu_nr_pages_to_bits(chunk),
> +					     start, bit_size, align_mask);
> +
> +	if (bit_off >= pcpu_nr_pages_to_bits(chunk))
> +		return -1;
> +
> +	/* update alloc map */
> +	bitmap_set(chunk->alloc_map, bit_off, bit_size);

blank line

> +	/* update boundary map */
> +	set_bit(bit_off, chunk->bound_map);
> +	bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1);
> +	set_bit(bit_off + bit_size, chunk->bound_map);
> +
> +	chunk->free_bits -= bit_size;
> +
> +	if (pcpu_block_update_hint_alloc(chunk, bit_off, bit_size))
> +		pcpu_chunk_update_hint(chunk);
> +
> +	/* update chunk first_free */
> +	for (i = chunk->first_free_block, block = chunk->md_blocks + i;
> +	     i < pcpu_nr_pages_to_blocks(chunk); i++, block++)
> +		if (block->contig_hint != 0)
> +			break;
> +
> +	chunk->first_free_block = i;
>  
>  	pcpu_chunk_relocate(chunk, oslot);
>  
> +	return bit_off * PCPU_MIN_ALLOC_SIZE;
>  }
>  
>  /**
> + * pcpu_free_area - frees the corresponding offset
>   * @chunk: chunk of interest
> + * @off: addr offset into chunk
>   *
> + * This function determines the size of an allocation to free using
> + * the boundary bitmap and clears the allocation map.  A block metadata
> + * update is triggered and potentially a chunk update occurs.
>   */
> +static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
>  {
> +	int bit_off, bit_size, index, end, oslot;
> +	struct pcpu_bitmap_md *block;
>  
>  	lockdep_assert_held(&pcpu_lock);
>  	pcpu_stats_area_dealloc(chunk);
>  
> +	oslot = pcpu_chunk_slot(chunk);
>  
> +	bit_off = off / PCPU_MIN_ALLOC_SIZE;
>  
> +	/* find end index */
> +	end = find_next_bit(chunk->bound_map, pcpu_nr_pages_to_bits(chunk),
> +			    bit_off + 1);
> +	bit_size = end - bit_off;
>  
> +	bitmap_clear(chunk->alloc_map, bit_off, bit_size);
>  
> +	chunk->free_bits += bit_size;
> +
> +	/* update first_free */
> +	index = pcpu_off_to_block_index(bit_off);
> +	block = chunk->md_blocks + index;
> +	block->first_free = min_t(int, block->first_free,
> +				  bit_off % PCPU_BITMAP_BLOCK_SIZE);
> +
> +	chunk->first_free_block = min(chunk->first_free_block, index);
> +
> +	if (pcpu_block_update_hint_free(chunk, bit_off, bit_size))
> +		pcpu_chunk_update_hint(chunk);

Do we ever not update chunk hint when block hint indicates that it's
necessary?  If not, maybe just call it from the previous function?

>  static void pcpu_free_chunk(struct pcpu_chunk *chunk)
>  {
>  	if (!chunk)
>  		return;
> +	pcpu_mem_free(chunk->md_blocks);
> +	pcpu_mem_free(chunk->bound_map);
> +	pcpu_mem_free(chunk->alloc_map);
>  	pcpu_mem_free(chunk);
>  }
>  
> @@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk,
>  
>  	bitmap_set(chunk->populated, page_start, nr);
>  	chunk->nr_populated += nr;
> +	chunk->nr_empty_pop_pages += nr;
>  	pcpu_nr_empty_pop_pages += nr;
>  }
>  
> @@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
>  
>  	bitmap_clear(chunk->populated, page_start, nr);
>  	chunk->nr_populated -= nr;
> +	chunk->nr_empty_pop_pages -= nr;
>  	pcpu_nr_empty_pop_pages -= nr;
>  }

Didn't we add this field in an earlier patch?  Do the above changes
belong in this patch?

> @@ -890,19 +1284,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
>  	struct pcpu_chunk *chunk;
>  	const char *err;
>  	bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> +	int slot, off, cpu, ret;
>  	unsigned long flags;
>  	void __percpu *ptr;
> +	size_t bit_size, bit_align;
>  
>  	/*
> +	 * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE.
> +	 * Therefore alignment must be a minimum of that many bytes as well
> +	 * as the allocation will have internal fragmentation from
> +	 * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
>  	 */
> +	if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
> +		align = PCPU_MIN_ALLOC_SIZE;
> +	size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
> +	bit_size = size >> PCPU_MIN_ALLOC_SHIFT;
> +	bit_align = align >> PCPU_MIN_ALLOC_SHIFT;

Shouldn't the above have happened earlier when MIN_ALLOC_SIZE was
introduced?

> @@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr)
>   * address.  The caller is responsible for ensuring @addr stays valid
>   * until this function finishes.
>   *
> - * percpu allocator has special setup for the first chunk, which currently
> + * Percpu allocator has special setup for the first chunk, which currently
>   * supports either embedding in linear address space or vmalloc mapping,
>   * and, from the second one, the backing allocator (currently either vm or
>   * km) provides translation.
>   *
>   * The addr can be translated simply without checking if it falls into the
> - * first chunk. But the current code reflects better how percpu allocator
> + * first chunk.  But the current code reflects better how percpu allocator
>   * actually works, and the verification can discover both bugs in percpu
> - * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current
> + * allocator itself and per_cpu_ptr_to_phys() callers.  So we keep current

Let's please move out what can be to other patches.  This patch is big
enough as it is.

> @@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
>  		else
>  			return page_to_phys(vmalloc_to_page(addr)) +
>  			       offset_in_page(addr);
> -	} else
> +	} else {
>  		return page_to_phys(pcpu_addr_to_page(addr)) +
>  		       offset_in_page(addr);
> +	}

Ditto.

> @@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl,
>   * static areas on architectures where the addressing model has
>   * limited offset range for symbol relocations to guarantee module
>   * percpu symbols fall inside the relocatable range.
> + * @ai->static_size + @ai->reserved_size is expected to be page aligned.
>   *
>   * @ai->dyn_size determines the number of bytes available for dynamic
> - * allocation in the first chunk.  The area between @ai->static_size +
> - * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused.
> + * allocation in the first chunk. Both the start and the end are expected
> + * to be page aligned. The area between @ai->static_size + @ai->reserved_size
> + * + @ai->dyn_size and @ai->unit_size is unused.
     ^^^
     contam

>   *
>   * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE
>   * and equal to or larger than @ai->static_size + @ai->reserved_size +
> @@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl,
>   * copied static data to each unit.
>   *
>   * If the first chunk ends up with both reserved and dynamic areas, it
> - * is served by two chunks - one to serve the core static and reserved
> - * areas and the other for the dynamic area.  They share the same vm
> - * and page map but uses different area allocation map to stay away
> - * from each other.  The latter chunk is circulated in the chunk slots
> - * and available for dynamic allocation like any other chunks.
> + * is served by two chunks - one to serve the reserved area and the other
> + * for the dynamic area.  They share the same vm and page map but use
> + * different area allocation map to stay away from each other.  The latter
> + * chunk is circulated in the chunk slots and available for dynamic allocation
> + * like any other chunks.

ditto

> @@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	 * Allocate chunk slots.  The additional last slot is for
>  	 * empty chunks.
>  	 */
> -	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> +	pcpu_nr_slots = __pcpu_size_to_slot(
> +				pcpu_pages_to_bits(pcpu_unit_pages)) + 2;

I get that we wanna be using bits inside the area allocator proper but
can we keep things outside in bytes?  These things don't really have
anything to do with what granularity the area allocator is operating
at.

> @@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	tmp_addr = (unsigned long)base_addr + ai->static_size;
>  	aligned_addr = tmp_addr & PAGE_MASK;
>  	pcpu_reserved_offset = tmp_addr - aligned_addr;
> +	begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE;
>  
>  	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
>  			 pcpu_reserved_offset;
> +
>  	chunk_pages = map_size_bytes >> PAGE_SHIFT;
>  
>  	/* chunk adjacent to static region allocation */
> +	chunk = pcpu_alloc_first_chunk(chunk_pages);
>  	chunk->base_addr = (void *)aligned_addr;
>  	chunk->immutable = true;
>  
> +	/* set metadata */
> +	chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
> +	chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
>  
> +	/*
> +	 * If the beginning of the reserved region overlaps the end of the
> +	 * static region, hide that portion in the metadata.
> +	 */
> +	if (begin_fill_bits) {
>  		chunk->has_reserved = true;
> +		bitmap_fill(chunk->alloc_map, begin_fill_bits);
> +		set_bit(0, chunk->bound_map);
> +		set_bit(begin_fill_bits, chunk->bound_map);
> +
> +		if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits))
> +			pcpu_chunk_update_hint(chunk);
>  	}
>  
> +	/* init dynamic chunk if necessary */
> +	if (ai->reserved_size) {
> +		pcpu_reserved_chunk = chunk;
> +
>  		chunk_pages = dyn_size >> PAGE_SHIFT;
>  
>  		/* chunk allocation */
> +		chunk = pcpu_alloc_first_chunk(chunk_pages);
>  		chunk->base_addr = base_addr + ai->static_size +
>  				    ai->reserved_size;
> +
> +		/* set metadata */
> +		chunk->contig_hint = pcpu_nr_pages_to_bits(chunk);
> +		chunk->free_bits = pcpu_nr_pages_to_bits(chunk);
>  	}
>  
>  	/* link the first chunk in */

I *think* that quite a bit of the above can be moved into a separate
patch.

Thanks a lot!

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/10] percpu: add optimizations on allocation path for the bitmap allocator
  2017-07-16  2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou
@ 2017-07-17 23:32   ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2017-07-17 23:32 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:15PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> This patch adds two optimizations to the allocation path. The first is
> to not consider a chunk if the requested allocation cannot fit in the
> chunk's contig_hint. The benefit is that this avoids unncessary scanning
> over a chunk as the assumption is memory pressure is high and creating a
> new chunk has minimal consequences. This may fail when the contig_hint
> has poor alignment, but again we fall back on the high memory pressure
> argument.
> 
> The second is just a fail-fast mechanism. When allocating, a offset is
> identified within a block and then scanning is used to see if it will
> fit. An offset should never be returned unless it is known to fit, so
> here we just bind the scanning to the size of a block.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>

Looks good to me and there's nothing wrong with these two
optimizations being in a separate patch but they might be too little
to help reviewing / debugging in any noticeable way.  It'd be great if
more significant parts can be separated out.  If not, this is fine
too.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator
  2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
                   ` (9 preceding siblings ...)
  2017-07-16  2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou
@ 2017-07-18 19:15 ` Josef Bacik
  2017-07-24 21:14   ` Dennis Zhou
  10 siblings, 1 reply; 37+ messages in thread
From: Josef Bacik @ 2017-07-18 19:15 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:05PM -0400, Dennis Zhou wrote:
> Hi everyone,
> 
> The Linux kernel percpu memory allocator is responsible for managing
> percpu memory. It allocates memory from chunks of percpu areas and uses a
> simple first-fit area allocator to manage allocations inside each chunk.
> There now exist use cases where allocating and deallocating a million or
> more objects occurs making the current implementation inadequate.
> 
> The two primary problems with the current area map allocator are:
>   1. The backing data structure is an array of the areas. To manage this
>      array, it is possible to need to memmove a large portion of it.
>   2. On allocation, chunks are considered based on the contig_hint. It is
>      possible that the contig_hint may be large enough while the alignment 
>      could not meet the request. This causes scanning over every free
>      fragment that could spill over into scanning chunks.
> 
> The primary considerations for the new allocator were the following:
>  - Remove the memmove operation from the critical path
>  - Be conservative with additional use of memory
>  - Provide consistency in performance and memory footprint
>  - Focus on small allocations < 64 bytes
> 
> This patchset introduces a simple bitmap allocator backed by metadata
> blocks as a replacement for the area map allocator for percpu memory. Each
> chunk has an allocation bitmap, a boundary bitmap, and a set of metadata
> blocks. The allocation map serves as the ground truth for allocations
> while the boundary map serves as a way to distinguish between consecutive
> allocations. The minimum allocation size has been increased to 4-bytes.
> 
> The key property behind the bitmap allocator is its static metadata. The
> main problem it solves is that a memmove is no longer part of the critical
> path for freeing, which was the primary source of latency. This also helps
> bound the metadata overhead. The area map allocator prior required an
> integer per allocation. This may be beneficial with larger allocations,
> but as mentioned, allocating a significant number of small objects is
> becoming more common. This causes worst-case scenarios for metadata
> overhead.
> 
> There is one caveat with this implementation. In an effort to make freeing
> fast, the only time metadata is updated on the free path is if a whole
> block becomes free or the freed area spans across metadata blocks. This
> causes the chunka??s contig_hint to be potentially smaller than what it
> could allocate by up to a block. If the chunka??s contig_hint is smaller
> than a block, a check occurs and the hint is kept accurate. Metadata is
> always kept accurate on allocation and therefore the situation where a
> chunk has a larger contig_hint than available will never occur.
> 
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
> 
>   Area Map Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        335      |     4960
>              16B    |        485      |     1150
>              64B    |        445      |      280
>             128B    |        505      |      177
>            1024B    |       3385      |      140
> 
>   Bitmap Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        725      |       70
>              16B    |        760      |       70
>              64B    |        855      |       80
>             128B    |        910      |       90
>            1024B    |       3770      |      260
> 
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
> 
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
> 

Ok so you say that this test is better for the area map allocator, so presumably
this is worst case for the bitmap allocator?  What does the average case look
like?  Trading 2x allocation latency for a pretty significant free latency
reduction seems ok, but are we allocating or freeing more?  Are both allocations
and free's done in performance critical areas, or do allocations only happen in
performance critical areas, making the increased allocation latency hurt more?
And lastly, why are we paying a 2x latency cost?  What is it about the bitmap
allocator that makes it much worse than the area map allocator?  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/10] percpu: modify base_addr to be region specific
  2017-07-16  2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou
  2017-07-17 18:57   ` Tejun Heo
@ 2017-07-18 19:26   ` Josef Bacik
  2017-07-18 19:36     ` Matthew Wilcox
  1 sibling, 1 reply; 37+ messages in thread
From: Josef Bacik @ 2017-07-18 19:26 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> Originally, the first chunk is served by up to three chunks, each given
> a region they are responsible for. Despite this, the arithmetic was based
> off of the base_addr making it require offsets or be overly inclusive.
> This patch changes percpu checks for first chunk to consider the only
> the dynamic region and the reserved check to be only the reserved
> region. There is no impact here besides making these checks a little
> more accurate.
> 
> This patch also adds the ground work increasing the minimum allocation
> size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to
> keep track of the number of pages the bitmap serves. The arithmetic for
> identifying first chunk and reserved chunk reflect this change.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
> ---
>  include/linux/percpu.h |   4 ++
>  mm/percpu-internal.h   |  12 +++--
>  mm/percpu.c            | 127 ++++++++++++++++++++++++++++++++++---------------
>  3 files changed, 100 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 98a371c..a5cedcd 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -21,6 +21,10 @@
>  /* minimum unit size, also is the maximum supported allocation size */
>  #define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(32 << 10)
>  
> +/* minimum allocation size and shift in bytes */
> +#define PCPU_MIN_ALLOC_SIZE		(1 << PCPU_MIN_ALLOC_SHIFT)
> +#define PCPU_MIN_ALLOC_SHIFT		2
> +
>  /*
>   * Percpu allocator can serve percpu allocations before slab is
>   * initialized which allows slab to depend on the percpu allocator.
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index c9158a4..56e1aba 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -23,11 +23,12 @@ struct pcpu_chunk {
>  	void			*data;		/* chunk data */
>  	int			first_free;	/* no free below this */
>  	bool			immutable;	/* no [de]population allowed */
> -	bool			has_reserved;	/* Indicates if chunk has reserved space
> -						   at the beginning. Reserved chunk will
> -						   contain reservation for static chunk.
> -						   Dynamic chunk will contain reservation
> -						   for static and reserved chunks. */
> +	bool			has_reserved;	/* indicates if the region this chunk
> +						   is responsible for overlaps with
> +						   the prior adjacent region */
> +
> +	int                     nr_pages;       /* # of PAGE_SIZE pages served
> +						   by this chunk */
>  	int			nr_populated;	/* # of populated pages */
>  	unsigned long		populated[];	/* populated bitmap */
>  };
> @@ -40,6 +41,7 @@ extern int pcpu_nr_empty_pop_pages;
>  
>  extern struct pcpu_chunk *pcpu_first_chunk;
>  extern struct pcpu_chunk *pcpu_reserved_chunk;
> +extern unsigned long pcpu_reserved_offset;
>  
>  #ifdef CONFIG_PERCPU_STATS
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 7704db9..c74ad68 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -144,14 +144,14 @@ static const size_t *pcpu_group_sizes __ro_after_init;
>  struct pcpu_chunk *pcpu_first_chunk __ro_after_init;
>  
>  /*
> - * Optional reserved chunk.  This chunk reserves part of the first
> - * chunk and serves it for reserved allocations.  The amount of
> - * reserved offset is in pcpu_reserved_chunk_limit.  When reserved
> - * area doesn't exist, the following variables contain NULL and 0
> - * respectively.
> + * Optional reserved chunk.  This is the part of the first chunk that
> + * serves reserved allocations.  The pcpu_reserved_offset is the amount
> + * the pcpu_reserved_chunk->base_addr is push back into the static
> + * region for the base_addr to be page aligned.  When the reserved area
> + * doesn't exist, the following variables contain NULL and 0 respectively.
>   */
>  struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
> -static int pcpu_reserved_chunk_limit __ro_after_init;
> +unsigned long pcpu_reserved_offset __ro_after_init;
>  
>  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
>  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
> @@ -184,19 +184,32 @@ static void pcpu_schedule_balance_work(void)
>  		schedule_work(&pcpu_balance_work);
>  }
>  
> +/*
> + * Static addresses should never be passed into the allocator.  They
> + * are accessed using the group_offsets and therefore do not rely on
> + * chunk->base_addr.
> + */
>  static bool pcpu_addr_in_first_chunk(void *addr)
>  {
>  	void *first_start = pcpu_first_chunk->base_addr;
>  
> -	return addr >= first_start && addr < first_start + pcpu_unit_size;
> +	return addr >= first_start &&
> +	       addr < first_start +
> +	       pcpu_first_chunk->nr_pages * PAGE_SIZE;
>  }
>  
>  static bool pcpu_addr_in_reserved_chunk(void *addr)
>  {
> -	void *first_start = pcpu_first_chunk->base_addr;
> +	void *first_start;
>  
> -	return addr >= first_start &&
> -		addr < first_start + pcpu_reserved_chunk_limit;
> +	if (!pcpu_reserved_chunk)
> +		return false;
> +
> +	first_start = pcpu_reserved_chunk->base_addr;
> +
> +	return addr >= first_start + pcpu_reserved_offset &&
> +	       addr < first_start +
> +	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
>  }
>  
>  static int __pcpu_size_to_slot(int size)
> @@ -237,11 +250,16 @@ static int __maybe_unused pcpu_page_idx(unsigned int cpu, int page_idx)
>  	return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx;
>  }
>  
> +static unsigned long pcpu_unit_page_offset(unsigned int cpu, int page_idx)
> +{
> +	return pcpu_unit_offsets[cpu] + (page_idx << PAGE_SHIFT);
> +}
> +
>  static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
>  				     unsigned int cpu, int page_idx)
>  {
> -	return (unsigned long)chunk->base_addr + pcpu_unit_offsets[cpu] +
> -		(page_idx << PAGE_SHIFT);
> +	return (unsigned long)chunk->base_addr +
> +		pcpu_unit_page_offset(cpu, page_idx);
>  }
>  
>  static void __maybe_unused pcpu_next_unpop(struct pcpu_chunk *chunk,
> @@ -737,6 +755,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void)
>  	chunk->free_size = pcpu_unit_size;
>  	chunk->contig_hint = pcpu_unit_size;
>  
> +	chunk->nr_pages = pcpu_unit_pages;
> +
>  	return chunk;
>  }
>  
> @@ -824,18 +844,20 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
>   * pcpu_chunk_addr_search - determine chunk containing specified address
>   * @addr: address for which the chunk needs to be determined.
>   *
> + * This is an internal function that handles all but static allocations.
> + * Static percpu address values should never be passed into the allocator.
> + *
>   * RETURNS:
>   * The address of the found chunk.
>   */
>  static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
>  {
>  	/* is it in the first chunk? */
> -	if (pcpu_addr_in_first_chunk(addr)) {
> -		/* is it in the reserved area? */
> -		if (pcpu_addr_in_reserved_chunk(addr))
> -			return pcpu_reserved_chunk;
> +	if (pcpu_addr_in_first_chunk(addr))
>  		return pcpu_first_chunk;
> -	}
> +	/* is it in the reserved chunk? */
> +	if (pcpu_addr_in_reserved_chunk(addr))
> +		return pcpu_reserved_chunk;
>  
>  	/*
>  	 * The address is relative to unit0 which might be unused and
> @@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
>  	 * The following test on unit_low/high isn't strictly
>  	 * necessary but will speed up lookups of addresses which
>  	 * aren't in the first chunk.
> +	 *
> +	 * The address check is of high granularity checking against full
> +	 * chunk sizes.  pcpu_base_addr points to the beginning of the first
> +	 * chunk including the static region.  This allows us to examine all
> +	 * regions of the first chunk. Assumes good intent as the first
> +	 * chunk may not be full (ie. < pcpu_unit_pages in size).
>  	 */
> -	first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0);
> -	first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu,
> -				     pcpu_unit_pages);
> +	first_low = (unsigned long) pcpu_base_addr +
> +		    pcpu_unit_page_offset(pcpu_low_unit_cpu, 0);
> +	first_high = (unsigned long) pcpu_base_addr +
> +		     pcpu_unit_page_offset(pcpu_high_unit_cpu, pcpu_unit_pages);
>  	if ((unsigned long)addr >= first_low &&
>  	    (unsigned long)addr < first_high) {
>  		for_each_possible_cpu(cpu) {
> @@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  	unsigned int cpu;
>  	int *unit_map;
>  	int group, unit, i;
> +	unsigned long tmp_addr, aligned_addr;
> +	unsigned long map_size_bytes;
>  
>  #define PCPU_SETUP_BUG_ON(cond)	do {					\
>  	if (unlikely(cond)) {						\
> @@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
>  		INIT_LIST_HEAD(&pcpu_slot[i]);
>  
>  	/*
> -	 * Initialize static chunk.  If reserved_size is zero, the
> -	 * static chunk covers static area + dynamic allocation area
> -	 * in the first chunk.  If reserved_size is not zero, it
> -	 * covers static area + reserved area (mostly used for module
> -	 * static percpu allocation).
> +	 * Initialize static chunk.
> +	 * The static region is dropped as those addresses are already
> +	 * allocated and do not rely on chunk->base_addr.
> +	 * reserved_size == 0:
> +	 *      the static chunk covers the dynamic area
> +	 * reserved_size > 0:
> +	 *      the static chunk covers the reserved area
> +	 *
> +	 * If the static area is not page aligned, the region adjacent
> +	 * to the static area must have its base_addr be offset into
> +	 * the static area to have it be page aligned.  The overlap is
> +	 * then allocated preserving the alignment in the metadata for
> +	 * the actual region.
>  	 */
> +	tmp_addr = (unsigned long)base_addr + ai->static_size;
> +	aligned_addr = tmp_addr & PAGE_MASK;
> +	pcpu_reserved_offset = tmp_addr - aligned_addr;
> +
> +	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
> +			 pcpu_reserved_offset;

This confused me for a second, better to be explicit with

(ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset;

Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/10] percpu: modify base_addr to be region specific
  2017-07-18 19:26   ` Josef Bacik
@ 2017-07-18 19:36     ` Matthew Wilcox
  2017-07-19 14:20       ` Josef Bacik
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2017-07-18 19:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, kernel-team,
	linux-kernel, linux-mm, Dennis Zhou

On Tue, Jul 18, 2017 at 03:26:02PM -0400, Josef Bacik wrote:
> On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote:
> > +	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
> > +			 pcpu_reserved_offset;
> 
> This confused me for a second, better to be explicit with
> 
> (ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset;

You're still confused ;-)  What Dennis wrote is equivalent to:

(ai->reserved_size ? ai->reserved_size : ai->dyn_size) + pcpu_reserved_offset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/10] percpu: modify base_addr to be region specific
  2017-07-18 19:36     ` Matthew Wilcox
@ 2017-07-19 14:20       ` Josef Bacik
  0 siblings, 0 replies; 37+ messages in thread
From: Josef Bacik @ 2017-07-19 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Josef Bacik, Dennis Zhou, Tejun Heo, Christoph Lameter,
	kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Tue, Jul 18, 2017 at 12:36:27PM -0700, Matthew Wilcox wrote:
> On Tue, Jul 18, 2017 at 03:26:02PM -0400, Josef Bacik wrote:
> > On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote:
> > > +	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
> > > +			 pcpu_reserved_offset;
> > 
> > This confused me for a second, better to be explicit with
> > 
> > (ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset;
> 
> You're still confused ;-)  What Dennis wrote is equivalent to:
> 
> (ai->reserved_size ? ai->reserved_size : ai->dyn_size) + pcpu_reserved_offset;

Lol jesus, made my point even harder with me being an idiot.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
  2017-07-17 23:27   ` Tejun Heo
@ 2017-07-19 19:11   ` Josef Bacik
  2017-07-19 22:19     ` Dennis Zhou
  2017-07-19 19:16   ` Josef Bacik
  2 siblings, 1 reply; 37+ messages in thread
From: Josef Bacik @ 2017-07-19 19:11 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> The percpu memory allocator is experiencing scalability issues when
> allocating and freeing large numbers of counters as in BPF.
> Additionally, there is a corner case where iteration is triggered over
> all chunks if the contig_hint is the right size, but wrong alignment.
> 
> Implementation:
> This patch removes the area map allocator in favor of a bitmap allocator
> backed by metadata blocks. The primary goal is to provide consistency
> in performance and memory footprint with a focus on small allocations
> (< 64 bytes). The bitmap removes the heavy memmove from the freeing
> critical path and provides a consistent memory footprint. The metadata
> blocks provide a bound on the amount of scanning required by maintaining
> a set of hints.
> 
> The chunks previously were managed by free_size, a value maintained in
> bytes. Now, the chunks are managed in terms of bits, which is just a
> scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.
> 
> There is one caveat with this implementation. In an effort to make
> freeing fast, the only time metadata is updated on the free path is if a
> whole block becomes free or the freed area spans across metadata blocks.
> This causes the chunka??s contig_hint to be potentially smaller than what
> it could allocate by up to a block. If the chunka??s contig_hint is
> smaller than a block, a check occurs and the hint is kept accurate.
> Metadata is always kept accurate on allocation and therefore the
> situation where a chunk has a larger contig_hint than available will
> never occur.
> 
> Evaluation:
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
> 
>   Area Map Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        335      |     4960
>              16B    |        485      |     1150
>              64B    |        445      |      280
>             128B    |        505      |      177
>            1024B    |       3385      |      140
> 
>   Bitmap Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        725      |       70
>              16B    |        760      |       70
>              64B    |        855      |       80
>             128B    |        910      |       90
>            1024B    |       3770      |      260
> 
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
> 
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
> 

This was a bear to review, I feel like it could be split into a few smaller
pieces.  You are changing hinting, allocating/freeing, and how you find chunks.
Those seem like good logical divisions to me.  Overall the design seems sound
and I didn't spot any major problems.  Once you've split them up I'll do another
thorough comb through and then add my reviewed by.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
  2017-07-17 23:27   ` Tejun Heo
  2017-07-19 19:11   ` Josef Bacik
@ 2017-07-19 19:16   ` Josef Bacik
  2017-07-19 22:13     ` Dennis Zhou
  2 siblings, 1 reply; 37+ messages in thread
From: Josef Bacik @ 2017-07-19 19:16 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> The percpu memory allocator is experiencing scalability issues when
> allocating and freeing large numbers of counters as in BPF.
> Additionally, there is a corner case where iteration is triggered over
> all chunks if the contig_hint is the right size, but wrong alignment.
> 
> Implementation:
> This patch removes the area map allocator in favor of a bitmap allocator
> backed by metadata blocks. The primary goal is to provide consistency
> in performance and memory footprint with a focus on small allocations
> (< 64 bytes). The bitmap removes the heavy memmove from the freeing
> critical path and provides a consistent memory footprint. The metadata
> blocks provide a bound on the amount of scanning required by maintaining
> a set of hints.
> 
> The chunks previously were managed by free_size, a value maintained in
> bytes. Now, the chunks are managed in terms of bits, which is just a
> scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.
> 
> There is one caveat with this implementation. In an effort to make
> freeing fast, the only time metadata is updated on the free path is if a
> whole block becomes free or the freed area spans across metadata blocks.
> This causes the chunka??s contig_hint to be potentially smaller than what
> it could allocate by up to a block. If the chunka??s contig_hint is
> smaller than a block, a check occurs and the hint is kept accurate.
> Metadata is always kept accurate on allocation and therefore the
> situation where a chunk has a larger contig_hint than available will
> never occur.
> 
> Evaluation:
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
> 
>   Area Map Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        335      |     4960
>              16B    |        485      |     1150
>              64B    |        445      |      280
>             128B    |        505      |      177
>            1024B    |       3385      |      140
> 
>   Bitmap Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        725      |       70
>              16B    |        760      |       70
>              64B    |        855      |       80
>             128B    |        910      |       90
>            1024B    |       3770      |      260
> 
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
> 
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
> 
> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
> ---
>  include/linux/percpu.h |   10 +-
>  init/main.c            |    1 -
>  mm/percpu-internal.h   |   70 ++-
>  mm/percpu-stats.c      |   99 ++--
>  mm/percpu.c            | 1280 ++++++++++++++++++++++++++++++------------------
>  5 files changed, 923 insertions(+), 537 deletions(-)
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index a5cedcd..8f62b10 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -26,6 +26,15 @@
>  #define PCPU_MIN_ALLOC_SHIFT		2
>  
>  /*
> + * This determines the size of each metadata block.  There are several subtle
> + * constraints around this variable.  The reserved_region and dynamic_region
> + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE.  This is
> + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks
> + * that are backing multiple pages, this needs to be accounted for.
> + */
> +#define PCPU_BITMAP_BLOCK_SIZE		(PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
> +
> +/*
>   * Percpu allocator can serve percpu allocations before slab is
>   * initialized which allows slab to depend on the percpu allocator.
>   * The following two parameters decide how much resource to
> @@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr);
>  #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
>  extern void __init setup_per_cpu_areas(void);
>  #endif
> -extern void __init percpu_init_late(void);
>  
>  extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp);
>  extern void __percpu *__alloc_percpu(size_t size, size_t align);
> diff --git a/init/main.c b/init/main.c
> index 052481f..c9a9fff 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -500,7 +500,6 @@ static void __init mm_init(void)
>  	page_ext_init_flatmem();
>  	mem_init();
>  	kmem_cache_init();
> -	percpu_init_late();
>  	pgtable_init();
>  	vmalloc_init();
>  	ioremap_huge_init();
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index f0776f6..2dac644 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -4,6 +4,21 @@
>  #include <linux/types.h>
>  #include <linux/percpu.h>
>  
> +/*
> + * pcpu_bitmap_md is the metadata block struct.
> + * All units are in terms of bits.
> + */
> +struct pcpu_bitmap_md {
> +	int			contig_hint;	/* contig hint for block */
> +	int			contig_hint_start; /* block relative starting
> +						      position of the contig hint */
> +	int			left_free;	/* size of free space along
> +						   the left side of the block */
> +	int			right_free;	/* size of free space along
> +						   the right side of the block */
> +	int			first_free;	/* block position of first free */
> +};
> +
>  struct pcpu_chunk {
>  #ifdef CONFIG_PERCPU_STATS
>  	int			nr_alloc;	/* # of allocations */
> @@ -11,17 +26,20 @@ struct pcpu_chunk {
>  #endif
>  
>  	struct list_head	list;		/* linked to pcpu_slot lists */
> -	int			free_size;	/* free bytes in the chunk */
> -	int			contig_hint;	/* max contiguous size hint */
> +	int			free_bits;	/* free bits in the chunk */
> +	int			contig_hint;	/* max contiguous size hint
> +						   in bits */
> +	int			contig_hint_start; /* contig_hint starting
> +						      bit offset */
>  	void			*base_addr;	/* base address of this chunk */
>  
> -	int			map_used;	/* # of map entries used before the sentry */
> -	int			map_alloc;	/* # of map entries allocated */
> -	int			*map;		/* allocation map */
> -	struct list_head	map_extend_list;/* on pcpu_map_extend_chunks */
> +	unsigned long		*alloc_map;	/* allocation map */
> +	unsigned long		*bound_map;	/* boundary map */
> +	struct pcpu_bitmap_md	*md_blocks;	/* metadata blocks */
>  
>  	void			*data;		/* chunk data */
> -	int			first_free;	/* no free below this */
> +	int			first_free_block; /* block that contains the first
> +						     free bit */
>  	bool			immutable;	/* no [de]population allowed */
>  	bool			has_reserved;	/* indicates if the region this chunk
>  						   is responsible for overlaps with
> @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk;
>  extern struct pcpu_chunk *pcpu_reserved_chunk;
>  extern unsigned long pcpu_reserved_offset;
>  
> +/*
> + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bitmap blocks required.  It converts to bytes
> + * served to bits required and then blocks used.
> + */
> +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk)
> +{
> +	return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE /
> +	       PCPU_BITMAP_BLOCK_SIZE;
> +}
> +
> +/*
> + * pcpu_pages_to_bits - converts the pages to size of bitmap
> + * @pages: number of physical pages
> + *
> + * This conversion is from physical pages to the number of bits
> + * required in the bitmap.
> + */
> +static inline int pcpu_pages_to_bits(int pages)
> +{
> +	return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
> +}
> +
> +/*
> + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bits in the bitmap.
> + */
> +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk)
> +{
> +	return pcpu_pages_to_bits(chunk->nr_pages);
> +}
> +
>  #ifdef CONFIG_PERCPU_STATS
>  
>  #include <linux/spinlock.h>
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index 6fc04b1..8dbef0c 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b)
>  }
>  
>  /*
> - * Iterates over all chunks to find the max # of map entries used.
> + * Iterates over all chunks to find the max nr_alloc entries.
>   */
> -static int find_max_map_used(void)
> +static int find_max_nr_alloc(void)
>  {
>  	struct pcpu_chunk *chunk;
> -	int slot, max_map_used;
> +	int slot, max_nr_alloc;
>  
> -	max_map_used = 0;
> +	max_nr_alloc = 0;
>  	for (slot = 0; slot < pcpu_nr_slots; slot++)
>  		list_for_each_entry(chunk, &pcpu_slot[slot], list)
> -			max_map_used = max(max_map_used, chunk->map_used);
> +			max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
>  
> -	return max_map_used;
> +	return max_nr_alloc;
>  }
>  
>  /*
>   * Prints out chunk state. Fragmentation is considered between
>   * the beginning of the chunk to the last allocation.
> + *
> + * All statistics are in bytes unless stated otherwise.
>   */
>  static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  			    int *buffer)
>  {
> -	int i, s_index, last_alloc, alloc_sign, as_len;
> +	int i, last_alloc, as_len, start, end;
>  	int *alloc_sizes, *p;
>  	/* statistics */
>  	int sum_frag = 0, max_frag = 0;
>  	int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0;
>  
>  	alloc_sizes = buffer;
> -	s_index = chunk->has_reserved ? 1 : 0;
> -
> -	/* find last allocation */
> -	last_alloc = -1;
> -	for (i = chunk->map_used - 1; i >= s_index; i--) {
> -		if (chunk->map[i] & 1) {
> -			last_alloc = i;
> -			break;
> -		}
> -	}
>  
> -	/* if the chunk is not empty - ignoring reserve */
> -	if (last_alloc >= s_index) {
> -		as_len = last_alloc + 1 - s_index;
> -
> -		/*
> -		 * Iterate through chunk map computing size info.
> -		 * The first bit is overloaded to be a used flag.
> -		 * negative = free space, positive = allocated
> -		 */
> -		for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) {
> -			alloc_sign = (*p & 1) ? 1 : -1;
> -			alloc_sizes[i] = alloc_sign *
> -				((p[1] & ~1) - (p[0] & ~1));
> +	/*
> +	 * find_last_bit returns the start value if nothing found.
> +	 * Therefore, we must determine if it is a failure of find_last_bit
> +	 * and set the appropriate value.
> +	 */
> +	last_alloc = find_last_bit(chunk->alloc_map,
> +				   pcpu_nr_pages_to_bits(chunk) - 1);
> +	last_alloc = test_bit(last_alloc, chunk->alloc_map) ?
> +		     last_alloc + 1 : 0;
> +
> +	start = as_len = 0;
> +	if (chunk->has_reserved)
> +		start = pcpu_reserved_offset;
> +
> +	/*
> +	 * If a bit is set in the allocation map, the bound_map identifies
> +	 * where the allocation ends.  If the allocation is not set, the
> +	 * bound_map does not identify free areas as it is only kept accurate
> +	 * on allocation, not free.
> +	 *
> +	 * Positive values are allocations and negative values are free
> +	 * fragments.
> +	 */
> +	while (start < last_alloc) {
> +		if (test_bit(start, chunk->alloc_map)) {
> +			end = find_next_bit(chunk->bound_map, last_alloc,
> +					    start + 1);
> +			alloc_sizes[as_len] = 1;
> +		} else {
> +			end = find_next_bit(chunk->alloc_map, last_alloc,
> +					    start + 1);
> +			alloc_sizes[as_len] = -1;
>  		}
>  
> -		sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL);
> +		alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE;
> +
> +		start = end;
> +	}
> +
> +	/*
> +	 * The negative values are free fragments and thus sorting gives the
> +	 * free fragments at the beginning in largest first order.
> +	 */
> +	if (as_len > 0) {
> +		sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL);
>  
> -		/* Iterate through the unallocated fragements. */
> +		/* iterate through the unallocated fragments */
>  		for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) {
>  			sum_frag -= *p;
>  			max_frag = max(max_frag, -1 * (*p));
> @@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  	P("nr_alloc", chunk->nr_alloc);
>  	P("max_alloc_size", chunk->max_alloc_size);
>  	P("empty_pop_pages", chunk->nr_empty_pop_pages);
> -	P("free_size", chunk->free_size);
> -	P("contig_hint", chunk->contig_hint);
> +	P("first_free_block", chunk->first_free_block);
> +	P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE);
> +	P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE);
>  	P("sum_frag", sum_frag);
>  	P("max_frag", max_frag);
>  	P("cur_min_alloc", cur_min_alloc);
> @@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
>  static int percpu_stats_show(struct seq_file *m, void *v)
>  {
>  	struct pcpu_chunk *chunk;
> -	int slot, max_map_used;
> +	int slot, max_nr_alloc;
>  	int *buffer;
>  
>  alloc_buffer:
>  	spin_lock_irq(&pcpu_lock);
> -	max_map_used = find_max_map_used();
> +	max_nr_alloc = find_max_nr_alloc();
>  	spin_unlock_irq(&pcpu_lock);
>  
> -	buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0]));
> +	/* there can be at most this many free and allocated fragments */
> +	buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int));
>  	if (!buffer)
>  		return -ENOMEM;
>  
>  	spin_lock_irq(&pcpu_lock);
>  
>  	/* if the buffer allocated earlier is too small */
> -	if (max_map_used < find_max_map_used()) {
> +	if (max_nr_alloc < find_max_nr_alloc()) {
>  		spin_unlock_irq(&pcpu_lock);
>  		vfree(buffer);
>  		goto alloc_buffer;
> diff --git a/mm/percpu.c b/mm/percpu.c
> index fb01841..569df63 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -4,6 +4,9 @@
>   * Copyright (C) 2009		SUSE Linux Products GmbH
>   * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
>   *
> + * Copyright (C) 2017		Facebook Inc.
> + * Copyright (C) 2017		Dennis Zhou <dennisszhou@gmail.com>
> + *
>   * This file is released under the GPLv2 license.
>   *
>   * The percpu allocator handles both static and dynamic areas.  Percpu
> @@ -34,19 +37,20 @@
>   * percpu variables from kernel modules.  Finally, the dynamic section
>   * takes care of normal allocations.
>   *
> - * Allocation state in each chunk is kept using an array of integers
> - * on chunk->map.  A positive value in the map represents a free
> - * region and negative allocated.  Allocation inside a chunk is done
> - * by scanning this map sequentially and serving the first matching
> - * entry.  This is mostly copied from the percpu_modalloc() allocator.
> - * Chunks can be determined from the address using the index field
> - * in the page struct. The index field contains a pointer to the chunk.
> - *
> - * These chunks are organized into lists according to free_size and
> - * tries to allocate from the fullest chunk first. Each chunk maintains
> - * a maximum contiguous area size hint which is guaranteed to be equal
> - * to or larger than the maximum contiguous area in the chunk. This
> - * helps prevent the allocator from iterating over chunks unnecessarily.
> + * The allocator organizes chunks into lists according to free size and
> + * tries to allocate from the fullest chunk first.  Each chunk is managed
> + * by a bitmap with metadata blocks.  The allocation map is updated on
> + * every allocation to reflect the current state while the boundary map
> + * is only updated on allocation.  Each metadata block contains
> + * information to help mitigate the need to iterate over large portions
> + * of the bitmap.  The reverse mapping from page to chunk is stored in
> + * the page's index.  Lastly, units are lazily backed and grow in unison.
> + *
> + * There is a unique conversion that goes on here between bytes and bits.
> + * The chunk tracks the number of pages it is responsible for in nr_pages.
> + * From there, helper functions are used to convert from physical pages
> + * to bitmap bits and bitmap blocks.  All hints are managed in bits
> + * unless explicitly stated.
>   *
>   * To use this allocator, arch code should do the following:
>   *
> @@ -86,10 +90,13 @@
>  
>  #include "percpu-internal.h"
>  
> -#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
> -#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
> -#define PCPU_ATOMIC_MAP_MARGIN_LOW	32
> -#define PCPU_ATOMIC_MAP_MARGIN_HIGH	64
> +/*
> + * The metadata is managed in terms of bits with each bit mapping to
> + * a fragment of size PCPU_MIN_ALLOC_SIZE.  Thus, the slots are calculated
> + * with respect to the number of bits available.
> + */
> +#define PCPU_SLOT_BASE_SHIFT		3
> +
>  #define PCPU_EMPTY_POP_PAGES_LOW	2
>  #define PCPU_EMPTY_POP_PAGES_HIGH	4
>  
> @@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init;
>  DEFINE_SPINLOCK(pcpu_lock);	/* all internal data structures */
>  static DEFINE_MUTEX(pcpu_alloc_mutex);	/* chunk create/destroy, [de]pop, map ext */
>  
> -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> -
> -/* chunks which need their map areas extended, protected by pcpu_lock */
> -static LIST_HEAD(pcpu_map_extend_chunks);
> +/*
> + * Chunk list slots.  These slots order the chunks by the number of
> + * free bits available in the bitmap.
> + */
> +struct list_head *pcpu_slot __ro_after_init;
>  
>  /*
>   * The number of empty populated pages, protected by pcpu_lock.  The
> @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr)
>  	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
>  }
>  
> -static int __pcpu_size_to_slot(int size)
> +static int __pcpu_size_to_slot(int bit_size)
>  {
> -	int highbit = fls(size);	/* size is in bytes */
> +	int highbit = fls(bit_size);	/* size is in bits */
>  	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
>  }
>  
> -static int pcpu_size_to_slot(int size)
> +static int pcpu_size_to_slot(int bit_size)
>  {
> -	if (size == pcpu_unit_size)
> +	if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages))
>  		return pcpu_nr_slots - 1;
> -	return __pcpu_size_to_slot(size);
> +	return __pcpu_size_to_slot(bit_size);
>  }
>  
>  static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
>  {
> -	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
> +	if (chunk->free_bits == 0 || chunk->contig_hint == 0)
>  		return 0;
>  
> -	return pcpu_size_to_slot(chunk->free_size);
> +	return pcpu_size_to_slot(chunk->free_bits);
>  }
>  
>  /* set the pointer to a chunk in a page struct */
> @@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk,
>  }
>  
>  /*
> + * The following are helper functions to help access bitmaps and convert
> + * between bitmap offsets to actual address offsets.
> + */
> +static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index)
> +{
> +	return chunk->alloc_map +
> +		(index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG);
> +}
> +
> +static unsigned long pcpu_off_to_block_index(int off)
> +{
> +	return off / PCPU_BITMAP_BLOCK_SIZE;
> +}
> +
> +static unsigned long pcpu_off_to_block_off(int off)
> +{
> +	return off & (PCPU_BITMAP_BLOCK_SIZE - 1);
> +}
> +
> +static unsigned long pcpu_block_off_to_off(int index, int off)
> +{
> +	return index * PCPU_BITMAP_BLOCK_SIZE + off;
> +}
> +
> +static unsigned long pcpu_block_get_first_page(int index)
> +{
> +	return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE *
> +			PCPU_MIN_ALLOC_SIZE);
> +}
> +
> +/*
>   * (Un)populated page region iterators.  Iterate over (un)populated
>   * page regions between @start and @end in @chunk.  @rs and @re should
>   * be integer variables and will be set to start and end page index of
> @@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr)
>  }
>  
>  /**
> - * pcpu_count_occupied_pages - count the number of pages an area occupies
> - * @chunk: chunk of interest
> - * @i: index of the area in question
> - *
> - * Count the number of pages chunk's @i'th area occupies.  When the area's
> - * start and/or end address isn't aligned to page boundary, the straddled
> - * page is included in the count iff the rest of the page is free.
> - */
> -static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i)
> -{
> -	int off = chunk->map[i] & ~1;
> -	int end = chunk->map[i + 1] & ~1;
> -
> -	if (!PAGE_ALIGNED(off) && i > 0) {
> -		int prev = chunk->map[i - 1];
> -
> -		if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE))
> -			off = round_down(off, PAGE_SIZE);
> -	}
> -
> -	if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) {
> -		int next = chunk->map[i + 1];
> -		int nend = chunk->map[i + 2] & ~1;
> -
> -		if (!(next & 1) && nend >= round_up(end, PAGE_SIZE))
> -			end = round_up(end, PAGE_SIZE);
> -	}
> -
> -	return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0);
> -}
> -
> -/**
>   * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
>   * @chunk: chunk of interest
>   * @oslot: the previous slot it was on
> @@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
>  }
>  
>  /**
> - * pcpu_need_to_extend - determine whether chunk area map needs to be extended
> + * pcpu_cnt_pop_pages- counts populated backing pages in range
>   * @chunk: chunk of interest
> - * @is_atomic: the allocation context
> + * @start: start index
> + * @end: end index
>   *
> - * Determine whether area map of @chunk needs to be extended.  If
> - * @is_atomic, only the amount necessary for a new allocation is
> - * considered; however, async extension is scheduled if the left amount is
> - * low.  If !@is_atomic, it aims for more empty space.  Combined, this
> - * ensures that the map is likely to have enough available space to
> - * accomodate atomic allocations which can't extend maps directly.
> - *
> - * CONTEXT:
> - * pcpu_lock.
> + * Calculates the number of populated pages in the region [start, end).
> + * This lets us keep track of how many empty populated pages are available
> + * and decide if we should schedule async work.
>   *
>   * RETURNS:
> - * New target map allocation length if extension is necessary, 0
> - * otherwise.
> + * The nr of populated pages.
>   */
> -static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic)
> +static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk,
> +				      int start, int end)
>  {
> -	int margin, new_alloc;
> -
> -	lockdep_assert_held(&pcpu_lock);
> +	return bitmap_weight(chunk->populated, end) -
> +	       bitmap_weight(chunk->populated, start);
> +}
>  
> -	if (is_atomic) {
> -		margin = 3;
> +/**
> + * pcpu_chunk_update_hint - updates metadata about a chunk
> + * @chunk: chunk of interest
> + *
> + * Responsible for iterating over metadata blocks to aggregate the
> + * overall statistics of the chunk.
> + *
> + * Updates:
> + *      chunk->contig_hint
> + *      chunk->contig_hint_start
> + *      nr_empty_pop_pages
> + */
> +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk)
> +{
> +	bool is_page_empty = true;
> +	int i, off, cur_contig, nr_empty_pop_pages, l_pop_off;
> +	struct pcpu_bitmap_md *block;
> +
> +	chunk->contig_hint = cur_contig = 0;
> +	off = nr_empty_pop_pages = 0;
> +	l_pop_off = pcpu_block_get_first_page(chunk->first_free_block);
> +
> +	for (i = chunk->first_free_block, block = chunk->md_blocks + i;
> +	     i < pcpu_nr_pages_to_blocks(chunk); i++, block++) {
> +		/* Manage nr_empty_pop_pages.
> +		 *
> +		 * This is tricky.  So the the background work function is
> +		 * triggered when there are not enough free populated pages.
> +		 * This is necessary to make sure atomic allocations can
> +		 * succeed.
> +		 *
> +		 * The first page of each block is kept track of here allowing
> +		 * this to scale in both situations where there are > 1 page
> +		 * per block and where a block may be a portion of a page.
> +		 */
> +		int pop_off = pcpu_block_get_first_page(i);
> +
> +		if (pop_off > l_pop_off) {
> +			if (is_page_empty)
> +				nr_empty_pop_pages +=
> +					pcpu_cnt_pop_pages(chunk, l_pop_off,
> +							   pop_off);
> +			l_pop_off = pop_off;
> +			is_page_empty = true;
> +		}
> +		if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE)
> +			is_page_empty = false;
>  
> -		if (chunk->map_alloc <
> -		    chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
> -			if (list_empty(&chunk->map_extend_list)) {
> -				list_add_tail(&chunk->map_extend_list,
> -					      &pcpu_map_extend_chunks);
> -				pcpu_schedule_balance_work();
> +		/* continue from prev block adding to the cur_contig hint */
> +		if (cur_contig) {
> +			cur_contig += block->left_free;
> +			if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
> +				continue;
> +			} else if (cur_contig > chunk->contig_hint) {
> +				chunk->contig_hint = cur_contig;
> +				chunk->contig_hint_start = off;
>  			}
> +			cur_contig = 0;
>  		}
> -	} else {
> -		margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
> +		/* check if the block->contig_hint is larger */
> +		if (block->contig_hint > chunk->contig_hint) {
> +			chunk->contig_hint = block->contig_hint;
> +			chunk->contig_hint_start =
> +				pcpu_block_off_to_off(i,
> +						      block->contig_hint_start);
> +		}
> +		/* let the next iteration catch the right_free */
> +		cur_contig = block->right_free;
> +		off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free;
>  	}
>  
> -	if (chunk->map_alloc >= chunk->map_used + margin)
> -		return 0;
> -
> -	new_alloc = PCPU_DFL_MAP_ALLOC;
> -	while (new_alloc < chunk->map_used + margin)
> -		new_alloc *= 2;
> +	/* catch last iteration if the last block ends with free space */
> +	if (cur_contig > chunk->contig_hint) {
> +		chunk->contig_hint = cur_contig;
> +		chunk->contig_hint_start = off;
> +	}
>  
> -	return new_alloc;
> +	/*
> +	 * Keep track of nr_empty_pop_pages.
> +	 *
> +	 * The chunk is maintains the previous number of free pages it held,
> +	 * so the delta is used to update the global counter.  The reserved
> +	 * chunk is not part of the free page count as they are populated
> +	 * at init and are special to serving reserved allocations.
> +	 */
> +	if (is_page_empty) {
> +		nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off,
> +							 chunk->nr_pages);
> +	}
> +	if (chunk != pcpu_reserved_chunk)
> +		pcpu_nr_empty_pop_pages +=
> +			(nr_empty_pop_pages - chunk->nr_empty_pop_pages);
> +	chunk->nr_empty_pop_pages = nr_empty_pop_pages;
>  }
>  
>  /**
> - * pcpu_extend_area_map - extend area map of a chunk
> + * pcpu_block_update_hint
>   * @chunk: chunk of interest
> - * @new_alloc: new target allocation length of the area map
> + * @index: block index of the metadata block
>   *
> - * Extend area map of @chunk to have @new_alloc entries.
> + * Full scan over the entire block to recalculate block-level metadata.
> + */
> +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
> +{
> +	unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
> +	struct pcpu_bitmap_md *block = chunk->md_blocks + index;
> +	bool is_left_free = false, is_right_free = false;
> +	int contig;
> +	unsigned long start, end;
> +
> +	block->contig_hint = 0;
> +	start = end = block->first_free;
> +	while (start < PCPU_BITMAP_BLOCK_SIZE) {
> +		/*
> +		 * Scans the allocation map corresponding to this block
> +		 * to find free fragments and update metadata accordingly.
> +		 */
> +		start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE,
> +					   start);
> +		if (start >= PCPU_BITMAP_BLOCK_SIZE)
> +			break;
> +		/* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */
> +		end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start);
> +		/* update left_free */
> +		contig = end - start;
> +		if (start == 0) {
> +			block->left_free = contig;
> +			is_left_free = true;
> +		}
> +		/* update right_free */
> +		if (end == PCPU_BITMAP_BLOCK_SIZE) {
> +			block->right_free = contig;
> +			is_right_free = true;
> +		}
> +		/* update block contig_hints */
> +		if (block->contig_hint < contig) {
> +			block->contig_hint = contig;
> +			block->contig_hint_start = start;
> +		}
> +		start = end;
> +	}
> +
> +	/* clear left/right free hints */
> +	if (!is_left_free)
> +		block->left_free = 0;
> +	if (!is_right_free)
> +		block->right_free = 0;
> +}
> +
> +/**
> + * pcpu_block_update_hint_alloc - update hint on allocation path
> + * @chunk: chunk of interest
> + * @bit_off: bitmap offset
> + * @bit_size: size of request in allocation units
>   *
> - * CONTEXT:
> - * Does GFP_KERNEL allocation.  Grabs and releases pcpu_lock.
> + * Updates metadata for the allocation path.  The metadata only has to be
> + * refreshed by a full scan iff we break the largest contig region.
>   *
>   * RETURNS:
> - * 0 on success, -errno on failure.
> + * Bool if we need to update the chunk's metadata. This occurs only if we
> + * break the chunk's contig hint.
>   */
> -static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc)
> +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> +					 int bit_size)
>  {
> -	int *old = NULL, *new = NULL;
> -	size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
> -	unsigned long flags;
> -
> -	lockdep_assert_held(&pcpu_alloc_mutex);
> +	bool update_chunk = false;
> +	int i;
> +	int s_index, e_index, s_off, e_off;
> +	struct pcpu_bitmap_md *s_block, *e_block, *block;
>  
> -	new = pcpu_mem_zalloc(new_size);
> -	if (!new)
> -		return -ENOMEM;
> +	/* calculate per block offsets */
> +	s_index = pcpu_off_to_block_index(bit_off);
> +	e_index = pcpu_off_to_block_index(bit_off + bit_size);
> +	s_off = pcpu_off_to_block_off(bit_off);
> +	e_off = pcpu_off_to_block_off(bit_off + bit_size);
>  
> -	/* acquire pcpu_lock and switch to new area map */
> -	spin_lock_irqsave(&pcpu_lock, flags);
> +	/*
> +	 * If the offset is the beginning of the next block, set it to the
> +	 * end of the previous block as the last bit is the exclusive.
> +	 */
> +	if (e_off == 0) {
> +		e_off = PCPU_BITMAP_BLOCK_SIZE;
> +		e_index--;
> +	}
>  
> -	if (new_alloc <= chunk->map_alloc)
> -		goto out_unlock;
> +	s_block = chunk->md_blocks + s_index;
> +	e_block = chunk->md_blocks + e_index;
>  
> -	old_size = chunk->map_alloc * sizeof(chunk->map[0]);
> -	old = chunk->map;
> +	/*
> +	 * Update s_block.
> +	 *
> +	 * block->first_free must be updated if the allocation takes its place.
> +	 * If the allocation breaks the contig_hint, a scan is required to
> +	 * restore this hint.
> +	 */
> +	if (s_off == s_block->first_free)
> +		s_block->first_free = find_next_zero_bit(
> +					pcpu_index_alloc_map(chunk, s_index),
> +					PCPU_BITMAP_BLOCK_SIZE,
> +					s_off + bit_size);
> +
> +	if (s_off >= s_block->contig_hint_start &&
> +	    s_off < s_block->contig_hint_start + s_block->contig_hint) {
> +		pcpu_block_refresh_hint(chunk, s_index);
> +	} else {
> +		/* update left and right contig manually */
> +		s_block->left_free = min(s_block->left_free, s_off);
> +		if (s_index == e_index)
> +			s_block->right_free = min_t(int, s_block->right_free,
> +					PCPU_BITMAP_BLOCK_SIZE - e_off);
> +		else
> +			s_block->right_free = 0;
> +	}
>  
> -	memcpy(new, old, old_size);
> +	/*
> +	 * Update e_block.
> +	 * If they are different, then e_block's first_free is guaranteed to
> +	 * be the extend of e_off.  first_free must be updated and a scan
> +	 * over e_block is issued.
> +	 */
> +	if (s_index != e_index) {
> +		e_block->first_free = find_next_zero_bit(
> +				pcpu_index_alloc_map(chunk, e_index),
> +				PCPU_BITMAP_BLOCK_SIZE, e_off);
>  
> -	chunk->map_alloc = new_alloc;
> -	chunk->map = new;
> -	new = NULL;
> +		pcpu_block_refresh_hint(chunk, e_index);
> +	}
>  
> -out_unlock:
> -	spin_unlock_irqrestore(&pcpu_lock, flags);
> +	/* update in-between md_blocks */
> +	for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index;
> +	     i++, block++) {
> +		block->contig_hint = 0;
> +		block->left_free = 0;
> +		block->right_free = 0;
> +	}
>  
>  	/*
> -	 * pcpu_mem_free() might end up calling vfree() which uses
> -	 * IRQ-unsafe lock and thus can't be called under pcpu_lock.
> +	 * The only time a full chunk scan is required is if the global
> +	 * contig_hint is broken.  Otherwise, it means a smaller space
> +	 * was used and therefore the global contig_hint is still correct.
>  	 */
> -	pcpu_mem_free(old);
> -	pcpu_mem_free(new);
> +	if (bit_off >= chunk->contig_hint_start &&
> +	    bit_off < chunk->contig_hint_start + chunk->contig_hint)
> +		update_chunk = true;
>  
> -	return 0;
> +	return update_chunk;
>  }
>  
>  /**
> - * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
> - * @chunk: chunk the candidate area belongs to
> - * @off: the offset to the start of the candidate area
> - * @this_size: the size of the candidate area
> - * @size: the size of the target allocation
> - * @align: the alignment of the target allocation
> - * @pop_only: only allocate from already populated region
> - *
> - * We're trying to allocate @size bytes aligned at @align.  @chunk's area
> - * at @off sized @this_size is a candidate.  This function determines
> - * whether the target allocation fits in the candidate area and returns the
> - * number of bytes to pad after @off.  If the target area doesn't fit, -1
> - * is returned.
> - *
> - * If @pop_only is %true, this function only considers the already
> - * populated part of the candidate area.
> + * pcpu_block_update_hint_free - updates the block hints on the free path
> + * @chunk: chunk of interest
> + * @bit_off: bitmap offset
> + * @bit_size: size of request in allocation units
> + *
> + * Updates the hint along the free path by taking advantage of current metadata
> + * to minimize scanning of the bitmap.  Triggers a global update if an entire
> + * block becomes free or the free spans across blocks.  This tradeoff is to
> + * minimize global scanning to update the chunk->contig_hint.  The
> + * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint
> + * will never be more than the available space.  If the chunk->contig_hint is
> + * in this block, it will be accurate.
> + *
> + * RETURNS:
> + * Bool if we need to update the chunk's metadata.  This occurs if a larger
> + * contig region is created along the edges or we free across blocks.
>   */
> -static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size,
> -			    int size, int align, bool pop_only)
> +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
> +					int bit_size)
>  {
> -	int cand_off = off;
> +	bool update_chunk = false;
> +	int i;
> +	int s_index, e_index, s_off, e_off;
> +	int start, end, contig;
> +	struct pcpu_bitmap_md *s_block, *e_block, *block;
>  
> -	while (true) {
> -		int head = ALIGN(cand_off, align) - off;
> -		int page_start, page_end, rs, re;
> +	/* calculate per block offsets */
> +	s_index = pcpu_off_to_block_index(bit_off);
> +	e_index = pcpu_off_to_block_index(bit_off + bit_size);
> +	s_off = pcpu_off_to_block_off(bit_off);
> +	e_off = pcpu_off_to_block_off(bit_off + bit_size);
> +
> +	/*
> +	 * If the offset is the beginning of the next block, set it to the
> +	 * end of the previous block as the last bit is the exclusive.
> +	 */
> +	if (e_off == 0) {
> +		e_off = PCPU_BITMAP_BLOCK_SIZE;
> +		e_index--;
> +	}
> +
> +	s_block = chunk->md_blocks + s_index;
> +	e_block = chunk->md_blocks + e_index;
> +
> +	/*
> +	 * Check if the freed area aligns with the block->contig_hint.
> +	 * If it does, then the scan to find the beginning/end of the
> +	 * larger free area can be avoided.
> +	 *
> +	 * start and end refer to beginning and end of the free region
> +	 * within each their respective blocks.  This is not necessarily
> +	 * the entire free region as it may span blocks past the beginning
> +	 * or end of the block.
> +	 */
> +	start = s_off;
> +	if (s_off == s_block->contig_hint + s_block->contig_hint_start) {
> +		start = s_block->contig_hint_start;
> +	} else {
> +		int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index),
> +					  start);
> +		start = (start == l_bit) ? 0 : l_bit + 1;
> +	}
> +
> +	end = e_off;
> +	if (e_off == e_block->contig_hint_start)
> +		end = e_block->contig_hint_start + e_block->contig_hint;
> +	else
> +		end = find_next_bit(pcpu_index_alloc_map(chunk, e_index),
> +				    PCPU_BITMAP_BLOCK_SIZE, end);
>  
> -		if (this_size < head + size)
> -			return -1;
> +	/* freeing in the same block */
> +	if (s_index == e_index) {
> +		contig = end - start;
>  
> -		if (!pop_only)
> -			return head;
> +		if (start == 0)
> +			s_block->left_free = contig;
>  
> +		if (end == PCPU_BITMAP_BLOCK_SIZE)
> +			s_block->right_free = contig;
> +
> +		s_block->first_free = min(s_block->first_free, start);
> +		if (contig > s_block->contig_hint) {
> +			s_block->contig_hint = contig;
> +			s_block->contig_hint_start = start;
> +		}
> +
> +	} else {
>  		/*
> -		 * If the first unpopulated page is beyond the end of the
> -		 * allocation, the whole allocation is populated;
> -		 * otherwise, retry from the end of the unpopulated area.
> +		 * Freeing across md_blocks.
> +		 *
> +		 * If the start is at the beginning of the block, just
> +		 * reset the block instead.
>  		 */
> -		page_start = PFN_DOWN(head + off);
> -		page_end = PFN_UP(head + off + size);
> -
> -		rs = page_start;
> -		pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size));
> -		if (rs >= page_end)
> -			return head;
> -		cand_off = re * PAGE_SIZE;
> +		if (start == 0) {
> +			s_index--;
> +		} else {
> +			/*
> +			 * Knowing that the free is across blocks, this means
> +			 * the hint can be updated on the right side and the
> +			 * left side does not need to be touched.
> +			 */
> +			s_block->first_free = min(s_block->first_free, start);
> +			contig = PCPU_BITMAP_BLOCK_SIZE - start;
> +			s_block->right_free = contig;
> +			if (contig > s_block->contig_hint) {
> +				s_block->contig_hint = contig;
> +				s_block->contig_hint_start = start;
> +			}
> +		}
> +		/*
> +		 * If end is the entire e_block, just reset the block
> +		 * as well.
> +		 */
> +		if (end == PCPU_BITMAP_BLOCK_SIZE) {
> +			e_index++;
> +		} else {
> +			/*
> +			 * The hint must only be on the left side, so
> +			 * update accordingly.
> +			 */
> +			e_block->first_free = 0;
> +			e_block->left_free = end;
> +			if (end > e_block->contig_hint) {
> +				e_block->contig_hint = end;
> +				e_block->contig_hint_start = 0;
> +			}
> +		}
> +
> +		/* reset md_blocks in the middle */
> +		for (i = s_index + 1, block = chunk->md_blocks + i;
> +		     i < e_index; i++, block++) {
> +			block->first_free = 0;
> +			block->contig_hint_start = 0;
> +			block->contig_hint = PCPU_BITMAP_BLOCK_SIZE;
> +			block->left_free = PCPU_BITMAP_BLOCK_SIZE;
> +			block->right_free = PCPU_BITMAP_BLOCK_SIZE;
> +		}
>  	}
> +
> +	/*
> +	 * The hint is only checked in the s_block and e_block when
> +	 * freeing and particularly only when it is self contained within
> +	 * its own block.  A scan is required if the free space spans
> +	 * blocks or makes a block whole as the scan will take into
> +	 * account free space across blocks.
> +	 */
> +	if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) ||
> +	    s_index != e_index) {
> +		update_chunk = true;
> +	} else if (s_block->contig_hint > chunk->contig_hint) {
> +		/* check if block contig_hint is bigger */
> +		chunk->contig_hint = s_block->contig_hint;
> +		chunk->contig_hint_start =
> +			pcpu_block_off_to_off(s_index,
> +					      s_block->contig_hint_start);
> +	}
> +
> +	return update_chunk;
>  }
>  
>  /**
> - * pcpu_alloc_area - allocate area from a pcpu_chunk
> + * pcpu_is_populated - determines if the region is populated
>   * @chunk: chunk of interest
> - * @size: wanted size in bytes
> - * @align: wanted align
> - * @pop_only: allocate only from the populated area
> - * @occ_pages_p: out param for the number of pages the area occupies
> + * @index: block index
> + * @block_off: offset within the bitmap
> + * @bit_size: size of request in allocation units
> + * @next_index: return value for next block index that is populated
>   *
> - * Try to allocate @size bytes area aligned at @align from @chunk.
> - * Note that this function only allocates the offset.  It doesn't
> - * populate or map the area.
> + * For atomic allocations, we must check if the backing pages are populated.
>   *
> - * @chunk->map must have at least two free slots.
> + * RETURNS:
> + * Bool if the backing pages are populated.  next_index is to skip over
> + * unpopulated blocks in pcpu_find_block_fit.
> + */
> +static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index,
> +			      int block_off, int bit_size, int *next_index)
> +{
> +	int page_start, page_end, rs, re;
> +	int off = pcpu_block_off_to_off(index, block_off);
> +	int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE;
> +
> +	page_start = PFN_DOWN(off);
> +	page_end = PFN_UP(e_off);
> +
> +	rs = page_start;
> +	pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off));
> +	if (rs >= page_end)
> +		return true;
> +	*next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE;
> +	return false;
> +}
> +
> +/**
> + * pcpu_find_block_fit - finds the block index to start searching
> + * @chunk: chunk of interest
> + * @bit_size: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + * @pop_only: use populated regions only
>   *
> - * CONTEXT:
> - * pcpu_lock.
> + * Given a chunk and an allocation spec, find the offset to begin searching
> + * for a free region.  This is done by iterating over the bitmap metadata
> + * blocks and then only returning regions that will be guaranteed to fit
> + * alignment by comparing against the block->contig_hint_start or a correctly
> + * aligned offset.  Iteration is used within a block as an allocation may be
> + * able to be served prior to the contig_hint.
> + *
> + * Note: This errs on the side of caution by only selecting blocks guaranteed
> + *	 to have a fit in the chunk's contig_hint.  Poor alignment can cause
> + *	 us to skip over chunk's that have valid vacancies.
>   *
>   * RETURNS:
> - * Allocated offset in @chunk on success, -1 if no matching area is
> - * found.
> + * The offset in the bitmap to begin searching.
> + * -1 if no offset is found.
>   */
> -static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align,
> -			   bool pop_only, int *occ_pages_p)
> +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
> +			       size_t align, bool pop_only)
>  {
> -	int oslot = pcpu_chunk_slot(chunk);
> -	int max_contig = 0;
> -	int i, off;
> -	bool seen_free = false;
> -	int *p;
> -
> -	for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) {
> -		int head, tail;
> -		int this_size;
> -
> -		off = *p;
> -		if (off & 1)
> -			continue;
> +	int i, cur_free;
> +	int s_index, block_off, next_index, end_off; /* interior alloc index */
> +	struct pcpu_bitmap_md *block;
> +	unsigned long *alloc_map;
>  
> -		this_size = (p[1] & ~1) - off;
> +	lockdep_assert_held(&pcpu_lock);
>  
> -		head = pcpu_fit_in_area(chunk, off, this_size, size, align,
> -					pop_only);
> -		if (head < 0) {
> -			if (!seen_free) {
> -				chunk->first_free = i;
> -				seen_free = true;
> -			}
> -			max_contig = max(this_size, max_contig);
> +	cur_free = block_off = 0;
> +	s_index = chunk->first_free_block;
> +	for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk);
> +	     i++) {
> +		alloc_map = pcpu_index_alloc_map(chunk, i);
> +		block = chunk->md_blocks + i;
> +
> +		/* continue from prev block */
> +		cur_free += block->left_free;
> +		if (cur_free >= bit_size) {
> +			end_off = bit_size;
> +			goto check_populated;
> +		} else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
>  			continue;
>  		}
>  
>  		/*
> -		 * If head is small or the previous block is free,
> -		 * merge'em.  Note that 'small' is defined as smaller
> -		 * than sizeof(int), which is very small but isn't too
> -		 * uncommon for percpu allocations.
> +		 * Can this block hold this alloc?
> +		 *
> +		 * Here the block->contig_hint is used to guarantee a fit,
> +		 * but the block->first_free is returned as we may be able
> +		 * to serve the allocation earlier.  The population check
> +		 * must take into account the area beginning at first_free
> +		 * through the end of the contig_hint.
>  		 */
> -		if (head && (head < sizeof(int) || !(p[-1] & 1))) {
> -			*p = off += head;
> -			if (p[-1] & 1)
> -				chunk->free_size -= head;
> -			else
> -				max_contig = max(*p - p[-1], max_contig);
> -			this_size -= head;
> -			head = 0;
> +		cur_free = 0;
> +		s_index = i;
> +		block_off = ALIGN(block->contig_hint_start, align);
> +		block_off -= block->contig_hint_start;
> +		if (block->contig_hint >= block_off + bit_size) {
> +			block_off = block->first_free;
> +			end_off = block->contig_hint_start - block_off +
> +				  bit_size;
> +			goto check_populated;
>  		}
>  
> -		/* if tail is small, just keep it around */
> -		tail = this_size - head - size;
> -		if (tail < sizeof(int)) {
> -			tail = 0;
> -			size = this_size - head;
> +		/* check right */
> +		block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free,
> +				  align);
> +		/* reset to start looking in the next block */
> +		if (block_off >= PCPU_BITMAP_BLOCK_SIZE) {
> +			s_index++;
> +			cur_free = block_off = 0;
> +			continue;
>  		}
> +		cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off;
> +		if (cur_free >= bit_size) {
> +			end_off = bit_size;
> +check_populated:
> +			if (!pop_only ||
> +			    pcpu_is_populated(chunk, s_index, block_off,
> +					      end_off, &next_index))
> +				break;
>  
> -		/* split if warranted */
> -		if (head || tail) {
> -			int nr_extra = !!head + !!tail;
> -
> -			/* insert new subblocks */
> -			memmove(p + nr_extra + 1, p + 1,
> -				sizeof(chunk->map[0]) * (chunk->map_used - i));
> -			chunk->map_used += nr_extra;
> -
> -			if (head) {
> -				if (!seen_free) {
> -					chunk->first_free = i;
> -					seen_free = true;
> -				}
> -				*++p = off += head;
> -				++i;
> -				max_contig = max(head, max_contig);
> -			}
> -			if (tail) {
> -				p[1] = off + size;
> -				max_contig = max(tail, max_contig);
> -			}
> +			i = next_index - 1;
> +			s_index = next_index;
> +			cur_free = block_off = 0;
>  		}
> +	}
>  
> -		if (!seen_free)
> -			chunk->first_free = i + 1;
> +	/* nothing found */
> +	if (i == pcpu_nr_pages_to_blocks(chunk))
> +		return -1;
>  
> -		/* update hint and mark allocated */
> -		if (i + 1 == chunk->map_used)
> -			chunk->contig_hint = max_contig; /* fully scanned */
> -		else
> -			chunk->contig_hint = max(chunk->contig_hint,
> -						 max_contig);
> +	return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off;
> +}
>  
> -		chunk->free_size -= size;
> -		*p |= 1;
>  
> -		*occ_pages_p = pcpu_count_occupied_pages(chunk, i);
> -		pcpu_chunk_relocate(chunk, oslot);
> -		return off;
> -	}
> +/**
> + * pcpu_alloc_area - allocates area from a pcpu_chunk
> + * @chunk: chunk of interest
> + * @bit_size: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + * @start: bit_off to start searching
> + *
> + * This function takes in a start bit_offset to begin searching.  It
> + * searches the allocation bitmap to verify that the offset is available
> + * as block->first_free is provided when allocation within a block is
> + * available.
> + *
> + * RETURNS:
> + * Allocated addr offset in @chunk on success,
> + * -1 if no matching area is found
> + */
> +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size,
> +			   size_t align, int start)
> +{
> +	size_t align_mask = (align) ? (align - 1) : 0;
> +	int i, bit_off, oslot;
> +	struct pcpu_bitmap_md *block;
> +
> +	lockdep_assert_held(&pcpu_lock);
> +
> +	oslot = pcpu_chunk_slot(chunk);
> +
> +	/* search to find fit */
> +	bit_off = bitmap_find_next_zero_area(chunk->alloc_map,
> +					     pcpu_nr_pages_to_bits(chunk),
> +					     start, bit_size, align_mask);
> +
> +	if (bit_off >= pcpu_nr_pages_to_bits(chunk))
> +		return -1;
> +
> +	/* update alloc map */
> +	bitmap_set(chunk->alloc_map, bit_off, bit_size);
> +	/* update boundary map */
> +	set_bit(bit_off, chunk->bound_map);
> +	bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1);
> +	set_bit(bit_off + bit_size, chunk->bound_map);
> +

Actually I decided I do want to complain about this.  Have you considered making
chunks statically sized, like slab does?  We could avoid this whole bound_map
thing completely and save quite a few cycles trying to figure out how big our
allocation was.  Thanks,

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-19 19:16   ` Josef Bacik
@ 2017-07-19 22:13     ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-19 22:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

Hi Josef,

Thanks for taking a look at my code.

On Wed, Jul 19, 2017 at 07:16:35PM +0000, Josef Bacik wrote:
> 
> Actually I decided I do want to complain about this.  Have you considered making
> chunks statically sized, like slab does?  We could avoid this whole bound_map
> thing completely and save quite a few cycles trying to figure out how big our
> allocation was.  Thanks,

I did consider something along the lines of a slab allocator, but
ultimately utilization and fragmentation were why I decided against it.

Percpu memory is handled by giving each cpu its own copy of the object
to use. This means cpus can avoid cache coherence when accessing and
manipulating the object. To do this, the percpu allocator creates chunks
to serve each allocation out of. Because each cpu has its own copy, there
is a high cost for having each chunk lying around (and this memory in
general).

With slab allocation, it takes liberty in caching often used sizes and
accepting internal fragmentation for performance. Unfortunately, the
percpu memory allocator does not necessarily know what is going to get
allocated. It would need to keep many slabs around to serve each
allocation which can be quite expensive. In the worst-case, long living
percpu allocations can keep entire slabs alive as there is no way to
perform consolidation once addresses are given out. Additionally, any
internal fragmentation caused by ill-fit objects is amplified by the
number of possible cpus.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-19 19:11   ` Josef Bacik
@ 2017-07-19 22:19     ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-19 22:19 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

Hi Josef,

On Wed, Jul 19, 2017 at 07:11:06PM +0000, Josef Bacik wrote:
> 
> This was a bear to review, I feel like it could be split into a few smaller
> pieces.  You are changing hinting, allocating/freeing, and how you find chunks.
> Those seem like good logical divisions to me.  Overall the design seems sound
> and I didn't spot any major problems.  Once you've split them up I'll do another
> thorough comb through and then add my reviewed by.  Thanks,

Yeah.. Thanks for taking the time. I'm currently working on responding
to Tejun's feedback and will do my best to split it down further. I've
done a bit of refactoring which should help readability and hopefully
make it easier in the next pass.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned
  2017-07-17 16:46   ` Tejun Heo
  2017-07-17 19:10     ` Dennis Zhou
@ 2017-07-24 20:04     ` Dennis Zhou
  1 sibling, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-24 20:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hi Tejun,

On Mon, Jul 17, 2017 at 12:46:50PM -0400, Tejun Heo wrote:
 
> Heh, that was pretty difficult to parse, but here's my question.  So,
> we're expanding reserved area so that its end aligns to page boundary
> which is completely fine.  We may end up with reserved area which is a
> bit larger than specified but no big deal.  However, we can't do the
> same thing with the boundary between the static and reserved chunks,
> so instead we pull down the start of the reserved area and mark off
> the overwrapping area, which is fine too.
> 
> My question is why we're doing one thing for the end of reserved area
> while we need to do a different thing for the beginning of it.  Can't
> we do the same thing in both cases?  ie. for the both boundaries
> between static and reserved, and reserved and dynamic, pull down the
> start to the page boundary and mark the overlapping areas used?

I've refactored the code to maintain start and end offsets. This removes
the need to expand the reserved region. There are a few more constraints
though. The reserved region must be a multiple of the minimum allocation
size. The static region and dynamic region are expanded and shrunk
respectively to maintain alignment with the minimum allocation size.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names
  2017-07-17 19:10   ` Tejun Heo
@ 2017-07-24 20:07     ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-24 20:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

Hi Tejun,

On Mon, Jul 17, 2017 at 03:10:09PM -0400, Tejun Heo wrote:

> >  	/*
> > +	 * Initialize first chunk.
> > +	 * pcpu_first_chunk will always manage the dynamic region of the
> > +	 * first chunk.  The static region is dropped as those addresses
> 
> Would "not covered by any chunk" be clearer than "dropped"?

I've updated the comments in the new revision. This is explained in the
function comment now.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps
  2017-07-17 19:26   ` Tejun Heo
@ 2017-07-24 20:13     ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-24 20:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Mon, Jul 17, 2017 at 03:26:02PM -0400, Tejun Heo wrote:
> > This patch changes the allocator to only mark allocated pages for the
> > region the population bitmap is used for. Prior, the bitmap was marked
> > completely used as the first chunk was allocated and immutable. This is
> > misleading because the first chunk may not be completely filled.
> > Additionally, with moving the base_addr up in the previous patch, the
> > population map no longer corresponds to what is being checked.
> 
> This in isolation makes sense although the rationale isn't clear from
> the description.  Is it a mere cleanup or is this needed to enable
> further changes?

This change is clean up to make sure there is no misunderstanding
between what part of the bitmap actually is meaningful and the actual
size of the bitmap.


> > pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
> > pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
> > is added to the pcpu_chunk struct to keep track of the number of empty
> > pages. This field is needed as the number of empty populated pages is
> > globally kept track of and deltas are used to update it. This new field
> > is exposed in percpu_stats.
> 
> But I can't see why this is being added or why this is in the same
> patch with the previous change.
> 

I've split this out into another patch.

> > Now that chunk->nr_pages is the number of pages the chunk is serving, it
> > is nice to use this in the work function for population and freeing of
> > chunks rather than use the global variable pcpu_unit_pages.
> 
> The same goes for the above part.  It's fine to collect misc changes
> into a patch when they're trivial and related in some ways but the
> content of this patch seems a bit random.

This change is needed in the same patch because chunk->nr_populated no
longer is set to pcpu_unit_pages. The checks would check the dynamic
chunk and then try to populate. Those checks should be checking against
the size of the region being served which is nr_pages.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator
  2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik
@ 2017-07-24 21:14   ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-24 21:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel,
	linux-mm, Dennis Zhou

Hi Josef,

On Tue, Jul 18, 2017 at 03:15:28PM -0400, Josef Bacik wrote:
 
> Ok so you say that this test is better for the area map allocator, so presumably
> this is worst case for the bitmap allocator?  What does the average case look
> like?

The bitmap allocator should perform relatively consistent in most
allocation schemes. The worst case would be where constant scanning is
required like the allocation scenario of 4-bytes with 8-byte alignment.
As far as average case, I would expect similar performance on average.
These are updated values with v2.

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        310      |     4770
             16B    |        557      |     1325
             64B    |        436      |      273
            256B    |        776      |      131
           1024B    |       3280      |      122

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        490      |       70
             16B    |        515      |       75
             64B    |        610      |       80
            256B    |        950      |      100
           1024B    |       3520      |      200

> Trading 2x allocation latency for a pretty significant free latency
> reduction seems ok, but are we allocating or freeing more?  Are both allocations
> and free's done in performance critical areas, or do allocations only happen in
> performance critical areas, making the increased allocation latency hurt more?

I unfortunately do not have any data to support whether we are
allocating or freeing more in critical areas. I would defer use case
questions to those consuming percpu memory. There is one issue that
stood out though with BPF that is causing the cpu to hang due to the
percpu allocator. This leads me to believe that freeing is of more
concern. I would believe that if allocation latency is a major problem,
it is easier for the user to preallocate than it is for them to manage
delayed freeing.

> And lastly, why are we paying a 2x latency cost?  What is it about the bitmap
> allocator that makes it much worse than the area map allocator?

So it turns out v1 was making poor use of control logic and the
compiler wasn't doing me any favors. I've since done a rather large
refactor with v2 which shows better performance overall. The bitmap
allocator is paying more in that it has to manage metadata that the area
map allocator does not. In the optimal case, the area map allocator is
just an append without any heavy operations.

A worse performing scenario for the area map allocator is when the
second half of a chunk is occupied by a lot of small allocations. In
that case, each allocation has to memmove the rest of the area map. This
scenario is nontrivial to reproduce, so I provide data below for a
similar scenario where the second half of each page is filled.

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |       4118      |     4892
             16B    |       1651      |     1163
             64B    |        598      |      285
            256B    |        771      |      158
           1024B    |       3034      |      160

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        481      |       67
             16B    |        506      |       69
             64B    |        636      |       75
            256B    |        892      |       90
           1024B    |       3262      |      147

This shows that in less ideal chunk states, the allocation path can fail
just like the freeing path can with non-ideal deallocation patterns.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
  2017-07-17 23:27   ` Tejun Heo
@ 2017-07-24 21:37     ` Dennis Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Dennis Zhou @ 2017-07-24 21:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou

On Mon, Jul 17, 2017 at 07:27:56PM -0400, Tejun Heo wrote:
> I think it'd be nice to have a similar table for allocation patterns
> which aren't ideal for the original allocator.  The biggest goal is
> avoiding cases where the allocator collapses and just glancing at the
> table doesn't seem very compelling.

I've added new data to the cover letter and in the commit log to
demonstrate poor allocation performance.

> >  /*
> > + * This determines the size of each metadata block.  There are several subtle
> > + * constraints around this variable.  The reserved_region and dynamic_region
>                               ^
> 			      constant
> 			      

Fixed.

> > + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE.  This is
> > + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks
> > + * that are backing multiple pages, this needs to be accounted for.
> > + */
> > +#define PCPU_BITMAP_BLOCK_SIZE		(PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
> 
> Given that percpu allocator can align only upto a page, the
> restriction makes sense to me.  I'm kinda curious whether PAGE_SIZE
> blocks is optimal tho.  Why did you pick PAGE_SIZE?
> 

I've refactored v2 to be able to handle block sizes with certain
constraints. The PAGE_SIZE blocks really just were a balance between
amount required to scan and the number of metadata blocks. I tested 2KB
blocks and they seemed to perform marginally worse.

> > +/*
>    ^^
>    /**
> 
> Ditto for other comments.
> 

Fixed.

> > + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks
> > + * @chunk: chunk of interest
> > + *
> > + * This conversion is from the number of physical pages that the chunk
> > + * serves to the number of bitmap blocks required.  It converts to bytes
> > + * served to bits required and then blocks used.
> > + */
> > +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk)
> 
> Maybe just pcpu_chunk_nr_blocks()?
> 

Renamed.

> > +{
> > +	return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE /
> > +	       PCPU_BITMAP_BLOCK_SIZE;
> > +}
> > +
> > +/*
> > + * pcpu_pages_to_bits - converts the pages to size of bitmap
> > + * @pages: number of physical pages
> > + *
> > + * This conversion is from physical pages to the number of bits
> > + * required in the bitmap.
> > + */
> > +static inline int pcpu_pages_to_bits(int pages)
> 
> pcpu_nr_pages_to_map_bits()?
> 

Renamed.

> > +{
> > +	return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
> > +}
> > +
> > +/*
> > + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap
> > + * @chunk: chunk of interest
> > + *
> > + * This conversion is from the number of physical pages that the chunk
> > + * serves to the number of bits in the bitmap.
> > + */
> > +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk)
> 
> pcpu_chunk_map_bits()?
> 

Renamed.

> > @@ -86,10 +90,13 @@
> >  
> >  #include "percpu-internal.h"
> >  
> > -#define PCPU_SLOT_BASE_SHIFT		5	/* 1-31 shares the same slot */
> > -#define PCPU_DFL_MAP_ALLOC		16	/* start a map with 16 ents */
> > -#define PCPU_ATOMIC_MAP_MARGIN_LOW	32
> > -#define PCPU_ATOMIC_MAP_MARGIN_HIGH	64
> > +/*
> > + * The metadata is managed in terms of bits with each bit mapping to
> > + * a fragment of size PCPU_MIN_ALLOC_SIZE.  Thus, the slots are calculated
> > + * with respect to the number of bits available.
> > + */
> > +#define PCPU_SLOT_BASE_SHIFT		3
> 
> Ah, so this is actually the same as before 3 + 2, order 5.  Can you
> please note the explicit number in the comment?
> 

With the refactor in v2, the slots are back to being managed in bytes.

> >  #define PCPU_EMPTY_POP_PAGES_LOW	2
> >  #define PCPU_EMPTY_POP_PAGES_HIGH	4
> 
> and these numbers too.  I can't tell how these numbers would map.
> Also, any chance we can have these numbers in a more intuitive unit?
> 

Those are from the original implementation to decide when to schedule
work to repopulate the empty free page pool.


> > @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr)
> >  	       pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
> >  }
> >  
> > -static int __pcpu_size_to_slot(int size)
> > +static int __pcpu_size_to_slot(int bit_size)
> 
> Wouldn't sth like @map_bits more intuitive than @bit_size?  We can
> just use @bits too.
> 

Back to bytes.

> >  {
> > -	int highbit = fls(size);	/* size is in bytes */
> > +	int highbit = fls(bit_size);	/* size is in bits */
> >  	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> >  }
> >  
> > -static int pcpu_size_to_slot(int size)
> > +static int pcpu_size_to_slot(int bit_size)
> 
> Ditto.
> 

Back to bytes.

> > +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk)
> 
> It's a span iteration problem.  When abstracted properly, it shouldn't
> be too difficult to follow.
> 

I agree, I've refactored to use an iterator.

> > +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
 
> It's a lot simpler here but this too might look simpler with an
> appropriate interation abstraction.

Generalized the populated bitmap iterators so it can be used here.

> 
> Hmm... why do we need is_left/right_free?  Can't we reset them to zero
> at the top and update directly during the loop?
> 

Yes. Done.

> 
> @update_chunk seems unnecessary.
> 

I've moved the chunk refresh call to be here.

> 
> So, if you do the above with inclusive range, it becomes
> 
> 	s_index = pcpu_off_to_block_index(start_bit);
> 	e_index = pcpu_off_to_block_index(end_bit - 1);
> 	s_off = pcpu_off_to_block_off(start_bit);
> 	e_off = pcpu_off_to_block_off(end_bit - 1) + 1;
> 
> and you can just comment that you're using inclusive range so that the
> e_index always points to the last block in the range.  Wouldn't that
> be easier?  People do use inclusive ranges for these sorts of
> calculations.

Ah I see, thanks. Done.

> 
> How about something like the following?  It's kinda weird to have an
> extra loop var which isn't really used for anything.  The same goes
> for other places too.
> 
> 		for (block = chunk->md_blocks + s_index + 1;
> 		     block < chunk->md_blocks + e_index; block++)
> 

I've rewritten most for loops to do this.

> > +			block->first_free = 0;
 
> > +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
> > +			       size_t align, bool pop_only)
> 
> Wouldn't this function be a lot simpler too if there were free span
> iterator?
> 

Yes added an iterator for this.

> > +
> > +	/* update alloc map */
> > +	bitmap_set(chunk->alloc_map, bit_off, bit_size);
> 
> blank line
> 

Added.

> 
> Do we ever not update chunk hint when block hint indicates that it's
> necessary?  If not, maybe just call it from the previous function?
> 

No. I've added the call to the previous function.

> > @@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk,
> >  
> >  	bitmap_set(chunk->populated, page_start, nr);
> >  	chunk->nr_populated += nr;
> > +	chunk->nr_empty_pop_pages += nr;
> >  	pcpu_nr_empty_pop_pages += nr;
> >  }
> >  
> > @@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
> >  
> >  	bitmap_clear(chunk->populated, page_start, nr);
> >  	chunk->nr_populated -= nr;
> > +	chunk->nr_empty_pop_pages -= nr;
> >  	pcpu_nr_empty_pop_pages -= nr;
> >  }
> 
> Didn't we add this field in an earlier patch?  Do the above changes
> belong in this patch?
> 

Yes, moved to previous patch.

> > +	size_t bit_size, bit_align;
> >  
> >  	/*
> > +	 * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE.
> > +	 * Therefore alignment must be a minimum of that many bytes as well
> > +	 * as the allocation will have internal fragmentation from
> > +	 * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
> >  	 */
> > +	if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
> > +		align = PCPU_MIN_ALLOC_SIZE;
> > +	size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
> > +	bit_size = size >> PCPU_MIN_ALLOC_SHIFT;
> > +	bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
> 
> Shouldn't the above have happened earlier when MIN_ALLOC_SIZE was
> introduced?
> 

Yes, moved to previous patch.

> > @@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr)
> >   * address.  The caller is responsible for ensuring @addr stays valid
> >   * until this function finishes.
> >   *
> > - * percpu allocator has special setup for the first chunk, which currently
> > + * Percpu allocator has special setup for the first chunk, which currently
> >   * supports either embedding in linear address space or vmalloc mapping,
> >   * and, from the second one, the backing allocator (currently either vm or
> >   * km) provides translation.
> >   *
> >   * The addr can be translated simply without checking if it falls into the
> > - * first chunk. But the current code reflects better how percpu allocator
> > + * first chunk.  But the current code reflects better how percpu allocator
> >   * actually works, and the verification can discover both bugs in percpu
> > - * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current
> > + * allocator itself and per_cpu_ptr_to_phys() callers.  So we keep current
> 
> Let's please move out what can be to other patches.  This patch is big
> enough as it is.
> 

Removed.

> > @@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)
> >  		else
> >  			return page_to_phys(vmalloc_to_page(addr)) +
> >  			       offset_in_page(addr);
> > -	} else
> > +	} else {
> >  		return page_to_phys(pcpu_addr_to_page(addr)) +
> >  		       offset_in_page(addr);
> > +	}
> 
> Ditto.
> 

Removed.

> > @@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl,
> >   * static areas on architectures where the addressing model has
> >   * limited offset range for symbol relocations to guarantee module
> >   * percpu symbols fall inside the relocatable range.
> > + * @ai->static_size + @ai->reserved_size is expected to be page aligned.
> >   *
> >   * @ai->dyn_size determines the number of bytes available for dynamic
> > - * allocation in the first chunk.  The area between @ai->static_size +
> > - * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused.
> > + * allocation in the first chunk. Both the start and the end are expected
> > + * to be page aligned. The area between @ai->static_size + @ai->reserved_size
> > + * + @ai->dyn_size and @ai->unit_size is unused.
>      ^^^
>      contam
> 

This is for the math continuing from the previous line.

> >   *
> >   * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE
> >   * and equal to or larger than @ai->static_size + @ai->reserved_size +
> > @@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl,
> >   * copied static data to each unit.
> >   *
> >   * If the first chunk ends up with both reserved and dynamic areas, it
> > - * is served by two chunks - one to serve the core static and reserved
> > - * areas and the other for the dynamic area.  They share the same vm
> > - * and page map but uses different area allocation map to stay away
> > - * from each other.  The latter chunk is circulated in the chunk slots
> > - * and available for dynamic allocation like any other chunks.
> > + * is served by two chunks - one to serve the reserved area and the other
> > + * for the dynamic area.  They share the same vm and page map but use
> > + * different area allocation map to stay away from each other.  The latter
> > + * chunk is circulated in the chunk slots and available for dynamic allocation
> > + * like any other chunks.
> 
> ditto
> 

Split into another patch.

> > @@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	 * Allocate chunk slots.  The additional last slot is for
> >  	 * empty chunks.
> >  	 */
> > -	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> > +	pcpu_nr_slots = __pcpu_size_to_slot(
> > +				pcpu_pages_to_bits(pcpu_unit_pages)) + 2;
> 
> I get that we wanna be using bits inside the area allocator proper but
> can we keep things outside in bytes?  These things don't really have
> anything to do with what granularity the area allocator is operating
> at.
> 

The refactor keeps this in bytes.


> > @@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> >  	tmp_addr = (unsigned long)base_addr + ai->static_size;
> >  	aligned_addr = tmp_addr & PAGE_MASK;
> >  	pcpu_reserved_offset = tmp_addr - aligned_addr;
> > +	begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE;
> >  
> >  	map_size_bytes = (ai->reserved_size ?: ai->dyn_size) +
> >  			 pcpu_reserved_offset;
> > +
> >  	chunk_pages = map_size_bytes >> PAGE_SHIFT;
> >  
> >  	/* chunk adjacent to static region allocation */
> > +	chunk = pcpu_alloc_first_chunk(chunk_pages);
> >  	chunk->base_addr = (void *)aligned_addr;
> >  	chunk->immutable = true;
> >  
> > +	/* set metadata */
> > +	chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
> > +	chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits;
> >  
> > +	/*
> > +	 * If the beginning of the reserved region overlaps the end of the
> > +	 * static region, hide that portion in the metadata.
> > +	 */
> > +	if (begin_fill_bits) {
> >  		chunk->has_reserved = true;
> > +		bitmap_fill(chunk->alloc_map, begin_fill_bits);
> > +		set_bit(0, chunk->bound_map);
> > +		set_bit(begin_fill_bits, chunk->bound_map);
> > +
> > +		if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits))
> > +			pcpu_chunk_update_hint(chunk);
> >  	}
> >  
> > +	/* init dynamic chunk if necessary */
> > +	if (ai->reserved_size) {
> > +		pcpu_reserved_chunk = chunk;
> > +
> >  		chunk_pages = dyn_size >> PAGE_SHIFT;
> >  
> >  		/* chunk allocation */
> > +		chunk = pcpu_alloc_first_chunk(chunk_pages);
> >  		chunk->base_addr = base_addr + ai->static_size +
> >  				    ai->reserved_size;
> > +
> > +		/* set metadata */
> > +		chunk->contig_hint = pcpu_nr_pages_to_bits(chunk);
> > +		chunk->free_bits = pcpu_nr_pages_to_bits(chunk);
> >  	}
> >  
> >  	/* link the first chunk in */
> 
> I *think* that quite a bit of the above can be moved into a separate
> patch.
> 

The last one was quite a bit of work. It is the first handful of
patches in v2. The first chunk creation logic has been consolidated and
the reserved region is no longer expanded. The reserved region needs to
be a multiple of PCPU_MIN_ALLOC_SIZE and the static region is aligned up
while the dynamic region is shrunk by the aligned up amount. This is fine
as the dynamic region is expanded to be page aligned. If there was a
need to align up the static region, that means the dynamic region
initially expanded to use that space.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-07-24 21:37 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-16  2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou
2017-07-16  2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou
2017-07-17 14:44   ` Tejun Heo
2017-07-16  2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou
2017-07-17 14:46   ` Tejun Heo
2017-07-16  2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou
2017-07-17 14:47   ` Tejun Heo
2017-07-16  2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou
2017-07-17 14:53   ` Tejun Heo
2017-07-16  2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou
2017-07-16  4:01   ` kbuild test robot
2017-07-16  5:11   ` kbuild test robot
2017-07-17 16:46   ` Tejun Heo
2017-07-17 19:10     ` Dennis Zhou
2017-07-24 20:04     ` Dennis Zhou
2017-07-16  2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou
2017-07-17 18:57   ` Tejun Heo
2017-07-18 19:26   ` Josef Bacik
2017-07-18 19:36     ` Matthew Wilcox
2017-07-19 14:20       ` Josef Bacik
2017-07-16  2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou
2017-07-17 19:10   ` Tejun Heo
2017-07-24 20:07     ` Dennis Zhou
2017-07-16  2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou
2017-07-17 19:26   ` Tejun Heo
2017-07-24 20:13     ` Dennis Zhou
2017-07-16  2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou
2017-07-17 23:27   ` Tejun Heo
2017-07-24 21:37     ` Dennis Zhou
2017-07-19 19:11   ` Josef Bacik
2017-07-19 22:19     ` Dennis Zhou
2017-07-19 19:16   ` Josef Bacik
2017-07-19 22:13     ` Dennis Zhou
2017-07-16  2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou
2017-07-17 23:32   ` Tejun Heo
2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik
2017-07-24 21:14   ` Dennis Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).