* [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator @ 2017-07-16 2:23 Dennis Zhou 2017-07-16 2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou ` (10 more replies) 0 siblings, 11 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou, Dennis Zhou Hi everyone, The Linux kernel percpu memory allocator is responsible for managing percpu memory. It allocates memory from chunks of percpu areas and uses a simple first-fit area allocator to manage allocations inside each chunk. There now exist use cases where allocating and deallocating a million or more objects occurs making the current implementation inadequate. The two primary problems with the current area map allocator are: 1. The backing data structure is an array of the areas. To manage this array, it is possible to need to memmove a large portion of it. 2. On allocation, chunks are considered based on the contig_hint. It is possible that the contig_hint may be large enough while the alignment could not meet the request. This causes scanning over every free fragment that could spill over into scanning chunks. The primary considerations for the new allocator were the following: - Remove the memmove operation from the critical path - Be conservative with additional use of memory - Provide consistency in performance and memory footprint - Focus on small allocations < 64 bytes This patchset introduces a simple bitmap allocator backed by metadata blocks as a replacement for the area map allocator for percpu memory. Each chunk has an allocation bitmap, a boundary bitmap, and a set of metadata blocks. The allocation map serves as the ground truth for allocations while the boundary map serves as a way to distinguish between consecutive allocations. The minimum allocation size has been increased to 4-bytes. The key property behind the bitmap allocator is its static metadata. The main problem it solves is that a memmove is no longer part of the critical path for freeing, which was the primary source of latency. This also helps bound the metadata overhead. The area map allocator prior required an integer per allocation. This may be beneficial with larger allocations, but as mentioned, allocating a significant number of small objects is becoming more common. This causes worst-case scenarios for metadata overhead. There is one caveat with this implementation. In an effort to make freeing fast, the only time metadata is updated on the free path is if a whole block becomes free or the freed area spans across metadata blocks. This causes the chunka??s contig_hint to be potentially smaller than what it could allocate by up to a block. If the chunka??s contig_hint is smaller than a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation and therefore the situation where a chunk has a larger contig_hint than available will never occur. I have primarily done testing against a simple workload of allocation of 1 million objects of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 335 | 4960 16B | 485 | 1150 64B | 445 | 280 128B | 505 | 177 1024B | 3385 | 140 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 725 | 70 16B | 760 | 70 64B | 855 | 80 128B | 910 | 90 1024B | 3770 | 260 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The second problem of additional scanning can result in the area map allocator completing in 52 minutes. The same workload takes only 14 seconds to complete for the bitmap allocator. This was produced under a more contrived scenario of allocating 1 milion 4-byte objects with 8-byte alignment. Alternative implementations were evaluated including: linked lists, trees, and buddy systems. These all suffer from either high metadata overhead for small allocations or from the amplified costs of fragmentation with percpu memory. This patchset contains the following ten patches: 0001-percpu-pcpu-stats-change-void-buffer-to-int-buffer.patch 0002-percpu-change-the-format-for-percpu_stats-output.patch 0003-percpu-expose-pcpu_nr_empty_pop_pages-in-pcpu_stats.patch 0004-percpu-update-the-header-comment-and-pcpu_build_allo.patch 0005-percpu-change-reserved_size-to-end-page-aligned.patch 0006-percpu-modify-base_addr-to-be-region-specific.patch 0007-percpu-fix-misnomer-in-schunk-dchunk-variable-names.patch 0008-percpu-change-the-number-of-pages-marked-in-the-firs.patch 0009-percpu-replace-area-map-allocator-with-bitmap-alloca.patch 0010-percpu-add-optimizations-on-allocation-path-for-the-.patch 0001-0002 are minor fixes to percpu_stats. 0003 exposes a new field via percpu_stats. 0004 updates comments in the percpu allocator. 0005-0006 are preparatory patches that modify the first_chunk's base_addr management and the reserved region. 0007 does some variable renaming for clarity. 0008 modifies the population map and the variables surrounding population. 0009 is the bitmap allocator backed by metadata blocks implementation. 0010 adds two optimizations on top of the allocator. This patchset is on top of linus#master a80099a152. diffstats below: Dennis Zhou (Facebook) (10): percpu: pcpu-stats change void buffer to int buffer percpu: change the format for percpu_stats output percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats percpu: update the header comment and pcpu_build_alloc_info comments percpu: change reserved_size to end page aligned percpu: modify base_addr to be region specific percpu: fix misnomer in schunk/dchunk variable names percpu: change the number of pages marked in the first_chunk bitmaps percpu: replace area map allocator with bitmap allocator percpu: add optimizations on allocation path for the bitmap allocator arch/ia64/mm/contig.c | 3 +- arch/ia64/mm/discontig.c | 3 +- include/linux/percpu.h | 43 +- init/main.c | 1 - mm/percpu-internal.h | 84 ++- mm/percpu-stats.c | 111 ++-- mm/percpu.c | 1461 +++++++++++++++++++++++++++++----------------- 7 files changed, 1107 insertions(+), 599 deletions(-) Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 14:44 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou ` (9 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> Changes the use of a void buffer to an int buffer for clarity. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu-stats.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 03524a5..0d81044 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -49,7 +49,7 @@ static int find_max_map_used(void) * the beginning of the chunk to the last allocation. */ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, - void *buffer) + int *buffer) { int i, s_index, last_alloc, alloc_sign, as_len; int *alloc_sizes, *p; @@ -113,7 +113,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) { struct pcpu_chunk *chunk; int slot, max_map_used; - void *buffer; + int *buffer; alloc_buffer: spin_lock_irq(&pcpu_lock); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer 2017-07-16 2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou @ 2017-07-17 14:44 ` Tejun Heo 0 siblings, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 14:44 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:06PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > Changes the use of a void buffer to an int buffer for clarity. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Applied to percpu/for-4.14.h Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 02/10] percpu: change the format for percpu_stats output 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou 2017-07-16 2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 14:46 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou ` (8 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> This makes the debugfs output for percpu_stats a little easier to read by changing the spacing of the output to be consistent. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu-stats.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 0d81044..fa0f5de 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -18,7 +18,7 @@ #include "percpu-internal.h" #define P(X, Y) \ - seq_printf(m, " %-24s: %8lld\n", X, (long long int)Y) + seq_printf(m, " %-20s: %12lld\n", X, (long long int)Y) struct percpu_stats pcpu_stats; struct pcpu_alloc_info pcpu_stats_ai; @@ -134,7 +134,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) } #define PL(X) \ - seq_printf(m, " %-24s: %8lld\n", #X, (long long int)pcpu_stats_ai.X) + seq_printf(m, " %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X) seq_printf(m, "Percpu Memory Statistics\n" @@ -151,7 +151,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) #undef PL #define PU(X) \ - seq_printf(m, " %-18s: %14llu\n", #X, (unsigned long long)pcpu_stats.X) + seq_printf(m, " %-20s: %12llu\n", #X, (unsigned long long)pcpu_stats.X) seq_printf(m, "Global Stats:\n" -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 02/10] percpu: change the format for percpu_stats output 2017-07-16 2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou @ 2017-07-17 14:46 ` Tejun Heo 0 siblings, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 14:46 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:07PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > This makes the debugfs output for percpu_stats a little easier > to read by changing the spacing of the output to be consistent. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Applied to percpu/for-4.14. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou 2017-07-16 2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou 2017-07-16 2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 14:47 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou ` (7 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> Percpu memory holds a minimum threshold of pages that are populated in order to serve atomic percpu memory requests. This change makes it easier to verify that there are a minimum number of populated pages lying around. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu-internal.h | 1 + mm/percpu-stats.c | 1 + mm/percpu.c | 2 +- 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index cd2442e..c9158a4 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -36,6 +36,7 @@ extern spinlock_t pcpu_lock; extern struct list_head *pcpu_slot; extern int pcpu_nr_slots; +extern int pcpu_nr_empty_pop_pages; extern struct pcpu_chunk *pcpu_first_chunk; extern struct pcpu_chunk *pcpu_reserved_chunk; diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index fa0f5de..44e561d 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -164,6 +164,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) PU(nr_max_chunks); PU(min_alloc_size); PU(max_alloc_size); + P("empty_pop_pages", pcpu_nr_empty_pop_pages); seq_putc(m, '\n'); #undef PU diff --git a/mm/percpu.c b/mm/percpu.c index bd4130a..9ec5fd4 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -160,7 +160,7 @@ static LIST_HEAD(pcpu_map_extend_chunks); * The number of empty populated pages, protected by pcpu_lock. The * reserved chunk doesn't contribute to the count. */ -static int pcpu_nr_empty_pop_pages; +int pcpu_nr_empty_pop_pages; /* * Balance work is used to populate or destroy chunks asynchronously. We -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats 2017-07-16 2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou @ 2017-07-17 14:47 ` Tejun Heo 0 siblings, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 14:47 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:08PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > Percpu memory holds a minimum threshold of pages that are populated > in order to serve atomic percpu memory requests. This change makes it > easier to verify that there are a minimum number of populated pages > lying around. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Applied to percpu/for-4.14. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (2 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 14:53 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou ` (6 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> The header comment for percpu memory is a little hard to parse and is not super clear about how the first chunk is managed. This adds a little more clarity to the situation. There is also quite a bit of tricky logic in the pcpu_build_alloc_info. This adds a restructure of a comment to add a little more information. Unfortunately, you will still have to piece together a handful of other comments too, but should help direct you to the meaningful comments. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu.c | 58 ++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 32 insertions(+), 26 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 9ec5fd4..5bb90d8 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -4,36 +4,35 @@ * Copyright (C) 2009 SUSE Linux Products GmbH * Copyright (C) 2009 Tejun Heo <tj@kernel.org> * - * This file is released under the GPLv2. + * This file is released under the GPLv2 license. * - * This is percpu allocator which can handle both static and dynamic - * areas. Percpu areas are allocated in chunks. Each chunk is - * consisted of boot-time determined number of units and the first - * chunk is used for static percpu variables in the kernel image - * (special boot time alloc/init handling necessary as these areas - * need to be brought up before allocation services are running). - * Unit grows as necessary and all units grow or shrink in unison. - * When a chunk is filled up, another chunk is allocated. + * The percpu allocator handles both static and dynamic areas. Percpu + * areas are allocated in chunks which are divided into units. There is + * a 1-to-1 mapping for units to possible cpus. These units are grouped + * based on NUMA properties of the machine. * * c0 c1 c2 * ------------------- ------------------- ------------ * | u0 | u1 | u2 | u3 | | u0 | u1 | u2 | u3 | | u0 | u1 | u * ------------------- ...... ------------------- .... ------------ + + * Allocation is done by offsets into a unit's address space. Ie., an + * area of 512 bytes at 6k in c1 occupies 512 bytes at 6k in c1:u0, + * c1:u1, c1:u2, etc. On NUMA machines, the mapping may be non-linear + * and even sparse. Access is handled by configuring percpu base + * registers according to the cpu to unit mappings and offsetting the + * base address using pcpu_unit_size. + * + * There is special consideration for the first chunk which must handle + * the static percpu variables in the kernel image as allocation services + * are not online yet. In short, the first chunk is structure like so: * - * Allocation is done in offset-size areas of single unit space. Ie, - * an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0, - * c1:u1, c1:u2 and c1:u3. On UMA, units corresponds directly to - * cpus. On NUMA, the mapping can be non-linear and even sparse. - * Percpu access can be done by configuring percpu base registers - * according to cpu to unit mapping and pcpu_unit_size. - * - * There are usually many small percpu allocations many of them being - * as small as 4 bytes. The allocator organizes chunks into lists - * according to free size and tries to allocate from the fullest one. - * Each chunk keeps the maximum contiguous area size hint which is - * guaranteed to be equal to or larger than the maximum contiguous - * area in the chunk. This helps the allocator not to iterate the - * chunk maps unnecessarily. + * <Static | [Reserved] | Dynamic> + * + * The static data is copied from the original section managed by the + * linker. The reserved section, if non-zero, primarily manages static + * percpu variables from kernel modules. Finally, the dynamic section + * takes care of normal allocations. * * Allocation state in each chunk is kept using an array of integers * on chunk->map. A positive value in the map represents a free @@ -43,6 +42,12 @@ * Chunks can be determined from the address using the index field * in the page struct. The index field contains a pointer to the chunk. * + * These chunks are organized into lists according to free_size and + * tries to allocate from the fullest chunk first. Each chunk maintains + * a maximum contiguous area size hint which is guaranteed to be equal + * to or larger than the maximum contiguous area in the chunk. This + * helps prevent the allocator from iterating over chunks unnecessarily. + * * To use this allocator, arch code should do the following: * * - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate @@ -1842,6 +1847,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( */ min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE); + /* determine the maximum # of units that can fit in an allocation */ alloc_size = roundup(min_unit_size, atom_size); upa = alloc_size / min_unit_size; while (alloc_size % upa || (offset_in_page(alloc_size / upa))) @@ -1868,9 +1874,9 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( } /* - * Expand unit size until address space usage goes over 75% - * and then as much as possible without using more address - * space. + * Wasted space is caused by a ratio imbalance of upa to group_cnt. + * Expand the unit_size until we use >= 75% of the units allocated. + * Related to atom_size, which could be much larger than the unit_size. */ last_allocs = INT_MAX; for (upa = max_upa; upa; upa--) { -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments 2017-07-16 2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou @ 2017-07-17 14:53 ` Tejun Heo 0 siblings, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 14:53 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:09PM -0400, Dennis Zhou wrote: > * c0 c1 c2 > * ------------------- ------------------- ------------ > * | u0 | u1 | u2 | u3 | | u0 | u1 | u2 | u3 | | u0 | u1 | u > * ------------------- ...... ------------------- .... ------------ > + Missing '*'. Added and applied to percpu/for-4.14. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (3 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-16 4:01 ` kbuild test robot ` (2 more replies) 2017-07-16 2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou ` (5 subsequent siblings) 10 siblings, 3 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> Preparatory patch to modify the first chunk's static_size + reserved_size to end page aligned. The first chunk has a unique allocation scheme overlaying the static, reserved, and dynamic regions. The other regions of each chunk are reserved or hidden. The bitmap allocator would have to allocate in the bitmap the static region to replicate this. By having the reserved region to end page aligned, the metadata overhead can be saved. The consequence is that up to an additional page of memory will be allocated to the reserved region that primarily serves static percpu variables. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- arch/ia64/mm/contig.c | 3 ++- arch/ia64/mm/discontig.c | 3 ++- include/linux/percpu.h | 29 +++++++++++++++++++++++++++++ mm/percpu.c | 6 ++++++ 4 files changed, 39 insertions(+), 2 deletions(-) diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c index 52715a7..20ee2b2 100644 --- a/arch/ia64/mm/contig.c +++ b/arch/ia64/mm/contig.c @@ -164,7 +164,8 @@ setup_per_cpu_areas(void) /* set parameters */ static_size = __per_cpu_end - __per_cpu_start; - reserved_size = PERCPU_MODULE_RESERVE; + reserved_size = pcpu_align_reserved_region(static_size, + PERCPU_MODULE_RESERVE); dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size; if (dyn_size < 0) panic("percpu area overflow static=%zd reserved=%zd\n", diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 8786268..f898b24 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -214,7 +214,8 @@ void __init setup_per_cpu_areas(void) /* set basic parameters */ static_size = __per_cpu_end - __per_cpu_start; - reserved_size = PERCPU_MODULE_RESERVE; + reserved_size = pcpu_align_reserved_region(static_size, + PERCPU_MODULE_RSERVE); dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size; if (dyn_size < 0) panic("percpu area overflow static=%zd reserved=%zd\n", diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 491b3f5..98a371c 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -130,4 +130,33 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr); (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \ __alignof__(type)) +/* + * pcpu_align_reserved_region - page align the end of the reserved region + * @static_size: the static region size + * @reserved_size: the minimum reserved region size + * + * This function calculates the size of the reserved region required to + * make the reserved region end page aligned. + * + * Percpu memory offers a maximum alignment of PAGE_SIZE. Aligning this + * minimizes the metadata overhead of overlapping the static, reserved, + * and dynamic regions by allowing the metadata for the static region to + * not be allocated. This lets the base_addr be moved up to a page + * aligned address and disregard the static region as offsets are allocated. + * The beginning of the reserved region will overlap with the static + * region if the end of the static region is not page aligned. + * + * RETURNS: + * Size of reserved region required to make static_size + reserved_size + * page aligned. + */ +static inline ssize_t pcpu_align_reserved_region(ssize_t static_size, + ssize_t reserved_size) +{ + if (!reserved_size) + return 0; + + return PFN_ALIGN(static_size + reserved_size) - static_size; +} + #endif /* __LINUX_PERCPU_H */ diff --git a/mm/percpu.c b/mm/percpu.c index 5bb90d8..7704db9 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1597,6 +1597,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, PCPU_SETUP_BUG_ON(ai->unit_size < size_sum); PCPU_SETUP_BUG_ON(offset_in_page(ai->unit_size)); PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE); + PCPU_SETUP_BUG_ON(ai->reserved_size && + !PAGE_ALIGNED(ai->static_size + ai->reserved_size)); PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE); PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0); @@ -1800,6 +1802,9 @@ early_param("percpu_alloc", percpu_alloc_setup); * @atom_size: allocation atom size * @cpu_distance_fn: callback to determine distance between cpus, optional * + * If there is a @reserved_size, it is expanded to ensure the end of the + * reserved region is page aligned. + * * This function determines grouping of units, their mappings to cpus * and other parameters considering needed percpu size, allocation * atom size and distances between CPUs. @@ -1835,6 +1840,7 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( memset(group_cnt, 0, sizeof(group_cnt)); /* calculate size_sum and ensure dyn_size is enough for early alloc */ + reserved_size = pcpu_align_reserved_region(static_size, reserved_size); size_sum = PFN_ALIGN(static_size + reserved_size + max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE)); dyn_size = size_sum - static_size - reserved_size; -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou @ 2017-07-16 4:01 ` kbuild test robot 2017-07-16 5:11 ` kbuild test robot 2017-07-17 16:46 ` Tejun Heo 2 siblings, 0 replies; 37+ messages in thread From: kbuild test robot @ 2017-07-16 4:01 UTC (permalink / raw) To: Dennis Zhou Cc: kbuild-all, Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou [-- Attachment #1: Type: text/plain, Size: 8175 bytes --] Hi Dennis, [auto build test ERROR on percpu/for-next] [also build test ERROR on v4.13-rc1 next-20170714] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dennis-Zhou/percpu-replace-percpu-area-map-allocator-with-bitmap-allocator/20170716-103337 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next config: xtensa-allyesconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All error/warnings (new ones prefixed by >>): In file included from include/linux/percpu.h:9:0, from include/linux/percpu-rwsem.h:6, from include/linux/fs.h:30, from fs/affs/affs.h:8, from fs/affs/namei.c:11: include/linux/percpu.h: In function 'pcpu_align_reserved_region': >> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ >> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ -- In file included from include/linux/percpu.h:9:0, from include/linux/percpu-rwsem.h:6, from include/linux/fs.h:30, from fs/ocfs2/file.c:27: include/linux/percpu.h: In function 'pcpu_align_reserved_region': >> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ >> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ In file included from arch/xtensa/include/asm/atomic.h:21:0, from include/linux/atomic.h:4, from include/linux/debug_locks.h:5, from include/linux/lockdep.h:25, from include/linux/spinlock_types.h:18, from include/linux/spinlock.h:81, from include/linux/wait.h:8, from include/linux/fs.h:5, from fs/ocfs2/file.c:27: fs/ocfs2/file.c: In function 'ocfs2_file_write_iter': arch/xtensa/include/asm/cmpxchg.h:139:3: warning: value computed is not used [-Wunused-value] ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) ^ fs/ocfs2/file.c:2341:3: note: in expansion of macro 'xchg' xchg(&iocb->ki_complete, saved_ki_complete); ^ -- In file included from include/linux/percpu.h:9:0, from include/linux/context_tracking_state.h:4, from include/linux/vtime.h:4, from include/linux/hardirq.h:7, from include/linux/interrupt.h:12, from drivers/scsi/sym53c8xx_2/sym_glue.h:45, from drivers/scsi/sym53c8xx_2/sym_fw.c:40: include/linux/percpu.h: In function 'pcpu_align_reserved_region': >> include/linux/pfn.h:17:46: error: 'PAGE_SIZE' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ include/linux/pfn.h:17:46: note: each undeclared identifier is reported only once for each function it appears in #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ >> include/linux/pfn.h:17:64: error: 'PAGE_MASK' undeclared (first use in this function) #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) ^ >> include/linux/percpu.h:159:9: note: in expansion of macro 'PFN_ALIGN' return PFN_ALIGN(static_size + reserved_size) - static_size; ^ In file included from drivers/scsi/sym53c8xx_2/sym_glue.h:64:0, from drivers/scsi/sym53c8xx_2/sym_fw.c:40: drivers/scsi/sym53c8xx_2/sym_defs.h: At top level: drivers/scsi/sym53c8xx_2/sym_defs.h:109:0: warning: "WSR" redefined #define WSR 0x01 /* sta: wide scsi received [W]*/ ^ In file included from arch/xtensa/include/asm/bitops.h:22:0, from include/linux/bitops.h:36, from include/linux/kernel.h:10, from include/linux/list.h:8, from include/linux/wait.h:6, from include/linux/completion.h:11, from drivers/scsi/sym53c8xx_2/sym_glue.h:43, from drivers/scsi/sym53c8xx_2/sym_fw.c:40: arch/xtensa/include/asm/processor.h:227:0: note: this is the location of the previous definition #define WSR(v,sr) __asm__ __volatile__ ("wsr %0,"__stringify(sr) :: "a"(v)); ^ vim +/PAGE_SIZE +17 include/linux/pfn.h 947d0496 Jeremy Fitzhardinge 2008-09-11 16 22a9835c Dave Hansen 2006-03-27 @17 #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK) 22a9835c Dave Hansen 2006-03-27 18 #define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT) 22a9835c Dave Hansen 2006-03-27 19 #define PFN_DOWN(x) ((x) >> PAGE_SHIFT) 947d0496 Jeremy Fitzhardinge 2008-09-11 20 #define PFN_PHYS(x) ((phys_addr_t)(x) << PAGE_SHIFT) 8f235d1a Chen Gang 2016-01-14 21 #define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT)) 22a9835c Dave Hansen 2006-03-27 22 :::::: The code at line 17 was first introduced by commit :::::: 22a9835c350782a5c3257343713932af3ac92ee0 [PATCH] unify PFN_* macros :::::: TO: Dave Hansen <haveblue@us.ibm.com> :::::: CC: Linus Torvalds <torvalds@g5.osdl.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 50216 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou 2017-07-16 4:01 ` kbuild test robot @ 2017-07-16 5:11 ` kbuild test robot 2017-07-17 16:46 ` Tejun Heo 2 siblings, 0 replies; 37+ messages in thread From: kbuild test robot @ 2017-07-16 5:11 UTC (permalink / raw) To: Dennis Zhou Cc: kbuild-all, Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou [-- Attachment #1: Type: text/plain, Size: 4298 bytes --] Hi Dennis, [auto build test ERROR on percpu/for-next] [also build test ERROR on v4.13-rc1 next-20170714] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dennis-Zhou/percpu-replace-percpu-area-map-allocator-with-bitmap-allocator/20170716-103337 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next config: ia64-allmodconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 6.2.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=ia64 All errors (new ones prefixed by >>): arch/ia64/mm/discontig.c: In function 'setup_per_cpu_areas': >> arch/ia64/mm/discontig.c:218:10: error: 'PERCPU_MODULE_RSERVE' undeclared (first use in this function) PERCPU_MODULE_RSERVE); ^~~~~~~~~~~~~~~~~~~~ arch/ia64/mm/discontig.c:218:10: note: each undeclared identifier is reported only once for each function it appears in vim +/PERCPU_MODULE_RSERVE +218 arch/ia64/mm/discontig.c 174 175 #ifdef CONFIG_SMP 176 /** 177 * setup_per_cpu_areas - setup percpu areas 178 * 179 * Arch code has already allocated and initialized percpu areas. All 180 * this function has to do is to teach the determined layout to the 181 * dynamic percpu allocator, which happens to be more complex than 182 * creating whole new ones using helpers. 183 */ 184 void __init setup_per_cpu_areas(void) 185 { 186 struct pcpu_alloc_info *ai; 187 struct pcpu_group_info *uninitialized_var(gi); 188 unsigned int *cpu_map; 189 void *base; 190 unsigned long base_offset; 191 unsigned int cpu; 192 ssize_t static_size, reserved_size, dyn_size; 193 int node, prev_node, unit, nr_units, rc; 194 195 ai = pcpu_alloc_alloc_info(MAX_NUMNODES, nr_cpu_ids); 196 if (!ai) 197 panic("failed to allocate pcpu_alloc_info"); 198 cpu_map = ai->groups[0].cpu_map; 199 200 /* determine base */ 201 base = (void *)ULONG_MAX; 202 for_each_possible_cpu(cpu) 203 base = min(base, 204 (void *)(__per_cpu_offset[cpu] + __per_cpu_start)); 205 base_offset = (void *)__per_cpu_start - base; 206 207 /* build cpu_map, units are grouped by node */ 208 unit = 0; 209 for_each_node(node) 210 for_each_possible_cpu(cpu) 211 if (node == node_cpuid[cpu].nid) 212 cpu_map[unit++] = cpu; 213 nr_units = unit; 214 215 /* set basic parameters */ 216 static_size = __per_cpu_end - __per_cpu_start; 217 reserved_size = pcpu_align_reserved_region(static_size, > 218 PERCPU_MODULE_RSERVE); 219 dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size; 220 if (dyn_size < 0) 221 panic("percpu area overflow static=%zd reserved=%zd\n", 222 static_size, reserved_size); 223 224 ai->static_size = static_size; 225 ai->reserved_size = reserved_size; 226 ai->dyn_size = dyn_size; 227 ai->unit_size = PERCPU_PAGE_SIZE; 228 ai->atom_size = PAGE_SIZE; 229 ai->alloc_size = PERCPU_PAGE_SIZE; 230 231 /* 232 * CPUs are put into groups according to node. Walk cpu_map 233 * and create new groups at node boundaries. 234 */ 235 prev_node = -1; 236 ai->nr_groups = 0; 237 for (unit = 0; unit < nr_units; unit++) { 238 cpu = cpu_map[unit]; 239 node = node_cpuid[cpu].nid; 240 241 if (node == prev_node) { 242 gi->nr_units++; 243 continue; 244 } 245 prev_node = node; 246 247 gi = &ai->groups[ai->nr_groups++]; 248 gi->nr_units = 1; 249 gi->base_offset = __per_cpu_offset[cpu] + base_offset; 250 gi->cpu_map = &cpu_map[unit]; 251 } 252 253 rc = pcpu_setup_first_chunk(ai, base); 254 if (rc) 255 panic("failed to setup percpu area (err=%d)", rc); 256 257 pcpu_free_alloc_info(ai); 258 } 259 #endif 260 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 47682 bytes --] ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou 2017-07-16 4:01 ` kbuild test robot 2017-07-16 5:11 ` kbuild test robot @ 2017-07-17 16:46 ` Tejun Heo 2017-07-17 19:10 ` Dennis Zhou 2017-07-24 20:04 ` Dennis Zhou 2 siblings, 2 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 16:46 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hello, On Sat, Jul 15, 2017 at 10:23:10PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > Preparatory patch to modify the first chunk's static_size + > reserved_size to end page aligned. The first chunk has a unique > allocation scheme overlaying the static, reserved, and dynamic regions. > The other regions of each chunk are reserved or hidden. The bitmap > allocator would have to allocate in the bitmap the static region to > replicate this. By having the reserved region to end page aligned, the > metadata overhead can be saved. The consequence is that up to an > additional page of memory will be allocated to the reserved region that > primarily serves static percpu variables. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Sans the build warnings, generally looks good to me. Some nits asnd one question below. > +/* Should be /** > + * pcpu_align_reserved_region - page align the end of the reserved region > + * @static_size: the static region size > + * @reserved_size: the minimum reserved region size > + * > + * This function calculates the size of the reserved region required to > + * make the reserved region end page aligned. > + * > + * Percpu memory offers a maximum alignment of PAGE_SIZE. Aligning this > + * minimizes the metadata overhead of overlapping the static, reserved, > + * and dynamic regions by allowing the metadata for the static region to > + * not be allocated. This lets the base_addr be moved up to a page > + * aligned address and disregard the static region as offsets are allocated. > + * The beginning of the reserved region will overlap with the static > + * region if the end of the static region is not page aligned. Heh, that was pretty difficult to parse, but here's my question. So, we're expanding reserved area so that its end aligns to page boundary which is completely fine. We may end up with reserved area which is a bit larger than specified but no big deal. However, we can't do the same thing with the boundary between the static and reserved chunks, so instead we pull down the start of the reserved area and mark off the overwrapping area, which is fine too. My question is why we're doing one thing for the end of reserved area while we need to do a different thing for the beginning of it. Can't we do the same thing in both cases? ie. for the both boundaries between static and reserved, and reserved and dynamic, pull down the start to the page boundary and mark the overlapping areas used? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-17 16:46 ` Tejun Heo @ 2017-07-17 19:10 ` Dennis Zhou 2017-07-24 20:04 ` Dennis Zhou 1 sibling, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-17 19:10 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Tejun, On Mon, Jul 17, 2017 at 12:46:50PM -0400, Tejun Heo wrote: > > +/* > > Should be /** > I have fixed this in v2. > > + * pcpu_align_reserved_region - page align the end of the reserved region > > + * @static_size: the static region size > > + * @reserved_size: the minimum reserved region size > > + * > > + * This function calculates the size of the reserved region required to > > + * make the reserved region end page aligned. > > + * > > + * Percpu memory offers a maximum alignment of PAGE_SIZE. Aligning this > > + * minimizes the metadata overhead of overlapping the static, reserved, > > + * and dynamic regions by allowing the metadata for the static region to > > + * not be allocated. This lets the base_addr be moved up to a page > > + * aligned address and disregard the static region as offsets are allocated. > > + * The beginning of the reserved region will overlap with the static > > + * region if the end of the static region is not page aligned. > > Heh, that was pretty difficult to parse, but here's my question. So, > we're expanding reserved area so that its end aligns to page boundary > which is completely fine. We may end up with reserved area which is a > bit larger than specified but no big deal. However, we can't do the > same thing with the boundary between the static and reserved chunks, > so instead we pull down the start of the reserved area and mark off > the overwrapping area, which is fine too. > > My question is why we're doing one thing for the end of reserved area > while we need to do a different thing for the beginning of it. Can't > we do the same thing in both cases? ie. for the both boundaries > between static and reserved, and reserved and dynamic, pull down the > start to the page boundary and mark the overlapping areas used? I don't have a very good answer to why I chose to do it different for the beginning and then end. I think it came down to wanting to maximize metadata usage at the time. A benefit to doing it this way is that it clarifies the number of full pages that will be allocated to the reserved region. For example, if the reserved region is set to 8KB and the region is offset due to the static region, the reserved region would only be given one full page. The first and last page are shared with the static region and dynamic region respectively. Expanding the reserved region would allocate two 4KB pages to it + the partial at the beginning if the static region is not aligned. It's not perfect, but it makes alignment slightly easier to understand for the reserved region. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 05/10] percpu: change reserved_size to end page aligned 2017-07-17 16:46 ` Tejun Heo 2017-07-17 19:10 ` Dennis Zhou @ 2017-07-24 20:04 ` Dennis Zhou 1 sibling, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-24 20:04 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Tejun, On Mon, Jul 17, 2017 at 12:46:50PM -0400, Tejun Heo wrote: > Heh, that was pretty difficult to parse, but here's my question. So, > we're expanding reserved area so that its end aligns to page boundary > which is completely fine. We may end up with reserved area which is a > bit larger than specified but no big deal. However, we can't do the > same thing with the boundary between the static and reserved chunks, > so instead we pull down the start of the reserved area and mark off > the overwrapping area, which is fine too. > > My question is why we're doing one thing for the end of reserved area > while we need to do a different thing for the beginning of it. Can't > we do the same thing in both cases? ie. for the both boundaries > between static and reserved, and reserved and dynamic, pull down the > start to the page boundary and mark the overlapping areas used? I've refactored the code to maintain start and end offsets. This removes the need to expand the reserved region. There are a few more constraints though. The reserved region must be a multiple of the minimum allocation size. The static region and dynamic region are expanded and shrunk respectively to maintain alignment with the minimum allocation size. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 06/10] percpu: modify base_addr to be region specific 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (4 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 18:57 ` Tejun Heo 2017-07-18 19:26 ` Josef Bacik 2017-07-16 2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou ` (4 subsequent siblings) 10 siblings, 2 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> Originally, the first chunk is served by up to three chunks, each given a region they are responsible for. Despite this, the arithmetic was based off of the base_addr making it require offsets or be overly inclusive. This patch changes percpu checks for first chunk to consider the only the dynamic region and the reserved check to be only the reserved region. There is no impact here besides making these checks a little more accurate. This patch also adds the ground work increasing the minimum allocation size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to keep track of the number of pages the bitmap serves. The arithmetic for identifying first chunk and reserved chunk reflect this change. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- include/linux/percpu.h | 4 ++ mm/percpu-internal.h | 12 +++-- mm/percpu.c | 127 ++++++++++++++++++++++++++++++++++--------------- 3 files changed, 100 insertions(+), 43 deletions(-) diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 98a371c..a5cedcd 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -21,6 +21,10 @@ /* minimum unit size, also is the maximum supported allocation size */ #define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10) +/* minimum allocation size and shift in bytes */ +#define PCPU_MIN_ALLOC_SIZE (1 << PCPU_MIN_ALLOC_SHIFT) +#define PCPU_MIN_ALLOC_SHIFT 2 + /* * Percpu allocator can serve percpu allocations before slab is * initialized which allows slab to depend on the percpu allocator. diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index c9158a4..56e1aba 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -23,11 +23,12 @@ struct pcpu_chunk { void *data; /* chunk data */ int first_free; /* no free below this */ bool immutable; /* no [de]population allowed */ - bool has_reserved; /* Indicates if chunk has reserved space - at the beginning. Reserved chunk will - contain reservation for static chunk. - Dynamic chunk will contain reservation - for static and reserved chunks. */ + bool has_reserved; /* indicates if the region this chunk + is responsible for overlaps with + the prior adjacent region */ + + int nr_pages; /* # of PAGE_SIZE pages served + by this chunk */ int nr_populated; /* # of populated pages */ unsigned long populated[]; /* populated bitmap */ }; @@ -40,6 +41,7 @@ extern int pcpu_nr_empty_pop_pages; extern struct pcpu_chunk *pcpu_first_chunk; extern struct pcpu_chunk *pcpu_reserved_chunk; +extern unsigned long pcpu_reserved_offset; #ifdef CONFIG_PERCPU_STATS diff --git a/mm/percpu.c b/mm/percpu.c index 7704db9..c74ad68 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -144,14 +144,14 @@ static const size_t *pcpu_group_sizes __ro_after_init; struct pcpu_chunk *pcpu_first_chunk __ro_after_init; /* - * Optional reserved chunk. This chunk reserves part of the first - * chunk and serves it for reserved allocations. The amount of - * reserved offset is in pcpu_reserved_chunk_limit. When reserved - * area doesn't exist, the following variables contain NULL and 0 - * respectively. + * Optional reserved chunk. This is the part of the first chunk that + * serves reserved allocations. The pcpu_reserved_offset is the amount + * the pcpu_reserved_chunk->base_addr is push back into the static + * region for the base_addr to be page aligned. When the reserved area + * doesn't exist, the following variables contain NULL and 0 respectively. */ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; -static int pcpu_reserved_chunk_limit __ro_after_init; +unsigned long pcpu_reserved_offset __ro_after_init; DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ @@ -184,19 +184,32 @@ static void pcpu_schedule_balance_work(void) schedule_work(&pcpu_balance_work); } +/* + * Static addresses should never be passed into the allocator. They + * are accessed using the group_offsets and therefore do not rely on + * chunk->base_addr. + */ static bool pcpu_addr_in_first_chunk(void *addr) { void *first_start = pcpu_first_chunk->base_addr; - return addr >= first_start && addr < first_start + pcpu_unit_size; + return addr >= first_start && + addr < first_start + + pcpu_first_chunk->nr_pages * PAGE_SIZE; } static bool pcpu_addr_in_reserved_chunk(void *addr) { - void *first_start = pcpu_first_chunk->base_addr; + void *first_start; - return addr >= first_start && - addr < first_start + pcpu_reserved_chunk_limit; + if (!pcpu_reserved_chunk) + return false; + + first_start = pcpu_reserved_chunk->base_addr; + + return addr >= first_start + pcpu_reserved_offset && + addr < first_start + + pcpu_reserved_chunk->nr_pages * PAGE_SIZE; } static int __pcpu_size_to_slot(int size) @@ -237,11 +250,16 @@ static int __maybe_unused pcpu_page_idx(unsigned int cpu, int page_idx) return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx; } +static unsigned long pcpu_unit_page_offset(unsigned int cpu, int page_idx) +{ + return pcpu_unit_offsets[cpu] + (page_idx << PAGE_SHIFT); +} + static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk, unsigned int cpu, int page_idx) { - return (unsigned long)chunk->base_addr + pcpu_unit_offsets[cpu] + - (page_idx << PAGE_SHIFT); + return (unsigned long)chunk->base_addr + + pcpu_unit_page_offset(cpu, page_idx); } static void __maybe_unused pcpu_next_unpop(struct pcpu_chunk *chunk, @@ -737,6 +755,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void) chunk->free_size = pcpu_unit_size; chunk->contig_hint = pcpu_unit_size; + chunk->nr_pages = pcpu_unit_pages; + return chunk; } @@ -824,18 +844,20 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai); * pcpu_chunk_addr_search - determine chunk containing specified address * @addr: address for which the chunk needs to be determined. * + * This is an internal function that handles all but static allocations. + * Static percpu address values should never be passed into the allocator. + * * RETURNS: * The address of the found chunk. */ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) { /* is it in the first chunk? */ - if (pcpu_addr_in_first_chunk(addr)) { - /* is it in the reserved area? */ - if (pcpu_addr_in_reserved_chunk(addr)) - return pcpu_reserved_chunk; + if (pcpu_addr_in_first_chunk(addr)) return pcpu_first_chunk; - } + /* is it in the reserved chunk? */ + if (pcpu_addr_in_reserved_chunk(addr)) + return pcpu_reserved_chunk; /* * The address is relative to unit0 which might be unused and @@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) * The following test on unit_low/high isn't strictly * necessary but will speed up lookups of addresses which * aren't in the first chunk. + * + * The address check is of high granularity checking against full + * chunk sizes. pcpu_base_addr points to the beginning of the first + * chunk including the static region. This allows us to examine all + * regions of the first chunk. Assumes good intent as the first + * chunk may not be full (ie. < pcpu_unit_pages in size). */ - first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0); - first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu, - pcpu_unit_pages); + first_low = (unsigned long) pcpu_base_addr + + pcpu_unit_page_offset(pcpu_low_unit_cpu, 0); + first_high = (unsigned long) pcpu_base_addr + + pcpu_unit_page_offset(pcpu_high_unit_cpu, pcpu_unit_pages); if ((unsigned long)addr >= first_low && (unsigned long)addr < first_high) { for_each_possible_cpu(cpu) { @@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, unsigned int cpu; int *unit_map; int group, unit, i; + unsigned long tmp_addr, aligned_addr; + unsigned long map_size_bytes; #define PCPU_SETUP_BUG_ON(cond) do { \ if (unlikely(cond)) { \ @@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, INIT_LIST_HEAD(&pcpu_slot[i]); /* - * Initialize static chunk. If reserved_size is zero, the - * static chunk covers static area + dynamic allocation area - * in the first chunk. If reserved_size is not zero, it - * covers static area + reserved area (mostly used for module - * static percpu allocation). + * Initialize static chunk. + * The static region is dropped as those addresses are already + * allocated and do not rely on chunk->base_addr. + * reserved_size == 0: + * the static chunk covers the dynamic area + * reserved_size > 0: + * the static chunk covers the reserved area + * + * If the static area is not page aligned, the region adjacent + * to the static area must have its base_addr be offset into + * the static area to have it be page aligned. The overlap is + * then allocated preserving the alignment in the metadata for + * the actual region. */ + tmp_addr = (unsigned long)base_addr + ai->static_size; + aligned_addr = tmp_addr & PAGE_MASK; + pcpu_reserved_offset = tmp_addr - aligned_addr; + + map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + + pcpu_reserved_offset; + + /* schunk allocation */ schunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); INIT_LIST_HEAD(&schunk->list); INIT_LIST_HEAD(&schunk->map_extend_list); - schunk->base_addr = base_addr; + schunk->base_addr = (void *)aligned_addr; schunk->map = smap; schunk->map_alloc = ARRAY_SIZE(smap); schunk->immutable = true; bitmap_fill(schunk->populated, pcpu_unit_pages); schunk->nr_populated = pcpu_unit_pages; + schunk->nr_pages = map_size_bytes >> PAGE_SHIFT; + if (ai->reserved_size) { schunk->free_size = ai->reserved_size; pcpu_reserved_chunk = schunk; - pcpu_reserved_chunk_limit = ai->static_size + ai->reserved_size; } else { schunk->free_size = dyn_size; dyn_size = 0; /* dynamic area covered */ } schunk->contig_hint = schunk->free_size; - schunk->map[0] = 1; - schunk->map[1] = ai->static_size; - schunk->map_used = 1; + if (pcpu_reserved_offset) { + schunk->has_reserved = true; + schunk->map[0] = 1; + schunk->map[1] = pcpu_reserved_offset; + schunk->map_used = 1; + } if (schunk->free_size) - schunk->map[++schunk->map_used] = ai->static_size + schunk->free_size; + schunk->map[++schunk->map_used] = map_size_bytes; schunk->map[schunk->map_used] |= 1; - schunk->has_reserved = true; /* init dynamic chunk if necessary */ if (dyn_size) { dchunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); INIT_LIST_HEAD(&dchunk->list); INIT_LIST_HEAD(&dchunk->map_extend_list); - dchunk->base_addr = base_addr; + dchunk->base_addr = base_addr + ai->static_size + + ai->reserved_size; dchunk->map = dmap; dchunk->map_alloc = ARRAY_SIZE(dmap); dchunk->immutable = true; @@ -1725,11 +1776,11 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, dchunk->nr_populated = pcpu_unit_pages; dchunk->contig_hint = dchunk->free_size = dyn_size; - dchunk->map[0] = 1; - dchunk->map[1] = pcpu_reserved_chunk_limit; - dchunk->map[2] = (pcpu_reserved_chunk_limit + dchunk->free_size) | 1; - dchunk->map_used = 2; - dchunk->has_reserved = true; + dchunk->map[0] = 0; + dchunk->map[1] = dchunk->free_size | 1; + dchunk->map_used = 1; + + dchunk->nr_pages = dyn_size >> PAGE_SHIFT; } /* link the first chunk in */ -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 06/10] percpu: modify base_addr to be region specific 2017-07-16 2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou @ 2017-07-17 18:57 ` Tejun Heo 2017-07-18 19:26 ` Josef Bacik 1 sibling, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 18:57 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hello, On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > Originally, the first chunk is served by up to three chunks, each given > a region they are responsible for. Despite this, the arithmetic was based > off of the base_addr making it require offsets or be overly inclusive. > This patch changes percpu checks for first chunk to consider the only > the dynamic region and the reserved check to be only the reserved > region. There is no impact here besides making these checks a little > more accurate. > > This patch also adds the ground work increasing the minimum allocation > size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to > keep track of the number of pages the bitmap serves. The arithmetic for > identifying first chunk and reserved chunk reflect this change. However small the patch might end up being, I'd much prefer changing the minimum alloc size to be a separate patch with rationale. > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index 98a371c..a5cedcd 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -21,6 +21,10 @@ > /* minimum unit size, also is the maximum supported allocation size */ > #define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10) > > +/* minimum allocation size and shift in bytes */ > +#define PCPU_MIN_ALLOC_SIZE (1 << PCPU_MIN_ALLOC_SHIFT) > +#define PCPU_MIN_ALLOC_SHIFT 2 nitpick: Put SHIFT def above SIZE def? > +/* > + * Static addresses should never be passed into the allocator. They > + * are accessed using the group_offsets and therefore do not rely on > + * chunk->base_addr. > + */ > static bool pcpu_addr_in_first_chunk(void *addr) > { > void *first_start = pcpu_first_chunk->base_addr; > > - return addr >= first_start && addr < first_start + pcpu_unit_size; > + return addr >= first_start && > + addr < first_start + > + pcpu_first_chunk->nr_pages * PAGE_SIZE; Does the above line need line break? If so, it'd probably be easier to read if the broken line is indented (preferably to align with the start of the sub expression). e.g. return addr < first_start + pcpu_first_chunk->nr_pages * PAGE_SIZE; > static bool pcpu_addr_in_reserved_chunk(void *addr) > { > - void *first_start = pcpu_first_chunk->base_addr; > + void *first_start; > > - return addr >= first_start && > - addr < first_start + pcpu_reserved_chunk_limit; > + if (!pcpu_reserved_chunk) > + return false; > + > + first_start = pcpu_reserved_chunk->base_addr; > + > + return addr >= first_start + pcpu_reserved_offset && > + addr < first_start + > + pcpu_reserved_chunk->nr_pages * PAGE_SIZE; Ditto on indentation. > @@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) > * The following test on unit_low/high isn't strictly > * necessary but will speed up lookups of addresses which > * aren't in the first chunk. > + * > + * The address check is of high granularity checking against full > + * chunk sizes. pcpu_base_addr points to the beginning of the first > + * chunk including the static region. This allows us to examine all > + * regions of the first chunk. Assumes good intent as the first > + * chunk may not be full (ie. < pcpu_unit_pages in size). > */ > - first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0); > - first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu, > - pcpu_unit_pages); > + first_low = (unsigned long) pcpu_base_addr + ^ no space for type casts > @@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > unsigned int cpu; > int *unit_map; > int group, unit, i; > + unsigned long tmp_addr, aligned_addr; > + unsigned long map_size_bytes; How about just map_size or map_bytes? > @@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > INIT_LIST_HEAD(&pcpu_slot[i]); > > /* > - * Initialize static chunk. If reserved_size is zero, the > - * static chunk covers static area + dynamic allocation area > - * in the first chunk. If reserved_size is not zero, it > - * covers static area + reserved area (mostly used for module > - * static percpu allocation). > + * Initialize static chunk. > + * The static region is dropped as those addresses are already > + * allocated and do not rely on chunk->base_addr. > + * reserved_size == 0: > + * the static chunk covers the dynamic area > + * reserved_size > 0: > + * the static chunk covers the reserved area > + * > + * If the static area is not page aligned, the region adjacent > + * to the static area must have its base_addr be offset into > + * the static area to have it be page aligned. The overlap is > + * then allocated preserving the alignment in the metadata for > + * the actual region. We can address this later but static chunk not covering static area is kinda confusing. The original complication came from trying to make the static chunk service either reserved or first dynamic chunk. We don't need that anymore and might as well use separate rchunk and dchunk. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 06/10] percpu: modify base_addr to be region specific 2017-07-16 2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou 2017-07-17 18:57 ` Tejun Heo @ 2017-07-18 19:26 ` Josef Bacik 2017-07-18 19:36 ` Matthew Wilcox 1 sibling, 1 reply; 37+ messages in thread From: Josef Bacik @ 2017-07-18 19:26 UTC (permalink / raw) To: Dennis Zhou Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > Originally, the first chunk is served by up to three chunks, each given > a region they are responsible for. Despite this, the arithmetic was based > off of the base_addr making it require offsets or be overly inclusive. > This patch changes percpu checks for first chunk to consider the only > the dynamic region and the reserved check to be only the reserved > region. There is no impact here besides making these checks a little > more accurate. > > This patch also adds the ground work increasing the minimum allocation > size to 4 bytes. The new field nr_pages in pcpu_chunk will be used to > keep track of the number of pages the bitmap serves. The arithmetic for > identifying first chunk and reserved chunk reflect this change. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> > --- > include/linux/percpu.h | 4 ++ > mm/percpu-internal.h | 12 +++-- > mm/percpu.c | 127 ++++++++++++++++++++++++++++++++++--------------- > 3 files changed, 100 insertions(+), 43 deletions(-) > > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index 98a371c..a5cedcd 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -21,6 +21,10 @@ > /* minimum unit size, also is the maximum supported allocation size */ > #define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10) > > +/* minimum allocation size and shift in bytes */ > +#define PCPU_MIN_ALLOC_SIZE (1 << PCPU_MIN_ALLOC_SHIFT) > +#define PCPU_MIN_ALLOC_SHIFT 2 > + > /* > * Percpu allocator can serve percpu allocations before slab is > * initialized which allows slab to depend on the percpu allocator. > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > index c9158a4..56e1aba 100644 > --- a/mm/percpu-internal.h > +++ b/mm/percpu-internal.h > @@ -23,11 +23,12 @@ struct pcpu_chunk { > void *data; /* chunk data */ > int first_free; /* no free below this */ > bool immutable; /* no [de]population allowed */ > - bool has_reserved; /* Indicates if chunk has reserved space > - at the beginning. Reserved chunk will > - contain reservation for static chunk. > - Dynamic chunk will contain reservation > - for static and reserved chunks. */ > + bool has_reserved; /* indicates if the region this chunk > + is responsible for overlaps with > + the prior adjacent region */ > + > + int nr_pages; /* # of PAGE_SIZE pages served > + by this chunk */ > int nr_populated; /* # of populated pages */ > unsigned long populated[]; /* populated bitmap */ > }; > @@ -40,6 +41,7 @@ extern int pcpu_nr_empty_pop_pages; > > extern struct pcpu_chunk *pcpu_first_chunk; > extern struct pcpu_chunk *pcpu_reserved_chunk; > +extern unsigned long pcpu_reserved_offset; > > #ifdef CONFIG_PERCPU_STATS > > diff --git a/mm/percpu.c b/mm/percpu.c > index 7704db9..c74ad68 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -144,14 +144,14 @@ static const size_t *pcpu_group_sizes __ro_after_init; > struct pcpu_chunk *pcpu_first_chunk __ro_after_init; > > /* > - * Optional reserved chunk. This chunk reserves part of the first > - * chunk and serves it for reserved allocations. The amount of > - * reserved offset is in pcpu_reserved_chunk_limit. When reserved > - * area doesn't exist, the following variables contain NULL and 0 > - * respectively. > + * Optional reserved chunk. This is the part of the first chunk that > + * serves reserved allocations. The pcpu_reserved_offset is the amount > + * the pcpu_reserved_chunk->base_addr is push back into the static > + * region for the base_addr to be page aligned. When the reserved area > + * doesn't exist, the following variables contain NULL and 0 respectively. > */ > struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; > -static int pcpu_reserved_chunk_limit __ro_after_init; > +unsigned long pcpu_reserved_offset __ro_after_init; > > DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ > static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ > @@ -184,19 +184,32 @@ static void pcpu_schedule_balance_work(void) > schedule_work(&pcpu_balance_work); > } > > +/* > + * Static addresses should never be passed into the allocator. They > + * are accessed using the group_offsets and therefore do not rely on > + * chunk->base_addr. > + */ > static bool pcpu_addr_in_first_chunk(void *addr) > { > void *first_start = pcpu_first_chunk->base_addr; > > - return addr >= first_start && addr < first_start + pcpu_unit_size; > + return addr >= first_start && > + addr < first_start + > + pcpu_first_chunk->nr_pages * PAGE_SIZE; > } > > static bool pcpu_addr_in_reserved_chunk(void *addr) > { > - void *first_start = pcpu_first_chunk->base_addr; > + void *first_start; > > - return addr >= first_start && > - addr < first_start + pcpu_reserved_chunk_limit; > + if (!pcpu_reserved_chunk) > + return false; > + > + first_start = pcpu_reserved_chunk->base_addr; > + > + return addr >= first_start + pcpu_reserved_offset && > + addr < first_start + > + pcpu_reserved_chunk->nr_pages * PAGE_SIZE; > } > > static int __pcpu_size_to_slot(int size) > @@ -237,11 +250,16 @@ static int __maybe_unused pcpu_page_idx(unsigned int cpu, int page_idx) > return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx; > } > > +static unsigned long pcpu_unit_page_offset(unsigned int cpu, int page_idx) > +{ > + return pcpu_unit_offsets[cpu] + (page_idx << PAGE_SHIFT); > +} > + > static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk, > unsigned int cpu, int page_idx) > { > - return (unsigned long)chunk->base_addr + pcpu_unit_offsets[cpu] + > - (page_idx << PAGE_SHIFT); > + return (unsigned long)chunk->base_addr + > + pcpu_unit_page_offset(cpu, page_idx); > } > > static void __maybe_unused pcpu_next_unpop(struct pcpu_chunk *chunk, > @@ -737,6 +755,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(void) > chunk->free_size = pcpu_unit_size; > chunk->contig_hint = pcpu_unit_size; > > + chunk->nr_pages = pcpu_unit_pages; > + > return chunk; > } > > @@ -824,18 +844,20 @@ static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai); > * pcpu_chunk_addr_search - determine chunk containing specified address > * @addr: address for which the chunk needs to be determined. > * > + * This is an internal function that handles all but static allocations. > + * Static percpu address values should never be passed into the allocator. > + * > * RETURNS: > * The address of the found chunk. > */ > static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) > { > /* is it in the first chunk? */ > - if (pcpu_addr_in_first_chunk(addr)) { > - /* is it in the reserved area? */ > - if (pcpu_addr_in_reserved_chunk(addr)) > - return pcpu_reserved_chunk; > + if (pcpu_addr_in_first_chunk(addr)) > return pcpu_first_chunk; > - } > + /* is it in the reserved chunk? */ > + if (pcpu_addr_in_reserved_chunk(addr)) > + return pcpu_reserved_chunk; > > /* > * The address is relative to unit0 which might be unused and > @@ -1366,10 +1388,17 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) > * The following test on unit_low/high isn't strictly > * necessary but will speed up lookups of addresses which > * aren't in the first chunk. > + * > + * The address check is of high granularity checking against full > + * chunk sizes. pcpu_base_addr points to the beginning of the first > + * chunk including the static region. This allows us to examine all > + * regions of the first chunk. Assumes good intent as the first > + * chunk may not be full (ie. < pcpu_unit_pages in size). > */ > - first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0); > - first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu, > - pcpu_unit_pages); > + first_low = (unsigned long) pcpu_base_addr + > + pcpu_unit_page_offset(pcpu_low_unit_cpu, 0); > + first_high = (unsigned long) pcpu_base_addr + > + pcpu_unit_page_offset(pcpu_high_unit_cpu, pcpu_unit_pages); > if ((unsigned long)addr >= first_low && > (unsigned long)addr < first_high) { > for_each_possible_cpu(cpu) { > @@ -1575,6 +1604,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > unsigned int cpu; > int *unit_map; > int group, unit, i; > + unsigned long tmp_addr, aligned_addr; > + unsigned long map_size_bytes; > > #define PCPU_SETUP_BUG_ON(cond) do { \ > if (unlikely(cond)) { \ > @@ -1678,46 +1709,66 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > INIT_LIST_HEAD(&pcpu_slot[i]); > > /* > - * Initialize static chunk. If reserved_size is zero, the > - * static chunk covers static area + dynamic allocation area > - * in the first chunk. If reserved_size is not zero, it > - * covers static area + reserved area (mostly used for module > - * static percpu allocation). > + * Initialize static chunk. > + * The static region is dropped as those addresses are already > + * allocated and do not rely on chunk->base_addr. > + * reserved_size == 0: > + * the static chunk covers the dynamic area > + * reserved_size > 0: > + * the static chunk covers the reserved area > + * > + * If the static area is not page aligned, the region adjacent > + * to the static area must have its base_addr be offset into > + * the static area to have it be page aligned. The overlap is > + * then allocated preserving the alignment in the metadata for > + * the actual region. > */ > + tmp_addr = (unsigned long)base_addr + ai->static_size; > + aligned_addr = tmp_addr & PAGE_MASK; > + pcpu_reserved_offset = tmp_addr - aligned_addr; > + > + map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + > + pcpu_reserved_offset; This confused me for a second, better to be explicit with (ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset; Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 06/10] percpu: modify base_addr to be region specific 2017-07-18 19:26 ` Josef Bacik @ 2017-07-18 19:36 ` Matthew Wilcox 2017-07-19 14:20 ` Josef Bacik 0 siblings, 1 reply; 37+ messages in thread From: Matthew Wilcox @ 2017-07-18 19:36 UTC (permalink / raw) To: Josef Bacik Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Tue, Jul 18, 2017 at 03:26:02PM -0400, Josef Bacik wrote: > On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote: > > + map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + > > + pcpu_reserved_offset; > > This confused me for a second, better to be explicit with > > (ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset; You're still confused ;-) What Dennis wrote is equivalent to: (ai->reserved_size ? ai->reserved_size : ai->dyn_size) + pcpu_reserved_offset; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 06/10] percpu: modify base_addr to be region specific 2017-07-18 19:36 ` Matthew Wilcox @ 2017-07-19 14:20 ` Josef Bacik 0 siblings, 0 replies; 37+ messages in thread From: Josef Bacik @ 2017-07-19 14:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Josef Bacik, Dennis Zhou, Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Tue, Jul 18, 2017 at 12:36:27PM -0700, Matthew Wilcox wrote: > On Tue, Jul 18, 2017 at 03:26:02PM -0400, Josef Bacik wrote: > > On Sat, Jul 15, 2017 at 10:23:11PM -0400, Dennis Zhou wrote: > > > + map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + > > > + pcpu_reserved_offset; > > > > This confused me for a second, better to be explicit with > > > > (ai->reserved_size ? 0 : ai->dyn_size) + pcpu_reserved_offset; > > You're still confused ;-) What Dennis wrote is equivalent to: > > (ai->reserved_size ? ai->reserved_size : ai->dyn_size) + pcpu_reserved_offset; Lol jesus, made my point even harder with me being an idiot. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (5 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 19:10 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou ` (3 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> With moving the base_addr in the chunks responsible for serving the first chunk up, the use of schunk/dchunk in pcpu_setup_first_chunk no longer makes sense. This makes the linking in the first chunk code not rely on a ternary and renames the variables to a shared variable, chunk, because the allocation path is sequential. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu.c | 96 ++++++++++++++++++++++++++++++------------------------------- 1 file changed, 48 insertions(+), 48 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index c74ad68..9dd28a2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1597,7 +1597,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata; size_t dyn_size = ai->dyn_size; size_t size_sum = ai->static_size + ai->reserved_size + dyn_size; - struct pcpu_chunk *schunk, *dchunk = NULL; + struct pcpu_chunk *chunk; unsigned long *group_offsets; size_t *group_sizes; unsigned long *unit_off; @@ -1709,13 +1709,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, INIT_LIST_HEAD(&pcpu_slot[i]); /* - * Initialize static chunk. - * The static region is dropped as those addresses are already - * allocated and do not rely on chunk->base_addr. - * reserved_size == 0: - * the static chunk covers the dynamic area - * reserved_size > 0: - * the static chunk covers the reserved area + * Initialize first chunk. + * pcpu_first_chunk will always manage the dynamic region of the + * first chunk. The static region is dropped as those addresses + * are already allocated and do not rely on chunk->base_addr. + * + * ai->reserved == 0: + * reserved_chunk == NULL; * * If the static area is not page aligned, the region adjacent * to the static area must have its base_addr be offset into @@ -1730,61 +1730,61 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + pcpu_reserved_offset; - /* schunk allocation */ - schunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); - INIT_LIST_HEAD(&schunk->list); - INIT_LIST_HEAD(&schunk->map_extend_list); - schunk->base_addr = (void *)aligned_addr; - schunk->map = smap; - schunk->map_alloc = ARRAY_SIZE(smap); - schunk->immutable = true; - bitmap_fill(schunk->populated, pcpu_unit_pages); - schunk->nr_populated = pcpu_unit_pages; + /* chunk adjacent to static region allocation */ + chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); + INIT_LIST_HEAD(&chunk->list); + INIT_LIST_HEAD(&chunk->map_extend_list); + chunk->base_addr = (void *)aligned_addr; + chunk->map = smap; + chunk->map_alloc = ARRAY_SIZE(smap); + chunk->immutable = true; + bitmap_fill(chunk->populated, pcpu_unit_pages); + chunk->nr_populated = pcpu_unit_pages; - schunk->nr_pages = map_size_bytes >> PAGE_SHIFT; + chunk->nr_pages = map_size_bytes >> PAGE_SHIFT; if (ai->reserved_size) { - schunk->free_size = ai->reserved_size; - pcpu_reserved_chunk = schunk; + chunk->free_size = ai->reserved_size; + pcpu_reserved_chunk = chunk; } else { - schunk->free_size = dyn_size; + chunk->free_size = dyn_size; dyn_size = 0; /* dynamic area covered */ } - schunk->contig_hint = schunk->free_size; + chunk->contig_hint = chunk->free_size; if (pcpu_reserved_offset) { - schunk->has_reserved = true; - schunk->map[0] = 1; - schunk->map[1] = pcpu_reserved_offset; - schunk->map_used = 1; + chunk->has_reserved = true; + chunk->map[0] = 1; + chunk->map[1] = pcpu_reserved_offset; + chunk->map_used = 1; } - if (schunk->free_size) - schunk->map[++schunk->map_used] = map_size_bytes; - schunk->map[schunk->map_used] |= 1; + if (chunk->free_size) + chunk->map[++chunk->map_used] = map_size_bytes; + chunk->map[chunk->map_used] |= 1; - /* init dynamic chunk if necessary */ + /* init dynamic region of first chunk if necessary */ if (dyn_size) { - dchunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); - INIT_LIST_HEAD(&dchunk->list); - INIT_LIST_HEAD(&dchunk->map_extend_list); - dchunk->base_addr = base_addr + ai->static_size + + chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); + INIT_LIST_HEAD(&chunk->list); + INIT_LIST_HEAD(&chunk->map_extend_list); + chunk->base_addr = base_addr + ai->static_size + ai->reserved_size; - dchunk->map = dmap; - dchunk->map_alloc = ARRAY_SIZE(dmap); - dchunk->immutable = true; - bitmap_fill(dchunk->populated, pcpu_unit_pages); - dchunk->nr_populated = pcpu_unit_pages; - - dchunk->contig_hint = dchunk->free_size = dyn_size; - dchunk->map[0] = 0; - dchunk->map[1] = dchunk->free_size | 1; - dchunk->map_used = 1; - - dchunk->nr_pages = dyn_size >> PAGE_SHIFT; + chunk->map = dmap; + chunk->map_alloc = ARRAY_SIZE(dmap); + chunk->immutable = true; + bitmap_fill(chunk->populated, pcpu_unit_pages); + chunk->nr_populated = pcpu_unit_pages; + + chunk->contig_hint = chunk->free_size = dyn_size; + chunk->map[0] = 0; + chunk->map[1] = chunk->free_size | 1; + chunk->map_used = 1; + + chunk->nr_pages = dyn_size >> PAGE_SHIFT; } /* link the first chunk in */ - pcpu_first_chunk = dchunk ?: schunk; + pcpu_first_chunk = chunk; pcpu_nr_empty_pop_pages += pcpu_count_occupied_pages(pcpu_first_chunk, 1); pcpu_chunk_relocate(pcpu_first_chunk, -1); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names 2017-07-16 2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou @ 2017-07-17 19:10 ` Tejun Heo 2017-07-24 20:07 ` Dennis Zhou 0 siblings, 1 reply; 37+ messages in thread From: Tejun Heo @ 2017-07-17 19:10 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:12PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > With moving the base_addr in the chunks responsible for serving the > first chunk up, the use of schunk/dchunk in pcpu_setup_first_chunk no > longer makes sense. This makes the linking in the first chunk code not > rely on a ternary and renames the variables to a shared variable, chunk, > because the allocation path is sequential. Ah cool, please disregard my previous comment on the misnomer. You can explain in the previous patch's description that a follow-up patch will resolve the situation tho. > @@ -1709,13 +1709,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > INIT_LIST_HEAD(&pcpu_slot[i]); > > /* > + * Initialize first chunk. > + * pcpu_first_chunk will always manage the dynamic region of the > + * first chunk. The static region is dropped as those addresses Would "not covered by any chunk" be clearer than "dropped"? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names 2017-07-17 19:10 ` Tejun Heo @ 2017-07-24 20:07 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-24 20:07 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Tejun, On Mon, Jul 17, 2017 at 03:10:09PM -0400, Tejun Heo wrote: > > /* > > + * Initialize first chunk. > > + * pcpu_first_chunk will always manage the dynamic region of the > > + * first chunk. The static region is dropped as those addresses > > Would "not covered by any chunk" be clearer than "dropped"? I've updated the comments in the new revision. This is explained in the function comment now. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (6 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 19:26 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou ` (2 subsequent siblings) 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> This patch changes the allocator to only mark allocated pages for the region the population bitmap is used for. Prior, the bitmap was marked completely used as the first chunk was allocated and immutable. This is misleading because the first chunk may not be completely filled. Additionally, with moving the base_addr up in the previous patch, the population map no longer corresponds to what is being checked. pcpu_nr_empty_pop_pages is used to ensure there are a handful of free pages around to serve atomic allocations. A new field, nr_empty_pop_pages, is added to the pcpu_chunk struct to keep track of the number of empty pages. This field is needed as the number of empty populated pages is globally kept track of and deltas are used to update it. This new field is exposed in percpu_stats. Now that chunk->nr_pages is the number of pages the chunk is serving, it is nice to use this in the work function for population and freeing of chunks rather than use the global variable pcpu_unit_pages. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu-internal.h | 1 + mm/percpu-stats.c | 1 + mm/percpu.c | 34 +++++++++++++++++++++------------- 3 files changed, 23 insertions(+), 13 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 56e1aba..f0776f6 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -30,6 +30,7 @@ struct pcpu_chunk { int nr_pages; /* # of PAGE_SIZE pages served by this chunk */ int nr_populated; /* # of populated pages */ + int nr_empty_pop_pages; /* # of empty populated pages */ unsigned long populated[]; /* populated bitmap */ }; diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 44e561d..6fc04b1 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -99,6 +99,7 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, P("nr_alloc", chunk->nr_alloc); P("max_alloc_size", chunk->max_alloc_size); + P("empty_pop_pages", chunk->nr_empty_pop_pages); P("free_size", chunk->free_size); P("contig_hint", chunk->contig_hint); P("sum_frag", sum_frag); diff --git a/mm/percpu.c b/mm/percpu.c index 9dd28a2..fb01841 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1164,7 +1164,7 @@ static void pcpu_balance_workfn(struct work_struct *work) list_for_each_entry_safe(chunk, next, &to_free, list) { int rs, re; - pcpu_for_each_pop_region(chunk, rs, re, 0, pcpu_unit_pages) { + pcpu_for_each_pop_region(chunk, rs, re, 0, chunk->nr_pages) { pcpu_depopulate_chunk(chunk, rs, re); spin_lock_irq(&pcpu_lock); pcpu_chunk_depopulated(chunk, rs, re); @@ -1221,7 +1221,7 @@ static void pcpu_balance_workfn(struct work_struct *work) spin_lock_irq(&pcpu_lock); list_for_each_entry(chunk, &pcpu_slot[slot], list) { - nr_unpop = pcpu_unit_pages - chunk->nr_populated; + nr_unpop = chunk->nr_pages - chunk->nr_populated; if (nr_unpop) break; } @@ -1231,7 +1231,7 @@ static void pcpu_balance_workfn(struct work_struct *work) continue; /* @chunk can't go away while pcpu_alloc_mutex is held */ - pcpu_for_each_unpop_region(chunk, rs, re, 0, pcpu_unit_pages) { + pcpu_for_each_unpop_region(chunk, rs, re, 0, chunk->nr_pages) { int nr = min(re - rs, nr_to_pop); ret = pcpu_populate_chunk(chunk, rs, rs + nr); @@ -1604,6 +1604,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, unsigned int cpu; int *unit_map; int group, unit, i; + int chunk_pages; unsigned long tmp_addr, aligned_addr; unsigned long map_size_bytes; @@ -1729,19 +1730,21 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + pcpu_reserved_offset; + chunk_pages = map_size_bytes >> PAGE_SHIFT; /* chunk adjacent to static region allocation */ - chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); + chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + + BITS_TO_LONGS(chunk_pages), 0); INIT_LIST_HEAD(&chunk->list); INIT_LIST_HEAD(&chunk->map_extend_list); chunk->base_addr = (void *)aligned_addr; chunk->map = smap; chunk->map_alloc = ARRAY_SIZE(smap); chunk->immutable = true; - bitmap_fill(chunk->populated, pcpu_unit_pages); - chunk->nr_populated = pcpu_unit_pages; + bitmap_fill(chunk->populated, chunk_pages); + chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages; - chunk->nr_pages = map_size_bytes >> PAGE_SHIFT; + chunk->nr_pages = chunk_pages; if (ai->reserved_size) { chunk->free_size = ai->reserved_size; @@ -1754,6 +1757,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, if (pcpu_reserved_offset) { chunk->has_reserved = true; + chunk->nr_empty_pop_pages--; chunk->map[0] = 1; chunk->map[1] = pcpu_reserved_offset; chunk->map_used = 1; @@ -1764,7 +1768,11 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, /* init dynamic region of first chunk if necessary */ if (dyn_size) { - chunk = memblock_virt_alloc(pcpu_chunk_struct_size, 0); + chunk_pages = dyn_size >> PAGE_SHIFT; + + /* chunk allocation */ + chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + + BITS_TO_LONGS(chunk_pages), 0); INIT_LIST_HEAD(&chunk->list); INIT_LIST_HEAD(&chunk->map_extend_list); chunk->base_addr = base_addr + ai->static_size + @@ -1772,21 +1780,21 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, chunk->map = dmap; chunk->map_alloc = ARRAY_SIZE(dmap); chunk->immutable = true; - bitmap_fill(chunk->populated, pcpu_unit_pages); - chunk->nr_populated = pcpu_unit_pages; + bitmap_fill(chunk->populated, chunk_pages); + chunk->nr_populated = chunk_pages; + chunk->nr_empty_pop_pages = chunk_pages; chunk->contig_hint = chunk->free_size = dyn_size; chunk->map[0] = 0; chunk->map[1] = chunk->free_size | 1; chunk->map_used = 1; - chunk->nr_pages = dyn_size >> PAGE_SHIFT; + chunk->nr_pages = chunk_pages; } /* link the first chunk in */ pcpu_first_chunk = chunk; - pcpu_nr_empty_pop_pages += - pcpu_count_occupied_pages(pcpu_first_chunk, 1); + pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages; pcpu_chunk_relocate(pcpu_first_chunk, -1); pcpu_stats_chunk_alloc(); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps 2017-07-16 2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou @ 2017-07-17 19:26 ` Tejun Heo 2017-07-24 20:13 ` Dennis Zhou 0 siblings, 1 reply; 37+ messages in thread From: Tejun Heo @ 2017-07-17 19:26 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hello, On Sat, Jul 15, 2017 at 10:23:13PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > This patch changes the allocator to only mark allocated pages for the > region the population bitmap is used for. Prior, the bitmap was marked > completely used as the first chunk was allocated and immutable. This is > misleading because the first chunk may not be completely filled. > Additionally, with moving the base_addr up in the previous patch, the > population map no longer corresponds to what is being checked. This in isolation makes sense although the rationale isn't clear from the description. Is it a mere cleanup or is this needed to enable further changes? > pcpu_nr_empty_pop_pages is used to ensure there are a handful of free > pages around to serve atomic allocations. A new field, nr_empty_pop_pages, > is added to the pcpu_chunk struct to keep track of the number of empty > pages. This field is needed as the number of empty populated pages is > globally kept track of and deltas are used to update it. This new field > is exposed in percpu_stats. But I can't see why this is being added or why this is in the same patch with the previous change. > Now that chunk->nr_pages is the number of pages the chunk is serving, it > is nice to use this in the work function for population and freeing of > chunks rather than use the global variable pcpu_unit_pages. The same goes for the above part. It's fine to collect misc changes into a patch when they're trivial and related in some ways but the content of this patch seems a bit random. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps 2017-07-17 19:26 ` Tejun Heo @ 2017-07-24 20:13 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-24 20:13 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Mon, Jul 17, 2017 at 03:26:02PM -0400, Tejun Heo wrote: > > This patch changes the allocator to only mark allocated pages for the > > region the population bitmap is used for. Prior, the bitmap was marked > > completely used as the first chunk was allocated and immutable. This is > > misleading because the first chunk may not be completely filled. > > Additionally, with moving the base_addr up in the previous patch, the > > population map no longer corresponds to what is being checked. > > This in isolation makes sense although the rationale isn't clear from > the description. Is it a mere cleanup or is this needed to enable > further changes? This change is clean up to make sure there is no misunderstanding between what part of the bitmap actually is meaningful and the actual size of the bitmap. > > pcpu_nr_empty_pop_pages is used to ensure there are a handful of free > > pages around to serve atomic allocations. A new field, nr_empty_pop_pages, > > is added to the pcpu_chunk struct to keep track of the number of empty > > pages. This field is needed as the number of empty populated pages is > > globally kept track of and deltas are used to update it. This new field > > is exposed in percpu_stats. > > But I can't see why this is being added or why this is in the same > patch with the previous change. > I've split this out into another patch. > > Now that chunk->nr_pages is the number of pages the chunk is serving, it > > is nice to use this in the work function for population and freeing of > > chunks rather than use the global variable pcpu_unit_pages. > > The same goes for the above part. It's fine to collect misc changes > into a patch when they're trivial and related in some ways but the > content of this patch seems a bit random. This change is needed in the same patch because chunk->nr_populated no longer is set to pcpu_unit_pages. The checks would check the dynamic chunk and then try to populate. Those checks should be checking against the size of the region being served which is nr_pages. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (7 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 23:27 ` Tejun Heo ` (2 more replies) 2017-07-16 2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou 2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik 10 siblings, 3 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. Implementation: This patch removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. The chunks previously were managed by free_size, a value maintained in bytes. Now, the chunks are managed in terms of bits, which is just a scaled value of free_size down by PCPU_MIN_ALLOC_SIZE. There is one caveat with this implementation. In an effort to make freeing fast, the only time metadata is updated on the free path is if a whole block becomes free or the freed area spans across metadata blocks. This causes the chunka??s contig_hint to be potentially smaller than what it could allocate by up to a block. If the chunka??s contig_hint is smaller than a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation and therefore the situation where a chunk has a larger contig_hint than available will never occur. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 335 | 4960 16B | 485 | 1150 64B | 445 | 280 128B | 505 | 177 1024B | 3385 | 140 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 725 | 70 16B | 760 | 70 64B | 855 | 80 128B | 910 | 90 1024B | 3770 | 260 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The second problem of additional scanning can result in the area map allocator completing in 52 minutes. The same workload takes only 14 seconds to complete for the bitmap allocator. This was produced under a more contrived scenario of allocating 1 milion 4-byte objects with 8-byte alignment. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- include/linux/percpu.h | 10 +- init/main.c | 1 - mm/percpu-internal.h | 70 ++- mm/percpu-stats.c | 99 ++-- mm/percpu.c | 1280 ++++++++++++++++++++++++++++++------------------ 5 files changed, 923 insertions(+), 537 deletions(-) diff --git a/include/linux/percpu.h b/include/linux/percpu.h index a5cedcd..8f62b10 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -26,6 +26,15 @@ #define PCPU_MIN_ALLOC_SHIFT 2 /* + * This determines the size of each metadata block. There are several subtle + * constraints around this variable. The reserved_region and dynamic_region + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks + * that are backing multiple pages, this needs to be accounted for. + */ +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) + +/* * Percpu allocator can serve percpu allocations before slab is * initialized which allows slab to depend on the percpu allocator. * The following two parameters decide how much resource to @@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr); #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) extern void __init setup_per_cpu_areas(void); #endif -extern void __init percpu_init_late(void); extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp); extern void __percpu *__alloc_percpu(size_t size, size_t align); diff --git a/init/main.c b/init/main.c index 052481f..c9a9fff 100644 --- a/init/main.c +++ b/init/main.c @@ -500,7 +500,6 @@ static void __init mm_init(void) page_ext_init_flatmem(); mem_init(); kmem_cache_init(); - percpu_init_late(); pgtable_init(); vmalloc_init(); ioremap_huge_init(); diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index f0776f6..2dac644 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -4,6 +4,21 @@ #include <linux/types.h> #include <linux/percpu.h> +/* + * pcpu_bitmap_md is the metadata block struct. + * All units are in terms of bits. + */ +struct pcpu_bitmap_md { + int contig_hint; /* contig hint for block */ + int contig_hint_start; /* block relative starting + position of the contig hint */ + int left_free; /* size of free space along + the left side of the block */ + int right_free; /* size of free space along + the right side of the block */ + int first_free; /* block position of first free */ +}; + struct pcpu_chunk { #ifdef CONFIG_PERCPU_STATS int nr_alloc; /* # of allocations */ @@ -11,17 +26,20 @@ struct pcpu_chunk { #endif struct list_head list; /* linked to pcpu_slot lists */ - int free_size; /* free bytes in the chunk */ - int contig_hint; /* max contiguous size hint */ + int free_bits; /* free bits in the chunk */ + int contig_hint; /* max contiguous size hint + in bits */ + int contig_hint_start; /* contig_hint starting + bit offset */ void *base_addr; /* base address of this chunk */ - int map_used; /* # of map entries used before the sentry */ - int map_alloc; /* # of map entries allocated */ - int *map; /* allocation map */ - struct list_head map_extend_list;/* on pcpu_map_extend_chunks */ + unsigned long *alloc_map; /* allocation map */ + unsigned long *bound_map; /* boundary map */ + struct pcpu_bitmap_md *md_blocks; /* metadata blocks */ void *data; /* chunk data */ - int first_free; /* no free below this */ + int first_free_block; /* block that contains the first + free bit */ bool immutable; /* no [de]population allowed */ bool has_reserved; /* indicates if the region this chunk is responsible for overlaps with @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk; extern struct pcpu_chunk *pcpu_reserved_chunk; extern unsigned long pcpu_reserved_offset; +/* + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks + * @chunk: chunk of interest + * + * This conversion is from the number of physical pages that the chunk + * serves to the number of bitmap blocks required. It converts to bytes + * served to bits required and then blocks used. + */ +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk) +{ + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE / + PCPU_BITMAP_BLOCK_SIZE; +} + +/* + * pcpu_pages_to_bits - converts the pages to size of bitmap + * @pages: number of physical pages + * + * This conversion is from physical pages to the number of bits + * required in the bitmap. + */ +static inline int pcpu_pages_to_bits(int pages) +{ + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE; +} + +/* + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap + * @chunk: chunk of interest + * + * This conversion is from the number of physical pages that the chunk + * serves to the number of bits in the bitmap. + */ +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk) +{ + return pcpu_pages_to_bits(chunk->nr_pages); +} + #ifdef CONFIG_PERCPU_STATS #include <linux/spinlock.h> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 6fc04b1..8dbef0c 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b) } /* - * Iterates over all chunks to find the max # of map entries used. + * Iterates over all chunks to find the max nr_alloc entries. */ -static int find_max_map_used(void) +static int find_max_nr_alloc(void) { struct pcpu_chunk *chunk; - int slot, max_map_used; + int slot, max_nr_alloc; - max_map_used = 0; + max_nr_alloc = 0; for (slot = 0; slot < pcpu_nr_slots; slot++) list_for_each_entry(chunk, &pcpu_slot[slot], list) - max_map_used = max(max_map_used, chunk->map_used); + max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc); - return max_map_used; + return max_nr_alloc; } /* * Prints out chunk state. Fragmentation is considered between * the beginning of the chunk to the last allocation. + * + * All statistics are in bytes unless stated otherwise. */ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, int *buffer) { - int i, s_index, last_alloc, alloc_sign, as_len; + int i, last_alloc, as_len, start, end; int *alloc_sizes, *p; /* statistics */ int sum_frag = 0, max_frag = 0; int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0; alloc_sizes = buffer; - s_index = chunk->has_reserved ? 1 : 0; - - /* find last allocation */ - last_alloc = -1; - for (i = chunk->map_used - 1; i >= s_index; i--) { - if (chunk->map[i] & 1) { - last_alloc = i; - break; - } - } - /* if the chunk is not empty - ignoring reserve */ - if (last_alloc >= s_index) { - as_len = last_alloc + 1 - s_index; - - /* - * Iterate through chunk map computing size info. - * The first bit is overloaded to be a used flag. - * negative = free space, positive = allocated - */ - for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) { - alloc_sign = (*p & 1) ? 1 : -1; - alloc_sizes[i] = alloc_sign * - ((p[1] & ~1) - (p[0] & ~1)); + /* + * find_last_bit returns the start value if nothing found. + * Therefore, we must determine if it is a failure of find_last_bit + * and set the appropriate value. + */ + last_alloc = find_last_bit(chunk->alloc_map, + pcpu_nr_pages_to_bits(chunk) - 1); + last_alloc = test_bit(last_alloc, chunk->alloc_map) ? + last_alloc + 1 : 0; + + start = as_len = 0; + if (chunk->has_reserved) + start = pcpu_reserved_offset; + + /* + * If a bit is set in the allocation map, the bound_map identifies + * where the allocation ends. If the allocation is not set, the + * bound_map does not identify free areas as it is only kept accurate + * on allocation, not free. + * + * Positive values are allocations and negative values are free + * fragments. + */ + while (start < last_alloc) { + if (test_bit(start, chunk->alloc_map)) { + end = find_next_bit(chunk->bound_map, last_alloc, + start + 1); + alloc_sizes[as_len] = 1; + } else { + end = find_next_bit(chunk->alloc_map, last_alloc, + start + 1); + alloc_sizes[as_len] = -1; } - sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL); + alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE; + + start = end; + } + + /* + * The negative values are free fragments and thus sorting gives the + * free fragments at the beginning in largest first order. + */ + if (as_len > 0) { + sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL); - /* Iterate through the unallocated fragements. */ + /* iterate through the unallocated fragments */ for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) { sum_frag -= *p; max_frag = max(max_frag, -1 * (*p)); @@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, P("nr_alloc", chunk->nr_alloc); P("max_alloc_size", chunk->max_alloc_size); P("empty_pop_pages", chunk->nr_empty_pop_pages); - P("free_size", chunk->free_size); - P("contig_hint", chunk->contig_hint); + P("first_free_block", chunk->first_free_block); + P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE); + P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE); P("sum_frag", sum_frag); P("max_frag", max_frag); P("cur_min_alloc", cur_min_alloc); @@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, static int percpu_stats_show(struct seq_file *m, void *v) { struct pcpu_chunk *chunk; - int slot, max_map_used; + int slot, max_nr_alloc; int *buffer; alloc_buffer: spin_lock_irq(&pcpu_lock); - max_map_used = find_max_map_used(); + max_nr_alloc = find_max_nr_alloc(); spin_unlock_irq(&pcpu_lock); - buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0])); + /* there can be at most this many free and allocated fragments */ + buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int)); if (!buffer) return -ENOMEM; spin_lock_irq(&pcpu_lock); /* if the buffer allocated earlier is too small */ - if (max_map_used < find_max_map_used()) { + if (max_nr_alloc < find_max_nr_alloc()) { spin_unlock_irq(&pcpu_lock); vfree(buffer); goto alloc_buffer; diff --git a/mm/percpu.c b/mm/percpu.c index fb01841..569df63 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -4,6 +4,9 @@ * Copyright (C) 2009 SUSE Linux Products GmbH * Copyright (C) 2009 Tejun Heo <tj@kernel.org> * + * Copyright (C) 2017 Facebook Inc. + * Copyright (C) 2017 Dennis Zhou <dennisszhou@gmail.com> + * * This file is released under the GPLv2 license. * * The percpu allocator handles both static and dynamic areas. Percpu @@ -34,19 +37,20 @@ * percpu variables from kernel modules. Finally, the dynamic section * takes care of normal allocations. * - * Allocation state in each chunk is kept using an array of integers - * on chunk->map. A positive value in the map represents a free - * region and negative allocated. Allocation inside a chunk is done - * by scanning this map sequentially and serving the first matching - * entry. This is mostly copied from the percpu_modalloc() allocator. - * Chunks can be determined from the address using the index field - * in the page struct. The index field contains a pointer to the chunk. - * - * These chunks are organized into lists according to free_size and - * tries to allocate from the fullest chunk first. Each chunk maintains - * a maximum contiguous area size hint which is guaranteed to be equal - * to or larger than the maximum contiguous area in the chunk. This - * helps prevent the allocator from iterating over chunks unnecessarily. + * The allocator organizes chunks into lists according to free size and + * tries to allocate from the fullest chunk first. Each chunk is managed + * by a bitmap with metadata blocks. The allocation map is updated on + * every allocation to reflect the current state while the boundary map + * is only updated on allocation. Each metadata block contains + * information to help mitigate the need to iterate over large portions + * of the bitmap. The reverse mapping from page to chunk is stored in + * the page's index. Lastly, units are lazily backed and grow in unison. + * + * There is a unique conversion that goes on here between bytes and bits. + * The chunk tracks the number of pages it is responsible for in nr_pages. + * From there, helper functions are used to convert from physical pages + * to bitmap bits and bitmap blocks. All hints are managed in bits + * unless explicitly stated. * * To use this allocator, arch code should do the following: * @@ -86,10 +90,13 @@ #include "percpu-internal.h" -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */ -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */ -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32 -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64 +/* + * The metadata is managed in terms of bits with each bit mapping to + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated + * with respect to the number of bits available. + */ +#define PCPU_SLOT_BASE_SHIFT 3 + #define PCPU_EMPTY_POP_PAGES_LOW 2 #define PCPU_EMPTY_POP_PAGES_HIGH 4 @@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init; DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */ - -/* chunks which need their map areas extended, protected by pcpu_lock */ -static LIST_HEAD(pcpu_map_extend_chunks); +/* + * Chunk list slots. These slots order the chunks by the number of + * free bits available in the bitmap. + */ +struct list_head *pcpu_slot __ro_after_init; /* * The number of empty populated pages, protected by pcpu_lock. The @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr) pcpu_reserved_chunk->nr_pages * PAGE_SIZE; } -static int __pcpu_size_to_slot(int size) +static int __pcpu_size_to_slot(int bit_size) { - int highbit = fls(size); /* size is in bytes */ + int highbit = fls(bit_size); /* size is in bits */ return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1); } -static int pcpu_size_to_slot(int size) +static int pcpu_size_to_slot(int bit_size) { - if (size == pcpu_unit_size) + if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages)) return pcpu_nr_slots - 1; - return __pcpu_size_to_slot(size); + return __pcpu_size_to_slot(bit_size); } static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) { - if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int)) + if (chunk->free_bits == 0 || chunk->contig_hint == 0) return 0; - return pcpu_size_to_slot(chunk->free_size); + return pcpu_size_to_slot(chunk->free_bits); } /* set the pointer to a chunk in a page struct */ @@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk, } /* + * The following are helper functions to help access bitmaps and convert + * between bitmap offsets to actual address offsets. + */ +static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index) +{ + return chunk->alloc_map + + (index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG); +} + +static unsigned long pcpu_off_to_block_index(int off) +{ + return off / PCPU_BITMAP_BLOCK_SIZE; +} + +static unsigned long pcpu_off_to_block_off(int off) +{ + return off & (PCPU_BITMAP_BLOCK_SIZE - 1); +} + +static unsigned long pcpu_block_off_to_off(int index, int off) +{ + return index * PCPU_BITMAP_BLOCK_SIZE + off; +} + +static unsigned long pcpu_block_get_first_page(int index) +{ + return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE * + PCPU_MIN_ALLOC_SIZE); +} + +/* * (Un)populated page region iterators. Iterate over (un)populated * page regions between @start and @end in @chunk. @rs and @re should * be integer variables and will be set to start and end page index of @@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr) } /** - * pcpu_count_occupied_pages - count the number of pages an area occupies - * @chunk: chunk of interest - * @i: index of the area in question - * - * Count the number of pages chunk's @i'th area occupies. When the area's - * start and/or end address isn't aligned to page boundary, the straddled - * page is included in the count iff the rest of the page is free. - */ -static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i) -{ - int off = chunk->map[i] & ~1; - int end = chunk->map[i + 1] & ~1; - - if (!PAGE_ALIGNED(off) && i > 0) { - int prev = chunk->map[i - 1]; - - if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE)) - off = round_down(off, PAGE_SIZE); - } - - if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) { - int next = chunk->map[i + 1]; - int nend = chunk->map[i + 2] & ~1; - - if (!(next & 1) && nend >= round_up(end, PAGE_SIZE)) - end = round_up(end, PAGE_SIZE); - } - - return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0); -} - -/** * pcpu_chunk_relocate - put chunk in the appropriate chunk slot * @chunk: chunk of interest * @oslot: the previous slot it was on @@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) } /** - * pcpu_need_to_extend - determine whether chunk area map needs to be extended + * pcpu_cnt_pop_pages- counts populated backing pages in range * @chunk: chunk of interest - * @is_atomic: the allocation context + * @start: start index + * @end: end index * - * Determine whether area map of @chunk needs to be extended. If - * @is_atomic, only the amount necessary for a new allocation is - * considered; however, async extension is scheduled if the left amount is - * low. If !@is_atomic, it aims for more empty space. Combined, this - * ensures that the map is likely to have enough available space to - * accomodate atomic allocations which can't extend maps directly. - * - * CONTEXT: - * pcpu_lock. + * Calculates the number of populated pages in the region [start, end). + * This lets us keep track of how many empty populated pages are available + * and decide if we should schedule async work. * * RETURNS: - * New target map allocation length if extension is necessary, 0 - * otherwise. + * The nr of populated pages. */ -static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic) +static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, + int start, int end) { - int margin, new_alloc; - - lockdep_assert_held(&pcpu_lock); + return bitmap_weight(chunk->populated, end) - + bitmap_weight(chunk->populated, start); +} - if (is_atomic) { - margin = 3; +/** + * pcpu_chunk_update_hint - updates metadata about a chunk + * @chunk: chunk of interest + * + * Responsible for iterating over metadata blocks to aggregate the + * overall statistics of the chunk. + * + * Updates: + * chunk->contig_hint + * chunk->contig_hint_start + * nr_empty_pop_pages + */ +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk) +{ + bool is_page_empty = true; + int i, off, cur_contig, nr_empty_pop_pages, l_pop_off; + struct pcpu_bitmap_md *block; + + chunk->contig_hint = cur_contig = 0; + off = nr_empty_pop_pages = 0; + l_pop_off = pcpu_block_get_first_page(chunk->first_free_block); + + for (i = chunk->first_free_block, block = chunk->md_blocks + i; + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) { + /* Manage nr_empty_pop_pages. + * + * This is tricky. So the the background work function is + * triggered when there are not enough free populated pages. + * This is necessary to make sure atomic allocations can + * succeed. + * + * The first page of each block is kept track of here allowing + * this to scale in both situations where there are > 1 page + * per block and where a block may be a portion of a page. + */ + int pop_off = pcpu_block_get_first_page(i); + + if (pop_off > l_pop_off) { + if (is_page_empty) + nr_empty_pop_pages += + pcpu_cnt_pop_pages(chunk, l_pop_off, + pop_off); + l_pop_off = pop_off; + is_page_empty = true; + } + if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE) + is_page_empty = false; - if (chunk->map_alloc < - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) { - if (list_empty(&chunk->map_extend_list)) { - list_add_tail(&chunk->map_extend_list, - &pcpu_map_extend_chunks); - pcpu_schedule_balance_work(); + /* continue from prev block adding to the cur_contig hint */ + if (cur_contig) { + cur_contig += block->left_free; + if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { + continue; + } else if (cur_contig > chunk->contig_hint) { + chunk->contig_hint = cur_contig; + chunk->contig_hint_start = off; } + cur_contig = 0; } - } else { - margin = PCPU_ATOMIC_MAP_MARGIN_HIGH; + /* check if the block->contig_hint is larger */ + if (block->contig_hint > chunk->contig_hint) { + chunk->contig_hint = block->contig_hint; + chunk->contig_hint_start = + pcpu_block_off_to_off(i, + block->contig_hint_start); + } + /* let the next iteration catch the right_free */ + cur_contig = block->right_free; + off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free; } - if (chunk->map_alloc >= chunk->map_used + margin) - return 0; - - new_alloc = PCPU_DFL_MAP_ALLOC; - while (new_alloc < chunk->map_used + margin) - new_alloc *= 2; + /* catch last iteration if the last block ends with free space */ + if (cur_contig > chunk->contig_hint) { + chunk->contig_hint = cur_contig; + chunk->contig_hint_start = off; + } - return new_alloc; + /* + * Keep track of nr_empty_pop_pages. + * + * The chunk is maintains the previous number of free pages it held, + * so the delta is used to update the global counter. The reserved + * chunk is not part of the free page count as they are populated + * at init and are special to serving reserved allocations. + */ + if (is_page_empty) { + nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off, + chunk->nr_pages); + } + if (chunk != pcpu_reserved_chunk) + pcpu_nr_empty_pop_pages += + (nr_empty_pop_pages - chunk->nr_empty_pop_pages); + chunk->nr_empty_pop_pages = nr_empty_pop_pages; } /** - * pcpu_extend_area_map - extend area map of a chunk + * pcpu_block_update_hint * @chunk: chunk of interest - * @new_alloc: new target allocation length of the area map + * @index: block index of the metadata block * - * Extend area map of @chunk to have @new_alloc entries. + * Full scan over the entire block to recalculate block-level metadata. + */ +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) +{ + unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index); + struct pcpu_bitmap_md *block = chunk->md_blocks + index; + bool is_left_free = false, is_right_free = false; + int contig; + unsigned long start, end; + + block->contig_hint = 0; + start = end = block->first_free; + while (start < PCPU_BITMAP_BLOCK_SIZE) { + /* + * Scans the allocation map corresponding to this block + * to find free fragments and update metadata accordingly. + */ + start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, + start); + if (start >= PCPU_BITMAP_BLOCK_SIZE) + break; + /* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */ + end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start); + /* update left_free */ + contig = end - start; + if (start == 0) { + block->left_free = contig; + is_left_free = true; + } + /* update right_free */ + if (end == PCPU_BITMAP_BLOCK_SIZE) { + block->right_free = contig; + is_right_free = true; + } + /* update block contig_hints */ + if (block->contig_hint < contig) { + block->contig_hint = contig; + block->contig_hint_start = start; + } + start = end; + } + + /* clear left/right free hints */ + if (!is_left_free) + block->left_free = 0; + if (!is_right_free) + block->right_free = 0; +} + +/** + * pcpu_block_update_hint_alloc - update hint on allocation path + * @chunk: chunk of interest + * @bit_off: bitmap offset + * @bit_size: size of request in allocation units * - * CONTEXT: - * Does GFP_KERNEL allocation. Grabs and releases pcpu_lock. + * Updates metadata for the allocation path. The metadata only has to be + * refreshed by a full scan iff we break the largest contig region. * * RETURNS: - * 0 on success, -errno on failure. + * Bool if we need to update the chunk's metadata. This occurs only if we + * break the chunk's contig hint. */ -static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc) +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, + int bit_size) { - int *old = NULL, *new = NULL; - size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); - unsigned long flags; - - lockdep_assert_held(&pcpu_alloc_mutex); + bool update_chunk = false; + int i; + int s_index, e_index, s_off, e_off; + struct pcpu_bitmap_md *s_block, *e_block, *block; - new = pcpu_mem_zalloc(new_size); - if (!new) - return -ENOMEM; + /* calculate per block offsets */ + s_index = pcpu_off_to_block_index(bit_off); + e_index = pcpu_off_to_block_index(bit_off + bit_size); + s_off = pcpu_off_to_block_off(bit_off); + e_off = pcpu_off_to_block_off(bit_off + bit_size); - /* acquire pcpu_lock and switch to new area map */ - spin_lock_irqsave(&pcpu_lock, flags); + /* + * If the offset is the beginning of the next block, set it to the + * end of the previous block as the last bit is the exclusive. + */ + if (e_off == 0) { + e_off = PCPU_BITMAP_BLOCK_SIZE; + e_index--; + } - if (new_alloc <= chunk->map_alloc) - goto out_unlock; + s_block = chunk->md_blocks + s_index; + e_block = chunk->md_blocks + e_index; - old_size = chunk->map_alloc * sizeof(chunk->map[0]); - old = chunk->map; + /* + * Update s_block. + * + * block->first_free must be updated if the allocation takes its place. + * If the allocation breaks the contig_hint, a scan is required to + * restore this hint. + */ + if (s_off == s_block->first_free) + s_block->first_free = find_next_zero_bit( + pcpu_index_alloc_map(chunk, s_index), + PCPU_BITMAP_BLOCK_SIZE, + s_off + bit_size); + + if (s_off >= s_block->contig_hint_start && + s_off < s_block->contig_hint_start + s_block->contig_hint) { + pcpu_block_refresh_hint(chunk, s_index); + } else { + /* update left and right contig manually */ + s_block->left_free = min(s_block->left_free, s_off); + if (s_index == e_index) + s_block->right_free = min_t(int, s_block->right_free, + PCPU_BITMAP_BLOCK_SIZE - e_off); + else + s_block->right_free = 0; + } - memcpy(new, old, old_size); + /* + * Update e_block. + * If they are different, then e_block's first_free is guaranteed to + * be the extend of e_off. first_free must be updated and a scan + * over e_block is issued. + */ + if (s_index != e_index) { + e_block->first_free = find_next_zero_bit( + pcpu_index_alloc_map(chunk, e_index), + PCPU_BITMAP_BLOCK_SIZE, e_off); - chunk->map_alloc = new_alloc; - chunk->map = new; - new = NULL; + pcpu_block_refresh_hint(chunk, e_index); + } -out_unlock: - spin_unlock_irqrestore(&pcpu_lock, flags); + /* update in-between md_blocks */ + for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index; + i++, block++) { + block->contig_hint = 0; + block->left_free = 0; + block->right_free = 0; + } /* - * pcpu_mem_free() might end up calling vfree() which uses - * IRQ-unsafe lock and thus can't be called under pcpu_lock. + * The only time a full chunk scan is required is if the global + * contig_hint is broken. Otherwise, it means a smaller space + * was used and therefore the global contig_hint is still correct. */ - pcpu_mem_free(old); - pcpu_mem_free(new); + if (bit_off >= chunk->contig_hint_start && + bit_off < chunk->contig_hint_start + chunk->contig_hint) + update_chunk = true; - return 0; + return update_chunk; } /** - * pcpu_fit_in_area - try to fit the requested allocation in a candidate area - * @chunk: chunk the candidate area belongs to - * @off: the offset to the start of the candidate area - * @this_size: the size of the candidate area - * @size: the size of the target allocation - * @align: the alignment of the target allocation - * @pop_only: only allocate from already populated region - * - * We're trying to allocate @size bytes aligned at @align. @chunk's area - * at @off sized @this_size is a candidate. This function determines - * whether the target allocation fits in the candidate area and returns the - * number of bytes to pad after @off. If the target area doesn't fit, -1 - * is returned. - * - * If @pop_only is %true, this function only considers the already - * populated part of the candidate area. + * pcpu_block_update_hint_free - updates the block hints on the free path + * @chunk: chunk of interest + * @bit_off: bitmap offset + * @bit_size: size of request in allocation units + * + * Updates the hint along the free path by taking advantage of current metadata + * to minimize scanning of the bitmap. Triggers a global update if an entire + * block becomes free or the free spans across blocks. This tradeoff is to + * minimize global scanning to update the chunk->contig_hint. The + * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint + * will never be more than the available space. If the chunk->contig_hint is + * in this block, it will be accurate. + * + * RETURNS: + * Bool if we need to update the chunk's metadata. This occurs if a larger + * contig region is created along the edges or we free across blocks. */ -static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size, - int size, int align, bool pop_only) +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, + int bit_size) { - int cand_off = off; + bool update_chunk = false; + int i; + int s_index, e_index, s_off, e_off; + int start, end, contig; + struct pcpu_bitmap_md *s_block, *e_block, *block; - while (true) { - int head = ALIGN(cand_off, align) - off; - int page_start, page_end, rs, re; + /* calculate per block offsets */ + s_index = pcpu_off_to_block_index(bit_off); + e_index = pcpu_off_to_block_index(bit_off + bit_size); + s_off = pcpu_off_to_block_off(bit_off); + e_off = pcpu_off_to_block_off(bit_off + bit_size); + + /* + * If the offset is the beginning of the next block, set it to the + * end of the previous block as the last bit is the exclusive. + */ + if (e_off == 0) { + e_off = PCPU_BITMAP_BLOCK_SIZE; + e_index--; + } + + s_block = chunk->md_blocks + s_index; + e_block = chunk->md_blocks + e_index; + + /* + * Check if the freed area aligns with the block->contig_hint. + * If it does, then the scan to find the beginning/end of the + * larger free area can be avoided. + * + * start and end refer to beginning and end of the free region + * within each their respective blocks. This is not necessarily + * the entire free region as it may span blocks past the beginning + * or end of the block. + */ + start = s_off; + if (s_off == s_block->contig_hint + s_block->contig_hint_start) { + start = s_block->contig_hint_start; + } else { + int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), + start); + start = (start == l_bit) ? 0 : l_bit + 1; + } + + end = e_off; + if (e_off == e_block->contig_hint_start) + end = e_block->contig_hint_start + e_block->contig_hint; + else + end = find_next_bit(pcpu_index_alloc_map(chunk, e_index), + PCPU_BITMAP_BLOCK_SIZE, end); - if (this_size < head + size) - return -1; + /* freeing in the same block */ + if (s_index == e_index) { + contig = end - start; - if (!pop_only) - return head; + if (start == 0) + s_block->left_free = contig; + if (end == PCPU_BITMAP_BLOCK_SIZE) + s_block->right_free = contig; + + s_block->first_free = min(s_block->first_free, start); + if (contig > s_block->contig_hint) { + s_block->contig_hint = contig; + s_block->contig_hint_start = start; + } + + } else { /* - * If the first unpopulated page is beyond the end of the - * allocation, the whole allocation is populated; - * otherwise, retry from the end of the unpopulated area. + * Freeing across md_blocks. + * + * If the start is at the beginning of the block, just + * reset the block instead. */ - page_start = PFN_DOWN(head + off); - page_end = PFN_UP(head + off + size); - - rs = page_start; - pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size)); - if (rs >= page_end) - return head; - cand_off = re * PAGE_SIZE; + if (start == 0) { + s_index--; + } else { + /* + * Knowing that the free is across blocks, this means + * the hint can be updated on the right side and the + * left side does not need to be touched. + */ + s_block->first_free = min(s_block->first_free, start); + contig = PCPU_BITMAP_BLOCK_SIZE - start; + s_block->right_free = contig; + if (contig > s_block->contig_hint) { + s_block->contig_hint = contig; + s_block->contig_hint_start = start; + } + } + /* + * If end is the entire e_block, just reset the block + * as well. + */ + if (end == PCPU_BITMAP_BLOCK_SIZE) { + e_index++; + } else { + /* + * The hint must only be on the left side, so + * update accordingly. + */ + e_block->first_free = 0; + e_block->left_free = end; + if (end > e_block->contig_hint) { + e_block->contig_hint = end; + e_block->contig_hint_start = 0; + } + } + + /* reset md_blocks in the middle */ + for (i = s_index + 1, block = chunk->md_blocks + i; + i < e_index; i++, block++) { + block->first_free = 0; + block->contig_hint_start = 0; + block->contig_hint = PCPU_BITMAP_BLOCK_SIZE; + block->left_free = PCPU_BITMAP_BLOCK_SIZE; + block->right_free = PCPU_BITMAP_BLOCK_SIZE; + } } + + /* + * The hint is only checked in the s_block and e_block when + * freeing and particularly only when it is self contained within + * its own block. A scan is required if the free space spans + * blocks or makes a block whole as the scan will take into + * account free space across blocks. + */ + if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) || + s_index != e_index) { + update_chunk = true; + } else if (s_block->contig_hint > chunk->contig_hint) { + /* check if block contig_hint is bigger */ + chunk->contig_hint = s_block->contig_hint; + chunk->contig_hint_start = + pcpu_block_off_to_off(s_index, + s_block->contig_hint_start); + } + + return update_chunk; } /** - * pcpu_alloc_area - allocate area from a pcpu_chunk + * pcpu_is_populated - determines if the region is populated * @chunk: chunk of interest - * @size: wanted size in bytes - * @align: wanted align - * @pop_only: allocate only from the populated area - * @occ_pages_p: out param for the number of pages the area occupies + * @index: block index + * @block_off: offset within the bitmap + * @bit_size: size of request in allocation units + * @next_index: return value for next block index that is populated * - * Try to allocate @size bytes area aligned at @align from @chunk. - * Note that this function only allocates the offset. It doesn't - * populate or map the area. + * For atomic allocations, we must check if the backing pages are populated. * - * @chunk->map must have at least two free slots. + * RETURNS: + * Bool if the backing pages are populated. next_index is to skip over + * unpopulated blocks in pcpu_find_block_fit. + */ +static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index, + int block_off, int bit_size, int *next_index) +{ + int page_start, page_end, rs, re; + int off = pcpu_block_off_to_off(index, block_off); + int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE; + + page_start = PFN_DOWN(off); + page_end = PFN_UP(e_off); + + rs = page_start; + pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off)); + if (rs >= page_end) + return true; + *next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE; + return false; +} + +/** + * pcpu_find_block_fit - finds the block index to start searching + * @chunk: chunk of interest + * @bit_size: size of request in allocation units + * @align: alignment of area (max PAGE_SIZE) + * @pop_only: use populated regions only * - * CONTEXT: - * pcpu_lock. + * Given a chunk and an allocation spec, find the offset to begin searching + * for a free region. This is done by iterating over the bitmap metadata + * blocks and then only returning regions that will be guaranteed to fit + * alignment by comparing against the block->contig_hint_start or a correctly + * aligned offset. Iteration is used within a block as an allocation may be + * able to be served prior to the contig_hint. + * + * Note: This errs on the side of caution by only selecting blocks guaranteed + * to have a fit in the chunk's contig_hint. Poor alignment can cause + * us to skip over chunk's that have valid vacancies. * * RETURNS: - * Allocated offset in @chunk on success, -1 if no matching area is - * found. + * The offset in the bitmap to begin searching. + * -1 if no offset is found. */ -static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align, - bool pop_only, int *occ_pages_p) +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, + size_t align, bool pop_only) { - int oslot = pcpu_chunk_slot(chunk); - int max_contig = 0; - int i, off; - bool seen_free = false; - int *p; - - for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) { - int head, tail; - int this_size; - - off = *p; - if (off & 1) - continue; + int i, cur_free; + int s_index, block_off, next_index, end_off; /* interior alloc index */ + struct pcpu_bitmap_md *block; + unsigned long *alloc_map; - this_size = (p[1] & ~1) - off; + lockdep_assert_held(&pcpu_lock); - head = pcpu_fit_in_area(chunk, off, this_size, size, align, - pop_only); - if (head < 0) { - if (!seen_free) { - chunk->first_free = i; - seen_free = true; - } - max_contig = max(this_size, max_contig); + cur_free = block_off = 0; + s_index = chunk->first_free_block; + for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk); + i++) { + alloc_map = pcpu_index_alloc_map(chunk, i); + block = chunk->md_blocks + i; + + /* continue from prev block */ + cur_free += block->left_free; + if (cur_free >= bit_size) { + end_off = bit_size; + goto check_populated; + } else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { continue; } /* - * If head is small or the previous block is free, - * merge'em. Note that 'small' is defined as smaller - * than sizeof(int), which is very small but isn't too - * uncommon for percpu allocations. + * Can this block hold this alloc? + * + * Here the block->contig_hint is used to guarantee a fit, + * but the block->first_free is returned as we may be able + * to serve the allocation earlier. The population check + * must take into account the area beginning at first_free + * through the end of the contig_hint. */ - if (head && (head < sizeof(int) || !(p[-1] & 1))) { - *p = off += head; - if (p[-1] & 1) - chunk->free_size -= head; - else - max_contig = max(*p - p[-1], max_contig); - this_size -= head; - head = 0; + cur_free = 0; + s_index = i; + block_off = ALIGN(block->contig_hint_start, align); + block_off -= block->contig_hint_start; + if (block->contig_hint >= block_off + bit_size) { + block_off = block->first_free; + end_off = block->contig_hint_start - block_off + + bit_size; + goto check_populated; } - /* if tail is small, just keep it around */ - tail = this_size - head - size; - if (tail < sizeof(int)) { - tail = 0; - size = this_size - head; + /* check right */ + block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free, + align); + /* reset to start looking in the next block */ + if (block_off >= PCPU_BITMAP_BLOCK_SIZE) { + s_index++; + cur_free = block_off = 0; + continue; } + cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off; + if (cur_free >= bit_size) { + end_off = bit_size; +check_populated: + if (!pop_only || + pcpu_is_populated(chunk, s_index, block_off, + end_off, &next_index)) + break; - /* split if warranted */ - if (head || tail) { - int nr_extra = !!head + !!tail; - - /* insert new subblocks */ - memmove(p + nr_extra + 1, p + 1, - sizeof(chunk->map[0]) * (chunk->map_used - i)); - chunk->map_used += nr_extra; - - if (head) { - if (!seen_free) { - chunk->first_free = i; - seen_free = true; - } - *++p = off += head; - ++i; - max_contig = max(head, max_contig); - } - if (tail) { - p[1] = off + size; - max_contig = max(tail, max_contig); - } + i = next_index - 1; + s_index = next_index; + cur_free = block_off = 0; } + } - if (!seen_free) - chunk->first_free = i + 1; + /* nothing found */ + if (i == pcpu_nr_pages_to_blocks(chunk)) + return -1; - /* update hint and mark allocated */ - if (i + 1 == chunk->map_used) - chunk->contig_hint = max_contig; /* fully scanned */ - else - chunk->contig_hint = max(chunk->contig_hint, - max_contig); + return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off; +} - chunk->free_size -= size; - *p |= 1; - *occ_pages_p = pcpu_count_occupied_pages(chunk, i); - pcpu_chunk_relocate(chunk, oslot); - return off; - } +/** + * pcpu_alloc_area - allocates area from a pcpu_chunk + * @chunk: chunk of interest + * @bit_size: size of request in allocation units + * @align: alignment of area (max PAGE_SIZE) + * @start: bit_off to start searching + * + * This function takes in a start bit_offset to begin searching. It + * searches the allocation bitmap to verify that the offset is available + * as block->first_free is provided when allocation within a block is + * available. + * + * RETURNS: + * Allocated addr offset in @chunk on success, + * -1 if no matching area is found + */ +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size, + size_t align, int start) +{ + size_t align_mask = (align) ? (align - 1) : 0; + int i, bit_off, oslot; + struct pcpu_bitmap_md *block; + + lockdep_assert_held(&pcpu_lock); + + oslot = pcpu_chunk_slot(chunk); + + /* search to find fit */ + bit_off = bitmap_find_next_zero_area(chunk->alloc_map, + pcpu_nr_pages_to_bits(chunk), + start, bit_size, align_mask); + + if (bit_off >= pcpu_nr_pages_to_bits(chunk)) + return -1; + + /* update alloc map */ + bitmap_set(chunk->alloc_map, bit_off, bit_size); + /* update boundary map */ + set_bit(bit_off, chunk->bound_map); + bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1); + set_bit(bit_off + bit_size, chunk->bound_map); + + chunk->free_bits -= bit_size; + + if (pcpu_block_update_hint_alloc(chunk, bit_off, bit_size)) + pcpu_chunk_update_hint(chunk); + + /* update chunk first_free */ + for (i = chunk->first_free_block, block = chunk->md_blocks + i; + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) + if (block->contig_hint != 0) + break; + + chunk->first_free_block = i; - chunk->contig_hint = max_contig; /* fully scanned */ pcpu_chunk_relocate(chunk, oslot); - /* tell the upper layer that this chunk has no matching area */ - return -1; + return bit_off * PCPU_MIN_ALLOC_SIZE; } /** - * pcpu_free_area - free area to a pcpu_chunk + * pcpu_free_area - frees the corresponding offset * @chunk: chunk of interest - * @freeme: offset of area to free - * @occ_pages_p: out param for the number of pages the area occupies + * @off: addr offset into chunk * - * Free area starting from @freeme to @chunk. Note that this function - * only modifies the allocation map. It doesn't depopulate or unmap - * the area. - * - * CONTEXT: - * pcpu_lock. + * This function determines the size of an allocation to free using + * the boundary bitmap and clears the allocation map. A block metadata + * update is triggered and potentially a chunk update occurs. */ -static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme, - int *occ_pages_p) +static void pcpu_free_area(struct pcpu_chunk *chunk, int off) { - int oslot = pcpu_chunk_slot(chunk); - int off = 0; - unsigned i, j; - int to_free = 0; - int *p; + int bit_off, bit_size, index, end, oslot; + struct pcpu_bitmap_md *block; lockdep_assert_held(&pcpu_lock); pcpu_stats_area_dealloc(chunk); - freeme |= 1; /* we are searching for <given offset, in use> pair */ - - i = 0; - j = chunk->map_used; - while (i != j) { - unsigned k = (i + j) / 2; - off = chunk->map[k]; - if (off < freeme) - i = k + 1; - else if (off > freeme) - j = k; - else - i = j = k; - } - BUG_ON(off != freeme); + oslot = pcpu_chunk_slot(chunk); - if (i < chunk->first_free) - chunk->first_free = i; + bit_off = off / PCPU_MIN_ALLOC_SIZE; - p = chunk->map + i; - *p = off &= ~1; - chunk->free_size += (p[1] & ~1) - off; + /* find end index */ + end = find_next_bit(chunk->bound_map, pcpu_nr_pages_to_bits(chunk), + bit_off + 1); + bit_size = end - bit_off; - *occ_pages_p = pcpu_count_occupied_pages(chunk, i); + bitmap_clear(chunk->alloc_map, bit_off, bit_size); - /* merge with next? */ - if (!(p[1] & 1)) - to_free++; - /* merge with previous? */ - if (i > 0 && !(p[-1] & 1)) { - to_free++; - i--; - p--; - } - if (to_free) { - chunk->map_used -= to_free; - memmove(p + 1, p + 1 + to_free, - (chunk->map_used - i) * sizeof(chunk->map[0])); - } + chunk->free_bits += bit_size; + + /* update first_free */ + index = pcpu_off_to_block_index(bit_off); + block = chunk->md_blocks + index; + block->first_free = min_t(int, block->first_free, + bit_off % PCPU_BITMAP_BLOCK_SIZE); + + chunk->first_free_block = min(chunk->first_free_block, index); + + if (pcpu_block_update_hint_free(chunk, bit_off, bit_size)) + pcpu_chunk_update_hint(chunk); - chunk->contig_hint = max(chunk->map[i + 1] - chunk->map[i] - 1, chunk->contig_hint); pcpu_chunk_relocate(chunk, oslot); } +static void pcpu_init_md_blocks(struct pcpu_chunk *chunk) +{ + struct pcpu_bitmap_md *md_block; + + for (md_block = chunk->md_blocks; + md_block != chunk->md_blocks + pcpu_nr_pages_to_blocks(chunk); + md_block++) { + md_block->contig_hint = PCPU_BITMAP_BLOCK_SIZE; + md_block->left_free = PCPU_BITMAP_BLOCK_SIZE; + md_block->right_free = PCPU_BITMAP_BLOCK_SIZE; + } +} + +static struct pcpu_chunk * __init pcpu_alloc_first_chunk(int chunk_pages) +{ + struct pcpu_chunk *chunk; + int map_size_bits; + + chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + + BITS_TO_LONGS(chunk_pages), 0); + + INIT_LIST_HEAD(&chunk->list); + chunk->has_reserved = false; + chunk->immutable = true; + + chunk->nr_pages = chunk_pages; + map_size_bits = pcpu_nr_pages_to_bits(chunk); + + chunk->alloc_map = memblock_virt_alloc( + BITS_TO_LONGS(map_size_bits) * + sizeof(chunk->alloc_map[0]), 0); + chunk->bound_map = memblock_virt_alloc( + BITS_TO_LONGS(map_size_bits + 1) * + sizeof(chunk->bound_map[0]), 0); + chunk->md_blocks = memblock_virt_alloc( + pcpu_nr_pages_to_blocks(chunk) * + sizeof(struct pcpu_bitmap_md), 0); + pcpu_init_md_blocks(chunk); + + /* fill page populated map - the first chunk is fully populated */ + bitmap_fill(chunk->populated, chunk_pages); + chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages; + + return chunk; +} + static struct pcpu_chunk *pcpu_alloc_chunk(void) { struct pcpu_chunk *chunk; + int map_size_bits; chunk = pcpu_mem_zalloc(pcpu_chunk_struct_size); if (!chunk) return NULL; - chunk->map = pcpu_mem_zalloc(PCPU_DFL_MAP_ALLOC * - sizeof(chunk->map[0])); - if (!chunk->map) { - pcpu_mem_free(chunk); - return NULL; - } - - chunk->map_alloc = PCPU_DFL_MAP_ALLOC; - chunk->map[0] = 0; - chunk->map[1] = pcpu_unit_size | 1; - chunk->map_used = 1; - chunk->has_reserved = false; - INIT_LIST_HEAD(&chunk->list); - INIT_LIST_HEAD(&chunk->map_extend_list); - chunk->free_size = pcpu_unit_size; - chunk->contig_hint = pcpu_unit_size; + chunk->has_reserved = false; chunk->nr_pages = pcpu_unit_pages; + map_size_bits = pcpu_nr_pages_to_bits(chunk); + + chunk->alloc_map = pcpu_mem_zalloc(BITS_TO_LONGS(map_size_bits) * + sizeof(chunk->alloc_map[0])); + if (!chunk->alloc_map) + goto alloc_map_fail; + + chunk->bound_map = pcpu_mem_zalloc(BITS_TO_LONGS(map_size_bits + 1) * + sizeof(chunk->bound_map[0])); + if (!chunk->alloc_map) + goto bound_map_fail; + + chunk->md_blocks = pcpu_mem_zalloc(pcpu_nr_pages_to_blocks(chunk) * + sizeof(chunk->md_blocks[0])); + if (!chunk->alloc_map) + goto md_blocks_fail; + + pcpu_init_md_blocks(chunk); + + /* init metadata */ + chunk->contig_hint = chunk->free_bits = map_size_bits; return chunk; + +md_blocks_fail: + pcpu_mem_free(chunk->bound_map); +bound_map_fail: + pcpu_mem_free(chunk->alloc_map); +alloc_map_fail: + pcpu_mem_free(chunk); + + return NULL; } static void pcpu_free_chunk(struct pcpu_chunk *chunk) { if (!chunk) return; - pcpu_mem_free(chunk->map); + pcpu_mem_free(chunk->md_blocks); + pcpu_mem_free(chunk->bound_map); + pcpu_mem_free(chunk->alloc_map); pcpu_mem_free(chunk); } @@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk, bitmap_set(chunk->populated, page_start, nr); chunk->nr_populated += nr; + chunk->nr_empty_pop_pages += nr; pcpu_nr_empty_pop_pages += nr; } @@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, bitmap_clear(chunk->populated, page_start, nr); chunk->nr_populated -= nr; + chunk->nr_empty_pop_pages -= nr; pcpu_nr_empty_pop_pages -= nr; } @@ -890,19 +1284,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, struct pcpu_chunk *chunk; const char *err; bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; - int occ_pages = 0; - int slot, off, new_alloc, cpu, ret; + int slot, off, cpu, ret; unsigned long flags; void __percpu *ptr; + size_t bit_size, bit_align; /* - * We want the lowest bit of offset available for in-use/free - * indicator, so force >= 16bit alignment and make size even. + * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE. + * Therefore alignment must be a minimum of that many bytes as well + * as the allocation will have internal fragmentation from + * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes. */ - if (unlikely(align < 2)) - align = 2; + if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) + align = PCPU_MIN_ALLOC_SIZE; - size = ALIGN(size, 2); + size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); + bit_size = size >> PCPU_MIN_ALLOC_SHIFT; + bit_align = align >> PCPU_MIN_ALLOC_SHIFT; if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE || !is_power_of_2(align))) { @@ -920,23 +1318,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, if (reserved && pcpu_reserved_chunk) { chunk = pcpu_reserved_chunk; - if (size > chunk->contig_hint) { + off = pcpu_find_block_fit(chunk, bit_size, bit_align, + is_atomic); + if (off < 0) { err = "alloc from reserved chunk failed"; goto fail_unlock; } - while ((new_alloc = pcpu_need_to_extend(chunk, is_atomic))) { - spin_unlock_irqrestore(&pcpu_lock, flags); - if (is_atomic || - pcpu_extend_area_map(chunk, new_alloc) < 0) { - err = "failed to extend area map of reserved chunk"; - goto fail; - } - spin_lock_irqsave(&pcpu_lock, flags); - } - - off = pcpu_alloc_area(chunk, size, align, is_atomic, - &occ_pages); + off = pcpu_alloc_area(chunk, bit_size, bit_align, off); if (off >= 0) goto area_found; @@ -946,31 +1335,17 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, restart: /* search through normal chunks */ - for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) { + for (slot = pcpu_size_to_slot(bit_size); slot < pcpu_nr_slots; slot++) { list_for_each_entry(chunk, &pcpu_slot[slot], list) { - if (size > chunk->contig_hint) + if (bit_size > chunk->contig_hint) continue; - new_alloc = pcpu_need_to_extend(chunk, is_atomic); - if (new_alloc) { - if (is_atomic) - continue; - spin_unlock_irqrestore(&pcpu_lock, flags); - if (pcpu_extend_area_map(chunk, - new_alloc) < 0) { - err = "failed to extend area map"; - goto fail; - } - spin_lock_irqsave(&pcpu_lock, flags); - /* - * pcpu_lock has been dropped, need to - * restart cpu_slot list walking. - */ - goto restart; - } + off = pcpu_find_block_fit(chunk, bit_size, bit_align, + is_atomic); + if (off < 0) + continue; - off = pcpu_alloc_area(chunk, size, align, is_atomic, - &occ_pages); + off = pcpu_alloc_area(chunk, bit_size, bit_align, off); if (off >= 0) goto area_found; } @@ -1021,7 +1396,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, spin_lock_irqsave(&pcpu_lock, flags); if (ret) { - pcpu_free_area(chunk, off, &occ_pages); + pcpu_free_area(chunk, off); err = "failed to populate"; goto fail_unlock; } @@ -1032,12 +1407,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, mutex_unlock(&pcpu_alloc_mutex); } - if (chunk != pcpu_reserved_chunk) { - spin_lock_irqsave(&pcpu_lock, flags); - pcpu_nr_empty_pop_pages -= occ_pages; - spin_unlock_irqrestore(&pcpu_lock, flags); - } - if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW) pcpu_schedule_balance_work(); @@ -1155,7 +1524,6 @@ static void pcpu_balance_workfn(struct work_struct *work) if (chunk == list_first_entry(free_head, struct pcpu_chunk, list)) continue; - list_del_init(&chunk->map_extend_list); list_move(&chunk->list, &to_free); } @@ -1173,25 +1541,6 @@ static void pcpu_balance_workfn(struct work_struct *work) pcpu_destroy_chunk(chunk); } - /* service chunks which requested async area map extension */ - do { - int new_alloc = 0; - - spin_lock_irq(&pcpu_lock); - - chunk = list_first_entry_or_null(&pcpu_map_extend_chunks, - struct pcpu_chunk, map_extend_list); - if (chunk) { - list_del_init(&chunk->map_extend_list); - new_alloc = pcpu_need_to_extend(chunk, false); - } - - spin_unlock_irq(&pcpu_lock); - - if (new_alloc) - pcpu_extend_area_map(chunk, new_alloc); - } while (chunk); - /* * Ensure there are certain number of free populated pages for * atomic allocs. Fill up from the most packed so that atomic @@ -1213,7 +1562,8 @@ static void pcpu_balance_workfn(struct work_struct *work) 0, PCPU_EMPTY_POP_PAGES_HIGH); } - for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) { + for (slot = pcpu_size_to_slot(PAGE_SIZE / PCPU_MIN_ALLOC_SIZE); + slot < pcpu_nr_slots; slot++) { int nr_unpop = 0, rs, re; if (!nr_to_pop) @@ -1277,7 +1627,7 @@ void free_percpu(void __percpu *ptr) void *addr; struct pcpu_chunk *chunk; unsigned long flags; - int off, occ_pages; + int off; if (!ptr) return; @@ -1291,13 +1641,10 @@ void free_percpu(void __percpu *ptr) chunk = pcpu_chunk_addr_search(addr); off = addr - chunk->base_addr; - pcpu_free_area(chunk, off, &occ_pages); - - if (chunk != pcpu_reserved_chunk) - pcpu_nr_empty_pop_pages += occ_pages; + pcpu_free_area(chunk, off); /* if there are more than one fully free chunks, wake up grim reaper */ - if (chunk->free_size == pcpu_unit_size) { + if (chunk->free_bits == pcpu_pages_to_bits(pcpu_unit_pages)) { struct pcpu_chunk *pos; list_for_each_entry(pos, &pcpu_slot[pcpu_nr_slots - 1], list) @@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr) * address. The caller is responsible for ensuring @addr stays valid * until this function finishes. * - * percpu allocator has special setup for the first chunk, which currently + * Percpu allocator has special setup for the first chunk, which currently * supports either embedding in linear address space or vmalloc mapping, * and, from the second one, the backing allocator (currently either vm or * km) provides translation. * * The addr can be translated simply without checking if it falls into the - * first chunk. But the current code reflects better how percpu allocator + * first chunk. But the current code reflects better how percpu allocator * actually works, and the verification can discover both bugs in percpu - * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current + * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current * code. * * RETURNS: @@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) else return page_to_phys(vmalloc_to_page(addr)) + offset_in_page(addr); - } else + } else { return page_to_phys(pcpu_addr_to_page(addr)) + offset_in_page(addr); + } } /** @@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl, * static areas on architectures where the addressing model has * limited offset range for symbol relocations to guarantee module * percpu symbols fall inside the relocatable range. + * @ai->static_size + @ai->reserved_size is expected to be page aligned. * * @ai->dyn_size determines the number of bytes available for dynamic - * allocation in the first chunk. The area between @ai->static_size + - * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused. + * allocation in the first chunk. Both the start and the end are expected + * to be page aligned. The area between @ai->static_size + @ai->reserved_size + * + @ai->dyn_size and @ai->unit_size is unused. * * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE * and equal to or larger than @ai->static_size + @ai->reserved_size + @@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl, * copied static data to each unit. * * If the first chunk ends up with both reserved and dynamic areas, it - * is served by two chunks - one to serve the core static and reserved - * areas and the other for the dynamic area. They share the same vm - * and page map but uses different area allocation map to stay away - * from each other. The latter chunk is circulated in the chunk slots - * and available for dynamic allocation like any other chunks. + * is served by two chunks - one to serve the reserved area and the other + * for the dynamic area. They share the same vm and page map but use + * different area allocation map to stay away from each other. The latter + * chunk is circulated in the chunk slots and available for dynamic allocation + * like any other chunks. * * RETURNS: * 0 on success, -errno on failure. @@ -1593,8 +1943,6 @@ static void pcpu_dump_alloc_info(const char *lvl, int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, void *base_addr) { - static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata; - static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata; size_t dyn_size = ai->dyn_size; size_t size_sum = ai->static_size + ai->reserved_size + dyn_size; struct pcpu_chunk *chunk; @@ -1606,7 +1954,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, int group, unit, i; int chunk_pages; unsigned long tmp_addr, aligned_addr; - unsigned long map_size_bytes; + unsigned long map_size_bytes, begin_fill_bits; #define PCPU_SETUP_BUG_ON(cond) do { \ if (unlikely(cond)) { \ @@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, * Allocate chunk slots. The additional last slot is for * empty chunks. */ - pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2; + pcpu_nr_slots = __pcpu_size_to_slot( + pcpu_pages_to_bits(pcpu_unit_pages)) + 2; pcpu_slot = memblock_virt_alloc( pcpu_nr_slots * sizeof(pcpu_slot[0]), 0); for (i = 0; i < pcpu_nr_slots; i++) @@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, tmp_addr = (unsigned long)base_addr + ai->static_size; aligned_addr = tmp_addr & PAGE_MASK; pcpu_reserved_offset = tmp_addr - aligned_addr; + begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE; map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + pcpu_reserved_offset; + chunk_pages = map_size_bytes >> PAGE_SHIFT; /* chunk adjacent to static region allocation */ - chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + - BITS_TO_LONGS(chunk_pages), 0); - INIT_LIST_HEAD(&chunk->list); - INIT_LIST_HEAD(&chunk->map_extend_list); + chunk = pcpu_alloc_first_chunk(chunk_pages); chunk->base_addr = (void *)aligned_addr; - chunk->map = smap; - chunk->map_alloc = ARRAY_SIZE(smap); chunk->immutable = true; - bitmap_fill(chunk->populated, chunk_pages); - chunk->nr_populated = chunk->nr_empty_pop_pages = chunk_pages; - - chunk->nr_pages = chunk_pages; - if (ai->reserved_size) { - chunk->free_size = ai->reserved_size; - pcpu_reserved_chunk = chunk; - } else { - chunk->free_size = dyn_size; - dyn_size = 0; /* dynamic area covered */ - } - chunk->contig_hint = chunk->free_size; + /* set metadata */ + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; + chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; - if (pcpu_reserved_offset) { + /* + * If the beginning of the reserved region overlaps the end of the + * static region, hide that portion in the metadata. + */ + if (begin_fill_bits) { chunk->has_reserved = true; - chunk->nr_empty_pop_pages--; - chunk->map[0] = 1; - chunk->map[1] = pcpu_reserved_offset; - chunk->map_used = 1; + bitmap_fill(chunk->alloc_map, begin_fill_bits); + set_bit(0, chunk->bound_map); + set_bit(begin_fill_bits, chunk->bound_map); + + if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits)) + pcpu_chunk_update_hint(chunk); } - if (chunk->free_size) - chunk->map[++chunk->map_used] = map_size_bytes; - chunk->map[chunk->map_used] |= 1; - /* init dynamic region of first chunk if necessary */ - if (dyn_size) { + /* init dynamic chunk if necessary */ + if (ai->reserved_size) { + pcpu_reserved_chunk = chunk; + chunk_pages = dyn_size >> PAGE_SHIFT; /* chunk allocation */ - chunk = memblock_virt_alloc(sizeof(struct pcpu_chunk) + - BITS_TO_LONGS(chunk_pages), 0); - INIT_LIST_HEAD(&chunk->list); - INIT_LIST_HEAD(&chunk->map_extend_list); + chunk = pcpu_alloc_first_chunk(chunk_pages); chunk->base_addr = base_addr + ai->static_size + ai->reserved_size; - chunk->map = dmap; - chunk->map_alloc = ARRAY_SIZE(dmap); - chunk->immutable = true; - bitmap_fill(chunk->populated, chunk_pages); - chunk->nr_populated = chunk_pages; - chunk->nr_empty_pop_pages = chunk_pages; - - chunk->contig_hint = chunk->free_size = dyn_size; - chunk->map[0] = 0; - chunk->map[1] = chunk->free_size | 1; - chunk->map_used = 1; - - chunk->nr_pages = chunk_pages; + + /* set metadata */ + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk); + chunk->free_bits = pcpu_nr_pages_to_bits(chunk); } /* link the first chunk in */ @@ -2370,36 +2700,6 @@ void __init setup_per_cpu_areas(void) #endif /* CONFIG_SMP */ /* - * First and reserved chunks are initialized with temporary allocation - * map in initdata so that they can be used before slab is online. - * This function is called after slab is brought up and replaces those - * with properly allocated maps. - */ -void __init percpu_init_late(void) -{ - struct pcpu_chunk *target_chunks[] = - { pcpu_first_chunk, pcpu_reserved_chunk, NULL }; - struct pcpu_chunk *chunk; - unsigned long flags; - int i; - - for (i = 0; (chunk = target_chunks[i]); i++) { - int *map; - const size_t size = PERCPU_DYNAMIC_EARLY_SLOTS * sizeof(map[0]); - - BUILD_BUG_ON(size > PAGE_SIZE); - - map = pcpu_mem_zalloc(size); - BUG_ON(!map); - - spin_lock_irqsave(&pcpu_lock, flags); - memcpy(map, chunk->map, size); - chunk->map = map; - spin_unlock_irqrestore(&pcpu_lock, flags); - } -} - -/* * Percpu allocator is initialized early during boot when neither slab or * workqueue is available. Plug async management until everything is up * and running. -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou @ 2017-07-17 23:27 ` Tejun Heo 2017-07-24 21:37 ` Dennis Zhou 2017-07-19 19:11 ` Josef Bacik 2017-07-19 19:16 ` Josef Bacik 2 siblings, 1 reply; 37+ messages in thread From: Tejun Heo @ 2017-07-17 23:27 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote: ... > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. I think it'd be nice to have a similar table for allocation patterns which aren't ideal for the original allocator. The biggest goal is avoiding cases where the allocator collapses and just glancing at the table doesn't seem very compelling. > /* > + * This determines the size of each metadata block. There are several subtle > + * constraints around this variable. The reserved_region and dynamic_region ^ constant > + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is > + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks > + * that are backing multiple pages, this needs to be accounted for. > + */ > +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) Given that percpu allocator can align only upto a page, the restriction makes sense to me. I'm kinda curious whether PAGE_SIZE blocks is optimal tho. Why did you pick PAGE_SIZE? > @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk; > extern struct pcpu_chunk *pcpu_reserved_chunk; > extern unsigned long pcpu_reserved_offset; > > +/* ^^ /** Ditto for other comments. > + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bitmap blocks required. It converts to bytes > + * served to bits required and then blocks used. > + */ > +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk) Maybe just pcpu_chunk_nr_blocks()? > +{ > + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE / > + PCPU_BITMAP_BLOCK_SIZE; > +} > + > +/* > + * pcpu_pages_to_bits - converts the pages to size of bitmap > + * @pages: number of physical pages > + * > + * This conversion is from physical pages to the number of bits > + * required in the bitmap. > + */ > +static inline int pcpu_pages_to_bits(int pages) pcpu_nr_pages_to_map_bits()? > +{ > + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE; > +} > + > +/* > + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bits in the bitmap. > + */ > +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk) pcpu_chunk_map_bits()? > @@ -86,10 +90,13 @@ > > #include "percpu-internal.h" > > -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */ > -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */ > -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32 > -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64 > +/* > + * The metadata is managed in terms of bits with each bit mapping to > + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated > + * with respect to the number of bits available. > + */ > +#define PCPU_SLOT_BASE_SHIFT 3 Ah, so this is actually the same as before 3 + 2, order 5. Can you please note the explicit number in the comment? > #define PCPU_EMPTY_POP_PAGES_LOW 2 > #define PCPU_EMPTY_POP_PAGES_HIGH 4 and these numbers too. I can't tell how these numbers would map. Also, any chance we can have these numbers in a more intuitive unit? > @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr) > pcpu_reserved_chunk->nr_pages * PAGE_SIZE; > } > > -static int __pcpu_size_to_slot(int size) > +static int __pcpu_size_to_slot(int bit_size) Wouldn't sth like @map_bits more intuitive than @bit_size? We can just use @bits too. > { > - int highbit = fls(size); /* size is in bytes */ > + int highbit = fls(bit_size); /* size is in bits */ > return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1); > } > > -static int pcpu_size_to_slot(int size) > +static int pcpu_size_to_slot(int bit_size) Ditto. > +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk) > +{ > + bool is_page_empty = true; > + int i, off, cur_contig, nr_empty_pop_pages, l_pop_off; > + struct pcpu_bitmap_md *block; > + > + chunk->contig_hint = cur_contig = 0; > + off = nr_empty_pop_pages = 0; > + l_pop_off = pcpu_block_get_first_page(chunk->first_free_block); > + > + for (i = chunk->first_free_block, block = chunk->md_blocks + i; > + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) { > + /* Manage nr_empty_pop_pages. The first line of a winged comment should be blank, so... /* * Manage nr_empty_pop_pages. > + * > + * This is tricky. So the the background work function is ^^^^^^^ > + * triggered when there are not enough free populated pages. > + * This is necessary to make sure atomic allocations can > + * succeed. > + * > + * The first page of each block is kept track of here allowing > + * this to scale in both situations where there are > 1 page > + * per block and where a block may be a portion of a page. > + */ > + int pop_off = pcpu_block_get_first_page(i); > + > + if (pop_off > l_pop_off) { > + if (is_page_empty) > + nr_empty_pop_pages += > + pcpu_cnt_pop_pages(chunk, l_pop_off, > + pop_off); IIUC, this is trying to handle multi-page block size, right? > + l_pop_off = pop_off; > + is_page_empty = true; > + } > + if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE) But isn't this assuming that each block is page sized? > + is_page_empty = false; > > + /* continue from prev block adding to the cur_contig hint */ > + if (cur_contig) { > + cur_contig += block->left_free; > + if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > + continue; > + } else if (cur_contig > chunk->contig_hint) { The "else" here is superflous, right? The if block always continues. > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > } > + cur_contig = 0; > } > + /* check if the block->contig_hint is larger */ > + if (block->contig_hint > chunk->contig_hint) { > + chunk->contig_hint = block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(i, > + block->contig_hint_start); > + } > + /* let the next iteration catch the right_free */ > + cur_contig = block->right_free; > + off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free; > } > > + /* catch last iteration if the last block ends with free space */ > + if (cur_contig > chunk->contig_hint) { > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > + } > > + /* > + * Keep track of nr_empty_pop_pages. > + * > + * The chunk is maintains the previous number of free pages it held, > + * so the delta is used to update the global counter. The reserved > + * chunk is not part of the free page count as they are populated > + * at init and are special to serving reserved allocations. > + */ > + if (is_page_empty) { > + nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off, > + chunk->nr_pages); > + } Unnecessary {}. > + if (chunk != pcpu_reserved_chunk) > + pcpu_nr_empty_pop_pages += > + (nr_empty_pop_pages - chunk->nr_empty_pop_pages); > + chunk->nr_empty_pop_pages = nr_empty_pop_pages; > } I am really not a big fan of the above implementation. All it wants to do is calculating the biggest free span and count unpopulated pages in the chunk. There gotta be a more readable way to implement this. For example, would it be possible to implement span iterator over a chunk which walks free spans of the chunk so that the above function can do something similar to the following? for_each_free_span(blah blah) { nr_pop_free += count populated whole pages in the span; update contig hint; } It's a span iteration problem. When abstracted properly, it shouldn't be too difficult to follow. > /** > + * pcpu_block_update_hint > * @chunk: chunk of interest > + * @index: block index of the metadata block > * > + * Full scan over the entire block to recalculate block-level metadata. > + */ > +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) > +{ > + unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index); > + struct pcpu_bitmap_md *block = chunk->md_blocks + index; > + bool is_left_free = false, is_right_free = false; > + int contig; > + unsigned long start, end; > + > + block->contig_hint = 0; > + start = end = block->first_free; > + while (start < PCPU_BITMAP_BLOCK_SIZE) { > + /* > + * Scans the allocation map corresponding to this block > + * to find free fragments and update metadata accordingly. > + */ > + start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, > + start); > + if (start >= PCPU_BITMAP_BLOCK_SIZE) > + break; It's a lot simpler here but this too might look simpler with an appropriate interation abstraction. > + /* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */ > + end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start); This isn't by no means a hard rule but it's often easier on the eyes to have a blank line when code and comment are packed like this. > + /* update left_free */ > + contig = end - start; > + if (start == 0) { > + block->left_free = contig; > + is_left_free = true; > + } > + /* update right_free */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { > + block->right_free = contig; > + is_right_free = true; > + } > + /* update block contig_hints */ > + if (block->contig_hint < contig) { > + block->contig_hint = contig; > + block->contig_hint_start = start; > + } > + start = end; > + } > + > + /* clear left/right free hints */ > + if (!is_left_free) > + block->left_free = 0; > + if (!is_right_free) > + block->right_free = 0; Hmm... why do we need is_left/right_free? Can't we reset them to zero at the top and update directly during the loop? > +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } > > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > > + /* > + * Update s_block. > + * > + * block->first_free must be updated if the allocation takes its place. > + * If the allocation breaks the contig_hint, a scan is required to > + * restore this hint. > + */ > + if (s_off == s_block->first_free) > + s_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, s_index), > + PCPU_BITMAP_BLOCK_SIZE, > + s_off + bit_size); > + > + if (s_off >= s_block->contig_hint_start && > + s_off < s_block->contig_hint_start + s_block->contig_hint) { > + pcpu_block_refresh_hint(chunk, s_index); > + } else { > + /* update left and right contig manually */ > + s_block->left_free = min(s_block->left_free, s_off); > + if (s_index == e_index) > + s_block->right_free = min_t(int, s_block->right_free, > + PCPU_BITMAP_BLOCK_SIZE - e_off); > + else > + s_block->right_free = 0; > + } > > + /* > + * Update e_block. > + * If they are different, then e_block's first_free is guaranteed to > + * be the extend of e_off. first_free must be updated and a scan > + * over e_block is issued. > + */ > + if (s_index != e_index) { > + e_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, e_off); > > + pcpu_block_refresh_hint(chunk, e_index); > + } > > + /* update in-between md_blocks */ > + for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index; > + i++, block++) { > + block->contig_hint = 0; > + block->left_free = 0; > + block->right_free = 0; > + } > > /* > + * The only time a full chunk scan is required is if the global > + * contig_hint is broken. Otherwise, it means a smaller space > + * was used and therefore the global contig_hint is still correct. > */ > + if (bit_off >= chunk->contig_hint_start && > + bit_off < chunk->contig_hint_start + chunk->contig_hint) > + update_chunk = true; > > + return update_chunk; @update_chunk seems unnecessary. > +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + int start, end, contig; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > + > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } So, if you do the above with inclusive range, it becomes s_index = pcpu_off_to_block_index(start_bit); e_index = pcpu_off_to_block_index(end_bit - 1); s_off = pcpu_off_to_block_off(start_bit); e_off = pcpu_off_to_block_off(end_bit - 1) + 1; and you can just comment that you're using inclusive range so that the e_index always points to the last block in the range. Wouldn't that be easier? People do use inclusive ranges for these sorts of calculations. > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > + > + /* > + * Check if the freed area aligns with the block->contig_hint. > + * If it does, then the scan to find the beginning/end of the > + * larger free area can be avoided. > + * > + * start and end refer to beginning and end of the free region > + * within each their respective blocks. This is not necessarily > + * the entire free region as it may span blocks past the beginning > + * or end of the block. > + */ > + start = s_off; > + if (s_off == s_block->contig_hint + s_block->contig_hint_start) { > + start = s_block->contig_hint_start; > + } else { > + int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), > + start); > + start = (start == l_bit) ? 0 : l_bit + 1; > + } > + > + end = e_off; > + if (e_off == e_block->contig_hint_start) > + end = e_block->contig_hint_start + e_block->contig_hint; > + else > + end = find_next_bit(pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, end); > > + /* freeing in the same block */ > + if (s_index == e_index) { > + contig = end - start; > > + if (start == 0) > + s_block->left_free = contig; > > + if (end == PCPU_BITMAP_BLOCK_SIZE) > + s_block->right_free = contig; > + > + s_block->first_free = min(s_block->first_free, start); > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + > + } else { > /* > + * Freeing across md_blocks. > + * > + * If the start is at the beginning of the block, just > + * reset the block instead. > */ > + if (start == 0) { The above comment can be moved here and lose the if in the sentence. ie. "As the start is ..., just .." > + s_index--; > + } else { > + /* > + * Knowing that the free is across blocks, this means > + * the hint can be updated on the right side and the > + * left side does not need to be touched. > + */ > + s_block->first_free = min(s_block->first_free, start); > + contig = PCPU_BITMAP_BLOCK_SIZE - start; > + s_block->right_free = contig; > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + } Blank line, please. > + /* > + * If end is the entire e_block, just reset the block > + * as well. > + */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { ditto > + e_index++; > + } else { > + /* > + * The hint must only be on the left side, so > + * update accordingly. > + */ > + e_block->first_free = 0; > + e_block->left_free = end; > + if (end > e_block->contig_hint) { > + e_block->contig_hint = end; > + e_block->contig_hint_start = 0; > + } > + } > + > + /* reset md_blocks in the middle */ > + for (i = s_index + 1, block = chunk->md_blocks + i; > + i < e_index; i++, block++) { How about something like the following? It's kinda weird to have an extra loop var which isn't really used for anything. The same goes for other places too. for (block = chunk->md_blocks + s_index + 1; block < chunk->md_blocks + e_index; block++) > + block->first_free = 0; > + block->contig_hint_start = 0; > + block->contig_hint = PCPU_BITMAP_BLOCK_SIZE; > + block->left_free = PCPU_BITMAP_BLOCK_SIZE; > + block->right_free = PCPU_BITMAP_BLOCK_SIZE; > + } > } > + > + /* > + * The hint is only checked in the s_block and e_block when > + * freeing and particularly only when it is self contained within > + * its own block. A scan is required if the free space spans > + * blocks or makes a block whole as the scan will take into > + * account free space across blocks. > + */ > + if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) || > + s_index != e_index) { > + update_chunk = true; > + } else if (s_block->contig_hint > chunk->contig_hint) { > + /* check if block contig_hint is bigger */ > + chunk->contig_hint = s_block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(s_index, > + s_block->contig_hint_start); > + } > + > + return update_chunk; Ditto with @update_chunk. > +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, > + size_t align, bool pop_only) > { > + int i, cur_free; > + int s_index, block_off, next_index, end_off; /* interior alloc index */ > + struct pcpu_bitmap_md *block; > + unsigned long *alloc_map; > > + lockdep_assert_held(&pcpu_lock); > > + cur_free = block_off = 0; > + s_index = chunk->first_free_block; > + for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk); > + i++) { > + alloc_map = pcpu_index_alloc_map(chunk, i); > + block = chunk->md_blocks + i; > + > + /* continue from prev block */ > + cur_free += block->left_free; > + if (cur_free >= bit_size) { > + end_off = bit_size; > + goto check_populated; > + } else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > continue; > } > > /* > + * Can this block hold this alloc? > + * > + * Here the block->contig_hint is used to guarantee a fit, > + * but the block->first_free is returned as we may be able > + * to serve the allocation earlier. The population check > + * must take into account the area beginning at first_free > + * through the end of the contig_hint. > */ > + cur_free = 0; > + s_index = i; > + block_off = ALIGN(block->contig_hint_start, align); > + block_off -= block->contig_hint_start; > + if (block->contig_hint >= block_off + bit_size) { > + block_off = block->first_free; > + end_off = block->contig_hint_start - block_off + > + bit_size; > + goto check_populated; > } > > + /* check right */ > + block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free, > + align); > + /* reset to start looking in the next block */ > + if (block_off >= PCPU_BITMAP_BLOCK_SIZE) { > + s_index++; > + cur_free = block_off = 0; > + continue; > } > + cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off; > + if (cur_free >= bit_size) { > + end_off = bit_size; > +check_populated: > + if (!pop_only || > + pcpu_is_populated(chunk, s_index, block_off, > + end_off, &next_index)) > + break; > > + i = next_index - 1; > + s_index = next_index; > + cur_free = block_off = 0; > } > + } > > + /* nothing found */ > + if (i == pcpu_nr_pages_to_blocks(chunk)) > + return -1; > > + return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off; > +} Wouldn't this function be a lot simpler too if there were free span iterator? > +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size, > + size_t align, int start) > +{ > + size_t align_mask = (align) ? (align - 1) : 0; > + int i, bit_off, oslot; > + struct pcpu_bitmap_md *block; > + > + lockdep_assert_held(&pcpu_lock); > + > + oslot = pcpu_chunk_slot(chunk); > + > + /* search to find fit */ > + bit_off = bitmap_find_next_zero_area(chunk->alloc_map, > + pcpu_nr_pages_to_bits(chunk), > + start, bit_size, align_mask); > + > + if (bit_off >= pcpu_nr_pages_to_bits(chunk)) > + return -1; > + > + /* update alloc map */ > + bitmap_set(chunk->alloc_map, bit_off, bit_size); blank line > + /* update boundary map */ > + set_bit(bit_off, chunk->bound_map); > + bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1); > + set_bit(bit_off + bit_size, chunk->bound_map); > + > + chunk->free_bits -= bit_size; > + > + if (pcpu_block_update_hint_alloc(chunk, bit_off, bit_size)) > + pcpu_chunk_update_hint(chunk); > + > + /* update chunk first_free */ > + for (i = chunk->first_free_block, block = chunk->md_blocks + i; > + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) > + if (block->contig_hint != 0) > + break; > + > + chunk->first_free_block = i; > > pcpu_chunk_relocate(chunk, oslot); > > + return bit_off * PCPU_MIN_ALLOC_SIZE; > } > > /** > + * pcpu_free_area - frees the corresponding offset > * @chunk: chunk of interest > + * @off: addr offset into chunk > * > + * This function determines the size of an allocation to free using > + * the boundary bitmap and clears the allocation map. A block metadata > + * update is triggered and potentially a chunk update occurs. > */ > +static void pcpu_free_area(struct pcpu_chunk *chunk, int off) > { > + int bit_off, bit_size, index, end, oslot; > + struct pcpu_bitmap_md *block; > > lockdep_assert_held(&pcpu_lock); > pcpu_stats_area_dealloc(chunk); > > + oslot = pcpu_chunk_slot(chunk); > > + bit_off = off / PCPU_MIN_ALLOC_SIZE; > > + /* find end index */ > + end = find_next_bit(chunk->bound_map, pcpu_nr_pages_to_bits(chunk), > + bit_off + 1); > + bit_size = end - bit_off; > > + bitmap_clear(chunk->alloc_map, bit_off, bit_size); > > + chunk->free_bits += bit_size; > + > + /* update first_free */ > + index = pcpu_off_to_block_index(bit_off); > + block = chunk->md_blocks + index; > + block->first_free = min_t(int, block->first_free, > + bit_off % PCPU_BITMAP_BLOCK_SIZE); > + > + chunk->first_free_block = min(chunk->first_free_block, index); > + > + if (pcpu_block_update_hint_free(chunk, bit_off, bit_size)) > + pcpu_chunk_update_hint(chunk); Do we ever not update chunk hint when block hint indicates that it's necessary? If not, maybe just call it from the previous function? > static void pcpu_free_chunk(struct pcpu_chunk *chunk) > { > if (!chunk) > return; > + pcpu_mem_free(chunk->md_blocks); > + pcpu_mem_free(chunk->bound_map); > + pcpu_mem_free(chunk->alloc_map); > pcpu_mem_free(chunk); > } > > @@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk, > > bitmap_set(chunk->populated, page_start, nr); > chunk->nr_populated += nr; > + chunk->nr_empty_pop_pages += nr; > pcpu_nr_empty_pop_pages += nr; > } > > @@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, > > bitmap_clear(chunk->populated, page_start, nr); > chunk->nr_populated -= nr; > + chunk->nr_empty_pop_pages -= nr; > pcpu_nr_empty_pop_pages -= nr; > } Didn't we add this field in an earlier patch? Do the above changes belong in this patch? > @@ -890,19 +1284,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, > struct pcpu_chunk *chunk; > const char *err; > bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; > + int slot, off, cpu, ret; > unsigned long flags; > void __percpu *ptr; > + size_t bit_size, bit_align; > > /* > + * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE. > + * Therefore alignment must be a minimum of that many bytes as well > + * as the allocation will have internal fragmentation from > + * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes. > */ > + if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) > + align = PCPU_MIN_ALLOC_SIZE; > + size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); > + bit_size = size >> PCPU_MIN_ALLOC_SHIFT; > + bit_align = align >> PCPU_MIN_ALLOC_SHIFT; Shouldn't the above have happened earlier when MIN_ALLOC_SIZE was introduced? > @@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr) > * address. The caller is responsible for ensuring @addr stays valid > * until this function finishes. > * > - * percpu allocator has special setup for the first chunk, which currently > + * Percpu allocator has special setup for the first chunk, which currently > * supports either embedding in linear address space or vmalloc mapping, > * and, from the second one, the backing allocator (currently either vm or > * km) provides translation. > * > * The addr can be translated simply without checking if it falls into the > - * first chunk. But the current code reflects better how percpu allocator > + * first chunk. But the current code reflects better how percpu allocator > * actually works, and the verification can discover both bugs in percpu > - * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current > + * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current Let's please move out what can be to other patches. This patch is big enough as it is. > @@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) > else > return page_to_phys(vmalloc_to_page(addr)) + > offset_in_page(addr); > - } else > + } else { > return page_to_phys(pcpu_addr_to_page(addr)) + > offset_in_page(addr); > + } Ditto. > @@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl, > * static areas on architectures where the addressing model has > * limited offset range for symbol relocations to guarantee module > * percpu symbols fall inside the relocatable range. > + * @ai->static_size + @ai->reserved_size is expected to be page aligned. > * > * @ai->dyn_size determines the number of bytes available for dynamic > - * allocation in the first chunk. The area between @ai->static_size + > - * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused. > + * allocation in the first chunk. Both the start and the end are expected > + * to be page aligned. The area between @ai->static_size + @ai->reserved_size > + * + @ai->dyn_size and @ai->unit_size is unused. ^^^ contam > * > * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE > * and equal to or larger than @ai->static_size + @ai->reserved_size + > @@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl, > * copied static data to each unit. > * > * If the first chunk ends up with both reserved and dynamic areas, it > - * is served by two chunks - one to serve the core static and reserved > - * areas and the other for the dynamic area. They share the same vm > - * and page map but uses different area allocation map to stay away > - * from each other. The latter chunk is circulated in the chunk slots > - * and available for dynamic allocation like any other chunks. > + * is served by two chunks - one to serve the reserved area and the other > + * for the dynamic area. They share the same vm and page map but use > + * different area allocation map to stay away from each other. The latter > + * chunk is circulated in the chunk slots and available for dynamic allocation > + * like any other chunks. ditto > @@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > * Allocate chunk slots. The additional last slot is for > * empty chunks. > */ > - pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2; > + pcpu_nr_slots = __pcpu_size_to_slot( > + pcpu_pages_to_bits(pcpu_unit_pages)) + 2; I get that we wanna be using bits inside the area allocator proper but can we keep things outside in bytes? These things don't really have anything to do with what granularity the area allocator is operating at. > @@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > tmp_addr = (unsigned long)base_addr + ai->static_size; > aligned_addr = tmp_addr & PAGE_MASK; > pcpu_reserved_offset = tmp_addr - aligned_addr; > + begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE; > > map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + > pcpu_reserved_offset; > + > chunk_pages = map_size_bytes >> PAGE_SHIFT; > > /* chunk adjacent to static region allocation */ > + chunk = pcpu_alloc_first_chunk(chunk_pages); > chunk->base_addr = (void *)aligned_addr; > chunk->immutable = true; > > + /* set metadata */ > + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; > + chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; > > + /* > + * If the beginning of the reserved region overlaps the end of the > + * static region, hide that portion in the metadata. > + */ > + if (begin_fill_bits) { > chunk->has_reserved = true; > + bitmap_fill(chunk->alloc_map, begin_fill_bits); > + set_bit(0, chunk->bound_map); > + set_bit(begin_fill_bits, chunk->bound_map); > + > + if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits)) > + pcpu_chunk_update_hint(chunk); > } > > + /* init dynamic chunk if necessary */ > + if (ai->reserved_size) { > + pcpu_reserved_chunk = chunk; > + > chunk_pages = dyn_size >> PAGE_SHIFT; > > /* chunk allocation */ > + chunk = pcpu_alloc_first_chunk(chunk_pages); > chunk->base_addr = base_addr + ai->static_size + > ai->reserved_size; > + > + /* set metadata */ > + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk); > + chunk->free_bits = pcpu_nr_pages_to_bits(chunk); > } > > /* link the first chunk in */ I *think* that quite a bit of the above can be moved into a separate patch. Thanks a lot! -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-17 23:27 ` Tejun Heo @ 2017-07-24 21:37 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-24 21:37 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Mon, Jul 17, 2017 at 07:27:56PM -0400, Tejun Heo wrote: > I think it'd be nice to have a similar table for allocation patterns > which aren't ideal for the original allocator. The biggest goal is > avoiding cases where the allocator collapses and just glancing at the > table doesn't seem very compelling. I've added new data to the cover letter and in the commit log to demonstrate poor allocation performance. > > /* > > + * This determines the size of each metadata block. There are several subtle > > + * constraints around this variable. The reserved_region and dynamic_region > ^ > constant > Fixed. > > + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is > > + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks > > + * that are backing multiple pages, this needs to be accounted for. > > + */ > > +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) > > Given that percpu allocator can align only upto a page, the > restriction makes sense to me. I'm kinda curious whether PAGE_SIZE > blocks is optimal tho. Why did you pick PAGE_SIZE? > I've refactored v2 to be able to handle block sizes with certain constraints. The PAGE_SIZE blocks really just were a balance between amount required to scan and the number of metadata blocks. I tested 2KB blocks and they seemed to perform marginally worse. > > +/* > ^^ > /** > > Ditto for other comments. > Fixed. > > + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks > > + * @chunk: chunk of interest > > + * > > + * This conversion is from the number of physical pages that the chunk > > + * serves to the number of bitmap blocks required. It converts to bytes > > + * served to bits required and then blocks used. > > + */ > > +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk) > > Maybe just pcpu_chunk_nr_blocks()? > Renamed. > > +{ > > + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE / > > + PCPU_BITMAP_BLOCK_SIZE; > > +} > > + > > +/* > > + * pcpu_pages_to_bits - converts the pages to size of bitmap > > + * @pages: number of physical pages > > + * > > + * This conversion is from physical pages to the number of bits > > + * required in the bitmap. > > + */ > > +static inline int pcpu_pages_to_bits(int pages) > > pcpu_nr_pages_to_map_bits()? > Renamed. > > +{ > > + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE; > > +} > > + > > +/* > > + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap > > + * @chunk: chunk of interest > > + * > > + * This conversion is from the number of physical pages that the chunk > > + * serves to the number of bits in the bitmap. > > + */ > > +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk) > > pcpu_chunk_map_bits()? > Renamed. > > @@ -86,10 +90,13 @@ > > > > #include "percpu-internal.h" > > > > -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */ > > -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */ > > -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32 > > -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64 > > +/* > > + * The metadata is managed in terms of bits with each bit mapping to > > + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated > > + * with respect to the number of bits available. > > + */ > > +#define PCPU_SLOT_BASE_SHIFT 3 > > Ah, so this is actually the same as before 3 + 2, order 5. Can you > please note the explicit number in the comment? > With the refactor in v2, the slots are back to being managed in bytes. > > #define PCPU_EMPTY_POP_PAGES_LOW 2 > > #define PCPU_EMPTY_POP_PAGES_HIGH 4 > > and these numbers too. I can't tell how these numbers would map. > Also, any chance we can have these numbers in a more intuitive unit? > Those are from the original implementation to decide when to schedule work to repopulate the empty free page pool. > > @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr) > > pcpu_reserved_chunk->nr_pages * PAGE_SIZE; > > } > > > > -static int __pcpu_size_to_slot(int size) > > +static int __pcpu_size_to_slot(int bit_size) > > Wouldn't sth like @map_bits more intuitive than @bit_size? We can > just use @bits too. > Back to bytes. > > { > > - int highbit = fls(size); /* size is in bytes */ > > + int highbit = fls(bit_size); /* size is in bits */ > > return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1); > > } > > > > -static int pcpu_size_to_slot(int size) > > +static int pcpu_size_to_slot(int bit_size) > > Ditto. > Back to bytes. > > +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk) > > It's a span iteration problem. When abstracted properly, it shouldn't > be too difficult to follow. > I agree, I've refactored to use an iterator. > > +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) > It's a lot simpler here but this too might look simpler with an > appropriate interation abstraction. Generalized the populated bitmap iterators so it can be used here. > > Hmm... why do we need is_left/right_free? Can't we reset them to zero > at the top and update directly during the loop? > Yes. Done. > > @update_chunk seems unnecessary. > I've moved the chunk refresh call to be here. > > So, if you do the above with inclusive range, it becomes > > s_index = pcpu_off_to_block_index(start_bit); > e_index = pcpu_off_to_block_index(end_bit - 1); > s_off = pcpu_off_to_block_off(start_bit); > e_off = pcpu_off_to_block_off(end_bit - 1) + 1; > > and you can just comment that you're using inclusive range so that the > e_index always points to the last block in the range. Wouldn't that > be easier? People do use inclusive ranges for these sorts of > calculations. Ah I see, thanks. Done. > > How about something like the following? It's kinda weird to have an > extra loop var which isn't really used for anything. The same goes > for other places too. > > for (block = chunk->md_blocks + s_index + 1; > block < chunk->md_blocks + e_index; block++) > I've rewritten most for loops to do this. > > + block->first_free = 0; > > +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, > > + size_t align, bool pop_only) > > Wouldn't this function be a lot simpler too if there were free span > iterator? > Yes added an iterator for this. > > + > > + /* update alloc map */ > > + bitmap_set(chunk->alloc_map, bit_off, bit_size); > > blank line > Added. > > Do we ever not update chunk hint when block hint indicates that it's > necessary? If not, maybe just call it from the previous function? > No. I've added the call to the previous function. > > @@ -787,6 +1179,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk, > > > > bitmap_set(chunk->populated, page_start, nr); > > chunk->nr_populated += nr; > > + chunk->nr_empty_pop_pages += nr; > > pcpu_nr_empty_pop_pages += nr; > > } > > > > @@ -809,6 +1202,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, > > > > bitmap_clear(chunk->populated, page_start, nr); > > chunk->nr_populated -= nr; > > + chunk->nr_empty_pop_pages -= nr; > > pcpu_nr_empty_pop_pages -= nr; > > } > > Didn't we add this field in an earlier patch? Do the above changes > belong in this patch? > Yes, moved to previous patch. > > + size_t bit_size, bit_align; > > > > /* > > + * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE. > > + * Therefore alignment must be a minimum of that many bytes as well > > + * as the allocation will have internal fragmentation from > > + * rounding up by up to PCPU_MIN_ALLOC_SIZE - 1 bytes. > > */ > > + if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) > > + align = PCPU_MIN_ALLOC_SIZE; > > + size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); > > + bit_size = size >> PCPU_MIN_ALLOC_SHIFT; > > + bit_align = align >> PCPU_MIN_ALLOC_SHIFT; > > Shouldn't the above have happened earlier when MIN_ALLOC_SIZE was > introduced? > Yes, moved to previous patch. > > @@ -1363,15 +1710,15 @@ bool is_kernel_percpu_address(unsigned long addr) > > * address. The caller is responsible for ensuring @addr stays valid > > * until this function finishes. > > * > > - * percpu allocator has special setup for the first chunk, which currently > > + * Percpu allocator has special setup for the first chunk, which currently > > * supports either embedding in linear address space or vmalloc mapping, > > * and, from the second one, the backing allocator (currently either vm or > > * km) provides translation. > > * > > * The addr can be translated simply without checking if it falls into the > > - * first chunk. But the current code reflects better how percpu allocator > > + * first chunk. But the current code reflects better how percpu allocator > > * actually works, and the verification can discover both bugs in percpu > > - * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current > > + * allocator itself and per_cpu_ptr_to_phys() callers. So we keep current > > Let's please move out what can be to other patches. This patch is big > enough as it is. > Removed. > > @@ -1417,9 +1764,10 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr) > > else > > return page_to_phys(vmalloc_to_page(addr)) + > > offset_in_page(addr); > > - } else > > + } else { > > return page_to_phys(pcpu_addr_to_page(addr)) + > > offset_in_page(addr); > > + } > > Ditto. > Removed. > > @@ -1555,10 +1903,12 @@ static void pcpu_dump_alloc_info(const char *lvl, > > * static areas on architectures where the addressing model has > > * limited offset range for symbol relocations to guarantee module > > * percpu symbols fall inside the relocatable range. > > + * @ai->static_size + @ai->reserved_size is expected to be page aligned. > > * > > * @ai->dyn_size determines the number of bytes available for dynamic > > - * allocation in the first chunk. The area between @ai->static_size + > > - * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused. > > + * allocation in the first chunk. Both the start and the end are expected > > + * to be page aligned. The area between @ai->static_size + @ai->reserved_size > > + * + @ai->dyn_size and @ai->unit_size is unused. > ^^^ > contam > This is for the math continuing from the previous line. > > * > > * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE > > * and equal to or larger than @ai->static_size + @ai->reserved_size + > > @@ -1581,11 +1931,11 @@ static void pcpu_dump_alloc_info(const char *lvl, > > * copied static data to each unit. > > * > > * If the first chunk ends up with both reserved and dynamic areas, it > > - * is served by two chunks - one to serve the core static and reserved > > - * areas and the other for the dynamic area. They share the same vm > > - * and page map but uses different area allocation map to stay away > > - * from each other. The latter chunk is circulated in the chunk slots > > - * and available for dynamic allocation like any other chunks. > > + * is served by two chunks - one to serve the reserved area and the other > > + * for the dynamic area. They share the same vm and page map but use > > + * different area allocation map to stay away from each other. The latter > > + * chunk is circulated in the chunk slots and available for dynamic allocation > > + * like any other chunks. > > ditto > Split into another patch. > > @@ -1703,7 +2051,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > > * Allocate chunk slots. The additional last slot is for > > * empty chunks. > > */ > > - pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2; > > + pcpu_nr_slots = __pcpu_size_to_slot( > > + pcpu_pages_to_bits(pcpu_unit_pages)) + 2; > > I get that we wanna be using bits inside the area allocator proper but > can we keep things outside in bytes? These things don't really have > anything to do with what granularity the area allocator is operating > at. > The refactor keeps this in bytes. > > @@ -1727,69 +2076,50 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, > > tmp_addr = (unsigned long)base_addr + ai->static_size; > > aligned_addr = tmp_addr & PAGE_MASK; > > pcpu_reserved_offset = tmp_addr - aligned_addr; > > + begin_fill_bits = pcpu_reserved_offset / PCPU_MIN_ALLOC_SIZE; > > > > map_size_bytes = (ai->reserved_size ?: ai->dyn_size) + > > pcpu_reserved_offset; > > + > > chunk_pages = map_size_bytes >> PAGE_SHIFT; > > > > /* chunk adjacent to static region allocation */ > > + chunk = pcpu_alloc_first_chunk(chunk_pages); > > chunk->base_addr = (void *)aligned_addr; > > chunk->immutable = true; > > > > + /* set metadata */ > > + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; > > + chunk->free_bits = pcpu_nr_pages_to_bits(chunk) - begin_fill_bits; > > > > + /* > > + * If the beginning of the reserved region overlaps the end of the > > + * static region, hide that portion in the metadata. > > + */ > > + if (begin_fill_bits) { > > chunk->has_reserved = true; > > + bitmap_fill(chunk->alloc_map, begin_fill_bits); > > + set_bit(0, chunk->bound_map); > > + set_bit(begin_fill_bits, chunk->bound_map); > > + > > + if (pcpu_block_update_hint_alloc(chunk, 0, begin_fill_bits)) > > + pcpu_chunk_update_hint(chunk); > > } > > > > + /* init dynamic chunk if necessary */ > > + if (ai->reserved_size) { > > + pcpu_reserved_chunk = chunk; > > + > > chunk_pages = dyn_size >> PAGE_SHIFT; > > > > /* chunk allocation */ > > + chunk = pcpu_alloc_first_chunk(chunk_pages); > > chunk->base_addr = base_addr + ai->static_size + > > ai->reserved_size; > > + > > + /* set metadata */ > > + chunk->contig_hint = pcpu_nr_pages_to_bits(chunk); > > + chunk->free_bits = pcpu_nr_pages_to_bits(chunk); > > } > > > > /* link the first chunk in */ > > I *think* that quite a bit of the above can be moved into a separate > patch. > The last one was quite a bit of work. It is the first handful of patches in v2. The first chunk creation logic has been consolidated and the reserved region is no longer expanded. The reserved region needs to be a multiple of PCPU_MIN_ALLOC_SIZE and the static region is aligned up while the dynamic region is shrunk by the aligned up amount. This is fine as the dynamic region is expanded to be page aligned. If there was a need to align up the static region, that means the dynamic region initially expanded to use that space. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou 2017-07-17 23:27 ` Tejun Heo @ 2017-07-19 19:11 ` Josef Bacik 2017-07-19 22:19 ` Dennis Zhou 2017-07-19 19:16 ` Josef Bacik 2 siblings, 1 reply; 37+ messages in thread From: Josef Bacik @ 2017-07-19 19:11 UTC (permalink / raw) To: Dennis Zhou Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > The percpu memory allocator is experiencing scalability issues when > allocating and freeing large numbers of counters as in BPF. > Additionally, there is a corner case where iteration is triggered over > all chunks if the contig_hint is the right size, but wrong alignment. > > Implementation: > This patch removes the area map allocator in favor of a bitmap allocator > backed by metadata blocks. The primary goal is to provide consistency > in performance and memory footprint with a focus on small allocations > (< 64 bytes). The bitmap removes the heavy memmove from the freeing > critical path and provides a consistent memory footprint. The metadata > blocks provide a bound on the amount of scanning required by maintaining > a set of hints. > > The chunks previously were managed by free_size, a value maintained in > bytes. Now, the chunks are managed in terms of bits, which is just a > scaled value of free_size down by PCPU_MIN_ALLOC_SIZE. > > There is one caveat with this implementation. In an effort to make > freeing fast, the only time metadata is updated on the free path is if a > whole block becomes free or the freed area spans across metadata blocks. > This causes the chunka??s contig_hint to be potentially smaller than what > it could allocate by up to a block. If the chunka??s contig_hint is > smaller than a block, a check occurs and the hint is kept accurate. > Metadata is always kept accurate on allocation and therefore the > situation where a chunk has a larger contig_hint than available will > never occur. > > Evaluation: > I have primarily done testing against a simple workload of allocation of > 1 million objects of varying size. Deallocation was done by in order, > alternating, and in reverse. These numbers were collected after rebasing > ontop of a80099a152. I present the worst-case numbers here: > > Area Map Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 335 | 4960 > 16B | 485 | 1150 > 64B | 445 | 280 > 128B | 505 | 177 > 1024B | 3385 | 140 > > Bitmap Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 725 | 70 > 16B | 760 | 70 > 64B | 855 | 80 > 128B | 910 | 90 > 1024B | 3770 | 260 > > This data demonstrates the inability for the area map allocator to > handle less than ideal situations. In the best case of reverse > deallocation, the area map allocator was able to perform within range > of the bitmap allocator. In the worst case situation, freeing took > nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator > dramatically improves the consistency of the free path. The small > allocations performed nearly identical regardless of the freeing > pattern. > > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. > This was a bear to review, I feel like it could be split into a few smaller pieces. You are changing hinting, allocating/freeing, and how you find chunks. Those seem like good logical divisions to me. Overall the design seems sound and I didn't spot any major problems. Once you've split them up I'll do another thorough comb through and then add my reviewed by. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-19 19:11 ` Josef Bacik @ 2017-07-19 22:19 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-19 22:19 UTC (permalink / raw) To: Josef Bacik Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Josef, On Wed, Jul 19, 2017 at 07:11:06PM +0000, Josef Bacik wrote: > > This was a bear to review, I feel like it could be split into a few smaller > pieces. You are changing hinting, allocating/freeing, and how you find chunks. > Those seem like good logical divisions to me. Overall the design seems sound > and I didn't spot any major problems. Once you've split them up I'll do another > thorough comb through and then add my reviewed by. Thanks, Yeah.. Thanks for taking the time. I'm currently working on responding to Tejun's feedback and will do my best to split it down further. I've done a bit of refactoring which should help readability and hopefully make it easier in the next pass. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou 2017-07-17 23:27 ` Tejun Heo 2017-07-19 19:11 ` Josef Bacik @ 2017-07-19 19:16 ` Josef Bacik 2017-07-19 22:13 ` Dennis Zhou 2 siblings, 1 reply; 37+ messages in thread From: Josef Bacik @ 2017-07-19 19:16 UTC (permalink / raw) To: Dennis Zhou Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > The percpu memory allocator is experiencing scalability issues when > allocating and freeing large numbers of counters as in BPF. > Additionally, there is a corner case where iteration is triggered over > all chunks if the contig_hint is the right size, but wrong alignment. > > Implementation: > This patch removes the area map allocator in favor of a bitmap allocator > backed by metadata blocks. The primary goal is to provide consistency > in performance and memory footprint with a focus on small allocations > (< 64 bytes). The bitmap removes the heavy memmove from the freeing > critical path and provides a consistent memory footprint. The metadata > blocks provide a bound on the amount of scanning required by maintaining > a set of hints. > > The chunks previously were managed by free_size, a value maintained in > bytes. Now, the chunks are managed in terms of bits, which is just a > scaled value of free_size down by PCPU_MIN_ALLOC_SIZE. > > There is one caveat with this implementation. In an effort to make > freeing fast, the only time metadata is updated on the free path is if a > whole block becomes free or the freed area spans across metadata blocks. > This causes the chunka??s contig_hint to be potentially smaller than what > it could allocate by up to a block. If the chunka??s contig_hint is > smaller than a block, a check occurs and the hint is kept accurate. > Metadata is always kept accurate on allocation and therefore the > situation where a chunk has a larger contig_hint than available will > never occur. > > Evaluation: > I have primarily done testing against a simple workload of allocation of > 1 million objects of varying size. Deallocation was done by in order, > alternating, and in reverse. These numbers were collected after rebasing > ontop of a80099a152. I present the worst-case numbers here: > > Area Map Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 335 | 4960 > 16B | 485 | 1150 > 64B | 445 | 280 > 128B | 505 | 177 > 1024B | 3385 | 140 > > Bitmap Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 725 | 70 > 16B | 760 | 70 > 64B | 855 | 80 > 128B | 910 | 90 > 1024B | 3770 | 260 > > This data demonstrates the inability for the area map allocator to > handle less than ideal situations. In the best case of reverse > deallocation, the area map allocator was able to perform within range > of the bitmap allocator. In the worst case situation, freeing took > nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator > dramatically improves the consistency of the free path. The small > allocations performed nearly identical regardless of the freeing > pattern. > > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> > --- > include/linux/percpu.h | 10 +- > init/main.c | 1 - > mm/percpu-internal.h | 70 ++- > mm/percpu-stats.c | 99 ++-- > mm/percpu.c | 1280 ++++++++++++++++++++++++++++++------------------ > 5 files changed, 923 insertions(+), 537 deletions(-) > > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index a5cedcd..8f62b10 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -26,6 +26,15 @@ > #define PCPU_MIN_ALLOC_SHIFT 2 > > /* > + * This determines the size of each metadata block. There are several subtle > + * constraints around this variable. The reserved_region and dynamic_region > + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is > + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks > + * that are backing multiple pages, this needs to be accounted for. > + */ > +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) > + > +/* > * Percpu allocator can serve percpu allocations before slab is > * initialized which allows slab to depend on the percpu allocator. > * The following two parameters decide how much resource to > @@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr); > #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) > extern void __init setup_per_cpu_areas(void); > #endif > -extern void __init percpu_init_late(void); > > extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp); > extern void __percpu *__alloc_percpu(size_t size, size_t align); > diff --git a/init/main.c b/init/main.c > index 052481f..c9a9fff 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -500,7 +500,6 @@ static void __init mm_init(void) > page_ext_init_flatmem(); > mem_init(); > kmem_cache_init(); > - percpu_init_late(); > pgtable_init(); > vmalloc_init(); > ioremap_huge_init(); > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > index f0776f6..2dac644 100644 > --- a/mm/percpu-internal.h > +++ b/mm/percpu-internal.h > @@ -4,6 +4,21 @@ > #include <linux/types.h> > #include <linux/percpu.h> > > +/* > + * pcpu_bitmap_md is the metadata block struct. > + * All units are in terms of bits. > + */ > +struct pcpu_bitmap_md { > + int contig_hint; /* contig hint for block */ > + int contig_hint_start; /* block relative starting > + position of the contig hint */ > + int left_free; /* size of free space along > + the left side of the block */ > + int right_free; /* size of free space along > + the right side of the block */ > + int first_free; /* block position of first free */ > +}; > + > struct pcpu_chunk { > #ifdef CONFIG_PERCPU_STATS > int nr_alloc; /* # of allocations */ > @@ -11,17 +26,20 @@ struct pcpu_chunk { > #endif > > struct list_head list; /* linked to pcpu_slot lists */ > - int free_size; /* free bytes in the chunk */ > - int contig_hint; /* max contiguous size hint */ > + int free_bits; /* free bits in the chunk */ > + int contig_hint; /* max contiguous size hint > + in bits */ > + int contig_hint_start; /* contig_hint starting > + bit offset */ > void *base_addr; /* base address of this chunk */ > > - int map_used; /* # of map entries used before the sentry */ > - int map_alloc; /* # of map entries allocated */ > - int *map; /* allocation map */ > - struct list_head map_extend_list;/* on pcpu_map_extend_chunks */ > + unsigned long *alloc_map; /* allocation map */ > + unsigned long *bound_map; /* boundary map */ > + struct pcpu_bitmap_md *md_blocks; /* metadata blocks */ > > void *data; /* chunk data */ > - int first_free; /* no free below this */ > + int first_free_block; /* block that contains the first > + free bit */ > bool immutable; /* no [de]population allowed */ > bool has_reserved; /* indicates if the region this chunk > is responsible for overlaps with > @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk; > extern struct pcpu_chunk *pcpu_reserved_chunk; > extern unsigned long pcpu_reserved_offset; > > +/* > + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bitmap blocks required. It converts to bytes > + * served to bits required and then blocks used. > + */ > +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk) > +{ > + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE / > + PCPU_BITMAP_BLOCK_SIZE; > +} > + > +/* > + * pcpu_pages_to_bits - converts the pages to size of bitmap > + * @pages: number of physical pages > + * > + * This conversion is from physical pages to the number of bits > + * required in the bitmap. > + */ > +static inline int pcpu_pages_to_bits(int pages) > +{ > + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE; > +} > + > +/* > + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bits in the bitmap. > + */ > +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk) > +{ > + return pcpu_pages_to_bits(chunk->nr_pages); > +} > + > #ifdef CONFIG_PERCPU_STATS > > #include <linux/spinlock.h> > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c > index 6fc04b1..8dbef0c 100644 > --- a/mm/percpu-stats.c > +++ b/mm/percpu-stats.c > @@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b) > } > > /* > - * Iterates over all chunks to find the max # of map entries used. > + * Iterates over all chunks to find the max nr_alloc entries. > */ > -static int find_max_map_used(void) > +static int find_max_nr_alloc(void) > { > struct pcpu_chunk *chunk; > - int slot, max_map_used; > + int slot, max_nr_alloc; > > - max_map_used = 0; > + max_nr_alloc = 0; > for (slot = 0; slot < pcpu_nr_slots; slot++) > list_for_each_entry(chunk, &pcpu_slot[slot], list) > - max_map_used = max(max_map_used, chunk->map_used); > + max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc); > > - return max_map_used; > + return max_nr_alloc; > } > > /* > * Prints out chunk state. Fragmentation is considered between > * the beginning of the chunk to the last allocation. > + * > + * All statistics are in bytes unless stated otherwise. > */ > static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > int *buffer) > { > - int i, s_index, last_alloc, alloc_sign, as_len; > + int i, last_alloc, as_len, start, end; > int *alloc_sizes, *p; > /* statistics */ > int sum_frag = 0, max_frag = 0; > int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0; > > alloc_sizes = buffer; > - s_index = chunk->has_reserved ? 1 : 0; > - > - /* find last allocation */ > - last_alloc = -1; > - for (i = chunk->map_used - 1; i >= s_index; i--) { > - if (chunk->map[i] & 1) { > - last_alloc = i; > - break; > - } > - } > > - /* if the chunk is not empty - ignoring reserve */ > - if (last_alloc >= s_index) { > - as_len = last_alloc + 1 - s_index; > - > - /* > - * Iterate through chunk map computing size info. > - * The first bit is overloaded to be a used flag. > - * negative = free space, positive = allocated > - */ > - for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) { > - alloc_sign = (*p & 1) ? 1 : -1; > - alloc_sizes[i] = alloc_sign * > - ((p[1] & ~1) - (p[0] & ~1)); > + /* > + * find_last_bit returns the start value if nothing found. > + * Therefore, we must determine if it is a failure of find_last_bit > + * and set the appropriate value. > + */ > + last_alloc = find_last_bit(chunk->alloc_map, > + pcpu_nr_pages_to_bits(chunk) - 1); > + last_alloc = test_bit(last_alloc, chunk->alloc_map) ? > + last_alloc + 1 : 0; > + > + start = as_len = 0; > + if (chunk->has_reserved) > + start = pcpu_reserved_offset; > + > + /* > + * If a bit is set in the allocation map, the bound_map identifies > + * where the allocation ends. If the allocation is not set, the > + * bound_map does not identify free areas as it is only kept accurate > + * on allocation, not free. > + * > + * Positive values are allocations and negative values are free > + * fragments. > + */ > + while (start < last_alloc) { > + if (test_bit(start, chunk->alloc_map)) { > + end = find_next_bit(chunk->bound_map, last_alloc, > + start + 1); > + alloc_sizes[as_len] = 1; > + } else { > + end = find_next_bit(chunk->alloc_map, last_alloc, > + start + 1); > + alloc_sizes[as_len] = -1; > } > > - sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL); > + alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE; > + > + start = end; > + } > + > + /* > + * The negative values are free fragments and thus sorting gives the > + * free fragments at the beginning in largest first order. > + */ > + if (as_len > 0) { > + sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL); > > - /* Iterate through the unallocated fragements. */ > + /* iterate through the unallocated fragments */ > for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) { > sum_frag -= *p; > max_frag = max(max_frag, -1 * (*p)); > @@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > P("nr_alloc", chunk->nr_alloc); > P("max_alloc_size", chunk->max_alloc_size); > P("empty_pop_pages", chunk->nr_empty_pop_pages); > - P("free_size", chunk->free_size); > - P("contig_hint", chunk->contig_hint); > + P("first_free_block", chunk->first_free_block); > + P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE); > + P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE); > P("sum_frag", sum_frag); > P("max_frag", max_frag); > P("cur_min_alloc", cur_min_alloc); > @@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > static int percpu_stats_show(struct seq_file *m, void *v) > { > struct pcpu_chunk *chunk; > - int slot, max_map_used; > + int slot, max_nr_alloc; > int *buffer; > > alloc_buffer: > spin_lock_irq(&pcpu_lock); > - max_map_used = find_max_map_used(); > + max_nr_alloc = find_max_nr_alloc(); > spin_unlock_irq(&pcpu_lock); > > - buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0])); > + /* there can be at most this many free and allocated fragments */ > + buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int)); > if (!buffer) > return -ENOMEM; > > spin_lock_irq(&pcpu_lock); > > /* if the buffer allocated earlier is too small */ > - if (max_map_used < find_max_map_used()) { > + if (max_nr_alloc < find_max_nr_alloc()) { > spin_unlock_irq(&pcpu_lock); > vfree(buffer); > goto alloc_buffer; > diff --git a/mm/percpu.c b/mm/percpu.c > index fb01841..569df63 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -4,6 +4,9 @@ > * Copyright (C) 2009 SUSE Linux Products GmbH > * Copyright (C) 2009 Tejun Heo <tj@kernel.org> > * > + * Copyright (C) 2017 Facebook Inc. > + * Copyright (C) 2017 Dennis Zhou <dennisszhou@gmail.com> > + * > * This file is released under the GPLv2 license. > * > * The percpu allocator handles both static and dynamic areas. Percpu > @@ -34,19 +37,20 @@ > * percpu variables from kernel modules. Finally, the dynamic section > * takes care of normal allocations. > * > - * Allocation state in each chunk is kept using an array of integers > - * on chunk->map. A positive value in the map represents a free > - * region and negative allocated. Allocation inside a chunk is done > - * by scanning this map sequentially and serving the first matching > - * entry. This is mostly copied from the percpu_modalloc() allocator. > - * Chunks can be determined from the address using the index field > - * in the page struct. The index field contains a pointer to the chunk. > - * > - * These chunks are organized into lists according to free_size and > - * tries to allocate from the fullest chunk first. Each chunk maintains > - * a maximum contiguous area size hint which is guaranteed to be equal > - * to or larger than the maximum contiguous area in the chunk. This > - * helps prevent the allocator from iterating over chunks unnecessarily. > + * The allocator organizes chunks into lists according to free size and > + * tries to allocate from the fullest chunk first. Each chunk is managed > + * by a bitmap with metadata blocks. The allocation map is updated on > + * every allocation to reflect the current state while the boundary map > + * is only updated on allocation. Each metadata block contains > + * information to help mitigate the need to iterate over large portions > + * of the bitmap. The reverse mapping from page to chunk is stored in > + * the page's index. Lastly, units are lazily backed and grow in unison. > + * > + * There is a unique conversion that goes on here between bytes and bits. > + * The chunk tracks the number of pages it is responsible for in nr_pages. > + * From there, helper functions are used to convert from physical pages > + * to bitmap bits and bitmap blocks. All hints are managed in bits > + * unless explicitly stated. > * > * To use this allocator, arch code should do the following: > * > @@ -86,10 +90,13 @@ > > #include "percpu-internal.h" > > -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */ > -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */ > -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32 > -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64 > +/* > + * The metadata is managed in terms of bits with each bit mapping to > + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated > + * with respect to the number of bits available. > + */ > +#define PCPU_SLOT_BASE_SHIFT 3 > + > #define PCPU_EMPTY_POP_PAGES_LOW 2 > #define PCPU_EMPTY_POP_PAGES_HIGH 4 > > @@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init; > DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ > static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ > > -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */ > - > -/* chunks which need their map areas extended, protected by pcpu_lock */ > -static LIST_HEAD(pcpu_map_extend_chunks); > +/* > + * Chunk list slots. These slots order the chunks by the number of > + * free bits available in the bitmap. > + */ > +struct list_head *pcpu_slot __ro_after_init; > > /* > * The number of empty populated pages, protected by pcpu_lock. The > @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr) > pcpu_reserved_chunk->nr_pages * PAGE_SIZE; > } > > -static int __pcpu_size_to_slot(int size) > +static int __pcpu_size_to_slot(int bit_size) > { > - int highbit = fls(size); /* size is in bytes */ > + int highbit = fls(bit_size); /* size is in bits */ > return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1); > } > > -static int pcpu_size_to_slot(int size) > +static int pcpu_size_to_slot(int bit_size) > { > - if (size == pcpu_unit_size) > + if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages)) > return pcpu_nr_slots - 1; > - return __pcpu_size_to_slot(size); > + return __pcpu_size_to_slot(bit_size); > } > > static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) > { > - if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int)) > + if (chunk->free_bits == 0 || chunk->contig_hint == 0) > return 0; > > - return pcpu_size_to_slot(chunk->free_size); > + return pcpu_size_to_slot(chunk->free_bits); > } > > /* set the pointer to a chunk in a page struct */ > @@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk, > } > > /* > + * The following are helper functions to help access bitmaps and convert > + * between bitmap offsets to actual address offsets. > + */ > +static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index) > +{ > + return chunk->alloc_map + > + (index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG); > +} > + > +static unsigned long pcpu_off_to_block_index(int off) > +{ > + return off / PCPU_BITMAP_BLOCK_SIZE; > +} > + > +static unsigned long pcpu_off_to_block_off(int off) > +{ > + return off & (PCPU_BITMAP_BLOCK_SIZE - 1); > +} > + > +static unsigned long pcpu_block_off_to_off(int index, int off) > +{ > + return index * PCPU_BITMAP_BLOCK_SIZE + off; > +} > + > +static unsigned long pcpu_block_get_first_page(int index) > +{ > + return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE * > + PCPU_MIN_ALLOC_SIZE); > +} > + > +/* > * (Un)populated page region iterators. Iterate over (un)populated > * page regions between @start and @end in @chunk. @rs and @re should > * be integer variables and will be set to start and end page index of > @@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr) > } > > /** > - * pcpu_count_occupied_pages - count the number of pages an area occupies > - * @chunk: chunk of interest > - * @i: index of the area in question > - * > - * Count the number of pages chunk's @i'th area occupies. When the area's > - * start and/or end address isn't aligned to page boundary, the straddled > - * page is included in the count iff the rest of the page is free. > - */ > -static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i) > -{ > - int off = chunk->map[i] & ~1; > - int end = chunk->map[i + 1] & ~1; > - > - if (!PAGE_ALIGNED(off) && i > 0) { > - int prev = chunk->map[i - 1]; > - > - if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE)) > - off = round_down(off, PAGE_SIZE); > - } > - > - if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) { > - int next = chunk->map[i + 1]; > - int nend = chunk->map[i + 2] & ~1; > - > - if (!(next & 1) && nend >= round_up(end, PAGE_SIZE)) > - end = round_up(end, PAGE_SIZE); > - } > - > - return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0); > -} > - > -/** > * pcpu_chunk_relocate - put chunk in the appropriate chunk slot > * @chunk: chunk of interest > * @oslot: the previous slot it was on > @@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) > } > > /** > - * pcpu_need_to_extend - determine whether chunk area map needs to be extended > + * pcpu_cnt_pop_pages- counts populated backing pages in range > * @chunk: chunk of interest > - * @is_atomic: the allocation context > + * @start: start index > + * @end: end index > * > - * Determine whether area map of @chunk needs to be extended. If > - * @is_atomic, only the amount necessary for a new allocation is > - * considered; however, async extension is scheduled if the left amount is > - * low. If !@is_atomic, it aims for more empty space. Combined, this > - * ensures that the map is likely to have enough available space to > - * accomodate atomic allocations which can't extend maps directly. > - * > - * CONTEXT: > - * pcpu_lock. > + * Calculates the number of populated pages in the region [start, end). > + * This lets us keep track of how many empty populated pages are available > + * and decide if we should schedule async work. > * > * RETURNS: > - * New target map allocation length if extension is necessary, 0 > - * otherwise. > + * The nr of populated pages. > */ > -static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic) > +static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, > + int start, int end) > { > - int margin, new_alloc; > - > - lockdep_assert_held(&pcpu_lock); > + return bitmap_weight(chunk->populated, end) - > + bitmap_weight(chunk->populated, start); > +} > > - if (is_atomic) { > - margin = 3; > +/** > + * pcpu_chunk_update_hint - updates metadata about a chunk > + * @chunk: chunk of interest > + * > + * Responsible for iterating over metadata blocks to aggregate the > + * overall statistics of the chunk. > + * > + * Updates: > + * chunk->contig_hint > + * chunk->contig_hint_start > + * nr_empty_pop_pages > + */ > +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk) > +{ > + bool is_page_empty = true; > + int i, off, cur_contig, nr_empty_pop_pages, l_pop_off; > + struct pcpu_bitmap_md *block; > + > + chunk->contig_hint = cur_contig = 0; > + off = nr_empty_pop_pages = 0; > + l_pop_off = pcpu_block_get_first_page(chunk->first_free_block); > + > + for (i = chunk->first_free_block, block = chunk->md_blocks + i; > + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) { > + /* Manage nr_empty_pop_pages. > + * > + * This is tricky. So the the background work function is > + * triggered when there are not enough free populated pages. > + * This is necessary to make sure atomic allocations can > + * succeed. > + * > + * The first page of each block is kept track of here allowing > + * this to scale in both situations where there are > 1 page > + * per block and where a block may be a portion of a page. > + */ > + int pop_off = pcpu_block_get_first_page(i); > + > + if (pop_off > l_pop_off) { > + if (is_page_empty) > + nr_empty_pop_pages += > + pcpu_cnt_pop_pages(chunk, l_pop_off, > + pop_off); > + l_pop_off = pop_off; > + is_page_empty = true; > + } > + if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE) > + is_page_empty = false; > > - if (chunk->map_alloc < > - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) { > - if (list_empty(&chunk->map_extend_list)) { > - list_add_tail(&chunk->map_extend_list, > - &pcpu_map_extend_chunks); > - pcpu_schedule_balance_work(); > + /* continue from prev block adding to the cur_contig hint */ > + if (cur_contig) { > + cur_contig += block->left_free; > + if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > + continue; > + } else if (cur_contig > chunk->contig_hint) { > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > } > + cur_contig = 0; > } > - } else { > - margin = PCPU_ATOMIC_MAP_MARGIN_HIGH; > + /* check if the block->contig_hint is larger */ > + if (block->contig_hint > chunk->contig_hint) { > + chunk->contig_hint = block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(i, > + block->contig_hint_start); > + } > + /* let the next iteration catch the right_free */ > + cur_contig = block->right_free; > + off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free; > } > > - if (chunk->map_alloc >= chunk->map_used + margin) > - return 0; > - > - new_alloc = PCPU_DFL_MAP_ALLOC; > - while (new_alloc < chunk->map_used + margin) > - new_alloc *= 2; > + /* catch last iteration if the last block ends with free space */ > + if (cur_contig > chunk->contig_hint) { > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > + } > > - return new_alloc; > + /* > + * Keep track of nr_empty_pop_pages. > + * > + * The chunk is maintains the previous number of free pages it held, > + * so the delta is used to update the global counter. The reserved > + * chunk is not part of the free page count as they are populated > + * at init and are special to serving reserved allocations. > + */ > + if (is_page_empty) { > + nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off, > + chunk->nr_pages); > + } > + if (chunk != pcpu_reserved_chunk) > + pcpu_nr_empty_pop_pages += > + (nr_empty_pop_pages - chunk->nr_empty_pop_pages); > + chunk->nr_empty_pop_pages = nr_empty_pop_pages; > } > > /** > - * pcpu_extend_area_map - extend area map of a chunk > + * pcpu_block_update_hint > * @chunk: chunk of interest > - * @new_alloc: new target allocation length of the area map > + * @index: block index of the metadata block > * > - * Extend area map of @chunk to have @new_alloc entries. > + * Full scan over the entire block to recalculate block-level metadata. > + */ > +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) > +{ > + unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index); > + struct pcpu_bitmap_md *block = chunk->md_blocks + index; > + bool is_left_free = false, is_right_free = false; > + int contig; > + unsigned long start, end; > + > + block->contig_hint = 0; > + start = end = block->first_free; > + while (start < PCPU_BITMAP_BLOCK_SIZE) { > + /* > + * Scans the allocation map corresponding to this block > + * to find free fragments and update metadata accordingly. > + */ > + start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, > + start); > + if (start >= PCPU_BITMAP_BLOCK_SIZE) > + break; > + /* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */ > + end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start); > + /* update left_free */ > + contig = end - start; > + if (start == 0) { > + block->left_free = contig; > + is_left_free = true; > + } > + /* update right_free */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { > + block->right_free = contig; > + is_right_free = true; > + } > + /* update block contig_hints */ > + if (block->contig_hint < contig) { > + block->contig_hint = contig; > + block->contig_hint_start = start; > + } > + start = end; > + } > + > + /* clear left/right free hints */ > + if (!is_left_free) > + block->left_free = 0; > + if (!is_right_free) > + block->right_free = 0; > +} > + > +/** > + * pcpu_block_update_hint_alloc - update hint on allocation path > + * @chunk: chunk of interest > + * @bit_off: bitmap offset > + * @bit_size: size of request in allocation units > * > - * CONTEXT: > - * Does GFP_KERNEL allocation. Grabs and releases pcpu_lock. > + * Updates metadata for the allocation path. The metadata only has to be > + * refreshed by a full scan iff we break the largest contig region. > * > * RETURNS: > - * 0 on success, -errno on failure. > + * Bool if we need to update the chunk's metadata. This occurs only if we > + * break the chunk's contig hint. > */ > -static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc) > +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > - int *old = NULL, *new = NULL; > - size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); > - unsigned long flags; > - > - lockdep_assert_held(&pcpu_alloc_mutex); > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > - new = pcpu_mem_zalloc(new_size); > - if (!new) > - return -ENOMEM; > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > > - /* acquire pcpu_lock and switch to new area map */ > - spin_lock_irqsave(&pcpu_lock, flags); > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } > > - if (new_alloc <= chunk->map_alloc) > - goto out_unlock; > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > > - old_size = chunk->map_alloc * sizeof(chunk->map[0]); > - old = chunk->map; > + /* > + * Update s_block. > + * > + * block->first_free must be updated if the allocation takes its place. > + * If the allocation breaks the contig_hint, a scan is required to > + * restore this hint. > + */ > + if (s_off == s_block->first_free) > + s_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, s_index), > + PCPU_BITMAP_BLOCK_SIZE, > + s_off + bit_size); > + > + if (s_off >= s_block->contig_hint_start && > + s_off < s_block->contig_hint_start + s_block->contig_hint) { > + pcpu_block_refresh_hint(chunk, s_index); > + } else { > + /* update left and right contig manually */ > + s_block->left_free = min(s_block->left_free, s_off); > + if (s_index == e_index) > + s_block->right_free = min_t(int, s_block->right_free, > + PCPU_BITMAP_BLOCK_SIZE - e_off); > + else > + s_block->right_free = 0; > + } > > - memcpy(new, old, old_size); > + /* > + * Update e_block. > + * If they are different, then e_block's first_free is guaranteed to > + * be the extend of e_off. first_free must be updated and a scan > + * over e_block is issued. > + */ > + if (s_index != e_index) { > + e_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, e_off); > > - chunk->map_alloc = new_alloc; > - chunk->map = new; > - new = NULL; > + pcpu_block_refresh_hint(chunk, e_index); > + } > > -out_unlock: > - spin_unlock_irqrestore(&pcpu_lock, flags); > + /* update in-between md_blocks */ > + for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index; > + i++, block++) { > + block->contig_hint = 0; > + block->left_free = 0; > + block->right_free = 0; > + } > > /* > - * pcpu_mem_free() might end up calling vfree() which uses > - * IRQ-unsafe lock and thus can't be called under pcpu_lock. > + * The only time a full chunk scan is required is if the global > + * contig_hint is broken. Otherwise, it means a smaller space > + * was used and therefore the global contig_hint is still correct. > */ > - pcpu_mem_free(old); > - pcpu_mem_free(new); > + if (bit_off >= chunk->contig_hint_start && > + bit_off < chunk->contig_hint_start + chunk->contig_hint) > + update_chunk = true; > > - return 0; > + return update_chunk; > } > > /** > - * pcpu_fit_in_area - try to fit the requested allocation in a candidate area > - * @chunk: chunk the candidate area belongs to > - * @off: the offset to the start of the candidate area > - * @this_size: the size of the candidate area > - * @size: the size of the target allocation > - * @align: the alignment of the target allocation > - * @pop_only: only allocate from already populated region > - * > - * We're trying to allocate @size bytes aligned at @align. @chunk's area > - * at @off sized @this_size is a candidate. This function determines > - * whether the target allocation fits in the candidate area and returns the > - * number of bytes to pad after @off. If the target area doesn't fit, -1 > - * is returned. > - * > - * If @pop_only is %true, this function only considers the already > - * populated part of the candidate area. > + * pcpu_block_update_hint_free - updates the block hints on the free path > + * @chunk: chunk of interest > + * @bit_off: bitmap offset > + * @bit_size: size of request in allocation units > + * > + * Updates the hint along the free path by taking advantage of current metadata > + * to minimize scanning of the bitmap. Triggers a global update if an entire > + * block becomes free or the free spans across blocks. This tradeoff is to > + * minimize global scanning to update the chunk->contig_hint. The > + * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint > + * will never be more than the available space. If the chunk->contig_hint is > + * in this block, it will be accurate. > + * > + * RETURNS: > + * Bool if we need to update the chunk's metadata. This occurs if a larger > + * contig region is created along the edges or we free across blocks. > */ > -static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size, > - int size, int align, bool pop_only) > +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > - int cand_off = off; > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + int start, end, contig; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > - while (true) { > - int head = ALIGN(cand_off, align) - off; > - int page_start, page_end, rs, re; > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > + > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } > + > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > + > + /* > + * Check if the freed area aligns with the block->contig_hint. > + * If it does, then the scan to find the beginning/end of the > + * larger free area can be avoided. > + * > + * start and end refer to beginning and end of the free region > + * within each their respective blocks. This is not necessarily > + * the entire free region as it may span blocks past the beginning > + * or end of the block. > + */ > + start = s_off; > + if (s_off == s_block->contig_hint + s_block->contig_hint_start) { > + start = s_block->contig_hint_start; > + } else { > + int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), > + start); > + start = (start == l_bit) ? 0 : l_bit + 1; > + } > + > + end = e_off; > + if (e_off == e_block->contig_hint_start) > + end = e_block->contig_hint_start + e_block->contig_hint; > + else > + end = find_next_bit(pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, end); > > - if (this_size < head + size) > - return -1; > + /* freeing in the same block */ > + if (s_index == e_index) { > + contig = end - start; > > - if (!pop_only) > - return head; > + if (start == 0) > + s_block->left_free = contig; > > + if (end == PCPU_BITMAP_BLOCK_SIZE) > + s_block->right_free = contig; > + > + s_block->first_free = min(s_block->first_free, start); > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + > + } else { > /* > - * If the first unpopulated page is beyond the end of the > - * allocation, the whole allocation is populated; > - * otherwise, retry from the end of the unpopulated area. > + * Freeing across md_blocks. > + * > + * If the start is at the beginning of the block, just > + * reset the block instead. > */ > - page_start = PFN_DOWN(head + off); > - page_end = PFN_UP(head + off + size); > - > - rs = page_start; > - pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size)); > - if (rs >= page_end) > - return head; > - cand_off = re * PAGE_SIZE; > + if (start == 0) { > + s_index--; > + } else { > + /* > + * Knowing that the free is across blocks, this means > + * the hint can be updated on the right side and the > + * left side does not need to be touched. > + */ > + s_block->first_free = min(s_block->first_free, start); > + contig = PCPU_BITMAP_BLOCK_SIZE - start; > + s_block->right_free = contig; > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + } > + /* > + * If end is the entire e_block, just reset the block > + * as well. > + */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { > + e_index++; > + } else { > + /* > + * The hint must only be on the left side, so > + * update accordingly. > + */ > + e_block->first_free = 0; > + e_block->left_free = end; > + if (end > e_block->contig_hint) { > + e_block->contig_hint = end; > + e_block->contig_hint_start = 0; > + } > + } > + > + /* reset md_blocks in the middle */ > + for (i = s_index + 1, block = chunk->md_blocks + i; > + i < e_index; i++, block++) { > + block->first_free = 0; > + block->contig_hint_start = 0; > + block->contig_hint = PCPU_BITMAP_BLOCK_SIZE; > + block->left_free = PCPU_BITMAP_BLOCK_SIZE; > + block->right_free = PCPU_BITMAP_BLOCK_SIZE; > + } > } > + > + /* > + * The hint is only checked in the s_block and e_block when > + * freeing and particularly only when it is self contained within > + * its own block. A scan is required if the free space spans > + * blocks or makes a block whole as the scan will take into > + * account free space across blocks. > + */ > + if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) || > + s_index != e_index) { > + update_chunk = true; > + } else if (s_block->contig_hint > chunk->contig_hint) { > + /* check if block contig_hint is bigger */ > + chunk->contig_hint = s_block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(s_index, > + s_block->contig_hint_start); > + } > + > + return update_chunk; > } > > /** > - * pcpu_alloc_area - allocate area from a pcpu_chunk > + * pcpu_is_populated - determines if the region is populated > * @chunk: chunk of interest > - * @size: wanted size in bytes > - * @align: wanted align > - * @pop_only: allocate only from the populated area > - * @occ_pages_p: out param for the number of pages the area occupies > + * @index: block index > + * @block_off: offset within the bitmap > + * @bit_size: size of request in allocation units > + * @next_index: return value for next block index that is populated > * > - * Try to allocate @size bytes area aligned at @align from @chunk. > - * Note that this function only allocates the offset. It doesn't > - * populate or map the area. > + * For atomic allocations, we must check if the backing pages are populated. > * > - * @chunk->map must have at least two free slots. > + * RETURNS: > + * Bool if the backing pages are populated. next_index is to skip over > + * unpopulated blocks in pcpu_find_block_fit. > + */ > +static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index, > + int block_off, int bit_size, int *next_index) > +{ > + int page_start, page_end, rs, re; > + int off = pcpu_block_off_to_off(index, block_off); > + int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE; > + > + page_start = PFN_DOWN(off); > + page_end = PFN_UP(e_off); > + > + rs = page_start; > + pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off)); > + if (rs >= page_end) > + return true; > + *next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE; > + return false; > +} > + > +/** > + * pcpu_find_block_fit - finds the block index to start searching > + * @chunk: chunk of interest > + * @bit_size: size of request in allocation units > + * @align: alignment of area (max PAGE_SIZE) > + * @pop_only: use populated regions only > * > - * CONTEXT: > - * pcpu_lock. > + * Given a chunk and an allocation spec, find the offset to begin searching > + * for a free region. This is done by iterating over the bitmap metadata > + * blocks and then only returning regions that will be guaranteed to fit > + * alignment by comparing against the block->contig_hint_start or a correctly > + * aligned offset. Iteration is used within a block as an allocation may be > + * able to be served prior to the contig_hint. > + * > + * Note: This errs on the side of caution by only selecting blocks guaranteed > + * to have a fit in the chunk's contig_hint. Poor alignment can cause > + * us to skip over chunk's that have valid vacancies. > * > * RETURNS: > - * Allocated offset in @chunk on success, -1 if no matching area is > - * found. > + * The offset in the bitmap to begin searching. > + * -1 if no offset is found. > */ > -static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align, > - bool pop_only, int *occ_pages_p) > +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, > + size_t align, bool pop_only) > { > - int oslot = pcpu_chunk_slot(chunk); > - int max_contig = 0; > - int i, off; > - bool seen_free = false; > - int *p; > - > - for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) { > - int head, tail; > - int this_size; > - > - off = *p; > - if (off & 1) > - continue; > + int i, cur_free; > + int s_index, block_off, next_index, end_off; /* interior alloc index */ > + struct pcpu_bitmap_md *block; > + unsigned long *alloc_map; > > - this_size = (p[1] & ~1) - off; > + lockdep_assert_held(&pcpu_lock); > > - head = pcpu_fit_in_area(chunk, off, this_size, size, align, > - pop_only); > - if (head < 0) { > - if (!seen_free) { > - chunk->first_free = i; > - seen_free = true; > - } > - max_contig = max(this_size, max_contig); > + cur_free = block_off = 0; > + s_index = chunk->first_free_block; > + for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk); > + i++) { > + alloc_map = pcpu_index_alloc_map(chunk, i); > + block = chunk->md_blocks + i; > + > + /* continue from prev block */ > + cur_free += block->left_free; > + if (cur_free >= bit_size) { > + end_off = bit_size; > + goto check_populated; > + } else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > continue; > } > > /* > - * If head is small or the previous block is free, > - * merge'em. Note that 'small' is defined as smaller > - * than sizeof(int), which is very small but isn't too > - * uncommon for percpu allocations. > + * Can this block hold this alloc? > + * > + * Here the block->contig_hint is used to guarantee a fit, > + * but the block->first_free is returned as we may be able > + * to serve the allocation earlier. The population check > + * must take into account the area beginning at first_free > + * through the end of the contig_hint. > */ > - if (head && (head < sizeof(int) || !(p[-1] & 1))) { > - *p = off += head; > - if (p[-1] & 1) > - chunk->free_size -= head; > - else > - max_contig = max(*p - p[-1], max_contig); > - this_size -= head; > - head = 0; > + cur_free = 0; > + s_index = i; > + block_off = ALIGN(block->contig_hint_start, align); > + block_off -= block->contig_hint_start; > + if (block->contig_hint >= block_off + bit_size) { > + block_off = block->first_free; > + end_off = block->contig_hint_start - block_off + > + bit_size; > + goto check_populated; > } > > - /* if tail is small, just keep it around */ > - tail = this_size - head - size; > - if (tail < sizeof(int)) { > - tail = 0; > - size = this_size - head; > + /* check right */ > + block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free, > + align); > + /* reset to start looking in the next block */ > + if (block_off >= PCPU_BITMAP_BLOCK_SIZE) { > + s_index++; > + cur_free = block_off = 0; > + continue; > } > + cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off; > + if (cur_free >= bit_size) { > + end_off = bit_size; > +check_populated: > + if (!pop_only || > + pcpu_is_populated(chunk, s_index, block_off, > + end_off, &next_index)) > + break; > > - /* split if warranted */ > - if (head || tail) { > - int nr_extra = !!head + !!tail; > - > - /* insert new subblocks */ > - memmove(p + nr_extra + 1, p + 1, > - sizeof(chunk->map[0]) * (chunk->map_used - i)); > - chunk->map_used += nr_extra; > - > - if (head) { > - if (!seen_free) { > - chunk->first_free = i; > - seen_free = true; > - } > - *++p = off += head; > - ++i; > - max_contig = max(head, max_contig); > - } > - if (tail) { > - p[1] = off + size; > - max_contig = max(tail, max_contig); > - } > + i = next_index - 1; > + s_index = next_index; > + cur_free = block_off = 0; > } > + } > > - if (!seen_free) > - chunk->first_free = i + 1; > + /* nothing found */ > + if (i == pcpu_nr_pages_to_blocks(chunk)) > + return -1; > > - /* update hint and mark allocated */ > - if (i + 1 == chunk->map_used) > - chunk->contig_hint = max_contig; /* fully scanned */ > - else > - chunk->contig_hint = max(chunk->contig_hint, > - max_contig); > + return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off; > +} > > - chunk->free_size -= size; > - *p |= 1; > > - *occ_pages_p = pcpu_count_occupied_pages(chunk, i); > - pcpu_chunk_relocate(chunk, oslot); > - return off; > - } > +/** > + * pcpu_alloc_area - allocates area from a pcpu_chunk > + * @chunk: chunk of interest > + * @bit_size: size of request in allocation units > + * @align: alignment of area (max PAGE_SIZE) > + * @start: bit_off to start searching > + * > + * This function takes in a start bit_offset to begin searching. It > + * searches the allocation bitmap to verify that the offset is available > + * as block->first_free is provided when allocation within a block is > + * available. > + * > + * RETURNS: > + * Allocated addr offset in @chunk on success, > + * -1 if no matching area is found > + */ > +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size, > + size_t align, int start) > +{ > + size_t align_mask = (align) ? (align - 1) : 0; > + int i, bit_off, oslot; > + struct pcpu_bitmap_md *block; > + > + lockdep_assert_held(&pcpu_lock); > + > + oslot = pcpu_chunk_slot(chunk); > + > + /* search to find fit */ > + bit_off = bitmap_find_next_zero_area(chunk->alloc_map, > + pcpu_nr_pages_to_bits(chunk), > + start, bit_size, align_mask); > + > + if (bit_off >= pcpu_nr_pages_to_bits(chunk)) > + return -1; > + > + /* update alloc map */ > + bitmap_set(chunk->alloc_map, bit_off, bit_size); > + /* update boundary map */ > + set_bit(bit_off, chunk->bound_map); > + bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1); > + set_bit(bit_off + bit_size, chunk->bound_map); > + Actually I decided I do want to complain about this. Have you considered making chunks statically sized, like slab does? We could avoid this whole bound_map thing completely and save quite a few cycles trying to figure out how big our allocation was. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator 2017-07-19 19:16 ` Josef Bacik @ 2017-07-19 22:13 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-19 22:13 UTC (permalink / raw) To: Josef Bacik Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Josef, Thanks for taking a look at my code. On Wed, Jul 19, 2017 at 07:16:35PM +0000, Josef Bacik wrote: > > Actually I decided I do want to complain about this. Have you considered making > chunks statically sized, like slab does? We could avoid this whole bound_map > thing completely and save quite a few cycles trying to figure out how big our > allocation was. Thanks, I did consider something along the lines of a slab allocator, but ultimately utilization and fragmentation were why I decided against it. Percpu memory is handled by giving each cpu its own copy of the object to use. This means cpus can avoid cache coherence when accessing and manipulating the object. To do this, the percpu allocator creates chunks to serve each allocation out of. Because each cpu has its own copy, there is a high cost for having each chunk lying around (and this memory in general). With slab allocation, it takes liberty in caching often used sizes and accepting internal fragmentation for performance. Unfortunately, the percpu memory allocator does not necessarily know what is going to get allocated. It would need to keep many slabs around to serve each allocation which can be quite expensive. In the worst-case, long living percpu allocations can keep entire slabs alive as there is no way to perform consolidation once addresses are given out. Additionally, any internal fragmentation caused by ill-fit objects is amplified by the number of possible cpus. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 10/10] percpu: add optimizations on allocation path for the bitmap allocator 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (8 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou @ 2017-07-16 2:23 ` Dennis Zhou 2017-07-17 23:32 ` Tejun Heo 2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik 10 siblings, 1 reply; 37+ messages in thread From: Dennis Zhou @ 2017-07-16 2:23 UTC (permalink / raw) To: Tejun Heo, Christoph Lameter Cc: kernel-team, linux-kernel, linux-mm, Dennis Zhou From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> This patch adds two optimizations to the allocation path. The first is to not consider a chunk if the requested allocation cannot fit in the chunk's contig_hint. The benefit is that this avoids unncessary scanning over a chunk as the assumption is memory pressure is high and creating a new chunk has minimal consequences. This may fail when the contig_hint has poor alignment, but again we fall back on the high memory pressure argument. The second is just a fail-fast mechanism. When allocating, a offset is identified within a block and then scanning is used to see if it will fit. An offset should never be returned unless it is known to fit, so here we just bind the scanning to the size of a block. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> --- mm/percpu.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 569df63..7496571 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -885,6 +885,12 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, lockdep_assert_held(&pcpu_lock); + /* check chunk->contig_hint to see if alloc can fit - see note above */ + block_off = ALIGN(chunk->contig_hint_start, align) - + chunk->contig_hint_start; + if (block_off + bit_size > chunk->contig_hint) + return -1; + cur_free = block_off = 0; s_index = chunk->first_free_block; for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk); @@ -973,19 +979,23 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size, size_t align, int start) { size_t align_mask = (align) ? (align - 1) : 0; - int i, bit_off, oslot; + int i, bit_off, end, oslot; struct pcpu_bitmap_md *block; lockdep_assert_held(&pcpu_lock); oslot = pcpu_chunk_slot(chunk); - /* search to find fit */ - bit_off = bitmap_find_next_zero_area(chunk->alloc_map, - pcpu_nr_pages_to_bits(chunk), - start, bit_size, align_mask); + /* + * Search to find fit. The search for the start is limited to + * be within a block_size, but should in reality never be hit + * as the contig_hint should be a valid placement. + */ + end = start + bit_size + PCPU_BITMAP_BLOCK_SIZE; + bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start, + bit_size, align_mask); - if (bit_off >= pcpu_nr_pages_to_bits(chunk)) + if (bit_off >= end) return -1; /* update alloc map */ -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 10/10] percpu: add optimizations on allocation path for the bitmap allocator 2017-07-16 2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou @ 2017-07-17 23:32 ` Tejun Heo 0 siblings, 0 replies; 37+ messages in thread From: Tejun Heo @ 2017-07-17 23:32 UTC (permalink / raw) To: Dennis Zhou Cc: Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:15PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com> > > This patch adds two optimizations to the allocation path. The first is > to not consider a chunk if the requested allocation cannot fit in the > chunk's contig_hint. The benefit is that this avoids unncessary scanning > over a chunk as the assumption is memory pressure is high and creating a > new chunk has minimal consequences. This may fail when the contig_hint > has poor alignment, but again we fall back on the high memory pressure > argument. > > The second is just a fail-fast mechanism. When allocating, a offset is > identified within a block and then scanning is used to see if it will > fit. An offset should never be returned unless it is known to fit, so > here we just bind the scanning to the size of a block. > > Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Looks good to me and there's nothing wrong with these two optimizations being in a separate patch but they might be too little to help reviewing / debugging in any noticeable way. It'd be great if more significant parts can be separated out. If not, this is fine too. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou ` (9 preceding siblings ...) 2017-07-16 2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou @ 2017-07-18 19:15 ` Josef Bacik 2017-07-24 21:14 ` Dennis Zhou 10 siblings, 1 reply; 37+ messages in thread From: Josef Bacik @ 2017-07-18 19:15 UTC (permalink / raw) To: Dennis Zhou Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou On Sat, Jul 15, 2017 at 10:23:05PM -0400, Dennis Zhou wrote: > Hi everyone, > > The Linux kernel percpu memory allocator is responsible for managing > percpu memory. It allocates memory from chunks of percpu areas and uses a > simple first-fit area allocator to manage allocations inside each chunk. > There now exist use cases where allocating and deallocating a million or > more objects occurs making the current implementation inadequate. > > The two primary problems with the current area map allocator are: > 1. The backing data structure is an array of the areas. To manage this > array, it is possible to need to memmove a large portion of it. > 2. On allocation, chunks are considered based on the contig_hint. It is > possible that the contig_hint may be large enough while the alignment > could not meet the request. This causes scanning over every free > fragment that could spill over into scanning chunks. > > The primary considerations for the new allocator were the following: > - Remove the memmove operation from the critical path > - Be conservative with additional use of memory > - Provide consistency in performance and memory footprint > - Focus on small allocations < 64 bytes > > This patchset introduces a simple bitmap allocator backed by metadata > blocks as a replacement for the area map allocator for percpu memory. Each > chunk has an allocation bitmap, a boundary bitmap, and a set of metadata > blocks. The allocation map serves as the ground truth for allocations > while the boundary map serves as a way to distinguish between consecutive > allocations. The minimum allocation size has been increased to 4-bytes. > > The key property behind the bitmap allocator is its static metadata. The > main problem it solves is that a memmove is no longer part of the critical > path for freeing, which was the primary source of latency. This also helps > bound the metadata overhead. The area map allocator prior required an > integer per allocation. This may be beneficial with larger allocations, > but as mentioned, allocating a significant number of small objects is > becoming more common. This causes worst-case scenarios for metadata > overhead. > > There is one caveat with this implementation. In an effort to make freeing > fast, the only time metadata is updated on the free path is if a whole > block becomes free or the freed area spans across metadata blocks. This > causes the chunka??s contig_hint to be potentially smaller than what it > could allocate by up to a block. If the chunka??s contig_hint is smaller > than a block, a check occurs and the hint is kept accurate. Metadata is > always kept accurate on allocation and therefore the situation where a > chunk has a larger contig_hint than available will never occur. > > I have primarily done testing against a simple workload of allocation of > 1 million objects of varying size. Deallocation was done by in order, > alternating, and in reverse. These numbers were collected after rebasing > ontop of a80099a152. I present the worst-case numbers here: > > Area Map Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 335 | 4960 > 16B | 485 | 1150 > 64B | 445 | 280 > 128B | 505 | 177 > 1024B | 3385 | 140 > > Bitmap Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 725 | 70 > 16B | 760 | 70 > 64B | 855 | 80 > 128B | 910 | 90 > 1024B | 3770 | 260 > > This data demonstrates the inability for the area map allocator to > handle less than ideal situations. In the best case of reverse > deallocation, the area map allocator was able to perform within range > of the bitmap allocator. In the worst case situation, freeing took > nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator > dramatically improves the consistency of the free path. The small > allocations performed nearly identical regardless of the freeing > pattern. > > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. > Ok so you say that this test is better for the area map allocator, so presumably this is worst case for the bitmap allocator? What does the average case look like? Trading 2x allocation latency for a pretty significant free latency reduction seems ok, but are we allocating or freeing more? Are both allocations and free's done in performance critical areas, or do allocations only happen in performance critical areas, making the increased allocation latency hurt more? And lastly, why are we paying a 2x latency cost? What is it about the bitmap allocator that makes it much worse than the area map allocator? Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator 2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik @ 2017-07-24 21:14 ` Dennis Zhou 0 siblings, 0 replies; 37+ messages in thread From: Dennis Zhou @ 2017-07-24 21:14 UTC (permalink / raw) To: Josef Bacik Cc: Tejun Heo, Christoph Lameter, kernel-team, linux-kernel, linux-mm, Dennis Zhou Hi Josef, On Tue, Jul 18, 2017 at 03:15:28PM -0400, Josef Bacik wrote: > Ok so you say that this test is better for the area map allocator, so presumably > this is worst case for the bitmap allocator? What does the average case look > like? The bitmap allocator should perform relatively consistent in most allocation schemes. The worst case would be where constant scanning is required like the allocation scenario of 4-bytes with 8-byte alignment. As far as average case, I would expect similar performance on average. These are updated values with v2. Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 > Trading 2x allocation latency for a pretty significant free latency > reduction seems ok, but are we allocating or freeing more? Are both allocations > and free's done in performance critical areas, or do allocations only happen in > performance critical areas, making the increased allocation latency hurt more? I unfortunately do not have any data to support whether we are allocating or freeing more in critical areas. I would defer use case questions to those consuming percpu memory. There is one issue that stood out though with BPF that is causing the cpu to hang due to the percpu allocator. This leads me to believe that freeing is of more concern. I would believe that if allocation latency is a major problem, it is easier for the user to preallocate than it is for them to manage delayed freeing. > And lastly, why are we paying a 2x latency cost? What is it about the bitmap > allocator that makes it much worse than the area map allocator? So it turns out v1 was making poor use of control logic and the compiler wasn't doing me any favors. I've since done a rather large refactor with v2 which shows better performance overall. The bitmap allocator is paying more in that it has to manage metadata that the area map allocator does not. In the optimal case, the area map allocator is just an append without any heavy operations. A worse performing scenario for the area map allocator is when the second half of a chunk is occupied by a lot of small allocations. In that case, each allocation has to memmove the rest of the area map. This scenario is nontrivial to reproduce, so I provide data below for a similar scenario where the second half of each page is filled. Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 This shows that in less ideal chunk states, the allocation path can fail just like the freeing path can with non-ideal deallocation patterns. Thanks, Dennis -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2017-07-24 21:37 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-16 2:23 [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator Dennis Zhou 2017-07-16 2:23 ` [PATCH 01/10] percpu: pcpu-stats change void buffer to int buffer Dennis Zhou 2017-07-17 14:44 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 02/10] percpu: change the format for percpu_stats output Dennis Zhou 2017-07-17 14:46 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 03/10] percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats Dennis Zhou 2017-07-17 14:47 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 04/10] percpu: update the header comment and pcpu_build_alloc_info comments Dennis Zhou 2017-07-17 14:53 ` Tejun Heo 2017-07-16 2:23 ` [PATCH 05/10] percpu: change reserved_size to end page aligned Dennis Zhou 2017-07-16 4:01 ` kbuild test robot 2017-07-16 5:11 ` kbuild test robot 2017-07-17 16:46 ` Tejun Heo 2017-07-17 19:10 ` Dennis Zhou 2017-07-24 20:04 ` Dennis Zhou 2017-07-16 2:23 ` [PATCH 06/10] percpu: modify base_addr to be region specific Dennis Zhou 2017-07-17 18:57 ` Tejun Heo 2017-07-18 19:26 ` Josef Bacik 2017-07-18 19:36 ` Matthew Wilcox 2017-07-19 14:20 ` Josef Bacik 2017-07-16 2:23 ` [PATCH 07/10] percpu: fix misnomer in schunk/dchunk variable names Dennis Zhou 2017-07-17 19:10 ` Tejun Heo 2017-07-24 20:07 ` Dennis Zhou 2017-07-16 2:23 ` [PATCH 08/10] percpu: change the number of pages marked in the first_chunk bitmaps Dennis Zhou 2017-07-17 19:26 ` Tejun Heo 2017-07-24 20:13 ` Dennis Zhou 2017-07-16 2:23 ` [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Dennis Zhou 2017-07-17 23:27 ` Tejun Heo 2017-07-24 21:37 ` Dennis Zhou 2017-07-19 19:11 ` Josef Bacik 2017-07-19 22:19 ` Dennis Zhou 2017-07-19 19:16 ` Josef Bacik 2017-07-19 22:13 ` Dennis Zhou 2017-07-16 2:23 ` [PATCH 10/10] percpu: add optimizations on allocation path for the " Dennis Zhou 2017-07-17 23:32 ` Tejun Heo 2017-07-18 19:15 ` [PATCH 00/10] percpu: replace percpu area map allocator with " Josef Bacik 2017-07-24 21:14 ` Dennis Zhou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).