Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator

From: Dennis Zhou <dennisz@fb.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>,
	kernel-team@fb.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Dennis Zhou <dennisszhou@gmail.com>
Subject: Re: [PATCH 00/10] percpu: replace percpu area map allocator with bitmap allocator
Date: Mon, 24 Jul 2017 17:14:41 -0400	[thread overview]
Message-ID: <20170724211440.GD91613@dennisz-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20170718191527.GA4009@destiny>

Hi Josef,

On Tue, Jul 18, 2017 at 03:15:28PM -0400, Josef Bacik wrote:

> Ok so you say that this test is better for the area map allocator, so presumably
> this is worst case for the bitmap allocator?  What does the average case look
> like?

The bitmap allocator should perform relatively consistent in most
allocation schemes. The worst case would be where constant scanning is
required like the allocation scenario of 4-bytes with 8-byte alignment.
As far as average case, I would expect similar performance on average.
These are updated values with v2.

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        310      |     4770
             16B    |        557      |     1325
             64B    |        436      |      273
            256B    |        776      |      131
           1024B    |       3280      |      122

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        490      |       70
             16B    |        515      |       75
             64B    |        610      |       80
            256B    |        950      |      100
           1024B    |       3520      |      200

> Trading 2x allocation latency for a pretty significant free latency
> reduction seems ok, but are we allocating or freeing more?  Are both allocations
> and free's done in performance critical areas, or do allocations only happen in
> performance critical areas, making the increased allocation latency hurt more?

I unfortunately do not have any data to support whether we are
allocating or freeing more in critical areas. I would defer use case
questions to those consuming percpu memory. There is one issue that
stood out though with BPF that is causing the cpu to hang due to the
percpu allocator. This leads me to believe that freeing is of more
concern. I would believe that if allocation latency is a major problem,
it is easier for the user to preallocate than it is for them to manage
delayed freeing.

> And lastly, why are we paying a 2x latency cost?  What is it about the bitmap
> allocator that makes it much worse than the area map allocator?

So it turns out v1 was making poor use of control logic and the
compiler wasn't doing me any favors. I've since done a rather large
refactor with v2 which shows better performance overall. The bitmap
allocator is paying more in that it has to manage metadata that the area
map allocator does not. In the optimal case, the area map allocator is
just an append without any heavy operations.

A worse performing scenario for the area map allocator is when the
second half of a chunk is occupied by a lot of small allocations. In
that case, each allocation has to memmove the rest of the area map. This
scenario is nontrivial to reproduce, so I provide data below for a
similar scenario where the second half of each page is filled.

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |       4118      |     4892
             16B    |       1651      |     1163
             64B    |        598      |      285
            256B    |        771      |      158
           1024B    |       3034      |      160

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        481      |       67
             16B    |        506      |       69
             64B    |        636      |       75
            256B    |        892      |       90
           1024B    |       3262      |      147

This shows that in less ideal chunk states, the allocation path can fail
just like the freeing path can with non-ideal deallocation patterns.

Thanks,
Dennis

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>