linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
       [not found] <492bb956-a670-8730-a35f-1d878c27175f@kernel.dk>
@ 2020-05-13 17:42 ` Jann Horn
  2020-05-13 18:34   ` Jens Axboe
  2020-05-13 19:20   ` Pekka Enberg
  0 siblings, 2 replies; 6+ messages in thread
From: Jann Horn @ 2020-05-13 17:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue, Pavel Begunkov,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Linux-MM

+slab allocator people

On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
> I turned the quick'n dirty from the other day into something a bit
> more done. Would be great if someone else could run some performance
> testing with this, I get about a 10% boost on the pure NOP benchmark
> with this. But that's just on my laptop in qemu, so some real iron
> testing would be awesome.

10% boost compared to which allocator? Are you using CONFIG_SLUB?

> The idea here is to have a percpu alloc cache. There's two sets of
> state:
>
> 1) Requests that have IRQ completion. preempt disable is not enough
>    there, we need to disable local irqs. This is a lot slower in
>    certain setups, so we keep this separate.
>
> 2) No IRQ completion, we can get by with just disabling preempt.

The SLUB allocator has percpu caching, too, and as long as you don't
enable any SLUB debugging or ASAN or such, and you're not hitting any
slowpath processing, it doesn't even have to disable interrupts, it
gets away with cmpxchg_double.

Have you profiled what the actual problem is when using SLUB? Have you
tested with CONFIG_SLAB_FREELIST_HARDENED turned off,
CONFIG_SLUB_DEBUG turned off, CONFIG_TRACING turned off,
CONFIG_FAILSLAB turned off, and so on? As far as I know, if you
disable all hardening and debugging infrastructure, SLUB's
kmem_cache_alloc()/kmem_cache_free() on the fastpaths should be really
straightforward. And if you don't turn those off, the comparison is
kinda unfair, because your custom freelist won't respect those flags.

When you build custom allocators like this, it interferes with
infrastructure meant to catch memory safety issues and such (both pure
debugging code and safety checks meant for production use) - for
example, ASAN and memory tagging will no longer be able to detect
use-after-free issues in objects managed by your custom allocator
cache.

So please, don't implement custom one-off allocators in random
subsystems. And if you do see a way to actually improve the
performance of memory allocation, add that to the generic SLUB
infrastructure.

> Outside of that, any freed requests goes to the ce->alloc_list.
> Attempting to alloc a request will check there first. When freeing
> a request, if we're over some threshold, move requests to the
> ce->free_list. This list can be browsed by the shrinker to free
> up memory. If a CPU goes offline, all requests are reaped.
>
> That's about it. If we go further with this, it'll be split into
> a few separate patches. For now, just throwing this out there
> for testing. The patch is against my for-5.8/io_uring branch.

That branch doesn't seem to exist on
<https://git.kernel.dk/cgit/linux-block/>...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
  2020-05-13 17:42 ` [PATCH RFC} io_uring: io_kiocb alloc cache Jann Horn
@ 2020-05-13 18:34   ` Jens Axboe
  2020-05-13 19:20   ` Pekka Enberg
  1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2020-05-13 18:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue, Pavel Begunkov,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Linux-MM

On 5/13/20 11:42 AM, Jann Horn wrote:
> +slab allocator people
> 
> On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
>> I turned the quick'n dirty from the other day into something a bit
>> more done. Would be great if someone else could run some performance
>> testing with this, I get about a 10% boost on the pure NOP benchmark
>> with this. But that's just on my laptop in qemu, so some real iron
>> testing would be awesome.
> 
> 10% boost compared to which allocator? Are you using CONFIG_SLUB?

SLUB, yes.

>> The idea here is to have a percpu alloc cache. There's two sets of
>> state:
>>
>> 1) Requests that have IRQ completion. preempt disable is not enough
>>    there, we need to disable local irqs. This is a lot slower in
>>    certain setups, so we keep this separate.
>>
>> 2) No IRQ completion, we can get by with just disabling preempt.
> 
> The SLUB allocator has percpu caching, too, and as long as you don't
> enable any SLUB debugging or ASAN or such, and you're not hitting any
> slowpath processing, it doesn't even have to disable interrupts, it
> gets away with cmpxchg_double.
> 
> Have you profiled what the actual problem is when using SLUB? Have you
> tested with CONFIG_SLAB_FREELIST_HARDENED turned off,
> CONFIG_SLUB_DEBUG turned off, CONFIG_TRACING turned off,
> CONFIG_FAILSLAB turned off, and so on? As far as I know, if you
> disable all hardening and debugging infrastructure, SLUB's
> kmem_cache_alloc()/kmem_cache_free() on the fastpaths should be really
> straightforward. And if you don't turn those off, the comparison is
> kinda unfair, because your custom freelist won't respect those flags.

But that's sort of the point. I don't have any nasty SLUB options
enabled, just the default. And that includes CONFIG_SLUB_DEBUG. Which
all the distros have enabled, I believe.

So yes, I could compare to a bare bones SLUB, and I'll definitely do
that because I'm curious. And it also could be an artifact of qemu,
sometimes that behaves differently than a real host (locks/irq is more
expensive, for example). Not sure how much with SLUB in particular,
haven't done targeted benchmarking of that.

The patch is just tossed out there for experimentation reasons, in case
it wasn't clear. It's not like I'm proposing this for inclusion. But if
the wins are big enough over a _normal_ configuration, then it's
definitely tempting.

> When you build custom allocators like this, it interferes with
> infrastructure meant to catch memory safety issues and such (both pure
> debugging code and safety checks meant for production use) - for
> example, ASAN and memory tagging will no longer be able to detect
> use-after-free issues in objects managed by your custom allocator
> cache.
> 
> So please, don't implement custom one-off allocators in random
> subsystems. And if you do see a way to actually improve the
> performance of memory allocation, add that to the generic SLUB
> infrastructure.

I hear you. This isn't unique, fwiw. Networking has a page pool
allocator for example, which I did consider tapping into.

Anyway, I/we will be a lot wiser once this experiment progresses!

>> Outside of that, any freed requests goes to the ce->alloc_list.
>> Attempting to alloc a request will check there first. When freeing
>> a request, if we're over some threshold, move requests to the
>> ce->free_list. This list can be browsed by the shrinker to free
>> up memory. If a CPU goes offline, all requests are reaped.
>>
>> That's about it. If we go further with this, it'll be split into
>> a few separate patches. For now, just throwing this out there
>> for testing. The patch is against my for-5.8/io_uring branch.
> 
> That branch doesn't seem to exist on
> <https://git.kernel.dk/cgit/linux-block/>...

Oh oops, guess I never pushed that out. Will do so.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
  2020-05-13 17:42 ` [PATCH RFC} io_uring: io_kiocb alloc cache Jann Horn
  2020-05-13 18:34   ` Jens Axboe
@ 2020-05-13 19:20   ` Pekka Enberg
  2020-05-13 20:09     ` Jens Axboe
  1 sibling, 1 reply; 6+ messages in thread
From: Pekka Enberg @ 2020-05-13 19:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue,
	Pavel Begunkov, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Linux-MM


Hi,

On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
> > I turned the quick'n dirty from the other day into something a bit 
> > more done. Would be great if someone else could run some
> > performance testing with this, I get about a 10% boost on the pure
> > NOP benchmark with this. But that's just on my laptop in qemu, so
> > some real iron testing would be awesome.

On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people
> 10% boost compared to which allocator? Are you using CONFIG_SLUB?
 
On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
> > The idea here is to have a percpu alloc cache. There's two sets of 
> > state:
> > 
> > 1) Requests that have IRQ completion. preempt disable is not
> > enough there, we need to disable local irqs. This is a lot slower
> > in certain setups, so we keep this separate.
> > 
> > 2) No IRQ completion, we can get by with just disabling preempt.

On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people
> The SLUB allocator has percpu caching, too, and as long as you don't 
> enable any SLUB debugging or ASAN or such, and you're not hitting
> any slowpath processing, it doesn't even have to disable interrupts,
> it gets away with cmpxchg_double.

The struct io_kiocb is 240 bytes. I don't see a dedicated slab for it in
/proc/slabinfo on my machine, so it likely got merged to the kmalloc-256
cache. This means that there's 32 objects in the per-CPU cache. Jens, on
the other hand, made the cache much bigger:

+#define IO_KIOCB_CACHE_MAX 256

So I assume if someone does "perf record", they will see significant
reduction in page allocator activity with Jens' patch. One possible way
around that is forcing the page allocation order to be much higher. IOW,
something like the following completely untested patch:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 979d9f977409..c3bf7b72026d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8143,7 +8143,7 @@ static int __init io_uring_init(void)
 
 	BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
 	BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
-	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_LARGE_ORDER);
 	return 0;
 };
 __initcall(io_uring_init);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6d454886bcaf..316fd821ec1f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -39,6 +39,8 @@
 #define SLAB_STORE_USER		((slab_flags_t __force)0x00010000U)
 /* Panic if kmem_cache_create() fails */
 #define SLAB_PANIC		((slab_flags_t __force)0x00040000U)
+/* Force slab page allocation order to be as large as possible */
+#define SLAB_LARGE_ORDER	((slab_flags_t __force)0x00080000U)
 /*
  * SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
  *
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23c7500eea7d..a18bbe9472e4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -51,7 +51,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
  */
 #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
 		SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
-		SLAB_FAILSLAB | SLAB_KASAN)
+		SLAB_FAILSLAB | SLAB_KASAN | SLAB_LARGE_ORDER)
 
 #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
diff --git a/mm/slub.c b/mm/slub.c
index b762450fc9f0..d1d86b1279aa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3318,12 +3318,15 @@ static inline unsigned int slab_order(unsigned int size,
 	return order;
 }
 
-static inline int calculate_order(unsigned int size)
+static inline int calculate_order(unsigned int size, gfp_t flags)
 {
 	unsigned int order;
 	unsigned int min_objects;
 	unsigned int max_objects;
 
+	if (flags & SLAB_LARGE_ORDER)
+		return slub_max_order;
+
 	/*
 	 * Attempt to find best configuration for a slab. This
 	 * works by first attempting to generate a layout with
@@ -3651,7 +3654,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-		order = calculate_order(size);
+		order = calculate_order(size, flags);
 
 	if ((int)order < 0)
 		return 0;

- Pekka


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
  2020-05-13 19:20   ` Pekka Enberg
@ 2020-05-13 20:09     ` Jens Axboe
  2020-05-13 20:31       ` Pekka Enberg
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2020-05-13 20:09 UTC (permalink / raw)
  To: Pekka Enberg, Jann Horn
  Cc: io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue, Pavel Begunkov,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Linux-MM

On 5/13/20 1:20 PM, Pekka Enberg wrote:
> 
> Hi,
> 
> On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> I turned the quick'n dirty from the other day into something a bit 
>>> more done. Would be great if someone else could run some
>>> performance testing with this, I get about a 10% boost on the pure
>>> NOP benchmark with this. But that's just on my laptop in qemu, so
>>> some real iron testing would be awesome.
> 
> On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people
>> 10% boost compared to which allocator? Are you using CONFIG_SLUB?
>  
> On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> The idea here is to have a percpu alloc cache. There's two sets of 
>>> state:
>>>
>>> 1) Requests that have IRQ completion. preempt disable is not
>>> enough there, we need to disable local irqs. This is a lot slower
>>> in certain setups, so we keep this separate.
>>>
>>> 2) No IRQ completion, we can get by with just disabling preempt.
> 
> On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people
>> The SLUB allocator has percpu caching, too, and as long as you don't 
>> enable any SLUB debugging or ASAN or such, and you're not hitting
>> any slowpath processing, it doesn't even have to disable interrupts,
>> it gets away with cmpxchg_double.
> 
> The struct io_kiocb is 240 bytes. I don't see a dedicated slab for it in
> /proc/slabinfo on my machine, so it likely got merged to the kmalloc-256
> cache. This means that there's 32 objects in the per-CPU cache. Jens, on
> the other hand, made the cache much bigger:

Right, it gets merged with kmalloc-256 (and 5 others) in my testing.

> +#define IO_KIOCB_CACHE_MAX 256
> 
> So I assume if someone does "perf record", they will see significant
> reduction in page allocator activity with Jens' patch. One possible way
> around that is forcing the page allocation order to be much higher. IOW,
> something like the following completely untested patch:

Now tested, I gave it a shot. This seems to bring performance to
basically what the io_uring patch does, so that's great! Again, just in
the microbenchmark test case, so freshly booted and just running the
case.

Will this patch introduce latencies or non-deterministic behavior for a
fragmented system?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
  2020-05-13 20:09     ` Jens Axboe
@ 2020-05-13 20:31       ` Pekka Enberg
  2020-05-13 20:44         ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Pekka Enberg @ 2020-05-13 20:31 UTC (permalink / raw)
  To: Jens Axboe, Jann Horn
  Cc: io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue, Pavel Begunkov,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Linux-MM

Hi Jens,

On 5/13/20 1:20 PM, Pekka Enberg wrote:
>> So I assume if someone does "perf record", they will see significant
>> reduction in page allocator activity with Jens' patch. One possible way
>> around that is forcing the page allocation order to be much higher. IOW,
>> something like the following completely untested patch:

On 5/13/20 11:09 PM, Jens Axboe wrote:
> Now tested, I gave it a shot. This seems to bring performance to
> basically what the io_uring patch does, so that's great! Again, just in
> the microbenchmark test case, so freshly booted and just running the
> case.

Great, thanks for testing!

On 5/13/20 11:09 PM, Jens Axboe wrote:
> Will this patch introduce latencies or non-deterministic behavior for a
> fragmented system?

You have to talk to someone who is more up-to-date with how the page 
allocator operates today. But yeah, I assume people still want to avoid 
higher-order allocations as much as possible, because they make 
allocation harder when memory is fragmented.

That said, perhaps it's not going to the page allocator as much as I 
thought, but the problem is that the per-CPU cache size is just to small 
for these allocations, forcing do_slab_free() to take the slow path 
often. Would be interesting to know if CONFIG_SLAB does better here 
because the per-CPU cache size is much larger IIRC.

- Pekka


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC} io_uring: io_kiocb alloc cache
  2020-05-13 20:31       ` Pekka Enberg
@ 2020-05-13 20:44         ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2020-05-13 20:44 UTC (permalink / raw)
  To: Pekka Enberg, Jann Horn
  Cc: io-uring, Xiaoguang Wang, joseph qi, Jiufei Xue, Pavel Begunkov,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, Linux-MM

On 5/13/20 2:31 PM, Pekka Enberg wrote:
> Hi Jens,
> 
> On 5/13/20 1:20 PM, Pekka Enberg wrote:
>>> So I assume if someone does "perf record", they will see significant
>>> reduction in page allocator activity with Jens' patch. One possible way
>>> around that is forcing the page allocation order to be much higher. IOW,
>>> something like the following completely untested patch:
> 
> On 5/13/20 11:09 PM, Jens Axboe wrote:
>> Now tested, I gave it a shot. This seems to bring performance to
>> basically what the io_uring patch does, so that's great! Again, just in
>> the microbenchmark test case, so freshly booted and just running the
>> case.
> 
> Great, thanks for testing!
> 
> On 5/13/20 11:09 PM, Jens Axboe wrote:
>> Will this patch introduce latencies or non-deterministic behavior for a
>> fragmented system?
> 
> You have to talk to someone who is more up-to-date with how the page 
> allocator operates today. But yeah, I assume people still want to avoid 
> higher-order allocations as much as possible, because they make 
> allocation harder when memory is fragmented.

That was my thinking... I don't want a random io_kiocb allocation to
take a long time because of high order allocations.

> That said, perhaps it's not going to the page allocator as much as I 
> thought, but the problem is that the per-CPU cache size is just to small 
> for these allocations, forcing do_slab_free() to take the slow path 
> often. Would be interesting to know if CONFIG_SLAB does better here 
> because the per-CPU cache size is much larger IIRC.

Just tried with SLAB, and it's roughly 4-5% down from the baseline
(non-modified) SLUB. So not faster, at least for this case.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-05-13 20:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <492bb956-a670-8730-a35f-1d878c27175f@kernel.dk>
2020-05-13 17:42 ` [PATCH RFC} io_uring: io_kiocb alloc cache Jann Horn
2020-05-13 18:34   ` Jens Axboe
2020-05-13 19:20   ` Pekka Enberg
2020-05-13 20:09     ` Jens Axboe
2020-05-13 20:31       ` Pekka Enberg
2020-05-13 20:44         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).