Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* Raw spinlocks and memory allocation
@ 2020-07-30 23:12 Paul E. McKenney
  2020-07-31 20:38 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2020-07-30 23:12 UTC (permalink / raw)
  To: cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: hannes, willy, urezki, linux-kernel, linux-mm

Hello!

We have an interesting issue involving interactions between RCU,
memory allocation, and "raw atomic" contexts.  The most attractive
solution to this issue requires adding a new GFP_ flag.  Perhaps this
is a big ask, but on the other hand, the benefit is a large reduction
in linked-list-induced cache misses when invoking RCU callbacks.

For more details, please read on!

Examples of raw atomic contexts include disabled hardware interrupts
(that is, a hardware irq handler rather than a threaded irq handler),
code holding a raw_spinlock_t, and code with preemption disabled (but
only in cases where -rt cannot safely map it to disabled migration).

It turns out that call_rcu() is already invoked from raw atomic contexts,
and we therefore anticipate that kfree_rcu() will also be at some point.

This matters due to recent work to fix a weakness in both call_rcu()
and kfree_rcu() that was pointed out long ago by Christoph Lameter,
among others.  The weakness is that RCU traverses linked callback lists
when invoking those callbacks.  Because the just-ended grace period will
have rendered these lists cache-cold, this results in an expensive cache
miss on each and every callback invocation.  Uladzislau Rezki (CCed) has
recently produced patches for kfree_rcu() that instead store pointers
to callbacks in arrays, so that callback invocation can step through
the array using the kfree_bulk() interface.  This greatly reducing the
number of cache misses.  The benefits are not subtle:

https://lore.kernel.org/lkml/20191231122241.5702-1-urezki@gmail.com/

Of course, the arrays have to come from somewhere, and that somewhere
is the memory allocator.  Yes, memory allocation can fail, but in that
rare case, kfree_rcu() just falls back to the old approach, taking a
few extra cache misses, but making good (if expensive) forward progress.

This works well until someone invokes kfree_rcu() with a raw spinlock
held.  Even that works fine unless the memory allocator has exhausted
its caches, at which point it will acquire a normal spinlock.  In kernels
built with CONFIG_PROVE_RAW_LOCK_NESTING=y this will result in a lockdep
splat.  Worse yet, in -rt kernels, this can result in scheduling while
atomic.

So, may we add a GFP_ flag that will cause kmalloc() and friends to return
NULL when they would otherwise need to acquire their non-raw spinlock?
This avoids adding any overhead to the slab-allocator fastpaths, but
allows callback invocation to reduce cache misses without having to
restructure some existing callers of call_rcu() and potential future
callers of kfree_rcu().

Thoughts?

							Thanx, Paul

PS.  Other avenues investigated:

o	Just don't invoke kmalloc() when kfree_rcu() is invoked
	from raw atomic contexts.  The problem with this is that
	there is no way to detect raw atomic contexts in production
	kernels built with CONFIG_PREEMPT=n.  Adding means to detect
	this would increase overhead on numerous fastpaths.

o	Just say "no" to invoking call_rcu() and kfree_rcu() from
	raw atomic contexts.  This would require that the affected
	call_rcu() and kfree_rcu() invocations be deferred.  This is
	in theory simple, but can get quite messy, and often requires
	fallbacks such as timers that can degrade energy efficiency and
	realtime response.

o	Provide a different non-allocating API such as kfree_rcu_raw()
	and call_rcu_raw() that are used from raw atomic contexts and also
	on memory-allocation failure from kfree_rcu() and call_rcu().
	This results in unconditional callback-invocation cache misses
	for calls from raw contexts, including for common code that is
	only occasionally invoked from raw atomic contexts.  This approach
	also forces developers to worry about two more RCU API members.

o	Move the memory allocator's spinlocks to raw_spinlock_t.
	This would be bad for realtime response, and would likely require
	even more conversions when the allocator invokes other subsystems
	that also use non-raw spinlocks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-30 23:12 Raw spinlocks and memory allocation Paul E. McKenney
@ 2020-07-31 20:38 ` Andrew Morton
  2020-07-31 20:48   ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2020-07-31 20:38 UTC (permalink / raw)
  To: paulmck
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, hannes, willy, urezki,
	linux-kernel, linux-mm

On Thu, 30 Jul 2020 16:12:05 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:

> So, may we add a GFP_ flag that will cause kmalloc() and friends to return
> NULL when they would otherwise need to acquire their non-raw spinlock?
> This avoids adding any overhead to the slab-allocator fastpaths, but
> allows callback invocation to reduce cache misses without having to
> restructure some existing callers of call_rcu() and potential future
> callers of kfree_rcu().

We have eight free gfp_t bits so that isn't a problem.

Adding a test-n-branch to the kmalloc() fastpath may well be a concern.

Which of mm/sl?b.c are affected?

A doesnt-need-to-really-work protopatch would help us understand the
potential cost?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 20:38 ` Andrew Morton
@ 2020-07-31 20:48   ` Paul E. McKenney
  2020-07-31 20:59     ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2020-07-31 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, hannes, willy, urezki,
	linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 01:38:34PM -0700, Andrew Morton wrote:
> On Thu, 30 Jul 2020 16:12:05 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > So, may we add a GFP_ flag that will cause kmalloc() and friends to return
> > NULL when they would otherwise need to acquire their non-raw spinlock?
> > This avoids adding any overhead to the slab-allocator fastpaths, but
> > allows callback invocation to reduce cache misses without having to
> > restructure some existing callers of call_rcu() and potential future
> > callers of kfree_rcu().
> 
> We have eight free gfp_t bits so that isn't a problem.

Whew!!!  ;-)

> Adding a test-n-branch to the kmalloc() fastpath may well be a concern.
> 
> Which of mm/sl?b.c are affected?

None of them, it turns out.  The initial patch will instead directly
invoke __get_free_page().  So we could just leave sl?b.c alone.

> A doesnt-need-to-really-work protopatch would help us understand the
> potential cost?

Makes sense!  My guess is that Uladzislau Rezki (CCed) will be sending
one along by the middle of next week.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 20:48   ` Paul E. McKenney
@ 2020-07-31 20:59     ` Matthew Wilcox
  2020-07-31 21:24       ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2020-07-31 20:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 01:48:55PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 31, 2020 at 01:38:34PM -0700, Andrew Morton wrote:
> > On Thu, 30 Jul 2020 16:12:05 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > 
> > > So, may we add a GFP_ flag that will cause kmalloc() and friends to return
> > > NULL when they would otherwise need to acquire their non-raw spinlock?
> > > This avoids adding any overhead to the slab-allocator fastpaths, but
> > > allows callback invocation to reduce cache misses without having to
> > > restructure some existing callers of call_rcu() and potential future
> > > callers of kfree_rcu().
> > 
> > We have eight free gfp_t bits so that isn't a problem.
> 
> Whew!!!  ;-)
> 
> > Adding a test-n-branch to the kmalloc() fastpath may well be a concern.
> > 
> > Which of mm/sl?b.c are affected?
> 
> None of them, it turns out.  The initial patch will instead directly
> invoke __get_free_page().  So we could just leave sl?b.c alone.

Isn't that spelled GFP_NOWAIT?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 20:59     ` Matthew Wilcox
@ 2020-07-31 21:24       ` Paul E. McKenney
  2020-07-31 21:29         ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2020-07-31 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 09:59:33PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 01:48:55PM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 31, 2020 at 01:38:34PM -0700, Andrew Morton wrote:
> > > On Thu, 30 Jul 2020 16:12:05 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > 
> > > > So, may we add a GFP_ flag that will cause kmalloc() and friends to return
> > > > NULL when they would otherwise need to acquire their non-raw spinlock?
> > > > This avoids adding any overhead to the slab-allocator fastpaths, but
> > > > allows callback invocation to reduce cache misses without having to
> > > > restructure some existing callers of call_rcu() and potential future
> > > > callers of kfree_rcu().
> > > 
> > > We have eight free gfp_t bits so that isn't a problem.
> > 
> > Whew!!!  ;-)
> > 
> > > Adding a test-n-branch to the kmalloc() fastpath may well be a concern.
> > > 
> > > Which of mm/sl?b.c are affected?
> > 
> > None of them, it turns out.  The initial patch will instead directly
> > invoke __get_free_page().  So we could just leave sl?b.c alone.
> 
> Isn't that spelled GFP_NOWAIT?

I don't think so in the current kernel, though I might be confused.

The problem we are having isn't waiting, but rather normal spinlock_t
acquisition.  This does not count as waiting in !CONFIG_PREEMPT_RT
kernels, and so there are code paths that acquire the non-raw zone_lock
in rmqueue_bulk() even in the GFP_NOWAIT case.  Because kfree_rcu()
and call_rcu() and their callers might hold raw spinlocks, acquiring a
non-raw spinlock is forbidden for them and for anything that they call,
directly or indirectly.

The reason for this restriction is that in -rt, the spin_lock(&zone->lock)
in rmqueue_bulk() can sleep.  This conversion of non-raw spinlocks
to sleeplocks is part of how -rt reduces scheduling latency.  Because
acquiring a raw spinlock disables preemption (even in -rt), acquiring
a non-raw spinlock while holding a raw spinlock gets you "scheduling
while atomic" in -rt.  And it will get you lockdep complaints in all
kernels, not just -rt, when CONFIG_PROVE_RAW_LOCK_NESTING is enabled.
And my guess is that CONFIG_PROVE_RAW_LOCK_NESTING=y will become the
default sooner rather than later.

But you are right that yet another approach might be modifying the
GFP_NOWAIT handling so that it avoided acquiring non-raw spinlocks.
However, evaluating that option requires quite a bit more knowledge of
MM than I have!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 21:24       ` Paul E. McKenney
@ 2020-07-31 21:29         ` Andrew Morton
  2020-07-31 22:30           ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2020-07-31 21:29 UTC (permalink / raw)
  To: paulmck
  Cc: Matthew Wilcox, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, 31 Jul 2020 14:24:57 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:

> The reason for this restriction is that in -rt, the spin_lock(&zone->lock)
> in rmqueue_bulk() can sleep.

So if there is runtime overhead, this overhead could be restricted to
-rt kernels with suitable ifdefs?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 21:29         ` Andrew Morton
@ 2020-07-31 22:30           ` Paul E. McKenney
  2020-07-31 22:47             ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2020-07-31 22:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 02:29:19PM -0700, Andrew Morton wrote:
> On Fri, 31 Jul 2020 14:24:57 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > The reason for this restriction is that in -rt, the spin_lock(&zone->lock)
> > in rmqueue_bulk() can sleep.
> 
> So if there is runtime overhead, this overhead could be restricted to
> -rt kernels with suitable ifdefs?

In theory, yes.  In practice, with CONFIG_PROVE_RAW_LOCK_NESTING=y,
lockdep will complain regardless of -rt or not.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 22:30           ` Paul E. McKenney
@ 2020-07-31 22:47             ` Matthew Wilcox
  2020-08-01  0:02               ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2020-07-31 22:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 03:30:16PM -0700, Paul E. McKenney wrote:
> On Fri, Jul 31, 2020 at 02:29:19PM -0700, Andrew Morton wrote:
> > On Fri, 31 Jul 2020 14:24:57 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > 
> > > The reason for this restriction is that in -rt, the spin_lock(&zone->lock)
> > > in rmqueue_bulk() can sleep.
> > 
> > So if there is runtime overhead, this overhead could be restricted to
> > -rt kernels with suitable ifdefs?
> 
> In theory, yes.  In practice, with CONFIG_PROVE_RAW_LOCK_NESTING=y,
> lockdep will complain regardless of -rt or not.

On non-RT, we could make that lock a raw spinlock.  On RT, we could
decline to take the lock.  We'd need to abstract the spin_lock() away
behind zone_lock(zone), but that should be OK.

But let's see if we need to do that.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Raw spinlocks and memory allocation
  2020-07-31 22:47             ` Matthew Wilcox
@ 2020-08-01  0:02               ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2020-08-01  0:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, cl, penberg, rientjes, iamjoonsoo.kim, hannes,
	urezki, linux-kernel, linux-mm

On Fri, Jul 31, 2020 at 11:47:38PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 03:30:16PM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 31, 2020 at 02:29:19PM -0700, Andrew Morton wrote:
> > > On Fri, 31 Jul 2020 14:24:57 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > 
> > > > The reason for this restriction is that in -rt, the spin_lock(&zone->lock)
> > > > in rmqueue_bulk() can sleep.
> > > 
> > > So if there is runtime overhead, this overhead could be restricted to
> > > -rt kernels with suitable ifdefs?
> > 
> > In theory, yes.  In practice, with CONFIG_PROVE_RAW_LOCK_NESTING=y,
> > lockdep will complain regardless of -rt or not.
> 
> On non-RT, we could make that lock a raw spinlock.  On RT, we could
> decline to take the lock.  We'd need to abstract the spin_lock() away
> behind zone_lock(zone), but that should be OK.
> 
> But let's see if we need to do that.

Fair enough!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 23:12 Raw spinlocks and memory allocation Paul E. McKenney
2020-07-31 20:38 ` Andrew Morton
2020-07-31 20:48   ` Paul E. McKenney
2020-07-31 20:59     ` Matthew Wilcox
2020-07-31 21:24       ` Paul E. McKenney
2020-07-31 21:29         ` Andrew Morton
2020-07-31 22:30           ` Paul E. McKenney
2020-07-31 22:47             ` Matthew Wilcox
2020-08-01  0:02               ` Paul E. McKenney

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git