[01/16] rcu/tree: Add a work to allocate pages from regular context
diff mbox series

Message ID 20201029165019.14218-1-urezki@gmail.com
State New
Headers show
Series
  • [01/16] rcu/tree: Add a work to allocate pages from regular context
Related show

Commit Message

Uladzislau Rezki Oct. 29, 2020, 4:50 p.m. UTC
The current memmory-allocation interface presents to following
difficulties that this patch is designed to overcome:

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
   complain about violation("BUG: Invalid wait context") of the
   nesting rules. It does the raw_spinlock vs. spinlock nesting
   checks, i.e. it is not legal to acquire a spinlock_t while
   holding a raw_spinlock_t.

   Internally the kfree_rcu() uses raw_spinlock_t whereas the
   "page allocator" internally deals with spinlock_t to access
   to its zones. The code also can be broken from higher level
   of view:
   <snip>
       raw_spin_lock(&some_lock);
       kfree_rcu(some_pointer, some_field_offset);
   <snip>

b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
   is converted into sleepable variant. Invoking the page allocator from
   atomic contexts leads to "BUG: scheduling while atomic".

c) call_rcu() is invoked from raw atomic context and kfree_rcu()
   and kvfree_rcu() are expected to be called from atomic raw context
   as well.

Move out a page allocation from contexts which trigger kvfree_rcu()
function to the separate worker. When a k[v]free_rcu() per-cpu page
cache is empty a fallback mechanism is used and a special job is
scheduled to refill the per-cpu cache.

Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------
 1 file changed, 66 insertions(+), 43 deletions(-)

Comments

Joel Fernandes Nov. 3, 2020, 3:47 p.m. UTC | #1
On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> The current memmory-allocation interface presents to following
> difficulties that this patch is designed to overcome:
> 
> a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
>    complain about violation("BUG: Invalid wait context") of the
>    nesting rules. It does the raw_spinlock vs. spinlock nesting
>    checks, i.e. it is not legal to acquire a spinlock_t while
>    holding a raw_spinlock_t.
> 
>    Internally the kfree_rcu() uses raw_spinlock_t whereas the
>    "page allocator" internally deals with spinlock_t to access
>    to its zones. The code also can be broken from higher level
>    of view:
>    <snip>
>        raw_spin_lock(&some_lock);
>        kfree_rcu(some_pointer, some_field_offset);
>    <snip>
> 
> b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
>    is converted into sleepable variant. Invoking the page allocator from
>    atomic contexts leads to "BUG: scheduling while atomic".
> 
> c) call_rcu() is invoked from raw atomic context and kfree_rcu()
>    and kvfree_rcu() are expected to be called from atomic raw context
>    as well.
> 
> Move out a page allocation from contexts which trigger kvfree_rcu()
> function to the separate worker. When a k[v]free_rcu() per-cpu page
> cache is empty a fallback mechanism is used and a special job is
> scheduled to refill the per-cpu cache.

Looks good, still reviewing here. BTW just for my education, I was wondering
about Thomas's email:
https://lkml.org/lkml/2020/8/11/939

If slab allocations in pure raw-atomic context on RT is not allowed or
recommended, should kfree_rcu() be allowed?
slab can have same issue right? If per-cpu cache is drained, it has to
allocate page from buddy allocator and there's no GFP flag to tell it about
context where alloc is happening from.

Or are we saying that we want to support kfree on RT from raw atomic atomic
context, even though kmalloc is not supported? I hate to bring up this
elephant in the room, but since I am a part of the people maintaining this
code, I believe I would rather set some rules than supporting unsupported
usages. :-\ (Once I know what is supported and what isn't that is). If indeed
raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
put a giant warning than supporting it :-(.

But I don't stand in the way of this patch if all agree on it. thanks,

 - Joel


> Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
> Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object")
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------
>  1 file changed, 66 insertions(+), 43 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 06895ef85d69..f2da2a1cc716 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444);
>   * per-CPU. Object size is equal to one page. This value
>   * can be changed at boot time.
>   */
> -static int rcu_min_cached_objs = 2;
> +static int rcu_min_cached_objs = 5;
>  module_param(rcu_min_cached_objs, int, 0444);
>  
>  /* Retrieve RCU kthreads priority for rcutorture */
> @@ -3084,6 +3084,9 @@ struct kfree_rcu_cpu_work {
>   *	In order to save some per-cpu space the list is singular.
>   *	Even though it is lockless an access has to be protected by the
>   *	per-cpu lock.
> + * @page_cache_work: A work to refill the cache when it is empty
> + * @work_in_progress: Indicates that page_cache_work is running
> + * @hrtimer: A hrtimer for scheduling a page_cache_work
>   * @nr_bkv_objs: number of allocated objects at @bkvcache.
>   *
>   * This is a per-CPU structure.  The reason that it is not included in
> @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
>  	bool monitor_todo;
>  	bool initialized;
>  	int count;
> +
> +	struct work_struct page_cache_work;
> +	atomic_t work_in_progress;
> +	struct hrtimer hrtimer;
> +
>  	struct llist_head bkvcache;
>  	int nr_bkv_objs;
>  };
> @@ -3217,10 +3225,10 @@ static void kfree_rcu_work(struct work_struct *work)
>  			}
>  			rcu_lock_release(&rcu_callback_map);
>  
> -			krcp = krc_this_cpu_lock(&flags);
> +			raw_spin_lock_irqsave(&krcp->lock, flags);
>  			if (put_cached_bnode(krcp, bkvhead[i]))
>  				bkvhead[i] = NULL;
> -			krc_this_cpu_unlock(krcp, flags);
> +			raw_spin_unlock_irqrestore(&krcp->lock, flags);
>  
>  			if (bkvhead[i])
>  				free_page((unsigned long) bkvhead[i]);
> @@ -3347,6 +3355,57 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  		raw_spin_unlock_irqrestore(&krcp->lock, flags);
>  }
>  
> +static enum hrtimer_restart
> +schedule_page_work_fn(struct hrtimer *t)
> +{
> +	struct kfree_rcu_cpu *krcp =
> +		container_of(t, struct kfree_rcu_cpu, hrtimer);
> +
> +	queue_work(system_highpri_wq, &krcp->page_cache_work);
> +	return HRTIMER_NORESTART;
> +}
> +
> +static void fill_page_cache_func(struct work_struct *work)
> +{
> +	struct kvfree_rcu_bulk_data *bnode;
> +	struct kfree_rcu_cpu *krcp =
> +		container_of(work, struct kfree_rcu_cpu,
> +			page_cache_work);
> +	unsigned long flags;
> +	bool pushed;
> +	int i;
> +
> +	for (i = 0; i < rcu_min_cached_objs; i++) {
> +		bnode = (struct kvfree_rcu_bulk_data *)
> +			__get_free_page(GFP_KERNEL | __GFP_NOWARN);
> +
> +		if (bnode) {
> +			raw_spin_lock_irqsave(&krcp->lock, flags);
> +			pushed = put_cached_bnode(krcp, bnode);
> +			raw_spin_unlock_irqrestore(&krcp->lock, flags);
> +
> +			if (!pushed) {
> +				free_page((unsigned long) bnode);
> +				break;
> +			}
> +		}
> +	}
> +
> +	atomic_set(&krcp->work_in_progress, 0);
> +}
> +
> +static void
> +run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> +{
> +	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> +			!atomic_xchg(&krcp->work_in_progress, 1)) {
> +		hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC,
> +			HRTIMER_MODE_REL);
> +		krcp->hrtimer.function = schedule_page_work_fn;
> +		hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
> +	}
> +}
> +
>  static inline bool
>  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  {
> @@ -3363,32 +3422,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  	if (!krcp->bkvhead[idx] ||
>  			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
>  		bnode = get_cached_bnode(krcp);
> -		if (!bnode) {
> -			/*
> -			 * To keep this path working on raw non-preemptible
> -			 * sections, prevent the optional entry into the
> -			 * allocator as it uses sleeping locks. In fact, even
> -			 * if the caller of kfree_rcu() is preemptible, this
> -			 * path still is not, as krcp->lock is a raw spinlock.
> -			 * With additional page pre-allocation in the works,
> -			 * hitting this return is going to be much less likely.
> -			 */
> -			if (IS_ENABLED(CONFIG_PREEMPT_RT))
> -				return false;
> -
> -			/*
> -			 * NOTE: For one argument of kvfree_rcu() we can
> -			 * drop the lock and get the page in sleepable
> -			 * context. That would allow to maintain an array
> -			 * for the CONFIG_PREEMPT_RT as well if no cached
> -			 * pages are available.
> -			 */
> -			bnode = (struct kvfree_rcu_bulk_data *)
> -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> -		}
> -
>  		/* Switch to emergency path. */
> -		if (unlikely(!bnode))
> +		if (!bnode)
>  			return false;
>  
>  		/* Initialize the new block. */
> @@ -3452,12 +3487,10 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		goto unlock_return;
>  	}
>  
> -	/*
> -	 * Under high memory pressure GFP_NOWAIT can fail,
> -	 * in that case the emergency path is maintained.
> -	 */
>  	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
>  	if (!success) {
> +		run_page_cache_worker(krcp);
> +
>  		if (head == NULL)
>  			// Inline if kvfree_rcu(one_arg) call.
>  			goto unlock_return;
> @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
>  
>  	for_each_possible_cpu(cpu) {
>  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> -		struct kvfree_rcu_bulk_data *bnode;
>  
>  		for (i = 0; i < KFREE_N_BATCHES; i++) {
>  			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
>  			krcp->krw_arr[i].krcp = krcp;
>  		}
>  
> -		for (i = 0; i < rcu_min_cached_objs; i++) {
> -			bnode = (struct kvfree_rcu_bulk_data *)
> -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> -
> -			if (bnode)
> -				put_cached_bnode(krcp, bnode);
> -			else
> -				pr_err("Failed to preallocate for %d CPU!\n", cpu);
> -		}
> -
>  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> +		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
>  		krcp->initialized = true;
>  	}
>  	if (register_shrinker(&kfree_rcu_shrinker))
> -- 
> 2.20.1
>
Uladzislau Rezki Nov. 3, 2020, 4:33 p.m. UTC | #2
On Tue, Nov 03, 2020 at 10:47:23AM -0500, Joel Fernandes wrote:
> On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > The current memmory-allocation interface presents to following
> > difficulties that this patch is designed to overcome:
> > 
> > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> >    complain about violation("BUG: Invalid wait context") of the
> >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> >    checks, i.e. it is not legal to acquire a spinlock_t while
> >    holding a raw_spinlock_t.
> > 
> >    Internally the kfree_rcu() uses raw_spinlock_t whereas the
> >    "page allocator" internally deals with spinlock_t to access
> >    to its zones. The code also can be broken from higher level
> >    of view:
> >    <snip>
> >        raw_spin_lock(&some_lock);
> >        kfree_rcu(some_pointer, some_field_offset);
> >    <snip>
> > 
> > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> >    is converted into sleepable variant. Invoking the page allocator from
> >    atomic contexts leads to "BUG: scheduling while atomic".
> > 
> > c) call_rcu() is invoked from raw atomic context and kfree_rcu()
> >    and kvfree_rcu() are expected to be called from atomic raw context
> >    as well.
> > 
> > Move out a page allocation from contexts which trigger kvfree_rcu()
> > function to the separate worker. When a k[v]free_rcu() per-cpu page
> > cache is empty a fallback mechanism is used and a special job is
> > scheduled to refill the per-cpu cache.
> 
> Looks good, still reviewing here. BTW just for my education, I was wondering
> about Thomas's email:
> https://lkml.org/lkml/2020/8/11/939
> 
> If slab allocations in pure raw-atomic context on RT is not allowed or
> recommended, should kfree_rcu() be allowed?
>
Thanks for reviewing, Joel :)

The decision was made that we need to support kfree_rcu() from "real atomic contexts",
to align with how it used to be before. We can go and just convert our local locks
to the spinlock_t variant but that was not Paul goal, it can be that some users need
kfree_rcu() for raw atomics.

>
> slab can have same issue right? If per-cpu cache is drained, it has to
> allocate page from buddy allocator and there's no GFP flag to tell it about
> context where alloc is happening from.
> 
Sounds like that. Apart of that, it might turn out soon that we or somebody
else will rise a question one more time about something GFP_RAW or GFP_NOLOCKS.
So who knows..

>
> Or are we saying that we want to support kfree on RT from raw atomic atomic
> context, even though kmalloc is not supported? I hate to bring up this
> elephant in the room, but since I am a part of the people maintaining this
> code, I believe I would rather set some rules than supporting unsupported
> usages. :-\ (Once I know what is supported and what isn't that is). If indeed
> raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
> put a giant warning than supporting it :-(.
> 
We discussed it several times, the conclusion was that we need to support 
kfree_rcu() from raw contexts. At least that was a clear signal from Paul 
to me. I think, if we obtain the preemtable(), so it becomes versatile, we
can drop the patch that is in question later on in the future.

--
Vlad Rezki
Joel Fernandes Nov. 3, 2020, 5:54 p.m. UTC | #3
On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> The current memmory-allocation interface presents to following
> difficulties that this patch is designed to overcome
[...]
> ---
>  kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------
>  1 file changed, 66 insertions(+), 43 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 06895ef85d69..f2da2a1cc716 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444);
>   * per-CPU. Object size is equal to one page. This value
>   * can be changed at boot time.
>   */
> -static int rcu_min_cached_objs = 2;
> +static int rcu_min_cached_objs = 5;
>  module_param(rcu_min_cached_objs, int, 0444);
>  
>  /* Retrieve RCU kthreads priority for rcutorture */
> @@ -3084,6 +3084,9 @@ struct kfree_rcu_cpu_work {
>   *	In order to save some per-cpu space the list is singular.
>   *	Even though it is lockless an access has to be protected by the
>   *	per-cpu lock.
> + * @page_cache_work: A work to refill the cache when it is empty
> + * @work_in_progress: Indicates that page_cache_work is running
> + * @hrtimer: A hrtimer for scheduling a page_cache_work
>   * @nr_bkv_objs: number of allocated objects at @bkvcache.
>   *
>   * This is a per-CPU structure.  The reason that it is not included in
> @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
>  	bool monitor_todo;
>  	bool initialized;
>  	int count;
> +
> +	struct work_struct page_cache_work;
> +	atomic_t work_in_progress;

Does it need to be atomic? run_page_cache_work() is only called under a lock.
You can use xchg() there. And when you do the atomic_set, you can use
WRITE_ONCE as it is a data-race.

> @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
>  
>  	for_each_possible_cpu(cpu) {
>  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> -		struct kvfree_rcu_bulk_data *bnode;
>  
>  		for (i = 0; i < KFREE_N_BATCHES; i++) {
>  			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
>  			krcp->krw_arr[i].krcp = krcp;
>  		}
>  
> -		for (i = 0; i < rcu_min_cached_objs; i++) {
> -			bnode = (struct kvfree_rcu_bulk_data *)
> -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> -
> -			if (bnode)
> -				put_cached_bnode(krcp, bnode);
> -			else
> -				pr_err("Failed to preallocate for %d CPU!\n", cpu);
> -		}
> -
>  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> +		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
>  		krcp->initialized = true;

During initialization, is it not better to still pre-allocate? That way you
don't have to wait to get into a situation where you need to initially
allocate.

AFAICS after the above line deletions, the bulk pages are initially empty.

thanks,

 - Joel


>  	}
>  	if (register_shrinker(&kfree_rcu_shrinker))
> -- 
> 2.20.1
>
Paul E. McKenney Nov. 3, 2020, 7:18 p.m. UTC | #4
On Tue, Nov 03, 2020 at 05:33:50PM +0100, Uladzislau Rezki wrote:
> On Tue, Nov 03, 2020 at 10:47:23AM -0500, Joel Fernandes wrote:
> > On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > The current memmory-allocation interface presents to following
> > > difficulties that this patch is designed to overcome:
> > > 
> > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > >    complain about violation("BUG: Invalid wait context") of the
> > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > >    holding a raw_spinlock_t.
> > > 
> > >    Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > >    "page allocator" internally deals with spinlock_t to access
> > >    to its zones. The code also can be broken from higher level
> > >    of view:
> > >    <snip>
> > >        raw_spin_lock(&some_lock);
> > >        kfree_rcu(some_pointer, some_field_offset);
> > >    <snip>
> > > 
> > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > >    is converted into sleepable variant. Invoking the page allocator from
> > >    atomic contexts leads to "BUG: scheduling while atomic".
> > > 
> > > c) call_rcu() is invoked from raw atomic context and kfree_rcu()
> > >    and kvfree_rcu() are expected to be called from atomic raw context
> > >    as well.
> > > 
> > > Move out a page allocation from contexts which trigger kvfree_rcu()
> > > function to the separate worker. When a k[v]free_rcu() per-cpu page
> > > cache is empty a fallback mechanism is used and a special job is
> > > scheduled to refill the per-cpu cache.
> > 
> > Looks good, still reviewing here. BTW just for my education, I was wondering
> > about Thomas's email:
> > https://lkml.org/lkml/2020/8/11/939
> > 
> > If slab allocations in pure raw-atomic context on RT is not allowed or
> > recommended, should kfree_rcu() be allowed?
> >
> Thanks for reviewing, Joel :)
> 
> The decision was made that we need to support kfree_rcu() from "real atomic contexts",
> to align with how it used to be before. We can go and just convert our local locks
> to the spinlock_t variant but that was not Paul goal, it can be that some users need
> kfree_rcu() for raw atomics.

People invoke call_rcu() from raw atomics, and so we should provide
the same for kfree_rcu().  Yes, people could work around a raw-atomic
prohibition, but such prohibitions incur constant costs over time in
terms of development effort, increased bug rate, and increased complexity.
Yes, this does increase all of those for RCU, but the relative increase
is negligible, RCU being what it is.

> > slab can have same issue right? If per-cpu cache is drained, it has to
> > allocate page from buddy allocator and there's no GFP flag to tell it about
> > context where alloc is happening from.
> > 
> Sounds like that. Apart of that, it might turn out soon that we or somebody
> else will rise a question one more time about something GFP_RAW or GFP_NOLOCKS.
> So who knows..

I would prefer that slab provide some way of dealing with raw atomic
context, but the maintainers are thus far unconvinced.

> > Or are we saying that we want to support kfree on RT from raw atomic atomic
> > context, even though kmalloc is not supported? I hate to bring up this
> > elephant in the room, but since I am a part of the people maintaining this
> > code, I believe I would rather set some rules than supporting unsupported
> > usages. :-\ (Once I know what is supported and what isn't that is). If indeed
> > raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
> > put a giant warning than supporting it :-(.
> > 
> We discussed it several times, the conclusion was that we need to support 
> kfree_rcu() from raw contexts. At least that was a clear signal from Paul 
> to me. I think, if we obtain the preemtable(), so it becomes versatile, we
> can drop the patch that is in question later on in the future.

Given a universally meaningful preemptible(), we could directly call
the allocator in some cases.  It might (or might not) still make sense
to defer the allocation when preemptible() indicated that a direct call
to the allocator was unsafe.

							Thanx, Paul
Uladzislau Rezki Nov. 4, 2020, 12:12 p.m. UTC | #5
On Tue, Nov 03, 2020 at 12:54:22PM -0500, Joel Fernandes wrote:
> On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > The current memmory-allocation interface presents to following
> > difficulties that this patch is designed to overcome
> [...]
> > ---
> >  kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------
> >  1 file changed, 66 insertions(+), 43 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 06895ef85d69..f2da2a1cc716 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444);
> >   * per-CPU. Object size is equal to one page. This value
> >   * can be changed at boot time.
> >   */
> > -static int rcu_min_cached_objs = 2;
> > +static int rcu_min_cached_objs = 5;
> >  module_param(rcu_min_cached_objs, int, 0444);
> >  
> >  /* Retrieve RCU kthreads priority for rcutorture */
> > @@ -3084,6 +3084,9 @@ struct kfree_rcu_cpu_work {
> >   *	In order to save some per-cpu space the list is singular.
> >   *	Even though it is lockless an access has to be protected by the
> >   *	per-cpu lock.
> > + * @page_cache_work: A work to refill the cache when it is empty
> > + * @work_in_progress: Indicates that page_cache_work is running
> > + * @hrtimer: A hrtimer for scheduling a page_cache_work
> >   * @nr_bkv_objs: number of allocated objects at @bkvcache.
> >   *
> >   * This is a per-CPU structure.  The reason that it is not included in
> > @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
> >  	bool monitor_todo;
> >  	bool initialized;
> >  	int count;
> > +
> > +	struct work_struct page_cache_work;
> > +	atomic_t work_in_progress;
> 
> Does it need to be atomic? run_page_cache_work() is only called under a lock.
> You can use xchg() there. And when you do the atomic_set, you can use
> WRITE_ONCE as it is a data-race.
> 
We can use xchg together with *_ONCE() macro. Could you please clarify what
is your concern about using atomic_t? Both xchg() and atomic_xchg() guarantee
atamarity. Same as WRITE_ONCE() or atomic_set().

> > @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
> >  
> >  	for_each_possible_cpu(cpu) {
> >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > -		struct kvfree_rcu_bulk_data *bnode;
> >  
> >  		for (i = 0; i < KFREE_N_BATCHES; i++) {
> >  			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
> >  			krcp->krw_arr[i].krcp = krcp;
> >  		}
> >  
> > -		for (i = 0; i < rcu_min_cached_objs; i++) {
> > -			bnode = (struct kvfree_rcu_bulk_data *)
> > -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > -
> > -			if (bnode)
> > -				put_cached_bnode(krcp, bnode);
> > -			else
> > -				pr_err("Failed to preallocate for %d CPU!\n", cpu);
> > -		}
> > -
> >  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > +		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
> >  		krcp->initialized = true;
> 
> During initialization, is it not better to still pre-allocate? That way you
> don't have to wait to get into a situation where you need to initially
> allocate.
> 
Since we have a worker that does it when a cache is empty there is no
a high need in doing it during initialization phase. If we can reduce
an amount of code it is always good :)

Thanks, Joel.

--
Vlad Rezki
Uladzislau Rezki Nov. 4, 2020, 12:35 p.m. UTC | #6
On Tue, Nov 03, 2020 at 11:18:22AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 03, 2020 at 05:33:50PM +0100, Uladzislau Rezki wrote:
> > On Tue, Nov 03, 2020 at 10:47:23AM -0500, Joel Fernandes wrote:
> > > On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > The current memmory-allocation interface presents to following
> > > > difficulties that this patch is designed to overcome:
> > > > 
> > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > > >    complain about violation("BUG: Invalid wait context") of the
> > > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > > >    holding a raw_spinlock_t.
> > > > 
> > > >    Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > > >    "page allocator" internally deals with spinlock_t to access
> > > >    to its zones. The code also can be broken from higher level
> > > >    of view:
> > > >    <snip>
> > > >        raw_spin_lock(&some_lock);
> > > >        kfree_rcu(some_pointer, some_field_offset);
> > > >    <snip>
> > > > 
> > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > > >    is converted into sleepable variant. Invoking the page allocator from
> > > >    atomic contexts leads to "BUG: scheduling while atomic".
> > > > 
> > > > c) call_rcu() is invoked from raw atomic context and kfree_rcu()
> > > >    and kvfree_rcu() are expected to be called from atomic raw context
> > > >    as well.
> > > > 
> > > > Move out a page allocation from contexts which trigger kvfree_rcu()
> > > > function to the separate worker. When a k[v]free_rcu() per-cpu page
> > > > cache is empty a fallback mechanism is used and a special job is
> > > > scheduled to refill the per-cpu cache.
> > > 
> > > Looks good, still reviewing here. BTW just for my education, I was wondering
> > > about Thomas's email:
> > > https://lkml.org/lkml/2020/8/11/939
> > > 
> > > If slab allocations in pure raw-atomic context on RT is not allowed or
> > > recommended, should kfree_rcu() be allowed?
> > >
> > Thanks for reviewing, Joel :)
> > 
> > The decision was made that we need to support kfree_rcu() from "real atomic contexts",
> > to align with how it used to be before. We can go and just convert our local locks
> > to the spinlock_t variant but that was not Paul goal, it can be that some users need
> > kfree_rcu() for raw atomics.
> 
> People invoke call_rcu() from raw atomics, and so we should provide
> the same for kfree_rcu().  Yes, people could work around a raw-atomic
> prohibition, but such prohibitions incur constant costs over time in
> terms of development effort, increased bug rate, and increased complexity.
> Yes, this does increase all of those for RCU, but the relative increase
> is negligible, RCU being what it is.
> 
I see your point.

> > > slab can have same issue right? If per-cpu cache is drained, it has to
> > > allocate page from buddy allocator and there's no GFP flag to tell it about
> > > context where alloc is happening from.
> > > 
> > Sounds like that. Apart of that, it might turn out soon that we or somebody
> > else will rise a question one more time about something GFP_RAW or GFP_NOLOCKS.
> > So who knows..
> 
> I would prefer that slab provide some way of dealing with raw atomic
> context, but the maintainers are thus far unconvinced.
> 
I think, when preempt_rt is fully integrated to the kernel, we might get
new users with such demand. So, it is not a closed topic so far, IMHO.

> > > Or are we saying that we want to support kfree on RT from raw atomic atomic
> > > context, even though kmalloc is not supported? I hate to bring up this
> > > elephant in the room, but since I am a part of the people maintaining this
> > > code, I believe I would rather set some rules than supporting unsupported
> > > usages. :-\ (Once I know what is supported and what isn't that is). If indeed
> > > raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
> > > put a giant warning than supporting it :-(.
> > > 
> > We discussed it several times, the conclusion was that we need to support 
> > kfree_rcu() from raw contexts. At least that was a clear signal from Paul 
> > to me. I think, if we obtain the preemtable(), so it becomes versatile, we
> > can drop the patch that is in question later on in the future.
> 
> Given a universally meaningful preemptible(), we could directly call
> the allocator in some cases.  It might (or might not) still make sense
> to defer the allocation when preemptible() indicated that a direct call
> to the allocator was unsafe.
> 
I do not have a strong opinion here. Giving the fact that maintaining of
such "deferring" is not considered as a big effort, i think, we can live
with it.

--
Vlad Rezki
Paul E. McKenney Nov. 4, 2020, 2:12 p.m. UTC | #7
On Wed, Nov 04, 2020 at 01:35:53PM +0100, Uladzislau Rezki wrote:
> On Tue, Nov 03, 2020 at 11:18:22AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 03, 2020 at 05:33:50PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Nov 03, 2020 at 10:47:23AM -0500, Joel Fernandes wrote:
> > > > On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > The current memmory-allocation interface presents to following
> > > > > difficulties that this patch is designed to overcome:
> > > > > 
> > > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > > > >    complain about violation("BUG: Invalid wait context") of the
> > > > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > > > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > > > >    holding a raw_spinlock_t.
> > > > > 
> > > > >    Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > > > >    "page allocator" internally deals with spinlock_t to access
> > > > >    to its zones. The code also can be broken from higher level
> > > > >    of view:
> > > > >    <snip>
> > > > >        raw_spin_lock(&some_lock);
> > > > >        kfree_rcu(some_pointer, some_field_offset);
> > > > >    <snip>
> > > > > 
> > > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > > > >    is converted into sleepable variant. Invoking the page allocator from
> > > > >    atomic contexts leads to "BUG: scheduling while atomic".
> > > > > 
> > > > > c) call_rcu() is invoked from raw atomic context and kfree_rcu()
> > > > >    and kvfree_rcu() are expected to be called from atomic raw context
> > > > >    as well.
> > > > > 
> > > > > Move out a page allocation from contexts which trigger kvfree_rcu()
> > > > > function to the separate worker. When a k[v]free_rcu() per-cpu page
> > > > > cache is empty a fallback mechanism is used and a special job is
> > > > > scheduled to refill the per-cpu cache.
> > > > 
> > > > Looks good, still reviewing here. BTW just for my education, I was wondering
> > > > about Thomas's email:
> > > > https://lkml.org/lkml/2020/8/11/939
> > > > 
> > > > If slab allocations in pure raw-atomic context on RT is not allowed or
> > > > recommended, should kfree_rcu() be allowed?
> > > >
> > > Thanks for reviewing, Joel :)
> > > 
> > > The decision was made that we need to support kfree_rcu() from "real atomic contexts",
> > > to align with how it used to be before. We can go and just convert our local locks
> > > to the spinlock_t variant but that was not Paul goal, it can be that some users need
> > > kfree_rcu() for raw atomics.
> > 
> > People invoke call_rcu() from raw atomics, and so we should provide
> > the same for kfree_rcu().  Yes, people could work around a raw-atomic
> > prohibition, but such prohibitions incur constant costs over time in
> > terms of development effort, increased bug rate, and increased complexity.
> > Yes, this does increase all of those for RCU, but the relative increase
> > is negligible, RCU being what it is.
> > 
> I see your point.
> 
> > > > slab can have same issue right? If per-cpu cache is drained, it has to
> > > > allocate page from buddy allocator and there's no GFP flag to tell it about
> > > > context where alloc is happening from.
> > > > 
> > > Sounds like that. Apart of that, it might turn out soon that we or somebody
> > > else will rise a question one more time about something GFP_RAW or GFP_NOLOCKS.
> > > So who knows..
> > 
> > I would prefer that slab provide some way of dealing with raw atomic
> > context, but the maintainers are thus far unconvinced.
> > 
> I think, when preempt_rt is fully integrated to the kernel, we might get
> new users with such demand. So, it is not a closed topic so far, IMHO.

Agreed!  ;-)

> > > > Or are we saying that we want to support kfree on RT from raw atomic atomic
> > > > context, even though kmalloc is not supported? I hate to bring up this
> > > > elephant in the room, but since I am a part of the people maintaining this
> > > > code, I believe I would rather set some rules than supporting unsupported
> > > > usages. :-\ (Once I know what is supported and what isn't that is). If indeed
> > > > raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
> > > > put a giant warning than supporting it :-(.
> > > > 
> > > We discussed it several times, the conclusion was that we need to support 
> > > kfree_rcu() from raw contexts. At least that was a clear signal from Paul 
> > > to me. I think, if we obtain the preemtable(), so it becomes versatile, we
> > > can drop the patch that is in question later on in the future.
> > 
> > Given a universally meaningful preemptible(), we could directly call
> > the allocator in some cases.  It might (or might not) still make sense
> > to defer the allocation when preemptible() indicated that a direct call
> > to the allocator was unsafe.
> > 
> I do not have a strong opinion here. Giving the fact that maintaining of
> such "deferring" is not considered as a big effort, i think, we can live
> with it.

And agreed here as well.  If this were instead a large body of complex
code, I might feel otherwise.  But as it is, why worry?

							Thanx, Paul
Uladzislau Rezki Nov. 4, 2020, 2:40 p.m. UTC | #8
On Wed, Nov 04, 2020 at 06:12:00AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 04, 2020 at 01:35:53PM +0100, Uladzislau Rezki wrote:
> > On Tue, Nov 03, 2020 at 11:18:22AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 03, 2020 at 05:33:50PM +0100, Uladzislau Rezki wrote:
> > > > On Tue, Nov 03, 2020 at 10:47:23AM -0500, Joel Fernandes wrote:
> > > > > On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > > The current memmory-allocation interface presents to following
> > > > > > difficulties that this patch is designed to overcome:
> > > > > > 
> > > > > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
> > > > > >    complain about violation("BUG: Invalid wait context") of the
> > > > > >    nesting rules. It does the raw_spinlock vs. spinlock nesting
> > > > > >    checks, i.e. it is not legal to acquire a spinlock_t while
> > > > > >    holding a raw_spinlock_t.
> > > > > > 
> > > > > >    Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > > > > >    "page allocator" internally deals with spinlock_t to access
> > > > > >    to its zones. The code also can be broken from higher level
> > > > > >    of view:
> > > > > >    <snip>
> > > > > >        raw_spin_lock(&some_lock);
> > > > > >        kfree_rcu(some_pointer, some_field_offset);
> > > > > >    <snip>
> > > > > > 
> > > > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
> > > > > >    is converted into sleepable variant. Invoking the page allocator from
> > > > > >    atomic contexts leads to "BUG: scheduling while atomic".
> > > > > > 
> > > > > > c) call_rcu() is invoked from raw atomic context and kfree_rcu()
> > > > > >    and kvfree_rcu() are expected to be called from atomic raw context
> > > > > >    as well.
> > > > > > 
> > > > > > Move out a page allocation from contexts which trigger kvfree_rcu()
> > > > > > function to the separate worker. When a k[v]free_rcu() per-cpu page
> > > > > > cache is empty a fallback mechanism is used and a special job is
> > > > > > scheduled to refill the per-cpu cache.
> > > > > 
> > > > > Looks good, still reviewing here. BTW just for my education, I was wondering
> > > > > about Thomas's email:
> > > > > https://lkml.org/lkml/2020/8/11/939
> > > > > 
> > > > > If slab allocations in pure raw-atomic context on RT is not allowed or
> > > > > recommended, should kfree_rcu() be allowed?
> > > > >
> > > > Thanks for reviewing, Joel :)
> > > > 
> > > > The decision was made that we need to support kfree_rcu() from "real atomic contexts",
> > > > to align with how it used to be before. We can go and just convert our local locks
> > > > to the spinlock_t variant but that was not Paul goal, it can be that some users need
> > > > kfree_rcu() for raw atomics.
> > > 
> > > People invoke call_rcu() from raw atomics, and so we should provide
> > > the same for kfree_rcu().  Yes, people could work around a raw-atomic
> > > prohibition, but such prohibitions incur constant costs over time in
> > > terms of development effort, increased bug rate, and increased complexity.
> > > Yes, this does increase all of those for RCU, but the relative increase
> > > is negligible, RCU being what it is.
> > > 
> > I see your point.
> > 
> > > > > slab can have same issue right? If per-cpu cache is drained, it has to
> > > > > allocate page from buddy allocator and there's no GFP flag to tell it about
> > > > > context where alloc is happening from.
> > > > > 
> > > > Sounds like that. Apart of that, it might turn out soon that we or somebody
> > > > else will rise a question one more time about something GFP_RAW or GFP_NOLOCKS.
> > > > So who knows..
> > > 
> > > I would prefer that slab provide some way of dealing with raw atomic
> > > context, but the maintainers are thus far unconvinced.
> > > 
> > I think, when preempt_rt is fully integrated to the kernel, we might get
> > new users with such demand. So, it is not a closed topic so far, IMHO.
> 
> Agreed!  ;-)
> 
> > > > > Or are we saying that we want to support kfree on RT from raw atomic atomic
> > > > > context, even though kmalloc is not supported? I hate to bring up this
> > > > > elephant in the room, but since I am a part of the people maintaining this
> > > > > code, I believe I would rather set some rules than supporting unsupported
> > > > > usages. :-\ (Once I know what is supported and what isn't that is). If indeed
> > > > > raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to
> > > > > put a giant warning than supporting it :-(.
> > > > > 
> > > > We discussed it several times, the conclusion was that we need to support 
> > > > kfree_rcu() from raw contexts. At least that was a clear signal from Paul 
> > > > to me. I think, if we obtain the preemtable(), so it becomes versatile, we
> > > > can drop the patch that is in question later on in the future.
> > > 
> > > Given a universally meaningful preemptible(), we could directly call
> > > the allocator in some cases.  It might (or might not) still make sense
> > > to defer the allocation when preemptible() indicated that a direct call
> > > to the allocator was unsafe.
> > > 
> > I do not have a strong opinion here. Giving the fact that maintaining of
> > such "deferring" is not considered as a big effort, i think, we can live
> > with it.
> 
> And agreed here as well.  If this were instead a large body of complex
> code, I might feel otherwise.  But as it is, why worry?
> 
Agreed! I do not consider it as extra complexity.

--
Vlad Rezki
Joel Fernandes Nov. 4, 2020, 3:01 p.m. UTC | #9
On Wed, Nov 04, 2020 at 01:12:03PM +0100, Uladzislau Rezki wrote:
> On Tue, Nov 03, 2020 at 12:54:22PM -0500, Joel Fernandes wrote:
> > On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote:
> > > The current memmory-allocation interface presents to following
> > > difficulties that this patch is designed to overcome
> > [...]
> > > ---
> > >  kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------
> > >  1 file changed, 66 insertions(+), 43 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 06895ef85d69..f2da2a1cc716 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444);
> > >   * per-CPU. Object size is equal to one page. This value
> > >   * can be changed at boot time.
> > >   */
> > > -static int rcu_min_cached_objs = 2;
> > > +static int rcu_min_cached_objs = 5;
> > >  module_param(rcu_min_cached_objs, int, 0444);
> > >  
> > >  /* Retrieve RCU kthreads priority for rcutorture */
> > > @@ -3084,6 +3084,9 @@ struct kfree_rcu_cpu_work {
> > >   *	In order to save some per-cpu space the list is singular.
> > >   *	Even though it is lockless an access has to be protected by the
> > >   *	per-cpu lock.
> > > + * @page_cache_work: A work to refill the cache when it is empty
> > > + * @work_in_progress: Indicates that page_cache_work is running
> > > + * @hrtimer: A hrtimer for scheduling a page_cache_work
> > >   * @nr_bkv_objs: number of allocated objects at @bkvcache.
> > >   *
> > >   * This is a per-CPU structure.  The reason that it is not included in
> > > @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
> > >  	bool monitor_todo;
> > >  	bool initialized;
> > >  	int count;
> > > +
> > > +	struct work_struct page_cache_work;
> > > +	atomic_t work_in_progress;
> > 
> > Does it need to be atomic? run_page_cache_work() is only called under a lock.
> > You can use xchg() there. And when you do the atomic_set, you can use
> > WRITE_ONCE as it is a data-race.
> > 
> We can use xchg together with *_ONCE() macro. Could you please clarify what
> is your concern about using atomic_t? Both xchg() and atomic_xchg() guarantee
> atamarity. Same as WRITE_ONCE() or atomic_set().

Right, whether there's lock or not does not matter as xchg() is also
atomic-swap.

atomic_t is a more complex type though, I would directly use int since
atomic_t is not needed here and there's no lost-update issue here. It could
be matter of style as well.

BTW I did think atomic_xchg() adds additional memory barriers
but I could not find that to be the case in the implementation. Is that not
the case? Docs says "atomic_xchg must provide explicit memory barriers around
the operation.".

> > > @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
> > >  
> > >  	for_each_possible_cpu(cpu) {
> > >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > > -		struct kvfree_rcu_bulk_data *bnode;
> > >  
> > >  		for (i = 0; i < KFREE_N_BATCHES; i++) {
> > >  			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
> > >  			krcp->krw_arr[i].krcp = krcp;
> > >  		}
> > >  
> > > -		for (i = 0; i < rcu_min_cached_objs; i++) {
> > > -			bnode = (struct kvfree_rcu_bulk_data *)
> > > -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > -
> > > -			if (bnode)
> > > -				put_cached_bnode(krcp, bnode);
> > > -			else
> > > -				pr_err("Failed to preallocate for %d CPU!\n", cpu);
> > > -		}
> > > -
> > >  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > > +		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
> > >  		krcp->initialized = true;
> > 
> > During initialization, is it not better to still pre-allocate? That way you
> > don't have to wait to get into a situation where you need to initially
> > allocate.
> > 
> Since we have a worker that does it when a cache is empty there is no
> a high need in doing it during initialization phase. If we can reduce
> an amount of code it is always good :)

I am all for not having more code than needed. But you would hit
synchronize_rcu() slow path immediately on first headless kfree_rcu() right?
That seems like a step back from the current code :)

thanks,

 - Joel
Uladzislau Rezki Nov. 4, 2020, 6:38 p.m. UTC | #10
> > > >   * This is a per-CPU structure.  The reason that it is not included in
> > > > @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu {
> > > >  	bool monitor_todo;
> > > >  	bool initialized;
> > > >  	int count;
> > > > +
> > > > +	struct work_struct page_cache_work;
> > > > +	atomic_t work_in_progress;
> > > 
> > > Does it need to be atomic? run_page_cache_work() is only called under a lock.
> > > You can use xchg() there. And when you do the atomic_set, you can use
> > > WRITE_ONCE as it is a data-race.
> > > 
> > We can use xchg together with *_ONCE() macro. Could you please clarify what
> > is your concern about using atomic_t? Both xchg() and atomic_xchg() guarantee
> > atamarity. Same as WRITE_ONCE() or atomic_set().
> 
> Right, whether there's lock or not does not matter as xchg() is also
> atomic-swap.
> 
> atomic_t is a more complex type though, I would directly use int since
> atomic_t is not needed here and there's no lost-update issue here. It could
> be matter of style as well.
> 
> BTW I did think atomic_xchg() adds additional memory barriers
> but I could not find that to be the case in the implementation. Is that not
> the case? Docs says "atomic_xchg must provide explicit memory barriers around
> the operation.".
> 
In most of the systems atmoc_xchg() is same as xchg() and atomic_set()
is same as WRITE_ONCE(). But there are exceptions, for example "parisc"

*** arch/parisc/include/asm/atomic.h:
<snip>
...
#define _atomic_spin_lock_irqsave(l,f) do { \
    arch_spinlock_t *s = ATOMIC_HASH(l); \
    local_irq_save(f);   \
    arch_spin_lock(s);   \
} while(0)
...
static __inline__ void atomic_set(atomic_t *v, int i)
{
     unsigned long flags;
     _atomic_spin_lock_irqsave(v, flags);

     v->counter = i;

     _atomic_spin_unlock_irqrestore(v, flags);
}
<snip>

I will switch to xchg() and WRITE_ONCE(), because of such specific ARCHs.

> > > > @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void)
> > > >  
> > > >  	for_each_possible_cpu(cpu) {
> > > >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > > > -		struct kvfree_rcu_bulk_data *bnode;
> > > >  
> > > >  		for (i = 0; i < KFREE_N_BATCHES; i++) {
> > > >  			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
> > > >  			krcp->krw_arr[i].krcp = krcp;
> > > >  		}
> > > >  
> > > > -		for (i = 0; i < rcu_min_cached_objs; i++) {
> > > > -			bnode = (struct kvfree_rcu_bulk_data *)
> > > > -				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > > -
> > > > -			if (bnode)
> > > > -				put_cached_bnode(krcp, bnode);
> > > > -			else
> > > > -				pr_err("Failed to preallocate for %d CPU!\n", cpu);
> > > > -		}
> > > > -
> > > >  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > > > +		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
> > > >  		krcp->initialized = true;
> > > 
> > > During initialization, is it not better to still pre-allocate? That way you
> > > don't have to wait to get into a situation where you need to initially
> > > allocate.
> > > 
> > Since we have a worker that does it when a cache is empty there is no
> > a high need in doing it during initialization phase. If we can reduce
> > an amount of code it is always good :)
> 
> I am all for not having more code than needed. But you would hit
> synchronize_rcu() slow path immediately on first headless kfree_rcu() right?
> That seems like a step back from the current code :)
> 
As for slow path and hitting the synchronize_rcu() immediately. Yes, a slow 
hit "counter" will be increased by 1, the difference between two variants
will be N and N + 1 times. I do not consider N + 1 as a big difference and
impact on performance.

Should we guarantee that a first user does not hit a fallback path that
invokes synchronize_rcu()? If not, i would rather remove redundant code.

Any thoughts here?

Thanks!

--
Vlad Rezki

Patch
diff mbox series

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 06895ef85d69..f2da2a1cc716 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -177,7 +177,7 @@  module_param(rcu_unlock_delay, int, 0444);
  * per-CPU. Object size is equal to one page. This value
  * can be changed at boot time.
  */
-static int rcu_min_cached_objs = 2;
+static int rcu_min_cached_objs = 5;
 module_param(rcu_min_cached_objs, int, 0444);
 
 /* Retrieve RCU kthreads priority for rcutorture */
@@ -3084,6 +3084,9 @@  struct kfree_rcu_cpu_work {
  *	In order to save some per-cpu space the list is singular.
  *	Even though it is lockless an access has to be protected by the
  *	per-cpu lock.
+ * @page_cache_work: A work to refill the cache when it is empty
+ * @work_in_progress: Indicates that page_cache_work is running
+ * @hrtimer: A hrtimer for scheduling a page_cache_work
  * @nr_bkv_objs: number of allocated objects at @bkvcache.
  *
  * This is a per-CPU structure.  The reason that it is not included in
@@ -3100,6 +3103,11 @@  struct kfree_rcu_cpu {
 	bool monitor_todo;
 	bool initialized;
 	int count;
+
+	struct work_struct page_cache_work;
+	atomic_t work_in_progress;
+	struct hrtimer hrtimer;
+
 	struct llist_head bkvcache;
 	int nr_bkv_objs;
 };
@@ -3217,10 +3225,10 @@  static void kfree_rcu_work(struct work_struct *work)
 			}
 			rcu_lock_release(&rcu_callback_map);
 
-			krcp = krc_this_cpu_lock(&flags);
+			raw_spin_lock_irqsave(&krcp->lock, flags);
 			if (put_cached_bnode(krcp, bkvhead[i]))
 				bkvhead[i] = NULL;
-			krc_this_cpu_unlock(krcp, flags);
+			raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
 			if (bkvhead[i])
 				free_page((unsigned long) bkvhead[i]);
@@ -3347,6 +3355,57 @@  static void kfree_rcu_monitor(struct work_struct *work)
 		raw_spin_unlock_irqrestore(&krcp->lock, flags);
 }
 
+static enum hrtimer_restart
+schedule_page_work_fn(struct hrtimer *t)
+{
+	struct kfree_rcu_cpu *krcp =
+		container_of(t, struct kfree_rcu_cpu, hrtimer);
+
+	queue_work(system_highpri_wq, &krcp->page_cache_work);
+	return HRTIMER_NORESTART;
+}
+
+static void fill_page_cache_func(struct work_struct *work)
+{
+	struct kvfree_rcu_bulk_data *bnode;
+	struct kfree_rcu_cpu *krcp =
+		container_of(work, struct kfree_rcu_cpu,
+			page_cache_work);
+	unsigned long flags;
+	bool pushed;
+	int i;
+
+	for (i = 0; i < rcu_min_cached_objs; i++) {
+		bnode = (struct kvfree_rcu_bulk_data *)
+			__get_free_page(GFP_KERNEL | __GFP_NOWARN);
+
+		if (bnode) {
+			raw_spin_lock_irqsave(&krcp->lock, flags);
+			pushed = put_cached_bnode(krcp, bnode);
+			raw_spin_unlock_irqrestore(&krcp->lock, flags);
+
+			if (!pushed) {
+				free_page((unsigned long) bnode);
+				break;
+			}
+		}
+	}
+
+	atomic_set(&krcp->work_in_progress, 0);
+}
+
+static void
+run_page_cache_worker(struct kfree_rcu_cpu *krcp)
+{
+	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
+			!atomic_xchg(&krcp->work_in_progress, 1)) {
+		hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL);
+		krcp->hrtimer.function = schedule_page_work_fn;
+		hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
+	}
+}
+
 static inline bool
 kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 {
@@ -3363,32 +3422,8 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 	if (!krcp->bkvhead[idx] ||
 			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
 		bnode = get_cached_bnode(krcp);
-		if (!bnode) {
-			/*
-			 * To keep this path working on raw non-preemptible
-			 * sections, prevent the optional entry into the
-			 * allocator as it uses sleeping locks. In fact, even
-			 * if the caller of kfree_rcu() is preemptible, this
-			 * path still is not, as krcp->lock is a raw spinlock.
-			 * With additional page pre-allocation in the works,
-			 * hitting this return is going to be much less likely.
-			 */
-			if (IS_ENABLED(CONFIG_PREEMPT_RT))
-				return false;
-
-			/*
-			 * NOTE: For one argument of kvfree_rcu() we can
-			 * drop the lock and get the page in sleepable
-			 * context. That would allow to maintain an array
-			 * for the CONFIG_PREEMPT_RT as well if no cached
-			 * pages are available.
-			 */
-			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
-		}
-
 		/* Switch to emergency path. */
-		if (unlikely(!bnode))
+		if (!bnode)
 			return false;
 
 		/* Initialize the new block. */
@@ -3452,12 +3487,10 @@  void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 		goto unlock_return;
 	}
 
-	/*
-	 * Under high memory pressure GFP_NOWAIT can fail,
-	 * in that case the emergency path is maintained.
-	 */
 	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
 	if (!success) {
+		run_page_cache_worker(krcp);
+
 		if (head == NULL)
 			// Inline if kvfree_rcu(one_arg) call.
 			goto unlock_return;
@@ -4449,24 +4482,14 @@  static void __init kfree_rcu_batch_init(void)
 
 	for_each_possible_cpu(cpu) {
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
-		struct kvfree_rcu_bulk_data *bnode;
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
 			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
 			krcp->krw_arr[i].krcp = krcp;
 		}
 
-		for (i = 0; i < rcu_min_cached_objs; i++) {
-			bnode = (struct kvfree_rcu_bulk_data *)
-				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
-
-			if (bnode)
-				put_cached_bnode(krcp, bnode);
-			else
-				pr_err("Failed to preallocate for %d CPU!\n", cpu);
-		}
-
 		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
+		INIT_WORK(&krcp->page_cache_work, fill_page_cache_func);
 		krcp->initialized = true;
 	}
 	if (register_shrinker(&kfree_rcu_shrinker))