Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

From: Joel Fernandes <joel@joelfernandes.org>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: linux-kernel@vger.kernel.org, kernel-team@android.com,
	kernel-team@lge.com, Byungchul Park <byungchul.park@lge.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Josh Triplett <josh@joshtriplett.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	max.byungchul.park@gmail.com,
	"Paul E. McKenney" <paulmck@linux.ibm.com>,
	Rao Shoaib <rao.shoaib@oracle.com>,
	rcu@vger.kernel.org, Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching
Date: Mon, 30 Sep 2019 16:16:23 -0400	[thread overview]
Message-ID: <20190930201623.GA134859@google.com> (raw)
In-Reply-To: <20190918095811.GA25821@pc636>

On Wed, Sep 18, 2019 at 11:58:11AM +0200, Uladzislau Rezki wrote:
> > Recently a discussion about stability and performance of a system
> > involving a high rate of kfree_rcu() calls surfaced on the list [1]
> > which led to another discussion how to prepare for this situation.
> > 
> > This patch adds basic batching support for kfree_rcu(). It is "basic"
> > because we do none of the slab management, dynamic allocation, code
> > moving or any of the other things, some of which previous attempts did
> > [2]. These fancier improvements can be follow-up patches and there are
> > different ideas being discussed in those regards. This is an effort to
> > start simple, and build up from there. In the future, an extension to
> > use kfree_bulk and possibly per-slab batching could be done to further
> > improve performance due to cache-locality and slab-specific bulk free
> > optimizations. By using an array of pointers, the worker thread
> > processing the work would need to read lesser data since it does not
> > need to deal with large rcu_head(s) any longer.
> > 
> > Torture tests follow in the next patch and show improvements of around
> > 5x reduction in number of  grace periods on a 16 CPU system. More
> > details and test data are in that patch.
> > 
> > There is an implication with rcu_barrier() with this patch. Since the
> > kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> > machinery in fact, the monitor may not have even run yet to do the
> > queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> > to wait for those kfree_rcu()s that are already made. So this means a
> > kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> > be freed once rcu_barrier() returns.
> > 
> > Another implication is higher active memory usage (although not
> > run-away..) until the kfree_rcu() flooding ends, in comparison to
> > without batching. More details about this are in the second patch which
> > adds an rcuperf test.
> > 
> > Finally, in the near future we will get rid of kfree_rcu() special casing
> > within RCU such as in rcu_do_batch and switch everything to just
> > batching. Currently we don't do that since timer subsystem is not yet up
> > and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> > lock are not initialized. That would also mean getting rid of
> > kfree_call_rcu_nobatch() entirely.
> > 
> Hello, Joel.
> 
> First of all thank you for improving it. I also noticed a high pressure
> on RCU-machinery during performing some vmalloc tests when kfree_rcu()
> flood occurred. Therefore i got rid of using kfree_rcu() there.

Replying a bit late due to overseas conference travel and vacation.

When you say 'high pressure', do you mean memory pressure or just system
load?

Memory pressure slightly increases with the kfree_rcu() rework with the
benefit of much fewer grace periods.

> I have just a small question related to workloads and performance evaluation.
> Are you aware of any specific workloads which benefit from it for example
> mobile area, etc? I am asking because i think about backporting of it and
> reuse it on our kernel. 

I am not aware of a mobile usecase that benefits but there are server
workloads that make system more stable in the face of a kfree_rcu() flood.

For the KVA allocator work, I see it is quite similar to the way binder
allocates blocks. See function: binder_alloc_new_buf_locked(). Is there are
any chance to reuse any code? For one thing, binder also has an rbtree for
allocated blocks for fast lookup of allocated blocks. Does the KVA allocator
not have the need for that?

And, nice LPC presentation! I was there ;)

thanks,

 - Joel