Re: [PATCH RFC] rcu/kfree: Do not request RCU when not needed

From: Uladzislau Rezki <urezki@gmail.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	rcu@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC] rcu/kfree: Do not request RCU when not needed
Date: Fri, 4 Nov 2022 15:35:59 +0100	[thread overview]
Message-ID: <Y2UjT4yuFvU5d6FU@pc638.lan> (raw)
In-Reply-To: <20221103175143.GB5600@paulmck-ThinkPad-P17-Gen-1>

> > > 
> > This concern works in both cases. I mean in default configuration and
> > with a posted patch. The reclaim work, which name is kfree_rcu_work() only
> > does a progress when a gp is passed so the rcu_work_rcufn() can queue
> > our reclaim kworker.
> > 
> > As it is now:
> > 
> > 1. Collect pointers, then we decide to drop them we queue the
> >    monitro_work() worker to the system_wq.
> > 
> > 2. The monitor work, kfree_rcu_work(), tries to attach or saying
> > it by another words bypass a "backlog" to "free" channels.
> > 
> > 3. It invokes the queue_rcu_work() that does call_rcu_flush() and
> > in its turn it queues our worker from the handler. So the worker
> > is run after GP is passed.
> 
> So as it is now, we are not tying up a kworker kthread while waiting
> for the grace period, correct?  We instead have an RCU callback queued
> during that time, and the kworker kthread gets involved only after the
> grace period ends.
> 
Yes. The reclaim kworker is not kicked until a wrapper callback pushed
via queue_rcu_work() -> call_rcu_flush() is invoked by the nocb-kthread:

<snip>
static void rcu_work_rcufn(struct rcu_head *rcu)
{
...
	/* read the comment in __queue_work() */
	local_irq_disable();
	__queue_work(WORK_CPU_UNBOUND, rwork->wq, &rwork->work);
	local_irq_enable();
}

bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
{
...
	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
		rwork->wq = wq;
		call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
..
<snip>

rcu_work_rcufn() callback runs our reclaim work after a full grace period.
There is not so good dependency. If it is number, say, 10 000 in the cblist
then we just wait for our time in a queue.

> > With a patch: 
> > 
> > [1] and [2] steps are the same. But on third step we do:
> > 
> > 1. Record the GP status for last in channel;
> > 2. Directly queue the drain work without any call_rcu() helpers;
> > 3. On the reclaim worker entry we check if GP is passed;
> > 4. If not it invokes synchronize_rcu().
> 
> And #4 changes that, by (sometimes) tying up a kworker kthread for the
> full grace period.
> 
Yes if GP is not passed we need to wait it anyway. Like this is done in
default case. call_rcu()'s callback waits for a full grace period also
and after that queues the work.

It is the same. But patched version can proceed reclaim right away if
"during the fly" the GP is passed. On my tests i see it is ~30-40% of
cases.

With a patch it also does not require go over RCU-core layer that is
definitely longer.

> > The patch eliminates extra steps by not going via RCU-core route
> > instead it directly invokes the reclaim worker where it either
> > proceed or wait a GP if needed.
> 
> I agree that the use of the polled API could be reducing delays, which
> is a good thing.  Just being my usual greedy self and asking "Why not
> both?", that is use queue_rcu_work() instead of synchronize_rcu() in
> conjunction with the polled APIs so as to avoid both the grace-period
> delay and the tying up of the kworker kthread.
> 
> Or am I missing something here?
> 
Looks like not :) 

1.
I was thinking about tracking a GP status for set of objects. With last
in one we need to keep it updated for the whole set. Such objects can be
freed faster, for example from monitor kworker. All other can go over
queue_rcu_work().

This is more complex design but probably could be better. It will require
split such chains for whom a GP is passed and not.

Please note there can be hidden problems and what i wrote might not be
easily applicable.

The patch i sent is an attempt to optimize it by not introducing the
big delta to current design.

2. 
call_rcu()
   -> queue_work();
          -> do reclaim

queue_work();
    -> synchronize_rcu() or not if passed, do reclaim

I can miss something though, why the latest is a bad way?

--
Uladzislau Rezki