Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests

From: Joel Fernandes <joel@joelfernandes.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: linux-kernel@vger.kernel.org, byungchul.park@lge.com,
	Josh Triplett <josh@joshtriplett.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	linux-doc@vger.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	rcu@vger.kernel.org, Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
Date: Thu, 29 Aug 2019 16:56:37 -0400	[thread overview]
Message-ID: <20190829205637.GA162830@google.com> (raw)
In-Reply-To: <20190828211226.GW26530@linux.ibm.com>

On Wed, Aug 28, 2019 at 02:12:26PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 27, 2019 at 03:01:55PM -0400, Joel Fernandes (Google) wrote:
> > This test runs kfree_rcu() in a loop to measure performance of the new
> > kfree_rcu() batching functionality.
> > 
> > The following table shows results when booting with arguments:
> > rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1
> > 
> > In addition, rcuperf.kfree_no_batch is used to toggle the batching of
> > kfree_rcu()s for a test run.
> > 
> > patch applied		GPs	time (seconds)
> >  yes			1732	14.5
> >  no			9133 	11.5
> 
> This is really "rcuperf.kfree_no_batch" rather than "patch applied", right?
> (Yes, we did discuss this last time around, but this table combined with
> the prior paragraph is still ambiguous.)  Please make it unambiguous.
> One way to do that is as follows:
> 
> ------------------------------------------------------------------------
> 
> The following table shows results when booting with arguments:
> rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1  rcuperf.kfree_no_batch=X
> 
> rcuperf.kfree_no_batch=X    # Grace Periods	Test Duration (s)
>  X=1 (old behavior)              9133                 11.5
>  X=0 (new behavior)              1732                 14.5

Yes you are right, will fix. The reason I changed it to 'patch applied' is
because the last patch in the series removes kfree_no_batch. Will fix!
thanks!

> > On a 16 CPU system with the above boot parameters, we see that the total
> > number of grace periods that elapse during the test drops from 9133 when
> > not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
> > flood itself slows down a bit when batching, though, as shown.
> 
> This last sentence would be more clear as something like: "However,
> use of batching increases the duration of the kfree_rcu()-flood test."
> 
> > Note that the active memory consumption during the kfree_rcu() flood
> > does increase to around 200-250MB due to the batching (from around 50MB
> > without batching). However, this memory consumption is relatively
> > constant. In other words, the system is able to keep up with the
> > kfree_rcu() load. The memory consumption comes down considerably if
> > KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80.
> 
> That would be a decrease rather than an increase in KFREE_DRAIN_JIFFIES,
> correct?
> 
> This would also be a good place to mention that a later patch will
> decrease consumption, but that is strictly optional.  However, you did
> introduce the topic of changing KFREE_DRAIN_JIFFIES, so if a later patch
> changes this value, this would be an excellent place to mention this.

Fixed.

[snip]
> > +/*
> > + * kfree_rcu() performance tests: Start a kfree_rcu() loop on all CPUs for number
> > + * of iterations and measure total time and number of GP for all iterations to complete.
> > + */
> > +
> > +torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
> > +torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
> > +torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
> > +torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu().");
> > +
> > +static struct task_struct **kfree_reader_tasks;
> > +static int kfree_nrealthreads;
> > +static atomic_t n_kfree_perf_thread_started;
> > +static atomic_t n_kfree_perf_thread_ended;
> > +
> > +struct kfree_obj {
> > +	char kfree_obj[8];
> > +	struct rcu_head rh;
> > +};
> > +
> > +static int
> > +kfree_perf_thread(void *arg)
> > +{
> > +	int i, loop = 0;
> > +	long me = (long)arg;
> > +	struct kfree_obj *alloc_ptr;
> > +	u64 start_time, end_time;
> > +
> > +	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > +	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > +	set_user_nice(current, MAX_NICE);
> > +
> > +	start_time = ktime_get_mono_fast_ns();
> > +
> > +	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> > +		if (gp_exp)
> > +			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
> 
> At some point, it would be good to use the new grace-period
> sequence-counter functions (rcuperf_seq_diff(), for example) instead of
> the open-coded division by 2.  I freely admit that you are just copying
> my obsolete hack in this case, so not needed in this patch.

But I am using rcu_seq_diff() below in the pr_alert().

Anyway, I agree this can be a follow-on since this pattern is borrowed from
another part of rcuperf. However, I am also confused about the pattern
itself.

If I understand, you are doing the "/ 2" because expedited_sequence
progresses by 2 for every expedited batch.

But does rcu_seq_diff() really work on these expedited GP numbers, and will
it be immune to changes in RCU_SEQ_STATE_MASK? Sorry for the silly questions,
but admittedly I have not looked too much yet into expedited RCU so I could
be missing the point.

> > +		else
> > +			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> > +
> > +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > +		       (unsigned long long)(end_time - start_time), kfree_loops,
> > +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > +		if (shutdown) {
> > +			smp_mb(); /* Assign before wake. */
> > +			wake_up(&shutdown_wq);
> > +		}
> > +	}
> > +
> > +	torture_kthread_stopping("kfree_perf_thread");
> > +	return 0;
> > +}
> > +
> > +static void
> > +kfree_perf_cleanup(void)
> > +{
> > +	int i;
> > +
> > +	if (torture_cleanup_begin())
> > +		return;
> > +
> > +	if (kfree_reader_tasks) {
> > +		for (i = 0; i < kfree_nrealthreads; i++)
> > +			torture_stop_kthread(kfree_perf_thread,
> > +					     kfree_reader_tasks[i]);
> > +		kfree(kfree_reader_tasks);
> > +	}
> > +
> > +	torture_cleanup_end();
> > +}
> > +
> > +/*
> > + * shutdown kthread.  Just waits to be awakened, then shuts down system.
> > + */
> > +static int
> > +kfree_perf_shutdown(void *arg)
> > +{
> > +	do {
> > +		wait_event(shutdown_wq,
> > +			   atomic_read(&n_kfree_perf_thread_ended) >=
> > +			   kfree_nrealthreads);
> > +	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> > +
> > +	smp_mb(); /* Wake before output. */
> > +
> > +	kfree_perf_cleanup();
> > +	kernel_power_off();
> > +	return -EINVAL;
> 
> These last four lines should be combined with those of
> rcu_perf_shutdown().  Actually, you could fold the two functions together
> with only a pair of arguments and two one-line wrapper functions, which
> would be even better.

But the cleanup() function is different in the 2 cases and will have to be
passed in as a function pointer. I believe we discussed this last review as
well.

thanks,

 - Joel