rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
@ 2019-08-27 19:01 Joel Fernandes (Google)
  2019-08-28 21:12 ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-27 19:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	byungchul.park, Josh Triplett, Lai Jiangshan, linux-doc,
	Mathieu Desnoyers, Paul E. McKenney, rcu, Steven Rostedt,
	kernel-team

This test runs kfree_rcu() in a loop to measure performance of the new
kfree_rcu() batching functionality.

The following table shows results when booting with arguments:
rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1

In addition, rcuperf.kfree_no_batch is used to toggle the batching of
kfree_rcu()s for a test run.

patch applied		GPs	time (seconds)
 yes			1732	14.5
 no			9133 	11.5

On a 16 CPU system with the above boot parameters, we see that the total
number of grace periods that elapse during the test drops from 9133 when
not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
flood itself slows down a bit when batching, though, as shown.

Note that the active memory consumption during the kfree_rcu() flood
does increase to around 200-250MB due to the batching (from around 50MB
without batching). However, this memory consumption is relatively
constant. In other words, the system is able to keep up with the
kfree_rcu() load. The memory consumption comes down considerably if
KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80.

Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
CONFIG_PROVE_RCU for realistic comparisons with/without batching.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/kernel-parameters.txt         |  17 ++
 kernel/rcu/rcuperf.c                          | 181 +++++++++++++++++-
 2 files changed, 190 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 79b983bedcaa..24fe8aefb12c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3896,6 +3896,23 @@
 			test until boot completes in order to avoid
 			interference.
 
+	rcuperf.kfree_rcu_test= [KNL]
+			Set to measure performance of kfree_rcu() flooding.
+
+	rcuperf.kfree_nthreads= [KNL]
+			The number of threads running loops of kfree_rcu().
+
+	rcuperf.kfree_alloc_num= [KNL]
+			Number of allocations and frees done in an iteration.
+
+	rcuperf.kfree_loops= [KNL]
+			Number of loops doing rcuperf.kfree_alloc_num number
+			of allocations and frees.
+
+	rcuperf.kfree_no_batch= [KNL]
+			Use the non-batching (less efficient) version of kfree_rcu().
+			This is useful for comparing with the batched version.
+
 	rcuperf.nreaders= [KNL]
 			Set number of RCU readers.  The value -1 selects
 			N, where N is the number of CPUs.  A value
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 5f884d560384..c1e25fd10f2a 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -86,6 +86,7 @@ torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
 	      "Shutdown at end of performance tests.");
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
+torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() perf test?");
 
 static char *perf_type = "rcu";
 module_param(perf_type, charp, 0444);
@@ -105,8 +106,8 @@ static atomic_t n_rcu_perf_writer_finished;
 static wait_queue_head_t shutdown_wq;
 static u64 t_rcu_perf_writer_started;
 static u64 t_rcu_perf_writer_finished;
-static unsigned long b_rcu_perf_writer_started;
-static unsigned long b_rcu_perf_writer_finished;
+static unsigned long b_rcu_gp_test_started;
+static unsigned long b_rcu_gp_test_finished;
 static DEFINE_PER_CPU(atomic_t, n_async_inflight);
 
 #define MAX_MEAS 10000
@@ -378,10 +379,10 @@ rcu_perf_writer(void *arg)
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
 		if (gp_exp) {
-			b_rcu_perf_writer_started =
+			b_rcu_gp_test_started =
 				cur_ops->exp_completed() / 2;
 		} else {
-			b_rcu_perf_writer_started = cur_ops->get_gp_seq();
+			b_rcu_gp_test_started = cur_ops->get_gp_seq();
 		}
 	}
 
@@ -429,10 +430,10 @@ rcu_perf_writer(void *arg)
 				PERFOUT_STRING("Test complete");
 				t_rcu_perf_writer_finished = t;
 				if (gp_exp) {
-					b_rcu_perf_writer_finished =
+					b_rcu_gp_test_finished =
 						cur_ops->exp_completed() / 2;
 				} else {
-					b_rcu_perf_writer_finished =
+					b_rcu_gp_test_finished =
 						cur_ops->get_gp_seq();
 				}
 				if (shutdown) {
@@ -515,8 +516,8 @@ rcu_perf_cleanup(void)
 			 t_rcu_perf_writer_finished -
 			 t_rcu_perf_writer_started,
 			 ngps,
-			 rcuperf_seq_diff(b_rcu_perf_writer_finished,
-					  b_rcu_perf_writer_started));
+			 rcuperf_seq_diff(b_rcu_gp_test_finished,
+					  b_rcu_gp_test_started));
 		for (i = 0; i < nrealwriters; i++) {
 			if (!writer_durations)
 				break;
@@ -584,6 +585,167 @@ rcu_perf_shutdown(void *arg)
 	return -EINVAL;
 }
 
+/*
+ * kfree_rcu() performance tests: Start a kfree_rcu() loop on all CPUs for number
+ * of iterations and measure total time and number of GP for all iterations to complete.
+ */
+
+torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
+torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
+torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
+torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu().");
+
+static struct task_struct **kfree_reader_tasks;
+static int kfree_nrealthreads;
+static atomic_t n_kfree_perf_thread_started;
+static atomic_t n_kfree_perf_thread_ended;
+
+struct kfree_obj {
+	char kfree_obj[8];
+	struct rcu_head rh;
+};
+
+static int
+kfree_perf_thread(void *arg)
+{
+	int i, loop = 0;
+	long me = (long)arg;
+	struct kfree_obj *alloc_ptr;
+	u64 start_time, end_time;
+
+	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
+	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_user_nice(current, MAX_NICE);
+
+	start_time = ktime_get_mono_fast_ns();
+
+	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
+		if (gp_exp)
+			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
+		else
+			b_rcu_gp_test_started = cur_ops->get_gp_seq();
+	}
+
+	do {
+		for (i = 0; i < kfree_alloc_num; i++) {
+			alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
+			if (!alloc_ptr)
+				return -ENOMEM;
+
+			if (!kfree_no_batch) {
+				kfree_rcu(alloc_ptr, rh);
+			} else {
+				rcu_callback_t cb;
+
+				cb = (rcu_callback_t)(unsigned long)offsetof(struct kfree_obj, rh);
+				kfree_call_rcu_nobatch(&(alloc_ptr->rh), cb);
+			}
+		}
+
+		cond_resched();
+	} while (!torture_must_stop() && ++loop < kfree_loops);
+
+	if (atomic_inc_return(&n_kfree_perf_thread_ended) >= kfree_nrealthreads) {
+		end_time = ktime_get_mono_fast_ns();
+
+		if (gp_exp)
+			b_rcu_gp_test_finished = cur_ops->exp_completed() / 2;
+		else
+			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
+
+		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
+		       (unsigned long long)(end_time - start_time), kfree_loops,
+		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
+		if (shutdown) {
+			smp_mb(); /* Assign before wake. */
+			wake_up(&shutdown_wq);
+		}
+	}
+
+	torture_kthread_stopping("kfree_perf_thread");
+	return 0;
+}
+
+static void
+kfree_perf_cleanup(void)
+{
+	int i;
+
+	if (torture_cleanup_begin())
+		return;
+
+	if (kfree_reader_tasks) {
+		for (i = 0; i < kfree_nrealthreads; i++)
+			torture_stop_kthread(kfree_perf_thread,
+					     kfree_reader_tasks[i]);
+		kfree(kfree_reader_tasks);
+	}
+
+	torture_cleanup_end();
+}
+
+/*
+ * shutdown kthread.  Just waits to be awakened, then shuts down system.
+ */
+static int
+kfree_perf_shutdown(void *arg)
+{
+	do {
+		wait_event(shutdown_wq,
+			   atomic_read(&n_kfree_perf_thread_ended) >=
+			   kfree_nrealthreads);
+	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
+
+	smp_mb(); /* Wake before output. */
+
+	kfree_perf_cleanup();
+	kernel_power_off();
+	return -EINVAL;
+}
+
+static int __init
+kfree_perf_init(void)
+{
+	long i;
+	int firsterr = 0;
+
+	kfree_nrealthreads = compute_real(kfree_nthreads);
+	/* Start up the kthreads. */
+	if (shutdown) {
+		init_waitqueue_head(&shutdown_wq);
+		firsterr = torture_create_kthread(kfree_perf_shutdown, NULL,
+						  shutdown_task);
+		if (firsterr)
+			goto unwind;
+		schedule_timeout_uninterruptible(1);
+	}
+
+	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
+			       GFP_KERNEL);
+	if (kfree_reader_tasks == NULL) {
+		firsterr = -ENOMEM;
+		goto unwind;
+	}
+
+	for (i = 0; i < kfree_nrealthreads; i++) {
+		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
+						  kfree_reader_tasks[i]);
+		if (firsterr)
+			goto unwind;
+	}
+
+	while (atomic_read(&n_kfree_perf_thread_started) < kfree_nrealthreads)
+		schedule_timeout_uninterruptible(1);
+
+	torture_init_end();
+	return 0;
+
+unwind:
+	torture_init_end();
+	kfree_perf_cleanup();
+	return firsterr;
+}
+
 static int __init
 rcu_perf_init(void)
 {
@@ -616,6 +778,9 @@ rcu_perf_init(void)
 	if (cur_ops->init)
 		cur_ops->init();
 
+	if (kfree_rcu_test)
+		return kfree_perf_init();
+
 	nrealwriters = compute_real(nwriters);
 	nrealreaders = compute_real(nreaders);
 	atomic_set(&n_rcu_perf_reader_started, 0);
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
  2019-08-27 19:01 [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests Joel Fernandes (Google)
@ 2019-08-28 21:12 ` Paul E. McKenney
  2019-08-29 20:56   ` Joel Fernandes
  0 siblings, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2019-08-28 21:12 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, byungchul.park, Josh Triplett, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, rcu, Steven Rostedt, kernel-team

On Tue, Aug 27, 2019 at 03:01:55PM -0400, Joel Fernandes (Google) wrote:
> This test runs kfree_rcu() in a loop to measure performance of the new
> kfree_rcu() batching functionality.
> 
> The following table shows results when booting with arguments:
> rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1
> 
> In addition, rcuperf.kfree_no_batch is used to toggle the batching of
> kfree_rcu()s for a test run.
> 
> patch applied		GPs	time (seconds)
>  yes			1732	14.5
>  no			9133 	11.5

This is really "rcuperf.kfree_no_batch" rather than "patch applied", right?
(Yes, we did discuss this last time around, but this table combined with
the prior paragraph is still ambiguous.)  Please make it unambiguous.
One way to do that is as follows:

------------------------------------------------------------------------

The following table shows results when booting with arguments:
rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1  rcuperf.kfree_no_batch=X

rcuperf.kfree_no_batch=X    # Grace Periods	Test Duration (s)
 X=1 (old behavior)              9133                 11.5
 X=0 (new behavior)              1732                 14.5

------------------------------------------------------------------------

> On a 16 CPU system with the above boot parameters, we see that the total
> number of grace periods that elapse during the test drops from 9133 when
> not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
> flood itself slows down a bit when batching, though, as shown.

This last sentence would be more clear as something like: "However,
use of batching increases the duration of the kfree_rcu()-flood test."

> Note that the active memory consumption during the kfree_rcu() flood
> does increase to around 200-250MB due to the batching (from around 50MB
> without batching). However, this memory consumption is relatively
> constant. In other words, the system is able to keep up with the
> kfree_rcu() load. The memory consumption comes down considerably if
> KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80.

That would be a decrease rather than an increase in KFREE_DRAIN_JIFFIES,
correct?

This would also be a good place to mention that a later patch will
decrease consumption, but that is strictly optional.  However, you did
introduce the topic of changing KFREE_DRAIN_JIFFIES, so if a later patch
changes this value, this would be an excellent place to mention this.

> Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
> CONFIG_PROVE_RCU for realistic comparisons with/without batching.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

And the code is much better!  Just one duplication-avoidance nit below.
Plus a thought for a future patch.

							Thanx, Paul

> ---
>  .../admin-guide/kernel-parameters.txt         |  17 ++
>  kernel/rcu/rcuperf.c                          | 181 +++++++++++++++++-
>  2 files changed, 190 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 79b983bedcaa..24fe8aefb12c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3896,6 +3896,23 @@
>  			test until boot completes in order to avoid
>  			interference.
>  
> +	rcuperf.kfree_rcu_test= [KNL]
> +			Set to measure performance of kfree_rcu() flooding.
> +
> +	rcuperf.kfree_nthreads= [KNL]
> +			The number of threads running loops of kfree_rcu().
> +
> +	rcuperf.kfree_alloc_num= [KNL]
> +			Number of allocations and frees done in an iteration.
> +
> +	rcuperf.kfree_loops= [KNL]
> +			Number of loops doing rcuperf.kfree_alloc_num number
> +			of allocations and frees.
> +
> +	rcuperf.kfree_no_batch= [KNL]
> +			Use the non-batching (less efficient) version of kfree_rcu().
> +			This is useful for comparing with the batched version.
> +
>  	rcuperf.nreaders= [KNL]
>  			Set number of RCU readers.  The value -1 selects
>  			N, where N is the number of CPUs.  A value
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index 5f884d560384..c1e25fd10f2a 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -86,6 +86,7 @@ torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
>  	      "Shutdown at end of performance tests.");
>  torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
>  torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
> +torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() perf test?");
>  
>  static char *perf_type = "rcu";
>  module_param(perf_type, charp, 0444);
> @@ -105,8 +106,8 @@ static atomic_t n_rcu_perf_writer_finished;
>  static wait_queue_head_t shutdown_wq;
>  static u64 t_rcu_perf_writer_started;
>  static u64 t_rcu_perf_writer_finished;
> -static unsigned long b_rcu_perf_writer_started;
> -static unsigned long b_rcu_perf_writer_finished;
> +static unsigned long b_rcu_gp_test_started;
> +static unsigned long b_rcu_gp_test_finished;
>  static DEFINE_PER_CPU(atomic_t, n_async_inflight);
>  
>  #define MAX_MEAS 10000
> @@ -378,10 +379,10 @@ rcu_perf_writer(void *arg)
>  	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
>  		t_rcu_perf_writer_started = t;
>  		if (gp_exp) {
> -			b_rcu_perf_writer_started =
> +			b_rcu_gp_test_started =
>  				cur_ops->exp_completed() / 2;
>  		} else {
> -			b_rcu_perf_writer_started = cur_ops->get_gp_seq();
> +			b_rcu_gp_test_started = cur_ops->get_gp_seq();
>  		}
>  	}
>  
> @@ -429,10 +430,10 @@ rcu_perf_writer(void *arg)
>  				PERFOUT_STRING("Test complete");
>  				t_rcu_perf_writer_finished = t;
>  				if (gp_exp) {
> -					b_rcu_perf_writer_finished =
> +					b_rcu_gp_test_finished =
>  						cur_ops->exp_completed() / 2;
>  				} else {
> -					b_rcu_perf_writer_finished =
> +					b_rcu_gp_test_finished =
>  						cur_ops->get_gp_seq();
>  				}
>  				if (shutdown) {
> @@ -515,8 +516,8 @@ rcu_perf_cleanup(void)
>  			 t_rcu_perf_writer_finished -
>  			 t_rcu_perf_writer_started,
>  			 ngps,
> -			 rcuperf_seq_diff(b_rcu_perf_writer_finished,
> -					  b_rcu_perf_writer_started));
> +			 rcuperf_seq_diff(b_rcu_gp_test_finished,
> +					  b_rcu_gp_test_started));
>  		for (i = 0; i < nrealwriters; i++) {
>  			if (!writer_durations)
>  				break;
> @@ -584,6 +585,167 @@ rcu_perf_shutdown(void *arg)
>  	return -EINVAL;
>  }
>  
> +/*
> + * kfree_rcu() performance tests: Start a kfree_rcu() loop on all CPUs for number
> + * of iterations and measure total time and number of GP for all iterations to complete.
> + */
> +
> +torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
> +torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
> +torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
> +torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu().");
> +
> +static struct task_struct **kfree_reader_tasks;
> +static int kfree_nrealthreads;
> +static atomic_t n_kfree_perf_thread_started;
> +static atomic_t n_kfree_perf_thread_ended;
> +
> +struct kfree_obj {
> +	char kfree_obj[8];
> +	struct rcu_head rh;
> +};
> +
> +static int
> +kfree_perf_thread(void *arg)
> +{
> +	int i, loop = 0;
> +	long me = (long)arg;
> +	struct kfree_obj *alloc_ptr;
> +	u64 start_time, end_time;
> +
> +	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> +	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> +	set_user_nice(current, MAX_NICE);
> +
> +	start_time = ktime_get_mono_fast_ns();
> +
> +	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> +		if (gp_exp)
> +			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;

At some point, it would be good to use the new grace-period
sequence-counter functions (rcuperf_seq_diff(), for example) instead of
the open-coded division by 2.  I freely admit that you are just copying
my obsolete hack in this case, so not needed in this patch.

> +		else
> +			b_rcu_gp_test_started = cur_ops->get_gp_seq();
> +	}
> +
> +	do {
> +		for (i = 0; i < kfree_alloc_num; i++) {
> +			alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
> +			if (!alloc_ptr)
> +				return -ENOMEM;
> +
> +			if (!kfree_no_batch) {
> +				kfree_rcu(alloc_ptr, rh);
> +			} else {
> +				rcu_callback_t cb;
> +
> +				cb = (rcu_callback_t)(unsigned long)offsetof(struct kfree_obj, rh);
> +				kfree_call_rcu_nobatch(&(alloc_ptr->rh), cb);
> +			}
> +		}

Nice, much simpler than the earlier version!

> +		cond_resched();
> +	} while (!torture_must_stop() && ++loop < kfree_loops);
> +
> +	if (atomic_inc_return(&n_kfree_perf_thread_ended) >= kfree_nrealthreads) {
> +		end_time = ktime_get_mono_fast_ns();
> +
> +		if (gp_exp)
> +			b_rcu_gp_test_finished = cur_ops->exp_completed() / 2;

Same here on open-coded division by 2.

> +		else
> +			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> +
> +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> +		       (unsigned long long)(end_time - start_time), kfree_loops,
> +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> +		if (shutdown) {
> +			smp_mb(); /* Assign before wake. */
> +			wake_up(&shutdown_wq);
> +		}
> +	}
> +
> +	torture_kthread_stopping("kfree_perf_thread");
> +	return 0;
> +}
> +
> +static void
> +kfree_perf_cleanup(void)
> +{
> +	int i;
> +
> +	if (torture_cleanup_begin())
> +		return;
> +
> +	if (kfree_reader_tasks) {
> +		for (i = 0; i < kfree_nrealthreads; i++)
> +			torture_stop_kthread(kfree_perf_thread,
> +					     kfree_reader_tasks[i]);
> +		kfree(kfree_reader_tasks);
> +	}
> +
> +	torture_cleanup_end();
> +}
> +
> +/*
> + * shutdown kthread.  Just waits to be awakened, then shuts down system.
> + */
> +static int
> +kfree_perf_shutdown(void *arg)
> +{
> +	do {
> +		wait_event(shutdown_wq,
> +			   atomic_read(&n_kfree_perf_thread_ended) >=
> +			   kfree_nrealthreads);
> +	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> +
> +	smp_mb(); /* Wake before output. */
> +
> +	kfree_perf_cleanup();
> +	kernel_power_off();
> +	return -EINVAL;

These last four lines should be combined with those of
rcu_perf_shutdown().  Actually, you could fold the two functions together
with only a pair of arguments and two one-line wrapper functions, which
would be even better.

> +}
> +
> +static int __init
> +kfree_perf_init(void)
> +{
> +	long i;
> +	int firsterr = 0;
> +
> +	kfree_nrealthreads = compute_real(kfree_nthreads);
> +	/* Start up the kthreads. */
> +	if (shutdown) {
> +		init_waitqueue_head(&shutdown_wq);
> +		firsterr = torture_create_kthread(kfree_perf_shutdown, NULL,
> +						  shutdown_task);
> +		if (firsterr)
> +			goto unwind;
> +		schedule_timeout_uninterruptible(1);
> +	}
> +
> +	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
> +			       GFP_KERNEL);
> +	if (kfree_reader_tasks == NULL) {
> +		firsterr = -ENOMEM;
> +		goto unwind;
> +	}
> +
> +	for (i = 0; i < kfree_nrealthreads; i++) {
> +		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
> +						  kfree_reader_tasks[i]);
> +		if (firsterr)
> +			goto unwind;
> +	}
> +
> +	while (atomic_read(&n_kfree_perf_thread_started) < kfree_nrealthreads)
> +		schedule_timeout_uninterruptible(1);
> +
> +	torture_init_end();
> +	return 0;
> +
> +unwind:
> +	torture_init_end();
> +	kfree_perf_cleanup();
> +	return firsterr;
> +}
> +
>  static int __init
>  rcu_perf_init(void)
>  {
> @@ -616,6 +778,9 @@ rcu_perf_init(void)
>  	if (cur_ops->init)
>  		cur_ops->init();
>  
> +	if (kfree_rcu_test)
> +		return kfree_perf_init();
> +
>  	nrealwriters = compute_real(nwriters);
>  	nrealreaders = compute_real(nreaders);
>  	atomic_set(&n_rcu_perf_reader_started, 0);
> -- 
> 2.23.0.187.g17f5b7556c-goog
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
  2019-08-28 21:12 ` Paul E. McKenney
@ 2019-08-29 20:56   ` Joel Fernandes
  2019-09-03 20:08     ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Joel Fernandes @ 2019-08-29 20:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, byungchul.park, Josh Triplett, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, rcu, Steven Rostedt

On Wed, Aug 28, 2019 at 02:12:26PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 27, 2019 at 03:01:55PM -0400, Joel Fernandes (Google) wrote:
> > This test runs kfree_rcu() in a loop to measure performance of the new
> > kfree_rcu() batching functionality.
> > 
> > The following table shows results when booting with arguments:
> > rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1
> > 
> > In addition, rcuperf.kfree_no_batch is used to toggle the batching of
> > kfree_rcu()s for a test run.
> > 
> > patch applied		GPs	time (seconds)
> >  yes			1732	14.5
> >  no			9133 	11.5
> 
> This is really "rcuperf.kfree_no_batch" rather than "patch applied", right?
> (Yes, we did discuss this last time around, but this table combined with
> the prior paragraph is still ambiguous.)  Please make it unambiguous.
> One way to do that is as follows:
> 
> ------------------------------------------------------------------------
> 
> The following table shows results when booting with arguments:
> rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1  rcuperf.kfree_no_batch=X
> 
> rcuperf.kfree_no_batch=X    # Grace Periods	Test Duration (s)
>  X=1 (old behavior)              9133                 11.5
>  X=0 (new behavior)              1732                 14.5

Yes you are right, will fix. The reason I changed it to 'patch applied' is
because the last patch in the series removes kfree_no_batch. Will fix!
thanks!
 
> > On a 16 CPU system with the above boot parameters, we see that the total
> > number of grace periods that elapse during the test drops from 9133 when
> > not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
> > flood itself slows down a bit when batching, though, as shown.
> 
> This last sentence would be more clear as something like: "However,
> use of batching increases the duration of the kfree_rcu()-flood test."
> 
> > Note that the active memory consumption during the kfree_rcu() flood
> > does increase to around 200-250MB due to the batching (from around 50MB
> > without batching). However, this memory consumption is relatively
> > constant. In other words, the system is able to keep up with the
> > kfree_rcu() load. The memory consumption comes down considerably if
> > KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80.
> 
> That would be a decrease rather than an increase in KFREE_DRAIN_JIFFIES,
> correct?
> 
> This would also be a good place to mention that a later patch will
> decrease consumption, but that is strictly optional.  However, you did
> introduce the topic of changing KFREE_DRAIN_JIFFIES, so if a later patch
> changes this value, this would be an excellent place to mention this.

Fixed.

[snip]
> > +/*
> > + * kfree_rcu() performance tests: Start a kfree_rcu() loop on all CPUs for number
> > + * of iterations and measure total time and number of GP for all iterations to complete.
> > + */
> > +
> > +torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
> > +torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
> > +torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
> > +torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu().");
> > +
> > +static struct task_struct **kfree_reader_tasks;
> > +static int kfree_nrealthreads;
> > +static atomic_t n_kfree_perf_thread_started;
> > +static atomic_t n_kfree_perf_thread_ended;
> > +
> > +struct kfree_obj {
> > +	char kfree_obj[8];
> > +	struct rcu_head rh;
> > +};
> > +
> > +static int
> > +kfree_perf_thread(void *arg)
> > +{
> > +	int i, loop = 0;
> > +	long me = (long)arg;
> > +	struct kfree_obj *alloc_ptr;
> > +	u64 start_time, end_time;
> > +
> > +	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > +	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > +	set_user_nice(current, MAX_NICE);
> > +
> > +	start_time = ktime_get_mono_fast_ns();
> > +
> > +	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> > +		if (gp_exp)
> > +			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
> 
> At some point, it would be good to use the new grace-period
> sequence-counter functions (rcuperf_seq_diff(), for example) instead of
> the open-coded division by 2.  I freely admit that you are just copying
> my obsolete hack in this case, so not needed in this patch.

But I am using rcu_seq_diff() below in the pr_alert().

Anyway, I agree this can be a follow-on since this pattern is borrowed from
another part of rcuperf. However, I am also confused about the pattern
itself.

If I understand, you are doing the "/ 2" because expedited_sequence
progresses by 2 for every expedited batch.

But does rcu_seq_diff() really work on these expedited GP numbers, and will
it be immune to changes in RCU_SEQ_STATE_MASK? Sorry for the silly questions,
but admittedly I have not looked too much yet into expedited RCU so I could
be missing the point.

> > +		else
> > +			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> > +
> > +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > +		       (unsigned long long)(end_time - start_time), kfree_loops,
> > +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > +		if (shutdown) {
> > +			smp_mb(); /* Assign before wake. */
> > +			wake_up(&shutdown_wq);
> > +		}
> > +	}
> > +
> > +	torture_kthread_stopping("kfree_perf_thread");
> > +	return 0;
> > +}
> > +
> > +static void
> > +kfree_perf_cleanup(void)
> > +{
> > +	int i;
> > +
> > +	if (torture_cleanup_begin())
> > +		return;
> > +
> > +	if (kfree_reader_tasks) {
> > +		for (i = 0; i < kfree_nrealthreads; i++)
> > +			torture_stop_kthread(kfree_perf_thread,
> > +					     kfree_reader_tasks[i]);
> > +		kfree(kfree_reader_tasks);
> > +	}
> > +
> > +	torture_cleanup_end();
> > +}
> > +
> > +/*
> > + * shutdown kthread.  Just waits to be awakened, then shuts down system.
> > + */
> > +static int
> > +kfree_perf_shutdown(void *arg)
> > +{
> > +	do {
> > +		wait_event(shutdown_wq,
> > +			   atomic_read(&n_kfree_perf_thread_ended) >=
> > +			   kfree_nrealthreads);
> > +	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> > +
> > +	smp_mb(); /* Wake before output. */
> > +
> > +	kfree_perf_cleanup();
> > +	kernel_power_off();
> > +	return -EINVAL;
> 
> These last four lines should be combined with those of
> rcu_perf_shutdown().  Actually, you could fold the two functions together
> with only a pair of arguments and two one-line wrapper functions, which
> would be even better.

But the cleanup() function is different in the 2 cases and will have to be
passed in as a function pointer. I believe we discussed this last review as
well.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
  2019-08-29 20:56   ` Joel Fernandes
@ 2019-09-03 20:08     ` Paul E. McKenney
  2019-09-03 20:21       ` Joel Fernandes
  0 siblings, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2019-09-03 20:08 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, byungchul.park, Josh Triplett, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, rcu, Steven Rostedt

On Thu, Aug 29, 2019 at 04:56:37PM -0400, Joel Fernandes wrote:
> On Wed, Aug 28, 2019 at 02:12:26PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > +static int
> > > +kfree_perf_thread(void *arg)
> > > +{
> > > +	int i, loop = 0;
> > > +	long me = (long)arg;
> > > +	struct kfree_obj *alloc_ptr;
> > > +	u64 start_time, end_time;
> > > +
> > > +	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > > +	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > > +	set_user_nice(current, MAX_NICE);
> > > +
> > > +	start_time = ktime_get_mono_fast_ns();
> > > +
> > > +	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> > > +		if (gp_exp)
> > > +			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
> > 
> > At some point, it would be good to use the new grace-period
> > sequence-counter functions (rcuperf_seq_diff(), for example) instead of
> > the open-coded division by 2.  I freely admit that you are just copying
> > my obsolete hack in this case, so not needed in this patch.
> 
> But I am using rcu_seq_diff() below in the pr_alert().
> 
> Anyway, I agree this can be a follow-on since this pattern is borrowed from
> another part of rcuperf. However, I am also confused about the pattern
> itself.
> 
> If I understand, you are doing the "/ 2" because expedited_sequence
> progresses by 2 for every expedited batch.
> 
> But does rcu_seq_diff() really work on these expedited GP numbers, and will
> it be immune to changes in RCU_SEQ_STATE_MASK? Sorry for the silly questions,
> but admittedly I have not looked too much yet into expedited RCU so I could
> be missing the point.

Yes, expedited grace periods use the common sequence-number functions.
Oddly enough, normal grace periods were the last to make use of these.

> > > +		else
> > > +			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> > > +
> > > +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > > +		       (unsigned long long)(end_time - start_time), kfree_loops,
> > > +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > > +		if (shutdown) {
> > > +			smp_mb(); /* Assign before wake. */
> > > +			wake_up(&shutdown_wq);
> > > +		}
> > > +	}
> > > +
> > > +	torture_kthread_stopping("kfree_perf_thread");
> > > +	return 0;
> > > +}
> > > +
> > > +static void
> > > +kfree_perf_cleanup(void)
> > > +{
> > > +	int i;
> > > +
> > > +	if (torture_cleanup_begin())
> > > +		return;
> > > +
> > > +	if (kfree_reader_tasks) {
> > > +		for (i = 0; i < kfree_nrealthreads; i++)
> > > +			torture_stop_kthread(kfree_perf_thread,
> > > +					     kfree_reader_tasks[i]);
> > > +		kfree(kfree_reader_tasks);
> > > +	}
> > > +
> > > +	torture_cleanup_end();
> > > +}
> > > +
> > > +/*
> > > + * shutdown kthread.  Just waits to be awakened, then shuts down system.
> > > + */
> > > +static int
> > > +kfree_perf_shutdown(void *arg)
> > > +{
> > > +	do {
> > > +		wait_event(shutdown_wq,
> > > +			   atomic_read(&n_kfree_perf_thread_ended) >=
> > > +			   kfree_nrealthreads);
> > > +	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> > > +
> > > +	smp_mb(); /* Wake before output. */
> > > +
> > > +	kfree_perf_cleanup();
> > > +	kernel_power_off();
> > > +	return -EINVAL;
> > 
> > These last four lines should be combined with those of
> > rcu_perf_shutdown().  Actually, you could fold the two functions together
> > with only a pair of arguments and two one-line wrapper functions, which
> > would be even better.
> 
> But the cleanup() function is different in the 2 cases and will have to be
> passed in as a function pointer. I believe we discussed this last review as
> well.

Calling through a pointer should be a non-problem in this case.  We are
nowhere near a fastpath.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests
  2019-09-03 20:08     ` Paul E. McKenney
@ 2019-09-03 20:21       ` Joel Fernandes
  0 siblings, 0 replies; 5+ messages in thread
From: Joel Fernandes @ 2019-09-03 20:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, byungchul.park, Josh Triplett, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, rcu, Steven Rostedt

On Tue, Sep 03, 2019 at 01:08:49PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 29, 2019 at 04:56:37PM -0400, Joel Fernandes wrote:
> > On Wed, Aug 28, 2019 at 02:12:26PM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > +static int
> > > > +kfree_perf_thread(void *arg)
> > > > +{
> > > > +	int i, loop = 0;
> > > > +	long me = (long)arg;
> > > > +	struct kfree_obj *alloc_ptr;
> > > > +	u64 start_time, end_time;
> > > > +
> > > > +	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > > > +	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > > > +	set_user_nice(current, MAX_NICE);
> > > > +
> > > > +	start_time = ktime_get_mono_fast_ns();
> > > > +
> > > > +	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> > > > +		if (gp_exp)
> > > > +			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
> > > 
> > > At some point, it would be good to use the new grace-period
> > > sequence-counter functions (rcuperf_seq_diff(), for example) instead of
> > > the open-coded division by 2.  I freely admit that you are just copying
> > > my obsolete hack in this case, so not needed in this patch.
> > 
> > But I am using rcu_seq_diff() below in the pr_alert().
> > 
> > Anyway, I agree this can be a follow-on since this pattern is borrowed from
> > another part of rcuperf. However, I am also confused about the pattern
> > itself.
> > 
> > If I understand, you are doing the "/ 2" because expedited_sequence
> > progresses by 2 for every expedited batch.
> > 
> > But does rcu_seq_diff() really work on these expedited GP numbers, and will
> > it be immune to changes in RCU_SEQ_STATE_MASK? Sorry for the silly questions,
> > but admittedly I have not looked too much yet into expedited RCU so I could
> > be missing the point.
> 
> Yes, expedited grace periods use the common sequence-number functions.
> Oddly enough, normal grace periods were the last to make use of these.

Ok, will clean up in a follow on patch as we agreed, so as to not block this series.

> > > > +		else
> > > > +			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> > > > +
> > > > +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > > > +		       (unsigned long long)(end_time - start_time), kfree_loops,
> > > > +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > > > +		if (shutdown) {
> > > > +			smp_mb(); /* Assign before wake. */
> > > > +			wake_up(&shutdown_wq);
> > > > +		}
> > > > +	}
> > > > +
> > > > +	torture_kthread_stopping("kfree_perf_thread");
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void
> > > > +kfree_perf_cleanup(void)
> > > > +{
> > > > +	int i;
> > > > +
> > > > +	if (torture_cleanup_begin())
> > > > +		return;
> > > > +
> > > > +	if (kfree_reader_tasks) {
> > > > +		for (i = 0; i < kfree_nrealthreads; i++)
> > > > +			torture_stop_kthread(kfree_perf_thread,
> > > > +					     kfree_reader_tasks[i]);
> > > > +		kfree(kfree_reader_tasks);
> > > > +	}
> > > > +
> > > > +	torture_cleanup_end();
> > > > +}
> > > > +
> > > > +/*
> > > > + * shutdown kthread.  Just waits to be awakened, then shuts down system.
> > > > + */
> > > > +static int
> > > > +kfree_perf_shutdown(void *arg)
> > > > +{
> > > > +	do {
> > > > +		wait_event(shutdown_wq,
> > > > +			   atomic_read(&n_kfree_perf_thread_ended) >=
> > > > +			   kfree_nrealthreads);
> > > > +	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> > > > +
> > > > +	smp_mb(); /* Wake before output. */
> > > > +
> > > > +	kfree_perf_cleanup();
> > > > +	kernel_power_off();
> > > > +	return -EINVAL;
> > > 
> > > These last four lines should be combined with those of
> > > rcu_perf_shutdown().  Actually, you could fold the two functions together
> > > with only a pair of arguments and two one-line wrapper functions, which
> > > would be even better.
> > 
> > But the cleanup() function is different in the 2 cases and will have to be
> > passed in as a function pointer. I believe we discussed this last review as
> > well.
> 
> Calling through a pointer should be a non-problem in this case.  We are
> nowhere near a fastpath.

There's also the wait_event() condition that is different. I don't see how we
can combine this. It will look though and with probably the same lines of
code. Can this function be as is? pretty-please :-D. Or perhaps, if you feel
it is a trivial cleanup, could you do it so I can understand what you mean?
Sorry!

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-09-03 20:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27 19:01 [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests Joel Fernandes (Google)
2019-08-28 21:12 ` Paul E. McKenney
2019-08-29 20:56   ` Joel Fernandes
2019-09-03 20:08     ` Paul E. McKenney
2019-09-03 20:21       ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).