linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
@ 2019-08-13 17:00 Joel Fernandes (Google)
  2019-08-13 17:00 ` [PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests Joel Fernandes (Google)
  2019-08-13 19:07 ` [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Paul E. McKenney
  0 siblings, 2 replies; 8+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-13 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Rao Shoaib, max.byungchul.park, byungchul.park, kernel-team,
	kernel-team, Andrew Morton, Davidlohr Bueso, Jonathan Corbet,
	Josh Triplett, Kees Cook, Lai Jiangshan, linux-doc,
	Mathieu Desnoyers, Mauro Carvalho Chehab, Paul E. McKenney, rcu,
	Steven Rostedt, Thomas Gleixner

Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.

This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards.

Torture tests follow in the next patch and show improvements of around
400% in reduction of number of  grace periods on a 16 CPU system. More
details and test data are in that patch.

This is an effort to start simple, and build up from there. In the
future, an extension to use kfree_bulk and possibly per-slab batching
could be done to further improve performance due to cache-locality and
slab-specific bulk free optimizations. By using an array of pointers,
the worker thread processing the work would need to read lesser data
since it does not need to deal with large rcu_head(s) any longer.

There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.

Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.

Finally, in the near future we will get rid of kfree_rcu special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.

[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824

Cc: Rao Shoaib <rao.shoaib@oracle.com>
Cc: max.byungchul.park@gmail.com
Cc: byungchul.park@lge.com
Cc: kernel-team@android.com
Cc: kernel-team@lge.com
Co-developed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
v2->v3: Just some code comment changes thanks to Byungchul.

RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
                   Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce OOM occurrence.
                   Removed sleeps in rcuperf test, just using cond_resched()in loop.
                   Better code comments ;)

 .../admin-guide/kernel-parameters.txt         |  17 ++
 include/linux/rcutiny.h                       |   5 +
 include/linux/rcutree.h                       |   1 +
 kernel/rcu/tree.c                             | 204 +++++++++++++++++-
 4 files changed, 221 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7ccd158b3894..a9156ca5de24 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3895,6 +3895,23 @@
 			test until boot completes in order to avoid
 			interference.
 
+	rcuperf.kfree_rcu_test= [KNL]
+			Set to measure performance of kfree_rcu() flooding.
+
+	rcuperf.kfree_nthreads= [KNL]
+			The number of threads running loops of kfree_rcu().
+
+	rcuperf.kfree_alloc_num= [KNL]
+			Number of allocations and frees done in an iteration.
+
+	rcuperf.kfree_loops= [KNL]
+			Number of loops doing rcuperf.kfree_alloc_num number
+			of allocations and frees.
+
+	rcuperf.kfree_no_batch= [KNL]
+			Use the non-batching (slower) version of kfree_rcu.
+			This is useful for comparing with the batched version.
+
 	rcuperf.nreaders= [KNL]
 			Set number of RCU readers.  The value -1 selects
 			N, where N is the number of CPUs.  A value
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 8e727f57d814..383f2481750f 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 	call_rcu(head, func);
 }
 
+static inline void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
+{
+	call_rcu(head, func);
+}
+
 void rcu_qs(void);
 
 static inline void rcu_softirq_qs(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 735601ac27d3..7e38b39ec634 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
 
 void synchronize_rcu_expedited(void);
 void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
+void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func);
 
 void rcu_barrier(void);
 bool rcu_eqs_special_set(int cpu);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..3e62bbb0e1fa 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2593,17 +2593,195 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
+
+/* Maximum number of jiffies to wait before draining a batch. */
+#define KFREE_DRAIN_JIFFIES (HZ / 50)
+
 /*
- * Queue an RCU callback for lazy invocation after a grace period.
- * This will likely be later named something like "call_rcu_lazy()",
- * but this change will require some way of tagging the lazy RCU
- * callbacks in the list of pending callbacks. Until then, this
- * function may only be called from __kfree_rcu().
+ * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
+ * kfree(s) is queued for freeing after a grace period, right away.
  */
-void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
+struct kfree_rcu_cpu {
+	/* The rcu_work node for queuing work with queue_rcu_work(). The work
+	 * is done after a grace period.
+	 */
+	struct rcu_work rcu_work;
+
+	/* The list of objects being queued in a batch but are not yet
+	 * scheduled to be freed.
+	 */
+	struct rcu_head *head;
+
+	/* The list of objects that have now left ->head and are queued for
+	 * freeing after a grace period.
+	 */
+	struct rcu_head *head_free;
+
+	/* Protect concurrent access to this structure. */
+	spinlock_t lock;
+
+	/* The delayed work that flushes ->head to ->head_free incase ->head
+	 * within KFREE_DRAIN_JIFFIES. In case flushing cannot be done if RCU
+	 * is busy, ->head just continues to grow and we retry flushing later.
+	 */
+	struct delayed_work monitor_work;
+	bool monitor_todo;	/* Is a delayed work pending execution? */
+};
+
+static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
+
+/*
+ * This function is invoked in workqueue context after a grace period.
+ * It frees all the objects queued on ->head_free.
+ */
+static void kfree_rcu_work(struct work_struct *work)
+{
+	unsigned long flags;
+	struct rcu_head *head, *next;
+	struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
+					struct kfree_rcu_cpu, rcu_work);
+
+	spin_lock_irqsave(&krcp->lock, flags);
+	head = krcp->head_free;
+	krcp->head_free = NULL;
+	spin_unlock_irqrestore(&krcp->lock, flags);
+
+	/*
+	 * The head is detached and not referenced from anywhere, so lockless
+	 * access is Ok.
+	 */
+	for (; head; head = next) {
+		next = head->next;
+		head->next = NULL;
+		/* Could be possible to optimize with kfree_bulk in future */
+		__rcu_reclaim(rcu_state.name, head);
+		cond_resched_tasks_rcu_qs();
+	}
+}
+
+/*
+ * Schedule the kfree batch RCU work to run in workqueue context after a GP.
+ *
+ * This function is invoked by kfree_rcu_monitor() when the KFREE_DRAIN_JIFFIES
+ * timeout has been reached.
+ */
+static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
+{
+	lockdep_assert_held(&krcp->lock);
+
+	/* If a previous RCU batch work is already in progress, we cannot queue
+	 * another one, just refuse the optimization and it will be retried
+	 * again in KFREE_DRAIN_JIFFIES time.
+	 */
+	if (krcp->head_free)
+		return false;
+
+	krcp->head_free = krcp->head;
+	krcp->head = NULL;
+	INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
+	queue_rcu_work(system_wq, &krcp->rcu_work);
+
+	return true;
+}
+
+static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
+				   unsigned long flags)
+{
+	/* Flush ->head to ->head_free, all objects on ->head_free will be
+	 * kfree'd after a grace period.
+	 */
+	if (queue_kfree_rcu_work(krcp)) {
+		/* Success! Our job is done here. */
+		spin_unlock_irqrestore(&krcp->lock, flags);
+		return;
+	}
+
+	/* Previous batch did not get free yet, let us try again soon. */
+	if (krcp->monitor_todo == false) {
+		schedule_delayed_work_on(smp_processor_id(),
+				&krcp->monitor_work,  KFREE_DRAIN_JIFFIES);
+		krcp->monitor_todo = true;
+	}
+	spin_unlock_irqrestore(&krcp->lock, flags);
+}
+
+/*
+ * This function is invoked after the KFREE_DRAIN_JIFFIES timeout has elapsed,
+ * so it drains the specified kfree_rcu_cpu structure's ->head list.
+ */
+static void kfree_rcu_monitor(struct work_struct *work)
+{
+	bool todo;
+	unsigned long flags;
+	struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu,
+						 monitor_work.work);
+
+	spin_lock_irqsave(&krcp->lock, flags);
+	todo = krcp->monitor_todo;
+	krcp->monitor_todo = false;
+	if (todo)
+		kfree_rcu_drain_unlock(krcp, flags);
+	else
+		spin_unlock_irqrestore(&krcp->lock, flags);
+}
+
+/*
+ * This version of kfree_call_rcu does not do batching of kfree_rcu() requests.
+ * Used only by rcuperf torture test for comparison with kfree_rcu_batch().
+ */
+void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
 {
 	__call_rcu(head, func, -1, 1);
 }
+EXPORT_SYMBOL_GPL(kfree_call_rcu_nobatch);
+
+/*
+ * Queue a request for lazy invocation of kfree() after a grace period.
+ *
+ * Each kfree_call_rcu() request is added to a batch. The batch will be drained
+ * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch
+ * will be kfree'd in workqueue context. This allows us to:
+ *
+ * 1. Batch requests together to reduce the number of grace periods during
+ * heavy kfree_rcu() load.
+ *
+ * 2. In the future, makes it possible to use kfree_bulk() on a large number of
+ * kfree_rcu() requests thus reducing the per-object overhead of kfree() and
+ * also reducing cache misses.
+ */
+void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	unsigned long flags;
+	struct kfree_rcu_cpu *krcp;
+	bool monitor_todo;
+
+	/* kfree_call_rcu() batching requires timers to be up. If the scheduler
+	 * is not yet up, just skip batching and do non-batched kfree_call_rcu().
+	 */
+	if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
+		return kfree_call_rcu_nobatch(head, func);
+
+	local_irq_save(flags);
+	krcp = this_cpu_ptr(&krc);
+
+	spin_lock(&krcp->lock);
+
+	/* Queue the kfree but don't yet schedule the batch. */
+	head->func = func;
+	head->next = krcp->head;
+	krcp->head = head;
+
+	/* Schedule monitor for timely drain after KFREE_DRAIN_JIFFIES. */
+	monitor_todo = krcp->monitor_todo;
+	krcp->monitor_todo = true;
+	spin_unlock(&krcp->lock);
+
+	if (!monitor_todo) {
+		schedule_delayed_work_on(smp_processor_id(),
+				&krcp->monitor_work,  KFREE_DRAIN_JIFFIES);
+	}
+	local_irq_restore(flags);
+}
 EXPORT_SYMBOL_GPL(kfree_call_rcu);
 
 /*
@@ -3455,10 +3633,24 @@ static void __init rcu_dump_rcu_node_tree(void)
 struct workqueue_struct *rcu_gp_wq;
 struct workqueue_struct *rcu_par_gp_wq;
 
+static void __init kfree_rcu_batch_init(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
+
+		spin_lock_init(&krcp->lock);
+		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
+	}
+}
+
 void __init rcu_init(void)
 {
 	int cpu;
 
+	kfree_rcu_batch_init();
+
 	rcu_early_boot_tests();
 
 	rcu_bootup_announce();
-- 
2.23.0.rc1.153.gdeed80330f-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests
  2019-08-13 17:00 [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Joel Fernandes (Google)
@ 2019-08-13 17:00 ` Joel Fernandes (Google)
  2019-08-13 19:07 ` [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Paul E. McKenney
  1 sibling, 0 replies; 8+ messages in thread
From: Joel Fernandes (Google) @ 2019-08-13 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Andrew Morton, byungchul.park, Davidlohr Bueso, Jonathan Corbet,
	Josh Triplett, Kees Cook, kernel-team, kernel-team,
	Lai Jiangshan, linux-doc, Mathieu Desnoyers,
	Mauro Carvalho Chehab, max.byungchul.park, Paul E. McKenney,
	Rao Shoaib, rcu, Steven Rostedt, Thomas Gleixner

This test runs kfree_rcu in a loop to measure performance of the new
kfree_rcu batching functionality.

The following table shows results when booting with arguments:
rcuperf.kfree_loops=200000 rcuperf.kfree_alloc_num=1000 rcuperf.kfree_rcu_test=1

In addition, rcuperf.kfree_no_batch is used to toggle the batching of
kfree_rcu()s for a test run.

rcuperf.kfree_no_batch	GPs	time (seconds)
 0 (default)		1732	15.9
 1			9133 	14.5

Note that the results are the same for the case:
1. Patch is not applied and rcuperf.kfree_no_batch=0
2. Patch is applied     and rcuperf.kfree_no_batch=1

On a 16 CPU system with the above boot parameters, we see that the total
number of grace periods that elapse during the test drops from 9133 when
not batching to 1732 when batching (a ~427% improvement). The
kfree_rcu() flood itself slows down a bit when batching, though, as
shown. This is likely due to rcuperf threads contending with the
additional worker threads that are now running both before (the monitor)
and after (the work done to kfree()) the grace period.

Note that the active memory consumption during the kfree_rcu() flood
does increase to around 300-400MB due to the batching (from around 50MB
without batching). However, this memory consumption is relatively
constant and is just an effect of the buffering. In other words, the
system is able to keep up with the kfree_rcu() load.

Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
CONFIG_PROVE_RCU for realistic comparisons with/without batching.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuperf.c | 189 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 181 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 7a6890b23c5f..70d6ac19cbff 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -86,6 +86,7 @@ torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
 	      "Shutdown at end of performance tests.");
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
+torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu perf test?");
 
 static char *perf_type = "rcu";
 module_param(perf_type, charp, 0444);
@@ -105,8 +106,8 @@ static atomic_t n_rcu_perf_writer_finished;
 static wait_queue_head_t shutdown_wq;
 static u64 t_rcu_perf_writer_started;
 static u64 t_rcu_perf_writer_finished;
-static unsigned long b_rcu_perf_writer_started;
-static unsigned long b_rcu_perf_writer_finished;
+static unsigned long b_rcu_gp_test_started;
+static unsigned long b_rcu_gp_test_finished;
 static DEFINE_PER_CPU(atomic_t, n_async_inflight);
 
 static int rcu_perf_writer_state;
@@ -379,10 +380,10 @@ rcu_perf_writer(void *arg)
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
 		if (gp_exp) {
-			b_rcu_perf_writer_started =
+			b_rcu_gp_test_started =
 				cur_ops->exp_completed() / 2;
 		} else {
-			b_rcu_perf_writer_started = cur_ops->get_gp_seq();
+			b_rcu_gp_test_started = cur_ops->get_gp_seq();
 		}
 	}
 
@@ -435,10 +436,10 @@ rcu_perf_writer(void *arg)
 				PERFOUT_STRING("Test complete");
 				t_rcu_perf_writer_finished = t;
 				if (gp_exp) {
-					b_rcu_perf_writer_finished =
+					b_rcu_gp_test_finished =
 						cur_ops->exp_completed() / 2;
 				} else {
-					b_rcu_perf_writer_finished =
+					b_rcu_gp_test_finished =
 						cur_ops->get_gp_seq();
 				}
 				if (shutdown) {
@@ -523,8 +524,8 @@ rcu_perf_cleanup(void)
 			 t_rcu_perf_writer_finished -
 			 t_rcu_perf_writer_started,
 			 ngps,
-			 rcuperf_seq_diff(b_rcu_perf_writer_finished,
-					  b_rcu_perf_writer_started));
+			 rcuperf_seq_diff(b_rcu_gp_test_finished,
+					  b_rcu_gp_test_started));
 		for (i = 0; i < nrealwriters; i++) {
 			if (!writer_durations)
 				break;
@@ -592,6 +593,175 @@ rcu_perf_shutdown(void *arg)
 	return -EINVAL;
 }
 
+/*
+ * kfree_rcu performance tests: Start a kfree_rcu loop on all CPUs for number
+ * of iterations and measure total time and number of GP for all iterations to complete.
+ */
+
+torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
+torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
+torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
+torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu.");
+
+static struct task_struct **kfree_reader_tasks;
+static int kfree_nrealthreads;
+static atomic_t n_kfree_perf_thread_started;
+static atomic_t n_kfree_perf_thread_ended;
+
+struct kfree_obj {
+	char kfree_obj[8];
+	struct rcu_head rh;
+};
+
+static int
+kfree_perf_thread(void *arg)
+{
+	int i, loop = 0;
+	long me = (long)arg;
+	struct kfree_obj **alloc_ptrs;
+	u64 start_time, end_time;
+
+	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
+	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_user_nice(current, MAX_NICE);
+
+	alloc_ptrs = (struct kfree_obj **)kmalloc(sizeof(struct kfree_obj *) * kfree_alloc_num,
+						  GFP_KERNEL);
+	if (!alloc_ptrs)
+		return -ENOMEM;
+
+	start_time = ktime_get_mono_fast_ns();
+
+	if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
+		if (gp_exp)
+			b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
+		else
+			b_rcu_gp_test_started = cur_ops->get_gp_seq();
+	}
+
+	do {
+		for (i = 0; i < kfree_alloc_num; i++) {
+			alloc_ptrs[i] = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
+			if (!alloc_ptrs[i])
+				return -ENOMEM;
+		}
+
+		for (i = 0; i < kfree_alloc_num; i++) {
+			if (!kfree_no_batch) {
+				kfree_rcu(alloc_ptrs[i], rh);
+			} else {
+				rcu_callback_t cb;
+
+				cb = (rcu_callback_t)(unsigned long)offsetof(struct kfree_obj, rh);
+				kfree_call_rcu_nobatch(&(alloc_ptrs[i]->rh), cb);
+			}
+		}
+
+		cond_resched();
+	} while (!torture_must_stop() && ++loop < kfree_loops);
+
+	if (atomic_inc_return(&n_kfree_perf_thread_ended) >= kfree_nrealthreads) {
+		end_time = ktime_get_mono_fast_ns();
+
+		if (gp_exp)
+			b_rcu_gp_test_finished = cur_ops->exp_completed() / 2;
+		else
+			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
+
+		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
+		       (unsigned long long)(end_time - start_time), kfree_loops,
+		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
+		if (shutdown) {
+			smp_mb(); /* Assign before wake. */
+			wake_up(&shutdown_wq);
+		}
+	}
+
+	kfree(alloc_ptrs);
+	torture_kthread_stopping("kfree_perf_thread");
+	return 0;
+}
+
+static void
+kfree_perf_cleanup(void)
+{
+	int i;
+
+	if (torture_cleanup_begin())
+		return;
+
+	if (kfree_reader_tasks) {
+		for (i = 0; i < kfree_nrealthreads; i++)
+			torture_stop_kthread(kfree_perf_thread,
+					     kfree_reader_tasks[i]);
+		kfree(kfree_reader_tasks);
+	}
+
+	torture_cleanup_end();
+}
+
+/*
+ * shutdown kthread.  Just waits to be awakened, then shuts down system.
+ */
+static int
+kfree_perf_shutdown(void *arg)
+{
+	do {
+		wait_event(shutdown_wq,
+			   atomic_read(&n_kfree_perf_thread_ended) >=
+			   kfree_nrealthreads);
+	} while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
+
+	smp_mb(); /* Wake before output. */
+
+	kfree_perf_cleanup();
+	kernel_power_off();
+	return -EINVAL;
+}
+
+static int __init
+kfree_perf_init(void)
+{
+	long i;
+	int firsterr = 0;
+
+	kfree_nrealthreads = compute_real(kfree_nthreads);
+	/* Start up the kthreads. */
+	if (shutdown) {
+		init_waitqueue_head(&shutdown_wq);
+		firsterr = torture_create_kthread(kfree_perf_shutdown, NULL,
+						  shutdown_task);
+		if (firsterr)
+			goto unwind;
+		schedule_timeout_uninterruptible(1);
+	}
+
+	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
+			       GFP_KERNEL);
+	if (kfree_reader_tasks == NULL) {
+		firsterr = -ENOMEM;
+		goto unwind;
+	}
+
+	for (i = 0; i < kfree_nrealthreads; i++) {
+		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
+						  kfree_reader_tasks[i]);
+		if (firsterr)
+			goto unwind;
+	}
+
+	while (atomic_read(&n_kfree_perf_thread_started) < kfree_nrealthreads)
+		schedule_timeout_uninterruptible(1);
+
+	torture_init_end();
+	return 0;
+
+unwind:
+	torture_init_end();
+	kfree_perf_cleanup();
+	return firsterr;
+}
+
 static int __init
 rcu_perf_init(void)
 {
@@ -624,6 +794,9 @@ rcu_perf_init(void)
 	if (cur_ops->init)
 		cur_ops->init();
 
+	if (kfree_rcu_test)
+		return kfree_perf_init();
+
 	nrealwriters = compute_real(nwriters);
 	nrealreaders = compute_real(nreaders);
 	atomic_set(&n_rcu_perf_reader_started, 0);
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-13 17:00 [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Joel Fernandes (Google)
  2019-08-13 17:00 ` [PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests Joel Fernandes (Google)
@ 2019-08-13 19:07 ` Paul E. McKenney
  2019-08-14 14:38   ` Joel Fernandes
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2019-08-13 19:07 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Tue, Aug 13, 2019 at 01:00:45PM -0400, Joel Fernandes (Google) wrote:
> Recently a discussion about stability and performance of a system
> involving a high rate of kfree_rcu() calls surfaced on the list [1]
> which led to another discussion how to prepare for this situation.

Looks much improved!  More questions and comments inline.

							Thanx, Paul

> This patch adds basic batching support for kfree_rcu(). It is "basic"
> because we do none of the slab management, dynamic allocation, code
> moving or any of the other things, some of which previous attempts did
> [2]. These fancier improvements can be follow-up patches and there are
> different ideas being discussed in those regards.
> 
> Torture tests follow in the next patch and show improvements of around
> 400% in reduction of number of  grace periods on a 16 CPU system. More

s/400% in reduction/a 5x reduction/

Oddly enough, a reduction of more than 100% would require a negative
number of grace periods.  RCU being what it is, although this seems
unlikely, I hesitate to rule it out.  ;-)

> details and test data are in that patch.
> 
> This is an effort to start simple, and build up from there. In the
> future, an extension to use kfree_bulk and possibly per-slab batching
> could be done to further improve performance due to cache-locality and
> slab-specific bulk free optimizations. By using an array of pointers,
> the worker thread processing the work would need to read lesser data
> since it does not need to deal with large rcu_head(s) any longer.

This should be combined with the second paragraph -- they both discuss
possible follow-on work.

> There is an implication with rcu_barrier() with this patch. Since the
> kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> machinery in fact, the monitor may not have even run yet to do the
> queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> to wait for those kfree_rcu()s that are already made. So this means a
> kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> be freed once rcu_barrier() returns.

The usual approach (should kfree_rcu_barrier() in fact be needed) is to
record the address of the most recent block being kfree_rcu()'d on
each CPU, count the number of recorded addresses, and have the
kfree() side check the address, and do the usual atomic_dec_and_test()
and wakeup dance.  This gets a bit more crazy should the kfree_rcu()
requests be grouped per slab, of course!  So yes, good to avoid in
the meantime.

> Another implication is higher active memory usage (although not
> run-away..) until the kfree_rcu() flooding ends, in comparison to
> without batching. More details about this are in the second patch which
> adds an rcuperf test.

But this is adjustable, at least via patching at build time, via
KFREE_DRAIN_JIFFIES, right?

> Finally, in the near future we will get rid of kfree_rcu special casing
> within RCU such as in rcu_do_batch and switch everything to just
> batching. Currently we don't do that since timer subsystem is not yet up
> and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> lock are not initialized. That would also mean getting rid of
> kfree_call_rcu_nobatch() entirely.

Please consistently put "()" after function names.

> [1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
> [2] https://lkml.org/lkml/2017/12/19/824
> 
> Cc: Rao Shoaib <rao.shoaib@oracle.com>
> Cc: max.byungchul.park@gmail.com
> Cc: byungchul.park@lge.com
> Cc: kernel-team@android.com
> Cc: kernel-team@lge.com
> Co-developed-by: Byungchul Park <byungchul.park@lge.com>
> Signed-off-by: Byungchul Park <byungchul.park@lge.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> ---
> v2->v3: Just some code comment changes thanks to Byungchul.
> 
> RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
>                    Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce OOM occurrence.
>                    Removed sleeps in rcuperf test, just using cond_resched()in loop.
>                    Better code comments ;)
> 
>  .../admin-guide/kernel-parameters.txt         |  17 ++
>  include/linux/rcutiny.h                       |   5 +
>  include/linux/rcutree.h                       |   1 +
>  kernel/rcu/tree.c                             | 204 +++++++++++++++++-
>  4 files changed, 221 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 7ccd158b3894..a9156ca5de24 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3895,6 +3895,23 @@
>  			test until boot completes in order to avoid
>  			interference.
>  
> +	rcuperf.kfree_rcu_test= [KNL]
> +			Set to measure performance of kfree_rcu() flooding.
> +
> +	rcuperf.kfree_nthreads= [KNL]
> +			The number of threads running loops of kfree_rcu().
> +
> +	rcuperf.kfree_alloc_num= [KNL]
> +			Number of allocations and frees done in an iteration.
> +
> +	rcuperf.kfree_loops= [KNL]
> +			Number of loops doing rcuperf.kfree_alloc_num number
> +			of allocations and frees.
> +
> +	rcuperf.kfree_no_batch= [KNL]
> +			Use the non-batching (slower) version of kfree_rcu.
> +			This is useful for comparing with the batched version.
> +
>  	rcuperf.nreaders= [KNL]
>  			Set number of RCU readers.  The value -1 selects
>  			N, where N is the number of CPUs.  A value

This change to kernel-parameters.txt really needs to be in the
rcuperf patch rather than this one.

> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 8e727f57d814..383f2481750f 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	call_rcu(head, func);
>  }
>  
> +static inline void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
> +{
> +	call_rcu(head, func);
> +}
> +
>  void rcu_qs(void);
>  
>  static inline void rcu_softirq_qs(void)
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 735601ac27d3..7e38b39ec634 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
>  
>  void synchronize_rcu_expedited(void);
>  void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
> +void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func);
>  
>  void rcu_barrier(void);
>  bool rcu_eqs_special_set(int cpu);
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a14e5fbbea46..3e62bbb0e1fa 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2593,17 +2593,195 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
>  
> +
> +/* Maximum number of jiffies to wait before draining a batch. */
> +#define KFREE_DRAIN_JIFFIES (HZ / 50)
> +
>  /*
> - * Queue an RCU callback for lazy invocation after a grace period.
> - * This will likely be later named something like "call_rcu_lazy()",
> - * but this change will require some way of tagging the lazy RCU
> - * callbacks in the list of pending callbacks. Until then, this
> - * function may only be called from __kfree_rcu().
> + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> + * kfree(s) is queued for freeing after a grace period, right away.
>   */
> -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> +struct kfree_rcu_cpu {
> +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> +	 * is done after a grace period.
> +	 */
> +	struct rcu_work rcu_work;
> +
> +	/* The list of objects being queued in a batch but are not yet
> +	 * scheduled to be freed.
> +	 */
> +	struct rcu_head *head;
> +
> +	/* The list of objects that have now left ->head and are queued for
> +	 * freeing after a grace period.
> +	 */
> +	struct rcu_head *head_free;

So this is not yet the one that does multiple batches concurrently
awaiting grace periods, correct?  Or am I missing something subtle?

If it is not doing the multiple-batch optimization, what does the
400% in the commit log refer to?

> +	/* Protect concurrent access to this structure. */
> +	spinlock_t lock;
> +
> +	/* The delayed work that flushes ->head to ->head_free incase ->head
> +	 * within KFREE_DRAIN_JIFFIES. In case flushing cannot be done if RCU
> +	 * is busy, ->head just continues to grow and we retry flushing later.
> +	 */
> +	struct delayed_work monitor_work;
> +	bool monitor_todo;	/* Is a delayed work pending execution? */
> +};
> +
> +static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
> +
> +/*
> + * This function is invoked in workqueue context after a grace period.
> + * It frees all the objects queued on ->head_free.
> + */
> +static void kfree_rcu_work(struct work_struct *work)
> +{
> +	unsigned long flags;
> +	struct rcu_head *head, *next;
> +	struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
> +					struct kfree_rcu_cpu, rcu_work);
> +
> +	spin_lock_irqsave(&krcp->lock, flags);
> +	head = krcp->head_free;
> +	krcp->head_free = NULL;
> +	spin_unlock_irqrestore(&krcp->lock, flags);
> +
> +	/*
> +	 * The head is detached and not referenced from anywhere, so lockless
> +	 * access is Ok.
> +	 */
> +	for (; head; head = next) {
> +		next = head->next;
> +		head->next = NULL;

Why do we need to NULL ->next?  Could produce an expensive cache miss,
and I don't see how it helps anything.  What am I missing here?

> +		/* Could be possible to optimize with kfree_bulk in future */
> +		__rcu_reclaim(rcu_state.name, head);
> +		cond_resched_tasks_rcu_qs();
> +	}
> +}
> +
> +/*
> + * Schedule the kfree batch RCU work to run in workqueue context after a GP.
> + *
> + * This function is invoked by kfree_rcu_monitor() when the KFREE_DRAIN_JIFFIES
> + * timeout has been reached.
> + */
> +static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
> +{
> +	lockdep_assert_held(&krcp->lock);
> +
> +	/* If a previous RCU batch work is already in progress, we cannot queue
> +	 * another one, just refuse the optimization and it will be retried
> +	 * again in KFREE_DRAIN_JIFFIES time.
> +	 */
> +	if (krcp->head_free)
> +		return false;
> +
> +	krcp->head_free = krcp->head;
> +	krcp->head = NULL;
> +	INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
> +	queue_rcu_work(system_wq, &krcp->rcu_work);
> +
> +	return true;
> +}
> +
> +static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
> +				   unsigned long flags)
> +{
> +	/* Flush ->head to ->head_free, all objects on ->head_free will be
> +	 * kfree'd after a grace period.
> +	 */
> +	if (queue_kfree_rcu_work(krcp)) {
> +		/* Success! Our job is done here. */
> +		spin_unlock_irqrestore(&krcp->lock, flags);
> +		return;
> +	}
> +
> +	/* Previous batch did not get free yet, let us try again soon. */
> +	if (krcp->monitor_todo == false) {
> +		schedule_delayed_work_on(smp_processor_id(),
> +				&krcp->monitor_work,  KFREE_DRAIN_JIFFIES);

Given that we are using a lock, do we need to restrict to the current
CPU?  In any case, we do need to handle the case where the current
CPU has gone offline before we get around to sending a batch off to
wait for a grace period, correct?

> +		krcp->monitor_todo = true;
> +	}
> +	spin_unlock_irqrestore(&krcp->lock, flags);
> +}
> +
> +/*
> + * This function is invoked after the KFREE_DRAIN_JIFFIES timeout has elapsed,
> + * so it drains the specified kfree_rcu_cpu structure's ->head list.
> + */
> +static void kfree_rcu_monitor(struct work_struct *work)
> +{
> +	bool todo;
> +	unsigned long flags;
> +	struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu,
> +						 monitor_work.work);
> +
> +	spin_lock_irqsave(&krcp->lock, flags);
> +	todo = krcp->monitor_todo;
> +	krcp->monitor_todo = false;
> +	if (todo)
> +		kfree_rcu_drain_unlock(krcp, flags);
> +	else
> +		spin_unlock_irqrestore(&krcp->lock, flags);
> +}
> +
> +/*
> + * This version of kfree_call_rcu does not do batching of kfree_rcu() requests.
> + * Used only by rcuperf torture test for comparison with kfree_rcu_batch().
> + */
> +void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
>  {
>  	__call_rcu(head, func, -1, 1);
>  }
> +EXPORT_SYMBOL_GPL(kfree_call_rcu_nobatch);
> +
> +/*
> + * Queue a request for lazy invocation of kfree() after a grace period.
> + *
> + * Each kfree_call_rcu() request is added to a batch. The batch will be drained
> + * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch
> + * will be kfree'd in workqueue context. This allows us to:
> + *
> + * 1. Batch requests together to reduce the number of grace periods during
> + * heavy kfree_rcu() load.
> + *
> + * 2. In the future, makes it possible to use kfree_bulk() on a large number of
> + * kfree_rcu() requests thus reducing the per-object overhead of kfree() and
> + * also reducing cache misses.
> + */
> +void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	unsigned long flags;
> +	struct kfree_rcu_cpu *krcp;
> +	bool monitor_todo;
> +
> +	/* kfree_call_rcu() batching requires timers to be up. If the scheduler
> +	 * is not yet up, just skip batching and do non-batched kfree_call_rcu().
> +	 */
> +	if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
> +		return kfree_call_rcu_nobatch(head, func);
> +
> +	local_irq_save(flags);
> +	krcp = this_cpu_ptr(&krc);
> +
> +	spin_lock(&krcp->lock);
> +
> +	/* Queue the kfree but don't yet schedule the batch. */
> +	head->func = func;
> +	head->next = krcp->head;
> +	krcp->head = head;
> +
> +	/* Schedule monitor for timely drain after KFREE_DRAIN_JIFFIES. */
> +	monitor_todo = krcp->monitor_todo;
> +	krcp->monitor_todo = true;
> +	spin_unlock(&krcp->lock);
> +
> +	if (!monitor_todo) {
> +		schedule_delayed_work_on(smp_processor_id(),
> +				&krcp->monitor_work,  KFREE_DRAIN_JIFFIES);
> +	}
> +	local_irq_restore(flags);
> +}
>  EXPORT_SYMBOL_GPL(kfree_call_rcu);
>  
>  /*
> @@ -3455,10 +3633,24 @@ static void __init rcu_dump_rcu_node_tree(void)
>  struct workqueue_struct *rcu_gp_wq;
>  struct workqueue_struct *rcu_par_gp_wq;
>  
> +static void __init kfree_rcu_batch_init(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> +
> +		spin_lock_init(&krcp->lock);
> +		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> +	}
> +}
> +
>  void __init rcu_init(void)
>  {
>  	int cpu;
>  
> +	kfree_rcu_batch_init();
> +
>  	rcu_early_boot_tests();
>  
>  	rcu_bootup_announce();
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-13 19:07 ` [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Paul E. McKenney
@ 2019-08-14 14:38   ` Joel Fernandes
  2019-08-14 17:22     ` Joel Fernandes
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2019-08-14 14:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote:
[snip]
> > This patch adds basic batching support for kfree_rcu(). It is "basic"
> > because we do none of the slab management, dynamic allocation, code
> > moving or any of the other things, some of which previous attempts did
> > [2]. These fancier improvements can be follow-up patches and there are
> > different ideas being discussed in those regards.
> > 
> > Torture tests follow in the next patch and show improvements of around
> > 400% in reduction of number of  grace periods on a 16 CPU system. More
> 
> s/400% in reduction/a 5x reduction/

Ok. That's more clear.

> > details and test data are in that patch.
> > 
> > This is an effort to start simple, and build up from there. In the
> > future, an extension to use kfree_bulk and possibly per-slab batching
> > could be done to further improve performance due to cache-locality and
> > slab-specific bulk free optimizations. By using an array of pointers,
> > the worker thread processing the work would need to read lesser data
> > since it does not need to deal with large rcu_head(s) any longer.
> 
> This should be combined with the second paragraph -- they both discuss
> possible follow-on work.

Ack.

> > There is an implication with rcu_barrier() with this patch. Since the
> > kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> > machinery in fact, the monitor may not have even run yet to do the
> > queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> > to wait for those kfree_rcu()s that are already made. So this means a
> > kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> > be freed once rcu_barrier() returns.
> 
> The usual approach (should kfree_rcu_barrier() in fact be needed) is to
> record the address of the most recent block being kfree_rcu()'d on
> each CPU, count the number of recorded addresses, and have the
> kfree() side check the address, and do the usual atomic_dec_and_test()
> and wakeup dance.  This gets a bit more crazy should the kfree_rcu()
> requests be grouped per slab, of course!  So yes, good to avoid in
> the meantime.

Good idea!

> > Another implication is higher active memory usage (although not
> > run-away..) until the kfree_rcu() flooding ends, in comparison to
> > without batching. More details about this are in the second patch which
> > adds an rcuperf test.
> 
> But this is adjustable, at least via patching at build time, via
> KFREE_DRAIN_JIFFIES, right?

Yes.

> > Finally, in the near future we will get rid of kfree_rcu special casing
> > within RCU such as in rcu_do_batch and switch everything to just
> > batching. Currently we don't do that since timer subsystem is not yet up
> > and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> > lock are not initialized. That would also mean getting rid of
> > kfree_call_rcu_nobatch() entirely.
> 
> Please consistently put "()" after function names.

Done.

> > [1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
> > [2] https://lkml.org/lkml/2017/12/19/824
> > 
> > Cc: Rao Shoaib <rao.shoaib@oracle.com>
> > Cc: max.byungchul.park@gmail.com
> > Cc: byungchul.park@lge.com
> > Cc: kernel-team@android.com
> > Cc: kernel-team@lge.com
> > Co-developed-by: Byungchul Park <byungchul.park@lge.com>
> > Signed-off-by: Byungchul Park <byungchul.park@lge.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > ---
> > v2->v3: Just some code comment changes thanks to Byungchul.
> > 
> > RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
> >                    Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce OOM occurrence.
> >                    Removed sleeps in rcuperf test, just using cond_resched()in loop.
> >                    Better code comments ;)
> > 
> >  .../admin-guide/kernel-parameters.txt         |  17 ++
> >  include/linux/rcutiny.h                       |   5 +
> >  include/linux/rcutree.h                       |   1 +
> >  kernel/rcu/tree.c                             | 204 +++++++++++++++++-
> >  4 files changed, 221 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 7ccd158b3894..a9156ca5de24 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3895,6 +3895,23 @@
> >  			test until boot completes in order to avoid
> >  			interference.
> >  
> > +	rcuperf.kfree_rcu_test= [KNL]
> > +			Set to measure performance of kfree_rcu() flooding.
> > +
> > +	rcuperf.kfree_nthreads= [KNL]
> > +			The number of threads running loops of kfree_rcu().
> > +
> > +	rcuperf.kfree_alloc_num= [KNL]
> > +			Number of allocations and frees done in an iteration.
> > +
> > +	rcuperf.kfree_loops= [KNL]
> > +			Number of loops doing rcuperf.kfree_alloc_num number
> > +			of allocations and frees.
> > +
> > +	rcuperf.kfree_no_batch= [KNL]
> > +			Use the non-batching (slower) version of kfree_rcu.
> > +			This is useful for comparing with the batched version.
> > +
> >  	rcuperf.nreaders= [KNL]
> >  			Set number of RCU readers.  The value -1 selects
> >  			N, where N is the number of CPUs.  A value
> 
> This change to kernel-parameters.txt really needs to be in the
> rcuperf patch rather than this one.

Fixed.

FWIW, my new tool (sort of under wraps) to move hunks from one patch to
another in a git tree makes this easy: https://github.com/joelagnel/git-edit
(I hit 'p' to edit patches and it commits back the results when I close the
editor).

[snip]
> > - * Queue an RCU callback for lazy invocation after a grace period.
> > - * This will likely be later named something like "call_rcu_lazy()",
> > - * but this change will require some way of tagging the lazy RCU
> > - * callbacks in the list of pending callbacks. Until then, this
> > - * function may only be called from __kfree_rcu().
> > + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> > + * kfree(s) is queued for freeing after a grace period, right away.
> >   */
> > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > +struct kfree_rcu_cpu {
> > +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> > +	 * is done after a grace period.
> > +	 */
> > +	struct rcu_work rcu_work;
> > +
> > +	/* The list of objects being queued in a batch but are not yet
> > +	 * scheduled to be freed.
> > +	 */
> > +	struct rcu_head *head;
> > +
> > +	/* The list of objects that have now left ->head and are queued for
> > +	 * freeing after a grace period.
> > +	 */
> > +	struct rcu_head *head_free;
> 
> So this is not yet the one that does multiple batches concurrently
> awaiting grace periods, correct?  Or am I missing something subtle?

Yes, it is not. I honestly, still did not understand that idea. Or how it
would improve things. May be we can discuss at LPC on pen and paper? But I
think that can also be a follow-up optimization.

> If it is not doing the multiple-batch optimization, what does the
> 400% in the commit log refer to?

The 400% is the reduction in the number of grace periods just with this patch
(single batch processed at a time). The improvement is because of batching, we end
up having to start fewer grace periods while still managing to attend to the
kfree_rcu() load.

There is concurrency, because ->head continues to grow while ->head_free is
being processed.

> > +	/* Protect concurrent access to this structure. */
> > +	spinlock_t lock;
> > +
> > +	/* The delayed work that flushes ->head to ->head_free incase ->head
> > +	 * within KFREE_DRAIN_JIFFIES. In case flushing cannot be done if RCU
> > +	 * is busy, ->head just continues to grow and we retry flushing later.
> > +	 */
> > +	struct delayed_work monitor_work;
> > +	bool monitor_todo;	/* Is a delayed work pending execution? */
> > +};
> > +
> > +static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
> > +
> > +/*
> > + * This function is invoked in workqueue context after a grace period.
> > + * It frees all the objects queued on ->head_free.
> > + */
> > +static void kfree_rcu_work(struct work_struct *work)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_head *head, *next;
> > +	struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
> > +					struct kfree_rcu_cpu, rcu_work);
> > +
> > +	spin_lock_irqsave(&krcp->lock, flags);
> > +	head = krcp->head_free;
> > +	krcp->head_free = NULL;
> > +	spin_unlock_irqrestore(&krcp->lock, flags);
> > +
> > +	/*
> > +	 * The head is detached and not referenced from anywhere, so lockless
> > +	 * access is Ok.
> > +	 */
> > +	for (; head; head = next) {
> > +		next = head->next;
> > +		head->next = NULL;
> 
> Why do we need to NULL ->next?  Could produce an expensive cache miss,
> and I don't see how it helps anything.  What am I missing here?

You're right. I will get rid of hit, no reason to have it.

[snip]
> > +static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
> > +				   unsigned long flags)
> > +{
> > +	/* Flush ->head to ->head_free, all objects on ->head_free will be
> > +	 * kfree'd after a grace period.
> > +	 */
> > +	if (queue_kfree_rcu_work(krcp)) {
> > +		/* Success! Our job is done here. */
> > +		spin_unlock_irqrestore(&krcp->lock, flags);
> > +		return;
> > +	}
> > +
> > +	/* Previous batch did not get free yet, let us try again soon. */
> > +	if (krcp->monitor_todo == false) {
> > +		schedule_delayed_work_on(smp_processor_id(),
> > +				&krcp->monitor_work,  KFREE_DRAIN_JIFFIES);
> 
> Given that we are using a lock, do we need to restrict to the current
> CPU?  In any case, we do need to handle the case where the current
> CPU has gone offline before we get around to sending a batch off to
> wait for a grace period, correct?

The results either way are similar, but I agree it is better to not use the
CPU specific API, and use the generic one especially because it tries to
queue on the local CPU if possible.

Will respin!

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-14 14:38   ` Joel Fernandes
@ 2019-08-14 17:22     ` Joel Fernandes
  2019-08-14 18:44       ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2019-08-14 17:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote:
> On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote:
 [snip]
> > > - * Queue an RCU callback for lazy invocation after a grace period.
> > > - * This will likely be later named something like "call_rcu_lazy()",
> > > - * but this change will require some way of tagging the lazy RCU
> > > - * callbacks in the list of pending callbacks. Until then, this
> > > - * function may only be called from __kfree_rcu().
> > > + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> > > + * kfree(s) is queued for freeing after a grace period, right away.
> > >   */
> > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > +struct kfree_rcu_cpu {
> > > +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> > > +	 * is done after a grace period.
> > > +	 */
> > > +	struct rcu_work rcu_work;
> > > +
> > > +	/* The list of objects being queued in a batch but are not yet
> > > +	 * scheduled to be freed.
> > > +	 */
> > > +	struct rcu_head *head;
> > > +
> > > +	/* The list of objects that have now left ->head and are queued for
> > > +	 * freeing after a grace period.
> > > +	 */
> > > +	struct rcu_head *head_free;
> > 
> > So this is not yet the one that does multiple batches concurrently
> > awaiting grace periods, correct?  Or am I missing something subtle?
> 
> Yes, it is not. I honestly, still did not understand that idea. Or how it
> would improve things. May be we can discuss at LPC on pen and paper? But I
> think that can also be a follow-up optimization.

I got it now. Basically we can benefit a bit more by having another list
(that is have multiple kfree_rcu batches in flight). I will think more about
it - but hopefully we don't need to gate this patch by that.

It'll be interesting to see what rcuperf says about such an improvement :)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-14 17:22     ` Joel Fernandes
@ 2019-08-14 18:44       ` Paul E. McKenney
  2019-08-14 22:34         ` Joel Fernandes
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2019-08-14 18:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote:
> On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote:
> > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote:
>  [snip]
> > > > - * Queue an RCU callback for lazy invocation after a grace period.
> > > > - * This will likely be later named something like "call_rcu_lazy()",
> > > > - * but this change will require some way of tagging the lazy RCU
> > > > - * callbacks in the list of pending callbacks. Until then, this
> > > > - * function may only be called from __kfree_rcu().
> > > > + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> > > > + * kfree(s) is queued for freeing after a grace period, right away.
> > > >   */
> > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > > +struct kfree_rcu_cpu {
> > > > +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> > > > +	 * is done after a grace period.
> > > > +	 */
> > > > +	struct rcu_work rcu_work;
> > > > +
> > > > +	/* The list of objects being queued in a batch but are not yet
> > > > +	 * scheduled to be freed.
> > > > +	 */
> > > > +	struct rcu_head *head;
> > > > +
> > > > +	/* The list of objects that have now left ->head and are queued for
> > > > +	 * freeing after a grace period.
> > > > +	 */
> > > > +	struct rcu_head *head_free;
> > > 
> > > So this is not yet the one that does multiple batches concurrently
> > > awaiting grace periods, correct?  Or am I missing something subtle?
> > 
> > Yes, it is not. I honestly, still did not understand that idea. Or how it
> > would improve things. May be we can discuss at LPC on pen and paper? But I
> > think that can also be a follow-up optimization.
> 
> I got it now. Basically we can benefit a bit more by having another list
> (that is have multiple kfree_rcu batches in flight). I will think more about
> it - but hopefully we don't need to gate this patch by that.

I am willing to take this as a later optimization.

> It'll be interesting to see what rcuperf says about such an improvement :)

Indeed, no guarantees either way.  The reason for hope assumes a busy
system where each grace period is immediately followed by another
grace period.  On such a system, the current setup allows each CPU to
make use only of every second grace period for its kfree_rcu() work.
The hope would therefore be that this would reduce the memory footprint
substantially with no increase in overhead.

But no way to know without trying it!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-14 18:44       ` Paul E. McKenney
@ 2019-08-14 22:34         ` Joel Fernandes
  2019-08-14 23:02           ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Joel Fernandes @ 2019-08-14 22:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Wed, Aug 14, 2019 at 11:44:29AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote:
> > On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote:
> > > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote:
> >  [snip]
> > > > > - * Queue an RCU callback for lazy invocation after a grace period.
> > > > > - * This will likely be later named something like "call_rcu_lazy()",
> > > > > - * but this change will require some way of tagging the lazy RCU
> > > > > - * callbacks in the list of pending callbacks. Until then, this
> > > > > - * function may only be called from __kfree_rcu().
> > > > > + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> > > > > + * kfree(s) is queued for freeing after a grace period, right away.
> > > > >   */
> > > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > > > +struct kfree_rcu_cpu {
> > > > > +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> > > > > +	 * is done after a grace period.
> > > > > +	 */
> > > > > +	struct rcu_work rcu_work;
> > > > > +
> > > > > +	/* The list of objects being queued in a batch but are not yet
> > > > > +	 * scheduled to be freed.
> > > > > +	 */
> > > > > +	struct rcu_head *head;
> > > > > +
> > > > > +	/* The list of objects that have now left ->head and are queued for
> > > > > +	 * freeing after a grace period.
> > > > > +	 */
> > > > > +	struct rcu_head *head_free;
> > > > 
> > > > So this is not yet the one that does multiple batches concurrently
> > > > awaiting grace periods, correct?  Or am I missing something subtle?
> > > 
> > > Yes, it is not. I honestly, still did not understand that idea. Or how it
> > > would improve things. May be we can discuss at LPC on pen and paper? But I
> > > think that can also be a follow-up optimization.
> > 
> > I got it now. Basically we can benefit a bit more by having another list
> > (that is have multiple kfree_rcu batches in flight). I will think more about
> > it - but hopefully we don't need to gate this patch by that.
> 
> I am willing to take this as a later optimization.
> 
> > It'll be interesting to see what rcuperf says about such an improvement :)
> 
> Indeed, no guarantees either way.  The reason for hope assumes a busy
> system where each grace period is immediately followed by another
> grace period.  On such a system, the current setup allows each CPU to
> make use only of every second grace period for its kfree_rcu() work.
> The hope would therefore be that this would reduce the memory footprint
> substantially with no increase in overhead.

Good news! I was able to bring down memory foot print by almost 30% by adding
another batch. Below is the patch. Thanks for the suggestion!

I can add this as a patch on top of the initial one, for easier review.

The memory consumed drops from 300-350MB to 200-250MB. Increasing
KFREE_N_BATCHES did not cause a reduction in memory, though.

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] WIP: Multiple batches

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 58 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1d1847cadea2..a272c893dbdc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2596,26 +2596,35 @@ EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (HZ / 50)
+#define KFREE_N_BATCHES 2
+
+struct kfree_rcu_work {
+	/* The rcu_work node for queuing work with queue_rcu_work(). The work
+	 * is done after a grace period.
+	 */
+	struct rcu_work rcu_work;
+
+	/* The list of objects that have now left ->head and are queued for
+	 * freeing after a grace period.
+	 */
+	struct rcu_head *head_free;
+
+	struct kfree_rcu_cpu *krc;
+};
+static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], krw);
 
 /*
  * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
  * kfree(s) is queued for freeing after a grace period, right away.
  */
 struct kfree_rcu_cpu {
-	/* The rcu_work node for queuing work with queue_rcu_work(). The work
-	 * is done after a grace period.
-	 */
-	struct rcu_work rcu_work;
 
 	/* The list of objects being queued in a batch but are not yet
 	 * scheduled to be freed.
 	 */
 	struct rcu_head *head;
 
-	/* The list of objects that have now left ->head and are queued for
-	 * freeing after a grace period.
-	 */
-	struct rcu_head *head_free;
+	struct kfree_rcu_work *krw;
 
 	/* Protect concurrent access to this structure. */
 	spinlock_t lock;
@@ -2638,12 +2647,15 @@ static void kfree_rcu_work(struct work_struct *work)
 {
 	unsigned long flags;
 	struct rcu_head *head, *next;
-	struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
-					struct kfree_rcu_cpu, rcu_work);
+	struct kfree_rcu_work *krw = container_of(to_rcu_work(work),
+					struct kfree_rcu_work, rcu_work);
+	struct kfree_rcu_cpu *krcp;
+
+	krcp = krw->krc;
 
 	spin_lock_irqsave(&krcp->lock, flags);
-	head = krcp->head_free;
-	krcp->head_free = NULL;
+	head = krw->head_free;
+	krw->head_free = NULL;
 	spin_unlock_irqrestore(&krcp->lock, flags);
 
 	/*
@@ -2666,19 +2678,30 @@ static void kfree_rcu_work(struct work_struct *work)
  */
 static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
 {
+	int i = 0;
+	struct kfree_rcu_work *krw = NULL;
+
 	lockdep_assert_held(&krcp->lock);
+	while (i < KFREE_N_BATCHES) {
+		if (!krcp->krw[i].head_free) {
+			krw = &(krcp->krw[i]);
+			break;
+		}
+		i++;
+	}
 
-	/* If a previous RCU batch work is already in progress, we cannot queue
+	/* If both RCU batches are already in progress, we cannot queue
 	 * another one, just refuse the optimization and it will be retried
 	 * again in KFREE_DRAIN_JIFFIES time.
 	 */
-	if (krcp->head_free)
+	if (!krw)
 		return false;
 
-	krcp->head_free = krcp->head;
+	krw->head_free = krcp->head;
+	krw->krc = krcp;   /* Should need to do only once, optimize later. */
 	krcp->head = NULL;
-	INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
-	queue_rcu_work(system_wq, &krcp->rcu_work);
+	INIT_RCU_WORK(&krw->rcu_work, kfree_rcu_work);
+	queue_rcu_work(system_wq, &krw->rcu_work);
 
 	return true;
 }
@@ -3631,6 +3654,7 @@ static void __init kfree_rcu_batch_init(void)
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
 
 		spin_lock_init(&krcp->lock);
+		krcp->krw = &(per_cpu(krw, cpu)[0]);
 		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
 	}
 }
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching
  2019-08-14 22:34         ` Joel Fernandes
@ 2019-08-14 23:02           ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2019-08-14 23:02 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Rao Shoaib, max.byungchul.park, byungchul.park,
	kernel-team, kernel-team, Andrew Morton, Davidlohr Bueso,
	Jonathan Corbet, Josh Triplett, Kees Cook, Lai Jiangshan,
	linux-doc, Mathieu Desnoyers, Mauro Carvalho Chehab, rcu,
	Steven Rostedt, Thomas Gleixner

On Wed, Aug 14, 2019 at 06:34:13PM -0400, Joel Fernandes wrote:
> On Wed, Aug 14, 2019 at 11:44:29AM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 14, 2019 at 01:22:33PM -0400, Joel Fernandes wrote:
> > > On Wed, Aug 14, 2019 at 10:38:17AM -0400, Joel Fernandes wrote:
> > > > On Tue, Aug 13, 2019 at 12:07:38PM -0700, Paul E. McKenney wrote:
> > >  [snip]
> > > > > > - * Queue an RCU callback for lazy invocation after a grace period.
> > > > > > - * This will likely be later named something like "call_rcu_lazy()",
> > > > > > - * but this change will require some way of tagging the lazy RCU
> > > > > > - * callbacks in the list of pending callbacks. Until then, this
> > > > > > - * function may only be called from __kfree_rcu().
> > > > > > + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> > > > > > + * kfree(s) is queued for freeing after a grace period, right away.
> > > > > >   */
> > > > > > -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > > > > > +struct kfree_rcu_cpu {
> > > > > > +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> > > > > > +	 * is done after a grace period.
> > > > > > +	 */
> > > > > > +	struct rcu_work rcu_work;
> > > > > > +
> > > > > > +	/* The list of objects being queued in a batch but are not yet
> > > > > > +	 * scheduled to be freed.
> > > > > > +	 */
> > > > > > +	struct rcu_head *head;
> > > > > > +
> > > > > > +	/* The list of objects that have now left ->head and are queued for
> > > > > > +	 * freeing after a grace period.
> > > > > > +	 */
> > > > > > +	struct rcu_head *head_free;
> > > > > 
> > > > > So this is not yet the one that does multiple batches concurrently
> > > > > awaiting grace periods, correct?  Or am I missing something subtle?
> > > > 
> > > > Yes, it is not. I honestly, still did not understand that idea. Or how it
> > > > would improve things. May be we can discuss at LPC on pen and paper? But I
> > > > think that can also be a follow-up optimization.
> > > 
> > > I got it now. Basically we can benefit a bit more by having another list
> > > (that is have multiple kfree_rcu batches in flight). I will think more about
> > > it - but hopefully we don't need to gate this patch by that.
> > 
> > I am willing to take this as a later optimization.
> > 
> > > It'll be interesting to see what rcuperf says about such an improvement :)
> > 
> > Indeed, no guarantees either way.  The reason for hope assumes a busy
> > system where each grace period is immediately followed by another
> > grace period.  On such a system, the current setup allows each CPU to
> > make use only of every second grace period for its kfree_rcu() work.
> > The hope would therefore be that this would reduce the memory footprint
> > substantially with no increase in overhead.
> 
> Good news! I was able to bring down memory foot print by almost 30% by adding
> another batch. Below is the patch. Thanks for the suggestion!

Nice!

> I can add this as a patch on top of the initial one, for easier review.

Yes, please!

> The memory consumed drops from 300-350MB to 200-250MB. Increasing
> KFREE_N_BATCHES did not cause a reduction in memory, though.

OK, good to know.

						Thanx, Paul

> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH] WIP: Multiple batches
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree.c | 58 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 41 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 1d1847cadea2..a272c893dbdc 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2596,26 +2596,35 @@ EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (HZ / 50)
> +#define KFREE_N_BATCHES 2
> +
> +struct kfree_rcu_work {
> +	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> +	 * is done after a grace period.
> +	 */
> +	struct rcu_work rcu_work;
> +
> +	/* The list of objects that have now left ->head and are queued for
> +	 * freeing after a grace period.
> +	 */
> +	struct rcu_head *head_free;
> +
> +	struct kfree_rcu_cpu *krc;
> +};
> +static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], krw);
>  
>  /*
>   * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
>   * kfree(s) is queued for freeing after a grace period, right away.
>   */
>  struct kfree_rcu_cpu {
> -	/* The rcu_work node for queuing work with queue_rcu_work(). The work
> -	 * is done after a grace period.
> -	 */
> -	struct rcu_work rcu_work;
>  
>  	/* The list of objects being queued in a batch but are not yet
>  	 * scheduled to be freed.
>  	 */
>  	struct rcu_head *head;
>  
> -	/* The list of objects that have now left ->head and are queued for
> -	 * freeing after a grace period.
> -	 */
> -	struct rcu_head *head_free;
> +	struct kfree_rcu_work *krw;
>  
>  	/* Protect concurrent access to this structure. */
>  	spinlock_t lock;
> @@ -2638,12 +2647,15 @@ static void kfree_rcu_work(struct work_struct *work)
>  {
>  	unsigned long flags;
>  	struct rcu_head *head, *next;
> -	struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
> -					struct kfree_rcu_cpu, rcu_work);
> +	struct kfree_rcu_work *krw = container_of(to_rcu_work(work),
> +					struct kfree_rcu_work, rcu_work);
> +	struct kfree_rcu_cpu *krcp;
> +
> +	krcp = krw->krc;
>  
>  	spin_lock_irqsave(&krcp->lock, flags);
> -	head = krcp->head_free;
> -	krcp->head_free = NULL;
> +	head = krw->head_free;
> +	krw->head_free = NULL;
>  	spin_unlock_irqrestore(&krcp->lock, flags);
>  
>  	/*
> @@ -2666,19 +2678,30 @@ static void kfree_rcu_work(struct work_struct *work)
>   */
>  static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
>  {
> +	int i = 0;
> +	struct kfree_rcu_work *krw = NULL;
> +
>  	lockdep_assert_held(&krcp->lock);
> +	while (i < KFREE_N_BATCHES) {
> +		if (!krcp->krw[i].head_free) {
> +			krw = &(krcp->krw[i]);
> +			break;
> +		}
> +		i++;
> +	}
>  
> -	/* If a previous RCU batch work is already in progress, we cannot queue
> +	/* If both RCU batches are already in progress, we cannot queue
>  	 * another one, just refuse the optimization and it will be retried
>  	 * again in KFREE_DRAIN_JIFFIES time.
>  	 */
> -	if (krcp->head_free)
> +	if (!krw)
>  		return false;
>  
> -	krcp->head_free = krcp->head;
> +	krw->head_free = krcp->head;
> +	krw->krc = krcp;   /* Should need to do only once, optimize later. */
>  	krcp->head = NULL;
> -	INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
> -	queue_rcu_work(system_wq, &krcp->rcu_work);
> +	INIT_RCU_WORK(&krw->rcu_work, kfree_rcu_work);
> +	queue_rcu_work(system_wq, &krw->rcu_work);
>  
>  	return true;
>  }
> @@ -3631,6 +3654,7 @@ static void __init kfree_rcu_batch_init(void)
>  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
>  
>  		spin_lock_init(&krcp->lock);
> +		krcp->krw = &(per_cpu(krw, cpu)[0]);
>  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
>  	}
>  }
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-08-14 23:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13 17:00 [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Joel Fernandes (Google)
2019-08-13 17:00 ` [PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests Joel Fernandes (Google)
2019-08-13 19:07 ` [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching Paul E. McKenney
2019-08-14 14:38   ` Joel Fernandes
2019-08-14 17:22     ` Joel Fernandes
2019-08-14 18:44       ` Paul E. McKenney
2019-08-14 22:34         ` Joel Fernandes
2019-08-14 23:02           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).