[PATCH] Remove GP_REPLAY state from rcu_sync

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
To: linux-kernel@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>,
	bristot@redhat.com, peterz@infradead.org, oleg@redhat.com,
	paulmck@kernel.org, rcu@vger.kernel.org,
	Josh Triplett <josh@joshtriplett.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	"Paul E. McKenney" <paulmck@linux.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: [PATCH] Remove GP_REPLAY state from rcu_sync
Date: Fri,  4 Oct 2019 10:57:41 -0400	[thread overview]
Message-ID: <20191004145741.118292-1-joel@joelfernandes.org> (raw)

From: Joel Fernandes <joel@joelfernandes.org>

Please consider this is an RFC for discussion only. Just want to discuss
why the GP_REPLAY state is needed at all.

Here's the intention AFAICS:
When rcu_sync_exit() has happened, the gp_state changes to GP_EXIT while
we wait for a grace period before transitioning to GP_IDLE. In the
meanwhile, if we receive another rcu_sync_exit(), then we want to wait
for another GP to account for that.

Drawing some timing diagrams, it looks like this:

Legend:
rse = rcu_sync_enter
rsx = rcu_sync_exit
i = GP_IDLE
x = GP_EXIT
r = GP_REPLAY
e = GP_ENTER
p = GP_PASSED
rx = GP_REPLAY changes to GP_EXIT

GP num = The GP we are one.

note: A GP passes between the states:
  e and p
  x and i
  x and rx
  rx and i

In a simple case, the timing and states look like:
time
---------------------->
GP num         1111111    2222222
GP state  i    e     p    x     i
CPU0 :         rse	  rsx

However we can enter the replay state like this:
time
---------------------->
GP num         1111111    2222222222222222222223333333
GP state  i    e     p    x              r     rx    i
CPU0 :         rse	  rsx
CPU1 :                         rse     rsx

Due to the second rse + rsx, we had to wait for another GP.

I believe the rationale is, if another rsx happens, another GP has to
happen.

But this is not always true if you consider the following events:

time
---------------------->
GP num         111111     22222222222222222222222222222222233333333
GP state  i    e     p    x                 r              rx     i
CPU0 :         rse	  rsx
CPU1 :                         rse     rsx
CPU2 :                                         rse     rsx

Here, we had 3 grace periods that elapsed, 1 for the rcu_sync_enter(),
and 2 for the rcu_sync_exit(s).

However, we had 3 rcu_sync_exit()s, not 2. In other words, the
rcu_sync_exit() got batched.

So my point here is, rcu_sync_exit() does not really always cause a new
GP to happen and we can end up having the rcu_sync_exit()s as batched
and sharing the same grace period.

Then what is the point of the GP_REPLAY state at all if it does not
always wait for a new GP?  Taking a step back, why did we intend to have
to wait for a new GP if another rcu_sync_exit() comes while one is still
in progress?

Cc: bristot@redhat.com
Cc: peterz@infradead.org
Cc: oleg@redhat.com
Cc: paulmck@kernel.org
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/sync.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index d4558ab7a07d..4f3aad67992c 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -10,7 +10,7 @@
 #include <linux/rcu_sync.h>
 #include <linux/sched.h>
 
-enum { GP_IDLE = 0, GP_ENTER, GP_PASSED, GP_EXIT, GP_REPLAY };
+enum { GP_IDLE = 0, GP_ENTER, GP_PASSED, GP_EXIT };
 
 #define	rss_lock	gp_wait.lock
 
@@ -85,13 +85,6 @@ static void rcu_sync_func(struct rcu_head *rhp)
 		 */
 		WRITE_ONCE(rsp->gp_state, GP_PASSED);
 		wake_up_locked(&rsp->gp_wait);
-	} else if (rsp->gp_state == GP_REPLAY) {
-		/*
-		 * A new rcu_sync_exit() has happened; requeue the callback to
-		 * catch a later GP.
-		 */
-		WRITE_ONCE(rsp->gp_state, GP_EXIT);
-		rcu_sync_call(rsp);
 	} else {
 		/*
 		 * We're at least a GP after the last rcu_sync_exit(); eveybody
@@ -167,16 +160,13 @@ void rcu_sync_enter(struct rcu_sync *rsp)
  */
 void rcu_sync_exit(struct rcu_sync *rsp)
 {
-	WARN_ON_ONCE(READ_ONCE(rsp->gp_state) == GP_IDLE);
-	WARN_ON_ONCE(READ_ONCE(rsp->gp_count) == 0);
+	WARN_ON_ONCE(READ_ONCE(rsp->gp_state) < GP_PASSED);
 
 	spin_lock_irq(&rsp->rss_lock);
 	if (!--rsp->gp_count) {
 		if (rsp->gp_state == GP_PASSED) {
 			WRITE_ONCE(rsp->gp_state, GP_EXIT);
 			rcu_sync_call(rsp);
-		} else if (rsp->gp_state == GP_EXIT) {
-			WRITE_ONCE(rsp->gp_state, GP_REPLAY);
 		}
 	}
 	spin_unlock_irq(&rsp->rss_lock);
-- 
2.23.0.581.g78d2f28ef7-goog