RCU Archive on lore.kernel.org
 help / color / Atom feed
From: Joel Fernandes <joel@joelfernandes.org>
To: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: rcu@vger.kernel.org
Subject: Re: need_heavy_qs flag for PREEMPT=y kernels
Date: Mon, 12 Aug 2019 17:20:13 -0400
Message-ID: <20190812212013.GB48751@google.com> (raw)
In-Reply-To: <20190812035306.GE28441@linux.ibm.com>

On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > [snip]
> > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > reported a quiescent state for the current grace period.
> > 
> > Just wanted to ask something - how does resched_cpu() help for this case?
> > 
> > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > 
> > This seems not that useful. Even if we enter the scheduler due to the
> > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > quiescent state to the leaf node. Neither will anything to do a
> > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > will still end up getting blocked.
> > 
> > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > kernel actually helps to end the grace period when we call resched_cpu() on
> > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > CPU? Maybe I should do an experiment to see this all play out.
> 
> An experiment would be good!

Hi Paul,
Some very interesting findings!

Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
exits. Diff for test is appended.

Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
grace period is unable to come to an end with the hold up seemingly due to
CPU 3.

I see that the scheduler tick is off mostly, but occasionally is turned back
on during the test loop. However it has no effect and the grace period is
stuck on the same rcu_state.gp_seq value for the duration of the test. I
think the scheduler-tick ineffectiveness could be because of this patch?
https://lore.kernel.org/patchwork/patch/446044/

Relevant traces, sorry I did not wrap it for better readability:

Here I marked the start of the test:

[   24.852639] rcu_perf-163    13.... 1828278us : rcu_perf_writer: Start of rcuperf test
[   24.853651] rcu_perf-163    13dN.1 1828284us : rcu_grace_period: rcu_preempt 18446744073709550677 cpuqs
[   24.971709] rcu_pree-10      0d..1 1833228us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   26.493607] rcu_pree-10      0d..1 2137286us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   28.090618] rcu_pree-10      0d..1 2441290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3

Scheduler tick came in but almost 6 seconds of start of test, probably because migrate thread increase number of tasks queued on RQ:

[   28.355513] rcu_perf-163     3d.h. 2487447us : rcu_sched_clock_irq: sched-tick
[   29.592912] rcu_pree-10      0d..1 2745293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   30.238041] rcu_perf-163     3d..2 2879562us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=migration/3 next_pid=26 next_prio=0
[   30.269203] migratio-26      3d..2 2879745us : sched_switch: prev_comm=migration/3 prev_pid=26 prev_prio=0 prev_state=S ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
[   31.109635] rcu_pree-10      0d..1 3049301us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   32.627686] rcu_pree-10      0d..1 3353298us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   34.071163] rcu_pree-10      0d..1 3657299us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3

Several context-switches on CPU 3, but grace-period is still extended:

[   34.634814] rcu_perf-163     3d..2 3775310us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
[   34.638532] kworker/-28      3d..2 3775343us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=D ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
[   34.640338]  cpuhp/3-25      3d..2 3775348us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   34.653831]   <idle>-0       3d..2 3775522us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
[   34.657770] kworker/-28      3d..2 3775536us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=I ==> next_comm=kworker/3:1 next_pid=177 next_prio=120
[   34.661758] kworker/-177     3d..2 3775543us : sched_switch: prev_comm=kworker/3:1 prev_pid=177 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   34.866007]   <idle>-0       3d..2 3789967us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0H next_pid=29 next_prio=100
[   34.874200] kworker/-29      3d..2 3789988us : sched_switch: prev_comm=kworker/3:0H prev_pid=29 prev_prio=100 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   35.128506]   <idle>-0       3d..2 3816391us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
[   35.130481]  cpuhp/3-25      3d..2 3816503us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   35.509469]   <idle>-0       3d..2 3828462us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98

Scheduler tick doesn't do much to help:

[   35.512436] rcu_perf-163     3d.h. 3829210us : rcu_sched_clock_irq: sched-tick

Now scheduler tick is completely off for the next 29 seconds. FQS loops keeps
doing resched_cpu on CPU 3 at 300 jiffy intervals (my HZ=1000):

[   36.160145] rcu_pree-10      0d..1 3958214us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   38.778574] rcu_pree-10      0d..1 4262288us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   40.604108] rcu_pree-10      0d..1 4566246us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   42.565345] rcu_pree-10      0d..1 4870294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   44.322041] rcu_pree-10      0d..1 5174293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   46.083710] rcu_pree-10      0d..1 5478290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   47.851449] rcu_pree-10      0d..1 5782289us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   49.693127] rcu_pree-10      0d..1 6086293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   51.432018] rcu_pree-10      0d..1 6390283us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   53.187991] rcu_pree-10      0d..1 6694294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   53.970850] rcu_perf-163     3.... 6828237us : rcu_perf_writer: End of rcuperf test

I feel we could do one of:
1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler
2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.

What do you think?  Also test diff is below.

thanks,

 - Joel

---8<-----------------------

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 70d6ac19cbff..181cd9ce733a 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -366,17 +366,15 @@ rcu_perf_writer(void *arg)
 	u64 t;
 	u64 *wdp;
 	u64 *wdpp = writer_durations[me];
+	unsigned long jiffies_init;
 
 	VERBOSE_PERFOUT_STRING("rcu_perf_writer task started");
+	trace_printk("Start of rcuperf test\n");
 	WARN_ON(!wdpp);
-	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_cpus_allowed_ptr(current, cpumask_of(3));
 	sp.sched_priority = 1;
 	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
 
-	if (holdoff)
-		schedule_timeout_uninterruptible(holdoff * HZ);
-
-	t = ktime_get_mono_fast_ns();
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
 		if (gp_exp) {
@@ -387,80 +385,21 @@ rcu_perf_writer(void *arg)
 		}
 	}
 
+	jiffies_init = jiffies;
 	do {
-		if (writer_holdoff)
-			udelay(writer_holdoff);
-		wdp = &wdpp[i];
-		*wdp = ktime_get_mono_fast_ns();
-		if (gp_async) {
-retry:
-			if (!rhp)
-				rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
-			if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
-				rcu_perf_writer_state = RTWS_ASYNC;
-				atomic_inc(this_cpu_ptr(&n_async_inflight));
-				cur_ops->async(rhp, rcu_perf_async_cb);
-				rhp = NULL;
-			} else if (!kthread_should_stop()) {
-				rcu_perf_writer_state = RTWS_BARRIER;
-				cur_ops->gp_barrier();
-				goto retry;
-			} else {
-				kfree(rhp); /* Because we are stopping. */
-			}
-		} else if (gp_exp) {
-			rcu_perf_writer_state = RTWS_EXP_SYNC;
-			cur_ops->exp_sync();
-		} else {
-			rcu_perf_writer_state = RTWS_SYNC;
-			cur_ops->sync();
-		}
-		rcu_perf_writer_state = RTWS_IDLE;
-		t = ktime_get_mono_fast_ns();
-		*wdp = t - *wdp;
-		i_max = i;
-		if (!started &&
-		    atomic_read(&n_rcu_perf_writer_started) >= nrealwriters)
-			started = true;
-		if (!done && i >= MIN_MEAS) {
-			done = true;
-			sp.sched_priority = 0;
-			sched_setscheduler_nocheck(current,
-						   SCHED_NORMAL, &sp);
-			pr_alert("%s%s rcu_perf_writer %ld has %d measurements\n",
-				 perf_type, PERF_FLAG, me, MIN_MEAS);
-			if (atomic_inc_return(&n_rcu_perf_writer_finished) >=
-			    nrealwriters) {
-				schedule_timeout_interruptible(10);
-				rcu_ftrace_dump(DUMP_ALL);
-				PERFOUT_STRING("Test complete");
-				t_rcu_perf_writer_finished = t;
-				if (gp_exp) {
-					b_rcu_gp_test_finished =
-						cur_ops->exp_completed() / 2;
-				} else {
-					b_rcu_gp_test_finished =
-						cur_ops->get_gp_seq();
-				}
-				if (shutdown) {
-					smp_mb(); /* Assign before wake. */
-					wake_up(&shutdown_wq);
-				}
-			}
-		}
-		if (done && !alldone &&
-		    atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
-			alldone = true;
-		if (started && !alldone && i < MAX_MEAS - 1)
-			i++;
-		rcu_perf_wait_shutdown();
-	} while (!torture_must_stop());
-	if (gp_async) {
-		rcu_perf_writer_state = RTWS_BARRIER;
-		cur_ops->gp_barrier();
+		i++;
+		cond_resched();
+	} while ((jiffies < jiffies_init + 5000) && !torture_must_stop());
+
+	trace_printk("End of rcuperf test\n");
+	schedule_timeout_interruptible(10);
+	rcu_ftrace_dump(DUMP_ALL);
+	PERFOUT_STRING("Test complete");
+	if (shutdown) {
+		smp_mb(); /* Assign before wake. */
+		wake_up(&shutdown_wq);
 	}
-	rcu_perf_writer_state = RTWS_STOPPING;
-	writer_n_durations[me] = i_max;
+
 	torture_kthread_stopping("rcu_perf_writer");
 	return 0;
 }
@@ -797,7 +736,7 @@ rcu_perf_init(void)
 	if (kfree_rcu_test)
 		return kfree_perf_init();
 
-	nrealwriters = compute_real(nwriters);
+	nrealwriters = 1;
 	nrealreaders = compute_real(nreaders);
 	atomic_set(&n_rcu_perf_reader_started, 0);
 	atomic_set(&n_rcu_perf_writer_started, 0);
@@ -814,6 +753,8 @@ rcu_perf_init(void)
 			goto unwind;
 		schedule_timeout_uninterruptible(1);
 	}
+
+#if 0
 	reader_tasks = kcalloc(nrealreaders, sizeof(reader_tasks[0]),
 			       GFP_KERNEL);
 	if (reader_tasks == NULL) {
@@ -829,6 +770,7 @@ rcu_perf_init(void)
 	}
 	while (atomic_read(&n_rcu_perf_reader_started) < nrealreaders)
 		schedule_timeout_uninterruptible(1);
+#endif
 	writer_tasks = kcalloc(nrealwriters, sizeof(reader_tasks[0]),
 			       GFP_KERNEL);
 	writer_durations = kcalloc(nrealwriters, sizeof(*writer_durations),
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 102a5f606d78..e3deb9431544 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1066,6 +1066,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 	if (tick_nohz_full_cpu(rdp->cpu) &&
 		   time_after(jiffies,
 			      READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
+		trace_printk("Sending urgent resched to cpu %d\n", rdp->cpu);
 		resched_cpu(rdp->cpu);
 		WRITE_ONCE(rdp->last_fqs_resched, jiffies);
 	}
@@ -2168,6 +2169,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
  */
 void rcu_sched_clock_irq(int user)
 {
+	trace_printk("sched-tick\n");
+
 	trace_rcu_utilization(TPS("Start scheduler-tick"));
 	raw_cpu_inc(rcu_data.ticks_this_gp);
 	/* The load-acquire pairs with the store-release setting to true. */

  reply index

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-11 18:08 Joel Fernandes
2019-08-11 18:34 ` Joel Fernandes
2019-08-11 21:16   ` Paul E. McKenney
2019-08-11 21:25     ` Joel Fernandes
2019-08-11 23:30       ` Paul E. McKenney
2019-08-12  1:24         ` Joel Fernandes
2019-08-12  1:40           ` Joel Fernandes
2019-08-12  3:57             ` Paul E. McKenney
2019-08-11 21:13 ` Paul E. McKenney
2019-08-12  3:21   ` Joel Fernandes
2019-08-12  3:53     ` Paul E. McKenney
2019-08-12 21:20       ` Joel Fernandes [this message]
2019-08-12 23:01         ` Paul E. McKenney
2019-08-13  1:02           ` Joel Fernandes
2019-08-13  1:05             ` Joel Fernandes
2019-08-13  2:28               ` Paul E. McKenney
2019-08-13  2:27             ` Paul E. McKenney
2019-08-13  2:50               ` Paul E. McKenney
2019-08-15 17:17             ` Paul E. McKenney
2019-08-15 20:04               ` Joel Fernandes
2019-08-15 20:31                 ` Paul E. McKenney
2019-08-15 21:22                   ` Joel Fernandes
2019-08-15 21:27                     ` Joel Fernandes
2019-08-15 21:34                       ` Joel Fernandes
2019-08-15 21:57                         ` Paul E. McKenney
2019-08-15 21:45                     ` Paul E. McKenney
2019-08-16  0:02                       ` Joel Fernandes
2019-08-19 12:34                         ` Frederic Weisbecker
2019-08-19 12:09                   ` Frederic Weisbecker
2019-08-19 16:57                   ` Frederic Weisbecker
2019-08-19 22:31                     ` Paul E. McKenney

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190812212013.GB48751@google.com \
    --to=joel@joelfernandes.org \
    --cc=paulmck@linux.ibm.com \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git