Possible issue with commit 4961b6e11825?

* Possible issue with commit 4961b6e11825?
@ 2015-12-04 23:20 Paul E. McKenney
  2015-12-05 19:01 ` Paul E. McKenney
  2015-12-07 19:01 ` Frederic Weisbecker
  0 siblings, 2 replies; 7+ messages in thread
From: Paul E. McKenney @ 2015-12-04 23:20 UTC (permalink / raw)
  To: tglx, peterz, preeti, viresh.kumar, mtosatti, fweisbec
  Cc: linux-kernel, sasha.levin

Hello!

Are there any known issues with commit 4961b6e11825 (sched: core: Use
hrtimer_start[_expires]())?

The reason that I ask is that I am about 90% sure that an rcutorture
failure bisects to that commit.  I will be running more tests on
3497d206c4d9 (perf: core: Use hrtimer_start()), which is the predecessor
of 4961b6e11825, and which, unlike 4961b6e11825, passes a 12-hour
rcutorture test with scenario TREE03.  In contrast, 4961b6e11825 gets
131 RCU CPU stall warnings, 132 reports of one of RCU's grace-period
kthreads being starved, and 525 reports of one of rcutorture's kthreads
being starved.  Most of the test runs hang on shutdown, which is no
surprise if an RCU CPU stall is happening at about that time.

But perhaps 3497d206c4d9 was just getting lucky, hence additional testing
over the weekend.

Reproducing this takes some doing.  A multisocket x86 box with significant
background computation noise seems to be able to reproduce this with
high probability in a twelve-hour test.  I -can- make it happen on
a single-socket four-core system (eight hardware threads, and with
significant background computational noise), but I ran the test for
several days before seeing the first error.  In addition, the probability
of hitting this is greatly reduced when running the tests on the
multisocket x86 box without the background computational noise.
(I recently taught some IBMers about ppcmem and herd, and gave them
some problems to solve, which is where the background noise came from,
in case you were wondering.  An unexpected benefit from those tools!)

The starving of RCU's grace-period kthreads is quite surprising, as
diagnostics indicate that they are in a wait_event_interruptible_timeout()
with a three-jiffy timeout.  The starvation is not subtle: 21-second
starvation periods are quite common, and 84-second starvation periods
occur from time to time.  In addition, rcutorture goes idle every few
seconds in order to test ramp-up and ramp-down effects, which should rule
out starvation due to heavy load.  Besides, I never see any softlockup
warnings, which should appear in the heavy-load-starvation case.

The commit log for 4961b6e11825 is as follows:

	sched: core: Use hrtimer_start[_expires]()

	hrtimer_start() now enforces a timer interrupt when an already
	expired timer is enqueued.

	Get rid of the __hrtimer_start_range_ns() invocations and the
	loops around it.

Is it possible that I need to adjust RCU or rcutorture code to account
for these newly enforced timer interrupts?  Or is there a known bug with
this commit whose fix I need to apply when bisecting?  (There were two
other fixes that I needed to do this with, so I figured I should ask.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread