linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle
@ 2017-08-22  8:43 Nicholas Piggin
  2017-08-23  3:21 ` [v2] " jeffy
  2017-08-24  9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin
  0 siblings, 2 replies; 6+ messages in thread
From: Nicholas Piggin @ 2017-08-22  8:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
	Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
	linux-kernel, torvalds

When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz ide, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.

This has been tested and found to fix the RCU softlockup messages.

Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:

             max     avg      std
upstream   506.0    1.20     4.68
patched      2.0    1.08     0.15

The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:

	     max     avg      std
upstream     9.0    1.05     0.19
patched      2.0    1.04     0.11

Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

- Address Thomas' comments (hopefully).
- Reword the changelog a bit.
- Added Jonathan's tested-by

 kernel/time/timer.c | 50 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..f2674a056c26 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
 	bool			migration_enabled;
 	bool			nohz_active;
 	bool			is_idle;
+	bool			must_forward_clk;
 	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
 	struct hlist_head	vectors[WHEEL_SIZE];
 } ____cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-	unsigned long jnow = READ_ONCE(jiffies);
+	unsigned long jnow;
 
 	/*
-	 * We only forward the base when it's idle and we have a delta between
-	 * base clock and jiffies.
+	 * We only forward the base when we are idle or have just come out of
+	 * idle (must_forward_clk logic), and have a delta between base clock
+	 * and jiffies. In the common case, run_timers will take care of it.
 	 */
-	if (!base->is_idle || (long) (jnow - base->clk) < 2)
+	if (likely(!base->must_forward_clk))
+		return;
+
+	jnow = READ_ONCE(jiffies);
+	base->must_forward_clk = base->is_idle;
+	if ((long)(jnow - base->clk) < 2)
 		return;
 
 	/*
@@ -938,6 +945,11 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 	 * same array bucket then just return:
 	 */
 	if (timer_pending(timer)) {
+		/*
+		 * The downside of this optimization is that it can result in
+		 * larger granularity than you would get from adding a new
+		 * timer with this expiry.
+		 */
 		if (timer->expires == expires)
 			return 1;
 
@@ -948,6 +960,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		 * dequeue/enqueue dance.
 		 */
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 
 		clk = base->clk;
 		idx = calc_wheel_index(expires, clk);
@@ -964,6 +977,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		}
 	} else {
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 	}
 
 	ret = detach_if_pending(timer, base, false);
@@ -991,12 +1005,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 			raw_spin_lock(&base->lock);
 			WRITE_ONCE(timer->flags,
 				   (timer->flags & ~TIMER_BASEMASK) | base->cpu);
+			forward_timer_base(base);
 		}
 	}
 
-	/* Try to forward a stale timer base clock */
-	forward_timer_base(base);
-
 	timer->expires = expires;
 	/*
 	 * If 'idx' was calculated above and the base time did not advance
@@ -1112,6 +1124,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 		WRITE_ONCE(timer->flags,
 			   (timer->flags & ~TIMER_BASEMASK) | cpu);
 	}
+	forward_timer_base(base);
 
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
@@ -1497,10 +1510,16 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 		if (!is_max_delta)
 			expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
 		/*
-		 * If we expect to sleep more than a tick, mark the base idle:
+		 * If we expect to sleep more than a tick, mark the base idle.
+		 * Also the tick is stopped so any added timer must forward
+		 * the base clk itself to keep granularity small. This idle
+		 * logic is only maintained for the BASE_STD base, deferrable
+		 * timers may still see large granularity skew (by design).
 		 */
-		if ((expires - basem) > TICK_NSEC)
+		if ((expires - basem) > TICK_NSEC) {
+			base->must_forward_clk = true;
 			base->is_idle = true;
+		}
 	}
 	raw_spin_unlock(&base->lock);
 
@@ -1611,6 +1630,19 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 
+	/*
+	 * must_forward_clk must be cleared before running timers so that any
+	 * timer functions that call mod_timer will not try to forward the
+	 * base. idle trcking / clock forwarding logic is only used with
+	 * BASE_STD timers.
+	 *
+	 * The deferrable base does not do idle tracking at all, so we do
+	 * not forward it. This can result in very large variations in
+	 * granularity for deferrable timers, but they can be deferred for
+	 * long periods due to idle.
+	 */
+	base->must_forward_clk = false;
+
 	__run_timers(base);
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
  2017-08-22  8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
@ 2017-08-23  3:21 ` jeffy
  2017-08-23  3:35   ` Stephen Rothwell
  2017-08-23  7:01   ` Thomas Gleixner
  2017-08-24  9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin
  1 sibling, 2 replies; 6+ messages in thread
From: jeffy @ 2017-08-23  3:21 UTC (permalink / raw)
  To: Nicholas Piggin, Thomas Gleixner
  Cc: akpm, John Stultz, Stephen Boyd, Jonathan.Cameron, paulmck, mpe,
	dzickus, sfr, linuxarm, abdhalee, linux-kernel, torvalds

Hi guys,

I was testing a arm64 based device(chromebook bob), and the kernel hang 
a lot(during booting or suspending) on for-next(next-20170822) with this 
commit:
71acb768f5b3 (timers: Fix excessive granularity of new timers after a 
nohz idle)

After revert it(or just add "!base->is_idle" check back), things work 
well...


my dts:
arch/arm64/boot/dts/rockchip/rk3399-gru-kevin.dts

my cpu info:
localhost / # cat /proc/cpuinfo
processor       : 0
model name      : ARMv8 Processor rev 4 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 1
model name      : ARMv8 Processor rev 4 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 2
model name      : ARMv8 Processor rev 4 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 3
model name      : ARMv8 Processor rev 4 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 4
model name      : ARMv8 Processor rev 2 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 2

processor       : 5
model name      : ARMv8 Processor rev 2 (v8l)
BogoMIPS        : 48.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 2


my hang dmesg:
[  304.797633] EXT4-fs (mmcblk1p3): mounted filesystem without journal. 
Opts: (null
)
[  304.806070] VFS: Mounted root (ext2 filesystem) on device 179:3.
[  304.815775] devtmpfs: mounted
[  304.823110] Freeing unused kernel memory: 2048K
[  304.891944] audit: type=1404 audit(304.617:2): enforcing=1 
old_enforcing=0 auid=
4294967295 ses=4294967295
[  364.655978] INFO: rcu_preempt self-detected stall on CPU
[  364.661924]  3-...: (23 GPs behind) idle=486/2/0 softirq=39/40 fqs=14991
[  364.669513]   (t=60000 jiffies g=-244 c=-245 q=71)
[  364.674872] Task dump for CPU 3:
[  364.678478] swapper/3       R  running task        0     0      1 
0x00000022
[  364.686365] Call trace:
[  364.689095] [<ffffff8008089ed4>] dump_backtrace+0x0/0x220
[  364.695133] [<ffffff800808a118>] show_stack+0x24/0x30
[  364.700784] [<ffffff80080d84fc>] sched_show_task+0x138/0x168
[  364.707112] [<ffffff80080d96bc>] dump_cpu_task+0x48/0x54
[  364.713054] [<ffffff800810ce80>] rcu_dump_cpu_stacks+0xa0/0xe0
[  364.719577] [<ffffff800810b9bc>] rcu_check_callbacks+0x328/0x934
[  364.726295] [<ffffff8008112b20>] update_process_times+0x34/0x5c
[  364.732919] [<ffffff8008123164>] tick_sched_handle.isra.3+0x54/0x68
[  364.739928] [<ffffff80081231c0>] tick_sched_timer+0x48/0x84
[  364.746160] [<ffffff8008113b40>] __hrtimer_run_queues+0x194/0x31c
[  364.752974] [<ffffff8008113eb0>] hrtimer_interrupt+0xac/0x1bc
[  364.759402] [<ffffff80086152a4>] arch_timer_handler_phys+0x3c/0x4c
[  364.766316] [<ffffff80080fe838>] handle_percpu_devid_irq+0xf4/0x270
[  364.773324] [<ffffff80080f8df0>] generic_handle_irq+0x2c/0x44
[  364.779749] [<ffffff80080f94b0>] __handle_domain_irq+0x90/0xb8
[  364.786271] [<ffffff8008080ef8>] gic_handle_irq+0x140/0x1a8
[  364.792502] Exception stack(0xffffff800801bba0 to 0xffffff800801bce0)
[  364.799696] bba0: 000000000000b7e9 0000000000000003 00000040eee72000 
ffffffc0f7d
601f8
[  364.808456] bbc0: 000000000000003f ffffffc0f7d5efb8 ffffff800801bde8 
80000000000
00000
[  364.817215] bbe0: 0000000000000001 0000000000000000 ffffff8009028338 
00000000000
00000
[  364.825975] bc00: 0000000000000105 0000000000000130 0000000000005dc0 
ffffff8008a
58000
[  364.834734] bc20: 0000000000000007 000000000000000e 000000000000005c 
ffffffc0002
18000
[  364.843494] bc40: ffffff8009007000 ffffffc000218058 ffffff80080c1034 
ffffffc0002
18000
[  364.852253] bc60: ffffff800801bde0 ffffffc000218000 000000000000000a 
00000000000
00001
[  364.861013] bc80: ffffff80090e8000 ffffff800801bce0 ffffff8008885c00 
ffffff80080
1bce0
[  364.869773] bca0: ffffff8008885c04 0000000060000145 ffffffc000218058 
ffffff80080
c1034
[  364.878532] bcc0: 0000008000000000 ffffff8008885c00 ffffff800801bce0 
ffffff80088
85c04
[  364.887291] [<ffffff80080828f0>] el1_irq+0xb0/0x124
[  364.892747] [<ffffff8008885c04>] _raw_spin_unlock_irq+0x24/0x50
[  364.899368] [<ffffff80080c10a0>] idle_worker_timeout+0x6c/0x78
[  364.905892] [<ffffff8008111924>] call_timer_fn+0xb0/0x1c8
[  364.911929] [<ffffff8008111b48>] expire_timers+0x10c/0x158
[  364.918055] [<ffffff8008111d54>] run_timer_softirq.part.1+0xb0/0xe8
[  364.925062] [<ffffff8008111e50>] run_timer_softirq+0xc4/0xec
[  364.931381] [<ffffff80080810ec>] __do_softirq+0x18c/0x348
[  364.937419] [<ffffff80080af650>] irq_exit+0x7c/0xc4
[  364.942872] [<ffffff80080f94b4>] __handle_domain_irq+0x94/0xb8
[  364.949393] [<ffffff8008080ef8>] gic_handle_irq+0x140/0x1a8
[  364.955623] Exception stack(0xffffff80092ebe10 to 0xffffff80092ebf50)
[  364.962826] be00:                                   00000040eee72000 
ffffff8008e
e8000
[  364.971586] be20: 0000000000000000 ffffff80092ebf60 00000040eee72000 
ffffff80090
2f900
[  364.980345] be40: 00000046de94b940 0000000000000040 ffffffc0f2d35360 
ffffff80092
ebed0
[  364.989105] be60: 0000000000000a80 0000000000000003 0000000000003c93 
00000000000
03c93
[  364.997864] be80: 00000000000c8000 ffffff8008a58000 0000000000000007 
00000000000
0000e
[  365.006624] bea0: 000000000000005c ffffff8008ee8000 ffffff80090e9000 
ffffff80090
e9160
[  365.015383] bec0: ffffff8009010a30 0000000000000000 0000000000000000 
00000000000
00000
[  365.024143] bee0: 0000000000000000 0000000000000000 0000000000000000 
ffffff80092
ebf50
[  365.032902] bf00: ffffff80080855e0 ffffff80092ebf50 ffffff80080855e4 
00000000600
00145
[  365.041662] bf20: ffffff8008ee8000 ffffff8009010a30 ffffffffffffffff 
00000000000
00000
[  365.050420] bf40: ffffff80092ebf50 ffffff80080855e4
[  365.055873] [<ffffff80080828f0>] el1_irq+0xb0/0x124
[  365.061326] [<ffffff80080855e4>] arch_cpu_idle+0x74/0x19c
[  365.067364] [<ffffff8008885a34>] default_idle_call+0x28/0x30
[  365.073694] [<ffffff80080eb114>] do_idle+0xd8/0x198
[  365.079148] [<ffffff80080eb390>] cpu_startup_entry+0x28/0x2c
[  365.085478] [<ffffff8008090c1c>] secondary_start_kernel+0x184/0x190
[  492.786002] INFO: task init:1 blocked for more than 120 seconds.
[  492.792730]       Not tainted 4.13.0-rc6-next-20170822 #100
[  492.799000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this mes
sage.
[  492.807773] init            D    0     1      0 0x00400000
[  492.813922] Call trace:
[  492.816662] [<ffffff8008085c28>] __switch_to+0x7c/0x88
[  492.822415] [<ffffff8008880ebc>] __schedule+0x40c/0x6c4
[  492.828265] [<ffffff8008881204>] schedule+0x90/0xb0
[  492.833726] [<ffffff8008884b54>] schedule_timeout+0x4c/0x354
[  492.840062] [<ffffff8008881f0c>] wait_for_common+0x164/0x19c
[  492.846397] [<ffffff8008881f6c>] wait_for_completion+0x28/0x34
[  492.852928] [<ffffff80081051cc>] __wait_rcu_gp+0x160/0x18c
[  492.859062] [<ffffff800810b248>] synchronize_rcu+0x6c/0x90
[  492.865205] [<ffffff80086b455c>] synchronize_net+0x38/0x40
[  492.871348] [<ffffff8008309c4c>] selinux_netcache_avc_callback+0x30/0x40
[  492.878851] [<ffffff8008302d2c>] avc_ss_reset+0x44/0xa4
[  492.884703] [<ffffff800830e97c>] sel_write_enforce+0x138/0x198
[  492.891235] [<ffffff80081f2a00>] __vfs_write+0x60/0x16c
[  492.897086] [<ffffff80081f3e90>] vfs_write+0xc4/0x160
[  492.902740] [<ffffff80081f5348>] SyS_write+0x70/0xc8
[  492.908297] Exception stack(0xffffff800804bec0 to 0xffffff800804c000)
[  492.915513] bec0: 0000000000000003 00000000ff7f46a0 0000000000000001 
00000000fff
fffff
[  492.924285] bee0: 00000000ff7f46a0 0000000000000003 0000000005a417b8 
00000000000
00004
[  492.933075] bf00: 00000000ff7f5708 00000000ffffffff 00000000045f8d0c 
00000000000
00000
[  492.941843] bf20: 0000000000000000 00000000ff7f4694 00000000ea720eaf 
00000000000
00000
[  492.950609] bf40: 0000000000000000 0000000000000000 0000000000000000 
00000000000
00000
[  492.959375] bf60: 0000000000000000 0000000000000000 0000000000000000 
00000000000
00000
[  492.968141] bf80: 0000000000000000 0000000000000000 0000000000000000 
00000000000
00000
[  492.976907] bfa0: 0000000000000000 0000000000000000 0000000000000000 
00000000000
00000
[  492.985674] bfc0: 00000000ea6aba56 0000000060000030 0000000000000003 
00000000000
00004
[  492.994440] bfe0: 0000000000000000 0000000000000000 0000000000000000 
00000000000
00000
[  493.003207] [<ffffff8008082fb0>] el0_svc_naked+0x24/0x28
[  493.009153] Kernel panic - not syncing: hung_task: blocked tasks
[  493.015872] CPU: 0 PID: 43 Comm: khungtaskd Not tainted 
4.13.0-rc6-next-20170822
  #100
[  493.024630] Hardware name: Google Kevin (DT)
[  493.029400] Call trace:
[  493.032132] [<ffffff8008089ed4>] dump_backtrace+0x0/0x220
[  493.038169] [<ffffff800808a118>] show_stack+0x24/0x30
[  493.043819] [<ffffff800886fb6c>] dump_stack+0x98/0xb8
[  493.049468] [<ffffff80080a99e0>] panic+0x144/0x2a0
[  493.054826] [<ffffff800814e788>] watchdog+0x348/0x380
[  493.060475] [<ffffff80080c9de0>] kthread+0x134/0x13c
[  493.066025] [<ffffff8008084bf0>] ret_from_fork+0x10/0x20
[  493.071966] SMP: stopping secondary CPUs
[  493.076352] Kernel Offset: disabled
[  493.080249] CPU features: 0x00200c
[  493.084048] Memory Limit: none

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
  2017-08-23  3:21 ` [v2] " jeffy
@ 2017-08-23  3:35   ` Stephen Rothwell
  2017-08-23  7:01   ` Thomas Gleixner
  1 sibling, 0 replies; 6+ messages in thread
From: Stephen Rothwell @ 2017-08-23  3:35 UTC (permalink / raw)
  To: jeffy
  Cc: Nicholas Piggin, Thomas Gleixner, akpm, John Stultz,
	Stephen Boyd, Jonathan.Cameron, paulmck, mpe, dzickus, linuxarm,
	abdhalee, linux-kernel, torvalds

Hi jeffy,

On Wed, 23 Aug 2017 11:21:40 +0800 jeffy <jeffy.chen@rock-chips.com> wrote:
>
> I was testing a arm64 based device(chromebook bob), and the kernel hang 
> a lot(during booting or suspending) on for-next(next-20170822) with this 
> commit:
> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a 
> nohz idle)

I had the same problem with a powerpc le qemu boot.  Paul has now
removed v1 from his rcu tree, so it will not be in linux-next today.

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
  2017-08-23  3:21 ` [v2] " jeffy
  2017-08-23  3:35   ` Stephen Rothwell
@ 2017-08-23  7:01   ` Thomas Gleixner
  2017-08-23  7:29     ` jeffy
  1 sibling, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2017-08-23  7:01 UTC (permalink / raw)
  To: jeffy
  Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
	Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
	linux-kernel, torvalds

On Wed, 23 Aug 2017, jeffy wrote:

> Hi guys,
> 
> I was testing a arm64 based device(chromebook bob), and the kernel hang a
> lot(during booting or suspending) on for-next(next-20170822) with this commit:
> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a nohz
> idle)
> 
> After revert it(or just add "!base->is_idle" check back), things work well...

Can you please test V2?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
  2017-08-23  7:01   ` Thomas Gleixner
@ 2017-08-23  7:29     ` jeffy
  0 siblings, 0 replies; 6+ messages in thread
From: jeffy @ 2017-08-23  7:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
	Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
	linux-kernel, torvalds

Hi Thomas,

On 08/23/2017 03:01 PM, Thomas Gleixner wrote:
> On Wed, 23 Aug 2017, jeffy wrote:
>
>> Hi guys,
>>
>> I was testing a arm64 based device(chromebook bob), and the kernel hang a
>> lot(during booting or suspending) on for-next(next-20170822) with this commit:
>> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a nohz
>> idle)
>>
>> After revert it(or just add "!base->is_idle" check back), things work well...
>
> Can you please test V2?
>
hmm, v2 works...

Tested-by: Jeffy Chen <jeffy.chen@rock-chips.com>

> Thanks,
>
> 	tglx
>
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:timers/urgent] timers: Fix excessive granularity of new timers after a nohz idle
  2017-08-22  8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
  2017-08-23  3:21 ` [v2] " jeffy
@ 2017-08-24  9:42 ` tip-bot for Nicholas Piggin
  1 sibling, 0 replies; 6+ messages in thread
From: tip-bot for Nicholas Piggin @ 2017-08-24  9:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Jonathan.Cameron, sboyd, mingo, hpa, tglx, john.stultz,
	linux-kernel, npiggin, davem

Commit-ID:  2fe59f507a65dbd734b990a11ebc7488f6f87a24
Gitweb:     http://git.kernel.org/tip/2fe59f507a65dbd734b990a11ebc7488f6f87a24
Author:     Nicholas Piggin <npiggin@gmail.com>
AuthorDate: Tue, 22 Aug 2017 18:43:48 +1000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 24 Aug 2017 11:40:18 +0200

timers: Fix excessive granularity of new timers after a nohz idle

When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz idle, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.

This has been tested and found to fix the RCU softlockup messages.

Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:

             max     avg      std
upstream   506.0    1.20     4.68
patched      2.0    1.08     0.15

The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:

	     max     avg      std
upstream     9.0    1.05     0.19
patched      2.0    1.04     0.11

Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Cc: dzickus@redhat.com
Cc: sfr@canb.auug.org.au
Cc: mpe@ellerman.id.au
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: linuxarm@huawei.com
Cc: abdhalee@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: akpm@linux-foundation.org
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com

---
 kernel/time/timer.c | 50 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf..f2674a0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
 	bool			migration_enabled;
 	bool			nohz_active;
 	bool			is_idle;
+	bool			must_forward_clk;
 	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
 	struct hlist_head	vectors[WHEEL_SIZE];
 } ____cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-	unsigned long jnow = READ_ONCE(jiffies);
+	unsigned long jnow;
 
 	/*
-	 * We only forward the base when it's idle and we have a delta between
-	 * base clock and jiffies.
+	 * We only forward the base when we are idle or have just come out of
+	 * idle (must_forward_clk logic), and have a delta between base clock
+	 * and jiffies. In the common case, run_timers will take care of it.
 	 */
-	if (!base->is_idle || (long) (jnow - base->clk) < 2)
+	if (likely(!base->must_forward_clk))
+		return;
+
+	jnow = READ_ONCE(jiffies);
+	base->must_forward_clk = base->is_idle;
+	if ((long)(jnow - base->clk) < 2)
 		return;
 
 	/*
@@ -938,6 +945,11 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 	 * same array bucket then just return:
 	 */
 	if (timer_pending(timer)) {
+		/*
+		 * The downside of this optimization is that it can result in
+		 * larger granularity than you would get from adding a new
+		 * timer with this expiry.
+		 */
 		if (timer->expires == expires)
 			return 1;
 
@@ -948,6 +960,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		 * dequeue/enqueue dance.
 		 */
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 
 		clk = base->clk;
 		idx = calc_wheel_index(expires, clk);
@@ -964,6 +977,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 		}
 	} else {
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 	}
 
 	ret = detach_if_pending(timer, base, false);
@@ -991,12 +1005,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 			raw_spin_lock(&base->lock);
 			WRITE_ONCE(timer->flags,
 				   (timer->flags & ~TIMER_BASEMASK) | base->cpu);
+			forward_timer_base(base);
 		}
 	}
 
-	/* Try to forward a stale timer base clock */
-	forward_timer_base(base);
-
 	timer->expires = expires;
 	/*
 	 * If 'idx' was calculated above and the base time did not advance
@@ -1112,6 +1124,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
 		WRITE_ONCE(timer->flags,
 			   (timer->flags & ~TIMER_BASEMASK) | cpu);
 	}
+	forward_timer_base(base);
 
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
@@ -1497,10 +1510,16 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 		if (!is_max_delta)
 			expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
 		/*
-		 * If we expect to sleep more than a tick, mark the base idle:
+		 * If we expect to sleep more than a tick, mark the base idle.
+		 * Also the tick is stopped so any added timer must forward
+		 * the base clk itself to keep granularity small. This idle
+		 * logic is only maintained for the BASE_STD base, deferrable
+		 * timers may still see large granularity skew (by design).
 		 */
-		if ((expires - basem) > TICK_NSEC)
+		if ((expires - basem) > TICK_NSEC) {
+			base->must_forward_clk = true;
 			base->is_idle = true;
+		}
 	}
 	raw_spin_unlock(&base->lock);
 
@@ -1611,6 +1630,19 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 
+	/*
+	 * must_forward_clk must be cleared before running timers so that any
+	 * timer functions that call mod_timer will not try to forward the
+	 * base. idle trcking / clock forwarding logic is only used with
+	 * BASE_STD timers.
+	 *
+	 * The deferrable base does not do idle tracking at all, so we do
+	 * not forward it. This can result in very large variations in
+	 * granularity for deferrable timers, but they can be deferred for
+	 * long periods due to idle.
+	 */
+	base->must_forward_clk = false;
+
 	__run_timers(base);
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-08-24  9:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-22  8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
2017-08-23  3:21 ` [v2] " jeffy
2017-08-23  3:35   ` Stephen Rothwell
2017-08-23  7:01   ` Thomas Gleixner
2017-08-23  7:29     ` jeffy
2017-08-24  9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).