* [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle
@ 2017-08-22 8:43 Nicholas Piggin
2017-08-23 3:21 ` [v2] " jeffy
2017-08-24 9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin
0 siblings, 2 replies; 6+ messages in thread
From: Nicholas Piggin @ 2017-08-22 8:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
linux-kernel, torvalds
When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.
However there are several problems:
- If an existing timer is modified, the base is forwarded only after
the index is calculated.
- The base is not forwarded by add_timer_on.
- There is a window after a timer is restarted from a nohz ide, after
it is marked not-idle and before the timer tick on this CPU, where a
timer may be added but the ancient base does not get forwarded.
These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.
Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.
There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.
This has been tested and found to fix the RCU softlockup messages.
Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:
max avg std
upstream 506.0 1.20 4.68
patched 2.0 1.08 0.15
The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:
max avg std
upstream 9.0 1.05 0.19
patched 2.0 1.04 0.11
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
- Address Thomas' comments (hopefully).
- Reword the changelog a bit.
- Added Jonathan's tested-by
kernel/time/timer.c | 50 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 9 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..f2674a056c26 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
bool migration_enabled;
bool nohz_active;
bool is_idle;
+ bool must_forward_clk;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head vectors[WHEEL_SIZE];
} ____cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
static inline void forward_timer_base(struct timer_base *base)
{
- unsigned long jnow = READ_ONCE(jiffies);
+ unsigned long jnow;
/*
- * We only forward the base when it's idle and we have a delta between
- * base clock and jiffies.
+ * We only forward the base when we are idle or have just come out of
+ * idle (must_forward_clk logic), and have a delta between base clock
+ * and jiffies. In the common case, run_timers will take care of it.
*/
- if (!base->is_idle || (long) (jnow - base->clk) < 2)
+ if (likely(!base->must_forward_clk))
+ return;
+
+ jnow = READ_ONCE(jiffies);
+ base->must_forward_clk = base->is_idle;
+ if ((long)(jnow - base->clk) < 2)
return;
/*
@@ -938,6 +945,11 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
* same array bucket then just return:
*/
if (timer_pending(timer)) {
+ /*
+ * The downside of this optimization is that it can result in
+ * larger granularity than you would get from adding a new
+ * timer with this expiry.
+ */
if (timer->expires == expires)
return 1;
@@ -948,6 +960,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
* dequeue/enqueue dance.
*/
base = lock_timer_base(timer, &flags);
+ forward_timer_base(base);
clk = base->clk;
idx = calc_wheel_index(expires, clk);
@@ -964,6 +977,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
}
} else {
base = lock_timer_base(timer, &flags);
+ forward_timer_base(base);
}
ret = detach_if_pending(timer, base, false);
@@ -991,12 +1005,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
raw_spin_lock(&base->lock);
WRITE_ONCE(timer->flags,
(timer->flags & ~TIMER_BASEMASK) | base->cpu);
+ forward_timer_base(base);
}
}
- /* Try to forward a stale timer base clock */
- forward_timer_base(base);
-
timer->expires = expires;
/*
* If 'idx' was calculated above and the base time did not advance
@@ -1112,6 +1124,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
WRITE_ONCE(timer->flags,
(timer->flags & ~TIMER_BASEMASK) | cpu);
}
+ forward_timer_base(base);
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);
@@ -1497,10 +1510,16 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
if (!is_max_delta)
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
/*
- * If we expect to sleep more than a tick, mark the base idle:
+ * If we expect to sleep more than a tick, mark the base idle.
+ * Also the tick is stopped so any added timer must forward
+ * the base clk itself to keep granularity small. This idle
+ * logic is only maintained for the BASE_STD base, deferrable
+ * timers may still see large granularity skew (by design).
*/
- if ((expires - basem) > TICK_NSEC)
+ if ((expires - basem) > TICK_NSEC) {
+ base->must_forward_clk = true;
base->is_idle = true;
+ }
}
raw_spin_unlock(&base->lock);
@@ -1611,6 +1630,19 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ /*
+ * must_forward_clk must be cleared before running timers so that any
+ * timer functions that call mod_timer will not try to forward the
+ * base. idle trcking / clock forwarding logic is only used with
+ * BASE_STD timers.
+ *
+ * The deferrable base does not do idle tracking at all, so we do
+ * not forward it. This can result in very large variations in
+ * granularity for deferrable timers, but they can be deferred for
+ * long periods due to idle.
+ */
+ base->must_forward_clk = false;
+
__run_timers(base);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
--
2.13.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
2017-08-22 8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
@ 2017-08-23 3:21 ` jeffy
2017-08-23 3:35 ` Stephen Rothwell
2017-08-23 7:01 ` Thomas Gleixner
2017-08-24 9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin
1 sibling, 2 replies; 6+ messages in thread
From: jeffy @ 2017-08-23 3:21 UTC (permalink / raw)
To: Nicholas Piggin, Thomas Gleixner
Cc: akpm, John Stultz, Stephen Boyd, Jonathan.Cameron, paulmck, mpe,
dzickus, sfr, linuxarm, abdhalee, linux-kernel, torvalds
Hi guys,
I was testing a arm64 based device(chromebook bob), and the kernel hang
a lot(during booting or suspending) on for-next(next-20170822) with this
commit:
71acb768f5b3 (timers: Fix excessive granularity of new timers after a
nohz idle)
After revert it(or just add "!base->is_idle" check back), things work
well...
my dts:
arch/arm64/boot/dts/rockchip/rk3399-gru-kevin.dts
my cpu info:
localhost / # cat /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 4 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 1
model name : ARMv8 Processor rev 4 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 2
model name : ARMv8 Processor rev 4 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 3
model name : ARMv8 Processor rev 4 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4
processor : 4
model name : ARMv8 Processor rev 2 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd08
CPU revision : 2
processor : 5
model name : ARMv8 Processor rev 2 (v8l)
BogoMIPS : 48.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpa
e evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd08
CPU revision : 2
my hang dmesg:
[ 304.797633] EXT4-fs (mmcblk1p3): mounted filesystem without journal.
Opts: (null
)
[ 304.806070] VFS: Mounted root (ext2 filesystem) on device 179:3.
[ 304.815775] devtmpfs: mounted
[ 304.823110] Freeing unused kernel memory: 2048K
[ 304.891944] audit: type=1404 audit(304.617:2): enforcing=1
old_enforcing=0 auid=
4294967295 ses=4294967295
[ 364.655978] INFO: rcu_preempt self-detected stall on CPU
[ 364.661924] 3-...: (23 GPs behind) idle=486/2/0 softirq=39/40 fqs=14991
[ 364.669513] (t=60000 jiffies g=-244 c=-245 q=71)
[ 364.674872] Task dump for CPU 3:
[ 364.678478] swapper/3 R running task 0 0 1
0x00000022
[ 364.686365] Call trace:
[ 364.689095] [<ffffff8008089ed4>] dump_backtrace+0x0/0x220
[ 364.695133] [<ffffff800808a118>] show_stack+0x24/0x30
[ 364.700784] [<ffffff80080d84fc>] sched_show_task+0x138/0x168
[ 364.707112] [<ffffff80080d96bc>] dump_cpu_task+0x48/0x54
[ 364.713054] [<ffffff800810ce80>] rcu_dump_cpu_stacks+0xa0/0xe0
[ 364.719577] [<ffffff800810b9bc>] rcu_check_callbacks+0x328/0x934
[ 364.726295] [<ffffff8008112b20>] update_process_times+0x34/0x5c
[ 364.732919] [<ffffff8008123164>] tick_sched_handle.isra.3+0x54/0x68
[ 364.739928] [<ffffff80081231c0>] tick_sched_timer+0x48/0x84
[ 364.746160] [<ffffff8008113b40>] __hrtimer_run_queues+0x194/0x31c
[ 364.752974] [<ffffff8008113eb0>] hrtimer_interrupt+0xac/0x1bc
[ 364.759402] [<ffffff80086152a4>] arch_timer_handler_phys+0x3c/0x4c
[ 364.766316] [<ffffff80080fe838>] handle_percpu_devid_irq+0xf4/0x270
[ 364.773324] [<ffffff80080f8df0>] generic_handle_irq+0x2c/0x44
[ 364.779749] [<ffffff80080f94b0>] __handle_domain_irq+0x90/0xb8
[ 364.786271] [<ffffff8008080ef8>] gic_handle_irq+0x140/0x1a8
[ 364.792502] Exception stack(0xffffff800801bba0 to 0xffffff800801bce0)
[ 364.799696] bba0: 000000000000b7e9 0000000000000003 00000040eee72000
ffffffc0f7d
601f8
[ 364.808456] bbc0: 000000000000003f ffffffc0f7d5efb8 ffffff800801bde8
80000000000
00000
[ 364.817215] bbe0: 0000000000000001 0000000000000000 ffffff8009028338
00000000000
00000
[ 364.825975] bc00: 0000000000000105 0000000000000130 0000000000005dc0
ffffff8008a
58000
[ 364.834734] bc20: 0000000000000007 000000000000000e 000000000000005c
ffffffc0002
18000
[ 364.843494] bc40: ffffff8009007000 ffffffc000218058 ffffff80080c1034
ffffffc0002
18000
[ 364.852253] bc60: ffffff800801bde0 ffffffc000218000 000000000000000a
00000000000
00001
[ 364.861013] bc80: ffffff80090e8000 ffffff800801bce0 ffffff8008885c00
ffffff80080
1bce0
[ 364.869773] bca0: ffffff8008885c04 0000000060000145 ffffffc000218058
ffffff80080
c1034
[ 364.878532] bcc0: 0000008000000000 ffffff8008885c00 ffffff800801bce0
ffffff80088
85c04
[ 364.887291] [<ffffff80080828f0>] el1_irq+0xb0/0x124
[ 364.892747] [<ffffff8008885c04>] _raw_spin_unlock_irq+0x24/0x50
[ 364.899368] [<ffffff80080c10a0>] idle_worker_timeout+0x6c/0x78
[ 364.905892] [<ffffff8008111924>] call_timer_fn+0xb0/0x1c8
[ 364.911929] [<ffffff8008111b48>] expire_timers+0x10c/0x158
[ 364.918055] [<ffffff8008111d54>] run_timer_softirq.part.1+0xb0/0xe8
[ 364.925062] [<ffffff8008111e50>] run_timer_softirq+0xc4/0xec
[ 364.931381] [<ffffff80080810ec>] __do_softirq+0x18c/0x348
[ 364.937419] [<ffffff80080af650>] irq_exit+0x7c/0xc4
[ 364.942872] [<ffffff80080f94b4>] __handle_domain_irq+0x94/0xb8
[ 364.949393] [<ffffff8008080ef8>] gic_handle_irq+0x140/0x1a8
[ 364.955623] Exception stack(0xffffff80092ebe10 to 0xffffff80092ebf50)
[ 364.962826] be00: 00000040eee72000
ffffff8008e
e8000
[ 364.971586] be20: 0000000000000000 ffffff80092ebf60 00000040eee72000
ffffff80090
2f900
[ 364.980345] be40: 00000046de94b940 0000000000000040 ffffffc0f2d35360
ffffff80092
ebed0
[ 364.989105] be60: 0000000000000a80 0000000000000003 0000000000003c93
00000000000
03c93
[ 364.997864] be80: 00000000000c8000 ffffff8008a58000 0000000000000007
00000000000
0000e
[ 365.006624] bea0: 000000000000005c ffffff8008ee8000 ffffff80090e9000
ffffff80090
e9160
[ 365.015383] bec0: ffffff8009010a30 0000000000000000 0000000000000000
00000000000
00000
[ 365.024143] bee0: 0000000000000000 0000000000000000 0000000000000000
ffffff80092
ebf50
[ 365.032902] bf00: ffffff80080855e0 ffffff80092ebf50 ffffff80080855e4
00000000600
00145
[ 365.041662] bf20: ffffff8008ee8000 ffffff8009010a30 ffffffffffffffff
00000000000
00000
[ 365.050420] bf40: ffffff80092ebf50 ffffff80080855e4
[ 365.055873] [<ffffff80080828f0>] el1_irq+0xb0/0x124
[ 365.061326] [<ffffff80080855e4>] arch_cpu_idle+0x74/0x19c
[ 365.067364] [<ffffff8008885a34>] default_idle_call+0x28/0x30
[ 365.073694] [<ffffff80080eb114>] do_idle+0xd8/0x198
[ 365.079148] [<ffffff80080eb390>] cpu_startup_entry+0x28/0x2c
[ 365.085478] [<ffffff8008090c1c>] secondary_start_kernel+0x184/0x190
[ 492.786002] INFO: task init:1 blocked for more than 120 seconds.
[ 492.792730] Not tainted 4.13.0-rc6-next-20170822 #100
[ 492.799000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this mes
sage.
[ 492.807773] init D 0 1 0 0x00400000
[ 492.813922] Call trace:
[ 492.816662] [<ffffff8008085c28>] __switch_to+0x7c/0x88
[ 492.822415] [<ffffff8008880ebc>] __schedule+0x40c/0x6c4
[ 492.828265] [<ffffff8008881204>] schedule+0x90/0xb0
[ 492.833726] [<ffffff8008884b54>] schedule_timeout+0x4c/0x354
[ 492.840062] [<ffffff8008881f0c>] wait_for_common+0x164/0x19c
[ 492.846397] [<ffffff8008881f6c>] wait_for_completion+0x28/0x34
[ 492.852928] [<ffffff80081051cc>] __wait_rcu_gp+0x160/0x18c
[ 492.859062] [<ffffff800810b248>] synchronize_rcu+0x6c/0x90
[ 492.865205] [<ffffff80086b455c>] synchronize_net+0x38/0x40
[ 492.871348] [<ffffff8008309c4c>] selinux_netcache_avc_callback+0x30/0x40
[ 492.878851] [<ffffff8008302d2c>] avc_ss_reset+0x44/0xa4
[ 492.884703] [<ffffff800830e97c>] sel_write_enforce+0x138/0x198
[ 492.891235] [<ffffff80081f2a00>] __vfs_write+0x60/0x16c
[ 492.897086] [<ffffff80081f3e90>] vfs_write+0xc4/0x160
[ 492.902740] [<ffffff80081f5348>] SyS_write+0x70/0xc8
[ 492.908297] Exception stack(0xffffff800804bec0 to 0xffffff800804c000)
[ 492.915513] bec0: 0000000000000003 00000000ff7f46a0 0000000000000001
00000000fff
fffff
[ 492.924285] bee0: 00000000ff7f46a0 0000000000000003 0000000005a417b8
00000000000
00004
[ 492.933075] bf00: 00000000ff7f5708 00000000ffffffff 00000000045f8d0c
00000000000
00000
[ 492.941843] bf20: 0000000000000000 00000000ff7f4694 00000000ea720eaf
00000000000
00000
[ 492.950609] bf40: 0000000000000000 0000000000000000 0000000000000000
00000000000
00000
[ 492.959375] bf60: 0000000000000000 0000000000000000 0000000000000000
00000000000
00000
[ 492.968141] bf80: 0000000000000000 0000000000000000 0000000000000000
00000000000
00000
[ 492.976907] bfa0: 0000000000000000 0000000000000000 0000000000000000
00000000000
00000
[ 492.985674] bfc0: 00000000ea6aba56 0000000060000030 0000000000000003
00000000000
00004
[ 492.994440] bfe0: 0000000000000000 0000000000000000 0000000000000000
00000000000
00000
[ 493.003207] [<ffffff8008082fb0>] el0_svc_naked+0x24/0x28
[ 493.009153] Kernel panic - not syncing: hung_task: blocked tasks
[ 493.015872] CPU: 0 PID: 43 Comm: khungtaskd Not tainted
4.13.0-rc6-next-20170822
#100
[ 493.024630] Hardware name: Google Kevin (DT)
[ 493.029400] Call trace:
[ 493.032132] [<ffffff8008089ed4>] dump_backtrace+0x0/0x220
[ 493.038169] [<ffffff800808a118>] show_stack+0x24/0x30
[ 493.043819] [<ffffff800886fb6c>] dump_stack+0x98/0xb8
[ 493.049468] [<ffffff80080a99e0>] panic+0x144/0x2a0
[ 493.054826] [<ffffff800814e788>] watchdog+0x348/0x380
[ 493.060475] [<ffffff80080c9de0>] kthread+0x134/0x13c
[ 493.066025] [<ffffff8008084bf0>] ret_from_fork+0x10/0x20
[ 493.071966] SMP: stopping secondary CPUs
[ 493.076352] Kernel Offset: disabled
[ 493.080249] CPU features: 0x00200c
[ 493.084048] Memory Limit: none
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
2017-08-23 3:21 ` [v2] " jeffy
@ 2017-08-23 3:35 ` Stephen Rothwell
2017-08-23 7:01 ` Thomas Gleixner
1 sibling, 0 replies; 6+ messages in thread
From: Stephen Rothwell @ 2017-08-23 3:35 UTC (permalink / raw)
To: jeffy
Cc: Nicholas Piggin, Thomas Gleixner, akpm, John Stultz,
Stephen Boyd, Jonathan.Cameron, paulmck, mpe, dzickus, linuxarm,
abdhalee, linux-kernel, torvalds
Hi jeffy,
On Wed, 23 Aug 2017 11:21:40 +0800 jeffy <jeffy.chen@rock-chips.com> wrote:
>
> I was testing a arm64 based device(chromebook bob), and the kernel hang
> a lot(during booting or suspending) on for-next(next-20170822) with this
> commit:
> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a
> nohz idle)
I had the same problem with a powerpc le qemu boot. Paul has now
removed v1 from his rcu tree, so it will not be in linux-next today.
--
Cheers,
Stephen Rothwell
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
2017-08-23 3:21 ` [v2] " jeffy
2017-08-23 3:35 ` Stephen Rothwell
@ 2017-08-23 7:01 ` Thomas Gleixner
2017-08-23 7:29 ` jeffy
1 sibling, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2017-08-23 7:01 UTC (permalink / raw)
To: jeffy
Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
linux-kernel, torvalds
On Wed, 23 Aug 2017, jeffy wrote:
> Hi guys,
>
> I was testing a arm64 based device(chromebook bob), and the kernel hang a
> lot(during booting or suspending) on for-next(next-20170822) with this commit:
> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a nohz
> idle)
>
> After revert it(or just add "!base->is_idle" check back), things work well...
Can you please test V2?
Thanks,
tglx
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [v2] timers: Fix excessive granularity of new timers after a nohz idle
2017-08-23 7:01 ` Thomas Gleixner
@ 2017-08-23 7:29 ` jeffy
0 siblings, 0 replies; 6+ messages in thread
From: jeffy @ 2017-08-23 7:29 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Nicholas Piggin, akpm, John Stultz, Stephen Boyd,
Jonathan.Cameron, paulmck, mpe, dzickus, sfr, linuxarm, abdhalee,
linux-kernel, torvalds
Hi Thomas,
On 08/23/2017 03:01 PM, Thomas Gleixner wrote:
> On Wed, 23 Aug 2017, jeffy wrote:
>
>> Hi guys,
>>
>> I was testing a arm64 based device(chromebook bob), and the kernel hang a
>> lot(during booting or suspending) on for-next(next-20170822) with this commit:
>> 71acb768f5b3 (timers: Fix excessive granularity of new timers after a nohz
>> idle)
>>
>> After revert it(or just add "!base->is_idle" check back), things work well...
>
> Can you please test V2?
>
hmm, v2 works...
Tested-by: Jeffy Chen <jeffy.chen@rock-chips.com>
> Thanks,
>
> tglx
>
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tip:timers/urgent] timers: Fix excessive granularity of new timers after a nohz idle
2017-08-22 8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
2017-08-23 3:21 ` [v2] " jeffy
@ 2017-08-24 9:42 ` tip-bot for Nicholas Piggin
1 sibling, 0 replies; 6+ messages in thread
From: tip-bot for Nicholas Piggin @ 2017-08-24 9:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: Jonathan.Cameron, sboyd, mingo, hpa, tglx, john.stultz,
linux-kernel, npiggin, davem
Commit-ID: 2fe59f507a65dbd734b990a11ebc7488f6f87a24
Gitweb: http://git.kernel.org/tip/2fe59f507a65dbd734b990a11ebc7488f6f87a24
Author: Nicholas Piggin <npiggin@gmail.com>
AuthorDate: Tue, 22 Aug 2017 18:43:48 +1000
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 24 Aug 2017 11:40:18 +0200
timers: Fix excessive granularity of new timers after a nohz idle
When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.
However there are several problems:
- If an existing timer is modified, the base is forwarded only after
the index is calculated.
- The base is not forwarded by add_timer_on.
- There is a window after a timer is restarted from a nohz idle, after
it is marked not-idle and before the timer tick on this CPU, where a
timer may be added but the ancient base does not get forwarded.
These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.
Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.
There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.
This has been tested and found to fix the RCU softlockup messages.
Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:
max avg std
upstream 506.0 1.20 4.68
patched 2.0 1.08 0.15
The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:
max avg std
upstream 9.0 1.05 0.19
patched 2.0 1.04 0.11
Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Cc: dzickus@redhat.com
Cc: sfr@canb.auug.org.au
Cc: mpe@ellerman.id.au
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: linuxarm@huawei.com
Cc: abdhalee@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: akpm@linux-foundation.org
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
---
kernel/time/timer.c | 50 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 9 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf..f2674a0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
bool migration_enabled;
bool nohz_active;
bool is_idle;
+ bool must_forward_clk;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head vectors[WHEEL_SIZE];
} ____cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
static inline void forward_timer_base(struct timer_base *base)
{
- unsigned long jnow = READ_ONCE(jiffies);
+ unsigned long jnow;
/*
- * We only forward the base when it's idle and we have a delta between
- * base clock and jiffies.
+ * We only forward the base when we are idle or have just come out of
+ * idle (must_forward_clk logic), and have a delta between base clock
+ * and jiffies. In the common case, run_timers will take care of it.
*/
- if (!base->is_idle || (long) (jnow - base->clk) < 2)
+ if (likely(!base->must_forward_clk))
+ return;
+
+ jnow = READ_ONCE(jiffies);
+ base->must_forward_clk = base->is_idle;
+ if ((long)(jnow - base->clk) < 2)
return;
/*
@@ -938,6 +945,11 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
* same array bucket then just return:
*/
if (timer_pending(timer)) {
+ /*
+ * The downside of this optimization is that it can result in
+ * larger granularity than you would get from adding a new
+ * timer with this expiry.
+ */
if (timer->expires == expires)
return 1;
@@ -948,6 +960,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
* dequeue/enqueue dance.
*/
base = lock_timer_base(timer, &flags);
+ forward_timer_base(base);
clk = base->clk;
idx = calc_wheel_index(expires, clk);
@@ -964,6 +977,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
}
} else {
base = lock_timer_base(timer, &flags);
+ forward_timer_base(base);
}
ret = detach_if_pending(timer, base, false);
@@ -991,12 +1005,10 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
raw_spin_lock(&base->lock);
WRITE_ONCE(timer->flags,
(timer->flags & ~TIMER_BASEMASK) | base->cpu);
+ forward_timer_base(base);
}
}
- /* Try to forward a stale timer base clock */
- forward_timer_base(base);
-
timer->expires = expires;
/*
* If 'idx' was calculated above and the base time did not advance
@@ -1112,6 +1124,7 @@ void add_timer_on(struct timer_list *timer, int cpu)
WRITE_ONCE(timer->flags,
(timer->flags & ~TIMER_BASEMASK) | cpu);
}
+ forward_timer_base(base);
debug_activate(timer, timer->expires);
internal_add_timer(base, timer);
@@ -1497,10 +1510,16 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
if (!is_max_delta)
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
/*
- * If we expect to sleep more than a tick, mark the base idle:
+ * If we expect to sleep more than a tick, mark the base idle.
+ * Also the tick is stopped so any added timer must forward
+ * the base clk itself to keep granularity small. This idle
+ * logic is only maintained for the BASE_STD base, deferrable
+ * timers may still see large granularity skew (by design).
*/
- if ((expires - basem) > TICK_NSEC)
+ if ((expires - basem) > TICK_NSEC) {
+ base->must_forward_clk = true;
base->is_idle = true;
+ }
}
raw_spin_unlock(&base->lock);
@@ -1611,6 +1630,19 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ /*
+ * must_forward_clk must be cleared before running timers so that any
+ * timer functions that call mod_timer will not try to forward the
+ * base. idle trcking / clock forwarding logic is only used with
+ * BASE_STD timers.
+ *
+ * The deferrable base does not do idle tracking at all, so we do
+ * not forward it. This can result in very large variations in
+ * granularity for deferrable timers, but they can be deferred for
+ * long periods due to idle.
+ */
+ base->must_forward_clk = false;
+
__run_timers(base);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-08-24 9:46 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-22 8:43 [PATCH v2] timers: Fix excessive granularity of new timers after a nohz idle Nicholas Piggin
2017-08-23 3:21 ` [v2] " jeffy
2017-08-23 3:35 ` Stephen Rothwell
2017-08-23 7:01 ` Thomas Gleixner
2017-08-23 7:29 ` jeffy
2017-08-24 9:42 ` [tip:timers/urgent] " tip-bot for Nicholas Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).