linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86-tip tsc/tick gripage
@ 2017-04-26  8:02 Mike Galbraith
  2017-04-26  8:16 ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26  8:02 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker

Greetings,

After picking up the pieces of my tip-rt tree, I'm seeing grumbling on
two boxen, and it ain't me, the below is virgin tip.  The second box
has crap BIOS (replacement ready to be installed), but works fine if
the sync code gets the things synchronized before giving up (bumping
loop count a wee bit makes that reliable).  This first box is my
primary RT test box, not a racehorse (old nag), but highly reliable.

tip v4.11-rc8-893-g8ec9e12aff06, trusty ole 8 socket (X7560) DL980 G7

[   11.856619] tsc: Refined TSC clocksource calibration: 2260.999 MHz
[   11.867703] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
[   12.951432] clocksource: Switched to clocksource tsc
[  255.674776] clocksource: timekeeping watchdog on CPU54: Marking clocksource 'tsc' as unstable because the skew is too large:
[  255.895200] clocksource:                       'tsc' cs_now: 11b2a29f798 cs_last: c4faa1e90a mask: ffffffffffffffff
[  256.105724] tsc: Marking TSC unstable due to clocksource watchdog
[  316.976942] ------------[ cut here ]------------
[  316.980923] WARNING: CPU: 0 PID: 0 at kernel/time/tick-sched.c:874 tick_nohz_stop_sched_tick+0x339/0x380
[  316.980923] Modules linked in: autofs4(E) edd(E) af_packet(E) cpufreq_conservative(E) cpufreq_userspace(E) cpufreq_powersave(E) fuse(E) loop(E) md_mod(E) dm_mod(E) vhost_net(E) vhost(E) tap(E) tun(E) kvm_intel(E) iTCO_wdt(E) kvm(E) gpio_ich(E) iTCO_vendor_support(E) i7core_edac(E) ipmi_ssif(E) joydev(E) lpc_ich(E) edac_core(E) hpilo(E) bnx2(E) netxen_nic(E) sr_mod(E) hid_generic(E) hpwdt(E) irqbypass(E) mfd_core(E) shpchp(E) ipmi_si(E) pcc_cpufreq(E) ehci_pci(E) ipmi_msghandler(E) acpi_cpufreq(E) cdrom(E) sg(E) acpi_power_meter(E) pcspkr(E) button(E) ext4(E) mbcache(E) jbd2(E) crc16(E) usbhid(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) thermal(E) sd_mod(E) scsi_dh_hp_sw(E) scsi_dh_emc(E) scsi_dh_rdac(E) scsi_dh_alua(E) ata_generic(E) ata_piix(E) libata(E) hpsa(E) scsi_transport_sas(E) cciss(E) scsi_mod(E)
[  316.980923] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E   4.11.0-tip-default #57
[  316.980923] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[  316.980923] task: ffffffff81c104c0 task.stack: ffffffff81c00000
[  316.980923] RIP: 0010:tick_nohz_stop_sched_tick+0x339/0x380
[  316.980923] RSP: 0018:ffff88027fc03f40 EFLAGS: 00010093
[  316.980923] RAX: 0000000000000001 RBX: ffff88027fc15d20 RCX: 00000049cc0cd28a
[  316.980923] RDX: 00000068e05e6aa0 RSI: ffff88017e286010 RDI: ffff88027fc15e00
[  316.980923] RBP: 00000049cc0cbf00 R08: ffffffffffffffdd R09: 0000000040001000
[  316.980923] R10: 00000001000010b3 R11: 0000000000000000 R12: 00000049cc0cf0de
[  316.980923] R13: ffff88027fc0d640 R14: 00000049a9b7af00 R15: 00000049a9b7af00
[  316.980923] FS:  0000000000000000(0000) GS:ffff88027fc00000(0000) knlGS:0000000000000000
[  316.980923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  316.980923] CR2: 00007f229a37e000 CR3: 0000000001c09000 CR4: 00000000000006f0
[  316.980923] Call Trace:
[  316.980923]  <IRQ>
[  316.980923]  ? __tick_nohz_idle_enter+0x8e/0x150
[  316.980923]  ? irq_exit+0x77/0xe0
[  316.980923]  ? do_IRQ+0x4c/0xd0
[  316.980923]  ? common_interrupt+0x8c/0x8c
[  316.980923]  </IRQ>
[  316.980923]  ? poll_idle+0x2d/0x58
[  316.980923]  ? cpuidle_enter_state+0x9d/0x260
[  316.980923]  ? do_idle+0x165/0x1d0
[  316.980923]  ? cpu_startup_entry+0x5d/0x70
[  316.980923]  ? start_kernel+0x481/0x48c
[  316.980923]  ? set_init_arg+0x50/0x50
[  316.980923]  ? early_idt_handler_array+0x120/0x120
[  316.980923]  ? x86_64_start_kernel+0x147/0x156
[  316.980923]  ? secondary_startup_64+0x9f/0x9f
[  316.980923] Code: 49 39 cf 7c 26 48 89 c8 e9 94 fe ff ff f3 90 e9 03 fd ff ff 83 7b 48 02 75 d9 48 89 df e8 d0 1a ff ff 49 8b 45 18 e9 76 fe ff ff <0f> ff 80 3d ee 59 cd 00 00 0f 85 8d fe ff ff 31 c0 4c 89 fa 48 
[  316.980923] ---[ end trace 996f71f82094d582 ]---
[  316.980923] basemono: 316956000000 ts->next_tick: 316380000000 dev->next_event: 316956005002

tip v4.11-rc8-887-gefb93ce741c6, 2 socket E7-8890 v3 (72 core) box

[   29.473651] tsc: Refined TSC clocksource calibration: 2493.986 MHz
[   29.473701] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23f308a7388, max_idle_ns: 440795280393 ns
[   30.506471] clocksource: Switched to clocksource tsc
[  716.538171] clocksource: timekeeping watchdog on CPU67: Marking clocksource 'tsc' as unstable because the skew is too large:
[  716.584589] clocksource:                       'tsc' cs_now: d171cd5ce006e cs_last: d165a9ce422d1 mask: ffffffffffffffff
[  716.609048] tsc: Marking TSC unstable due to clocksource watchdog
[ 1067.093433] ------------[ cut here ]------------
[ 1067.097416] WARNING: CPU: 0 PID: 0 at kernel/time/tick-sched.c:874 tick_nohz_stop_sched_tick+0x339/0x380
[ 1067.097416] Modules linked in: af_packet(E) iscsi_ibft(E) intel_rapl(E) iscsi_boot_sysfs(E) sb_edac(E) edac_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) ext4(E) crc16(E) irqbypass(E) jbd2(E) mbcache(E) joydev(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) ipmi_ssif(E) ixgbe(E) iTCO_wdt(E) aesni_intel(E) i40e(E) iTCO_vendor_support(E) aes_x86_64(E) mdio(E) crypto_simd(E) lpc_ich(E) mptctl(E) ptp(E) glue_helper(E) ipmi_si(E) cryptd(E) pcspkr(E) pps_core(E) dca(E) mfd_core(E) i2c_i801(E) mptbase(E) ipmi_devintf(E) ipmi_msghandler(E) wmi(E) button(E) acpi_pad(E) shpchp(E) sunrpc(E) btrfs(E) xor(E) raid6_pq(E) sd_mod(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ahci(E)
[ 1067.097416]  fb_sys_fops(E) ehci_pci(E) libahci(E) ttm(E) ehci_hcd(E) crc32c_intel(E) libata(E) drm(E) usbcore(E) mpt3sas(E) raid_class(E) scsi_transport_sas(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
[ 1067.097416] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E   4.11.0-tip-default #226
[ 1067.097416] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014
[ 1067.097416] task: ffffffff81c104c0 task.stack: ffffffff81c00000
[ 1067.097416] RIP: 0010:tick_nohz_stop_sched_tick+0x339/0x380
[ 1067.097416] RSP: 0018:ffff88085f803f40 EFLAGS: 00010087
[ 1067.097416] RAX: 0000000000000001 RBX: ffff88085f815d20 RCX: 000000f86d81fe00
[ 1067.097416] RDX: 800000f86d70d2ff RSI: 0000000000000048 RDI: ffff88085f815e00
[ 1067.097416] RBP: 000000f86d70d300 R08: ffffffffffffff2c R09: 000000004002ed00
[ 1067.097416] R10: 000000010002edd8 R11: 0000000000000000 R12: 000000f86d82121a
[ 1067.097416] R13: ffff88085f80d640 R14: 000000de6fa7af00 R15: 000000de6fa7af00
[ 1067.097416] FS:  0000000000000000(0000) GS:ffff88085f800000(0000) knlGS:0000000000000000
[ 1067.097416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1067.097416] CR2: 00007fba363082a7 CR3: 000000017c3f9000 CR4: 00000000001406f0
[ 1067.097416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1067.097416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1067.097416] Call Trace:
[ 1067.097416]  <IRQ>
[ 1067.097416]  ? __tick_nohz_idle_enter+0x8e/0x150
[ 1067.097416]  ? irq_exit+0x77/0xe0
[ 1067.097416]  ? do_IRQ+0x4c/0xd0
[ 1067.097416]  ? common_interrupt+0x8c/0x8c
[ 1067.097416]  </IRQ>
[ 1067.097416]  ? cpuidle_enter_state+0xe3/0x260
[ 1067.097416]  ? do_idle+0x165/0x1d0
[ 1067.097416]  ? cpu_startup_entry+0x5d/0x70
[ 1067.097416]  ? start_kernel+0x481/0x48c
[ 1067.097416]  ? set_init_arg+0x50/0x50
[ 1067.097416]  ? early_idt_handler_array+0x120/0x120
[ 1067.097416]  ? x86_64_start_kernel+0x147/0x156
[ 1067.097416]  ? secondary_startup_64+0x9f/0x9f
[ 1067.097416] Code: 49 39 cf 7c 26 48 89 c8 e9 94 fe ff ff f3 90 e9 03 fd ff ff 83 7b 48 02 75 d9 48 89 df e8 d0 1a ff ff 49 8b 45 18 e9 76 fe ff ff <0f> ff 80 3d 6e 5a cd 00 00 0f 85 8d fe ff ff 31 c0 4c 89 fa 48 
[ 1067.097416] ---[ end trace 65463b720a8bd23d ]---
[ 1067.097416] basemono: 1066988000000 ts->next_tick: 955356000000 dev->next_event: 1066989125120

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:02 x86-tip tsc/tick gripage Mike Galbraith
@ 2017-04-26  8:16 ` Mike Galbraith
  2017-04-26  8:21   ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26  8:16 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 10:02 +0200, Mike Galbraith wrote:

> tip v4.11-rc8-893-g8ec9e12aff06, trusty ole 8 socket (X7560) DL980 G7

Ew, DL980 then turned into unhappy RCU camper.

[  316.980923] basemono: 316956000000 ts->next_tick: 316380000000 dev->next_event: 316956005002
[  689.893122] INFO: rcu_sched detected stalls on CPUs/tasks:
[  689.897115] 	6-...: (1456 GPs behind) idle=114/0/0 softirq=935/935 fqs=0 
[  689.897115] 	7-...: (11 GPs behind) idle=358/0/0 softirq=1724/1724 fqs=0 
[  689.897115] 	8-...: (11 GPs behind) idle=5ac/0/0 softirq=1618/1619 fqs=0 
[  689.897115] 	9-...: (13 GPs behind) idle=318/0/0 softirq=500/500 fqs=0 
[  689.897115] 	10-...: (2045 GPs behind) idle=42c/0/0 softirq=480/480 fqs=0 
[  689.897115] 	12-...: (5442 GPs behind) idle=270/0/0 softirq=437/437 fqs=0 
[  689.897115] 	13-...: (418 GPs behind) idle=844/0/0 softirq=449/449 fqs=0 
[  689.897115] 	14-...: (14 GPs behind) idle=95c/0/0 softirq=508/509 fqs=0 
[  689.897115] 	15-...: (14 GPs behind) idle=b78/0/0 softirq=475/476 fqs=0 
[  689.897115] 	16-...: (9825 GPs behind) idle=22c/0/0 softirq=1308/1308 fqs=0 
[  689.897115] 	17-...: (10286 GPs behind) idle=454/0/0 softirq=402/402 fqs=0 
[  689.897115] 	18-...: (6734 GPs behind) idle=8e0/0/0 softirq=440/440 fqs=0 
[  689.897115] 	19-...: (6734 GPs behind) idle=604/0/0 softirq=378/378 fqs=0 
[  689.897115] 	22-...: (5162 GPs behind) idle=3c8/0/0 softirq=391/391 fqs=0 
[  689.897115] 	24-...: (229 GPs behind) idle=3d0/0/0 softirq=1470/1470 fqs=0 
[  689.897115] 	25-...: (9809 GPs behind) idle=2f0/0/0 softirq=436/436 fqs=0 
[  689.897115] 	26-...: (9809 GPs behind) idle=418/0/0 softirq=413/413 fqs=0 
[  689.897115] 	27-...: (9809 GPs behind) idle=458/0/0 softirq=438/438 fqs=0 
[  689.897115] 	28-...: (9809 GPs behind) idle=dd4/0/0 softirq=494/494 fqs=0 
[  689.897115] 	29-...: (9572 GPs behind) idle=3fc/0/0 softirq=400/400 fqs=0 
[  689.897115] 	30-...: (4950 GPs behind) idle=084/0/0 softirq=422/422 fqs=0 
[  689.897115] 	31-...: (4993 GPs behind) idle=ba4/0/0 softirq=366/366 fqs=0 
[  689.897115] 	32-...: (2048 GPs behind) idle=608/0/0 softirq=1258/1259 fqs=0 
[  689.897115] 	33-...: (2046 GPs behind) idle=f14/0/0 softirq=1267/1268 fqs=0 
[  689.897115] 	34-...: (0 ticks this GP) idle=950/0/0 softirq=357/357 fqs=0 
[  689.897115] 	35-...: (9813 GPs behind) idle=7a8/0/0 softirq=301/301 fqs=0 
[  689.897115] 	36-...: (9813 GPs behind) idle=5c0/0/0 softirq=299/299 fqs=0 
[  689.897115] 	37-...: (9828 GPs behind) idle=478/0/0 softirq=286/286 fqs=0 
[  689.897115] 	38-...: (0 ticks this GP) idle=8d4/0/0 softirq=307/307 fqs=0 
[  689.897115] 	39-...: (2048 GPs behind) idle=2ac/0/0 softirq=304/305 fqs=0 
[  689.897115] 	40-...: (0 ticks this GP) idle=bd0/0/0 softirq=1184/1184 fqs=0 
[  689.897115] 	41-...: (428 GPs behind) idle=d14/0/0 softirq=339/339 fqs=0 
[  689.897115] 	42-...: (243 GPs behind) idle=c38/0/0 softirq=338/338 fqs=0 
[  689.897115] 	43-...: (9809 GPs behind) idle=870/0/0 softirq=289/289 fqs=0 
[  689.897115] 	46-...: (9815 GPs behind) idle=654/0/0 softirq=261/261 fqs=0 
[  689.897115] 	47-...: (2134 GPs behind) idle=c98/0/0 softirq=303/303 fqs=0 
[  689.897115] 	48-...: (571 GPs behind) idle=4c0/0/0 softirq=1135/1135 fqs=0 
[  689.897115] 	49-...: (2211 GPs behind) idle=bcc/0/0 softirq=267/267 fqs=0 
[  689.897115] 	50-...: (2195 GPs behind) idle=334/0/0 softirq=232/233 fqs=0 
[  689.897115] 	51-...: (97 GPs behind) idle=e30/0/0 softirq=374/374 fqs=0 
[  689.897115] 	52-...: (252 GPs behind) idle=c14/0/0 softirq=207/208 fqs=0 
[  689.897115] 	53-...: (0 ticks this GP) idle=dd4/0/0 softirq=218/218 fqs=0 
[  689.897115] 	54-...: (526 GPs behind) idle=ce8/0/0 softirq=292/293 fqs=0 
[  689.897115] 	55-...: (573 GPs behind) idle=924/0/0 softirq=191/192 fqs=0 
[  689.897115] 	56-...: (68 GPs behind) idle=028/0/0 softirq=1141/1141 fqs=0 
[  689.897115] 	57-...: (5442 GPs behind) idle=f48/0/0 softirq=319/319 fqs=0 
[  689.897115] 	58-...: (5211 GPs behind) idle=18c/0/0 softirq=185/185 fqs=0 
[  689.897115] 	59-...: (255 GPs behind) idle=a30/0/0 softirq=276/277 fqs=0 
[  689.897115] 	60-...: (5442 GPs behind) idle=e54/0/0 softirq=155/155 fqs=0 
[  689.897115] 	61-...: (5151 GPs behind) idle=270/0/0 softirq=169/170 fqs=0 
[  689.897115] 	62-...: (9819 GPs behind) idle=c50/0/0 softirq=116/116 fqs=0 
[  689.897115] 	63-...: (2195 GPs behind) idle=e4c/0/0 softirq=141/142 fqs=0 
[  689.897115] 	(detected by 2, t=15689 jiffies, g=12275, c=12274, q=520)
[  689.897115] Sending NMI from CPU 2 to CPUs 6:
[  693.749331] NMI backtrace for cpu 6 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 7:
[  693.750354] NMI backtrace for cpu 7 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 8:
[  693.751358] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff8167bdc8
[  689.897115] Sending NMI from CPU 2 to CPUs 9:
[  693.752357] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 10:
[  693.753360] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 12:
[  693.754358] NMI backtrace for cpu 12 skipped: idling at pc 0xffffffff8167bdcd
[  689.897115] Sending NMI from CPU 2 to CPUs 13:
[  693.755360] NMI backtrace for cpu 13 skipped: idling at pc 0xffffffff8167bdd0
[  689.897115] Sending NMI from CPU 2 to CPUs 14:
[  693.756361] NMI backtrace for cpu 14 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 15:
[  693.757364] NMI backtrace for cpu 15 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 16:
[  693.758561] NMI backtrace for cpu 16 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 17:
[  693.759368] NMI backtrace for cpu 17 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 18:
[  693.760370] NMI backtrace for cpu 18 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 19:
[  693.761377] NMI backtrace for cpu 19 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 22:
[  693.762377] NMI backtrace for cpu 22 skipped: idling at pc 0xffffffff8167bdd0
[  689.897115] Sending NMI from CPU 2 to CPUs 24:
[  693.763390] NMI backtrace for cpu 24 skipped: idling at pc 0xffffffff8167bdcd
[  689.897115] Sending NMI from CPU 2 to CPUs 25:
[  693.764381] NMI backtrace for cpu 25 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 26:
[  693.765390] NMI backtrace for cpu 26 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 27:
[  693.766389] NMI backtrace for cpu 27 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 28:
[  693.767390] NMI backtrace for cpu 28 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 29:
[  693.768392] NMI backtrace for cpu 29 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 30:
[  693.769396] NMI backtrace for cpu 30 skipped: idling at pc 0xffffffff8167bdc8
[  689.897115] Sending NMI from CPU 2 to CPUs 31:
[  689.897115] NMI backtrace for cpu 31
[  689.897115] CPU: 31 PID: 176 Comm: rcuos/24 Tainted: G        W   E   4.11.0-tip-default #57
[  689.897115] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[  689.897115] task: ffff88017dac1540 task.stack: ffffc900013dc000
[  689.897115] RIP: 0010:native_queued_spin_lock_slowpath+0x170/0x1a0
[  689.897115] RSP: 0018:ffffc900013dfe88 EFLAGS: 00000002
[  689.897115] RAX: 0000000000780101 RBX: 0000000000000202 RCX: 0000000000000001
[  689.897115] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffff81caac40
[  689.897115] RBP: ffff88027741c600 R08: 0000000000000101 R09: 000000000000025d
[  689.897115] R10: 0000000000000265 R11: 0000000000000005 R12: 0000000000000001
[  689.897115] R13: ffff88017dac1540 R14: 0000000000000000 R15: ffff88017dac1540
[  689.897115] FS:  0000000000000000(0000) GS:ffff8802775c0000(0000) knlGS:0000000000000000
[  689.897115] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  689.897115] CR2: 00007f30fd0ce000 CR3: 0000000001c09000 CR4: 00000000000006e0
[  689.897115] Call Trace:
[  689.897115]  ? queued_spin_lock_slowpath+0x7/0xa
[  689.897115]  ? _raw_spin_lock_irqsave+0x39/0x40
[  689.897115]  ? rcu_nocb_kthread+0x24f/0x480
[  689.897115]  ? kthread+0x107/0x140
[  689.897115]  ? rcu_gp_kthread+0x8c0/0x8c0
[  689.897115]  ? kthread_park+0x60/0x60
[  689.897115]  ? ret_from_fork+0x26/0x40
[  689.897115] Code: 89 d0 66 31 c0 39 c1 74 e7 4d 85 c9 c6 07 01 74 2d 41 c7 41 08 01 00 00 00 e9 52 ff ff ff 83 fa 01 74 17 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f3 c3 f3 90 4d 8b 
[  689.897115] Sending NMI from CPU 2 to CPUs 32:
[  693.771432] NMI backtrace for cpu 32 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 33:
[  693.772426] NMI backtrace for cpu 33 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 34:
[  693.773431] NMI backtrace for cpu 34 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 35:
[  693.774432] NMI backtrace for cpu 35 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 36:
[  693.775434] NMI backtrace for cpu 36 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 37:
[  693.776436] NMI backtrace for cpu 37 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 38:
[  693.777442] NMI backtrace for cpu 38 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 39:
[  693.778444] NMI backtrace for cpu 39 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 40:
[  693.779456] NMI backtrace for cpu 40 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 41:
[  693.780448] NMI backtrace for cpu 41 skipped: idling at pc 0xffffffff8167bdd0
[  689.897115] Sending NMI from CPU 2 to CPUs 42:
[  693.781453] NMI backtrace for cpu 42 skipped: idling at pc 0xffffffff8167bdca
[  689.897115] Sending NMI from CPU 2 to CPUs 43:
[  693.782457] NMI backtrace for cpu 43 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 46:
[  693.783459] NMI backtrace for cpu 46 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 47:
[  693.784463] NMI backtrace for cpu 47 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 48:
[  693.785477] NMI backtrace for cpu 48 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 49:
[  693.786470] NMI backtrace for cpu 49 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 50:
[  693.787471] NMI backtrace for cpu 50 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 51:
[  693.788475] NMI backtrace for cpu 51 skipped: idling at pc 0xffffffff8167bdcd
[  689.897115] Sending NMI from CPU 2 to CPUs 52:
[  693.789481] NMI backtrace for cpu 52 skipped: idling at pc 0xffffffff8167bdca
[  689.897115] Sending NMI from CPU 2 to CPUs 53:
[  693.790482] NMI backtrace for cpu 53 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 54:
[  693.791585] NMI backtrace for cpu 54 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 55:
[  693.792587] NMI backtrace for cpu 55 skipped: idling at pc 0xffffffff8167bdca
[  689.897115] Sending NMI from CPU 2 to CPUs 56:
[  693.793606] NMI backtrace for cpu 56 skipped: idling at pc 0xffffffff8167bdca
[  689.897115] Sending NMI from CPU 2 to CPUs 57:
[  693.794593] NMI backtrace for cpu 57 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 58:
[  693.795595] NMI backtrace for cpu 58 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 59:
[  693.796599] NMI backtrace for cpu 59 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 60:
[  693.797605] NMI backtrace for cpu 60 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 61:
[  693.798605] NMI backtrace for cpu 61 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 62:
[  693.799611] NMI backtrace for cpu 62 skipped: idling at pc 0xffffffff8167bcc6
[  689.897115] Sending NMI from CPU 2 to CPUs 63:
[  693.800610] NMI backtrace for cpu 63 skipped: idling at pc 0xffffffff8167bdcd
[ 1447.897093] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1447.901088] 	5-...: (0 ticks this GP) idle=a6c/0/0 softirq=840/840 fqs=0 
[ 1447.901088] 	7-...: (46 GPs behind) idle=4f8/0/0 softirq=1841/1841 fqs=0 
[ 1447.901088] 	8-...: (58 GPs behind) idle=ab8/0/0 softirq=1994/1995 fqs=0 
[ 1447.901088] 	9-...: (58 GPs behind) idle=ebc/0/0 softirq=540/540 fqs=0 
[ 1447.901088] 	10-...: (1681 GPs behind) idle=c14/0/0 softirq=488/489 fqs=0 
[ 1447.901088] 	11-...: (0 ticks this GP) idle=dd8/0/0 softirq=1363/1363 fqs=0 
[ 1447.901088] 	12-...: (17237 GPs behind) idle=a34/0/0 softirq=437/438 fqs=0 
[ 1447.901088] 	13-...: (12213 GPs behind) idle=004/0/0 softirq=449/450 fqs=0 
[ 1447.901088] 	14-...: (58 GPs behind) idle=eec/0/0 softirq=704/705 fqs=0 
[ 1447.901088] 	15-...: (58 GPs behind) idle=0f8/0/0 softirq=3031/3031 fqs=0 
[ 1447.901088] 	16-...: (1374 GPs behind) idle=3a0/0/0 softirq=1310/1310 fqs=0 
[ 1447.901088] 	17-...: (1032 GPs behind) idle=ac4/0/0 softirq=412/412 fqs=0 
[ 1447.901088] 	18-...: (847 GPs behind) idle=d00/0/0 softirq=449/449 fqs=0 
[ 1447.901088] 	19-...: (1374 GPs behind) idle=8ec/0/0 softirq=381/381 fqs=0 
[ 1447.901088] 	20-...: (960 GPs behind) idle=2e4/0/0 softirq=407/407 fqs=0 
[ 1447.901088] 	21-...: (900 GPs behind) idle=6e8/0/0 softirq=463/463 fqs=0 
[ 1447.901088] 	22-...: (1036 GPs behind) idle=a70/0/0 softirq=399/399 fqs=0 
[ 1447.901088] 	23-...: (1373 GPs behind) idle=f08/0/0 softirq=377/377 fqs=0 
[ 1447.901088] 	24-...: (6627 GPs behind) idle=3d0/0/0 softirq=1492/1492 fqs=0 
[ 1447.901088] 	25-...: (11795 GPs behind) idle=2ec/0/0 softirq=438/438 fqs=0 
[ 1447.901088] 	26-...: (11795 GPs behind) idle=4c4/0/0 softirq=415/415 fqs=0 
[ 1447.901088] 	27-...: (11795 GPs behind) idle=4d0/0/0 softirq=440/440 fqs=0 
[ 1447.901088] 	28-...: (21604 GPs behind) idle=dec/0/0 softirq=494/495 fqs=0 
[ 1447.901088] 	29-...: (21367 GPs behind) idle=394/0/0 softirq=400/401 fqs=0 
[ 1447.901088] 	30-...: (16745 GPs behind) idle=0f0/0/0 softirq=422/423 fqs=0 
[ 1447.901088] 	31-...: (11795 GPs behind) idle=b94/0/0 softirq=367/367 fqs=0 
[ 1447.901088] 	32-...: (2042 GPs behind) idle=fa0/0/0 softirq=1263/1264 fqs=0 
[ 1447.901088] 	33-...: (120 GPs behind) idle=d88/0/0 softirq=1283/1284 fqs=0 
[ 1447.901088] 	34-...: (123 GPs behind) idle=4d8/0/0 softirq=359/359 fqs=0 
[ 1447.901088] 	35-...: (21608 GPs behind) idle=184/0/0 softirq=301/302 fqs=0 
[ 1447.901088] 	36-...: (21608 GPs behind) idle=fa0/0/0 softirq=299/300 fqs=0 
[ 1447.901088] 	37-...: (21623 GPs behind) idle=f0c/0/0 softirq=286/287 fqs=0 
[ 1447.901088] 	38-...: (214 GPs behind) idle=e54/0/0 softirq=324/324 fqs=0 
[ 1447.901088] 	39-...: (84 GPs behind) idle=c74/0/0 softirq=316/316 fqs=0 
[ 1447.901088] 	40-...: (0 ticks this GP) idle=0b0/0/0 softirq=1201/1201 fqs=0 
[ 1447.901088] 	42-...: (12038 GPs behind) idle=76c/0/0 softirq=338/339 fqs=0 
[ 1447.901088] 	43-...: (11795 GPs behind) idle=404/0/0 softirq=290/290 fqs=0 
[ 1447.901088] 	45-...: (462 GPs behind) idle=e90/0/0 softirq=336/336 fqs=0 
[ 1447.901088] 	46-...: (266 GPs behind) idle=de8/0/0 softirq=269/269 fqs=0 
[ 1447.901088] 	47-...: (161 GPs behind) idle=20c/0/0 softirq=309/309 fqs=0 
[ 1447.901088] 	48-...: (11795 GPs behind) idle=12c/0/0 softirq=1137/1138 fqs=0 
[ 1447.901088] 	49-...: (14006 GPs behind) idle=7a0/0/0 softirq=267/268 fqs=0 
[ 1447.901088] 	50-...: (13990 GPs behind) idle=f4c/0/0 softirq=232/234 fqs=0 
[ 1447.901088] 	52-...: (0 ticks this GP) idle=7bc/0/0 softirq=213/213 fqs=0 
[ 1447.901088] 	53-...: (63 GPs behind) idle=df0/0/0 softirq=238/238 fqs=0 
[ 1447.901088] 	54-...: (5326 GPs behind) idle=01c/0/0 softirq=300/300 fqs=0 
[ 1447.901088] 	55-...: (12368 GPs behind) idle=460/0/0 softirq=191/193 fqs=0 
[ 1447.901088] 	57-...: (17237 GPs behind) idle=ed8/0/0 softirq=319/320 fqs=0 
[ 1447.901088] 	58-...: (7471 GPs behind) idle=f18/0/0 softirq=190/190 fqs=0 
[ 1447.901088] 	59-...: (7474 GPs behind) idle=c70/0/0 softirq=279/280 fqs=0 
[ 1447.901088] 	60-...: (11788 GPs behind) idle=96c/0/0 softirq=157/158 fqs=0 
[ 1447.901088] 	61-...: (759 GPs behind) idle=73c/0/0 softirq=178/178 fqs=0 
[ 1447.901088] 	62-...: (11795 GPs behind) idle=b7c/0/0 softirq=117/117 fqs=0 
[ 1447.901088] 	63-...: (11795 GPs behind) idle=d8c/0/0 softirq=143/143 fqs=0 
[ 1447.901088] 	(detected by 2, t=15938 jiffies, g=24070, c=24069, q=1057)
[ 1447.901088] Sending NMI from CPU 2 to CPUs 5:
[ 1451.999285] NMI backtrace for cpu 5 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 7:
[ 1452.000294] NMI backtrace for cpu 7 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 8:
[ 1452.001297] NMI backtrace for cpu 8 skipped: idling at pc 0xffffffff8167bdcd
[ 1447.901088] Sending NMI from CPU 2 to CPUs 9:
[ 1452.002296] NMI backtrace for cpu 9 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 10:
[ 1452.003298] NMI backtrace for cpu 10 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 11:
[ 1452.004347] NMI backtrace for cpu 11 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 12:
[ 1452.005301] NMI backtrace for cpu 12 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 13:
[ 1452.006300] NMI backtrace for cpu 13 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 14:
[ 1452.007301] NMI backtrace for cpu 14 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 15:
[ 1452.008303] NMI backtrace for cpu 15 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 16:
[ 1452.009308] NMI backtrace for cpu 16 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 17:
[ 1452.010307] NMI backtrace for cpu 17 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 18:
[ 1452.011309] NMI backtrace for cpu 18 skipped: idling at pc 0xffffffff8167bdc8
[ 1447.901088] Sending NMI from CPU 2 to CPUs 19:
[ 1452.012313] NMI backtrace for cpu 19 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 20:
[ 1452.013319] NMI backtrace for cpu 20 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 21:
[ 1452.014316] NMI backtrace for cpu 21 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 22:
[ 1452.015318] NMI backtrace for cpu 22 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 23:
[ 1452.016320] NMI backtrace for cpu 23 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 24:
[ 1452.017332] NMI backtrace for cpu 24 skipped: idling at pc 0xffffffff8167bdcd
[ 1447.901088] Sending NMI from CPU 2 to CPUs 25:
[ 1452.018325] NMI backtrace for cpu 25 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 26:
[ 1452.019329] NMI backtrace for cpu 26 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 27:
[ 1452.020330] NMI backtrace for cpu 27 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 28:
[ 1452.021335] NMI backtrace for cpu 28 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 29:
[ 1452.022334] NMI backtrace for cpu 29 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 30:
[ 1452.023335] NMI backtrace for cpu 30 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 31:
[ 1452.024339] NMI backtrace for cpu 31 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 32:
[ 1452.025347] NMI backtrace for cpu 32 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 33:
[ 1452.026346] NMI backtrace for cpu 33 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 34:
[ 1452.027346] NMI backtrace for cpu 34 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 35:
[ 1452.028348] NMI backtrace for cpu 35 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 36:
[ 1452.029356] NMI backtrace for cpu 36 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 37:
[ 1452.030354] NMI backtrace for cpu 37 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 38:
[ 1452.031358] NMI backtrace for cpu 38 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 39:
[ 1452.032359] NMI backtrace for cpu 39 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 40:
[ 1452.033368] NMI backtrace for cpu 40 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 42:
[ 1452.034365] NMI backtrace for cpu 42 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 43:
[ 1452.035367] NMI backtrace for cpu 43 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 45:
[ 1452.036371] NMI backtrace for cpu 45 skipped: idling at pc 0xffffffff8167bdc8
[ 1447.901088] Sending NMI from CPU 2 to CPUs 46:
[ 1452.037374] NMI backtrace for cpu 46 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 47:
[ 1452.038476] NMI backtrace for cpu 47 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 48:
[ 1452.039483] NMI backtrace for cpu 48 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 49:
[ 1452.040484] NMI backtrace for cpu 49 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 50:
[ 1452.041486] NMI backtrace for cpu 50 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 52:
[ 1452.042488] NMI backtrace for cpu 52 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 53:
[ 1452.043489] NMI backtrace for cpu 53 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 54:
[ 1452.044493] NMI backtrace for cpu 54 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 55:
[ 1452.045500] NMI backtrace for cpu 55 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 57:
[ 1452.046499] NMI backtrace for cpu 57 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 58:
[ 1452.047502] NMI backtrace for cpu 58 skipped: idling at pc 0xffffffff8167bdca
[ 1447.901088] Sending NMI from CPU 2 to CPUs 59:
[ 1452.048504] NMI backtrace for cpu 59 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 60:
[ 1452.049511] NMI backtrace for cpu 60 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 61:
[ 1452.050509] NMI backtrace for cpu 61 skipped: idling at pc 0xffffffff8167bdcd
[ 1447.901088] Sending NMI from CPU 2 to CPUs 62:
[ 1452.051510] NMI backtrace for cpu 62 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] Sending NMI from CPU 2 to CPUs 63:
[ 1452.052514] NMI backtrace for cpu 63 skipped: idling at pc 0xffffffff8167bcc6
[ 1447.901088] rcu_sched kthread starved for 16040 jiffies! g24070 c24069 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[ 1447.901088] rcu_sched       S    0     9      2 0x00000000
[ 1447.901088] Call Trace:
[ 1447.901088]  ? __schedule+0x28f/0x8b0
[ 1447.901088]  ? schedule+0x39/0x90
[ 1447.901088]  ? schedule_timeout+0x180/0x2f0
[ 1447.901088]  ? call_timer_fn+0x150/0x150
[ 1447.901088]  ? rcu_gp_kthread+0x3b4/0x8c0
[ 1447.901088]  ? kthread+0x107/0x140
[ 1447.901088]  ? force_qs_rnp+0x1a0/0x1a0
[ 1447.901088]  ? kthread_park+0x60/0x60
[ 1447.901088]  ? ret_from_fork+0x26/0x40

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:16 ` Mike Galbraith
@ 2017-04-26  8:21   ` Ingo Molnar
  2017-04-26  8:31     ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2017-04-26  8:21 UTC (permalink / raw)
  To: Mike Galbraith, Paul E. McKenney
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker


* Mike Galbraith <efault@gmx.de> wrote:

> On Wed, 2017-04-26 at 10:02 +0200, Mike Galbraith wrote:
> 
> > tip v4.11-rc8-893-g8ec9e12aff06, trusty ole 8 socket (X7560) DL980 G7
> 
> Ew, DL980 then turned into unhappy RCU camper.
> 
> [  316.980923] basemono: 316956000000 ts->next_tick: 316380000000 dev->next_event: 316956005002
> [  689.893122] INFO: rcu_sched detected stalls on CPUs/tasks:

Probably RCU unrelated.

I have temporarily removed the current timers/urgent lineup from -tip:

 098991fccfc7: nohz: Print more debug info in tick_nohz_stop_sched_tick()
 22aa2ad45fd8: tick: Make sure tick timer is active when bypassing reprogramming
 d58bd60c773d: nohz: Fix again collision between tick and other hrtimers

... and have reintegrated tip:master, so it should be back to working I think.

Does this solve the warning and the RCU stalls on your boxes?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:21   ` Ingo Molnar
@ 2017-04-26  8:31     ` Mike Galbraith
  2017-04-26  8:57       ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26  8:31 UTC (permalink / raw)
  To: Ingo Molnar, Paul E. McKenney
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 10:21 +0200, Ingo Molnar wrote:

> I have temporarily removed the current timers/urgent lineup from -tip:
> 
>  098991fccfc7: nohz: Print more debug info in tick_nohz_stop_sched_tick()
>  22aa2ad45fd8: tick: Make sure tick timer is active when bypassing reprogramming
>  d58bd60c773d: nohz: Fix again collision between tick and other hrtimers
> 
> ... and have reintegrated tip:master, so it should be back to working I think.
> 
> Does this solve the warning and the RCU stalls on your boxes?

(both boxen building)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:31     ` Mike Galbraith
@ 2017-04-26  8:57       ` Mike Galbraith
  2017-04-26  9:18         ` Mike Galbraith
  2017-04-26 10:26         ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26  8:57 UTC (permalink / raw)
  To: Ingo Molnar, Paul E. McKenney
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 10:31 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 10:21 +0200, Ingo Molnar wrote:
> 
> > I have temporarily removed the current timers/urgent lineup from -tip:
> > 
> >  098991fccfc7: nohz: Print more debug info in tick_nohz_stop_sched_tick()
> >  22aa2ad45fd8: tick: Make sure tick timer is active when bypassing reprogramming
> >  d58bd60c773d: nohz: Fix again collision between tick and other hrtimers
> > 
> > ... and have reintegrated tip:master, so it should be back to working I think.
> > 
> > Does this solve the warning and the RCU stalls on your boxes?
> 
> (both boxen building)

(oh come on, boot already... zzz)

Both still lose their TSC.

[   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
[   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
[   13.064172] clocksource: Switched to clocksource tsc
[  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
[  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
[  240.675057] tsc: Marking TSC unstable due to clocksource watchdog

[  349.192460] clocksource: timekeeping watchdog on CPU63: Marking clocksource 'tsc' as unstable because the skew is too large:
[  349.217756] clocksource:                       'hpet' wd_now: 2bb0dad7 wd_last: 5d9ac241 mask: ffffffff
[  349.239021] clocksource:                       'tsc' cs_now: d2413ebfc1c28 cs_last: d2387b328a38c mask: ffffffffffffffff
[  349.263525] sched_clock: Marking unstable (349228856508, 34664619)<-(351146178141, -1882657014)
[  349.283245] tsc: Marking TSC unstable due to clocksource watchdog
[  349.298075] clocksource: Switched to clocksource hpet

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:57       ` Mike Galbraith
@ 2017-04-26  9:18         ` Mike Galbraith
  2017-04-26 10:26         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26  9:18 UTC (permalink / raw)
  To: Ingo Molnar, Paul E. McKenney
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Frederic Weisbecker, Paul E. McKenney

On Wed, 2017-04-26 at 10:57 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 10:31 +0200, Mike Galbraith wrote:
> > On Wed, 2017-04-26 at 10:21 +0200, Ingo Molnar wrote:
> > 
> > > I have temporarily removed the current timers/urgent lineup from -tip:
> > > 
> > >  098991fccfc7: nohz: Print more debug info in tick_nohz_stop_sched_tick()
> > >  22aa2ad45fd8: tick: Make sure tick timer is active when bypassing reprogramming
> > >  d58bd60c773d: nohz: Fix again collision between tick and other hrtimers
> > > 
> > > ... and have reintegrated tip:master, so it should be back to working I think.
> > > 
> > > Does this solve the warning and the RCU stalls on your boxes?
> > 
> > (both boxen building)
> 
> (oh come on, boot already... zzz)
> 
> Both still lose their TSC.

Aw crap, and the DL980 RCU stalled while idling again.

> [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> [   13.064172] clocksource: Switched to clocksource tsc
> [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> 
> [  349.192460] clocksource: timekeeping watchdog on CPU63: Marking clocksource 'tsc' as unstable because the skew is too large:
> [  349.217756] clocksource:                       'hpet' wd_now: 2bb0dad7 wd_last: 5d9ac241 mask: ffffffff
> [  349.239021] clocksource:                       'tsc' cs_now: d2413ebfc1c28 cs_last: d2387b328a38c mask: ffffffffffffffff
> [  349.263525] sched_clock: Marking unstable (349228856508, 34664619)<-(351146178141, -1882657014)
> [  349.283245] tsc: Marking TSC unstable due to clocksource watchdog
> [  349.298075] clocksource: Switched to clocksource hpet

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26  8:57       ` Mike Galbraith
  2017-04-26  9:18         ` Mike Galbraith
@ 2017-04-26 10:26         ` Peter Zijlstra
  2017-04-26 11:39           ` Mike Galbraith
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2017-04-26 10:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:

> Both still lose their TSC.
> 
> [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> [   13.064172] clocksource: Switched to clocksource tsc
> [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog


And they didn't use to? We don't typically write to TSC or TSC_ADJUST
and thus would not cause such behaviour.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26 10:26         ` Peter Zijlstra
@ 2017-04-26 11:39           ` Mike Galbraith
  2017-04-26 12:30             ` Mike Galbraith
  2017-04-28  7:35             ` [patch] Re: x86-tip tsc/tick gripage Mike Galbraith
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 12:26 +0200, Peter Zijlstra wrote:
> On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:
> 
> > Both still lose their TSC.
> > 
> > [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> > [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> > [   13.064172] clocksource: Switched to clocksource tsc
> > [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> > [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> > [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> 
> 
> And they didn't use to? We don't typically write to TSC or TSC_ADJUST
> and thus would not cause such behaviour.

Nope.  The DL980 is my RT jitter test box, which I use all the time. 
 The other makes very spiffy jitter numbers when TSCs get synched up
(crap bios).. well used to before suse maintenance elves got around to
doing the BIOS replacement ;-/  (poor box, say hi to the goldfish)

	-Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26 11:39           ` Mike Galbraith
@ 2017-04-26 12:30             ` Mike Galbraith
  2017-04-26 18:13               ` Mike Galbraith
                                 ` (2 more replies)
  2017-04-28  7:35             ` [patch] Re: x86-tip tsc/tick gripage Mike Galbraith
  1 sibling, 3 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 13:39 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 12:26 +0200, Peter Zijlstra wrote:
> > On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:
> > 
> > > Both still lose their TSC.
> > > 
> > > [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> > > [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> > > [   13.064172] clocksource: Switched to clocksource tsc
> > > [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> > > [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> > > [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> > 
> > 
> > And they didn't use to? We don't typically write to TSC or TSC_ADJUST
> > and thus would not cause such behaviour.
> 
> Nope.

DL980 seems perfectly happy with master.today.. so off we go.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26 12:30             ` Mike Galbraith
@ 2017-04-26 18:13               ` Mike Galbraith
  2017-04-27  5:26               ` Mike Galbraith
  2017-04-29 16:06               ` [patch] timer: Fix timers_update_migration(), and call it in tmigr_init() Mike Galbraith
  2 siblings, 0 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-26 18:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 14:30 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 13:39 +0200, Mike Galbraith wrote:
> > On Wed, 2017-04-26 at 12:26 +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:
> > > 
> > > > Both still lose their TSC.
> > > > 
> > > > [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> > > > [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> > > > [   13.064172] clocksource: Switched to clocksource tsc
> > > > [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> > > > [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> > > > [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> > > 
> > > 
> > > And they didn't use to? We don't typically write to TSC or TSC_ADJUST
> > > and thus would not cause such behaviour.
> > 
> > Nope.
> 
> DL980 seems perfectly happy with master.today.. so off we go.

ec2206b91d430da57c856869b5a37dc1e569e80d is the first bad commit
commit ec2206b91d430da57c856869b5a37dc1e569e80d
Author: Anna-Maria Gleixner <anna-maria@linutronix.de>
Date:   Mon Mar 20 10:34:20 2017 +0100

    timer: Implement the hierarchical pull model
    
    Placing timers at enqueue time on a target CPU based on dubious heuristics
    does not make any sense:
    
     1) Most timer wheel timers are canceled or rearmed before they expire.
    
     2) The heuristics to predict which CPU will be busy when the timer expires
        are wrong by definition.
    
    So we waste precious cycles to place timers at enqueue time.
    
    The proper solution to this problem is to always queue the timers on the
    local CPU and allow the non pinned timers to be pulled onto a busy CPU at
    expiry time.
    
    To achieve this the timer storage has been split into local pinned and
    global timers. Local pinned timers are always expired on the CPU on which
    they have been queued. Global timers can be expired on any CPU.
    
    As long as a CPU is busy it expires both local and global timers. When a
    CPU goes idle it arms for the first expiring local timer. If the first
    expiring pinned (local) timer is before the first expiring movable timer,
    then no action is required because the CPU will wake up before the first
    movable timer expires. If the first expiring movable timer is before the
    first expiring pinned (local) timer, then this timer is queued into a idle
    timerqueue and eventually expired by some other active CPU.
    
    To avoid global locking the timerqueues are implemented as a hierarchy. The
    lowest level of the hierarchy holds the CPUs. The CPUs are associated to
    groups of 8, which are seperated per node. If more than one CPU group
    exist, then a second level in the hierarchy collects the groups. Depending
    on the size of the system more than 2 levels are required. Each group has a
    "migrator" which checks the timerqueue during the tick for remote expirable
    timers.
    
    If the last CPU in a group goes idle it reports the first expiring event in
    the group up to the next group(s) in the hierarchy. If the last CPU goes
    idle it arms its timer for the first system wide expiring timer to ensure
    that no timer event is missed.
    
    Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

:040000 040000 76086a473919931466dfeea76a17566e2bb15130 5432a1bbc5ace9796e70e91447c6f26eb9eb51be M	include
:040000 040000 102292e74099b7c3e0b2125a4deef728267b4d9d 24385b3c68140fbb3362f32753f296d127e22097 M	kernel

git bisect start
# good: [ea839b41744dffe5c77b8d9842c9bb7073460901] Merge tag 'arc-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
git bisect good ea839b41744dffe5c77b8d9842c9bb7073460901
# bad: [d02c59825bbfd2726430f410321f9d0bfc9d1ef4] Merge branch 'WIP.x86/fpu'
git bisect bad d02c59825bbfd2726430f410321f9d0bfc9d1ef4
# good: [1212249de8142df8c64c4165d1dee8ed4c3d766c] Merge branch 'perf/core'
git bisect good 1212249de8142df8c64c4165d1dee8ed4c3d766c
# good: [28a68c6ee0837004c87845a09717a99b4e96f7d4] Merge branch 'x86-mm-for-xen'
git bisect good 28a68c6ee0837004c87845a09717a99b4e96f7d4
# good: [7fd018291abbc076dc8073d48bba9614ca67b818] Merge branch 'x86/cleanups'
git bisect good 7fd018291abbc076dc8073d48bba9614ca67b818
# good: [6f029d98bc76cf9aa21e73e4beba9014e7e3adbb] Merge branch 'x86/platform'
git bisect good 6f029d98bc76cf9aa21e73e4beba9014e7e3adbb
# bad: [4691eff5d6475014b4c04481b03aff89f5ec87c1] Merge branch 'WIP.timers'
git bisect bad 4691eff5d6475014b4c04481b03aff89f5ec87c1
# good: [0605ab6fd5aa0c78133fc6611fa5e4f17e46c396] sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
git bisect good 0605ab6fd5aa0c78133fc6611fa5e4f17e46c396
# bad: [1e1c48e6befca01abb2ed224a76fbe0ea8c73e30] timer: Always queue timers on the local CPU
git bisect bad 1e1c48e6befca01abb2ed224a76fbe0ea8c73e30
# good: [c82a8b6b216e6910f36a94939de7e8793b25feaa] timer: Retrieve next expiry of pinned/non-pinned timers seperately
git bisect good c82a8b6b216e6910f36a94939de7e8793b25feaa
# good: [e2e1214438bb25ec4e37b8aedb3d3e502535b09b] tick/sched: Split out jiffies update helper function
git bisect good e2e1214438bb25ec4e37b8aedb3d3e502535b09b
# bad: [270c4e558cc26a0d89e89dda46c35ac39457a896] timer_migration: Add tracepoints
git bisect bad 270c4e558cc26a0d89e89dda46c35ac39457a896
# bad: [ec2206b91d430da57c856869b5a37dc1e569e80d] timer: Implement the hierarchical pull model
git bisect bad ec2206b91d430da57c856869b5a37dc1e569e80d

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86-tip tsc/tick gripage
  2017-04-26 12:30             ` Mike Galbraith
  2017-04-26 18:13               ` Mike Galbraith
@ 2017-04-27  5:26               ` Mike Galbraith
  2017-04-29 16:06               ` [patch] timer: Fix timers_update_migration(), and call it in tmigr_init() Mike Galbraith
  2 siblings, 0 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-27  5:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 14:30 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 13:39 +0200, Mike Galbraith wrote:
> > On Wed, 2017-04-26 at 12:26 +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:
> > > 
> > > > Both still lose their TSC.
> > > > 
> > > > [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> > > > [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> > > > [   13.064172] clocksource: Switched to clocksource tsc
> > > > [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> > > > [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> > > > [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> > > 
> > > 
> > > And they didn't use to? We don't typically write to TSC or TSC_ADJUST
> > > and thus would not cause such behaviour.
> > 
> > Nope.
> 
> DL980 seems perfectly happy with master.today.. so off we go.

hm, this bit of huge trace looks less than wonderful.

          <idle>-0     [041] ..s2   317.304657: timer_expire_entry: timer=ffffffff820d6600 function=clocksource_watchdog now=4294971392
          <idle>-0     [041] d.s4   317.304660: timer_start: timer=ffffffff820d6600 function=clocksource_watchdog expires=4294916631 [timeout=-54761] cpu=42 idx=19 flags=
          <idle>-0     [041] ..s2   317.304660: timer_expire_exit: timer=ffffffff820d6600                                             ^^^^^^^^^^^^^^

1.1 megalines later, we finally meet function=clocksource_watchdog
again, and have a cow.
          <idle>-0     [043] d.s3   489.511620: timer_cancel: timer=ffffffff820d6600
          <idle>-0     [043] ..s2   489.511621: timer_expire_entry: timer=ffffffff820d6600 function=clocksource_watchdog now=4295014443
          <idle>-0     [043] ..s2   489.511628: clocksource_watchdog: timekeeping watchdog on CPU43: Marking clocksource 'tsc' as unstable because the skew is too large:
          <idle>-0     [043] ..s2   489.511630: clocksource_watchdog:                       'hpet' wd_now: a1cbfa1a wd_last: ed7acfe mask: ffffffff
          <idle>-0     [043] ..s2   489.511630: clocksource_watchdog:                       'tsc' cs_now: 1c24ee60f22 cs_last: 167a92e836f mask: ffffffffffffffff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch] Re: x86-tip tsc/tick gripage
  2017-04-26 11:39           ` Mike Galbraith
  2017-04-26 12:30             ` Mike Galbraith
@ 2017-04-28  7:35             ` Mike Galbraith
  2017-04-28  8:45               ` Mike Galbraith
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-28  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Wed, 2017-04-26 at 13:39 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-26 at 12:26 +0200, Peter Zijlstra wrote:
> > On Wed, Apr 26, 2017 at 10:57:42AM +0200, Mike Galbraith wrote:
> > 
> > > Both still lose their TSC.
> > > 
> > > [   11.982468] tsc: Refined TSC clocksource calibration: 2260.999 MHz
> > > [   11.994275] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x20974a4d8bb, max_idle_ns: 440795246623 ns
> > > [   13.064172] clocksource: Switched to clocksource tsc
> > > [  240.247851] clocksource: timekeeping watchdog on CPU23: Marking clocksource 'tsc' as unstable because the skew is too large:
> > > [  240.462501] clocksource:                       'tsc' cs_now: 108fe5be09f cs_last: b90a6a0676 mask: ffffffffffffffff
> > > [  240.675057] tsc: Marking TSC unstable due to clocksource watchdog
> > 
> > 
> > And they didn't use to? We don't typically write to TSC or TSC_ADJUST
> > and thus would not cause such behaviour.
> 
> Nope.  The DL980 is my RT jitter test box, which I use all the time. 

I found evidence of lysdexia at play (hunk 2), below is an RFFS (frozen
shark) which makes my DL980 a happy camper with tip.

timer/nohz: Call tick_tmigr_idle() with proper parameter order

Extremely late timer expiration soon leads to the clocksource watchdog
killing the TSC clocksource on multisocket boxen.  Call tmigr_cpu_idle(
) with proper parameter order, and pass next_local to tmigr_cpu_idle()
so it can return what it says it returns.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Fixes: ec2206b91d43 timer: Implement the hierarchical pull model
---
 kernel/time/tick-sched.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -695,7 +695,7 @@ tick_tmigr_idle(struct tick_sched *ts, u
 	if (next_global >= next_local)
 		next_global = KTIME_MAX;
 
-	next_global = tmigr_cpu_idle(next_global);
+	next_local = tmigr_cpu_idle(next_local);
 
 	return min_t(u64, next_local, next_global);
 }
@@ -768,7 +768,7 @@ static ktime_t tick_nohz_stop_sched_tick
 	 */
 	delta = next_local - basemono;
 	if (delta > (u64)TICK_NSEC)
-		next_local = tick_tmigr_idle(ts, next_local, next_global);
+		next_local = tick_tmigr_idle(ts, next_global, next_local);
 
 	ts->next_timer = next_local;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] Re: x86-tip tsc/tick gripage
  2017-04-28  7:35             ` [patch] Re: x86-tip tsc/tick gripage Mike Galbraith
@ 2017-04-28  8:45               ` Mike Galbraith
  2017-04-28 14:17                 ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-28  8:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

> On Fri, 2017-04-28 at 09:35 +0200, Mike Galbraith wrote:

> ...which makes my DL980 a happy camper with tip.

Bah, nevermind.  I forgot to restore command line.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] Re: x86-tip tsc/tick gripage
  2017-04-28  8:45               ` Mike Galbraith
@ 2017-04-28 14:17                 ` Mike Galbraith
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-28 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul E. McKenney, LKML, Ingo Molnar,
	Thomas Gleixner, Frederic Weisbecker

On Fri, 2017-04-28 at 10:45 +0200, Mike Galbraith wrote:

> Bah, nevermind.  I forgot to restore command line.

Well how 'bout that, it's not only old multi-socket boxen.  I just
reproduced on my i4790 desktop box.  Boot virgin tip to init 3 with
nowatchdog on command line, let box idle...

[  316.387436] clocksource: timekeeping watchdog on CPU7: Marking clocksource 'tsc' as unstable because the skew is too large:
[  316.393887] clocksource:                       'hpet' wd_now: e05e937 wd_last: 75c4e7c8 mask: ffffffff
[  316.400491] clocksource:                       'tsc' cs_now: 12134c63e80 cs_last: 8c0439451e mask: ffffffffffffffff
[  316.406899] sched_clock: Marking unstable (316377815439, 9598524)<-(316386903152, 510811)
[  316.413257] tsc: Marking TSC unstable due to clocksource watchdog
[  316.419223] clocksource: Switched to clocksource hpet
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-26 12:30             ` Mike Galbraith
  2017-04-26 18:13               ` Mike Galbraith
  2017-04-27  5:26               ` Mike Galbraith
@ 2017-04-29 16:06               ` Mike Galbraith
  2017-04-29 18:06                 ` Paul E. McKenney
  2 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-29 16:06 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Paul E. McKenney, Ingo Molnar, Thomas Gleixner,
	PeterZijlstra, Frederic Weisbecker

Note: there is more.  With this applied, my desktop box will no longer
reproduce when booted to init 3 with nowatchdog on the command line. 
 My 8 socket DL980 OTOH still will, though it takes longer, and is
seemingly no longer interested in following up  with a permanent RCU
stall after the tsc clocksource is killed, as it does in virgin source.

---

timers_update_migration() is called by tick_nohz_activate() before
the late initcall tmigr_init() sets tmigr_enabled to true, resulting
in it updating neither timer_base.nohz_active nor .migration_enabled,
meaning we'll not kick an idling cpu in add_timer_on().

Remove redundant loop avoidance such that tick_nohz_activate() updates
timer_bases[].nohz_active as intended, and call it in tmigr_init() to
update timer_bases[].migration_enabled.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Fixes: ec2206b91d43 timer: Implement the hierarchical pull model
---
 kernel/time/timer.c           |    4 ----
 kernel/time/timer_migration.c |    1 +
 2 files changed, 1 insertion(+), 4 deletions(-)

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -224,10 +224,6 @@ void timers_update_migration(bool update
 	bool on = sysctl_timer_migration && tick_nohz_active && tmigr_enabled;
 	unsigned int cpu;
 
-	/* Avoid the loop, if nothing to update */
-	if (this_cpu_read(timer_bases[BASE_GLOBAL].migration_enabled) == on)
-		return;
-
 	for_each_possible_cpu(cpu) {
 		per_cpu(timer_bases[BASE_LOCAL].migration_enabled, cpu) = on;
 		per_cpu(timer_bases[BASE_GLOBAL].migration_enabled, cpu) = on;
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -649,6 +649,7 @@ static int __init tmigr_init(void)
 		goto hp_err;
 
 	tmigr_enabled = true;
+	timers_update_migration(false);
 	pr_info("Timer migration: %d hierarchy levels\n", tmigr_hierarchy_levels);
 	return 0;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-29 16:06               ` [patch] timer: Fix timers_update_migration(), and call it in tmigr_init() Mike Galbraith
@ 2017-04-29 18:06                 ` Paul E. McKenney
  2017-04-29 18:20                   ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2017-04-29 18:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, Apr 29, 2017 at 06:06:40PM +0200, Mike Galbraith wrote:
> Note: there is more.  With this applied, my desktop box will no longer
> reproduce when booted to init 3 with nowatchdog on the command line. 
>  My 8 socket DL980 OTOH still will, though it takes longer, and is
> seemingly no longer interested in following up  with a permanent RCU
> stall after the tsc clocksource is killed, as it does in virgin source.
> 
> ---
> 
> timers_update_migration() is called by tick_nohz_activate() before
> the late initcall tmigr_init() sets tmigr_enabled to true, resulting
> in it updating neither timer_base.nohz_active nor .migration_enabled,
> meaning we'll not kick an idling cpu in add_timer_on().
> 
> Remove redundant loop avoidance such that tick_nohz_activate() updates
> timer_bases[].nohz_active as intended, and call it in tmigr_init() to
> update timer_bases[].migration_enabled.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Fixes: ec2206b91d43 timer: Implement the hierarchical pull model

If someone will either repost a fresh series or point me at exactly
the set of patches to use, I will run it through rcutorture again.

							Thanx, Paul

> ---
>  kernel/time/timer.c           |    4 ----
>  kernel/time/timer_migration.c |    1 +
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -224,10 +224,6 @@ void timers_update_migration(bool update
>  	bool on = sysctl_timer_migration && tick_nohz_active && tmigr_enabled;
>  	unsigned int cpu;
> 
> -	/* Avoid the loop, if nothing to update */
> -	if (this_cpu_read(timer_bases[BASE_GLOBAL].migration_enabled) == on)
> -		return;
> -
>  	for_each_possible_cpu(cpu) {
>  		per_cpu(timer_bases[BASE_LOCAL].migration_enabled, cpu) = on;
>  		per_cpu(timer_bases[BASE_GLOBAL].migration_enabled, cpu) = on;
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -649,6 +649,7 @@ static int __init tmigr_init(void)
>  		goto hp_err;
> 
>  	tmigr_enabled = true;
> +	timers_update_migration(false);
>  	pr_info("Timer migration: %d hierarchy levels\n", tmigr_hierarchy_levels);
>  	return 0;
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-29 18:06                 ` Paul E. McKenney
@ 2017-04-29 18:20                   ` Mike Galbraith
  2017-04-29 21:45                     ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-29 18:20 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:

> If someone will either repost a fresh series or point me at exactly
> the set of patches to use, I will run it through rcutorture again.

Patchlet is against x86-tip/master.today.

	-Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-29 18:20                   ` Mike Galbraith
@ 2017-04-29 21:45                     ` Paul E. McKenney
  2017-04-30  1:21                       ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2017-04-29 21:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> 
> > If someone will either repost a fresh series or point me at exactly
> > the set of patches to use, I will run it through rcutorture again.
> 
> Patchlet is against x86-tip/master.today.

So today's (as in Saturday April 29) x86-tip/master with the following
patch applied?

							Thanx, Paul

------------------------------------------------------------------------

timers_update_migration() is called by tick_nohz_activate() before
the late initcall tmigr_init() sets tmigr_enabled to true, resulting
in it updating neither timer_base.nohz_active nor .migration_enabled,
meaning we'll not kick an idling cpu in add_timer_on().

Remove redundant loop avoidance such that tick_nohz_activate() updates
timer_bases[].nohz_active as intended, and call it in tmigr_init() to
update timer_bases[].migration_enabled.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Fixes: ec2206b91d43 timer: Implement the hierarchical pull model
---
 kernel/time/timer.c           |    4 ----
 kernel/time/timer_migration.c |    1 +
 2 files changed, 1 insertion(+), 4 deletions(-)

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -224,10 +224,6 @@ void timers_update_migration(bool update
 	bool on = sysctl_timer_migration && tick_nohz_active && tmigr_enabled;
 	unsigned int cpu;

-	/* Avoid the loop, if nothing to update */
-	if (this_cpu_read(timer_bases[BASE_GLOBAL].migration_enabled) == on)
-		return;
-
 	for_each_possible_cpu(cpu) {
 		per_cpu(timer_bases[BASE_LOCAL].migration_enabled, cpu) = on;
 		per_cpu(timer_bases[BASE_GLOBAL].migration_enabled, cpu) = on;
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -649,6 +649,7 @@ static int __init tmigr_init(void)
 		goto hp_err;

 	tmigr_enabled = true;
+	timers_update_migration(false);
 	pr_info("Timer migration: %d hierarchy levels\n", tmigr_hierarchy_levels);
 	return 0;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-29 21:45                     ` Paul E. McKenney
@ 2017-04-30  1:21                       ` Mike Galbraith
  2017-04-30  3:43                         ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Galbraith @ 2017-04-30  1:21 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > 
> > > If someone will either repost a fresh series or point me at exactly
> > > the set of patches to use, I will run it through rcutorture again.
> > 
> > Patchlet is against x86-tip/master.today.
> 
> So today's (as in Saturday April 29) x86-tip/master with the following
> patch applied?

Yeah.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30  1:21                       ` Mike Galbraith
@ 2017-04-30  3:43                         ` Paul E. McKenney
  2017-04-30  4:20                           ` Mike Galbraith
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2017-04-30  3:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sun, Apr 30, 2017 at 03:21:58AM +0200, Mike Galbraith wrote:
> On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> > On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > > 
> > > > If someone will either repost a fresh series or point me at exactly
> > > > the set of patches to use, I will run it through rcutorture again.
> > > 
> > > Patchlet is against x86-tip/master.today.
> > 
> > So today's (as in Saturday April 29) x86-tip/master with the following
> > patch applied?
> 
> Yeah.

OK, will fire it up once the current set of overnight tests complete.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30  3:43                         ` Paul E. McKenney
@ 2017-04-30  4:20                           ` Mike Galbraith
  2017-04-30  4:36                             ` Paul E. McKenney
  2017-04-30  5:07                             ` Mike Galbraith
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-30  4:20 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, 2017-04-29 at 20:43 -0700, Paul E. McKenney wrote:
> On Sun, Apr 30, 2017 at 03:21:58AM +0200, Mike Galbraith wrote:
> > On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> > > On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > > > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > > > 
> > > > > If someone will either repost a fresh series or point me at exactly
> > > > > the set of patches to use, I will run it through rcutorture again.
> > > > 
> > > > Patchlet is against x86-tip/master.today.
> > > 
> > > So today's (as in Saturday April 29) x86-tip/master with the following
> > > patch applied?
> > 
> > Yeah.
> 
> OK, will fire it up once the current set of overnight tests complete.

I certainly don't want to discourage you from beating hell outta tip,
just want to make sure you know that I'm seeing zero RCU woes, only
late timer expiry (sharpening rocks/sticks to focus trace).

	-Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30  4:20                           ` Mike Galbraith
@ 2017-04-30  4:36                             ` Paul E. McKenney
  2017-04-30 22:41                               ` Paul E. McKenney
  2017-04-30  5:07                             ` Mike Galbraith
  1 sibling, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2017-04-30  4:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sun, Apr 30, 2017 at 06:20:15AM +0200, Mike Galbraith wrote:
> On Sat, 2017-04-29 at 20:43 -0700, Paul E. McKenney wrote:
> > On Sun, Apr 30, 2017 at 03:21:58AM +0200, Mike Galbraith wrote:
> > > On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> > > > On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > > > > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > > If someone will either repost a fresh series or point me at exactly
> > > > > > the set of patches to use, I will run it through rcutorture again.
> > > > > 
> > > > > Patchlet is against x86-tip/master.today.
> > > > 
> > > > So today's (as in Saturday April 29) x86-tip/master with the following
> > > > patch applied?
> > > 
> > > Yeah.
> > 
> > OK, will fire it up once the current set of overnight tests complete.
> 
> I certainly don't want to discourage you from beating hell outta tip,
> just want to make sure you know that I'm seeing zero RCU woes, only
> late timer expiry (sharpening rocks/sticks to focus trace).

I got timer_migration splats from an earlier rcutorture run.  Please see
message-ID <20170421192853.GD3956@linux.vnet.ibm.com> on LKML on April
21st in reply to Thomas's V2 00/10 cover letter.  So I am curious to
learn if your patches fix them.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30  4:20                           ` Mike Galbraith
  2017-04-30  4:36                             ` Paul E. McKenney
@ 2017-04-30  5:07                             ` Mike Galbraith
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Galbraith @ 2017-04-30  5:07 UTC (permalink / raw)
  To: paulmck
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sun, 2017-04-30 at 06:20 +0200, Mike Galbraith wrote:
> On Sat, 2017-04-29 at 20:43 -0700, Paul E. McKenney wrote:
> > On Sun, Apr 30, 2017 at 03:21:58AM +0200, Mike Galbraith wrote:
> > > On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> > > > On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > > > > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > > If someone will either repost a fresh series or point me at
> > > > > > exactly
> > > > > > the set of patches to use, I will run it through rcutorture
> > > > > > again.
> > > > > 
> > > > > Patchlet is against x86-tip/master.today.
> > > > 
> > > > So today's (as in Saturday April 29) x86-tip/master with the
> > > > following
> > > > patch applied?
> > > 
> > > Yeah.
> > 
> > OK, will fire it up once the current set of overnight tests
> > complete.
> 
> I certainly don't want to discourage you from beating hell outta tip,
> just want to make sure you know that I'm seeing zero RCU woes, only
> late timer expiry (sharpening rocks/sticks to focus trace).

Ah, seems a cpu shutting down the tick can race with add_timer_on(),
leaving the timer stranded until some other event kicks the cpu awake.

          <idle>-0     [025] d..4    92.087954: tmigr_group_set_cpu_active: group=ffff88017d03d000 lvl=1 numa=-1 active=1 migrator=25 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2
          <idle>-0     [025] d..4    92.087956: tmigr_group_removeevt: group=ffff88017d03d000 lvl=1 numa=-1 active=1 migrator=25 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2
          <idle>-0     [025] d..3    92.087956: tmigr_group_set_cpu_active: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [025] d..3    92.087957: tmigr_group_removeevt: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [025] d.h3    92.087959: hrtimer_cancel: hrtimer=ffff880277455d20
          <idle>-0     [025] d.h2    92.087959: hrtimer_expire_entry: hrtimer=ffff880277455d20 function=tick_sched_timer now=91100168287
          <idle>-0     [025] d.h2    92.087964: hrtimer_expire_exit: hrtimer=ffff880277455d20
          <idle>-0     [025] d.s3    92.087968: timer_cancel: timer=ffffffff820dd640
          <idle>-0     [025] ..s2    92.087969: timer_expire_entry: timer=ffffffff820dd640 function=clocksource_watchdog now=4294915072
          <idle>-0     [025] d.s4    92.087971: timer_start: timer=ffffffff820dd640 function=clocksource_watchdog expires=4294915197 [timeout=125] cpu=26 idx=218 flags=
          <idle>-0     [025] d.s3    92.087974: wake_up_idle_cpu: NO KICK 26 - !set_nr_and_not_polling(rq->idle)
          <idle>-0     [025] ..s2    92.087974: timer_expire_exit: timer=ffffffff820dd640
          <idle>-0     [025] .Ns2    92.087982: tmigr_handle_remote: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27 cpu=25
          <idle>-0     [025] .Ns2    92.087982: tmigr_handle_remote: group=ffff88017d03d000 lvl=1 numa=-1 active=1 migrator=25 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2 cpu=25
          <idle>-0     [025] dN.3    92.087995: hrtimer_start: hrtimer=ffff880277455d20 function=tick_sched_timer expires=91104000000 softexpires=91104000000
          <idle>-0     [026] dN.3    92.088009: tmigr_group_set_cpu_active: group=ffff880179cd1000 lvl=0 numa=3 active=2 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] dN.3    92.088011: tmigr_group_removeevt: group=ffff880179cd1000 lvl=0 numa=3 active=2 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [025] d..4    92.088016: tmigr_group_addevt: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=25 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] dN.3    92.088016: hrtimer_start: hrtimer=ffff880277495d20 function=tick_sched_timer expires=91104000000 softexpires=91104000000
          <idle>-0     [025] d..5    92.088017: tmigr_group_set_cpu_inactive: group=ffff88017d03d000 lvl=1 numa=-1 active=1 migrator=26 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2
          <idle>-0     [025] d..4    92.088017: tmigr_group_set_cpu_inactive: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=26 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [025] d..2    92.088018: tick_stop: success=1 dependency=NONE
          <idle>-0     [025] d..3    92.088018: hrtimer_cancel: hrtimer=ffff880277455d20
          <idle>-0     [025] d..3    92.088020: hrtimer_start: hrtimer=ffff880277455d20 function=tick_sched_timer expires=125916000000 softexpires=125916000000
          <idle>-0     [026] d..4    92.088022: tmigr_group_addevt: group=ffff880179cd1000 lvl=0 numa=3 active=0 migrator=26 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] d..5    92.088023: tmigr_group_addevt: group=ffff88017d03d000 lvl=1 numa=-1 active=0 migrator=26 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2
          <idle>-0     [026] d..5    92.088024: tmigr_group_set_cpu_inactive: group=ffff88017d03d000 lvl=1 numa=-1 active=0 migrator=-1 num_childs=8 parent=          (null) nextevt=125916000000 evtcpu=2
          <idle>-0     [026] d..4    92.088025: tmigr_group_set_cpu_inactive: group=ffff880179cd1000 lvl=0 numa=3 active=0 migrator=-1 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] d..2    92.088025: tick_stop: success=1 dependency=NONE
          <idle>-0     [026] d..3    92.088026: hrtimer_cancel: hrtimer=ffff880277495d20
          <idle>-0     [026] d..3    92.088027: hrtimer_start: hrtimer=ffff880277495d20 function=tick_sched_timer expires=93148000000 softexpires=93148000000
          <idle>-0     [026] d..4    94.135877: tmigr_group_set_cpu_active: group=ffff88017d03d000 lvl=1 numa=-1 active=2 migrator=1 num_childs=8 parent=          (null) nextevt=316380000000 evtcpu=56
          <idle>-0     [026] d..4    94.135879: tmigr_group_removeevt: group=ffff88017d03d000 lvl=1 numa=-1 active=2 migrator=1 num_childs=8 parent=          (null) nextevt=316380000000 evtcpu=56
          <idle>-0     [026] d..3    94.135879: tmigr_group_set_cpu_active: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=26 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] d..3    94.135880: tmigr_group_removeevt: group=ffff880179cd1000 lvl=0 numa=3 active=1 migrator=26 num_childs=8 parent=ffff88017d03d000 nextevt=316380000000 evtcpu=27
          <idle>-0     [026] d.h3    94.135881: hrtimer_cancel: hrtimer=ffff880277495d20
          <idle>-0     [026] d.h2    94.135882: hrtimer_expire_entry: hrtimer=ffff880277495d20 function=tick_sched_timer now=93148247903
          <idle>-0     [026] d.h2    94.136137: hrtimer_expire_exit: hrtimer=ffff880277495d20
          <idle>-0     [026] d.s3    94.136141: timer_cancel: timer=ffffffff820dd640
          <idle>-0     [026] ..s2    94.136141: timer_expire_entry: timer=ffffffff820dd640 function=clocksource_watchdog now=4294915584
          <idle>-0     [026] d.s4    94.136144: timer_start: timer=ffffffff820dd640 function=clocksource_watchdog expires=4294915709 [timeout=125] cpu=27 idx=219 flags=
          <idle>-0     [026] d.s3    94.136146: wake_up_idle_cpu: KICK 27
          <idle>-0     [026] ..s1    94.136148: clocksource_watchdog: LATE by 387 ticks

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30  4:36                             ` Paul E. McKenney
@ 2017-04-30 22:41                               ` Paul E. McKenney
  2017-05-01  7:54                                 ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2017-04-30 22:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, Ingo Molnar, Ingo Molnar, Thomas Gleixner, PeterZijlstra,
	Frederic Weisbecker

On Sat, Apr 29, 2017 at 09:36:37PM -0700, Paul E. McKenney wrote:
> On Sun, Apr 30, 2017 at 06:20:15AM +0200, Mike Galbraith wrote:
> > On Sat, 2017-04-29 at 20:43 -0700, Paul E. McKenney wrote:
> > > On Sun, Apr 30, 2017 at 03:21:58AM +0200, Mike Galbraith wrote:
> > > > On Sat, 2017-04-29 at 14:45 -0700, Paul E. McKenney wrote:
> > > > > On Sat, Apr 29, 2017 at 08:20:33PM +0200, Mike Galbraith wrote:
> > > > > > On Sat, 2017-04-29 at 11:06 -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > If someone will either repost a fresh series or point me at exactly
> > > > > > > the set of patches to use, I will run it through rcutorture again.
> > > > > > 
> > > > > > Patchlet is against x86-tip/master.today.
> > > > > 
> > > > > So today's (as in Saturday April 29) x86-tip/master with the following
> > > > > patch applied?
> > > > 
> > > > Yeah.
> > > 
> > > OK, will fire it up once the current set of overnight tests complete.
> > 
> > I certainly don't want to discourage you from beating hell outta tip,
> > just want to make sure you know that I'm seeing zero RCU woes, only
> > late timer expiry (sharpening rocks/sticks to focus trace).
> 
> I got timer_migration splats from an earlier rcutorture run.  Please see
> message-ID <20170421192853.GD3956@linux.vnet.ibm.com> on LKML on April
> 21st in reply to Thomas's V2 00/10 cover letter.  So I am curious to
> learn if your patches fix them.

And sadly, the splats are still there.  Please see the following for
the relevant console output and .config files:

http://www2.rdrop.com/users/paulmck/submission/TREE04.2017.04.30a.config
http://www2.rdrop.com/users/paulmck/submission/TREE04.2017.04.30a.console.log
http://www2.rdrop.com/users/paulmck/submission/TREE04.3.2017.04.30a.console.log

http://www2.rdrop.com/users/paulmck/submission/TREE07.2017.04.30a.config
http://www2.rdrop.com/users/paulmck/submission/TREE07.2.2017.04.30a.bzImage
http://www2.rdrop.com/users/paulmck/submission/TREE07.2017.04.30a.console.log
http://www2.rdrop.com/users/paulmck/submission/TREE07.2.2017.04.30a.console.log

Please let me know if you have any trouble accessing these.

Here is the first splat from the first TREE04 run:

[    3.310642] WARNING: CPU: 1 PID: 0 at /home/paulmck/public_git/timer-tip/kernel/time/timer_migration.c:387 tmigr_set_cpu_active+0xc6/0xe0
[    3.313210] Modules linked in:
[    3.313861] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.11.0-rc8+ #1
[    3.315196] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[    3.317196] task: ffff8c1fde9f0000 task.stack: ffff8e4fc0124000
[    3.318433] RIP: 0010:tmigr_set_cpu_active+0xc6/0xe0
[    3.319464] RSP: 0000:ffff8e4fc0127e90 EFLAGS: 00010046
[    3.320598] RAX: 0000000000000004 RBX: 0000000000000001 RCX: 000000000000001f
[    3.322146] RDX: 0000000000000001 RSI: ffff8c1fdfc54cc8 RDI: ffff8c1fdeb26f80
[    3.323652] RBP: ffff8e4fc0127ea8 R08: 0000000000000000 R09: 0000000000000008
[    3.325237] R10: ffff8e4fc0127e80 R11: 0000000000000400 R12: ffff8c1fdeb26f80
[    3.326699] R13: ffff8c1fdfc54cc8 R14: 0000000000000000 R15: 0000000000000000
[    3.328149] FS:  0000000000000000(0000) GS:ffff8c1fdfc40000(0000) knlGS:0000000000000000
[    3.329845] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.331078] CR2: ffff8e4fc02f0000 CR3: 0000000015e0a000 CR4: 00000000000006e0
[    3.332509] Call Trace:
[    3.333107]  tmigr_cpu_activate+0x36/0x40
[    3.333972]  tick_nohz_idle_exit+0xd1/0xf0
[    3.334845]  do_idle+0x113/0x170
[    3.335501]  cpu_startup_entry+0x18/0x20
[    3.336338]  start_secondary+0xe8/0xf0
[    3.337147]  secondary_startup_64+0x9f/0x9f
[    3.337998] Code: d0 48 8b 03 48 85 c0 75 eb eb a0 49 8b 7c 24 50 41 89 5c 24 08 48 85 ff 74 8c 49 8d 74 24 20 89 da e8 3f ff ff ff e9 7b ff ff ff <0f> ff 41 c6 04 24 00 5b 41 5c 41 5d 5d c3 66 90 66 2e 0f 1f 8

This is the first WARN_ON() in tmigr_set_cpu_active().  I got four splats
in 12 hours of running the rcutorture TREE04 test scenario, that is, three
runs of four hours each.

The TREE07 runs fared worse, with many more splats, starting with a
page fault.  The scripting claimed a hang, but that looks to have instead
been so many splats that the test failed to terminate itself in time.
I ran two TREE07 runs of four hours each.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-04-30 22:41                               ` Paul E. McKenney
@ 2017-05-01  7:54                                 ` Thomas Gleixner
  2017-05-01 19:33                                   ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2017-05-01  7:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mike Galbraith, LKML, Ingo Molnar, Ingo Molnar, PeterZijlstra,
	Frederic Weisbecker

On Sun, 30 Apr 2017, Paul E. McKenney wrote:
> 
> And sadly, the splats are still there.  Please see the following for
> the relevant console output and .config files:

Right. We are working on fixing the issues. I'll zap it from tip for now.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] timer: Fix timers_update_migration(), and call it in tmigr_init()
  2017-05-01  7:54                                 ` Thomas Gleixner
@ 2017-05-01 19:33                                   ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2017-05-01 19:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, LKML, Ingo Molnar, Ingo Molnar, PeterZijlstra,
	Frederic Weisbecker

On Mon, May 01, 2017 at 09:54:10AM +0200, Thomas Gleixner wrote:
> On Sun, 30 Apr 2017, Paul E. McKenney wrote:
> > 
> > And sadly, the splats are still there.  Please see the following for
> > the relevant console output and .config files:
> 
> Right. We are working on fixing the issues. I'll zap it from tip for now.

In case it helps...

I looked over the rest of the rcutorture scenarios, and it looks like
the problems occur for NO_HZ_FULL and more than two CPUs (though larger
numbers of CPUs might well just be increasing the probability of this
happening).  The TEST07 scenario (which had the segfaults) also had CPU
hotplug enabled, but !NO_HZ_FULL scenarios with CPU hotplug enabled did
just fine (or at least had a much lower probability of occurrence).

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-05-01 19:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-26  8:02 x86-tip tsc/tick gripage Mike Galbraith
2017-04-26  8:16 ` Mike Galbraith
2017-04-26  8:21   ` Ingo Molnar
2017-04-26  8:31     ` Mike Galbraith
2017-04-26  8:57       ` Mike Galbraith
2017-04-26  9:18         ` Mike Galbraith
2017-04-26 10:26         ` Peter Zijlstra
2017-04-26 11:39           ` Mike Galbraith
2017-04-26 12:30             ` Mike Galbraith
2017-04-26 18:13               ` Mike Galbraith
2017-04-27  5:26               ` Mike Galbraith
2017-04-29 16:06               ` [patch] timer: Fix timers_update_migration(), and call it in tmigr_init() Mike Galbraith
2017-04-29 18:06                 ` Paul E. McKenney
2017-04-29 18:20                   ` Mike Galbraith
2017-04-29 21:45                     ` Paul E. McKenney
2017-04-30  1:21                       ` Mike Galbraith
2017-04-30  3:43                         ` Paul E. McKenney
2017-04-30  4:20                           ` Mike Galbraith
2017-04-30  4:36                             ` Paul E. McKenney
2017-04-30 22:41                               ` Paul E. McKenney
2017-05-01  7:54                                 ` Thomas Gleixner
2017-05-01 19:33                                   ` Paul E. McKenney
2017-04-30  5:07                             ` Mike Galbraith
2017-04-28  7:35             ` [patch] Re: x86-tip tsc/tick gripage Mike Galbraith
2017-04-28  8:45               ` Mike Galbraith
2017-04-28 14:17                 ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).