On 3/15/20 7:43 PM, Waiman Long wrote: > On 3/15/20 10:20 PM, Guenter Roeck wrote: >> Hi, >> >> On Fri, Feb 07, 2020 at 02:39:29PM -0500, Waiman Long wrote: >>> The tick_periodic() function is used at the beginning part of the >>> bootup process for time keeping while the other clock sources are >>> being initialized. >>> >>> The current code assumes that all the timer interrupts are handled in >>> a timely manner with no missing ticks. That is not actually true. Some >>> ticks are missed and there are some discrepancies between the tick time >>> (jiffies) and the timestamp reported in the kernel log. Some systems, >>> however, are more prone to missing ticks than the others. In the extreme >>> case, the discrepancy can actually cause a soft lockup message to be >>> printed by the watchdog kthread. For example, on a Cavium ThunderX2 >>> Sabre arm64 system: >>> >>> [ 25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! >>> >>> On that system, the missing ticks are especially prevalent during the >>> smp_init() phase of the boot process. With an instrumented kernel, >>> it was found that it took about 24s as reported by the timestamp for >>> the tick to accumulate 4s of time. >>> >>> Investigation and bisection done by others seemed to point to the >>> commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or >>> lack thereof") as the culprit. It could also be a firmware issue as >>> new firmware was promised that would fix the issue. >>> >>> To properly address this problem, we cannot assume that there will >>> be no missing tick in tick_periodic(). This function is now modified >>> to follow the example of tick_do_update_jiffies64() by using another >>> reference clock to check for missing ticks. Since the watchdog timer >>> uses running_clock(), it is used here as the reference. With this patch >>> applied, the soft lockup problem in the arm64 system is gone and tick >>> time tracks much more closely to the timestamp time. >>> >>> Signed-off-by: Waiman Long >> Since this patch is in linux-next, roughly 10% of my x86 and x86_64 >> qemu emulation boots are stalling. Typical log: >> >> [ 0.002016] smpboot: Total of 1 processors activated (7576.40 BogoMIPS) >> [ 0.002016] devtmpfs: initialized >> [ 0.002016] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns >> [ 0.002016] futex hash table entries: 256 (order: 3, 32768 bytes, linear) >> [ 0.002016] xor: measuring software checksum speed >> >> another: >> >> [ 0.002653] Freeing SMP alternatives memory: 44K >> [ 0.002653] smpboot: CPU0: Intel Westmere E56xx/L56xx/X56xx (IBRS update) (family: 0x6, model: 0x2c, stepping: 0x1) >> [ 0.002653] Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only. >> [ 0.002653] rcu: Hierarchical SRCU implementation. >> [ 0.002653] smp: Bringing up secondary CPUs ... >> [ 0.002653] x86: Booting SMP configuration: >> [ 0.002653] .... node #0, CPUs: #1 >> [ 0.000000] smpboot: CPU 1 Converting physical 0 to logical die 1 >> >> ... and then there is silence until the test aborts. >> >> This is only (or at least predominantly) seen if the system running >> the emulation is under load. >> >> Reverting this patch fixes the problem. > > I was aware that there are some problem with this patch, but it is hard > to reproduce it. Do you have a more consistent way to reproduce it. > When  you say under load, you mean that the host system is also busy so > that there are a lot of vcpu preemption. Right? Could you give me the Correct. I am able to reproduce the problem quite reliably (ie 2-3 boots out of ~25 fail) if I run a kernel compilation in parallel, but not (or rarely) if the system is otherwise idle. > x86-64 .config file that you use? > Attached. It is pretty much defconfig with various debug and test options enabled. Guenter