All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
@ 2017-03-10 17:42 Meelis Roos
  2017-03-10 17:46 ` Meelis Roos
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Meelis Roos @ 2017-03-10 17:42 UTC (permalink / raw)
  To: sparclinux

I am seeing the following soft lockup on multiple Ultrasparc II era 
sparc machines - Ultra 2, Netra X1, Blade 100 at least. 4.9 was fine. 
Will attempt to bisect some time.

The lockups are detected routinely (every 2-3 files when compiling new 
kernel) but the machine keeps running.

CC added because of NMI changes - no idea if it is related.

I have THP enabled and always on in most machines' kernel config.

[503216.376072] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
[503216.470787] Modules linked in: ipv6 loop snd_sun_cs4231 snd_pcm snd_timer sr_mod sg evdev cdrom snd soundcore sunhme parport_sunbpp parport flash
[503216.636654] CPU: 0 PID: 153 Comm: khugepaged Tainted: G             L  4.10.0-11624-g0710f3f #162
[503216.747322] task: fffff8006c57e180 task.stack: fffff8006c5e0000
[503216.822501] TSTATE: 0000004480001607 TPC: 00000000007dfa68 TNPC: 00000000007dfa6c Y: 01ceecf2    Tainted: G             L 
[503216.959282] TPC: <_raw_spin_unlock_irqrestore+0x8/0x20>
[503217.026217] g0: 00000000007dfa68 g1: 0000400000000000 g2: 0000000000008000 g3: 000000006c014000
[503217.134716] g4: fffff8006c57e180 g5: fffff8006eef8000 g6: fffff8006c5e0000 g7: 0000000000000016
[503217.243359] o0: fffff8006cd2d6c0 o1: 0000000000000000 o2: 0000000000000000 o3: 0000000000000000
[503217.351857] o4: 000000007143c000 o5: fffff8006cd2d400 sp: fffff8006c5e3191 ret_pc: 0000000000442efc
[503217.464434] RPC: <tlb_batch_add_one+0x5c/0x100>
[503217.522754] l0: fffff8006eef8000 l1: 0000000000000002 l2: 00000000009160c0 l3: 000000000000000e
[503217.631244] l4: 0000000000000001 l5: 0000000000000001 l6: fffff8006da38000 l7: 00000000000003ef
[503217.739733] i0: fffff8006cd2d400 i1: 000000007143c000 i2: 0000000000000000 i3: 0000000000000000
[503217.848185] i4: fffff8006f8006e0 i5: 00000000009086e0 i6: fffff8006c5e3241 i7: 0000000000443230
[503217.956690] I7: <set_pmd_at+0x130/0x1a0>
[503218.007747] Call Trace:
[503218.040947]  [0000000000443230] set_pmd_at+0x130/0x1a0
[503218.106503]  [00000000005060f4] pmdp_collapse_flush+0x14/0x40
[503218.179404]  [0000000000529f38] khugepaged_scan_pmd+0x3d8/0xac0
[503218.254370]  [000000000052acd0] khugepaged+0x6b0/0xa80
[503218.319827]  [0000000000470ef0] kthread+0xd0/0x120
[503218.381186]  [0000000000406044] ret_from_fork+0x1c/0x2c
[503218.447700]  [0000000000000000]           (null)


-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
  2017-03-10 17:42 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153] Meelis Roos
@ 2017-03-10 17:46 ` Meelis Roos
  2017-03-10 19:54 ` Meelis Roos
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Meelis Roos @ 2017-03-10 17:46 UTC (permalink / raw)
  To: sparclinux

> I am seeing the following soft lockup on multiple Ultrasparc II era 
> sparc machines - Ultra 2, Netra X1, Blade 100 at least. 4.9 was fine. 
> Will attempt to bisect some time.
> 
> The lockups are detected routinely (every 2-3 files when compiling new 
> kernel) but the machine keeps running.
> 
> CC added because of NMI changes - no idea if it is related.
> 
> I have THP enabled and always on in most machines' kernel config.
> 
> [503216.376072] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]

Ocassionally, I also get this on the U2:

[500680.792223] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 39s!
[500680.891127] Showing busy workqueues and worker pools:
[500680.955396] workqueue events: flags=0x0
[500681.005072]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[500681.080732]     pending: vmstat_shepherd
[500681.131337] workqueue events_freezable_power_: flags=0x84
[500681.199644]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[500681.275233]     pending: disk_events_workfn
[500681.328847] workqueue kblockd: flags=0x18
[500681.380364]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
[500681.457924]     pending: blk_mq_timeout_work, blk_mq_timeout_work
[500681.534525] workqueue vmstat: flags=0xc
[500681.583956]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[500681.659357]     pending: vmstat_update


-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
  2017-03-10 17:42 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153] Meelis Roos
  2017-03-10 17:46 ` Meelis Roos
@ 2017-03-10 19:54 ` Meelis Roos
  2017-04-09 12:09 ` Meelis Roos
  2017-04-12  2:14 ` David Miller
  3 siblings, 0 replies; 5+ messages in thread
From: Meelis Roos @ 2017-03-10 19:54 UTC (permalink / raw)
  To: sparclinux

> I am seeing the following soft lockup on multiple Ultrasparc II era 
> sparc machines - Ultra 2, Netra X1, Blade 100 at least. 4.9 was fine. 
> Will attempt to bisect some time.

Also seeing both warnings on T2000, so not specific to UltraSparc II.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
  2017-03-10 17:42 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153] Meelis Roos
  2017-03-10 17:46 ` Meelis Roos
  2017-03-10 19:54 ` Meelis Roos
@ 2017-04-09 12:09 ` Meelis Roos
  2017-04-12  2:14 ` David Miller
  3 siblings, 0 replies; 5+ messages in thread
From: Meelis Roos @ 2017-04-09 12:09 UTC (permalink / raw)
  To: sparclinux

> > I am seeing the following soft lockup on multiple Ultrasparc II era 
> > sparc machines - Ultra 2, Netra X1, Blade 100 at least. 4.9 was fine. 
> > Will attempt to bisect some time.
> 
> Also seeing both warnings on T2000, so not specific to UltraSparc II.

Yesterday evening git has fixed that - at first it seems all affected 
servers are OK.

I do not know when it was fixed - did not have time to test anything 
more since I found the problem.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153]
  2017-03-10 17:42 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153] Meelis Roos
                   ` (2 preceding siblings ...)
  2017-04-09 12:09 ` Meelis Roos
@ 2017-04-12  2:14 ` David Miller
  3 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2017-04-12  2:14 UTC (permalink / raw)
  To: sparclinux

From: Meelis Roos <mroos@linux.ee>
Date: Sun, 9 Apr 2017 15:09:07 +0300 (EEST)

>> > I am seeing the following soft lockup on multiple Ultrasparc II era 
>> > sparc machines - Ultra 2, Netra X1, Blade 100 at least. 4.9 was fine. 
>> > Will attempt to bisect some time.
>> 
>> Also seeing both warnings on T2000, so not specific to UltraSparc II.
> 
> Yesterday evening git has fixed that - at first it seems all affected 
> servers are OK.
> 
> I do not know when it was fixed - did not have time to test anything 
> more since I found the problem.

I was probably the THP bug fix:

commit 76811263b3fa6347699a446cddeb63badf3e6095
Author: Nitin Gupta <nitin.m.gupta@oracle.com>
Date:   Fri Mar 31 15:48:53 2017 -0700

    sparc64: Fix memory corruption when THP is enabled
    
    The memory corruption was happening due to incorrect
    TLB/TSB flushing of hugepages.
    
    Reported-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-04-12  2:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 17:42 4.10 sparc64 regression: BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:153] Meelis Roos
2017-03-10 17:46 ` Meelis Roos
2017-03-10 19:54 ` Meelis Roos
2017-04-09 12:09 ` Meelis Roos
2017-04-12  2:14 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.