All of lore.kernel.org
 help / color / mirror / Atom feed
* raid1 lockups on 4.12.x
@ 2017-07-31 18:36 Simon Kirby
  2017-08-14  7:08 ` Simon Kirby
  0 siblings, 1 reply; 3+ messages in thread
From: Simon Kirby @ 2017-07-31 18:36 UTC (permalink / raw)
  To: linux-raid

Hello,

I recently upgraded two old boxes to 4.12.3 only to find that they
started to lock up several times per day. I upgraded to 4.12.4 and the
problem persisted. Downgrading to 4.11.11 seems to have stopped the
issue.

I captured the following with nmi_watchdog=1 and a serial console while
on 4.12.4.

[ 9672.236351] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md2_raid1:1704]
[ 9672.236351] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack serio_raw evdev e1000
[ 9672.236351] CPU: 0 PID: 1704 Comm: md2_raid1 Not tainted 4.12.4-hw #6
[ 9672.236351] Hardware name: Dell Computer Corporation PowerEdge 750              /0R1479, BIOS A03 10/22/2004
[ 9672.236351] task: f4a1a100 task.stack: f47fa000
[ 9672.236351] EIP: md_check_recovery+0x2d4/0x460 
[ 9672.236351] EFLAGS: 00000246 CPU: 0
[ 9672.236351] EAX: 00000000 EBX: f448ec00 ECX: 00000001 EDX: 00000000
[ 9672.236351] ESI: 00000000 EDI: 00000000 EBP: f47fbe38 ESP: f47fbe28
[ 9672.236351]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 9672.236351] CR0: 80050033 CR2: b77c0000 CR3: 31501000 CR4: 000006d0
[ 9672.236351] Call Trace:
[ 9672.236351]  raid1d+0x27/0x16f0
[ 9672.236351]  ? ktime_get+0x44/0x100
[ 9672.236351]  ? lapic_next_event+0x13/0x20
[ 9672.236351]  ? clockevents_program_event+0x9d/0x150
[ 9672.236351]  ? __hrtimer_run_queues+0x13a/0x210
[ 9672.236351]  ? hrtimer_interrupt+0xb0/0x190
[ 9672.236351]  ? smp_apic_timer_interrupt+0x33/0x40
[ 9672.236351]  ? apic_timer_interrupt+0x32/0x38
[ 9672.236351]  ? perf_trace_thermal_power_cpu_limit+0x68/0x120
[ 9672.344374]  md_thread+0x116/0x140
[ 9672.344374]  ? do_wait_intr+0x70/0x70
[ 9672.344374]  kthread+0xe3/0x110
[ 9672.344374]  ? md_register_thread+0xc0/0xc0
[ 9672.344374]  ? __kthread_init_worker+0x90/0x90
[ 9672.344374]  ret_from_fork+0x19/0x24
[ 9672.344374] Code: 68 01 00 00 ef f0 0f ba b3 68 01 00 00 05 72 47 8b b3 08 01 00 00 85 f6 0f 85 d9 fe ff ff f0 80 a3 68 01 00 00 fe b9 01 00 00 00 <ba> 03 00 00 00 c7 04 24 00 00 00 00 b8 c0 a8 bc c1 e8 86 d9 97

later, there were (looping) backtraces such as:

[34552.236353] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [md0_raid1:1684]
[34552.236353] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack serio_raw evdev e1000
[34552.236353] CPU: 0 PID: 1684 Comm: md0_raid1 Tainted: G             L  4.12.4-hw #6
[34552.236353] Hardware name: Dell Computer Corporation PowerEdge 750              /0R1479, BIOS A03 10/22/2004
[34552.236353] task: f4a95280 task.stack: f52c0000
[34552.236353] EIP: _raw_spin_unlock_irqrestore+0xa/0x10
[34552.236353] EFLAGS: 00000292 CPU: 0
[34552.236353] EAX: 00000292 EBX: f46a3d08 ECX: 00000001 EDX: 00000292
[34552.236353] ESI: 00000292 EDI: 00000003 EBP: f52c1de4 ESP: f52c1de4
[34552.236353]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[34552.236353] CR0: 80050033 CR2: 08062e00 CR3: 347af000 CR4: 000006d0
[34552.236353] Call Trace:
[34552.236353]  __wake_up+0x40/0x50
[34552.236353]  md_wakeup_thread+0x2b/0x30
[34552.236353]  mddev_unlock+0x74/0xc0
[34552.236353]  md_check_recovery+0x1c3/0x460
[34552.324380]  raid1d+0x27/0x16f0
[34552.324380]  ? ktime_get+0x44/0x100
[34552.324380]  ? lapic_next_event+0x13/0x20
[34552.324380]  ? clockevents_program_event+0x9d/0x150
[34552.324380]  ? __hrtimer_run_queues+0x13a/0x210
[34552.344366]  ? smp_apic_timer_interrupt+0x33/0x40
[34552.344366]  ? apic_timer_interrupt+0x32/0x38
[34552.344366]  ? perf_trace_thermal_power_cpu_limit+0xb/0x120
[34552.344366]  md_thread+0x116/0x140
[34552.344366]  ? do_wait_intr+0x70/0x70
[34552.344366]  kthread+0xe3/0x110
[34552.344366]  ? md_register_thread+0xc0/0xc0
[34552.344366]  ? __kthread_init_worker+0x90/0x90
[34552.344366]  ret_from_fork+0x19/0x24
[34552.344366] Code: 00 f0 0f b1 0a 85 c0 75 08 b0 01 5d c3 8d 74 26 00 31 c0 5d c3 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 e5 c6 00 00 89 d0 50 9d <8d> 74 26 00 5d c3 55 89 c1 89 e5 b8 00 ff ff ff f0 0f c1 01 89
[34567.020353] INFO: rcu_sched self-detected stall on CPU
[34567.020361]  0-...: (104347 ticks this GP) idle=d52/140000000000001/0 softirq=615378/615378 fqs=52025
[34567.020361]   (t=105006 jiffies g=349461 c=349460 q=85)
[34567.020361] NMI backtrace for cpu 0
[34567.020361] CPU: 0 PID: 1684 Comm: md0_raid1 Tainted: G             L  4.12.4-hw #6
[34567.020361] Hardware name: Dell Computer Corporation PowerEdge 750              /0R1479, BIOS A03 10/22/2004
[34567.020361] Call Trace:
[34567.020361]  dump_stack+0x58/0x78
[34567.020361]  nmi_cpu_backtrace+0xa6/0xb0
[34567.020361]  nmi_trigger_cpumask_backtrace+0x92/0xd0
[34567.020361]  ? irq_force_complete_move+0xd0/0xd0
[34567.020361]  arch_trigger_cpumask_backtrace+0x10/0x20
[34567.020361]  rcu_dump_cpu_stacks+0x64/0xa7
[34567.020361]  rcu_check_callbacks+0x6af/0x710
[34567.020361]  ? account_process_tick+0x54/0x130
[34567.020361]  update_process_times+0x2a/0x60   
[34567.020361]  tick_sched_handle.isra.11+0x30/0x60
[34567.020361]  tick_sched_timer+0x3b/0x80
[34567.020361]  __hrtimer_run_queues+0xb0/0x210
[34567.020361]  ? tick_nohz_handler+0xa0/0xa0  
[34567.020361]  hrtimer_interrupt+0x95/0x190   
[34567.020361]  local_apic_timer_interrupt+0x2a/0x50
[34567.020361]  smp_apic_timer_interrupt+0x2e/0x40  
[34567.020361]  apic_timer_interrupt+0x32/0x38
[34567.020361] EIP: _raw_spin_unlock_irqrestore+0xa/0x10
[34567.020361] EFLAGS: 00000292 CPU: 0
[34567.020361] EAX: 00000292 EBX: f448e5f0 ECX: 00000001 EDX: 00000292
[34567.020361] ESI: 00000292 EDI: 00000003 EBP: f52c1e00 ESP: f52c1e00
[34567.020361]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[34567.020361]  __wake_up+0x40/0x50
[34567.020361]  md_check_recovery+0x1bc/0x460
[34567.020361]  raid1d+0x27/0x16f0
[34567.020361]  ? ktime_get+0x44/0x100
[34567.020361]  ? lapic_next_event+0x13/0x20
[34567.020361]  ? clockevents_program_event+0x9d/0x150
[34567.020361]  ? __hrtimer_run_queues+0x13a/0x210
[34567.020361]  ? hrtimer_interrupt+0xb0/0x190
[34567.020361]  ? hrtimer_interrupt+0xb0/0x190
[34567.020361]  ? smp_apic_timer_interrupt+0x33/0x40
[34567.020361]  ? apic_timer_interrupt+0x32/0x38
[34567.020361]  md_thread+0x116/0x140
[34567.020361]  ? do_wait_intr+0x70/0x70
[34567.020361]  kthread+0xe3/0x110
[34567.020361]  ? md_register_thread+0xc0/0xc0
[34567.020361]  ? __kthread_init_worker+0x90/0x90
[34567.020361]  ret_from_fork+0x19/0x24
[34592.236352] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md0_raid1:1684]
[34592.236352] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack serio_raw evdev e1000
[34592.236352] CPU: 0 PID: 1684 Comm: md0_raid1 Tainted: G             L  4.12.4-hw #6
[34592.236352] Hardware name: Dell Computer Corporation PowerEdge 750              /0R1479, BIOS A03 10/22/2004
[34592.236352] task: f4a95280 task.stack: f52c0000
[34592.236352] EIP: _raw_spin_unlock_irqrestore+0xa/0x10
[34592.236352] EFLAGS: 00000292 CPU: 0
[34592.236352] EAX: 00000292 EBX: c1bca8c0 ECX: 00000001 EDX: 00000292
[34592.236352] ESI: 00000292 EDI: 00000003 EBP: f52c1e00 ESP: f52c1e00
[34592.236352]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[34592.236352] CR0: 80050033 CR2: 08062e00 CR3: 347af000 CR4: 000006d0
[34592.236352] Call Trace:
[34592.236352]  __wake_up+0x40/0x50
[34592.236352]  md_check_recovery+0x2ea/0x460
[34592.236352]  raid1d+0x27/0x16f0
[34592.236352]  ? ktime_get+0x44/0x100
[34592.236352]  ? lapic_next_event+0x13/0x20
[34592.236352]  ? clockevents_program_event+0x9d/0x150
[34592.236352]  ? __hrtimer_run_queues+0x13a/0x210
[34592.236352]  ? tick_program_event+0x41/0x80
[34592.236352]  ? hrtimer_interrupt+0xb0/0x190
[34592.344366]  ? irq_exit+0x58/0xb0
[34592.344366]  ? smp_apic_timer_interrupt+0x33/0x40
[34592.344366]  ? apic_timer_interrupt+0x32/0x38
[34592.356384]  ? perf_trace_thermal_power_cpu_limit+0xb/0x120
[34592.356384]  md_thread+0x116/0x140
[34592.356384]  ? do_wait_intr+0x70/0x70
[34592.356384]  kthread+0xe3/0x110
[34592.356384]  ? md_register_thread+0xc0/0xc0
[34592.356384]  ? __kthread_init_worker+0x90/0x90
[34592.356384]  ret_from_fork+0x19/0x24
[34592.356384] Code: 00 f0 0f b1 0a 85 c0 75 08 b0 01 5d c3 8d 74 26 00 31 c0 5d c3 8d b6 00 00 00 00 8d bf 00 00 00 00 55 89 e5 c6 00 00 89 d0 50 9d <8d> 74 26 00 5d c3 55 89 c1 89 e5 b8 00 ff ff ff f0 0f c1 01 89 

Config here: http://0x.ca/sim/ref/4.12.4-32/config

Simon-

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid1 lockups on 4.12.x
  2017-07-31 18:36 raid1 lockups on 4.12.x Simon Kirby
@ 2017-08-14  7:08 ` Simon Kirby
  2017-08-14  8:40   ` Jack Wang
  0 siblings, 1 reply; 3+ messages in thread
From: Simon Kirby @ 2017-08-14  7:08 UTC (permalink / raw)
  To: linux-raid

On Mon, Jul 31, 2017 at 11:36:16AM -0700, Simon Kirby wrote:

> Hello,
> 
> I recently upgraded two old boxes to 4.12.3 only to find that they
> started to lock up several times per day. I upgraded to 4.12.4 and the
> problem persisted. Downgrading to 4.11.11 seems to have stopped the
> issue.

I am still seeing this, even on 4.12.6, now also on a home server using
RAID 1, even with the "mddev->in_sync || !mddev->sync_checkers" condition
fixed to drop the '!':

[37795.380559] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [md0_raid1:214]
[37795.380581] Modules linked in: xt_conntrack xt_nat xt_recent xt_state xt_owner xt_REDIRECT nf_nat_redirect ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_nat_ftp iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack mxm_wmi nvidia_drm(O) nvidia_modeset(O) nvidia(O) drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm hid_logitech_hidpp firewire_ohci hid_logitech_dj usb_storage
[37795.380585] irq event stamp: 262
[37795.380585] hardirqs last  enabled at (261): [<ffffffff81a1108a>] _raw_spin_unlock_irqrestore+0x3a/0x60
[37795.380585] hardirqs last disabled at (262): [<ffffffff81a0b18e>] __schedule+0x9e/0x9f0
[37795.380585] softirqs last  enabled at (246): [<ffffffff810cddc6>] __do_softirq+0x366/0x470
[37795.380585] softirqs last disabled at (239): [<ffffffff810ce037>] irq_exit+0x87/0x90
[37795.380585] CPU: 3 PID: 214 Comm: md0_raid1 Tainted: G           O    4.12.6-flick+ #107
[37795.380585] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./990FXA-UD3, BIOS F4x 04/01/2016
[37795.380585] task: ffff8801374fbb40 task.stack: ffff88013770c000
[37795.380585] RIP: 0010:_raw_spin_unlock_irqrestore+0x3c/0x60
[37795.380585] RSP: 0018:ffff88013770fd68 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[37795.380585] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 0000000000000000
[37795.380585] RDX: ffffffff81108706 RSI: 0000000000000001 RDI: ffffffff81a1108a
[37795.380585] RBP: ffff88013770fd78 R08: 0000000000000000 R09: 0000000000000001
[37795.380585] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880137c7d3c8
[37795.380585] R13: 0000000000000000 R14: ffff8801374fbb40 R15: ffff8801374bba00
[37795.380585] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
[37795.380585] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37795.380585] CR2: 00007f79edc093a0 CR3: 000000012cfe9000 CR4: 00000000000006e0
[37795.380585] Call Trace:
[37795.380585]  __wake_up+0x46/0x60
[37795.380585]  md_check_recovery+0x147/0x4b0
[37795.380585]  raid1d+0x39/0x870
[37795.380585]  ? irq_exit+0xa/0x90
[37795.380585]  ? retint_kernel+0x10/0x10
[37795.380585]  md_thread+0x122/0x150
[37795.380585]  ? wake_bit_function+0x50/0x50
[37795.380585]  kthread+0x120/0x140
[37795.380585]  ? md_register_thread+0xe0/0xe0
[37795.380585]  ? kthread_create_on_node+0x40/0x40
[37795.380585]  ret_from_fork+0x27/0x40
[37795.380585] Code: 89 f3 4c 89 65 f8 be 01 00 00 00 49 89 fc 48 83 c7 18 e8 58 24 70 ff 4c 89 e7 e8 e0 4e 70 ff f6 c7 02 74 11 e8 c6 03 70 ff 53 9d <48> 8b 5d f0 4c 8b 65 f8 c9 c3 53 9d e8 03 dd 6f ff 48 8b 5d f0 

I'll try a build with a415c0f106279 and 4ad23a976413a reverted to see if
doing so stops these lockups. It appears perhaps something to do with the
spinlocks and IRQs...? At one point top was able to see the md0_raid1
thread spinning (typically I/O hangs and then the host needs rebooting).

Simon-

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid1 lockups on 4.12.x
  2017-08-14  7:08 ` Simon Kirby
@ 2017-08-14  8:40   ` Jack Wang
  0 siblings, 0 replies; 3+ messages in thread
From: Jack Wang @ 2017-08-14  8:40 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-raid, NeilBrown, Shaohua Li

2017-08-14 9:08 GMT+02:00 Simon Kirby <sim@hostway.ca>:
> On Mon, Jul 31, 2017 at 11:36:16AM -0700, Simon Kirby wrote:
>
>> Hello,
>>
>> I recently upgraded two old boxes to 4.12.3 only to find that they
>> started to lock up several times per day. I upgraded to 4.12.4 and the
>> problem persisted. Downgrading to 4.11.11 seems to have stopped the
>> issue.
>
> I am still seeing this, even on 4.12.6, now also on a home server using
> RAID 1, even with the "mddev->in_sync || !mddev->sync_checkers" condition
> fixed to drop the '!':

Hi,

Neil and Shaohua have 3 fixes in this regards:

https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=33182d15c6bf182f7ae32a66ea4a547d979cd6d7
https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=81fe48e9aa00bdd509bd3c37a76d1132da6b9f09
https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=afc1f55ca44e257f69da8f43e0714a76686ae8d1

Hope this helps!
Jack

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-08-14  8:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-31 18:36 raid1 lockups on 4.12.x Simon Kirby
2017-08-14  7:08 ` Simon Kirby
2017-08-14  8:40   ` Jack Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.