Shaouhua, ( really CC'ing Tejun now ) On Mon, Aug 07, 2017 at 09:51:03PM -0700, Shaohua Li wrote: > On Mon, Aug 07, 2017 at 01:20:25PM +0200, Dominik Brodowski wrote: > > Neil, Shaohua, > > > > following up on David R's bug message: I have observed something similar > > on v4.12.[345] and v4.13-rc4, but not on v4.11. This is a RAID1 (on bare > > metal partitions, /dev/sdaX and /dev/sdbY linked together). In case it > > matters: Further upwards are cryptsetup, a DM volume group, then logical > > volumes, and then filesystems (ext4, but also happened with xfs). > > > > In a tedious bisect (the bug wasn't as quickly reproducible as I would like, > > but happened when I repeatedly created large lvs and filled them with some > > content, while compiling kernels in parallel), I was able to track this > > down to: > > > > > > commit 4ad23a976413aa57fe5ba7a25953dc35ccca5b71 > > Author: NeilBrown > > Date: Wed Mar 15 14:05:14 2017 +1100 > > > > MD: use per-cpu counter for writes_pending > > > > The 'writes_pending' counter is used to determine when the > > array is stable so that it can be marked in the superblock > > as "Clean". Consequently it needs to be updated frequently > > but only checked for zero occasionally. Recent changes to > > raid5 cause the count to be updated even more often - once > > per 4K rather than once per bio. This provided > > justification for making the updates more efficient. > > > > ... > > > > > > CC'ing tj@kernel.org, as 4ad23a976413 is the first (and only?) user > > of percpu_ref_switch_to_atomic_sync() introduced in 210f7cdcf088. > > > > Applying a415c0f10627 on top of 4ad23a976413 does *not* fix the issue, but > > reverting all of a2bfc6753065, a415c0f10627 and 4ad23a976413 seems to fix > > the issue for v4.12.5. > > Spent some time to check this one, Thanks for taking a look at this issue! > unfortunately I can't find how that patch makes rcu stall. the percpu part > looks good to me too. Can you double check if reverting 4ad23a976413aa57 > makes the issue go away? v4.12.5 with the three patches reverted survived many hours of testing yesterday. Rebooted into 4ad23a976413aa57 (rebuilt with lockup detection), and got back traces for mdX_raid1 after a few minutes. The following test was done with plain v4.12.5, with the first stall after around ~1 minute: > When the rcu stall happens, what the /sys/block/md/md0/array_state? First, a "ps axf | grep 'md'" snippet: 498 ? S< 0:00 \_ [md] 1163 ? D 0:01 \_ [md0_raid1] 1172 ? D 0:07 \_ [md1_raid1] 1182 ? D 0:00 \_ [md2_raid1] 1192 ? R 1:35 \_ [md3_raid1] (why are md[012]_raid1 continuously in "D" state?) array_state for md0: active array_state for md1: active-idle array_state for md2: active-idle array_state for md3: clean > please also attach /proc/mdstat. Personalities : [raid1] md3 : active raid1 sda8[0] sdb5[1] 174981120 blocks super 1.2 [2/2] [UU] bitmap: 0/2 pages [0KB], 65536KB chunk md2 : active raid1 sda7[0] sdb6[1] 174981120 blocks super 1.2 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk md1 : active raid1 sda6[0] sdb7[1] 174981120 blocks super 1.2 [2/2] [UU] bitmap: 2/2 pages [8KB], 65536KB chunk md0 : active raid1 sda5[0] sdb8[1] 174981120 blocks super 1.2 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: > When you say the mdx_raid1 threads are in 'R' state, can you double check > if the /proc/pid/stack always 0xffffffffff? root@i3test ~ # cat /proc/1192/stack # md3_raid1 [] 0xffffffffffffffff root@i3test ~ # cat /proc/1182/stack [] percpu_ref_switch_to_atomic_sync+0x3f/0x80 [] set_in_sync+0x68/0xe0 [] md_check_recovery+0x284/0x4c0 [] raid1d+0x4c/0x910 [] md_thread+0x12d/0x160 [] kthread+0x131/0x170 [] ret_from_fork+0x27/0x40 [] 0xffffffffffffffff root@i3test ~ # cat /proc/1172/stack [] percpu_ref_switch_to_atomic_sync+0x3f/0x80 [] set_in_sync+0x68/0xe0 [] md_check_recovery+0x284/0x4c0 [] raid1d+0x4c/0x910 [] md_thread+0x12d/0x160 [] kthread+0x131/0x170 [] ret_from_fork+0x27/0x40 [] 0xffffffffffffffff root@i3test ~ # cat /proc/1163/stack [] percpu_ref_switch_to_atomic_sync+0x3f/0x80 [] set_in_sync+0x68/0xe0 [] md_check_recovery+0x284/0x4c0 [] raid1d+0x4c/0x910 [] md_thread+0x12d/0x160 [] kthread+0x131/0x170 [] ret_from_fork+0x27/0x40 [] 0xffffffffffffffff So the other md[012]_raid1 threads are waiting in percpu_ref_switch_to_atomic_sync(). And md3_raid1 is, (according to NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [md3_raid1:1192], stuck somewhere in raid1d... [ 92.564717] task: ffff880234442e80 task.stack: ffffc9000102c000 [ 92.564781] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50 [ 92.564843] RSP: 0018:ffffc9000102fdb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [ 92.564913] RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffff8802321a1e00 [ 92.564995] RDX: ffffffff815cca08 RSI: 0000000000000001 RDI: ffffffff81793291 [ 92.565077] RBP: ffffc9000102fdc0 R08: 0000000000000000 R09: 0000000000000000 [ 92.565159] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802321a1e18 [ 92.565241] R13: ffff8802321a1e70 R14: ffff880234442e80 R15: 0000000000000000 [ 92.565323] FS: 0000000000000000(0000) GS:ffff880236e00000(0000) knlGS:0000000000000000 [ 92.565425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 92.565502] CR2: 00000083b78e4e30 CR3: 0000000001c23000 CR4: 00000000000006e0 [ 92.565585] Call Trace: [ 92.565657] raid1d+0x5f8/0x910 [ 92.565729] ? retint_kernel+0x10/0x10 [ 92.565802] md_thread+0x12d/0x160 [ 92.565874] ? md_thread+0x12d/0x160 [ 92.565947] ? __wake_up_common+0x80/0x80 [ 92.566022] kthread+0x131/0x170 [ 92.566093] ? find_pers+0x70/0x70 [ 92.566165] ? __kthread_create_on_node+0x220/0x220 [ 92.566241] ret_from_fork+0x27/0x40 [ 92.566312] Code: fc 48 8b 55 08 53 48 8d 7f 18 48 89 f3 be 01 00 00 00 e8 b1 ce 9d ff 4c 89 e7 e8 19 0c 9e ff f6 c7 02 74 13 e8 ff 90 9d ff 53 9d <65> ff 0d 3e 94 87 7e 5b 41 5c 5d c3 53 9d e8 4a 93 9d ff eb eb Thanks, Dominik