* raid10 resync hangs in 4.2.6, 4.3
@ 2015-11-20 18:14 Andre Tomt
2015-11-20 20:30 ` John Stoffel
2015-11-23 9:20 ` Artur Paszkiewicz
0 siblings, 2 replies; 7+ messages in thread
From: Andre Tomt @ 2015-11-20 18:14 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: Type: text/plain, Size: 5582 bytes --]
[resend with compressed attachments, first may have gotten eaten by a grue]
Hi
I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on
the same drive set works without any problems. BTRFS RAID6 is problem
free on a different set of (very busy) drives on the same controllers as
well.
It happens shortly after array creation, from seconds to a couple
minutes in.
wchan for md0_resync kernel thread shows it sitting in raise_barrier()
forever, while md0_raid10 keeps a CPU core 100% busy (but shows no
wchan), and no resyncing or I/O to the array is getting done anymore.
After a short while kernel starts spitting out rcu_sched self-detected
stall on CPU warnings, and other rcu use starts getting iffy (I think).
I/O directly to the RAID member disk (below md layer, eg /dev/sdX
directly) continues to work after the hang, and there are no I/O errors
in the kernel log.
The array is a 8 drive array spread over 3 HBAs, created with:
mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none
--raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1
/dev/sdj1 /dev/sdp1 /dev/sds1
The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA
drives. Dual socket Xeon E5 v3 system with both sockets populated.
This happens on at least 4.2.6 and 4.3
I'm going to test some earlier kernels.
attached some more info.
md0_resync stack:
> root@mental:~# cat /proc/1663/stack
> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
> [<ffffffff9d061db2>] kthread+0xcd/0xd5
> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
> [<ffffffffffffffff>] 0xffffffffffffffff
md0_raid10 stack:
> root@mental:~# cat /proc/1662/stack
> [<ffffffffffffffff>] 0xffffffffffffffff
cat stack trying to read /dev/md0 after hang:
> root@mental:~# cat /proc/1737/stack
> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
> [<ffffffff9d113820>] vfs_read+0x95/0x120
> [<ffffffff9d11402d>] SyS_read+0x49/0x84
> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff
First OOPS (more in dmesg.txt)
> [ 150.183473] md0: detected capacity change from 0 to 8001054310400
> [ 150.183647] md: resync of RAID array md0
> [ 150.183652] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [ 150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
> [ 150.183678] md: using 128k window, over a total of 7813529600k.
> [ 233.068271] INFO: rcu_sched self-detected stall on CPU
> [ 233.068308] 5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
> [ 233.068335] (t=18000 jiffies g=935 c=934 q=37)
> [ 233.068354] Task dump for CPU 5:
> [ 233.068356] md0_raid10 R running task 0 1662 2 0x00000008
> [ 233.068358] 0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
> [ 233.068360] ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
> [ 233.068361] ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
> [ 233.068363] Call Trace:
> [ 233.068364] <IRQ> [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
> [ 233.068372] [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
> [ 233.068375] [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
> [ 233.068378] [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
> [ 233.068382] [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
> [ 233.068384] [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
> [ 233.068387] [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
> [ 233.068388] [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
> [ 233.068390] [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
> [ 233.068392] [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
> [ 233.068395] [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
> [ 233.068397] [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
> [ 233.068400] [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
> [ 233.068401] <EOI> [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
> [ 233.068410] [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
> [ 233.068417] [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
> [ 233.068421] [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
> [ 233.068424] [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
> [ 233.068428] [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
> [ 233.068431] [<ffffffff9d061db2>] kthread+0xcd/0xd5
> [ 233.068434] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
> [ 233.068436] [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
> [ 233.068438] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
[-- Attachment #2: config.txt.gz --]
[-- Type: application/gzip, Size: 31864 bytes --]
[-- Attachment #3: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 31060 bytes --]
[-- Attachment #4: wchan.txt.gz --]
[-- Type: application/gzip, Size: 3051 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
@ 2015-11-20 20:30 ` John Stoffel
2015-11-21 1:27 ` Andre Tomt
2015-11-23 9:20 ` Artur Paszkiewicz
1 sibling, 1 reply; 7+ messages in thread
From: John Stoffel @ 2015-11-20 20:30 UTC (permalink / raw)
To: Andre Tomt; +Cc: linux-raid
Andre,
I don't have time myself to test this, but since I'm running 4.2.6
myself with triple mirror RAID1 arrays, I'm slightly worried now. But
can you make this failure happen with loop-back mounts by any chance?
That might help narrow down if it's an MD layer, block layer or
controller issue.
Andre> [resend with compressed attachments, first may have gotten eaten by a grue]
Andre> Hi
Andre> I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on
Andre> the same drive set works without any problems. BTRFS RAID6 is problem
Andre> free on a different set of (very busy) drives on the same controllers as
Andre> well.
Andre> It happens shortly after array creation, from seconds to a couple
Andre> minutes in.
Andre> wchan for md0_resync kernel thread shows it sitting in raise_barrier()
Andre> forever, while md0_raid10 keeps a CPU core 100% busy (but shows no
Andre> wchan), and no resyncing or I/O to the array is getting done anymore.
Andre> After a short while kernel starts spitting out rcu_sched self-detected
Andre> stall on CPU warnings, and other rcu use starts getting iffy (I think).
Andre> I/O directly to the RAID member disk (below md layer, eg /dev/sdX
Andre> directly) continues to work after the hang, and there are no I/O errors
Andre> in the kernel log.
Andre> The array is a 8 drive array spread over 3 HBAs, created with:
Andre> mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none
Andre> --raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1
Andre> /dev/sdj1 /dev/sdp1 /dev/sds1
Andre> The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA
Andre> drives. Dual socket Xeon E5 v3 system with both sockets populated.
Andre> This happens on at least 4.2.6 and 4.3
Andre> I'm going to test some earlier kernels.
Andre> attached some more info.
Andre> md0_resync stack:
>> root@mental:~# cat /proc/1663/stack
>> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
>> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
>> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
>> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [<ffffffffffffffff>] 0xffffffffffffffff
Andre> md0_raid10 stack:
>> root@mental:~# cat /proc/1662/stack
>> [<ffffffffffffffff>] 0xffffffffffffffff
Andre> cat stack trying to read /dev/md0 after hang:
>> root@mental:~# cat /proc/1737/stack
>> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
>> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
>> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
>> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
>> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
>> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
>> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
>> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
>> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
>> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
>> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
>> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
>> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
>> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
>> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
>> [<ffffffff9d113820>] vfs_read+0x95/0x120
>> [<ffffffff9d11402d>] SyS_read+0x49/0x84
>> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff
Andre> First OOPS (more in dmesg.txt)
>> [ 150.183473] md0: detected capacity change from 0 to 8001054310400
>> [ 150.183647] md: resync of RAID array md0
>> [ 150.183652] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
>> [ 150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>> [ 150.183678] md: using 128k window, over a total of 7813529600k.
>> [ 233.068271] INFO: rcu_sched self-detected stall on CPU
>> [ 233.068308] 5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
>> [ 233.068335] (t=18000 jiffies g=935 c=934 q=37)
>> [ 233.068354] Task dump for CPU 5:
>> [ 233.068356] md0_raid10 R running task 0 1662 2 0x00000008
>> [ 233.068358] 0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
>> [ 233.068360] ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
>> [ 233.068361] ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
>> [ 233.068363] Call Trace:
>> [ 233.068364] <IRQ> [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
>> [ 233.068372] [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
>> [ 233.068375] [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
>> [ 233.068378] [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
>> [ 233.068382] [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
>> [ 233.068384] [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
>> [ 233.068387] [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
>> [ 233.068388] [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
>> [ 233.068390] [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
>> [ 233.068392] [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
>> [ 233.068395] [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
>> [ 233.068397] [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
>> [ 233.068400] [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
>> [ 233.068401] <EOI> [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
>> [ 233.068410] [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
>> [ 233.068417] [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [ 233.068421] [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
>> [ 233.068424] [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
>> [ 233.068428] [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
>> [ 233.068431] [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [ 233.068434] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
>> [ 233.068436] [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [ 233.068438] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
Andre> [DELETED ATTACHMENT config.txt.gz, application/gzip]
Andre> [DELETED ATTACHMENT dmesg.txt.gz, application/gzip]
Andre> [DELETED ATTACHMENT wchan.txt.gz, application/gzip]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-20 20:30 ` John Stoffel
@ 2015-11-21 1:27 ` Andre Tomt
2015-11-22 19:35 ` John Stoffel
0 siblings, 1 reply; 7+ messages in thread
From: Andre Tomt @ 2015-11-21 1:27 UTC (permalink / raw)
To: John Stoffel; +Cc: linux-raid
On 20. nov. 2015 21:30, John Stoffel wrote:
>
> Andre,
> I don't have time myself to test this, but since I'm running 4.2.6
> myself with triple mirror RAID1 arrays, I'm slightly worried now. But
> can you make this failure happen with loop-back mounts by any chance?
> That might help narrow down if it's an MD layer, block layer or
> controller issue.
I've failed to reproduce it using loop devices, but I've found out the
problem appeared somewhere between 4.1 and 4.2.
4.1.13 is stable, 4.2.0 to 4.2.6 and 4.3 is not.
Guess I'm starting a bisect now ;-)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-21 1:27 ` Andre Tomt
@ 2015-11-22 19:35 ` John Stoffel
0 siblings, 0 replies; 7+ messages in thread
From: John Stoffel @ 2015-11-22 19:35 UTC (permalink / raw)
To: Andre Tomt; +Cc: John Stoffel, linux-raid
>>>>> "Andre" == Andre Tomt <andre@tomt.net> writes:
Andre> On 20. nov. 2015 21:30, John Stoffel wrote:
>>
>> Andre,
>> I don't have time myself to test this, but since I'm running 4.2.6
>> myself with triple mirror RAID1 arrays, I'm slightly worried now. But
>> can you make this failure happen with loop-back mounts by any chance?
>> That might help narrow down if it's an MD layer, block layer or
>> controller issue.
Andre> I've failed to reproduce it using loop devices, but I've found
Andre> out the problem appeared somewhere between 4.1 and 4.2.
Andre> 4.1.13 is stable, 4.2.0 to 4.2.6 and 4.3 is not.
Andre> Guess I'm starting a bisect now ;-)
Good luck! I just got back from a weekend of camping and digging out
from email and tent cleanup and other tasks.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
2015-11-20 20:30 ` John Stoffel
@ 2015-11-23 9:20 ` Artur Paszkiewicz
2015-11-23 9:52 ` Andre Tomt
1 sibling, 1 reply; 7+ messages in thread
From: Artur Paszkiewicz @ 2015-11-23 9:20 UTC (permalink / raw)
To: Andre Tomt, linux-raid
On 11/20/2015 07:14 PM, Andre Tomt wrote:
> [resend with compressed attachments, first may have gotten eaten by a grue]
>
> Hi
>
> I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on the same drive set works without any problems. BTRFS RAID6 is problem free on a different set of (very busy) drives on the same controllers as well.
>
> It happens shortly after array creation, from seconds to a couple minutes in.
>
> wchan for md0_resync kernel thread shows it sitting in raise_barrier() forever, while md0_raid10 keeps a CPU core 100% busy (but shows no wchan), and no resyncing or I/O to the array is getting done anymore.
>
> After a short while kernel starts spitting out rcu_sched self-detected stall on CPU warnings, and other rcu use starts getting iffy (I think).
>
> I/O directly to the RAID member disk (below md layer, eg /dev/sdX directly) continues to work after the hang, and there are no I/O errors in the kernel log.
>
> The array is a 8 drive array spread over 3 HBAs, created with:
> mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none --raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1 /dev/sdj1 /dev/sdp1 /dev/sds1
>
> The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA drives. Dual socket Xeon E5 v3 system with both sockets populated.
>
> This happens on at least 4.2.6 and 4.3
> I'm going to test some earlier kernels.
>
> attached some more info.
>
> md0_resync stack:
>> root@mental:~# cat /proc/1663/stack
>> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
>> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
>> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
>> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> md0_raid10 stack:
>> root@mental:~# cat /proc/1662/stack
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> cat stack trying to read /dev/md0 after hang:
>> root@mental:~# cat /proc/1737/stack
>> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
>> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
>> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
>> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
>> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
>> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
>> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
>> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
>> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
>> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
>> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
>> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
>> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
>> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
>> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
>> [<ffffffff9d113820>] vfs_read+0x95/0x120
>> [<ffffffff9d11402d>] SyS_read+0x49/0x84
>> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
>
> First OOPS (more in dmesg.txt)
>> [ 150.183473] md0: detected capacity change from 0 to 8001054310400
>> [ 150.183647] md: resync of RAID array md0
>> [ 150.183652] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
>> [ 150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>> [ 150.183678] md: using 128k window, over a total of 7813529600k.
>> [ 233.068271] INFO: rcu_sched self-detected stall on CPU
>> [ 233.068308] 5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
>> [ 233.068335] (t=18000 jiffies g=935 c=934 q=37)
>> [ 233.068354] Task dump for CPU 5:
>> [ 233.068356] md0_raid10 R running task 0 1662 2 0x00000008
>> [ 233.068358] 0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
>> [ 233.068360] ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
>> [ 233.068361] ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
>> [ 233.068363] Call Trace:
>> [ 233.068364] <IRQ> [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
>> [ 233.068372] [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
>> [ 233.068375] [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
>> [ 233.068378] [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
>> [ 233.068382] [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
>> [ 233.068384] [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
>> [ 233.068387] [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
>> [ 233.068388] [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
>> [ 233.068390] [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
>> [ 233.068392] [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
>> [ 233.068395] [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
>> [ 233.068397] [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
>> [ 233.068400] [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
>> [ 233.068401] <EOI> [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
>> [ 233.068410] [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
>> [ 233.068417] [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [ 233.068421] [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
>> [ 233.068424] [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
>> [ 233.068428] [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
>> [ 233.068431] [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [ 233.068434] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
>> [ 233.068436] [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [ 233.068438] [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
>
Hi Andre,
I've recently sent a patch for a similar problem, I suspect the root
cause is the same here. Please check out this patch:
http://marc.info/?l=linux-raid&m=144665464232126&w=2
Artur
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-23 9:20 ` Artur Paszkiewicz
@ 2015-11-23 9:52 ` Andre Tomt
2015-11-23 19:48 ` Andre Tomt
0 siblings, 1 reply; 7+ messages in thread
From: Andre Tomt @ 2015-11-23 9:52 UTC (permalink / raw)
To: Artur Paszkiewicz, linux-raid, John Stoffel
On 23. nov. 2015 10:20, Artur Paszkiewicz wrote:
> Hi Andre,
>
> I've recently sent a patch for a similar problem, I suspect the root
> cause is the same here. Please check out this patch:
>
> http://marc.info/?l=linux-raid&m=144665464232126&w=2
Indeed this is where the bisect just landed me. Going to test the fix
out later today.
Thanks!
And also thanks to Roy Sigurd Karlsbakk for confirming I had not gone
completely off the rails :-)
> atomt@mental:~/build/linux-upstream$ git bisect bad
> c31df25f20e35add6a453328c61eca15434fae18 is the first bad commit
> commit c31df25f20e35add6a453328c61eca15434fae18
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date: Wed May 6 23:34:20 2015 -0700
>
> md/raid10: make sync_request_write() call bio_copy_data()
>
> Refactor sync_request_write() of md/raid10 to use bio_copy_data()
> instead of open coding bio_vec iterations.
>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Neil Brown <neilb@suse.de>
> Cc: linux-raid@vger.kernel.org
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> [dpark: add more description in commit message]
> Signed-off-by: Dongsu Park <dpark@posteo.net>
> Signed-off-by: Ming Lin <mlin@kernel.org>
> Signed-off-by: NeilBrown <neilb@suse.de>
>
> :040000 040000 de7fe22262d763cd544d0dbc53039926e5c9a6f4 8cf70fec46bcc652fd3756eb6edd1a746b41a4cd M drivers
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: raid10 resync hangs in 4.2.6, 4.3
2015-11-23 9:52 ` Andre Tomt
@ 2015-11-23 19:48 ` Andre Tomt
0 siblings, 0 replies; 7+ messages in thread
From: Andre Tomt @ 2015-11-23 19:48 UTC (permalink / raw)
To: Artur Paszkiewicz, linux-raid, John Stoffel
On 23. nov. 2015 10:52, Andre Tomt wrote:
> On 23. nov. 2015 10:20, Artur Paszkiewicz wrote:
>> Hi Andre,
>>
>> I've recently sent a patch for a similar problem, I suspect the root
>> cause is the same here. Please check out this patch:
>>
>> http://marc.info/?l=linux-raid&m=144665464232126&w=2
>
> Indeed this is where the bisect just landed me. Going to test the fix
> out later today.
>
> Thanks!
>
> And also thanks to Roy Sigurd Karlsbakk for confirming I had not gone
> completely off the rails :-)
Artur Paszkiewicz's patch fixes the crashes. I'm going to do some data
integrity checking later today/tomorrow.
It probably should be promoted to stable as both 4.2.x and 4.3.x is
currently very broken.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-11-23 19:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
2015-11-20 20:30 ` John Stoffel
2015-11-21 1:27 ` Andre Tomt
2015-11-22 19:35 ` John Stoffel
2015-11-23 9:20 ` Artur Paszkiewicz
2015-11-23 9:52 ` Andre Tomt
2015-11-23 19:48 ` Andre Tomt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.