All of lore.kernel.org
 help / color / mirror / Atom feed
* raid10 resync hangs in 4.2.6, 4.3
@ 2015-11-20 18:14 Andre Tomt
  2015-11-20 20:30 ` John Stoffel
  2015-11-23  9:20 ` Artur Paszkiewicz
  0 siblings, 2 replies; 7+ messages in thread
From: Andre Tomt @ 2015-11-20 18:14 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5582 bytes --]

[resend with compressed attachments, first may have gotten eaten by a grue]

Hi

I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on 
the same drive set works without any problems. BTRFS RAID6 is problem 
free on a different set of (very busy) drives on the same controllers as 
well.

It happens shortly after array creation, from seconds to a couple 
minutes in.

wchan for md0_resync kernel thread shows it sitting in raise_barrier() 
forever, while md0_raid10 keeps a CPU core 100% busy (but shows no 
wchan), and no resyncing or I/O to the array is getting done anymore.

After a short while kernel starts spitting out rcu_sched self-detected 
stall on CPU warnings, and other rcu use starts getting iffy (I think).

I/O directly to the RAID member disk (below md layer, eg /dev/sdX 
directly) continues to work after the hang, and there are no I/O errors 
in the kernel log.

The array is a 8 drive array spread over 3 HBAs, created with:
mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none 
--raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1 
/dev/sdj1 /dev/sdp1 /dev/sds1

The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA 
drives. Dual socket Xeon E5 v3 system with both sockets populated.

This happens on at least 4.2.6 and 4.3
I'm going to test some earlier kernels.

attached some more info.

md0_resync stack:
> root@mental:~# cat /proc/1663/stack
> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
> [<ffffffff9d061db2>] kthread+0xcd/0xd5
> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
> [<ffffffffffffffff>] 0xffffffffffffffff

md0_raid10 stack:
> root@mental:~# cat /proc/1662/stack
> [<ffffffffffffffff>] 0xffffffffffffffff

cat stack trying to read /dev/md0 after hang:
> root@mental:~# cat /proc/1737/stack
> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
> [<ffffffff9d113820>] vfs_read+0x95/0x120
> [<ffffffff9d11402d>] SyS_read+0x49/0x84
> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
> [<ffffffffffffffff>] 0xffffffffffffffff


First OOPS (more in dmesg.txt)
> [  150.183473] md0: detected capacity change from 0 to 8001054310400
> [  150.183647] md: resync of RAID array md0
> [  150.183652] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> [  150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
> [  150.183678] md: using 128k window, over a total of 7813529600k.
> [  233.068271] INFO: rcu_sched self-detected stall on CPU
> [  233.068308] 	5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
> [  233.068335] 	 (t=18000 jiffies g=935 c=934 q=37)
> [  233.068354] Task dump for CPU 5:
> [  233.068356] md0_raid10      R  running task        0  1662      2 0x00000008
> [  233.068358]  0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
> [  233.068360]  ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
> [  233.068361]  ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
> [  233.068363] Call Trace:
> [  233.068364]  <IRQ>  [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
> [  233.068372]  [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
> [  233.068375]  [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
> [  233.068378]  [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
> [  233.068382]  [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
> [  233.068384]  [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
> [  233.068387]  [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
> [  233.068388]  [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
> [  233.068390]  [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
> [  233.068392]  [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
> [  233.068395]  [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
> [  233.068397]  [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
> [  233.068400]  [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
> [  233.068401]  <EOI>  [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
> [  233.068410]  [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
> [  233.068417]  [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
> [  233.068421]  [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
> [  233.068424]  [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
> [  233.068428]  [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
> [  233.068431]  [<ffffffff9d061db2>] kthread+0xcd/0xd5
> [  233.068434]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
> [  233.068436]  [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
> [  233.068438]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f


[-- Attachment #2: config.txt.gz --]
[-- Type: application/gzip, Size: 31864 bytes --]

[-- Attachment #3: dmesg.txt.gz --]
[-- Type: application/gzip, Size: 31060 bytes --]

[-- Attachment #4: wchan.txt.gz --]
[-- Type: application/gzip, Size: 3051 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
@ 2015-11-20 20:30 ` John Stoffel
  2015-11-21  1:27   ` Andre Tomt
  2015-11-23  9:20 ` Artur Paszkiewicz
  1 sibling, 1 reply; 7+ messages in thread
From: John Stoffel @ 2015-11-20 20:30 UTC (permalink / raw)
  To: Andre Tomt; +Cc: linux-raid


Andre,
I don't have time myself to test this, but since I'm running 4.2.6
myself with triple mirror RAID1 arrays, I'm slightly worried now.  But
can you make this failure happen with loop-back mounts by any chance?
That might help narrow down if it's an MD layer, block layer or
controller issue.  


Andre> [resend with compressed attachments, first may have gotten eaten by a grue]

Andre> Hi

Andre> I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on 
Andre> the same drive set works without any problems. BTRFS RAID6 is problem 
Andre> free on a different set of (very busy) drives on the same controllers as 
Andre> well.

Andre> It happens shortly after array creation, from seconds to a couple 
Andre> minutes in.

Andre> wchan for md0_resync kernel thread shows it sitting in raise_barrier() 
Andre> forever, while md0_raid10 keeps a CPU core 100% busy (but shows no 
Andre> wchan), and no resyncing or I/O to the array is getting done anymore.

Andre> After a short while kernel starts spitting out rcu_sched self-detected 
Andre> stall on CPU warnings, and other rcu use starts getting iffy (I think).

Andre> I/O directly to the RAID member disk (below md layer, eg /dev/sdX 
Andre> directly) continues to work after the hang, and there are no I/O errors 
Andre> in the kernel log.

Andre> The array is a 8 drive array spread over 3 HBAs, created with:
Andre> mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none 
Andre> --raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1 
Andre> /dev/sdj1 /dev/sdp1 /dev/sds1

Andre> The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA 
Andre> drives. Dual socket Xeon E5 v3 system with both sockets populated.

Andre> This happens on at least 4.2.6 and 4.3
Andre> I'm going to test some earlier kernels.

Andre> attached some more info.

Andre> md0_resync stack:
>> root@mental:~# cat /proc/1663/stack
>> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
>> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
>> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
>> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [<ffffffffffffffff>] 0xffffffffffffffff

Andre> md0_raid10 stack:
>> root@mental:~# cat /proc/1662/stack
>> [<ffffffffffffffff>] 0xffffffffffffffff

Andre> cat stack trying to read /dev/md0 after hang:
>> root@mental:~# cat /proc/1737/stack
>> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
>> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
>> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
>> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
>> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
>> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
>> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
>> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
>> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
>> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
>> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
>> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
>> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
>> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
>> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
>> [<ffffffff9d113820>] vfs_read+0x95/0x120
>> [<ffffffff9d11402d>] SyS_read+0x49/0x84
>> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff


Andre> First OOPS (more in dmesg.txt)
>> [  150.183473] md0: detected capacity change from 0 to 8001054310400
>> [  150.183647] md: resync of RAID array md0
>> [  150.183652] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>> [  150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>> [  150.183678] md: using 128k window, over a total of 7813529600k.
>> [  233.068271] INFO: rcu_sched self-detected stall on CPU
>> [  233.068308] 	5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
>> [  233.068335] 	 (t=18000 jiffies g=935 c=934 q=37)
>> [  233.068354] Task dump for CPU 5:
>> [  233.068356] md0_raid10      R  running task        0  1662      2 0x00000008
>> [  233.068358]  0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
>> [  233.068360]  ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
>> [  233.068361]  ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
>> [  233.068363] Call Trace:
>> [  233.068364]  <IRQ>  [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
>> [  233.068372]  [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
>> [  233.068375]  [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
>> [  233.068378]  [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
>> [  233.068382]  [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
>> [  233.068384]  [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
>> [  233.068387]  [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
>> [  233.068388]  [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
>> [  233.068390]  [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
>> [  233.068392]  [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
>> [  233.068395]  [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
>> [  233.068397]  [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
>> [  233.068400]  [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
>> [  233.068401]  <EOI>  [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
>> [  233.068410]  [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
>> [  233.068417]  [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [  233.068421]  [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
>> [  233.068424]  [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
>> [  233.068428]  [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
>> [  233.068431]  [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [  233.068434]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
>> [  233.068436]  [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [  233.068438]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f

Andre> [DELETED ATTACHMENT config.txt.gz, application/gzip]
Andre> [DELETED ATTACHMENT dmesg.txt.gz, application/gzip]
Andre> [DELETED ATTACHMENT wchan.txt.gz, application/gzip]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-20 20:30 ` John Stoffel
@ 2015-11-21  1:27   ` Andre Tomt
  2015-11-22 19:35     ` John Stoffel
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Tomt @ 2015-11-21  1:27 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid

On 20. nov. 2015 21:30, John Stoffel wrote:
>
> Andre,
> I don't have time myself to test this, but since I'm running 4.2.6
> myself with triple mirror RAID1 arrays, I'm slightly worried now.  But
> can you make this failure happen with loop-back mounts by any chance?
> That might help narrow down if it's an MD layer, block layer or
> controller issue.

I've failed to reproduce it using loop devices, but I've found out the 
problem appeared somewhere between 4.1 and 4.2.

4.1.13 is stable, 4.2.0 to 4.2.6 and 4.3 is not.

Guess I'm starting a bisect now ;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-21  1:27   ` Andre Tomt
@ 2015-11-22 19:35     ` John Stoffel
  0 siblings, 0 replies; 7+ messages in thread
From: John Stoffel @ 2015-11-22 19:35 UTC (permalink / raw)
  To: Andre Tomt; +Cc: John Stoffel, linux-raid

>>>>> "Andre" == Andre Tomt <andre@tomt.net> writes:

Andre> On 20. nov. 2015 21:30, John Stoffel wrote:
>> 
>> Andre,
>> I don't have time myself to test this, but since I'm running 4.2.6
>> myself with triple mirror RAID1 arrays, I'm slightly worried now.  But
>> can you make this failure happen with loop-back mounts by any chance?
>> That might help narrow down if it's an MD layer, block layer or
>> controller issue.

Andre> I've failed to reproduce it using loop devices, but I've found
Andre> out the problem appeared somewhere between 4.1 and 4.2.

Andre> 4.1.13 is stable, 4.2.0 to 4.2.6 and 4.3 is not.

Andre> Guess I'm starting a bisect now ;-)

Good luck!  I just got back from a weekend of camping and digging out
from email and tent cleanup and other tasks.  

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
  2015-11-20 20:30 ` John Stoffel
@ 2015-11-23  9:20 ` Artur Paszkiewicz
  2015-11-23  9:52   ` Andre Tomt
  1 sibling, 1 reply; 7+ messages in thread
From: Artur Paszkiewicz @ 2015-11-23  9:20 UTC (permalink / raw)
  To: Andre Tomt, linux-raid

On 11/20/2015 07:14 PM, Andre Tomt wrote:
> [resend with compressed attachments, first may have gotten eaten by a grue]
> 
> Hi
> 
> I'm seeing hangs with RAID10 resyncs on my system. RAID5/6 recovery on the same drive set works without any problems. BTRFS RAID6 is problem free on a different set of (very busy) drives on the same controllers as well.
> 
> It happens shortly after array creation, from seconds to a couple minutes in.
> 
> wchan for md0_resync kernel thread shows it sitting in raise_barrier() forever, while md0_raid10 keeps a CPU core 100% busy (but shows no wchan), and no resyncing or I/O to the array is getting done anymore.
> 
> After a short while kernel starts spitting out rcu_sched self-detected stall on CPU warnings, and other rcu use starts getting iffy (I think).
> 
> I/O directly to the RAID member disk (below md layer, eg /dev/sdX directly) continues to work after the hang, and there are no I/O errors in the kernel log.
> 
> The array is a 8 drive array spread over 3 HBAs, created with:
> mdadm --create /dev/md0 --level=10 --chunk=128 --bitmap=none --raid-devices=8 /dev/sda1 /dev/sdg1 /dev/sdl1 /dev/sdm1 /dev/sdi1 /dev/sdj1 /dev/sdp1 /dev/sds1
> 
> The HBAs are LSI SAS2008 in IT mode (mpt2sas driver), oldish 2TB SATA drives. Dual socket Xeon E5 v3 system with both sockets populated.
> 
> This happens on at least 4.2.6 and 4.3
> I'm going to test some earlier kernels.
> 
> attached some more info.
> 
> md0_resync stack:
>> root@mental:~# cat /proc/1663/stack
>> [<ffffffffc042dd02>] raise_barrier+0x11b/0x14d [raid10]
>> [<ffffffffc0432830>] sync_request+0x193/0x14fc [raid10]
>> [<ffffffffc0559ac6>] md_do_sync+0x7d2/0xd78 [md_mod]
>> [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> md0_raid10 stack:
>> root@mental:~# cat /proc/1662/stack
>> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> cat stack trying to read /dev/md0 after hang:
>> root@mental:~# cat /proc/1737/stack
>> [<ffffffffc042de2b>] wait_barrier+0xd8/0x118 [raid10]
>> [<ffffffffc042f83c>] __make_request+0x3e/0xb17 [raid10]
>> [<ffffffffc0430399>] make_request+0x84/0xdc [raid10]
>> [<ffffffffc0557aab>] md_make_request+0xf6/0x1cc [md_mod]
>> [<ffffffff9d1a369a>] generic_make_request+0x97/0xd6
>> [<ffffffff9d1a37d1>] submit_bio+0xf8/0x140
>> [<ffffffff9d140e71>] mpage_bio_submit+0x25/0x2c
>> [<ffffffff9d141499>] mpage_readpages+0x10e/0x11f
>> [<ffffffff9d13c76c>] blkdev_readpages+0x18/0x1a
>> [<ffffffff9d0d09e4>] __do_page_cache_readahead+0x13c/0x1e0
>> [<ffffffff9d0d0c67>] ondemand_readahead+0x1df/0x1f2
>> [<ffffffff9d0d0da0>] page_cache_sync_readahead+0x38/0x3a
>> [<ffffffff9d0c70b3>] generic_file_read_iter+0x184/0x50b
>> [<ffffffff9d13c8f7>] blkdev_read_iter+0x33/0x38
>> [<ffffffff9d1132fd>] __vfs_read+0x8d/0xb1
>> [<ffffffff9d113820>] vfs_read+0x95/0x120
>> [<ffffffff9d11402d>] SyS_read+0x49/0x84
>> [<ffffffff9d3b772e>] entry_SYSCALL_64_fastpath+0x12/0x71
>> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> 
> First OOPS (more in dmesg.txt)
>> [  150.183473] md0: detected capacity change from 0 to 8001054310400
>> [  150.183647] md: resync of RAID array md0
>> [  150.183652] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>> [  150.183654] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>> [  150.183678] md: using 128k window, over a total of 7813529600k.
>> [  233.068271] INFO: rcu_sched self-detected stall on CPU
>> [  233.068308]     5: (17999 ticks this GP) idle=695/140000000000001/0 softirq=1235/1235 fqs=8999
>> [  233.068335]      (t=18000 jiffies g=935 c=934 q=37)
>> [  233.068354] Task dump for CPU 5:
>> [  233.068356] md0_raid10      R  running task        0  1662      2 0x00000008
>> [  233.068358]  0000000000000000 ffff88103fca3de0 ffffffff9d069f07 0000000000000005
>> [  233.068360]  ffffffff9d63d0c0 ffff88103fca3df8 ffffffff9d06beb9 ffffffff9d63d0c0
>> [  233.068361]  ffff88103fca3e28 ffffffff9d08c836 ffffffff9d63d0c0 ffff88103fcb4e80
>> [  233.068363] Call Trace:
>> [  233.068364]  <IRQ>  [<ffffffff9d069f07>] sched_show_task+0xb9/0xbe
>> [  233.068372]  [<ffffffff9d06beb9>] dump_cpu_task+0x32/0x35
>> [  233.068375]  [<ffffffff9d08c836>] rcu_dump_cpu_stacks+0x71/0x8c
>> [  233.068378]  [<ffffffff9d08f32c>] rcu_check_callbacks+0x20f/0x5a3
>> [  233.068382]  [<ffffffff9d0b72da>] ? acct_account_cputime+0x17/0x19
>> [  233.068384]  [<ffffffff9d0911a2>] update_process_times+0x2a/0x4f
>> [  233.068387]  [<ffffffff9d09cd55>] tick_sched_handle.isra.5+0x31/0x33
>> [  233.068388]  [<ffffffff9d09cd8f>] tick_sched_timer+0x38/0x60
>> [  233.068390]  [<ffffffff9d0917e1>] __hrtimer_run_queues+0xa1/0x10c
>> [  233.068392]  [<ffffffff9d091c52>] hrtimer_interrupt+0xa0/0x172
>> [  233.068395]  [<ffffffff9d0367a4>] smp_trace_apic_timer_interrupt+0x76/0x88
>> [  233.068397]  [<ffffffff9d0367bf>] smp_apic_timer_interrupt+0x9/0xb
>> [  233.068400]  [<ffffffff9d3b8402>] apic_timer_interrupt+0x82/0x90
>> [  233.068401]  <EOI>  [<ffffffff9d19e9aa>] ? bio_copy_data+0xce/0x2af
>> [  233.068410]  [<ffffffffc04320e5>] raid10d+0x974/0xf2c [raid10]
>> [  233.068417]  [<ffffffffc0556df9>] md_thread+0x12f/0x145 [md_mod]
>> [  233.068421]  [<ffffffffc0556df9>] ? md_thread+0x12f/0x145 [md_mod]
>> [  233.068424]  [<ffffffff9d07ad2e>] ? wait_woken+0x6d/0x6d
>> [  233.068428]  [<ffffffffc0556cca>] ? md_wait_for_blocked_rdev+0x102/0x102 [md_mod]
>> [  233.068431]  [<ffffffff9d061db2>] kthread+0xcd/0xd5
>> [  233.068434]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
>> [  233.068436]  [<ffffffff9d3b7a8f>] ret_from_fork+0x3f/0x70
>> [  233.068438]  [<ffffffff9d061ce5>] ? kthread_worker_fn+0x13f/0x13f
> 

Hi Andre,

I've recently sent a patch for a similar problem, I suspect the root
cause is the same here. Please check out this patch:

http://marc.info/?l=linux-raid&m=144665464232126&w=2

Artur

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-23  9:20 ` Artur Paszkiewicz
@ 2015-11-23  9:52   ` Andre Tomt
  2015-11-23 19:48     ` Andre Tomt
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Tomt @ 2015-11-23  9:52 UTC (permalink / raw)
  To: Artur Paszkiewicz, linux-raid, John Stoffel

On 23. nov. 2015 10:20, Artur Paszkiewicz wrote:
> Hi Andre,
>
> I've recently sent a patch for a similar problem, I suspect the root
> cause is the same here. Please check out this patch:
>
> http://marc.info/?l=linux-raid&m=144665464232126&w=2

Indeed this is where the bisect just landed me. Going to test the fix 
out later today.

Thanks!

And also thanks to Roy Sigurd Karlsbakk for confirming I had not gone 
completely off the rails :-)

> atomt@mental:~/build/linux-upstream$ git bisect bad
> c31df25f20e35add6a453328c61eca15434fae18 is the first bad commit
> commit c31df25f20e35add6a453328c61eca15434fae18
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date:   Wed May 6 23:34:20 2015 -0700
>
>     md/raid10: make sync_request_write() call bio_copy_data()
>
>     Refactor sync_request_write() of md/raid10 to use bio_copy_data()
>     instead of open coding bio_vec iterations.
>
>     Cc: Christoph Hellwig <hch@infradead.org>
>     Cc: Neil Brown <neilb@suse.de>
>     Cc: linux-raid@vger.kernel.org
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Acked-by: NeilBrown <neilb@suse.de>
>     Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
>     [dpark: add more description in commit message]
>     Signed-off-by: Dongsu Park <dpark@posteo.net>
>     Signed-off-by: Ming Lin <mlin@kernel.org>
>     Signed-off-by: NeilBrown <neilb@suse.de>
>
> :040000 040000 de7fe22262d763cd544d0dbc53039926e5c9a6f4 8cf70fec46bcc652fd3756eb6edd1a746b41a4cd M	drivers


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid10 resync hangs in 4.2.6, 4.3
  2015-11-23  9:52   ` Andre Tomt
@ 2015-11-23 19:48     ` Andre Tomt
  0 siblings, 0 replies; 7+ messages in thread
From: Andre Tomt @ 2015-11-23 19:48 UTC (permalink / raw)
  To: Artur Paszkiewicz, linux-raid, John Stoffel

On 23. nov. 2015 10:52, Andre Tomt wrote:
> On 23. nov. 2015 10:20, Artur Paszkiewicz wrote:
>> Hi Andre,
>>
>> I've recently sent a patch for a similar problem, I suspect the root
>> cause is the same here. Please check out this patch:
>>
>> http://marc.info/?l=linux-raid&m=144665464232126&w=2
>
> Indeed this is where the bisect just landed me. Going to test the fix
> out later today.
>
> Thanks!
>
> And also thanks to Roy Sigurd Karlsbakk for confirming I had not gone
> completely off the rails :-)

Artur Paszkiewicz's patch fixes the crashes. I'm going to do some data 
integrity checking later today/tomorrow.

It probably should be promoted to stable as both 4.2.x and 4.3.x is 
currently very broken.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-11-23 19:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 18:14 raid10 resync hangs in 4.2.6, 4.3 Andre Tomt
2015-11-20 20:30 ` John Stoffel
2015-11-21  1:27   ` Andre Tomt
2015-11-22 19:35     ` John Stoffel
2015-11-23  9:20 ` Artur Paszkiewicz
2015-11-23  9:52   ` Andre Tomt
2015-11-23 19:48     ` Andre Tomt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.