* problem with recovered array
@ 2023-10-30 13:35 eyal
2023-10-30 16:14 ` Roger Heflin
2023-10-31 21:44 ` problem with recovered array [more details] eyal
0 siblings, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-30 13:35 UTC (permalink / raw)
To: linux-raid
F38
I know this is a bit long but I wanted to provide as much detail as I thought needed.
I have a 7-member raid6. The other day I needed to send a disk for replacement.
I have done this before and all looked well. The array is now degraded until I get the new disk.
At one point my system got into trouble and I am not sure why, but it started to have
very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
It refused to complete the shutdown and after a while I used the sysrq feature for this.
On the restart it dropped into emergency shell, the array had all members listed as spares.
I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
though the array was now listed in mdstat and looked as expected.
Since mdadm suggested I use --force', I did so
mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
Q0) was I correct to use this command?
2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
Q1) What does this last line mean?
Now that the array came up I still could not mount the fs (still in emergency shell).
I rebooted and all came up, the array was there and the fs was mounted and so far
I did not notice any issues with the fs.
However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
SATA disk on the server and it went fast, so the USB disk is OK.
I then noticed (in 'top') that there is a kworker running at 100% CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
944760 root 20 0 0 0 0 R 100.0 0.0 164:00.85 kworker/u16:3+flush-9:127
It did it for many hours and I do not know what it is doing.
Q2) what does this worker do?
I also noticed that mdstat shows a high bitmap usage:
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
bitmap: 87/88 pages [348KB], 65536KB chunk
Q3) Is this OK? Should the usage go down? It does not change at all.
While looking at everything, I started iostat on md127 and I see that there is a constant
trickle of writes, about 5KB/s. There is no activity on this fs.
Also, I see similar activity on all the members, at the same rate, so md127 does not show
6 times the members activity. I guess this is just how md works?
Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
If this is a background thing, how can I monitor it?
Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
What is the correct way to bring up a degraded array? What about the 'dirty' part?
'mdadm -D /dev/md127' mention'sync':
Number Major Minor RaidDevice State
- 0 0 0 removed
8 8 17 1 active sync /dev/sdb1
9 8 33 2 active sync /dev/sdc1
7 8 49 3 active sync /dev/sdd1
4 8 65 4 active sync /dev/sde1
5 8 81 5 active sync /dev/sdf1
6 8 97 6 active sync /dev/sdg1
Is this related?
BTW I plan to run a 'check' at some point.
TIA
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-30 13:35 problem with recovered array eyal
@ 2023-10-30 16:14 ` Roger Heflin
2023-10-31 2:35 ` eyal
2023-10-31 3:21 ` Carlos Carvalho
2023-10-31 21:44 ` problem with recovered array [more details] eyal
1 sibling, 2 replies; 35+ messages in thread
From: Roger Heflin @ 2023-10-30 16:14 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
look at SAR -d output for all the disks in the raid6. It may be a
disk issue (though I suspect not given the 100% cpu show in raid).
Clearly something very expensive/deadlockish is happening because of
the raid having to rebuild the data from the missing disk, not sure
what could be wrong with it.
If you are on a kernel that has a newer version upgrading and
rebooting may change something If it is one close to the newest
kernel version going back a minor version (if 6.5 go back to the last
6.4 kernel).
You might also install the perf package and run "perf top" and see
what sorts of calls the kernel is spending all of its time in.
On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>
> F38
>
> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>
> I have a 7-member raid6. The other day I needed to send a disk for replacement.
> I have done this before and all looked well. The array is now degraded until I get the new disk.
>
> At one point my system got into trouble and I am not sure why, but it started to have
> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>
> On the restart it dropped into emergency shell, the array had all members listed as spares.
> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
> though the array was now listed in mdstat and looked as expected.
>
> Since mdadm suggested I use --force', I did so
> mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>
> Q0) was I correct to use this command?
>
> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>
> Q1) What does this last line mean?
>
> Now that the array came up I still could not mount the fs (still in emergency shell).
> I rebooted and all came up, the array was there and the fs was mounted and so far
> I did not notice any issues with the fs.
>
> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
> SATA disk on the server and it went fast, so the USB disk is OK.
>
> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 944760 root 20 0 0 0 0 R 100.0 0.0 164:00.85 kworker/u16:3+flush-9:127
>
> It did it for many hours and I do not know what it is doing.
>
> Q2) what does this worker do?
>
> I also noticed that mdstat shows a high bitmap usage:
>
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
> 58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> bitmap: 87/88 pages [348KB], 65536KB chunk
>
> Q3) Is this OK? Should the usage go down? It does not change at all.
>
> While looking at everything, I started iostat on md127 and I see that there is a constant
> trickle of writes, about 5KB/s. There is no activity on this fs.
> Also, I see similar activity on all the members, at the same rate, so md127 does not show
> 6 times the members activity. I guess this is just how md works?
>
> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
> If this is a background thing, how can I monitor it?
>
> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
> What is the correct way to bring up a degraded array? What about the 'dirty' part?
> 'mdadm -D /dev/md127' mention'sync':
> Number Major Minor RaidDevice State
> - 0 0 0 removed
> 8 8 17 1 active sync /dev/sdb1
> 9 8 33 2 active sync /dev/sdc1
> 7 8 49 3 active sync /dev/sdd1
> 4 8 65 4 active sync /dev/sde1
> 5 8 81 5 active sync /dev/sdf1
> 6 8 97 6 active sync /dev/sdg1
> Is this related?
>
> BTW I plan to run a 'check' at some point.
>
> TIA
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-30 16:14 ` Roger Heflin
@ 2023-10-31 2:35 ` eyal
2023-10-31 2:50 ` eyal
2023-10-31 3:21 ` Carlos Carvalho
1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-10-31 2:35 UTC (permalink / raw)
To: linux-raid
Thanks Roger,
On 31/10/2023 03.14, Roger Heflin wrote:
> look at SAR -d output for all the disks in the raid6. It may be a
> disk issue (though I suspect not given the 100% cpu show in raid).
It seems that sar is installed but not active here. Latest data is from 2017...
> Clearly something very expensive/deadlockish is happening because of
> the raid having to rebuild the data from the missing disk, not sure
> what could be wrong with it.
>
> If you are on a kernel that has a newer version upgrading and
> rebooting may change something If it is one close to the newest
> kernel version going back a minor version (if 6.5 go back to the last
> 6.4 kernel).
I am running the latest available kernel
6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
> You might also install the perf package and run "perf top" and see
> what sorts of calls the kernel is spending all of its time in.
I now installed perf. Below is a grab from the console. Should I use different options?
Does anyone understand it?
BTW, I now ran rsync into the array and it barely moved. In 2 hours it copied about 5GB.
iostat shows about 5KB/s (kilobytes!) all that time.
TIA
top shows
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1385760 root 20 0 0 0 0 R 99.7 0.0 161:32.64 kworker/u16:4+flush-9:127
$ sudo perf top --pid 1385760 -g
+ 100.00% 0.49% [kernel] [k] ext4_mb_regular_allocator
+ 99.96% 0.00% [kernel] [k] ext4_do_writepages
+ 99.10% 0.00% [kernel] [k] ext4_ext_map_blocks
+ 89.39% 7.85% [kernel] [k] ext4_mb_find_good_group_avg_frag_lists
+ 81.78% 55.43% [kernel] [k] ext4_mb_good_group
+ 28.74% 14.63% [kernel] [k] ext4_get_group_info
+ 12.40% 0.00% [kernel] [k] ret_from_fork_asm
+ 12.40% 0.00% [kernel] [k] ret_from_fork
+ 12.40% 0.00% [kernel] [k] kthread
+ 12.40% 0.00% [kernel] [k] worker_thread
+ 12.40% 0.00% [kernel] [k] process_one_work
+ 12.40% 0.00% [kernel] [k] wb_workfn
+ 12.40% 0.00% [kernel] [k] wb_writeback
+ 12.40% 0.00% [kernel] [k] __writeback_inodes_wb
+ 12.40% 0.00% [kernel] [k] writeback_sb_inodes
+ 12.40% 0.00% [kernel] [k] __writeback_single_inode
+ 12.40% 0.00% [kernel] [k] do_writepages
+ 12.40% 0.00% [kernel] [k] ext4_writepages
+ 12.40% 0.00% [kernel] [k] ext4_map_blocks
+ 12.40% 0.00% [kernel] [k] ext4_mb_new_blocks
+ 7.21% 7.12% [kernel] [k] __rcu_read_unlock
+ 6.21% 3.23% [kernel] [k] ext4_mb_scan_aligned
+ 4.77% 4.67% [kernel] [k] __rcu_read_lock
+ 2.96% 0.56% [kernel] [k] mb_find_extent
+ 2.63% 0.55% [kernel] [k] ext4_mb_load_buddy_gfp
+ 2.32% 2.29% [kernel] [k] mb_find_order_for_block
+ 2.00% 0.08% [kernel] [k] pagecache_get_page
+ 1.88% 0.14% [kernel] [k] __filemap_get_folio
+ 1.71% 0.51% [kernel] [k] filemap_get_entry
+ 1.12% 0.43% [kernel] [k] xas_load
+ 0.68% 0.63% [kernel] [k] xas_descend
0.34% 0.34% [kernel] [k] ext4_mb_unload_buddy
0.29% 0.22% [kernel] [k] _raw_read_unlock
0.25% 0.25% [kernel] [k] _raw_spin_trylock
0.17% 0.17% [kernel] [k] _raw_read_lock
0.11% 0.10% [kernel] [k] mb_find_buddy
0.11% 0.00% [kernel] [k] asm_sysvec_apic_timer_interrupt
0.10% 0.00% [kernel] [k] sysvec_apic_timer_interrupt
0.09% 0.00% [kernel] [k] __sysvec_apic_timer_interrupt
0.08% 0.00% [kernel] [k] hrtimer_interrupt
0.08% 0.07% [kernel] [k] xas_start
0.07% 0.00% [kernel] [k] __hrtimer_run_queues
0.06% 0.00% [kernel] [k] tick_sched_timer
0.05% 0.04% [kernel] [k] folio_mark_accessed
0.04% 0.04% [kernel] [k] _raw_spin_unlock
0.04% 0.00% [kernel] [k] update_process_times
0.04% 0.02% [kernel] [k] __cond_resched
0.04% 0.04% [kernel] [k] folio_test_hugetlb
0.04% 0.00% [kernel] [k] tick_sched_handle
0.03% 0.00% [kernel] [k] scheduler_tick
0.02% 0.00% [kernel] [k] tick_sched_do_timer
0.02% 0.00% [kernel] [k] timekeeping_advance
0.01% 0.00% [kernel] [k] task_tick_fair
0.01% 0.00% [kernel] [k] perf_adjust_freq_unthr_context
0.01% 0.00% [kernel] [k] clockevents_program_event
0.01% 0.01% [kernel] [k] _raw_spin_lock_irqsave
0.01% 0.00% [kernel] [k] timekeeping_update
0.01% 0.00% [kernel] [k] arch_scale_freq_tick
0.01% 0.01% [kernel] [k] native_irq_return_iret
0.01% 0.01% [kernel] [k] native_write_msr
0.01% 0.00% [kernel] [k] update_load_avg
0.01% 0.00% [kernel] [k] lapic_next_deadline
0.01% 0.00% [kernel] [k] sched_clock_cpu
0.01% 0.00% [kernel] [k] irqtime_account_irq
0.00% 0.00% [kernel] [k] sched_clock
0.00% 0.00% [kernel] [k] __irq_exit_rcu
0.00% 0.00% [kernel] [k] native_read_msr
0.00% 0.00% [kernel] [k] native_apic_msr_eoi_write
0.00% 0.00% [kernel] [k] update_rq_clock
0.00% 0.00% [kernel] [k] smp_call_function_single_async
0.00% 0.00% [kernel] [k] native_sched_clock
0.00% 0.00% [kernel] [k] update_vsyscall
0.00% 0.00% [kernel] [k] _raw_spin_lock
0.00% 0.00% [kernel] [k] generic_exec_single
0.00% 0.00% [kernel] [k] notifier_call_chain
0.00% 0.00% [kernel] [k] __update_load_avg_cfs_rq
0.00% 0.00% [kernel] [k] tick_do_update_jiffies64
0.00% 0.00% [kernel] [k] update_irq_load_avg
0.00% 0.00% [kernel] [k] tick_program_event
0.00% 0.00% [kernel] [k] trigger_load_balance
0.00% 0.00% [kernel] [k] rb_erase
0.00% 0.00% [kernel] [k] update_wall_time
0.00% 0.00% [kernel] [k] ktime_get
0.00% 0.00% [kernel] [k] llist_add_batch
0.00% 0.00% [kernel] [k] hrtimer_active
0.00% 0.00% [kernel] [k] wq_worker_tick
0.00% 0.00% [kernel] [k] timerqueue_add
0.00% 0.00% [kernel] [k] irqentry_enter
0.00% 0.00% [kernel] [k] psi_account_irqtime
0.00% 0.00% [kernel] [k] run_posix_cpu_timers
0.00% 0.00% [kernel] [k] ktime_get_update_offsets_now
0.00% 0.00% [kernel] [k] rcu_sched_clock_irq
0.00% 0.00% [kernel] [k] read_tsc
0.00% 0.00% [kernel] [k] x86_pmu_disable
0.00% 0.00% [kernel] [k] update_curr
0.00% 0.00% [kernel] [k] _raw_spin_lock_irq
0.00% 0.00% [kernel] [k] account_system_index_time
0.00% 0.00% [kernel] [k] __update_load_avg_se
0.00% 0.00% [kernel] [k] error_entry
0.00% 0.00% [kernel] [k] update_fast_timekeeper
0.00% 0.00% [kernel] [k] _raw_spin_unlock_irqrestore
0.00% 0.00% [kernel] [k] hrtimer_run_queues
0.00% 0.00% [kernel] [k] pvclock_gtod_notify
0.00% 0.00% [kernel] [k] call_function_single_prep_ipi
0.00% 0.00% [kernel] [k] crypto_shash_update
0.00% 0.00% [kernel] [k] cpuacct_charge
0.00% 0.00% [kernel] [k] perf_pmu_nop_void
0.00% 0.00% [kernel] [k] kthread_data
0.00% 0.00% [kernel] [k] hrtimer_update_next_event
0.00% 0.00% [kernel] [k] hrtimer_forward
0.00% 0.00% [kernel] [k] __hrtimer_next_event_base
0.00% 0.00% [kernel] [k] __do_softirq
0.00% 0.00% [kernel] [k] __schedule
0.00% 0.00% [kernel] [k] __run_timers
0.00% 0.00% [kernel] [k] rcu_note_context_switch
0.00% 0.00% [kernel] [k] _find_next_and_bit
0.00% 0.00% [kernel] [k] tcp_orphan_count_sum
0.00% 0.00% [kernel] [k] cpuacct_account_field
0.00% 0.00% [kernel] [k] pick_next_task_fair
0.00% 0.00% [kernel] [k] irqtime_account_process_tick
0.00% 0.00% [kernel] [k] md_handle_request
0.00% 0.00% [kernel] [k] kmem_cache_alloc
0.00% 0.00% [kernel] [k] calc_global_load_tick
0.00% 0.00% [kernel] [k] irq_exit_rcu
0.00% 0.00% [kernel] [k] blake2s_compress_ssse3
0.00% 0.00% [kernel] [k] find_get_stripe
0.00% 0.00% [kernel] [k] rb_next
0.00% 0.00% [kernel] [k] error_return
0.00% 0.00% [kernel] [k] update_min_vruntime
0.00% 0.00% [kernel] [k] netlink_broadcast
0.00% 0.00% [kernel] [k] prb_retire_rx_blk_timer_expired
0.00% 0.00% [kernel] [k] __accumulate_pelt_segments
0.00% 0.00% [kernel] [k] kernfs_notify
0.00% 0.00% [kernel] [k] ntp_get_next_leap
0.00% 0.00% [kernel] [k] ext4_mb_prefetch
0.00% 0.00% [kernel] [k] __radix_tree_lookup
0.00% 0.00% [kernel] [k] irqentry_exit
0.00% 0.00% [kernel] [k] asm_sysvec_reschedule_ipi
0.00% 0.00% [kernel] [k] ext4_ext_insert_extent
0.00% 0.00% [kernel] [k] fast_mix
0.00% 0.00% [kernel] [k] record_times
0.00% 0.00% [kernel] [k] irq_enter_rcu
0.00% 0.00% [kernel] [k] early_xen_iret_patch
0.00% 0.00% [kernel] [k] __orc_find
0.00% 0.00% [kernel] [k] irq_work_tick
0.00% 0.00% [kernel] [k] kmem_cache_free
> On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>>
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>
>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>> I have done this before and all looked well. The array is now degraded until I get the new disk.
>>
>> At one point my system got into trouble and I am not sure why, but it started to have
>> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
>> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>>
>> On the restart it dropped into emergency shell, the array had all members listed as spares.
>> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
>> though the array was now listed in mdstat and looked as expected.
>>
>> Since mdadm suggested I use --force', I did so
>> mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>>
>> Q0) was I correct to use this command?
>>
>> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
>> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
>> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>>
>> Q1) What does this last line mean?
>>
>> Now that the array came up I still could not mount the fs (still in emergency shell).
>> I rebooted and all came up, the array was there and the fs was mounted and so far
>> I did not notice any issues with the fs.
>>
>> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
>> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
>> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
>> SATA disk on the server and it went fast, so the USB disk is OK.
>>
>> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 944760 root 20 0 0 0 0 R 100.0 0.0 164:00.85 kworker/u16:3+flush-9:127
>>
>> It did it for many hours and I do not know what it is doing.
>>
>> Q2) what does this worker do?
>>
>> I also noticed that mdstat shows a high bitmap usage:
>>
>> Personalities : [raid6] [raid5] [raid4]
>> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>> 58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>> bitmap: 87/88 pages [348KB], 65536KB chunk
>>
>> Q3) Is this OK? Should the usage go down? It does not change at all.
>>
>> While looking at everything, I started iostat on md127 and I see that there is a constant
>> trickle of writes, about 5KB/s. There is no activity on this fs.
>> Also, I see similar activity on all the members, at the same rate, so md127 does not show
>> 6 times the members activity. I guess this is just how md works?
>>
>> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>> If this is a background thing, how can I monitor it?
>>
>> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>> What is the correct way to bring up a degraded array? What about the 'dirty' part?
>> 'mdadm -D /dev/md127' mention'sync':
>> Number Major Minor RaidDevice State
>> - 0 0 0 removed
>> 8 8 17 1 active sync /dev/sdb1
>> 9 8 33 2 active sync /dev/sdc1
>> 7 8 49 3 active sync /dev/sdd1
>> 4 8 65 4 active sync /dev/sde1
>> 5 8 81 5 active sync /dev/sdf1
>> 6 8 97 6 active sync /dev/sdg1
>> Is this related?
>>
>> BTW I plan to run a 'check' at some point.
>>
>> TIA
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 2:35 ` eyal
@ 2023-10-31 2:50 ` eyal
0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31 2:50 UTC (permalink / raw)
To: eyal, linux-raid
On 31/10/2023 13.35, eyal@eyal.emu.id.au wrote:
> Thanks Roger,
>
> On 31/10/2023 03.14, Roger Heflin wrote:
>> look at SAR -d output for all the disks in the raid6. It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
>
> It seems that sar is installed but not active here. Latest data is from 2017...
>
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
>>
>> If you are on a kernel that has a newer version upgrading and
>> rebooting may change something If it is one close to the newest
>> kernel version going back a minor version (if 6.5 go back to the last
>> 6.4 kernel).
>
> I am running the latest available kernel
> 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
>
>> You might also install the perf package and run "perf top" and see
>> what sorts of calls the kernel is spending all of its time in.
>
> I now installed perf. Below is a grab from the console. Should I use different options?
>
> Does anyone understand it?
>
> BTW, I now ran rsync into the array and it barely moved. In 2 hours it copied about 5GB.
> iostat shows about 5KB/s (kilobytes!) all that time.
rsync kept the thread running at 100% CPU. After stopping the copy, the thread was
no longer at the top of 'top' (could not see it listed at all).
> TIA
>
> top shows
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1385760 root 20 0 0 0 0 R 99.7 0.0 161:32.64 kworker/u16:4+flush-9:127
>
> $ sudo perf top --pid 1385760 -g
> + 100.00% 0.49% [kernel] [k] ext4_mb_regular_allocator
[stack trimmed]
>> On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>>>
>>> F38
>>>
>>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>>
>>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>>> I have done this before and all looked well. The array is now degraded until I get the new disk.
>>>
>>> At one point my system got into trouble and I am not sure why, but it started to have
>>> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
>>> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>>>
>>> On the restart it dropped into emergency shell, the array had all members listed as spares.
>>> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
>>> though the array was now listed in mdstat and looked as expected.
>>>
>>> Since mdadm suggested I use --force', I did so
>>> mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>>>
>>> Q0) was I correct to use this command?
>>>
>>> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
>>> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
>>> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>>>
>>> Q1) What does this last line mean?
>>>
>>> Now that the array came up I still could not mount the fs (still in emergency shell).
>>> I rebooted and all came up, the array was there and the fs was mounted and so far
>>> I did not notice any issues with the fs.
>>>
>>> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
>>> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
>>> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
>>> SATA disk on the server and it went fast, so the USB disk is OK.
>>>
>>> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 944760 root 20 0 0 0 0 R 100.0 0.0 164:00.85 kworker/u16:3+flush-9:127
>>>
>>> It did it for many hours and I do not know what it is doing.
>>>
>>> Q2) what does this worker do?
>>>
>>> I also noticed that mdstat shows a high bitmap usage:
>>>
>>> Personalities : [raid6] [raid5] [raid4]
>>> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>>> 58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>>> bitmap: 87/88 pages [348KB], 65536KB chunk
>>>
>>> Q3) Is this OK? Should the usage go down? It does not change at all.
>>>
>>> While looking at everything, I started iostat on md127 and I see that there is a constant
>>> trickle of writes, about 5KB/s. There is no activity on this fs.
>>> Also, I see similar activity on all the members, at the same rate, so md127 does not show
>>> 6 times the members activity. I guess this is just how md works?
>>>
>>> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>>> If this is a background thing, how can I monitor it?
>>>
>>> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>>> What is the correct way to bring up a degraded array? What about the 'dirty' part?
>>> 'mdadm -D /dev/md127' mention'sync':
>>> Number Major Minor RaidDevice State
>>> - 0 0 0 removed
>>> 8 8 17 1 active sync /dev/sdb1
>>> 9 8 33 2 active sync /dev/sdc1
>>> 7 8 49 3 active sync /dev/sdd1
>>> 4 8 65 4 active sync /dev/sde1
>>> 5 8 81 5 active sync /dev/sdf1
>>> 6 8 97 6 active sync /dev/sdg1
>>> Is this related?
>>>
>>> BTW I plan to run a 'check' at some point.
>>>
>>> TIA
>>>
>>> --
>>> Eyal at Home (eyal@eyal.emu.id.au)
>
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-30 16:14 ` Roger Heflin
2023-10-31 2:35 ` eyal
@ 2023-10-31 3:21 ` Carlos Carvalho
2023-10-31 6:25 ` eyal
2023-10-31 9:29 ` eyal
1 sibling, 2 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31 3:21 UTC (permalink / raw)
To: linux-raid
Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> look at SAR -d output for all the disks in the raid6. It may be a
> disk issue (though I suspect not given the 100% cpu show in raid).
>
> Clearly something very expensive/deadlockish is happening because of
> the raid having to rebuild the data from the missing disk, not sure
> what could be wrong with it.
This is very similar to what I complained some 3 months ago. For me it happens
with an array in normal state. sar shows no disk activity yet there are no
writes to the array (reads happen normally) and the flushd thread uses 100%
cpu.
For the latest 6.5.* I can reliably reproduce it with
% xzcat linux-6.5.tar.xz | tar x -f -
This leaves the machine with ~1.5GB of dirty pages (as reported by
/proc/meminfo) that it never manages to write to the array. I've waited for
several hours to no avail. After a reboot the kernel tree had only about 220MB
instead of ~1.5GB...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 3:21 ` Carlos Carvalho
@ 2023-10-31 6:25 ` eyal
2023-10-31 9:29 ` eyal
1 sibling, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31 6:25 UTC (permalink / raw)
To: linux-raid
On 31/10/2023 14.21, Carlos Carvalho wrote:
> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>> look at SAR -d output for all the disks in the raid6. It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
>>
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
>
> This is very similar to what I complained some 3 months ago. For me it happens
> with an array in normal state. sar shows no disk activity yet there are no
> writes to the array (reads happen normally) and the flushd thread uses 100%
> cpu.
>
> For the latest 6.5.* I can reliably reproduce it with
> % xzcat linux-6.5.tar.xz | tar x -f -
>
> This leaves the machine with ~1.5GB of dirty pages (as reported by
> /proc/meminfo) that it never manages to write to the array. I've waited for
> several hours to no avail. After a reboot the kernel tree had only about 220MB
> instead of ~1.5GB...
I rebooted the machine, so all is pristine.
This is F38, kernel 6.5.8-200.fc38.x86_64, with 32GB RAM.
I started a copy (SATA->rsync) into the array. Within seconds the kworked started running with 100%CPU.
In less that 1 minute is almost stopped(*1), after transferring about 5GB.
Below(*2) is the meminfo at that time.
10 minutes later it is till copying the same (32KB!) file.
20m later I stopped the copy bnut the kworker remained.
10m later I removed the target directory and the kworker was immediately gone.
I still suspect that after the array was 'all spares' and I re-assembled it, after when it looked good,
something is not completely right.
What other information should I provide to help resolve this issue?
TIA
(*1) the read side had zero activity and the write (the array) had a trikkle of about 50KB/s.
(*2) $ cat /proc/meminfo
MemTotal: 32704880 kB
MemFree: 654660 kB
MemAvailable: 29397832 kB
Buffers: 951396 kB
Cached: 26962316 kB
SwapCached: 32 kB
Active: 2698164 kB
Inactive: 26738212 kB
Active(anon): 1755572 kB
Inactive(anon): 51596 kB
Active(file): 942592 kB
Inactive(file): 26686616 kB
Unevictable: 177112 kB
Mlocked: 0 kB
SwapTotal: 16777212 kB
SwapFree: 16776444 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 5069080 kB
Writeback: 28 kB
AnonPages: 1699812 kB
Mapped: 669464 kB
Shmem: 284500 kB
KReclaimable: 1452908 kB
Slab: 1731432 kB
SReclaimable: 1452908 kB
SUnreclaim: 278524 kB
KernelStack: 16160 kB
PageTables: 29476 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 33129652 kB
Committed_AS: 8020984 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 58744 kB
VmallocChunk: 0 kB
Percpu: 5792 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
Unaccepted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 338060 kB
DirectMap2M: 4757504 kB
DirectMap1G: 28311552 kB
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 3:21 ` Carlos Carvalho
2023-10-31 6:25 ` eyal
@ 2023-10-31 9:29 ` eyal
2023-10-31 10:24 ` Roger Heflin
2023-10-31 12:39 ` Carlos Carvalho
1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-31 9:29 UTC (permalink / raw)
To: linux-raid
On 31/10/2023 14.21, Carlos Carvalho wrote:
> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>> look at SAR -d output for all the disks in the raid6. It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
>>
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
>
> This is very similar to what I complained some 3 months ago. For me it happens
> with an array in normal state. sar shows no disk activity yet there are no
> writes to the array (reads happen normally) and the flushd thread uses 100%
> cpu.
>
> For the latest 6.5.* I can reliably reproduce it with
> % xzcat linux-6.5.tar.xz | tar x -f -
>
> This leaves the machine with ~1.5GB of dirty pages (as reported by
> /proc/meminfo) that it never manages to write to the array. I've waited for
> several hours to no avail. After a reboot the kernel tree had only about 220MB
> instead of ~1.5GB...
More evidence that the problem relates to the cache not flushed to disk.
If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
But it also evicts it from the cache, so nothing accumulates.
The result is a slower than otherwise copying but it streams with no pauses.
It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
only a few large files. rsync copies a very large number of small files which somehow triggers
the problem.
This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
Just my guess.
BTW I ran fsck on the fs (on the array) and it found no fault.
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 9:29 ` eyal
@ 2023-10-31 10:24 ` Roger Heflin
2023-10-31 21:40 ` eyal
2023-10-31 12:39 ` Carlos Carvalho
1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-10-31 10:24 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
If you directory entries are large (lots of small files in a
directory) then the recovery of the missing data could be just enough
to push your array too hard.
find /<mount> -type d -size +1M -ls will find large directories.
do a ls -l <largedirname> | wc -l and see how many files are in there.
ext3/4 has issues with really big directories.
The perf top showed just about all of the time was being spend in
ext3/4 threads allocating new blocks/directory entries and such.
How much free space does the disk show in df?
On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>
> On 31/10/2023 14.21, Carlos Carvalho wrote:
> > Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >> look at SAR -d output for all the disks in the raid6. It may be a
> >> disk issue (though I suspect not given the 100% cpu show in raid).
> >>
> >> Clearly something very expensive/deadlockish is happening because of
> >> the raid having to rebuild the data from the missing disk, not sure
> >> what could be wrong with it.
> >
> > This is very similar to what I complained some 3 months ago. For me it happens
> > with an array in normal state. sar shows no disk activity yet there are no
> > writes to the array (reads happen normally) and the flushd thread uses 100%
> > cpu.
> >
> > For the latest 6.5.* I can reliably reproduce it with
> > % xzcat linux-6.5.tar.xz | tar x -f -
> >
> > This leaves the machine with ~1.5GB of dirty pages (as reported by
> > /proc/meminfo) that it never manages to write to the array. I've waited for
> > several hours to no avail. After a reboot the kernel tree had only about 220MB
> > instead of ~1.5GB...
>
> More evidence that the problem relates to the cache not flushed to disk.
>
> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> But it also evicts it from the cache, so nothing accumulates.
> The result is a slower than otherwise copying but it streams with no pauses.
>
> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> only a few large files. rsync copies a very large number of small files which somehow triggers
> the problem.
>
> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>
> Just my guess.
>
> BTW I ran fsck on the fs (on the array) and it found no fault.
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 9:29 ` eyal
2023-10-31 10:24 ` Roger Heflin
@ 2023-10-31 12:39 ` Carlos Carvalho
2023-10-31 14:19 ` Roger Heflin
1 sibling, 1 reply; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31 12:39 UTC (permalink / raw)
To: linux-raid
eyal@eyal.emu.id.au (eyal@eyal.emu.id.au) wrote on Tue, Oct 31, 2023 at 06:29:14AM -03:
> More evidence that the problem relates to the cache not flushed to disk.
Yes.
> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> only a few large files. rsync copies a very large number of small files which somehow triggers
> the problem.
Mee too. Writing few files works fine, the problem happens when many files need
flushing. That's why expanding the kernel tree blocks the machine. After many
hours it either crashes or I have to do a hard reboot because all service
stops.
It also happens with 6.1 but 6.5 is a lot more susceptible. Further, the longer
the uptime the more prone to deadlock the machine becomes...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 12:39 ` Carlos Carvalho
@ 2023-10-31 14:19 ` Roger Heflin
2023-10-31 19:20 ` Carlos Carvalho
0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-10-31 14:19 UTC (permalink / raw)
To: Carlos Carvalho; +Cc: linux-raid
If it is a spinning disk there is a limit to the total amount of
updates(iops) read or write that can be done per second.
In Raid5/6 a small read is 1 small read, but a small write requires
several disks to get read and/or written to complete that single
write, when a disk is failed, a read that would have had its data on
the missing disk needs to read all other disks to recalculate the
missing data.
If your array was already busy, the extra reads and recalculates could
push your array over its ability to do io limit.
I set these 2 parms to limit the total amount of dirty cache and this
appears to make the system more responsive.
[root@bm-server ~]# sysctl -a | grep -i dirty | grep -i bytes
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000
The issue being that once you hit dirty_bytes all writes stop until
you get the cache cleared back to dirty_background_bytes, and the
greater the difference between the 2 the longer the freeze time is.
Note that by default most distributions don't use the _bytes they use
dirty_ratio/dirty_background_ratio that is a % of memory and can cause
the write freeze to last for a significant amount of time since a
single % is quite large if you have a lot of ram.
You might also want to look at the io for the sd* devices under the md
device as that would show what MD is doing under the covers to make
the io happen (sar and/or iostat).
On Tue, Oct 31, 2023 at 7:39 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>
> eyal@eyal.emu.id.au (eyal@eyal.emu.id.au) wrote on Tue, Oct 31, 2023 at 06:29:14AM -03:
> > More evidence that the problem relates to the cache not flushed to disk.
>
> Yes.
>
> > It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> > only a few large files. rsync copies a very large number of small files which somehow triggers
> > the problem.
>
> Mee too. Writing few files works fine, the problem happens when many files need
> flushing. That's why expanding the kernel tree blocks the machine. After many
> hours it either crashes or I have to do a hard reboot because all service
> stops.
>
> It also happens with 6.1 but 6.5 is a lot more susceptible. Further, the longer
> the uptime the more prone to deadlock the machine becomes...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 14:19 ` Roger Heflin
@ 2023-10-31 19:20 ` Carlos Carvalho
0 siblings, 0 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31 19:20 UTC (permalink / raw)
To: linux-raid
Roger Heflin (rogerheflin@gmail.com) wrote on Tue, Oct 31, 2023 at 11:19:24AM -03:
> You might also want to look at the io for the sd* devices under the md
> device as that would show what MD is doing under the covers to make
> the io happen (sar and/or iostat).
Exactly. Guess what disks are doing? Nothing! That's the whole point. The
kernel becomes unable to send writes to the disks and the flushd thread remains
using 100% cpu.
Again: this is not an excess of writes that the array cannot cope with. On the
contrary, it's an absence of writes.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 10:24 ` Roger Heflin
@ 2023-10-31 21:40 ` eyal
2023-11-01 10:30 ` Roger Heflin
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-10-31 21:40 UTC (permalink / raw)
To: linux-raid
On 31/10/2023 21.24, Roger Heflin wrote:
> If you directory entries are large (lots of small files in a
> directory) then the recovery of the missing data could be just enough
> to push your array too hard.
Nah, the directory I am copying has nothing really large, and the target directory is created new.
> find /<mount> -type d -size +1M -ls will find large directories.
>
> do a ls -l <largedirname> | wc -l and see how many files are in there.
>
> ext3/4 has issues with really big directories.
>
> The perf top showed just about all of the time was being spend in
> ext3/4 threads allocating new blocks/directory entries and such.
Just in case there is an issue, I will copy another directory as a test.
[later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
However, this same content was copied from an external USB disk (NOT to the array) without a problem.
> How much free space does the disk show in df?
Enough room:
/dev/md127 55T 45T 9.8T 83% /data1
I still suspect an issue with the array after it was recovered.
A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
Is there some background md process? Can it be hurried to completion?
> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>>
>> On 31/10/2023 14.21, Carlos Carvalho wrote:
>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>>>> look at SAR -d output for all the disks in the raid6. It may be a
>>>> disk issue (though I suspect not given the 100% cpu show in raid).
>>>>
>>>> Clearly something very expensive/deadlockish is happening because of
>>>> the raid having to rebuild the data from the missing disk, not sure
>>>> what could be wrong with it.
>>>
>>> This is very similar to what I complained some 3 months ago. For me it happens
>>> with an array in normal state. sar shows no disk activity yet there are no
>>> writes to the array (reads happen normally) and the flushd thread uses 100%
>>> cpu.
>>>
>>> For the latest 6.5.* I can reliably reproduce it with
>>> % xzcat linux-6.5.tar.xz | tar x -f -
>>>
>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
>>> /proc/meminfo) that it never manages to write to the array. I've waited for
>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
>>> instead of ~1.5GB...
>>
>> More evidence that the problem relates to the cache not flushed to disk.
>>
>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
>> But it also evicts it from the cache, so nothing accumulates.
>> The result is a slower than otherwise copying but it streams with no pauses.
>>
>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
>> only a few large files. rsync copies a very large number of small files which somehow triggers
>> the problem.
>>
>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>>
>> Just my guess.
>>
>> BTW I ran fsck on the fs (on the array) and it found no fault.
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array [more details]
2023-10-30 13:35 problem with recovered array eyal
2023-10-30 16:14 ` Roger Heflin
@ 2023-10-31 21:44 ` eyal
2023-10-31 22:00 ` eyal
2023-11-01 4:31 ` [now urgent] problem with recovered array eyal
1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-31 21:44 UTC (permalink / raw)
To: linux-raid
On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
> F38
>
> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>
> I have a 7-member raid6. The other day I needed to send a disk for replacement.
> I have done this before and all looked well. The array is now degraded until I get the new disk.
>
> At one point my system got into trouble and I am not sure why, but it started to have
> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>
> On the restart it dropped into emergency shell, the array had all members listed as spares.
> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
> though the array was now listed in mdstat and looked as expected.
>
> Since mdadm suggested I use --force', I did so
> mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>
> Q0) was I correct to use this command?
>
> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>
> Q1) What does this last line mean?
>
> Now that the array came up I still could not mount the fs (still in emergency shell).
> I rebooted and all came up, the array was there and the fs was mounted and so far
> I did not notice any issues with the fs.
>
> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
> SATA disk on the server and it went fast, so the USB disk is OK.
>
> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 944760 root 20 0 0 0 0 R 100.0 0.0 164:00.85 kworker/u16:3+flush-9:127
>
> It did it for many hours and I do not know what it is doing.
>
> Q2) what does this worker do?
>
> I also noticed that mdstat shows a high bitmap usage:
>
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
> 58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> bitmap: 87/88 pages [348KB], 65536KB chunk
>
> Q3) Is this OK? Should the usage go down? It does not change at all.
>
> While looking at everything, I started iostat on md127 and I see that there is a constant
> trickle of writes, about 5KB/s. There is no activity on this fs.
> Also, I see similar activity on all the members, at the same rate, so md127 does not show
> 6 times the members activity. I guess this is just how md works?
>
> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
> If this is a background thing, how can I monitor it?
>
> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
> What is the correct way to bring up a degraded array? What about the 'dirty' part?
> 'mdadm -D /dev/md127' mention'sync':
> Number Major Minor RaidDevice State
> - 0 0 0 removed
> 8 8 17 1 active sync /dev/sdb1
> 9 8 33 2 active sync /dev/sdc1
> 7 8 49 3 active sync /dev/sdd1
> 4 8 65 4 active sync /dev/sde1
> 5 8 81 5 active sync /dev/sdf1
> 6 8 97 6 active sync /dev/sdg1
> Is this related?
>
> BTW I plan to run a 'check' at some point.
I ran a 'check' of a few segment of the array, just to see if there is any wholesale issue, but is was clean.
> TIA
It is time to describe in more detail what happened, because it happened again yesterday and I kept notes.
The system became somewhat non responsive so I decided to reboot. The shutdown did not complete.
It took over 10m for the system to slowly give up on waiting for about 10 services, then stopping them hard.
The last line on the screen was
systemd[1]: Stopped systemd-sysctl.service - Apply Kernel Variables.
After waiting for almost 15m with no activity I rebooted using sysrq e,i,s,u,b.
The restart dropped into emergency shell.
mdstat shows all array members are spare. Same as before.
I repeated what worked last time
mdadm --assemble /dev/md127 /dev/sd{b,c,d,e,f,g}1
failed, then:
mdadm --stop md127
mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
Some messages, then /data1 was (automatically) mounted on md127. The log shows
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): recovery complete
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): mounted filesystem 378e74a6-e379-4bd5-ade5-f3cd85952099 r/w with ordered data mode. Quota mode: none.
There was also a message saying a recovery (maybe a different term was used?) is going.
mdstat shows the array is up and a recovery is running.
After waiting for a little while it was time to boot.
At this point I used 'exit' hoping for the boot to complete but it did not, there was no response at all.
Ctl-Alt-Del brought the system up. Looking at the system log suggests it was a proper restart.
On the fresh boot I do not see the recovery and the array looks like this:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
bitmap: 88/88 pages [352KB], 65536KB chunk
It now shows the bitmap as full. I left the system alone overnight with iostat running and I see
some activity every few minutes. Nothing is running on the array.
01:02:07 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
01:02:07 md127 43.32 1098.67 3659.18 0.00 4257297 14179212 0
01:03:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:04:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:05:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:06:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:07:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:08:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:09:07 md127 0.02 0.00 0.07 0.00 0 4 0
01:10:07 md127 0.38 0.00 1.53 0.00 0 92 0
01:11:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:12:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:13:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:14:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:15:07 md127 0.02 0.00 1.60 0.00 0 96 0
01:16:07 md127 0.18 0.00 7.27 0.00 0 436 0
01:17:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:18:07 md127 0.00 0.00 0.00 0.00 0 0 0
01:19:07 md127 0.02 0.00 0.07 0.00 0 4 0
01:20:07 md127 0.33 0.00 1.33 0.00 0 80 0
01:21:07 md127 0.00 0.00 0.00 0.00 0 0 0
Later, some mythtv recordings were done and I now checked and they play just fine.
This is what the relevant log messages are during this time:
2023-10-31T23:47:41+1100 kernel: EXT4-fs (nvme0n1p3): mounted filesystem 62fbafb4-86ba-497d-ae82-77b4d386b708 ro with ordered data mode. Quota mode: none.
2023-10-31T23:47:47+1100 kernel: EXT4-fs (nvme0n1p3): re-mounted 62fbafb4-86ba-497d-ae82-77b4d386b708 r/w. Quota mode: none.
2023-10-31T23:47:48+1100 kernel: EXT4-fs (nvme0n1p1): mounted filesystem 43e95102-07d4-4012-9d69-cf5e4100e682 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:47:48+1100 kernel: EXT4-fs (nvme0n1p4): mounted filesystem f415fb92-0a9c-49cc-b4b9-48d9135acd70 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:47:49+1100 kernel: EXT4-fs (sda1): mounted filesystem c7b4b181-a62e-4e66-96e1-dfa249049f70 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:48:32+1100 systemd[1]: Dependency failed for data1.mount - /data1.
2023-10-31T23:48:32+1100 systemd[1]: data1.mount: Job data1.mount/start failed with result 'dependency'.
2023-10-31T23:50:50+1100 kernel: md/raid:md127: not clean -- starting background reconstruction
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdf1 operational as raid disk 5
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sde1 operational as raid disk 4
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdg1 operational as raid disk 6
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdb1 operational as raid disk 1
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdd1 operational as raid disk 3
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdc1 operational as raid disk 2
2023-10-31T23:50:50+1100 kernel: md/raid:md127: cannot start dirty degraded array.
2023-10-31T23:50:50+1100 kernel: md/raid:md127: failed to run raid set.
2023-10-31T23:50:50+1100 kernel: md: pers->run() failed ...
2023-10-31T23:52:58+1100 kernel: md: md127 stopped.
2023-10-31T23:53:55+1100 kernel: md: md127 stopped.
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdb1 operational as raid disk 1
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdg1 operational as raid disk 6
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdf1 operational as raid disk 5
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sde1 operational as raid disk 4
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdd1 operational as raid disk 3
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdc1 operational as raid disk 2
2023-10-31T23:53:55+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
2023-10-31T23:53:55+1100 kernel: md127: detected capacity change from 0 to 117187522560
2023-10-31T23:53:55+1100 kernel: md: requested-resync of RAID array md127
2023-10-31T23:53:56+1100 systemd[1]: Mounting data1.mount - /data1...
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): recovery complete
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): mounted filesystem 378e74a6-e379-4bd5-ade5-f3cd85952099 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:54:02+1100 systemd[1]: Mounted data1.mount - /data1.
2023-10-31T23:54:02+1100 mdadm[1286]: DeviceDisappeared event detected on md device /dev/md/md127
2023-10-31T23:54:02+1100 mdadm[1286]: NewArray event detected on md device /dev/md127
2023-10-31T23:54:02+1100 mdadm[1286]: DegradedArray event detected on md device /dev/md127
2023-10-31T23:54:02+1100 mdadm[1286]: RebuildStarted event detected on md device /dev/md127
Of note:
Starting the array without --force fails
md/raid:md127: not clean -- starting background reconstruction
md/raid:md127: cannot start dirty degraded array.
md/raid:md127: failed to run raid set.
Then with --force it starts, /data1 is (automatically) mounted but says
kernel: EXT4-fs (md127): recovery complete
so the shutdown was not clean.
Soon after
mdadm[1286]: RebuildStarted event detected on md device /dev/md127
What I do not understand is, what does this mean, is there something going on that does not show in mdstat?
starting background reconstruction
RebuildStarted event detected
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array [more details]
2023-10-31 21:44 ` problem with recovered array [more details] eyal
@ 2023-10-31 22:00 ` eyal
2023-11-01 4:31 ` [now urgent] problem with recovered array eyal
1 sibling, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31 22:00 UTC (permalink / raw)
To: linux-raid
On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
[trimmed]
Responding with the array details:
mdadm --query
=============
/dev/md127: 54.57TiB raid6 7 devices, 0 spares. Use mdadm --detail for more detail.
mdadm --detail
==============
/dev/md127:
Version : 1.2
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Array Size : 58593761280 (54.57 TiB 60.00 TB)
Used Dev Size : 11718752256 (10.91 TiB 12.00 TB)
Raid Devices : 7
Total Devices : 6
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Nov 1 05:09:17 2023
State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : e4.eyal.emu.id.au:127
UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Events : 4802229
Number Major Minor RaidDevice State
- 0 0 0 removed
8 8 17 1 active sync /dev/sdb1
9 8 33 2 active sync /dev/sdc1
7 8 49 3 active sync /dev/sdd1
4 8 65 4 active sync /dev/sde1
5 8 81 5 active sync /dev/sdf1
6 8 97 6 active sync /dev/sdg1
mdadm --examine "/dev/sd?1
==========================
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : 4bfcc6d6:165b6a78:b8fad560:51ff21de
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : d90d6d62 - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : 7fb0b851:c3d0463b:6460ff66:170088b9
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : f86afdb5 - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : ce5bd1bc:3e5c3054:658888dd:9038e09f
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : d94e94f8 - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : 4706a5ab:e7cea085:9af96e3a:81ac89b1
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : 6822973d - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 4
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : b1732c74:a34e121d:8347018e:c42b5085
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : ef74518f - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
Name : e4.eyal.emu.id.au:127
Creation Time : Fri Oct 26 17:24:59 2018
Raid Level : raid6
Raid Devices : 7
Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=0 sectors
State : clean
Device UUID : b44b1807:ed20c6f9:ec0436d7:744d144a
Internal Bitmap : 8 sectors from superblock
Update Time : Wed Nov 1 05:09:17 2023
Bad Block Log : 512 entries available at offset 56 sectors
Checksum : 6d0cdaf7 - correct
Events : 4802229
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 6
Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
HTH
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [now urgent] problem with recovered array
2023-10-31 21:44 ` problem with recovered array [more details] eyal
2023-10-31 22:00 ` eyal
@ 2023-11-01 4:31 ` eyal
2023-11-01 6:44 ` eyal
1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-01 4:31 UTC (permalink / raw)
To: linux-raid
Why do I see it as urgent now:
I suspect that my system will soon run out of cache (due to no writes of dirty blocks to the array)
and fail in an ugly way.
If you can help then please read this thread.
Another unusual observation of the situation where the system looks OK but is evidently not.
While pondering the above situation, I noticed that when I watched a mythtv recording,
which is on the array (md127), I see no activity on md127 or the components '/dev/sd[b-g]1'.(*1)
Now, if I watch a program from earlier, from before the last boot(*2), I do see it in iostat.
Is it possible that all the recent recordings are still cached and were never written to disk?
What is holding the system from draining the cache? Is the array somehow in readonly mode? How do I fix this?
I now think that on shutdown, a sync is done that ends up preventing the shutdown from completion.
BTW, before, in the same situation, running 'sync' never completes.
HTH
(*1)
$ df -h /data1
Filesystem Size Used Avail Use% Mounted on
/dev/md127 55T 45T 9.8T 83% /data1
]$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
bitmap: 88/88 pages [352KB], 65536KB chunk
$ locate 2050_20231031093600.ts
/data1/mythtv/lib/2050_20231031093600.ts
$ ls -l /var/lib/mythtv
lrwxrwxrwx 1 root root 17 Feb 26 2023 /var/lib/mythtv -> /data1/mythtv/lib
$ ls -l /data1/mythtv/lib/2050_20231031093600.ts
-rw-r--r-- 1 mythtv mythtv 2362511964 Oct 31 21:24 /data1/mythtv/lib/2050_20231031093600.ts
(*2)
$ uptime
15:02:37 up 15:05, 27 users, load average: 0.49, 0.59, 0.67
On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>
>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>> I have done this before and all looked well. The array is now degraded until I get the new disk.
[trimmed] See original posting.
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [now urgent] problem with recovered array
2023-11-01 4:31 ` [now urgent] problem with recovered array eyal
@ 2023-11-01 6:44 ` eyal
0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-11-01 6:44 UTC (permalink / raw)
To: linux-raid
Bugger, I changed my mind. I now see that recordings do show write activity.
The reason why playing a recording from 12h ago does not show any reads is probably
because it is still in the cache, but not dirty. I guess.
My original problem remains though. rsync a large tree of small files to the array
chokes after a minute and after that, writing progresses at a very slow pace
of a few 10s of KB/s. A kthread then starts, runs at 100%CPU for hours even though
only 5 GB were copied.
See my recent thread 'problem with recovered array' for details. The problem is still urgent.
I cannot safely shut down the machine.
On 01/11/2023 15.31, eyal@eyal.emu.id.au wrote:
> Why do I see it as urgent now:
> I suspect that my system will soon run out of cache (due to no writes of dirty blocks to the array)
> and fail in an ugly way.
>
> If you can help then please read this thread.
>
> Another unusual observation of the situation where the system looks OK but is evidently not.
>
> While pondering the above situation, I noticed that when I watched a mythtv recording,
> which is on the array (md127), I see no activity on md127 or the components '/dev/sd[b-g]1'.(*1)
>
> Now, if I watch a program from earlier, from before the last boot(*2), I do see it in iostat.
> Is it possible that all the recent recordings are still cached and were never written to disk?
>
> What is holding the system from draining the cache? Is the array somehow in readonly mode? How do I fix this?
> I now think that on shutdown, a sync is done that ends up preventing the shutdown from completion.
>
> BTW, before, in the same situation, running 'sync' never completes.
>
> HTH
>
> (*1)
> $ df -h /data1
> Filesystem Size Used Avail Use% Mounted on
> /dev/md127 55T 45T 9.8T 83% /data1
>
> ]$ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
> 58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> bitmap: 88/88 pages [352KB], 65536KB chunk
>
> $ locate 2050_20231031093600.ts
> /data1/mythtv/lib/2050_20231031093600.ts
>
> $ ls -l /var/lib/mythtv
> lrwxrwxrwx 1 root root 17 Feb 26 2023 /var/lib/mythtv -> /data1/mythtv/lib
>
> $ ls -l /data1/mythtv/lib/2050_20231031093600.ts
> -rw-r--r-- 1 mythtv mythtv 2362511964 Oct 31 21:24 /data1/mythtv/lib/2050_20231031093600.ts
>
> (*2)
> $ uptime
> 15:02:37 up 15:05, 27 users, load average: 0.49, 0.59, 0.67
>
>
> On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
>> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>>> F38
>>>
>>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>>
>>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>>> I have done this before and all looked well. The array is now degraded until I get the new disk.
>
> [trimmed] See original posting.
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-10-31 21:40 ` eyal
@ 2023-11-01 10:30 ` Roger Heflin
2023-11-01 13:08 ` eyal
0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-01 10:30 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
Did you check with iostat/sar? (iostat -x 2 5). The md
housekeeping/background stuff does not show on the md device itself,
but it shows on the underlying disks.
It might also be related the the bitmap keeping track of how far the
missing disk is behind.
Small files are troublesome. A tiny file takes several reads and writes.
I think the bitmap is tracking how many writes need to be done on the
missing disk, and if so then until the new disk gets put back will not
start cleaning up.
On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
>
> On 31/10/2023 21.24, Roger Heflin wrote:
> > If you directory entries are large (lots of small files in a
> > directory) then the recovery of the missing data could be just enough
> > to push your array too hard.
>
> Nah, the directory I am copying has nothing really large, and the target directory is created new.
>
> > find /<mount> -type d -size +1M -ls will find large directories.
> >
> > do a ls -l <largedirname> | wc -l and see how many files are in there.
> >
> > ext3/4 has issues with really big directories.
> >
> > The perf top showed just about all of the time was being spend in
> > ext3/4 threads allocating new blocks/directory entries and such.
>
> Just in case there is an issue, I will copy another directory as a test.
> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
>
> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
>
> > How much free space does the disk show in df?
>
> Enough room:
> /dev/md127 55T 45T 9.8T 83% /data1
>
> I still suspect an issue with the array after it was recovered.
>
> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
>
> Is there some background md process? Can it be hurried to completion?
>
> > On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 31/10/2023 14.21, Carlos Carvalho wrote:
> >>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >>>> look at SAR -d output for all the disks in the raid6. It may be a
> >>>> disk issue (though I suspect not given the 100% cpu show in raid).
> >>>>
> >>>> Clearly something very expensive/deadlockish is happening because of
> >>>> the raid having to rebuild the data from the missing disk, not sure
> >>>> what could be wrong with it.
> >>>
> >>> This is very similar to what I complained some 3 months ago. For me it happens
> >>> with an array in normal state. sar shows no disk activity yet there are no
> >>> writes to the array (reads happen normally) and the flushd thread uses 100%
> >>> cpu.
> >>>
> >>> For the latest 6.5.* I can reliably reproduce it with
> >>> % xzcat linux-6.5.tar.xz | tar x -f -
> >>>
> >>> This leaves the machine with ~1.5GB of dirty pages (as reported by
> >>> /proc/meminfo) that it never manages to write to the array. I've waited for
> >>> several hours to no avail. After a reboot the kernel tree had only about 220MB
> >>> instead of ~1.5GB...
> >>
> >> More evidence that the problem relates to the cache not flushed to disk.
> >>
> >> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> >> But it also evicts it from the cache, so nothing accumulates.
> >> The result is a slower than otherwise copying but it streams with no pauses.
> >>
> >> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> >> only a few large files. rsync copies a very large number of small files which somehow triggers
> >> the problem.
> >>
> >> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
> >>
> >> Just my guess.
> >>
> >> BTW I ran fsck on the fs (on the array) and it found no fault.
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-01 10:30 ` Roger Heflin
@ 2023-11-01 13:08 ` eyal
2023-11-01 14:29 ` Roger Heflin
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-01 13:08 UTC (permalink / raw)
To: linux-raid
On 01/11/2023 21.30, Roger Heflin wrote:
> Did you check with iostat/sar? (iostat -x 2 5). The md
> housekeeping/background stuff does not show on the md device itself,
> but it shows on the underlying disks.
Yes, I have iostat on both the md device and the components and I see sparse activity on both.
> It might also be related the the bitmap keeping track of how far the
> missing disk is behind.
This may be the case. In case the disk is re-added it may help to know this,
but if a new disk is added then the full disk will be recreated anyway.
> Small files are troublesome. A tiny file takes several reads and writes.
Yes, but the array is so fast that this should not be a problem.
The rsync source is 175MB/s and the target array is 800MB/s so I do not
see how the writing can slow the copy.
In one case I (virsh) saved a VM, which created one 8GB file, which took
many hours to be written.
> I think the bitmap is tracking how many writes need to be done on the
> missing disk, and if so then until the new disk gets put back will not
> start cleaning up.
>
>
> On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
>>
>> On 31/10/2023 21.24, Roger Heflin wrote:
>>> If you directory entries are large (lots of small files in a
>>> directory) then the recovery of the missing data could be just enough
>>> to push your array too hard.
>>
>> Nah, the directory I am copying has nothing really large, and the target directory is created new.
>>
>>> find /<mount> -type d -size +1M -ls will find large directories.
>>>
>>> do a ls -l <largedirname> | wc -l and see how many files are in there.
>>>
>>> ext3/4 has issues with really big directories.
>>>
>>> The perf top showed just about all of the time was being spend in
>>> ext3/4 threads allocating new blocks/directory entries and such.
>>
>> Just in case there is an issue, I will copy another directory as a test.
>> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
>> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
>>
>> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
>>
>>> How much free space does the disk show in df?
>>
>> Enough room:
>> /dev/md127 55T 45T 9.8T 83% /data1
>>
>> I still suspect an issue with the array after it was recovered.
>>
>> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
>> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
>>
>> Is there some background md process? Can it be hurried to completion?
>>
>>> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>>>>
>>>> On 31/10/2023 14.21, Carlos Carvalho wrote:
>>>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>>>>>> look at SAR -d output for all the disks in the raid6. It may be a
>>>>>> disk issue (though I suspect not given the 100% cpu show in raid).
>>>>>>
>>>>>> Clearly something very expensive/deadlockish is happening because of
>>>>>> the raid having to rebuild the data from the missing disk, not sure
>>>>>> what could be wrong with it.
>>>>>
>>>>> This is very similar to what I complained some 3 months ago. For me it happens
>>>>> with an array in normal state. sar shows no disk activity yet there are no
>>>>> writes to the array (reads happen normally) and the flushd thread uses 100%
>>>>> cpu.
>>>>>
>>>>> For the latest 6.5.* I can reliably reproduce it with
>>>>> % xzcat linux-6.5.tar.xz | tar x -f -
>>>>>
>>>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
>>>>> /proc/meminfo) that it never manages to write to the array. I've waited for
>>>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
>>>>> instead of ~1.5GB...
>>>>
>>>> More evidence that the problem relates to the cache not flushed to disk.
>>>>
>>>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
>>>> But it also evicts it from the cache, so nothing accumulates.
>>>> The result is a slower than otherwise copying but it streams with no pauses.
>>>>
>>>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
>>>> only a few large files. rsync copies a very large number of small files which somehow triggers
>>>> the problem.
>>>>
>>>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>>>>
>>>> Just my guess.
>>>>
>>>> BTW I ran fsck on the fs (on the array) and it found no fault.
>>>>
>>>> --
>>>> Eyal at Home (eyal@eyal.emu.id.au)
>>>>
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-01 13:08 ` eyal
@ 2023-11-01 14:29 ` Roger Heflin
2023-11-02 8:34 ` Johannes Truschnigg
0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-01 14:29 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
Did you test writing on the array? And if you did test it did you
use a big enough file and/or bypass/disable cache? Having the default
dirty* settings high/default it can cache up to 20% of RAM (without
writing it to disk). Raid5/6 writing is much slower than reads. If
you have a single raid5/6 array of spinning disks I would be surprised
if it was able to sustain anything close to 175MB/sec. Mine
struggles to sustain even 100MB/sec on writes.
But (with my 3M/5Mbyte _bytes setting (lets the cache work, but
prevents massive amounts of data in cache).
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
501481472 bytes (501 MB, 478 MiB) copied, 8 s, 62.5 MB/s^C
2183+0 records in
2183+0 records out
572260352 bytes (572 MB, 546 MiB) copied, 8.79945 s, 65.0 MB/s
Using the fedora default:
vm.dirty_background_ratio = 20
vm.dirty_ratio = 20
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
4000+0 records in
4000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.687841 s, 1.5 GB/s
Note that on the 2nd test writing was still happening for the next
5-8seconds after it exited, giving an appearance of being much faster
than it is. As the test size is increased the speed gets lower as
more of the data has to get written. On my system with the fedora
default settings (64G ram) with a 16G file I am still getting
300MB/sec even though the underlying array can only really do around
70-100MB/sec.
Raid5/6 typically breaks down to about the write rate of a single disk
because of all of the extra work that has to be done for raid5/6 to
work.
Do you happen to know how fragmented the 8gb vm file is? (filefrag
<filename>). each fragment requires a separate several ms seek.
On Wed, Nov 1, 2023 at 8:08 AM <eyal@eyal.emu.id.au> wrote:
>
> On 01/11/2023 21.30, Roger Heflin wrote:
> > Did you check with iostat/sar? (iostat -x 2 5). The md
> > housekeeping/background stuff does not show on the md device itself,
> > but it shows on the underlying disks.
>
> Yes, I have iostat on both the md device and the components and I see sparse activity on both.
>
> > It might also be related the the bitmap keeping track of how far the
> > missing disk is behind.
>
> This may be the case. In case the disk is re-added it may help to know this,
> but if a new disk is added then the full disk will be recreated anyway.
>
> > Small files are troublesome. A tiny file takes several reads and writes.
>
> Yes, but the array is so fast that this should not be a problem.
> The rsync source is 175MB/s and the target array is 800MB/s so I do not
> see how the writing can slow the copy.
>
> In one case I (virsh) saved a VM, which created one 8GB file, which took
> many hours to be written.
>
> > I think the bitmap is tracking how many writes need to be done on the
> > missing disk, and if so then until the new disk gets put back will not
> > start cleaning up.
> >
> >
> > On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 31/10/2023 21.24, Roger Heflin wrote:
> >>> If you directory entries are large (lots of small files in a
> >>> directory) then the recovery of the missing data could be just enough
> >>> to push your array too hard.
> >>
> >> Nah, the directory I am copying has nothing really large, and the target directory is created new.
> >>
> >>> find /<mount> -type d -size +1M -ls will find large directories.
> >>>
> >>> do a ls -l <largedirname> | wc -l and see how many files are in there.
> >>>
> >>> ext3/4 has issues with really big directories.
> >>>
> >>> The perf top showed just about all of the time was being spend in
> >>> ext3/4 threads allocating new blocks/directory entries and such.
> >>
> >> Just in case there is an issue, I will copy another directory as a test.
> >> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
> >> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
> >>
> >> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
> >>
> >>> How much free space does the disk show in df?
> >>
> >> Enough room:
> >> /dev/md127 55T 45T 9.8T 83% /data1
> >>
> >> I still suspect an issue with the array after it was recovered.
> >>
> >> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
> >> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
> >>
> >> Is there some background md process? Can it be hurried to completion?
> >>
> >>> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
> >>>>
> >>>> On 31/10/2023 14.21, Carlos Carvalho wrote:
> >>>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >>>>>> look at SAR -d output for all the disks in the raid6. It may be a
> >>>>>> disk issue (though I suspect not given the 100% cpu show in raid).
> >>>>>>
> >>>>>> Clearly something very expensive/deadlockish is happening because of
> >>>>>> the raid having to rebuild the data from the missing disk, not sure
> >>>>>> what could be wrong with it.
> >>>>>
> >>>>> This is very similar to what I complained some 3 months ago. For me it happens
> >>>>> with an array in normal state. sar shows no disk activity yet there are no
> >>>>> writes to the array (reads happen normally) and the flushd thread uses 100%
> >>>>> cpu.
> >>>>>
> >>>>> For the latest 6.5.* I can reliably reproduce it with
> >>>>> % xzcat linux-6.5.tar.xz | tar x -f -
> >>>>>
> >>>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
> >>>>> /proc/meminfo) that it never manages to write to the array. I've waited for
> >>>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
> >>>>> instead of ~1.5GB...
> >>>>
> >>>> More evidence that the problem relates to the cache not flushed to disk.
> >>>>
> >>>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> >>>> But it also evicts it from the cache, so nothing accumulates.
> >>>> The result is a slower than otherwise copying but it streams with no pauses.
> >>>>
> >>>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> >>>> only a few large files. rsync copies a very large number of small files which somehow triggers
> >>>> the problem.
> >>>>
> >>>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
> >>>>
> >>>> Just my guess.
> >>>>
> >>>> BTW I ran fsck on the fs (on the array) and it found no fault.
> >>>>
> >>>> --
> >>>> Eyal at Home (eyal@eyal.emu.id.au)
> >>>>
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-01 14:29 ` Roger Heflin
@ 2023-11-02 8:34 ` Johannes Truschnigg
2023-11-02 11:27 ` eyal
2023-11-03 14:16 ` Carlos Carvalho
0 siblings, 2 replies; 35+ messages in thread
From: Johannes Truschnigg @ 2023-11-02 8:34 UTC (permalink / raw)
To: Roger Heflin; +Cc: eyal, linux-raid
[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]
Hi list,
for the record, I do not think that any of the observations the OP made can be
explained by non-pathological phenomena/patterns of behavior. Something is
very clearly wrong with how this system behaves (the reported figures do not
at all match the expected performance of even a degraded RAID6 array in my
experience) and how data written to the filesystem apparently fails to make it
into the backing devices in acceptable time.
The whole affair reeks either of "subtle kernel bug", or maybe "subtle
hardware failure", I think.
Still, it'd be interesting to know what happens when writes to the array thru
the file system are performed with O_DIRECT in effect, i.e., using `dd
oflag=direct status=progress ...` - does this yield any observable (via
`iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
report this way with varying block sizes? Are there no concerning messages in
the debug ringbuffer (`dmesg`) at this time?
I'm not sure how we'd best learn more about what the participating busy kernel
threads (Fedora 38 might have more convenient devices up its sleeve than
ftrace?) are spending their time on in particular, but I think that's what's
needed to be known to pin down the underlying cause of the problem.
--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )
www: https://johannes.truschnigg.info/
phone: +436502133337
xmpp: johannes@truschnigg.info
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 8:34 ` Johannes Truschnigg
@ 2023-11-02 11:27 ` eyal
2023-11-02 11:57 ` Roger Heflin
2023-11-02 12:29 ` eyal
2023-11-03 14:16 ` Carlos Carvalho
1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-11-02 11:27 UTC (permalink / raw)
To: linux-raid
On 02/11/2023 19.34, Johannes Truschnigg wrote:
> Hi list,
>
> for the record, I do not think that any of the observations the OP made can be
> explained by non-pathological phenomena/patterns of behavior. Something is
> very clearly wrong with how this system behaves (the reported figures do not
> at all match the expected performance of even a degraded RAID6 array in my
> experience) and how data written to the filesystem apparently fails to make it
> into the backing devices in acceptable time.
>
> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> hardware failure", I think.
>
> Still, it'd be interesting to know what happens when writes to the array thru
> the file system are performed with O_DIRECT in effect, i.e., using `dd
> oflag=direct status=progress ...` - does this yield any observable (via
> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> report this way with varying block sizes? Are there no concerning messages in
> the debug ringbuffer (`dmesg`) at this time?
>
> I'm not sure how we'd best learn more about what the participating busy kernel
> threads (Fedora 38 might have more convenient devices up its sleeve than
> ftrace?) are spending their time on in particular, but I think that's what's
> needed to be known to pin down the underlying cause of the problem.
To clarify what may be lost on this thread.
I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
It is cleared very fast. Tested writing 100GB.
The Dirty blocks rise to the max then drain quickly.
2023-11-02 21:28:27 Dirty: 236 kB
2023-11-02 21:28:37 Dirty: 8 kB
2023-11-02 21:28:47 Dirty: 40 kB
2023-11-02 21:28:57 Dirty: 68 kB
2023-11-02 21:29:07 Dirty: 216 kB
2023-11-02 21:29:17 Dirty: 364 kB
2023-11-02 21:29:27 Dirty: 20 kB
2023-11-02 21:29:37 Dirty: 48 kB
2023-11-02 21:29:47 Dirty: 24 kB
2023-11-02 21:29:57 Dirty: 36 kB
2023-11-02 21:30:07 Dirty: 28 kB
2023-11-02 21:30:17 Dirty: 52 kB
2023-11-02 21:30:27 Dirty: 36 kB
2023-11-02 21:30:37 Dirty: 4112980 kB
2023-11-02 21:30:47 Dirty: 3772396 kB
2023-11-02 21:30:57 Dirty: 3849688 kB
2023-11-02 21:31:07 Dirty: 3761132 kB
2023-11-02 21:31:17 Dirty: 3846216 kB
2023-11-02 21:31:27 Dirty: 3855060 kB
2023-11-02 21:31:37 Dirty: 3902212 kB
2023-11-02 21:31:47 Dirty: 4173524 kB
2023-11-02 21:31:57 Dirty: 3849856 kB
2023-11-02 21:32:07 Dirty: 3940632 kB
2023-11-02 21:32:17 Dirty: 2212008 kB
2023-11-02 21:32:27 Dirty: 244 kB
2023-11-02 21:32:37 Dirty: 288 kB
2023-11-02 21:32:47 Dirty: 60 kB
rsync'ing a large tree into this same fs is different.
When killed less than 3GB were copied. Mostly small files in many directories.
Something like 92,059 files using 2,519MB.
Note how slowly the dirty blocks are cleared.
2023-11-02 21:36:28 Dirty: 296 kB
2023-11-02 21:36:38 Dirty: 277412 kB
2023-11-02 21:36:48 Dirty: 403928 kB
2023-11-02 21:36:58 Dirty: 606916 kB
2023-11-02 21:37:08 Dirty: 753888 kB
2023-11-02 21:37:18 Dirty: 641624 kB
2023-11-02 21:37:28 Dirty: 744560 kB
2023-11-02 21:37:38 Dirty: 946864 kB
2023-11-02 21:37:48 Dirty: 1355964 kB
2023-11-02 21:37:58 Dirty: 2365632 kB
2023-11-02 21:38:08 Dirty: 2451948 kB
### at this point the rsync was cancelled as I see the kthread start
2023-11-02 21:38:18 Dirty: 2451752 kB
2023-11-02 21:38:28 Dirty: 2440496 kB
2023-11-02 21:38:38 Dirty: 2440308 kB
2023-11-02 21:38:48 Dirty: 2440136 kB
2023-11-02 21:38:58 Dirty: 2440036 kB
2023-11-02 21:39:08 Dirty: 2440240 kB
2023-11-02 21:39:18 Dirty: 2405768 kB
2023-11-02 21:39:28 Dirty: 2405784 kB
2023-11-02 21:39:38 Dirty: 2406012 kB
2023-11-02 21:39:48 Dirty: 2405908 kB
2023-11-02 21:39:58 Dirty: 2405848 kB
2023-11-02 21:40:08 Dirty: 2405876 kB
2023-11-02 21:40:18 Dirty: 2405704 kB
2023-11-02 21:40:28 Dirty: 2405628 kB
2023-11-02 21:40:38 Dirty: 2405544 kB
2023-11-02 21:40:48 Dirty: 2405484 kB
2023-11-02 21:40:58 Dirty: 2405416 kB
2023-11-02 21:41:08 Dirty: 2405412 kB
2023-11-02 21:41:18 Dirty: 2405240 kB
2023-11-02 21:41:28 Dirty: 2405148 kB
2023-11-02 21:41:38 Dirty: 2405132 kB
2023-11-02 21:41:48 Dirty: 2404428 kB
2023-11-02 21:41:58 Dirty: 2405056 kB
2023-11-02 21:42:08 Dirty: 2404944 kB
2023-11-02 21:42:18 Dirty: 2404904 kB
2023-11-02 21:42:28 Dirty: 2404316 kB
2023-11-02 21:42:38 Dirty: 2395340 kB
2023-11-02 21:42:48 Dirty: 2394540 kB
2023-11-02 21:42:58 Dirty: 2394368 kB
2023-11-02 21:43:08 Dirty: 2394520 kB
2023-11-02 21:43:18 Dirty: 2394132 kB
2023-11-02 21:43:28 Dirty: 2394032 kB
2023-11-02 21:43:38 Dirty: 2394276 kB
2023-11-02 21:43:48 Dirty: 2386960 kB
2023-11-02 21:43:58 Dirty: 2387420 kB
2023-11-02 21:44:08 Dirty: 2386620 kB
2023-11-02 21:44:18 Dirty: 2386828 kB
2023-11-02 21:44:28 Dirty: 2386104 kB
2023-11-02 21:44:38 Dirty: 2386328 kB
2023-11-02 21:44:48 Dirty: 2382520 kB
2023-11-02 21:44:58 Dirty: 2382024 kB
2023-11-02 21:45:08 Dirty: 2381344 kB
2023-11-02 21:45:18 Dirty: 2380776 kB
2023-11-02 21:45:28 Dirty: 2380424 kB
2023-11-02 21:45:38 Dirty: 2379672 kB
2023-11-02 21:45:48 Dirty: 2380180 kB
2023-11-02 21:45:58 Dirty: 2373272 kB
2023-11-02 21:46:08 Dirty: 2372416 kB
2023-11-02 21:46:18 Dirty: 2372052 kB
2023-11-02 21:46:28 Dirty: 2372176 kB
... and so on
At the same time I see the kthread in top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1491995 root 20 0 0 0 0 R 97.3 0.0 12:50.67 kworker/u16:2+flush-9:127
And iostat of md127 shows slow writing like this:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
21:59:52 md127 3.50 0.00 30.40 0.00 0 304 0
22:00:02 md127 3.10 0.00 31.20 0.00 0 312 0
22:00:12 md127 2.00 0.00 64.40 0.00 0 644 0
22:00:22 md127 3.40 0.00 22.80 0.00 0 228 0
So, why the slow clearing of the dirty blocks?
I will leave the machine to clear it into the night...
BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
an array of all spares on restart. At least this is what happened twice already.
HTH
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 11:27 ` eyal
@ 2023-11-02 11:57 ` Roger Heflin
2023-11-02 13:05 ` eyal
2023-11-02 12:29 ` eyal
1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-02 11:57 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
you need the iostat data from the disks under md127 to see what
mdraid is having to do to internally. Get the list of sd devices and
do iostat -x | grep -E 'Device|sdX|sxy|sdz' (putting into the list
the whole non-partitioned sd devices in your array)
The slowness of clearing dirty and what perf top showed (ext4* calls)
indicates it is creating entries for small files and putting data in
those files. Creating entries for small files requires a lot of both
read and write seeks and seeks are expensive. At most I would expect
you to be able to rsync/copy around 30-50 files/second (when the array
is healthy and once the size gets small enough the mb/sec does not
matter anymore as the seeks dominate).
On my array I do a nightly copy in of data written during the day to a
SSD (security cams). And without any array issues that makes my
array really busy. Anything wrong with the array and that process
takes several times as long.
On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
>
> On 02/11/2023 19.34, Johannes Truschnigg wrote:
> > Hi list,
> >
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
> >
> > Still, it'd be interesting to know what happens when writes to the array thru
> > the file system are performed with O_DIRECT in effect, i.e., using `dd
> > oflag=direct status=progress ...` - does this yield any observable (via
> > `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> > report this way with varying block sizes? Are there no concerning messages in
> > the debug ringbuffer (`dmesg`) at this time?
> >
> > I'm not sure how we'd best learn more about what the participating busy kernel
> > threads (Fedora 38 might have more convenient devices up its sleeve than
> > ftrace?) are spending their time on in particular, but I think that's what's
> > needed to be known to pin down the underlying cause of the problem.
>
> To clarify what may be lost on this thread.
>
> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> It is cleared very fast. Tested writing 100GB.
>
> The Dirty blocks rise to the max then drain quickly.
>
> 2023-11-02 21:28:27 Dirty: 236 kB
> 2023-11-02 21:28:37 Dirty: 8 kB
> 2023-11-02 21:28:47 Dirty: 40 kB
> 2023-11-02 21:28:57 Dirty: 68 kB
> 2023-11-02 21:29:07 Dirty: 216 kB
> 2023-11-02 21:29:17 Dirty: 364 kB
> 2023-11-02 21:29:27 Dirty: 20 kB
> 2023-11-02 21:29:37 Dirty: 48 kB
> 2023-11-02 21:29:47 Dirty: 24 kB
> 2023-11-02 21:29:57 Dirty: 36 kB
> 2023-11-02 21:30:07 Dirty: 28 kB
> 2023-11-02 21:30:17 Dirty: 52 kB
> 2023-11-02 21:30:27 Dirty: 36 kB
> 2023-11-02 21:30:37 Dirty: 4112980 kB
> 2023-11-02 21:30:47 Dirty: 3772396 kB
> 2023-11-02 21:30:57 Dirty: 3849688 kB
> 2023-11-02 21:31:07 Dirty: 3761132 kB
> 2023-11-02 21:31:17 Dirty: 3846216 kB
> 2023-11-02 21:31:27 Dirty: 3855060 kB
> 2023-11-02 21:31:37 Dirty: 3902212 kB
> 2023-11-02 21:31:47 Dirty: 4173524 kB
> 2023-11-02 21:31:57 Dirty: 3849856 kB
> 2023-11-02 21:32:07 Dirty: 3940632 kB
> 2023-11-02 21:32:17 Dirty: 2212008 kB
> 2023-11-02 21:32:27 Dirty: 244 kB
> 2023-11-02 21:32:37 Dirty: 288 kB
> 2023-11-02 21:32:47 Dirty: 60 kB
>
> rsync'ing a large tree into this same fs is different.
> When killed less than 3GB were copied. Mostly small files in many directories.
> Something like 92,059 files using 2,519MB.
>
> Note how slowly the dirty blocks are cleared.
>
> 2023-11-02 21:36:28 Dirty: 296 kB
> 2023-11-02 21:36:38 Dirty: 277412 kB
> 2023-11-02 21:36:48 Dirty: 403928 kB
> 2023-11-02 21:36:58 Dirty: 606916 kB
> 2023-11-02 21:37:08 Dirty: 753888 kB
> 2023-11-02 21:37:18 Dirty: 641624 kB
> 2023-11-02 21:37:28 Dirty: 744560 kB
> 2023-11-02 21:37:38 Dirty: 946864 kB
> 2023-11-02 21:37:48 Dirty: 1355964 kB
> 2023-11-02 21:37:58 Dirty: 2365632 kB
> 2023-11-02 21:38:08 Dirty: 2451948 kB
> ### at this point the rsync was cancelled as I see the kthread start
> 2023-11-02 21:38:18 Dirty: 2451752 kB
> 2023-11-02 21:38:28 Dirty: 2440496 kB
> 2023-11-02 21:38:38 Dirty: 2440308 kB
> 2023-11-02 21:38:48 Dirty: 2440136 kB
> 2023-11-02 21:38:58 Dirty: 2440036 kB
> 2023-11-02 21:39:08 Dirty: 2440240 kB
> 2023-11-02 21:39:18 Dirty: 2405768 kB
> 2023-11-02 21:39:28 Dirty: 2405784 kB
> 2023-11-02 21:39:38 Dirty: 2406012 kB
> 2023-11-02 21:39:48 Dirty: 2405908 kB
> 2023-11-02 21:39:58 Dirty: 2405848 kB
> 2023-11-02 21:40:08 Dirty: 2405876 kB
> 2023-11-02 21:40:18 Dirty: 2405704 kB
> 2023-11-02 21:40:28 Dirty: 2405628 kB
> 2023-11-02 21:40:38 Dirty: 2405544 kB
> 2023-11-02 21:40:48 Dirty: 2405484 kB
> 2023-11-02 21:40:58 Dirty: 2405416 kB
> 2023-11-02 21:41:08 Dirty: 2405412 kB
> 2023-11-02 21:41:18 Dirty: 2405240 kB
> 2023-11-02 21:41:28 Dirty: 2405148 kB
> 2023-11-02 21:41:38 Dirty: 2405132 kB
> 2023-11-02 21:41:48 Dirty: 2404428 kB
> 2023-11-02 21:41:58 Dirty: 2405056 kB
> 2023-11-02 21:42:08 Dirty: 2404944 kB
> 2023-11-02 21:42:18 Dirty: 2404904 kB
> 2023-11-02 21:42:28 Dirty: 2404316 kB
> 2023-11-02 21:42:38 Dirty: 2395340 kB
> 2023-11-02 21:42:48 Dirty: 2394540 kB
> 2023-11-02 21:42:58 Dirty: 2394368 kB
> 2023-11-02 21:43:08 Dirty: 2394520 kB
> 2023-11-02 21:43:18 Dirty: 2394132 kB
> 2023-11-02 21:43:28 Dirty: 2394032 kB
> 2023-11-02 21:43:38 Dirty: 2394276 kB
> 2023-11-02 21:43:48 Dirty: 2386960 kB
> 2023-11-02 21:43:58 Dirty: 2387420 kB
> 2023-11-02 21:44:08 Dirty: 2386620 kB
> 2023-11-02 21:44:18 Dirty: 2386828 kB
> 2023-11-02 21:44:28 Dirty: 2386104 kB
> 2023-11-02 21:44:38 Dirty: 2386328 kB
> 2023-11-02 21:44:48 Dirty: 2382520 kB
> 2023-11-02 21:44:58 Dirty: 2382024 kB
> 2023-11-02 21:45:08 Dirty: 2381344 kB
> 2023-11-02 21:45:18 Dirty: 2380776 kB
> 2023-11-02 21:45:28 Dirty: 2380424 kB
> 2023-11-02 21:45:38 Dirty: 2379672 kB
> 2023-11-02 21:45:48 Dirty: 2380180 kB
> 2023-11-02 21:45:58 Dirty: 2373272 kB
> 2023-11-02 21:46:08 Dirty: 2372416 kB
> 2023-11-02 21:46:18 Dirty: 2372052 kB
> 2023-11-02 21:46:28 Dirty: 2372176 kB
> ... and so on
>
> At the same time I see the kthread in top:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1491995 root 20 0 0 0 0 R 97.3 0.0 12:50.67 kworker/u16:2+flush-9:127
>
> And iostat of md127 shows slow writing like this:
> Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 21:59:52 md127 3.50 0.00 30.40 0.00 0 304 0
> 22:00:02 md127 3.10 0.00 31.20 0.00 0 312 0
> 22:00:12 md127 2.00 0.00 64.40 0.00 0 644 0
> 22:00:22 md127 3.40 0.00 22.80 0.00 0 228 0
>
> So, why the slow clearing of the dirty blocks?
>
> I will leave the machine to clear it into the night...
>
> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> an array of all spares on restart. At least this is what happened twice already.
>
> HTH
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 11:27 ` eyal
2023-11-02 11:57 ` Roger Heflin
@ 2023-11-02 12:29 ` eyal
2023-11-02 12:46 ` Reindl Harald
1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 12:29 UTC (permalink / raw)
To: linux-raid
See update further down.
On 02/11/2023 22.27, eyal@eyal.emu.id.au wrote:
> On 02/11/2023 19.34, Johannes Truschnigg wrote:
>> Hi list,
>>
>> for the record, I do not think that any of the observations the OP made can be
>> explained by non-pathological phenomena/patterns of behavior. Something is
>> very clearly wrong with how this system behaves (the reported figures do not
>> at all match the expected performance of even a degraded RAID6 array in my
>> experience) and how data written to the filesystem apparently fails to make it
>> into the backing devices in acceptable time.
>>
>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>> hardware failure", I think.
>>
>> Still, it'd be interesting to know what happens when writes to the array thru
>> the file system are performed with O_DIRECT in effect, i.e., using `dd
>> oflag=direct status=progress ...` - does this yield any observable (via
>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
>> report this way with varying block sizes? Are there no concerning messages in
>> the debug ringbuffer (`dmesg`) at this time?
>>
>> I'm not sure how we'd best learn more about what the participating busy kernel
>> threads (Fedora 38 might have more convenient devices up its sleeve than
>> ftrace?) are spending their time on in particular, but I think that's what's
>> needed to be known to pin down the underlying cause of the problem.
>
> To clarify what may be lost on this thread.
>
> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> It is cleared very fast. Tested writing 100GB.
>
> The Dirty blocks rise to the max then drain quickly.
>
> 2023-11-02 21:28:27 Dirty: 236 kB
> 2023-11-02 21:28:37 Dirty: 8 kB
> 2023-11-02 21:28:47 Dirty: 40 kB
> 2023-11-02 21:28:57 Dirty: 68 kB
> 2023-11-02 21:29:07 Dirty: 216 kB
> 2023-11-02 21:29:17 Dirty: 364 kB
> 2023-11-02 21:29:27 Dirty: 20 kB
> 2023-11-02 21:29:37 Dirty: 48 kB
> 2023-11-02 21:29:47 Dirty: 24 kB
> 2023-11-02 21:29:57 Dirty: 36 kB
> 2023-11-02 21:30:07 Dirty: 28 kB
> 2023-11-02 21:30:17 Dirty: 52 kB
> 2023-11-02 21:30:27 Dirty: 36 kB
> 2023-11-02 21:30:37 Dirty: 4112980 kB
> 2023-11-02 21:30:47 Dirty: 3772396 kB
> 2023-11-02 21:30:57 Dirty: 3849688 kB
> 2023-11-02 21:31:07 Dirty: 3761132 kB
> 2023-11-02 21:31:17 Dirty: 3846216 kB
> 2023-11-02 21:31:27 Dirty: 3855060 kB
> 2023-11-02 21:31:37 Dirty: 3902212 kB
> 2023-11-02 21:31:47 Dirty: 4173524 kB
> 2023-11-02 21:31:57 Dirty: 3849856 kB
> 2023-11-02 21:32:07 Dirty: 3940632 kB
> 2023-11-02 21:32:17 Dirty: 2212008 kB
> 2023-11-02 21:32:27 Dirty: 244 kB
> 2023-11-02 21:32:37 Dirty: 288 kB
> 2023-11-02 21:32:47 Dirty: 60 kB
>
> rsync'ing a large tree into this same fs is different.
> When killed less than 3GB were copied. Mostly small files in many directories.
> Something like 92,059 files using 2,519MB.
>
> Note how slowly the dirty blocks are cleared.
>
> 2023-11-02 21:36:28 Dirty: 296 kB
> 2023-11-02 21:36:38 Dirty: 277412 kB
> 2023-11-02 21:36:48 Dirty: 403928 kB
> 2023-11-02 21:36:58 Dirty: 606916 kB
> 2023-11-02 21:37:08 Dirty: 753888 kB
> 2023-11-02 21:37:18 Dirty: 641624 kB
> 2023-11-02 21:37:28 Dirty: 744560 kB
> 2023-11-02 21:37:38 Dirty: 946864 kB
> 2023-11-02 21:37:48 Dirty: 1355964 kB
> 2023-11-02 21:37:58 Dirty: 2365632 kB
> 2023-11-02 21:38:08 Dirty: 2451948 kB
> ### at this point the rsync was cancelled as I see the kthread start
> 2023-11-02 21:38:18 Dirty: 2451752 kB
> 2023-11-02 21:38:28 Dirty: 2440496 kB
> 2023-11-02 21:38:38 Dirty: 2440308 kB
> 2023-11-02 21:38:48 Dirty: 2440136 kB
> 2023-11-02 21:38:58 Dirty: 2440036 kB
> 2023-11-02 21:39:08 Dirty: 2440240 kB
> 2023-11-02 21:39:18 Dirty: 2405768 kB
> 2023-11-02 21:39:28 Dirty: 2405784 kB
> 2023-11-02 21:39:38 Dirty: 2406012 kB
> 2023-11-02 21:39:48 Dirty: 2405908 kB
> 2023-11-02 21:39:58 Dirty: 2405848 kB
> 2023-11-02 21:40:08 Dirty: 2405876 kB
> 2023-11-02 21:40:18 Dirty: 2405704 kB
> 2023-11-02 21:40:28 Dirty: 2405628 kB
> 2023-11-02 21:40:38 Dirty: 2405544 kB
> 2023-11-02 21:40:48 Dirty: 2405484 kB
> 2023-11-02 21:40:58 Dirty: 2405416 kB
> 2023-11-02 21:41:08 Dirty: 2405412 kB
> 2023-11-02 21:41:18 Dirty: 2405240 kB
> 2023-11-02 21:41:28 Dirty: 2405148 kB
> 2023-11-02 21:41:38 Dirty: 2405132 kB
> 2023-11-02 21:41:48 Dirty: 2404428 kB
> 2023-11-02 21:41:58 Dirty: 2405056 kB
> 2023-11-02 21:42:08 Dirty: 2404944 kB
> 2023-11-02 21:42:18 Dirty: 2404904 kB
> 2023-11-02 21:42:28 Dirty: 2404316 kB
> 2023-11-02 21:42:38 Dirty: 2395340 kB
> 2023-11-02 21:42:48 Dirty: 2394540 kB
> 2023-11-02 21:42:58 Dirty: 2394368 kB
> 2023-11-02 21:43:08 Dirty: 2394520 kB
> 2023-11-02 21:43:18 Dirty: 2394132 kB
> 2023-11-02 21:43:28 Dirty: 2394032 kB
> 2023-11-02 21:43:38 Dirty: 2394276 kB
> 2023-11-02 21:43:48 Dirty: 2386960 kB
> 2023-11-02 21:43:58 Dirty: 2387420 kB
> 2023-11-02 21:44:08 Dirty: 2386620 kB
> 2023-11-02 21:44:18 Dirty: 2386828 kB
> 2023-11-02 21:44:28 Dirty: 2386104 kB
> 2023-11-02 21:44:38 Dirty: 2386328 kB
> 2023-11-02 21:44:48 Dirty: 2382520 kB
> 2023-11-02 21:44:58 Dirty: 2382024 kB
> 2023-11-02 21:45:08 Dirty: 2381344 kB
> 2023-11-02 21:45:18 Dirty: 2380776 kB
> 2023-11-02 21:45:28 Dirty: 2380424 kB
> 2023-11-02 21:45:38 Dirty: 2379672 kB
> 2023-11-02 21:45:48 Dirty: 2380180 kB
> 2023-11-02 21:45:58 Dirty: 2373272 kB
> 2023-11-02 21:46:08 Dirty: 2372416 kB
> 2023-11-02 21:46:18 Dirty: 2372052 kB
> 2023-11-02 21:46:28 Dirty: 2372176 kB
> ... and so on
Interestingly, after about 1.5 hours, when there were 1GB of dirty blocks, the whole lot was cleared fast:
2023-11-02 23:08:49 Dirty: 1018924 kB
2023-11-02 23:08:59 Dirty: 1018640 kB
2023-11-02 23:09:09 Dirty: 1018732 kB
2023-11-02 23:09:19 Dirty: 592196 kB
2023-11-02 23:09:29 Dirty: 1188 kB
2023-11-02 23:09:39 Dirty: 944 kB
2023-11-02 23:09:49 Dirty: 804 kB
2023-11-02 23:09:59 Dirty: 60 kB
And iostat saw it too:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:09:12 md127 2.80 0.00 40.40 0.00 0 404 0
23:09:22 md127 1372.33 0.80 47026.17 0.00 8 470732 0
23:09:32 md127 75.80 0.80 54763.20 0.00 8 547632 0
23:09:42 md127 0.00 0.00 0.00 0.00 0 0 0
> At the same time I see the kthread in top:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1491995 root 20 0 0 0 0 R 97.3 0.0 12:50.67 kworker/u16:2+flush-9:127
>
> And iostat of md127 shows slow writing like this:
> Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 21:59:52 md127 3.50 0.00 30.40 0.00 0 304 0
> 22:00:02 md127 3.10 0.00 31.20 0.00 0 312 0
> 22:00:12 md127 2.00 0.00 64.40 0.00 0 644 0
> 22:00:22 md127 3.40 0.00 22.80 0.00 0 228 0
>
> So, why the slow clearing of the dirty blocks?
>
> I will leave the machine to clear it into the night...
>
> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> an array of all spares on restart. At least this is what happened twice already.
>
> HTH
>
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 12:29 ` eyal
@ 2023-11-02 12:46 ` Reindl Harald
0 siblings, 0 replies; 35+ messages in thread
From: Reindl Harald @ 2023-11-02 12:46 UTC (permalink / raw)
To: linux-raid
Am 02.11.23 um 13:29 schrieb eyal@eyal.emu.id.au:
> See update further down.
>
> Interestingly, after about 1.5 hours, when there were 1GB of dirty
> blocks, the whole lot was cleared fast:
>
> 2023-11-02 23:08:49 Dirty: 1018924 kB
> 2023-11-02 23:08:59 Dirty: 1018640 kB
> 2023-11-02 23:09:09 Dirty: 1018732 kB
> 2023-11-02 23:09:19 Dirty: 592196 kB
> 2023-11-02 23:09:29 Dirty: 1188 kB
> 2023-11-02 23:09:39 Dirty: 944 kB
> 2023-11-02 23:09:49 Dirty: 804 kB
> 2023-11-02 23:09:59 Dirty: 60 kB
>
> And iostat saw it too:
> Device tps kB_read/s kB_wrtn/s
> kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:09:12 md127 2.80 0.00 40.40
> 0.00 0 404 0
> 23:09:22 md127 1372.33 0.80 47026.17
> 0.00 8 470732 0
> 23:09:32 md127 75.80 0.80 54763.20
> 0.00 8 547632 0
> 23:09:42 md127 0.00 0.00 0.00
> 0.00 0 0 0
it's pretty easy: RAID6 behaves terrible in degraded state especially
*with rotating disks* and for the sake of god as long it is degraded and
not fully rebuilt you should avoid any load which isn't strictly necessary
the chance that another disk dies is increasing especially in the
rebuild-phase and then start to pray becuase the next unrecoverable read
error will kill the array
a RAID10 couldn't care less at that point because it don't need to seek
like crazy on the drives
---------
what i don't understand is why people don't have replacement disks in
the shelf for every array they operate, replace the drive and leave it
in peace until the rebuild is finished
i am responsible for 7 machines at 5 locations with mdadm RAID of
different sizes and there is a replacement disk for each of them - if a
disk dies or smartd complains it's replaced and the next drive will be
ordered
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 11:57 ` Roger Heflin
@ 2023-11-02 13:05 ` eyal
2023-11-02 17:05 ` Roger Heflin
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 13:05 UTC (permalink / raw)
To: linux-raid
On 02/11/2023 22.57, Roger Heflin wrote:
> you need the iostat data from the disks under md127 to see what
> mdraid is having to do to internally. Get the list of sd devices and
> do iostat -x | grep -E 'Device|sdX|sxy|sdz' (putting into the list
> the whole non-partitioned sd devices in your array)
I do have iostat running on the components too, and the writing mostly reflects what md127 sees.
However, while there are no reads on md127, there is often some on the components,
very visible in the last two iostat minutes. Is there some RUW going, on the components?
Here is a snippet, when things were slow:
## meminfo data:
2023-11-02 23:05:07 Dirty: 1037680 kB
2023-11-02 23:05:17 Dirty: 1037288 kB
2023-11-02 23:05:27 Dirty: 1036560 kB
2023-11-02 23:05:37 Dirty: 1036064 kB
2023-11-02 23:05:47 Dirty: 1035720 kB
2023-11-02 23:05:57 Dirty: 1035732 kB
2023-11-02 23:06:07 Dirty: 1035860 kB
2023-11-02 23:06:17 Dirty: 1035708 kB
2023-11-02 23:06:27 Dirty: 1035648 kB
2023-11-02 23:06:37 Dirty: 1035388 kB
2023-11-02 23:06:47 Dirty: 1035336 kB
2023-11-02 23:06:57 Dirty: 1035248 kB
2023-11-02 23:07:07 Dirty: 1037312 kB
2023-11-02 23:07:17 Dirty: 1037172 kB
2023-11-02 23:07:27 Dirty: 1035224 kB
2023-11-02 23:07:37 Dirty: 1035152 kB
2023-11-02 23:07:47 Dirty: 1035064 kB
2023-11-02 23:07:57 Dirty: 1025316 kB
2023-11-02 23:08:07 Dirty: 1020000 kB
2023-11-02 23:08:17 Dirty: 1019628 kB
2023-11-02 23:08:27 Dirty: 1019784 kB
2023-11-02 23:08:37 Dirty: 1019284 kB
2023-11-02 23:08:47 Dirty: 1018908 kB
2023-11-02 23:08:57 Dirty: 1018700 kB
## iostat
23:05:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:05:08 md127 2.22 0.00 14.73 0.00 0 884 0
23:05:08 sdb 1.13 1.93 12.10 0.00 116 726 0
23:05:08 sdc 1.32 0.00 16.17 0.00 0 970 0
23:05:08 sdd 1.18 0.00 5.57 0.00 0 334 0
23:05:08 sde 0.83 0.00 1.50 0.00 0 90 0
23:05:08 sdf 0.83 0.00 1.50 0.00 0 90 0
23:05:08 sdg 1.27 4.47 5.97 0.00 268 358 0
23:06:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:06:08 md127 2.75 0.00 36.20 0.00 0 2172 0
23:06:08 sdb 1.30 4.47 22.40 0.00 268 1344 0
23:06:08 sdc 1.13 0.00 14.40 0.00 0 864 0
23:06:08 sdd 1.13 0.80 14.20 0.00 48 852 0
23:06:08 sde 0.82 8.53 9.73 0.00 512 584 0
23:06:08 sdf 0.87 8.53 7.13 0.00 512 428 0
23:06:08 sdg 0.75 8.53 1.20 0.00 512 72 0
23:07:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:07:08 md127 2.30 0.00 10.93 0.00 0 656 0
23:07:08 sdb 1.17 2.13 5.82 0.00 128 349 0
23:07:08 sdc 1.12 0.00 5.35 0.00 0 321 0
23:07:08 sdd 1.35 4.33 9.68 0.00 260 581 0
23:07:08 sde 1.00 4.33 5.68 0.00 260 341 0
23:07:08 sdf 0.82 0.00 3.95 0.00 0 237 0
23:07:08 sdg 1.00 4.33 5.68 0.00 260 341 0
23:08:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:08:08 md127 13.55 0.13 238.73 0.00 8 14324 0
23:08:08 sdb 1.92 36.47 72.25 0.00 2188 4335 0
23:08:08 sdc 1.78 27.67 66.38 0.00 1660 3983 0
23:08:08 sdd 2.25 47.67 95.52 0.00 2860 5731 0
23:08:08 sde 2.03 43.87 77.72 0.00 2632 4663 0
23:08:08 sdf 1.73 49.20 59.65 0.00 2952 3579 0
23:08:08 sdg 1.72 29.80 56.45 0.00 1788 3387 0
23:09:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
23:09:08 md127 3.22 0.00 59.47 0.00 0 3568 0
23:09:08 sdb 1.88 26.60 36.00 0.00 1596 2160 0
23:09:08 sdc 1.82 32.53 13.60 0.00 1952 816 0
23:09:08 sdd 1.83 17.07 41.80 0.00 1024 2508 0
23:09:08 sde 1.10 9.07 33.40 0.00 544 2004 0
23:09:08 sdf 0.93 21.93 3.80 0.00 1316 228 0
23:09:08 sdg 0.90 15.20 6.73 0.00 912 404 0
> The slowness of clearing dirty and what perf top showed (ext4* calls)
> indicates it is creating entries for small files and putting data in
> those files. Creating entries for small files requires a lot of both
> read and write seeks and seeks are expensive. At most I would expect
> you to be able to rsync/copy around 30-50 files/second (when the array
> is healthy and once the size gets small enough the mb/sec does not
> matter anymore as the seeks dominate).
My experience is that the array can handle all of this easily, and I never noticed
any hiccups.
> On my array I do a nightly copy in of data written during the day to a
> SSD (security cams). And without any array issues that makes my
> array really busy. Anything wrong with the array and that process
> takes several times as long.
What I see here is that a few seconds of usual work take a few hours to perform.
This does not look like the array is stressed, something more substantial is going on.
> On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
>>
>> On 02/11/2023 19.34, Johannes Truschnigg wrote:
>>> Hi list,
>>>
>>> for the record, I do not think that any of the observations the OP made can be
>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>> very clearly wrong with how this system behaves (the reported figures do not
>>> at all match the expected performance of even a degraded RAID6 array in my
>>> experience) and how data written to the filesystem apparently fails to make it
>>> into the backing devices in acceptable time.
>>>
>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>> hardware failure", I think.
>>>
>>> Still, it'd be interesting to know what happens when writes to the array thru
>>> the file system are performed with O_DIRECT in effect, i.e., using `dd
>>> oflag=direct status=progress ...` - does this yield any observable (via
>>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
>>> report this way with varying block sizes? Are there no concerning messages in
>>> the debug ringbuffer (`dmesg`) at this time?
>>>
>>> I'm not sure how we'd best learn more about what the participating busy kernel
>>> threads (Fedora 38 might have more convenient devices up its sleeve than
>>> ftrace?) are spending their time on in particular, but I think that's what's
>>> needed to be known to pin down the underlying cause of the problem.
>>
>> To clarify what may be lost on this thread.
>>
>> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
>> It is cleared very fast. Tested writing 100GB.
>>
>> The Dirty blocks rise to the max then drain quickly.
>>
>> 2023-11-02 21:28:27 Dirty: 236 kB
>> 2023-11-02 21:28:37 Dirty: 8 kB
>> 2023-11-02 21:28:47 Dirty: 40 kB
>> 2023-11-02 21:28:57 Dirty: 68 kB
>> 2023-11-02 21:29:07 Dirty: 216 kB
>> 2023-11-02 21:29:17 Dirty: 364 kB
>> 2023-11-02 21:29:27 Dirty: 20 kB
>> 2023-11-02 21:29:37 Dirty: 48 kB
>> 2023-11-02 21:29:47 Dirty: 24 kB
>> 2023-11-02 21:29:57 Dirty: 36 kB
>> 2023-11-02 21:30:07 Dirty: 28 kB
>> 2023-11-02 21:30:17 Dirty: 52 kB
>> 2023-11-02 21:30:27 Dirty: 36 kB
>> 2023-11-02 21:30:37 Dirty: 4112980 kB
>> 2023-11-02 21:30:47 Dirty: 3772396 kB
>> 2023-11-02 21:30:57 Dirty: 3849688 kB
>> 2023-11-02 21:31:07 Dirty: 3761132 kB
>> 2023-11-02 21:31:17 Dirty: 3846216 kB
>> 2023-11-02 21:31:27 Dirty: 3855060 kB
>> 2023-11-02 21:31:37 Dirty: 3902212 kB
>> 2023-11-02 21:31:47 Dirty: 4173524 kB
>> 2023-11-02 21:31:57 Dirty: 3849856 kB
>> 2023-11-02 21:32:07 Dirty: 3940632 kB
>> 2023-11-02 21:32:17 Dirty: 2212008 kB
>> 2023-11-02 21:32:27 Dirty: 244 kB
>> 2023-11-02 21:32:37 Dirty: 288 kB
>> 2023-11-02 21:32:47 Dirty: 60 kB
>>
>> rsync'ing a large tree into this same fs is different.
>> When killed less than 3GB were copied. Mostly small files in many directories.
>> Something like 92,059 files using 2,519MB.
>>
>> Note how slowly the dirty blocks are cleared.
>>
>> 2023-11-02 21:36:28 Dirty: 296 kB
>> 2023-11-02 21:36:38 Dirty: 277412 kB
>> 2023-11-02 21:36:48 Dirty: 403928 kB
>> 2023-11-02 21:36:58 Dirty: 606916 kB
>> 2023-11-02 21:37:08 Dirty: 753888 kB
>> 2023-11-02 21:37:18 Dirty: 641624 kB
>> 2023-11-02 21:37:28 Dirty: 744560 kB
>> 2023-11-02 21:37:38 Dirty: 946864 kB
>> 2023-11-02 21:37:48 Dirty: 1355964 kB
>> 2023-11-02 21:37:58 Dirty: 2365632 kB
>> 2023-11-02 21:38:08 Dirty: 2451948 kB
>> ### at this point the rsync was cancelled as I see the kthread start
>> 2023-11-02 21:38:18 Dirty: 2451752 kB
>> 2023-11-02 21:38:28 Dirty: 2440496 kB
>> 2023-11-02 21:38:38 Dirty: 2440308 kB
>> 2023-11-02 21:38:48 Dirty: 2440136 kB
>> 2023-11-02 21:38:58 Dirty: 2440036 kB
>> 2023-11-02 21:39:08 Dirty: 2440240 kB
>> 2023-11-02 21:39:18 Dirty: 2405768 kB
>> 2023-11-02 21:39:28 Dirty: 2405784 kB
>> 2023-11-02 21:39:38 Dirty: 2406012 kB
>> 2023-11-02 21:39:48 Dirty: 2405908 kB
>> 2023-11-02 21:39:58 Dirty: 2405848 kB
>> 2023-11-02 21:40:08 Dirty: 2405876 kB
>> 2023-11-02 21:40:18 Dirty: 2405704 kB
>> 2023-11-02 21:40:28 Dirty: 2405628 kB
>> 2023-11-02 21:40:38 Dirty: 2405544 kB
>> 2023-11-02 21:40:48 Dirty: 2405484 kB
>> 2023-11-02 21:40:58 Dirty: 2405416 kB
>> 2023-11-02 21:41:08 Dirty: 2405412 kB
>> 2023-11-02 21:41:18 Dirty: 2405240 kB
>> 2023-11-02 21:41:28 Dirty: 2405148 kB
>> 2023-11-02 21:41:38 Dirty: 2405132 kB
>> 2023-11-02 21:41:48 Dirty: 2404428 kB
>> 2023-11-02 21:41:58 Dirty: 2405056 kB
>> 2023-11-02 21:42:08 Dirty: 2404944 kB
>> 2023-11-02 21:42:18 Dirty: 2404904 kB
>> 2023-11-02 21:42:28 Dirty: 2404316 kB
>> 2023-11-02 21:42:38 Dirty: 2395340 kB
>> 2023-11-02 21:42:48 Dirty: 2394540 kB
>> 2023-11-02 21:42:58 Dirty: 2394368 kB
>> 2023-11-02 21:43:08 Dirty: 2394520 kB
>> 2023-11-02 21:43:18 Dirty: 2394132 kB
>> 2023-11-02 21:43:28 Dirty: 2394032 kB
>> 2023-11-02 21:43:38 Dirty: 2394276 kB
>> 2023-11-02 21:43:48 Dirty: 2386960 kB
>> 2023-11-02 21:43:58 Dirty: 2387420 kB
>> 2023-11-02 21:44:08 Dirty: 2386620 kB
>> 2023-11-02 21:44:18 Dirty: 2386828 kB
>> 2023-11-02 21:44:28 Dirty: 2386104 kB
>> 2023-11-02 21:44:38 Dirty: 2386328 kB
>> 2023-11-02 21:44:48 Dirty: 2382520 kB
>> 2023-11-02 21:44:58 Dirty: 2382024 kB
>> 2023-11-02 21:45:08 Dirty: 2381344 kB
>> 2023-11-02 21:45:18 Dirty: 2380776 kB
>> 2023-11-02 21:45:28 Dirty: 2380424 kB
>> 2023-11-02 21:45:38 Dirty: 2379672 kB
>> 2023-11-02 21:45:48 Dirty: 2380180 kB
>> 2023-11-02 21:45:58 Dirty: 2373272 kB
>> 2023-11-02 21:46:08 Dirty: 2372416 kB
>> 2023-11-02 21:46:18 Dirty: 2372052 kB
>> 2023-11-02 21:46:28 Dirty: 2372176 kB
>> ... and so on
>>
>> At the same time I see the kthread in top:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 1491995 root 20 0 0 0 0 R 97.3 0.0 12:50.67 kworker/u16:2+flush-9:127
>>
>> And iostat of md127 shows slow writing like this:
>> Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
>> 21:59:52 md127 3.50 0.00 30.40 0.00 0 304 0
>> 22:00:02 md127 3.10 0.00 31.20 0.00 0 312 0
>> 22:00:12 md127 2.00 0.00 64.40 0.00 0 644 0
>> 22:00:22 md127 3.40 0.00 22.80 0.00 0 228 0
>>
>> So, why the slow clearing of the dirty blocks?
>>
>> I will leave the machine to clear it into the night...
>>
>> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
>> an array of all spares on restart. At least this is what happened twice already.
>>
>> HTH
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 13:05 ` eyal
@ 2023-11-02 17:05 ` Roger Heflin
2023-11-02 23:23 ` eyal
0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-02 17:05 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
You need to add the -x for extended stats on iostat. That will catch
if one of the disks has difficulty recovering bad blocks and is being
super slow.
And that super slow will come and go based on if you are touching the
bad blocks.
On Thu, Nov 2, 2023 at 8:06 AM <eyal@eyal.emu.id.au> wrote:
>
> On 02/11/2023 22.57, Roger Heflin wrote:
> > you need the iostat data from the disks under md127 to see what
> > mdraid is having to do to internally. Get the list of sd devices and
> > do iostat -x | grep -E 'Device|sdX|sxy|sdz' (putting into the list
> > the whole non-partitioned sd devices in your array)
>
> I do have iostat running on the components too, and the writing mostly reflects what md127 sees.
> However, while there are no reads on md127, there is often some on the components,
> very visible in the last two iostat minutes. Is there some RUW going, on the components?
>
> Here is a snippet, when things were slow:
>
> ## meminfo data:
> 2023-11-02 23:05:07 Dirty: 1037680 kB
> 2023-11-02 23:05:17 Dirty: 1037288 kB
> 2023-11-02 23:05:27 Dirty: 1036560 kB
> 2023-11-02 23:05:37 Dirty: 1036064 kB
> 2023-11-02 23:05:47 Dirty: 1035720 kB
> 2023-11-02 23:05:57 Dirty: 1035732 kB
> 2023-11-02 23:06:07 Dirty: 1035860 kB
> 2023-11-02 23:06:17 Dirty: 1035708 kB
> 2023-11-02 23:06:27 Dirty: 1035648 kB
> 2023-11-02 23:06:37 Dirty: 1035388 kB
> 2023-11-02 23:06:47 Dirty: 1035336 kB
> 2023-11-02 23:06:57 Dirty: 1035248 kB
> 2023-11-02 23:07:07 Dirty: 1037312 kB
> 2023-11-02 23:07:17 Dirty: 1037172 kB
> 2023-11-02 23:07:27 Dirty: 1035224 kB
> 2023-11-02 23:07:37 Dirty: 1035152 kB
> 2023-11-02 23:07:47 Dirty: 1035064 kB
> 2023-11-02 23:07:57 Dirty: 1025316 kB
> 2023-11-02 23:08:07 Dirty: 1020000 kB
> 2023-11-02 23:08:17 Dirty: 1019628 kB
> 2023-11-02 23:08:27 Dirty: 1019784 kB
> 2023-11-02 23:08:37 Dirty: 1019284 kB
> 2023-11-02 23:08:47 Dirty: 1018908 kB
> 2023-11-02 23:08:57 Dirty: 1018700 kB
>
> ## iostat
> 23:05:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:05:08 md127 2.22 0.00 14.73 0.00 0 884 0
> 23:05:08 sdb 1.13 1.93 12.10 0.00 116 726 0
> 23:05:08 sdc 1.32 0.00 16.17 0.00 0 970 0
> 23:05:08 sdd 1.18 0.00 5.57 0.00 0 334 0
> 23:05:08 sde 0.83 0.00 1.50 0.00 0 90 0
> 23:05:08 sdf 0.83 0.00 1.50 0.00 0 90 0
> 23:05:08 sdg 1.27 4.47 5.97 0.00 268 358 0
>
> 23:06:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:06:08 md127 2.75 0.00 36.20 0.00 0 2172 0
> 23:06:08 sdb 1.30 4.47 22.40 0.00 268 1344 0
> 23:06:08 sdc 1.13 0.00 14.40 0.00 0 864 0
> 23:06:08 sdd 1.13 0.80 14.20 0.00 48 852 0
> 23:06:08 sde 0.82 8.53 9.73 0.00 512 584 0
> 23:06:08 sdf 0.87 8.53 7.13 0.00 512 428 0
> 23:06:08 sdg 0.75 8.53 1.20 0.00 512 72 0
>
> 23:07:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:07:08 md127 2.30 0.00 10.93 0.00 0 656 0
> 23:07:08 sdb 1.17 2.13 5.82 0.00 128 349 0
> 23:07:08 sdc 1.12 0.00 5.35 0.00 0 321 0
> 23:07:08 sdd 1.35 4.33 9.68 0.00 260 581 0
> 23:07:08 sde 1.00 4.33 5.68 0.00 260 341 0
> 23:07:08 sdf 0.82 0.00 3.95 0.00 0 237 0
> 23:07:08 sdg 1.00 4.33 5.68 0.00 260 341 0
>
> 23:08:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:08:08 md127 13.55 0.13 238.73 0.00 8 14324 0
> 23:08:08 sdb 1.92 36.47 72.25 0.00 2188 4335 0
> 23:08:08 sdc 1.78 27.67 66.38 0.00 1660 3983 0
> 23:08:08 sdd 2.25 47.67 95.52 0.00 2860 5731 0
> 23:08:08 sde 2.03 43.87 77.72 0.00 2632 4663 0
> 23:08:08 sdf 1.73 49.20 59.65 0.00 2952 3579 0
> 23:08:08 sdg 1.72 29.80 56.45 0.00 1788 3387 0
>
> 23:09:08 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> 23:09:08 md127 3.22 0.00 59.47 0.00 0 3568 0
> 23:09:08 sdb 1.88 26.60 36.00 0.00 1596 2160 0
> 23:09:08 sdc 1.82 32.53 13.60 0.00 1952 816 0
> 23:09:08 sdd 1.83 17.07 41.80 0.00 1024 2508 0
> 23:09:08 sde 1.10 9.07 33.40 0.00 544 2004 0
> 23:09:08 sdf 0.93 21.93 3.80 0.00 1316 228 0
> 23:09:08 sdg 0.90 15.20 6.73 0.00 912 404 0
>
> > The slowness of clearing dirty and what perf top showed (ext4* calls)
> > indicates it is creating entries for small files and putting data in
> > those files. Creating entries for small files requires a lot of both
> > read and write seeks and seeks are expensive. At most I would expect
> > you to be able to rsync/copy around 30-50 files/second (when the array
> > is healthy and once the size gets small enough the mb/sec does not
> > matter anymore as the seeks dominate).
>
> My experience is that the array can handle all of this easily, and I never noticed
> any hiccups.
>
> > On my array I do a nightly copy in of data written during the day to a
> > SSD (security cams). And without any array issues that makes my
> > array really busy. Anything wrong with the array and that process
> > takes several times as long.
>
> What I see here is that a few seconds of usual work take a few hours to perform.
> This does not look like the array is stressed, something more substantial is going on.
>
> > On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 02/11/2023 19.34, Johannes Truschnigg wrote:
> >>> Hi list,
> >>>
> >>> for the record, I do not think that any of the observations the OP made can be
> >>> explained by non-pathological phenomena/patterns of behavior. Something is
> >>> very clearly wrong with how this system behaves (the reported figures do not
> >>> at all match the expected performance of even a degraded RAID6 array in my
> >>> experience) and how data written to the filesystem apparently fails to make it
> >>> into the backing devices in acceptable time.
> >>>
> >>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> >>> hardware failure", I think.
> >>>
> >>> Still, it'd be interesting to know what happens when writes to the array thru
> >>> the file system are performed with O_DIRECT in effect, i.e., using `dd
> >>> oflag=direct status=progress ...` - does this yield any observable (via
> >>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> >>> report this way with varying block sizes? Are there no concerning messages in
> >>> the debug ringbuffer (`dmesg`) at this time?
> >>>
> >>> I'm not sure how we'd best learn more about what the participating busy kernel
> >>> threads (Fedora 38 might have more convenient devices up its sleeve than
> >>> ftrace?) are spending their time on in particular, but I think that's what's
> >>> needed to be known to pin down the underlying cause of the problem.
> >>
> >> To clarify what may be lost on this thread.
> >>
> >> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> >> It is cleared very fast. Tested writing 100GB.
> >>
> >> The Dirty blocks rise to the max then drain quickly.
> >>
> >> 2023-11-02 21:28:27 Dirty: 236 kB
> >> 2023-11-02 21:28:37 Dirty: 8 kB
> >> 2023-11-02 21:28:47 Dirty: 40 kB
> >> 2023-11-02 21:28:57 Dirty: 68 kB
> >> 2023-11-02 21:29:07 Dirty: 216 kB
> >> 2023-11-02 21:29:17 Dirty: 364 kB
> >> 2023-11-02 21:29:27 Dirty: 20 kB
> >> 2023-11-02 21:29:37 Dirty: 48 kB
> >> 2023-11-02 21:29:47 Dirty: 24 kB
> >> 2023-11-02 21:29:57 Dirty: 36 kB
> >> 2023-11-02 21:30:07 Dirty: 28 kB
> >> 2023-11-02 21:30:17 Dirty: 52 kB
> >> 2023-11-02 21:30:27 Dirty: 36 kB
> >> 2023-11-02 21:30:37 Dirty: 4112980 kB
> >> 2023-11-02 21:30:47 Dirty: 3772396 kB
> >> 2023-11-02 21:30:57 Dirty: 3849688 kB
> >> 2023-11-02 21:31:07 Dirty: 3761132 kB
> >> 2023-11-02 21:31:17 Dirty: 3846216 kB
> >> 2023-11-02 21:31:27 Dirty: 3855060 kB
> >> 2023-11-02 21:31:37 Dirty: 3902212 kB
> >> 2023-11-02 21:31:47 Dirty: 4173524 kB
> >> 2023-11-02 21:31:57 Dirty: 3849856 kB
> >> 2023-11-02 21:32:07 Dirty: 3940632 kB
> >> 2023-11-02 21:32:17 Dirty: 2212008 kB
> >> 2023-11-02 21:32:27 Dirty: 244 kB
> >> 2023-11-02 21:32:37 Dirty: 288 kB
> >> 2023-11-02 21:32:47 Dirty: 60 kB
> >>
> >> rsync'ing a large tree into this same fs is different.
> >> When killed less than 3GB were copied. Mostly small files in many directories.
> >> Something like 92,059 files using 2,519MB.
> >>
> >> Note how slowly the dirty blocks are cleared.
> >>
> >> 2023-11-02 21:36:28 Dirty: 296 kB
> >> 2023-11-02 21:36:38 Dirty: 277412 kB
> >> 2023-11-02 21:36:48 Dirty: 403928 kB
> >> 2023-11-02 21:36:58 Dirty: 606916 kB
> >> 2023-11-02 21:37:08 Dirty: 753888 kB
> >> 2023-11-02 21:37:18 Dirty: 641624 kB
> >> 2023-11-02 21:37:28 Dirty: 744560 kB
> >> 2023-11-02 21:37:38 Dirty: 946864 kB
> >> 2023-11-02 21:37:48 Dirty: 1355964 kB
> >> 2023-11-02 21:37:58 Dirty: 2365632 kB
> >> 2023-11-02 21:38:08 Dirty: 2451948 kB
> >> ### at this point the rsync was cancelled as I see the kthread start
> >> 2023-11-02 21:38:18 Dirty: 2451752 kB
> >> 2023-11-02 21:38:28 Dirty: 2440496 kB
> >> 2023-11-02 21:38:38 Dirty: 2440308 kB
> >> 2023-11-02 21:38:48 Dirty: 2440136 kB
> >> 2023-11-02 21:38:58 Dirty: 2440036 kB
> >> 2023-11-02 21:39:08 Dirty: 2440240 kB
> >> 2023-11-02 21:39:18 Dirty: 2405768 kB
> >> 2023-11-02 21:39:28 Dirty: 2405784 kB
> >> 2023-11-02 21:39:38 Dirty: 2406012 kB
> >> 2023-11-02 21:39:48 Dirty: 2405908 kB
> >> 2023-11-02 21:39:58 Dirty: 2405848 kB
> >> 2023-11-02 21:40:08 Dirty: 2405876 kB
> >> 2023-11-02 21:40:18 Dirty: 2405704 kB
> >> 2023-11-02 21:40:28 Dirty: 2405628 kB
> >> 2023-11-02 21:40:38 Dirty: 2405544 kB
> >> 2023-11-02 21:40:48 Dirty: 2405484 kB
> >> 2023-11-02 21:40:58 Dirty: 2405416 kB
> >> 2023-11-02 21:41:08 Dirty: 2405412 kB
> >> 2023-11-02 21:41:18 Dirty: 2405240 kB
> >> 2023-11-02 21:41:28 Dirty: 2405148 kB
> >> 2023-11-02 21:41:38 Dirty: 2405132 kB
> >> 2023-11-02 21:41:48 Dirty: 2404428 kB
> >> 2023-11-02 21:41:58 Dirty: 2405056 kB
> >> 2023-11-02 21:42:08 Dirty: 2404944 kB
> >> 2023-11-02 21:42:18 Dirty: 2404904 kB
> >> 2023-11-02 21:42:28 Dirty: 2404316 kB
> >> 2023-11-02 21:42:38 Dirty: 2395340 kB
> >> 2023-11-02 21:42:48 Dirty: 2394540 kB
> >> 2023-11-02 21:42:58 Dirty: 2394368 kB
> >> 2023-11-02 21:43:08 Dirty: 2394520 kB
> >> 2023-11-02 21:43:18 Dirty: 2394132 kB
> >> 2023-11-02 21:43:28 Dirty: 2394032 kB
> >> 2023-11-02 21:43:38 Dirty: 2394276 kB
> >> 2023-11-02 21:43:48 Dirty: 2386960 kB
> >> 2023-11-02 21:43:58 Dirty: 2387420 kB
> >> 2023-11-02 21:44:08 Dirty: 2386620 kB
> >> 2023-11-02 21:44:18 Dirty: 2386828 kB
> >> 2023-11-02 21:44:28 Dirty: 2386104 kB
> >> 2023-11-02 21:44:38 Dirty: 2386328 kB
> >> 2023-11-02 21:44:48 Dirty: 2382520 kB
> >> 2023-11-02 21:44:58 Dirty: 2382024 kB
> >> 2023-11-02 21:45:08 Dirty: 2381344 kB
> >> 2023-11-02 21:45:18 Dirty: 2380776 kB
> >> 2023-11-02 21:45:28 Dirty: 2380424 kB
> >> 2023-11-02 21:45:38 Dirty: 2379672 kB
> >> 2023-11-02 21:45:48 Dirty: 2380180 kB
> >> 2023-11-02 21:45:58 Dirty: 2373272 kB
> >> 2023-11-02 21:46:08 Dirty: 2372416 kB
> >> 2023-11-02 21:46:18 Dirty: 2372052 kB
> >> 2023-11-02 21:46:28 Dirty: 2372176 kB
> >> ... and so on
> >>
> >> At the same time I see the kthread in top:
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 1491995 root 20 0 0 0 0 R 97.3 0.0 12:50.67 kworker/u16:2+flush-9:127
> >>
> >> And iostat of md127 shows slow writing like this:
> >> Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
> >> 21:59:52 md127 3.50 0.00 30.40 0.00 0 304 0
> >> 22:00:02 md127 3.10 0.00 31.20 0.00 0 312 0
> >> 22:00:12 md127 2.00 0.00 64.40 0.00 0 644 0
> >> 22:00:22 md127 3.40 0.00 22.80 0.00 0 228 0
> >>
> >> So, why the slow clearing of the dirty blocks?
> >>
> >> I will leave the machine to clear it into the night...
> >>
> >> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> >> an array of all spares on restart. At least this is what happened twice already.
> >>
> >> HTH
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 17:05 ` Roger Heflin
@ 2023-11-02 23:23 ` eyal
2023-11-03 12:08 ` Roger Heflin
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 23:23 UTC (permalink / raw)
To: linux-raid
On 03/11/2023 04.05, Roger Heflin wrote:
> You need to add the -x for extended stats on iostat. That will catch
> if one of the disks has difficulty recovering bad blocks and is being
> super slow.
>
> And that super slow will come and go based on if you are touching the
> bad blocks.
I did not know about '-x'. I see that the total columns (kB_read, kB_wrtn) are not included:-(
Here is one.
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 1.88 116.72 0.00 0.00 11.27 62.19 6.31 1523.93 0.00 0.00 218.42 241.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.40 1.72
sdb 0.67 67.42 16.17 96.02 11.61 100.68 3.74 367.79 89.35 95.98 7.65 98.33 0.00 0.00 0.00 0.00 0.00 0.00 2.02 6.25 0.05 1.92
sdc 0.81 89.74 21.61 96.39 15.30 110.94 3.74 367.58 89.29 95.98 7.70 98.20 0.00 0.00 0.00 0.00 0.00 0.00 2.02 5.15 0.05 1.73
sdd 0.87 102.17 24.66 96.59 16.75 117.28 3.73 367.34 89.24 95.99 15.00 98.45 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.28 0.08 3.92
sde 0.87 101.87 24.58 96.56 19.38 116.46 3.72 367.45 89.28 96.00 16.20 98.71 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.30 0.08 3.94
sdf 0.81 90.11 21.70 96.39 16.24 110.80 3.73 367.15 89.20 95.99 14.19 98.51 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.17 0.07 3.91
sdg 0.68 67.91 16.28 95.97 12.17 99.30 3.73 367.20 89.21 95.98 13.28 98.32 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.10 0.06 3.86
Interesting to see that sd[bc] have lower w_await,aqu-sz and %util and higher f_await.
Even not yet understanding what these mean, I see that sd[bc] are model ST12000NM001G (recently replaced) while the rest are the original ST12000NM0007 (now 5yo).
I expect this shows different tuning in the device fw.
I do not expect this to be relevant to the current situation.
I need to understand the r vs w also. I see wkB/s identical for all members, rkB/s is not.
I expected this to be similar, but maybe md reads different disks at different times to make up for the missing one?
Still, thanks for your help.
> On Thu, Nov 2, 2023 at 8:06 AM <eyal@eyal.emu.id.au> wrote:
[discussion trimmed]
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 23:23 ` eyal
@ 2023-11-03 12:08 ` Roger Heflin
0 siblings, 0 replies; 35+ messages in thread
From: Roger Heflin @ 2023-11-03 12:08 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
On Thu, Nov 2, 2023 at 6:31 PM <eyal@eyal.emu.id.au> wrote:
>
> On 03/11/2023 04.05, Roger Heflin wrote:
> > You need to add the -x for extended stats on iostat. That will catch
> > if one of the disks has difficulty recovering bad blocks and is being
> > super slow.
> >
> > And that super slow will come and go based on if you are touching the
> > bad blocks.
>
> I did not know about '-x'. I see that the total columns (kB_read, kB_wrtn) are not included:-(
>
> Here is one.
>
> Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
> md127 1.88 116.72 0.00 0.00 11.27 62.19 6.31 1523.93 0.00 0.00 218.42 241.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.40 1.72
> sdb 0.67 67.42 16.17 96.02 11.61 100.68 3.74 367.79 89.35 95.98 7.65 98.33 0.00 0.00 0.00 0.00 0.00 0.00 2.02 6.25 0.05 1.92
> sdc 0.81 89.74 21.61 96.39 15.30 110.94 3.74 367.58 89.29 95.98 7.70 98.20 0.00 0.00 0.00 0.00 0.00 0.00 2.02 5.15 0.05 1.73
> sdd 0.87 102.17 24.66 96.59 16.75 117.28 3.73 367.34 89.24 95.99 15.00 98.45 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.28 0.08 3.92
> sde 0.87 101.87 24.58 96.56 19.38 116.46 3.72 367.45 89.28 96.00 16.20 98.71 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.30 0.08 3.94
> sdf 0.81 90.11 21.70 96.39 16.24 110.80 3.73 367.15 89.20 95.99 14.19 98.51 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.17 0.07 3.91
> sdg 0.68 67.91 16.28 95.97 12.17 99.30 3.73 367.20 89.21 95.98 13.28 98.32 0.00 0.00 0.00 0.00 0.00 0.00 2.02 3.10 0.06 3.86
>
> Interesting to see that sd[bc] have lower w_await,aqu-sz and %util and higher f_await.
> Even not yet understanding what these mean, I see that sd[bc] are model ST12000NM001G (recently replaced) while the rest are the original ST12000NM0007 (now 5yo).
> I expect this shows different tuning in the device fw.
>
> I do not expect this to be relevant to the current situation.
>
> I need to understand the r vs w also. I see wkB/s identical for all members, rkB/s is not.
> I expected this to be similar, but maybe md reads different disks at different times to make up for the missing one?
>
> Still, thanks for your help.
I would expect the reads to be slightly different. Note MD is
reading 116kb/sec but the underlying disks hare having to do
500kb/sec. MD is doing 1523kb/sec writes and doing 2200kb/sec. So
the rads are doing needing to do 4x the real reads to recover/rebuilt
the data. The interesting columns are r/s, rkb/s, r_await (how long
a read takes in ms) and w/s, rkb/s, w_await (how long an write takes
in ms) and the %util. rrqm is read requests and if you divide kb/s
-> requests it indicates average io is around 4k.
The %util column is the one to watch. If the disk is having
internal issues %util will hit close to 100% on lowish reads/writes.
If it gets close to 100% that is a really bad sign.
You might see what that data looks like when the disk is having
issues. You might also start using dirty_bytes and
dirty_background_bytes that makes the io suck less when your array
gets slow.
My array has mythtv stuff and security cam images. During the day I
save all of that to a 500GB ssd, and then at midnight move it to the
long term spinning disk, and during that window my disks are really
busy. And depending on if a rebuild is running and/or if something
else is going on with the array how long that takes varies with the
amount collected during the day and if there are array issues it takes
longer. I have been keeping a spare to use in emergencies.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-02 8:34 ` Johannes Truschnigg
2023-11-02 11:27 ` eyal
@ 2023-11-03 14:16 ` Carlos Carvalho
2023-11-03 14:32 ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
2023-11-03 15:57 ` problem with recovered array Roger Heflin
1 sibling, 2 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-11-03 14:16 UTC (permalink / raw)
To: linux-raid
Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> for the record, I do not think that any of the observations the OP made can be
> explained by non-pathological phenomena/patterns of behavior. Something is
> very clearly wrong with how this system behaves (the reported figures do not
> at all match the expected performance of even a degraded RAID6 array in my
> experience) and how data written to the filesystem apparently fails to make it
> into the backing devices in acceptable time.
>
> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> hardware failure", I think.
Exactly. That's what I've been saying for months...
I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
blocked in D state and the tarball expansion proceeds so slowly that it'd take
days to complete (checked with quota).
So 6.5 (and 6.4) are unusable in this case. In another machine, which does
hundreds of rsync downloads every day, the same problem exists and I also get
frequent random rsync timeouts.
This is all with raid6 and ext4. One of the machines has a journal disk in the
raid and the filesystem is mounted with nobarriers. Both show the same
behavior. It'd be interesting to try a different filesystem but these are
production machines with many disks and I cannot create another big array to
transfer the contents.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Dirty page flushing regression in 6.5.x vs 6.1.x
2023-11-03 14:16 ` Carlos Carvalho
@ 2023-11-03 14:32 ` Roman Mamedov
2023-11-03 15:57 ` problem with recovered array Roger Heflin
1 sibling, 0 replies; 35+ messages in thread
From: Roman Mamedov @ 2023-11-03 14:32 UTC (permalink / raw)
To: Carlos Carvalho; +Cc: linux-raid
On Fri, 3 Nov 2023 11:16:57 -0300
Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
>
> Exactly. That's what I've been saying for months...
>
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
If you have a 100% way to reproduce, next what would be ideal to do is a
bisect to narrow it down to which commit introduced the problem. Of course it
might be not feasible to reboot dozens of times on production machines. Still,
maybe for a start you could narrow it down some more, such as check kernels
around 6.2, 6.3? Those are not offered anymore on kernel.org, but should be
retrievable from distro repositories or git.
Also check the 6.6 kernel which has been released recently.
--
With respect,
Roman
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-03 14:16 ` Carlos Carvalho
2023-11-03 14:32 ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
@ 2023-11-03 15:57 ` Roger Heflin
2023-11-03 22:38 ` eyal
1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-03 15:57 UTC (permalink / raw)
To: Carlos Carvalho; +Cc: linux-raid
On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>
> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
>
> Exactly. That's what I've been saying for months...
>
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>
> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
> blocked in D state and the tarball expansion proceeds so slowly that it'd take
> days to complete (checked with quota).
>
> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
> hundreds of rsync downloads every day, the same problem exists and I also get
> frequent random rsync timeouts.
>
> This is all with raid6 and ext4. One of the machines has a journal disk in the
> raid and the filesystem is mounted with nobarriers. Both show the same
> behavior. It'd be interesting to try a different filesystem but these are
> production machines with many disks and I cannot create another big array to
> transfer the contents.
My array is running 6.5 + xfs, and mine all seems to work normally
(speed wise). And in the perf top call he ran all of the busy
kworkers were ext4* calls spending a lot of time doing various
filesystem work.
I did find/debug a situation where dumping the cache caused ext4
performance to be a disaster (large directories, lots of files). It
was tracked back to ext4 relies on the Buffers: data space in
/proc/meminfo for at least directory entry caching, and that if there
were a lot of directories and/or files in directories that Buffer:
getting dropped and/or getting pruned for any some reason caused the
fragmented directory entries to have to get reloaded from a spinning
disk and require the disk to be seeking for *MINUTES* to reload it
(there were in this case several million files in a couple of
directories with the directory entries being allocated over time so
very likely heavily fragmented).
I wonder if there was some change with how Buffers is
used/sized/pruned in the recent kernels. The same drop_cache on an
XFS filesystem had no effect that I could identify and doing a ls -lR
on a big xfs filesystem does not make Buffers grow, but doing the same
ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
depends on how many files/directories are on the filesystem).
He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
see if the poor performance correlates with Buffers suddenly being
smaller for some reason.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-03 15:57 ` problem with recovered array Roger Heflin
@ 2023-11-03 22:38 ` eyal
2023-11-04 0:48 ` eyal
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-03 22:38 UTC (permalink / raw)
To: linux-raid
On 04/11/2023 02.57, Roger Heflin wrote:
> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>
>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>> for the record, I do not think that any of the observations the OP made can be
>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>> very clearly wrong with how this system behaves (the reported figures do not
>>> at all match the expected performance of even a degraded RAID6 array in my
>>> experience) and how data written to the filesystem apparently fails to make it
>>> into the backing devices in acceptable time.
>>>
>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>> hardware failure", I think.
>>
>> Exactly. That's what I've been saying for months...
>>
>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>
>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>> days to complete (checked with quota).
>>
>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>> hundreds of rsync downloads every day, the same problem exists and I also get
>> frequent random rsync timeouts.
>>
>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>> raid and the filesystem is mounted with nobarriers. Both show the same
>> behavior. It'd be interesting to try a different filesystem but these are
>> production machines with many disks and I cannot create another big array to
>> transfer the contents.
>
> My array is running 6.5 + xfs, and mine all seems to work normally
> (speed wise). And in the perf top call he ran all of the busy
> kworkers were ext4* calls spending a lot of time doing various
> filesystem work.
>
> I did find/debug a situation where dumping the cache caused ext4
> performance to be a disaster (large directories, lots of files). It
> was tracked back to ext4 relies on the Buffers: data space in
> /proc/meminfo for at least directory entry caching, and that if there
> were a lot of directories and/or files in directories that Buffer:
> getting dropped and/or getting pruned for any some reason caused the
> fragmented directory entries to have to get reloaded from a spinning
> disk and require the disk to be seeking for *MINUTES* to reload it
> (there were in this case several million files in a couple of
> directories with the directory entries being allocated over time so
> very likely heavily fragmented).
>
> I wonder if there was some change with how Buffers is
> used/sized/pruned in the recent kernels. The same drop_cache on an
> XFS filesystem had no effect that I could identify and doing a ls -lR
> on a big xfs filesystem does not make Buffers grow, but doing the same
> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
> depends on how many files/directories are on the filesystem).
>
> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
> see if the poor performance correlates with Buffers suddenly being
> smaller for some reason.
As much as I hate this, I started another small test.
$ uname -a
Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
$ df -h /data1
Filesystem Size Used Avail Use% Mounted on
/dev/md127 55T 45T 9.8T 83% /data1
$ sudo du -sm /data2/no-backup/old-backups/tapes/01
2519 /data2/no-backup/old-backups/tapes/01
$ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
92059
$ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
It completed in about one minute and it was enough to trigger the problem.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2781097 root 20 0 0 0 0 R 98.3 0.0 36:04.35 kworker/u16:0+flush-9:127
iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
Dirty change Buffers change
2023-11-04 08:56:35 44 -580 1170536 8
2023-11-04 08:56:45 48 4 1170552 16
2023-11-04 08:56:55 812456 812408 1171008 456
2023-11-04 08:57:05 538436 -274020 1180820 9812
2023-11-04 08:57:15 698708 160272 1189368 8548
2023-11-04 08:57:25 874208 175500 1195620 6252
2023-11-04 08:57:35 742300 -131908 1202124 6504
2023-11-04 08:57:45 973528 231228 1209580 7456
2023-11-04 08:57:55 1269320 295792 1214900 5320
2023-11-04 08:58:05 1624428 355108 1219764 4864
2023-11-04 08:58:15 1629484 5056 1219816 52
2023-11-04 08:58:25 1629372 -112 1219832 16
2023-11-04 08:58:35 1629028 -344 1219856 24
2023-11-04 08:58:45 1628928 -100 1219880 24
2023-11-04 08:58:55 1628552 -376 1219908 28
2023-11-04 08:59:05 1629252 700 1220072 164
2023-11-04 08:59:15 1628696 -556 1220132 60
2023-11-04 08:59:25 1628304 -392 1220156 24
2023-11-04 08:59:35 1628264 -40 1220188 32
2023-11-04 08:59:45 1628184 -80 1220340 152
2023-11-04 08:59:55 1628144 -40 1220364 24
2023-11-04 09:00:05 1628124 -20 1219940 -424
2023-11-04 09:00:15 1627908 -216 1219976 36
2023-11-04 09:00:25 1627840 -68 1220000 24
2023-11-04 09:00:35 1624276 -3564 1220024 24
2023-11-04 09:00:45 1624100 -176 1220060 36
2023-11-04 09:00:55 1623912 -188 1220092 32
2023-11-04 09:01:05 1624076 164 1220112 20
2023-11-04 09:01:15 1623368 -708 1220160 48
2023-11-04 09:01:25 1623176 -192 1220196 36
2023-11-04 09:01:35 1621872 -1304 1220232 36
2023-11-04 09:01:45 1621732 -140 1220308 76
2023-11-04 09:01:55 1612304 -9428 1220392 84
2023-11-04 09:02:05 1612256 -48 1220420 28
2023-11-04 09:02:15 1612040 -216 1220444 24
2023-11-04 09:02:25 1611968 -72 1220476 32
2023-11-04 09:02:35 1611872 -96 1220492 16
2023-11-04 09:02:45 1609932 -1940 1220524 32
2023-11-04 09:02:55 1609828 -104 1220556 32
2023-11-04 09:03:05 1609916 88 1220572 16
2023-11-04 09:03:15 1609496 -420 1220608 36
2023-11-04 09:03:25 1609392 -104 1220632 24
2023-11-04 09:03:35 1609320 -72 1220648 16
2023-11-04 09:03:45 1609240 -80 1220672 24
2023-11-04 09:03:55 1609152 -88 1220688 16
2023-11-04 09:04:05 1609332 180 1220712 24
2023-11-04 09:04:15 1608892 -440 1220748 36
2023-11-04 09:04:25 1608848 -44 1220764 16
2023-11-04 09:04:35 1608744 -104 1220796 32
2023-11-04 09:04:45 1608436 -308 1220820 24
2023-11-04 09:04:55 1607916 -520 1220836 16
2023-11-04 09:05:05 1608624 708 1220876 40
2023-11-04 09:05:15 1606556 -2068 1220928 52
2023-11-04 09:05:25 1602692 -3864 1221016 88
2023-11-04 09:05:35 1602080 -612 1221052 36
2023-11-04 09:05:45 1602000 -80 1221080 28
2023-11-04 09:05:55 1601928 -72 1221096 16
2023-11-04 09:06:05 1602228 300 1221124 28
2023-11-04 09:06:15 1601848 -380 1221156 32
2023-11-04 09:06:25 1601656 -192 1221180 24
2023-11-04 09:06:35 1601532 -124 1221212 32
2023-11-04 09:06:45 1601476 -56 1221228 16
2023-11-04 09:06:55 1601364 -112 1221252 24
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-03 22:38 ` eyal
@ 2023-11-04 0:48 ` eyal
2023-11-04 1:01 ` Roger Heflin
0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-04 0:48 UTC (permalink / raw)
To: linux-raid
On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
> On 04/11/2023 02.57, Roger Heflin wrote:
>> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>>
>>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>>> for the record, I do not think that any of the observations the OP made can be
>>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>>> very clearly wrong with how this system behaves (the reported figures do not
>>>> at all match the expected performance of even a degraded RAID6 array in my
>>>> experience) and how data written to the filesystem apparently fails to make it
>>>> into the backing devices in acceptable time.
>>>>
>>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>>> hardware failure", I think.
>>>
>>> Exactly. That's what I've been saying for months...
>>>
>>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>>
>>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>>> days to complete (checked with quota).
>>>
>>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>>> hundreds of rsync downloads every day, the same problem exists and I also get
>>> frequent random rsync timeouts.
>>>
>>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>>> raid and the filesystem is mounted with nobarriers. Both show the same
>>> behavior. It'd be interesting to try a different filesystem but these are
>>> production machines with many disks and I cannot create another big array to
>>> transfer the contents.
>>
>> My array is running 6.5 + xfs, and mine all seems to work normally
>> (speed wise). And in the perf top call he ran all of the busy
>> kworkers were ext4* calls spending a lot of time doing various
>> filesystem work.
>>
>> I did find/debug a situation where dumping the cache caused ext4
>> performance to be a disaster (large directories, lots of files). It
>> was tracked back to ext4 relies on the Buffers: data space in
>> /proc/meminfo for at least directory entry caching, and that if there
>> were a lot of directories and/or files in directories that Buffer:
>> getting dropped and/or getting pruned for any some reason caused the
>> fragmented directory entries to have to get reloaded from a spinning
>> disk and require the disk to be seeking for *MINUTES* to reload it
>> (there were in this case several million files in a couple of
>> directories with the directory entries being allocated over time so
>> very likely heavily fragmented).
>>
>> I wonder if there was some change with how Buffers is
>> used/sized/pruned in the recent kernels. The same drop_cache on an
>> XFS filesystem had no effect that I could identify and doing a ls -lR
>> on a big xfs filesystem does not make Buffers grow, but doing the same
>> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
>> depends on how many files/directories are on the filesystem).
>>
>> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
>> see if the poor performance correlates with Buffers suddenly being
>> smaller for some reason.
>
> As much as I hate this, I started another small test.
>
> $ uname -a
> Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
>
> $ df -h /data1
> Filesystem Size Used Avail Use% Mounted on
> /dev/md127 55T 45T 9.8T 83% /data1
>
> $ sudo du -sm /data2/no-backup/old-backups/tapes/01
> 2519 /data2/no-backup/old-backups/tapes/01
>
> $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
> 92059
>
> $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
>
> It completed in about one minute and it was enough to trigger the problem.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2781097 root 20 0 0 0 0 R 98.3 0.0 36:04.35 kworker/u16:0+flush-9:127
>
> iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
>
> Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
>
> Dirty change Buffers change
> 2023-11-04 08:56:35 44 -580 1170536 8
> 2023-11-04 08:56:45 48 4 1170552 16
> 2023-11-04 08:56:55 812456 812408 1171008 456
> 2023-11-04 08:57:05 538436 -274020 1180820 9812
> 2023-11-04 08:57:15 698708 160272 1189368 8548
> 2023-11-04 08:57:25 874208 175500 1195620 6252
> 2023-11-04 08:57:35 742300 -131908 1202124 6504
> 2023-11-04 08:57:45 973528 231228 1209580 7456
> 2023-11-04 08:57:55 1269320 295792 1214900 5320
> 2023-11-04 08:58:05 1624428 355108 1219764 4864
> 2023-11-04 08:58:15 1629484 5056 1219816 52
> 2023-11-04 08:58:25 1629372 -112 1219832 16
> 2023-11-04 08:58:35 1629028 -344 1219856 24
> 2023-11-04 08:58:45 1628928 -100 1219880 24
> 2023-11-04 08:58:55 1628552 -376 1219908 28
> 2023-11-04 08:59:05 1629252 700 1220072 164
> 2023-11-04 08:59:15 1628696 -556 1220132 60
> 2023-11-04 08:59:25 1628304 -392 1220156 24
> 2023-11-04 08:59:35 1628264 -40 1220188 32
> 2023-11-04 08:59:45 1628184 -80 1220340 152
> 2023-11-04 08:59:55 1628144 -40 1220364 24
> 2023-11-04 09:00:05 1628124 -20 1219940 -424
> 2023-11-04 09:00:15 1627908 -216 1219976 36
> 2023-11-04 09:00:25 1627840 -68 1220000 24
> 2023-11-04 09:00:35 1624276 -3564 1220024 24
> 2023-11-04 09:00:45 1624100 -176 1220060 36
> 2023-11-04 09:00:55 1623912 -188 1220092 32
> 2023-11-04 09:01:05 1624076 164 1220112 20
> 2023-11-04 09:01:15 1623368 -708 1220160 48
> 2023-11-04 09:01:25 1623176 -192 1220196 36
> 2023-11-04 09:01:35 1621872 -1304 1220232 36
> 2023-11-04 09:01:45 1621732 -140 1220308 76
> 2023-11-04 09:01:55 1612304 -9428 1220392 84
> 2023-11-04 09:02:05 1612256 -48 1220420 28
> 2023-11-04 09:02:15 1612040 -216 1220444 24
> 2023-11-04 09:02:25 1611968 -72 1220476 32
> 2023-11-04 09:02:35 1611872 -96 1220492 16
> 2023-11-04 09:02:45 1609932 -1940 1220524 32
> 2023-11-04 09:02:55 1609828 -104 1220556 32
> 2023-11-04 09:03:05 1609916 88 1220572 16
> 2023-11-04 09:03:15 1609496 -420 1220608 36
> 2023-11-04 09:03:25 1609392 -104 1220632 24
> 2023-11-04 09:03:35 1609320 -72 1220648 16
> 2023-11-04 09:03:45 1609240 -80 1220672 24
> 2023-11-04 09:03:55 1609152 -88 1220688 16
> 2023-11-04 09:04:05 1609332 180 1220712 24
> 2023-11-04 09:04:15 1608892 -440 1220748 36
> 2023-11-04 09:04:25 1608848 -44 1220764 16
> 2023-11-04 09:04:35 1608744 -104 1220796 32
> 2023-11-04 09:04:45 1608436 -308 1220820 24
> 2023-11-04 09:04:55 1607916 -520 1220836 16
> 2023-11-04 09:05:05 1608624 708 1220876 40
> 2023-11-04 09:05:15 1606556 -2068 1220928 52
> 2023-11-04 09:05:25 1602692 -3864 1221016 88
> 2023-11-04 09:05:35 1602080 -612 1221052 36
> 2023-11-04 09:05:45 1602000 -80 1221080 28
> 2023-11-04 09:05:55 1601928 -72 1221096 16
> 2023-11-04 09:06:05 1602228 300 1221124 28
> 2023-11-04 09:06:15 1601848 -380 1221156 32
> 2023-11-04 09:06:25 1601656 -192 1221180 24
> 2023-11-04 09:06:35 1601532 -124 1221212 32
> 2023-11-04 09:06:45 1601476 -56 1221228 16
> 2023-11-04 09:06:55 1601364 -112 1221252 24
Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
09:00:00 md127 0.02 0.07 0.00 0.00 18.00 4.00 2.33 13.27 0.00 0.00 1268.36 5.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.96 1.12
09:00:00 sdb 0.10 3.80 0.85 89.47 2.50 38.00 0.85 1.57 0.00 0.00 3.00 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.82 0.01 0.35
09:00:00 sdc 0.13 3.93 0.85 86.44 6.00 29.50 1.00 4.17 0.50 33.33 4.38 4.17 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.72 0.01 0.53
09:00:00 sdd 0.13 3.93 0.85 86.44 7.12 29.50 1.05 3.70 0.33 24.10 13.78 3.52 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.90 0.02 1.54
09:00:00 sde 0.12 3.87 0.85 87.93 2.29 33.14 0.83 1.50 0.00 0.00 14.02 1.80 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.42 0.01 1.27
09:00:00 sdf 0.03 0.13 0.00 0.00 12.50 4.00 1.37 14.63 2.75 66.80 13.83 10.71 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.88 0.02 1.84
09:00:00 sdg 0.02 0.07 0.00 0.00 21.00 4.00 1.40 14.77 2.75 66.27 13.45 10.55 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.96 0.02 1.77
09:01:00 md127 0.05 0.20 0.00 0.00 26.00 4.00 7.85 83.00 0.00 0.00 2388.76 10.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.75 1.74
09:01:00 sdb 0.23 15.73 3.70 94.07 20.36 67.43 0.98 29.12 6.68 87.17 5.75 29.62 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.75 0.02 0.73
09:01:00 sdc 0.20 10.33 2.38 92.26 12.08 51.67 0.77 9.93 2.10 73.26 3.20 12.95 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.33 0.01 0.36
09:01:00 sdd 0.23 18.60 4.42 94.98 23.07 79.71 1.03 22.38 4.93 82.68 11.61 21.66 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.62 0.02 1.31
09:01:00 sde 0.22 18.20 4.33 95.24 19.62 84.00 0.98 26.18 5.93 85.78 11.61 26.63 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.40 0.02 1.35
09:01:00 sdf 0.18 18.67 4.48 96.07 42.64 101.82 1.47 49.72 11.33 88.54 12.03 33.90 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.67 0.03 1.62
09:01:00 sdg 0.18 19.27 4.63 96.19 39.00 105.09 1.48 46.98 10.63 87.76 12.37 31.67 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.35 0.03 1.68
09:02:00 md127 0.32 1.27 0.00 0.00 26.95 4.00 14.12 214.93 0.00 0.00 639.34 15.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.03 2.80
09:02:00 sdb 0.48 51.47 12.38 96.24 18.03 106.48 1.20 63.79 15.13 92.65 3.72 53.16 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.79 0.02 0.79
09:02:00 sdc 0.53 58.07 13.98 96.33 16.16 108.88 1.30 71.99 17.08 92.93 4.95 55.38 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.15 0.02 1.08
09:02:00 sdd 0.52 69.13 16.77 97.01 14.13 133.81 1.52 92.27 21.95 93.54 10.38 60.84 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.60 0.03 1.78
09:02:00 sde 0.50 64.60 15.65 96.90 16.83 129.20 1.87 96.40 22.63 92.38 11.50 51.64 0.00 0.00 0.00 0.00 0.00 0.00 0.78 6.96 0.04 2.20
09:02:00 sdf 0.57 74.80 18.13 96.97 15.65 132.00 1.77 77.60 18.03 91.08 11.51 43.92 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.23 0.03 2.25
09:02:00 sdg 0.68 64.13 15.35 95.74 9.49 93.85 1.72 68.73 15.87 90.24 11.74 40.04 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.43 0.03 2.26
09:03:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 6.97 51.00 0.00 0.00 7200.34 7.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.16 1.38
09:03:00 sdb 0.15 6.53 1.48 90.82 11.78 43.56 0.78 9.87 2.08 72.67 4.40 12.60 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.09 0.01 0.36
09:03:00 sdc 0.15 6.53 1.48 90.82 9.44 43.56 0.80 9.93 2.08 72.25 4.90 12.42 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.27 0.01 0.36
09:03:00 sdd 0.23 12.87 2.98 92.75 3.57 55.14 1.02 22.80 5.08 83.33 12.15 22.43 0.00 0.00 0.00 0.00 0.00 0.00 0.73 2.18 0.01 1.17
09:03:00 sde 0.45 17.73 3.98 89.85 1.52 39.41 1.37 29.27 6.35 82.29 13.65 21.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.34 0.02 1.60
09:03:00 sdf 0.37 11.40 2.48 87.13 5.59 31.09 1.13 16.33 3.35 74.72 13.82 14.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.98 0.02 1.60
09:03:00 sdg 0.47 21.67 4.95 91.38 3.11 46.43 1.17 16.60 3.38 74.36 14.50 14.23 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.23 0.02 1.58
09:04:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 1.88 11.93 0.00 0.00 1439.04 6.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.71 0.92
09:04:00 sdb 0.38 8.73 1.80 82.44 2.00 22.78 0.90 4.70 0.72 44.33 3.94 5.22 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.02 0.01 0.48
09:04:00 sdc 0.38 8.73 1.80 82.44 1.74 22.78 0.82 1.50 0.00 0.00 2.69 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.47 0.00 0.34
09:04:00 sdd 0.30 3.47 0.57 65.38 0.39 11.56 0.98 9.97 1.95 66.48 13.14 10.14 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.14 0.01 1.40
09:04:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.28 13.43 2.52 66.23 13.51 10.47 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.53 0.02 1.73
09:04:00 sdf 0.08 5.27 1.23 93.67 3.40 63.20 1.12 4.97 0.57 33.66 14.93 4.45 0.00 0.00 0.00 0.00 0.00 0.00 0.82 3.55 0.02 1.74
09:04:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.82 1.50 0.00 0.00 12.96 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 0.37 0.01 1.12
09:05:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.27 19.27 0.00 0.00 1670.65 8.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.79 1.40
09:05:00 sdb 0.32 4.27 0.75 70.31 0.42 13.47 0.85 8.00 1.55 64.58 3.61 9.41 0.00 0.00 0.00 0.00 0.00 0.00 0.70 3.98 0.01 0.40
09:05:00 sdc 0.42 13.33 2.92 87.50 4.12 32.00 0.87 14.27 3.10 78.15 3.88 16.46 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.33 0.01 0.52
09:05:00 sdd 0.35 5.67 1.07 75.29 0.76 16.19 0.87 11.73 2.47 74.00 12.62 13.54 0.00 0.00 0.00 0.00 0.00 0.00 0.70 2.86 0.01 1.19
09:05:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.23 17.47 3.53 74.13 12.08 14.16 0.00 0.00 0.00 0.00 0.00 0.00 0.70 6.19 0.02 1.51
09:05:00 sdf 0.08 5.53 1.30 93.98 11.40 66.40 1.15 13.47 2.62 69.47 13.26 11.71 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.21 0.02 1.55
09:05:00 sdg 0.08 5.53 1.30 93.98 12.00 66.40 0.80 9.07 1.87 70.00 12.04 11.33 0.00 0.00 0.00 0.00 0.00 0.00 0.70 1.24 0.01 1.03
09:06:00 md127 0.20 0.80 0.00 0.00 23.17 4.00 3.03 113.80 0.00 0.00 1063.22 37.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.23 2.60
09:06:00 sdb 0.30 32.80 7.90 96.34 10.56 109.33 1.22 36.85 8.37 87.30 4.70 30.29 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.56 0.01 0.76
09:06:00 sdc 0.30 37.40 9.05 96.79 8.83 124.67 1.13 42.32 9.82 89.65 4.34 37.34 0.00 0.00 0.00 0.00 0.00 0.00 0.72 4.42 0.01 0.63
09:06:00 sdd 0.58 62.60 15.07 96.27 16.83 107.31 1.43 57.72 13.37 90.32 11.91 40.27 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.74 0.03 1.79
09:06:00 sde 0.47 64.73 15.72 97.12 10.14 138.71 1.48 45.18 10.18 87.29 11.97 30.46 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.09 0.03 1.78
09:06:00 sdf 0.35 42.53 10.28 96.71 15.48 121.52 1.48 46.72 10.57 87.69 14.58 31.49 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.40 0.03 1.97
09:06:00 sdg 0.27 31.93 7.72 96.66 16.38 119.75 1.18 52.58 12.33 91.25 11.63 44.44 0.00 0.00 0.00 0.00 0.00 0.00 0.72 3.70 0.02 1.40
09:07:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.10 13.93 0.00 0.00 1405.96 6.63 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.95 1.00
09:07:00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.97 3.36 0.32 24.68 3.19 3.47 0.00 0.00 0.00 0.00 0.00 0.00 0.80 3.54 0.01 0.39
09:07:00 sdc 0.15 7.47 1.72 91.96 0.44 49.78 1.17 13.43 2.63 69.30 4.29 11.51 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.31 0.01 0.48
09:07:00 sdd 0.15 7.47 1.72 91.96 0.44 49.78 1.00 11.43 2.30 69.70 11.55 11.43 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.00 0.01 1.23
09:07:00 sde 0.15 7.47 1.72 91.96 9.44 49.78 1.28 12.83 2.37 64.84 14.22 9.99 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.27 0.02 1.69
09:07:00 sdf 0.05 2.53 0.58 92.11 4.67 50.67 1.18 7.89 1.23 51.03 13.87 6.67 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.29 0.02 1.63
09:07:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.43 0.00 0.00 13.48 1.78 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.08 0.01 1.14
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-04 0:48 ` eyal
@ 2023-11-04 1:01 ` Roger Heflin
2023-11-04 10:04 ` eyal
0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-04 1:01 UTC (permalink / raw)
To: eyal; +Cc: linux-raid
The await being so high on md127, but normish on the underlying disks
would seem to point at some kernel issue inside md using raid6 where
one disk is out and the data has to be reconstructed.
On Fri, Nov 3, 2023 at 7:54 PM <eyal@eyal.emu.id.au> wrote:
>
> On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
> > On 04/11/2023 02.57, Roger Heflin wrote:
> >> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
> >>>
> >>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> >>>> for the record, I do not think that any of the observations the OP made can be
> >>>> explained by non-pathological phenomena/patterns of behavior. Something is
> >>>> very clearly wrong with how this system behaves (the reported figures do not
> >>>> at all match the expected performance of even a degraded RAID6 array in my
> >>>> experience) and how data written to the filesystem apparently fails to make it
> >>>> into the backing devices in acceptable time.
> >>>>
> >>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> >>>> hardware failure", I think.
> >>>
> >>> Exactly. That's what I've been saying for months...
> >>>
> >>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> >>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> >>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> >>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> >>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> >>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
> >>>
> >>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
> >>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
> >>> days to complete (checked with quota).
> >>>
> >>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
> >>> hundreds of rsync downloads every day, the same problem exists and I also get
> >>> frequent random rsync timeouts.
> >>>
> >>> This is all with raid6 and ext4. One of the machines has a journal disk in the
> >>> raid and the filesystem is mounted with nobarriers. Both show the same
> >>> behavior. It'd be interesting to try a different filesystem but these are
> >>> production machines with many disks and I cannot create another big array to
> >>> transfer the contents.
> >>
> >> My array is running 6.5 + xfs, and mine all seems to work normally
> >> (speed wise). And in the perf top call he ran all of the busy
> >> kworkers were ext4* calls spending a lot of time doing various
> >> filesystem work.
> >>
> >> I did find/debug a situation where dumping the cache caused ext4
> >> performance to be a disaster (large directories, lots of files). It
> >> was tracked back to ext4 relies on the Buffers: data space in
> >> /proc/meminfo for at least directory entry caching, and that if there
> >> were a lot of directories and/or files in directories that Buffer:
> >> getting dropped and/or getting pruned for any some reason caused the
> >> fragmented directory entries to have to get reloaded from a spinning
> >> disk and require the disk to be seeking for *MINUTES* to reload it
> >> (there were in this case several million files in a couple of
> >> directories with the directory entries being allocated over time so
> >> very likely heavily fragmented).
> >>
> >> I wonder if there was some change with how Buffers is
> >> used/sized/pruned in the recent kernels. The same drop_cache on an
> >> XFS filesystem had no effect that I could identify and doing a ls -lR
> >> on a big xfs filesystem does not make Buffers grow, but doing the same
> >> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
> >> depends on how many files/directories are on the filesystem).
> >>
> >> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
> >> see if the poor performance correlates with Buffers suddenly being
> >> smaller for some reason.
> >
> > As much as I hate this, I started another small test.
> >
> > $ uname -a
> > Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
> >
> > $ df -h /data1
> > Filesystem Size Used Avail Use% Mounted on
> > /dev/md127 55T 45T 9.8T 83% /data1
> >
> > $ sudo du -sm /data2/no-backup/old-backups/tapes/01
> > 2519 /data2/no-backup/old-backups/tapes/01
> >
> > $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
> > 92059
> >
> > $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
> >
> > It completed in about one minute and it was enough to trigger the problem.
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 2781097 root 20 0 0 0 0 R 98.3 0.0 36:04.35 kworker/u16:0+flush-9:127
> >
> > iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
> >
> > Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
> >
> > Dirty change Buffers change
> > 2023-11-04 08:56:35 44 -580 1170536 8
> > 2023-11-04 08:56:45 48 4 1170552 16
> > 2023-11-04 08:56:55 812456 812408 1171008 456
> > 2023-11-04 08:57:05 538436 -274020 1180820 9812
> > 2023-11-04 08:57:15 698708 160272 1189368 8548
> > 2023-11-04 08:57:25 874208 175500 1195620 6252
> > 2023-11-04 08:57:35 742300 -131908 1202124 6504
> > 2023-11-04 08:57:45 973528 231228 1209580 7456
> > 2023-11-04 08:57:55 1269320 295792 1214900 5320
> > 2023-11-04 08:58:05 1624428 355108 1219764 4864
> > 2023-11-04 08:58:15 1629484 5056 1219816 52
> > 2023-11-04 08:58:25 1629372 -112 1219832 16
> > 2023-11-04 08:58:35 1629028 -344 1219856 24
> > 2023-11-04 08:58:45 1628928 -100 1219880 24
> > 2023-11-04 08:58:55 1628552 -376 1219908 28
> > 2023-11-04 08:59:05 1629252 700 1220072 164
> > 2023-11-04 08:59:15 1628696 -556 1220132 60
> > 2023-11-04 08:59:25 1628304 -392 1220156 24
> > 2023-11-04 08:59:35 1628264 -40 1220188 32
> > 2023-11-04 08:59:45 1628184 -80 1220340 152
> > 2023-11-04 08:59:55 1628144 -40 1220364 24
> > 2023-11-04 09:00:05 1628124 -20 1219940 -424
> > 2023-11-04 09:00:15 1627908 -216 1219976 36
> > 2023-11-04 09:00:25 1627840 -68 1220000 24
> > 2023-11-04 09:00:35 1624276 -3564 1220024 24
> > 2023-11-04 09:00:45 1624100 -176 1220060 36
> > 2023-11-04 09:00:55 1623912 -188 1220092 32
> > 2023-11-04 09:01:05 1624076 164 1220112 20
> > 2023-11-04 09:01:15 1623368 -708 1220160 48
> > 2023-11-04 09:01:25 1623176 -192 1220196 36
> > 2023-11-04 09:01:35 1621872 -1304 1220232 36
> > 2023-11-04 09:01:45 1621732 -140 1220308 76
> > 2023-11-04 09:01:55 1612304 -9428 1220392 84
> > 2023-11-04 09:02:05 1612256 -48 1220420 28
> > 2023-11-04 09:02:15 1612040 -216 1220444 24
> > 2023-11-04 09:02:25 1611968 -72 1220476 32
> > 2023-11-04 09:02:35 1611872 -96 1220492 16
> > 2023-11-04 09:02:45 1609932 -1940 1220524 32
> > 2023-11-04 09:02:55 1609828 -104 1220556 32
> > 2023-11-04 09:03:05 1609916 88 1220572 16
> > 2023-11-04 09:03:15 1609496 -420 1220608 36
> > 2023-11-04 09:03:25 1609392 -104 1220632 24
> > 2023-11-04 09:03:35 1609320 -72 1220648 16
> > 2023-11-04 09:03:45 1609240 -80 1220672 24
> > 2023-11-04 09:03:55 1609152 -88 1220688 16
> > 2023-11-04 09:04:05 1609332 180 1220712 24
> > 2023-11-04 09:04:15 1608892 -440 1220748 36
> > 2023-11-04 09:04:25 1608848 -44 1220764 16
> > 2023-11-04 09:04:35 1608744 -104 1220796 32
> > 2023-11-04 09:04:45 1608436 -308 1220820 24
> > 2023-11-04 09:04:55 1607916 -520 1220836 16
> > 2023-11-04 09:05:05 1608624 708 1220876 40
> > 2023-11-04 09:05:15 1606556 -2068 1220928 52
> > 2023-11-04 09:05:25 1602692 -3864 1221016 88
> > 2023-11-04 09:05:35 1602080 -612 1221052 36
> > 2023-11-04 09:05:45 1602000 -80 1221080 28
> > 2023-11-04 09:05:55 1601928 -72 1221096 16
> > 2023-11-04 09:06:05 1602228 300 1221124 28
> > 2023-11-04 09:06:15 1601848 -380 1221156 32
> > 2023-11-04 09:06:25 1601656 -192 1221180 24
> > 2023-11-04 09:06:35 1601532 -124 1221212 32
> > 2023-11-04 09:06:45 1601476 -56 1221228 16
> > 2023-11-04 09:06:55 1601364 -112 1221252 24
>
> Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
> is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.
>
> Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
> 09:00:00 md127 0.02 0.07 0.00 0.00 18.00 4.00 2.33 13.27 0.00 0.00 1268.36 5.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.96 1.12
> 09:00:00 sdb 0.10 3.80 0.85 89.47 2.50 38.00 0.85 1.57 0.00 0.00 3.00 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.82 0.01 0.35
> 09:00:00 sdc 0.13 3.93 0.85 86.44 6.00 29.50 1.00 4.17 0.50 33.33 4.38 4.17 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.72 0.01 0.53
> 09:00:00 sdd 0.13 3.93 0.85 86.44 7.12 29.50 1.05 3.70 0.33 24.10 13.78 3.52 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.90 0.02 1.54
> 09:00:00 sde 0.12 3.87 0.85 87.93 2.29 33.14 0.83 1.50 0.00 0.00 14.02 1.80 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.42 0.01 1.27
> 09:00:00 sdf 0.03 0.13 0.00 0.00 12.50 4.00 1.37 14.63 2.75 66.80 13.83 10.71 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.88 0.02 1.84
> 09:00:00 sdg 0.02 0.07 0.00 0.00 21.00 4.00 1.40 14.77 2.75 66.27 13.45 10.55 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.96 0.02 1.77
> 09:01:00 md127 0.05 0.20 0.00 0.00 26.00 4.00 7.85 83.00 0.00 0.00 2388.76 10.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.75 1.74
> 09:01:00 sdb 0.23 15.73 3.70 94.07 20.36 67.43 0.98 29.12 6.68 87.17 5.75 29.62 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.75 0.02 0.73
> 09:01:00 sdc 0.20 10.33 2.38 92.26 12.08 51.67 0.77 9.93 2.10 73.26 3.20 12.95 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.33 0.01 0.36
> 09:01:00 sdd 0.23 18.60 4.42 94.98 23.07 79.71 1.03 22.38 4.93 82.68 11.61 21.66 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.62 0.02 1.31
> 09:01:00 sde 0.22 18.20 4.33 95.24 19.62 84.00 0.98 26.18 5.93 85.78 11.61 26.63 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.40 0.02 1.35
> 09:01:00 sdf 0.18 18.67 4.48 96.07 42.64 101.82 1.47 49.72 11.33 88.54 12.03 33.90 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.67 0.03 1.62
> 09:01:00 sdg 0.18 19.27 4.63 96.19 39.00 105.09 1.48 46.98 10.63 87.76 12.37 31.67 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.35 0.03 1.68
> 09:02:00 md127 0.32 1.27 0.00 0.00 26.95 4.00 14.12 214.93 0.00 0.00 639.34 15.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.03 2.80
> 09:02:00 sdb 0.48 51.47 12.38 96.24 18.03 106.48 1.20 63.79 15.13 92.65 3.72 53.16 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.79 0.02 0.79
> 09:02:00 sdc 0.53 58.07 13.98 96.33 16.16 108.88 1.30 71.99 17.08 92.93 4.95 55.38 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.15 0.02 1.08
> 09:02:00 sdd 0.52 69.13 16.77 97.01 14.13 133.81 1.52 92.27 21.95 93.54 10.38 60.84 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.60 0.03 1.78
> 09:02:00 sde 0.50 64.60 15.65 96.90 16.83 129.20 1.87 96.40 22.63 92.38 11.50 51.64 0.00 0.00 0.00 0.00 0.00 0.00 0.78 6.96 0.04 2.20
> 09:02:00 sdf 0.57 74.80 18.13 96.97 15.65 132.00 1.77 77.60 18.03 91.08 11.51 43.92 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.23 0.03 2.25
> 09:02:00 sdg 0.68 64.13 15.35 95.74 9.49 93.85 1.72 68.73 15.87 90.24 11.74 40.04 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.43 0.03 2.26
> 09:03:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 6.97 51.00 0.00 0.00 7200.34 7.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.16 1.38
> 09:03:00 sdb 0.15 6.53 1.48 90.82 11.78 43.56 0.78 9.87 2.08 72.67 4.40 12.60 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.09 0.01 0.36
> 09:03:00 sdc 0.15 6.53 1.48 90.82 9.44 43.56 0.80 9.93 2.08 72.25 4.90 12.42 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.27 0.01 0.36
> 09:03:00 sdd 0.23 12.87 2.98 92.75 3.57 55.14 1.02 22.80 5.08 83.33 12.15 22.43 0.00 0.00 0.00 0.00 0.00 0.00 0.73 2.18 0.01 1.17
> 09:03:00 sde 0.45 17.73 3.98 89.85 1.52 39.41 1.37 29.27 6.35 82.29 13.65 21.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.34 0.02 1.60
> 09:03:00 sdf 0.37 11.40 2.48 87.13 5.59 31.09 1.13 16.33 3.35 74.72 13.82 14.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.98 0.02 1.60
> 09:03:00 sdg 0.47 21.67 4.95 91.38 3.11 46.43 1.17 16.60 3.38 74.36 14.50 14.23 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.23 0.02 1.58
> 09:04:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 1.88 11.93 0.00 0.00 1439.04 6.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.71 0.92
> 09:04:00 sdb 0.38 8.73 1.80 82.44 2.00 22.78 0.90 4.70 0.72 44.33 3.94 5.22 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.02 0.01 0.48
> 09:04:00 sdc 0.38 8.73 1.80 82.44 1.74 22.78 0.82 1.50 0.00 0.00 2.69 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.47 0.00 0.34
> 09:04:00 sdd 0.30 3.47 0.57 65.38 0.39 11.56 0.98 9.97 1.95 66.48 13.14 10.14 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.14 0.01 1.40
> 09:04:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.28 13.43 2.52 66.23 13.51 10.47 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.53 0.02 1.73
> 09:04:00 sdf 0.08 5.27 1.23 93.67 3.40 63.20 1.12 4.97 0.57 33.66 14.93 4.45 0.00 0.00 0.00 0.00 0.00 0.00 0.82 3.55 0.02 1.74
> 09:04:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.82 1.50 0.00 0.00 12.96 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 0.37 0.01 1.12
> 09:05:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.27 19.27 0.00 0.00 1670.65 8.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.79 1.40
> 09:05:00 sdb 0.32 4.27 0.75 70.31 0.42 13.47 0.85 8.00 1.55 64.58 3.61 9.41 0.00 0.00 0.00 0.00 0.00 0.00 0.70 3.98 0.01 0.40
> 09:05:00 sdc 0.42 13.33 2.92 87.50 4.12 32.00 0.87 14.27 3.10 78.15 3.88 16.46 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.33 0.01 0.52
> 09:05:00 sdd 0.35 5.67 1.07 75.29 0.76 16.19 0.87 11.73 2.47 74.00 12.62 13.54 0.00 0.00 0.00 0.00 0.00 0.00 0.70 2.86 0.01 1.19
> 09:05:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.23 17.47 3.53 74.13 12.08 14.16 0.00 0.00 0.00 0.00 0.00 0.00 0.70 6.19 0.02 1.51
> 09:05:00 sdf 0.08 5.53 1.30 93.98 11.40 66.40 1.15 13.47 2.62 69.47 13.26 11.71 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.21 0.02 1.55
> 09:05:00 sdg 0.08 5.53 1.30 93.98 12.00 66.40 0.80 9.07 1.87 70.00 12.04 11.33 0.00 0.00 0.00 0.00 0.00 0.00 0.70 1.24 0.01 1.03
> 09:06:00 md127 0.20 0.80 0.00 0.00 23.17 4.00 3.03 113.80 0.00 0.00 1063.22 37.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.23 2.60
> 09:06:00 sdb 0.30 32.80 7.90 96.34 10.56 109.33 1.22 36.85 8.37 87.30 4.70 30.29 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.56 0.01 0.76
> 09:06:00 sdc 0.30 37.40 9.05 96.79 8.83 124.67 1.13 42.32 9.82 89.65 4.34 37.34 0.00 0.00 0.00 0.00 0.00 0.00 0.72 4.42 0.01 0.63
> 09:06:00 sdd 0.58 62.60 15.07 96.27 16.83 107.31 1.43 57.72 13.37 90.32 11.91 40.27 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.74 0.03 1.79
> 09:06:00 sde 0.47 64.73 15.72 97.12 10.14 138.71 1.48 45.18 10.18 87.29 11.97 30.46 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.09 0.03 1.78
> 09:06:00 sdf 0.35 42.53 10.28 96.71 15.48 121.52 1.48 46.72 10.57 87.69 14.58 31.49 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.40 0.03 1.97
> 09:06:00 sdg 0.27 31.93 7.72 96.66 16.38 119.75 1.18 52.58 12.33 91.25 11.63 44.44 0.00 0.00 0.00 0.00 0.00 0.00 0.72 3.70 0.02 1.40
> 09:07:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.10 13.93 0.00 0.00 1405.96 6.63 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.95 1.00
> 09:07:00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.97 3.36 0.32 24.68 3.19 3.47 0.00 0.00 0.00 0.00 0.00 0.00 0.80 3.54 0.01 0.39
> 09:07:00 sdc 0.15 7.47 1.72 91.96 0.44 49.78 1.17 13.43 2.63 69.30 4.29 11.51 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.31 0.01 0.48
> 09:07:00 sdd 0.15 7.47 1.72 91.96 0.44 49.78 1.00 11.43 2.30 69.70 11.55 11.43 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.00 0.01 1.23
> 09:07:00 sde 0.15 7.47 1.72 91.96 9.44 49.78 1.28 12.83 2.37 64.84 14.22 9.99 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.27 0.02 1.69
> 09:07:00 sdf 0.05 2.53 0.58 92.11 4.67 50.67 1.18 7.89 1.23 51.03 13.87 6.67 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.29 0.02 1.63
> 09:07:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.43 0.00 0.00 13.48 1.78 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.08 0.01 1.14
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: problem with recovered array
2023-11-04 1:01 ` Roger Heflin
@ 2023-11-04 10:04 ` eyal
0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-11-04 10:04 UTC (permalink / raw)
To: linux-raid
See update at the bottom.
On 04/11/2023 12.01, Roger Heflin wrote:
> The await being so high on md127, but normish on the underlying disks
> would seem to point at some kernel issue inside md using raid6 where
> one disk is out and the data has to be reconstructed.
Should I continue with this thread or should I log a kernel bug report?
Or maybe there is a more suitable list to post this?
> On Fri, Nov 3, 2023 at 7:54 PM <eyal@eyal.emu.id.au> wrote:
>>
>> On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
>>> On 04/11/2023 02.57, Roger Heflin wrote:
>>>> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>>>>
>>>>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>>>>> for the record, I do not think that any of the observations the OP made can be
>>>>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>>>>> very clearly wrong with how this system behaves (the reported figures do not
>>>>>> at all match the expected performance of even a degraded RAID6 array in my
>>>>>> experience) and how data written to the filesystem apparently fails to make it
>>>>>> into the backing devices in acceptable time.
>>>>>>
>>>>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>>>>> hardware failure", I think.
>>>>>
>>>>> Exactly. That's what I've been saying for months...
>>>>>
>>>>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>>>>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>>>>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>>>>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>>>>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>>>>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>>>>
>>>>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>>>>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>>>>> days to complete (checked with quota).
>>>>>
>>>>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>>>>> hundreds of rsync downloads every day, the same problem exists and I also get
>>>>> frequent random rsync timeouts.
>>>>>
>>>>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>>>>> raid and the filesystem is mounted with nobarriers. Both show the same
>>>>> behavior. It'd be interesting to try a different filesystem but these are
>>>>> production machines with many disks and I cannot create another big array to
>>>>> transfer the contents.
>>>>
>>>> My array is running 6.5 + xfs, and mine all seems to work normally
>>>> (speed wise). And in the perf top call he ran all of the busy
>>>> kworkers were ext4* calls spending a lot of time doing various
>>>> filesystem work.
>>>>
>>>> I did find/debug a situation where dumping the cache caused ext4
>>>> performance to be a disaster (large directories, lots of files). It
>>>> was tracked back to ext4 relies on the Buffers: data space in
>>>> /proc/meminfo for at least directory entry caching, and that if there
>>>> were a lot of directories and/or files in directories that Buffer:
>>>> getting dropped and/or getting pruned for any some reason caused the
>>>> fragmented directory entries to have to get reloaded from a spinning
>>>> disk and require the disk to be seeking for *MINUTES* to reload it
>>>> (there were in this case several million files in a couple of
>>>> directories with the directory entries being allocated over time so
>>>> very likely heavily fragmented).
>>>>
>>>> I wonder if there was some change with how Buffers is
>>>> used/sized/pruned in the recent kernels. The same drop_cache on an
>>>> XFS filesystem had no effect that I could identify and doing a ls -lR
>>>> on a big xfs filesystem does not make Buffers grow, but doing the same
>>>> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
>>>> depends on how many files/directories are on the filesystem).
>>>>
>>>> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
>>>> see if the poor performance correlates with Buffers suddenly being
>>>> smaller for some reason.
>>>
>>> As much as I hate this, I started another small test.
>>>
>>> $ uname -a
>>> Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
>>>
>>> $ df -h /data1
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/md127 55T 45T 9.8T 83% /data1
>>>
>>> $ sudo du -sm /data2/no-backup/old-backups/tapes/01
>>> 2519 /data2/no-backup/old-backups/tapes/01
>>>
>>> $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
>>> 92059
>>>
>>> $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
>>>
>>> It completed in about one minute and it was enough to trigger the problem.
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 2781097 root 20 0 0 0 0 R 98.3 0.0 36:04.35 kworker/u16:0+flush-9:127
>>>
>>> iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
>>>
>>> Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
>>>
>>> Dirty change Buffers change
>>> 2023-11-04 08:56:35 44 -580 1170536 8
>>> 2023-11-04 08:56:45 48 4 1170552 16
>>> 2023-11-04 08:56:55 812456 812408 1171008 456
>>> 2023-11-04 08:57:05 538436 -274020 1180820 9812
>>> 2023-11-04 08:57:15 698708 160272 1189368 8548
>>> 2023-11-04 08:57:25 874208 175500 1195620 6252
>>> 2023-11-04 08:57:35 742300 -131908 1202124 6504
>>> 2023-11-04 08:57:45 973528 231228 1209580 7456
>>> 2023-11-04 08:57:55 1269320 295792 1214900 5320
>>> 2023-11-04 08:58:05 1624428 355108 1219764 4864
>>> 2023-11-04 08:58:15 1629484 5056 1219816 52
>>> 2023-11-04 08:58:25 1629372 -112 1219832 16
>>> 2023-11-04 08:58:35 1629028 -344 1219856 24
>>> 2023-11-04 08:58:45 1628928 -100 1219880 24
>>> 2023-11-04 08:58:55 1628552 -376 1219908 28
>>> 2023-11-04 08:59:05 1629252 700 1220072 164
>>> 2023-11-04 08:59:15 1628696 -556 1220132 60
>>> 2023-11-04 08:59:25 1628304 -392 1220156 24
>>> 2023-11-04 08:59:35 1628264 -40 1220188 32
>>> 2023-11-04 08:59:45 1628184 -80 1220340 152
>>> 2023-11-04 08:59:55 1628144 -40 1220364 24
>>> 2023-11-04 09:00:05 1628124 -20 1219940 -424
>>> 2023-11-04 09:00:15 1627908 -216 1219976 36
>>> 2023-11-04 09:00:25 1627840 -68 1220000 24
>>> 2023-11-04 09:00:35 1624276 -3564 1220024 24
>>> 2023-11-04 09:00:45 1624100 -176 1220060 36
>>> 2023-11-04 09:00:55 1623912 -188 1220092 32
>>> 2023-11-04 09:01:05 1624076 164 1220112 20
>>> 2023-11-04 09:01:15 1623368 -708 1220160 48
>>> 2023-11-04 09:01:25 1623176 -192 1220196 36
>>> 2023-11-04 09:01:35 1621872 -1304 1220232 36
>>> 2023-11-04 09:01:45 1621732 -140 1220308 76
>>> 2023-11-04 09:01:55 1612304 -9428 1220392 84
>>> 2023-11-04 09:02:05 1612256 -48 1220420 28
>>> 2023-11-04 09:02:15 1612040 -216 1220444 24
>>> 2023-11-04 09:02:25 1611968 -72 1220476 32
>>> 2023-11-04 09:02:35 1611872 -96 1220492 16
>>> 2023-11-04 09:02:45 1609932 -1940 1220524 32
>>> 2023-11-04 09:02:55 1609828 -104 1220556 32
>>> 2023-11-04 09:03:05 1609916 88 1220572 16
>>> 2023-11-04 09:03:15 1609496 -420 1220608 36
>>> 2023-11-04 09:03:25 1609392 -104 1220632 24
>>> 2023-11-04 09:03:35 1609320 -72 1220648 16
>>> 2023-11-04 09:03:45 1609240 -80 1220672 24
>>> 2023-11-04 09:03:55 1609152 -88 1220688 16
>>> 2023-11-04 09:04:05 1609332 180 1220712 24
>>> 2023-11-04 09:04:15 1608892 -440 1220748 36
>>> 2023-11-04 09:04:25 1608848 -44 1220764 16
>>> 2023-11-04 09:04:35 1608744 -104 1220796 32
>>> 2023-11-04 09:04:45 1608436 -308 1220820 24
>>> 2023-11-04 09:04:55 1607916 -520 1220836 16
>>> 2023-11-04 09:05:05 1608624 708 1220876 40
>>> 2023-11-04 09:05:15 1606556 -2068 1220928 52
>>> 2023-11-04 09:05:25 1602692 -3864 1221016 88
>>> 2023-11-04 09:05:35 1602080 -612 1221052 36
>>> 2023-11-04 09:05:45 1602000 -80 1221080 28
>>> 2023-11-04 09:05:55 1601928 -72 1221096 16
>>> 2023-11-04 09:06:05 1602228 300 1221124 28
>>> 2023-11-04 09:06:15 1601848 -380 1221156 32
>>> 2023-11-04 09:06:25 1601656 -192 1221180 24
>>> 2023-11-04 09:06:35 1601532 -124 1221212 32
>>> 2023-11-04 09:06:45 1601476 -56 1221228 16
>>> 2023-11-04 09:06:55 1601364 -112 1221252 24
>>
>> Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
>> is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.
>>
>> Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
>> 09:00:00 md127 0.02 0.07 0.00 0.00 18.00 4.00 2.33 13.27 0.00 0.00 1268.36 5.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.96 1.12
>> 09:00:00 sdb 0.10 3.80 0.85 89.47 2.50 38.00 0.85 1.57 0.00 0.00 3.00 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.82 0.01 0.35
>> 09:00:00 sdc 0.13 3.93 0.85 86.44 6.00 29.50 1.00 4.17 0.50 33.33 4.38 4.17 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.72 0.01 0.53
>> 09:00:00 sdd 0.13 3.93 0.85 86.44 7.12 29.50 1.05 3.70 0.33 24.10 13.78 3.52 0.00 0.00 0.00 0.00 0.00 0.00 0.83 2.90 0.02 1.54
>> 09:00:00 sde 0.12 3.87 0.85 87.93 2.29 33.14 0.83 1.50 0.00 0.00 14.02 1.80 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.42 0.01 1.27
>> 09:00:00 sdf 0.03 0.13 0.00 0.00 12.50 4.00 1.37 14.63 2.75 66.80 13.83 10.71 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.88 0.02 1.84
>> 09:00:00 sdg 0.02 0.07 0.00 0.00 21.00 4.00 1.40 14.77 2.75 66.27 13.45 10.55 0.00 0.00 0.00 0.00 0.00 0.00 0.83 4.96 0.02 1.77
>> 09:01:00 md127 0.05 0.20 0.00 0.00 26.00 4.00 7.85 83.00 0.00 0.00 2388.76 10.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.75 1.74
>> 09:01:00 sdb 0.23 15.73 3.70 94.07 20.36 67.43 0.98 29.12 6.68 87.17 5.75 29.62 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.75 0.02 0.73
>> 09:01:00 sdc 0.20 10.33 2.38 92.26 12.08 51.67 0.77 9.93 2.10 73.26 3.20 12.95 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.33 0.01 0.36
>> 09:01:00 sdd 0.23 18.60 4.42 94.98 23.07 79.71 1.03 22.38 4.93 82.68 11.61 21.66 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.62 0.02 1.31
>> 09:01:00 sde 0.22 18.20 4.33 95.24 19.62 84.00 0.98 26.18 5.93 85.78 11.61 26.63 0.00 0.00 0.00 0.00 0.00 0.00 0.67 3.40 0.02 1.35
>> 09:01:00 sdf 0.18 18.67 4.48 96.07 42.64 101.82 1.47 49.72 11.33 88.54 12.03 33.90 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.67 0.03 1.62
>> 09:01:00 sdg 0.18 19.27 4.63 96.19 39.00 105.09 1.48 46.98 10.63 87.76 12.37 31.67 0.00 0.00 0.00 0.00 0.00 0.00 0.67 7.35 0.03 1.68
>> 09:02:00 md127 0.32 1.27 0.00 0.00 26.95 4.00 14.12 214.93 0.00 0.00 639.34 15.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.03 2.80
>> 09:02:00 sdb 0.48 51.47 12.38 96.24 18.03 106.48 1.20 63.79 15.13 92.65 3.72 53.16 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.79 0.02 0.79
>> 09:02:00 sdc 0.53 58.07 13.98 96.33 16.16 108.88 1.30 71.99 17.08 92.93 4.95 55.38 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.15 0.02 1.08
>> 09:02:00 sdd 0.52 69.13 16.77 97.01 14.13 133.81 1.52 92.27 21.95 93.54 10.38 60.84 0.00 0.00 0.00 0.00 0.00 0.00 0.78 3.60 0.03 1.78
>> 09:02:00 sde 0.50 64.60 15.65 96.90 16.83 129.20 1.87 96.40 22.63 92.38 11.50 51.64 0.00 0.00 0.00 0.00 0.00 0.00 0.78 6.96 0.04 2.20
>> 09:02:00 sdf 0.57 74.80 18.13 96.97 15.65 132.00 1.77 77.60 18.03 91.08 11.51 43.92 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.23 0.03 2.25
>> 09:02:00 sdg 0.68 64.13 15.35 95.74 9.49 93.85 1.72 68.73 15.87 90.24 11.74 40.04 0.00 0.00 0.00 0.00 0.00 0.00 0.78 7.43 0.03 2.26
>> 09:03:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 6.97 51.00 0.00 0.00 7200.34 7.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.16 1.38
>> 09:03:00 sdb 0.15 6.53 1.48 90.82 11.78 43.56 0.78 9.87 2.08 72.67 4.40 12.60 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.09 0.01 0.36
>> 09:03:00 sdc 0.15 6.53 1.48 90.82 9.44 43.56 0.80 9.93 2.08 72.25 4.90 12.42 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.27 0.01 0.36
>> 09:03:00 sdd 0.23 12.87 2.98 92.75 3.57 55.14 1.02 22.80 5.08 83.33 12.15 22.43 0.00 0.00 0.00 0.00 0.00 0.00 0.73 2.18 0.01 1.17
>> 09:03:00 sde 0.45 17.73 3.98 89.85 1.52 39.41 1.37 29.27 6.35 82.29 13.65 21.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.34 0.02 1.60
>> 09:03:00 sdf 0.37 11.40 2.48 87.13 5.59 31.09 1.13 16.33 3.35 74.72 13.82 14.41 0.00 0.00 0.00 0.00 0.00 0.00 0.73 3.98 0.02 1.60
>> 09:03:00 sdg 0.47 21.67 4.95 91.38 3.11 46.43 1.17 16.60 3.38 74.36 14.50 14.23 0.00 0.00 0.00 0.00 0.00 0.00 0.73 5.23 0.02 1.58
>> 09:04:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 1.88 11.93 0.00 0.00 1439.04 6.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.71 0.92
>> 09:04:00 sdb 0.38 8.73 1.80 82.44 2.00 22.78 0.90 4.70 0.72 44.33 3.94 5.22 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.02 0.01 0.48
>> 09:04:00 sdc 0.38 8.73 1.80 82.44 1.74 22.78 0.82 1.50 0.00 0.00 2.69 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.47 0.00 0.34
>> 09:04:00 sdd 0.30 3.47 0.57 65.38 0.39 11.56 0.98 9.97 1.95 66.48 13.14 10.14 0.00 0.00 0.00 0.00 0.00 0.00 0.82 2.14 0.01 1.40
>> 09:04:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.28 13.43 2.52 66.23 13.51 10.47 0.00 0.00 0.00 0.00 0.00 0.00 0.82 4.53 0.02 1.73
>> 09:04:00 sdf 0.08 5.27 1.23 93.67 3.40 63.20 1.12 4.97 0.57 33.66 14.93 4.45 0.00 0.00 0.00 0.00 0.00 0.00 0.82 3.55 0.02 1.74
>> 09:04:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.82 1.50 0.00 0.00 12.96 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.82 0.37 0.01 1.12
>> 09:05:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.27 19.27 0.00 0.00 1670.65 8.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.79 1.40
>> 09:05:00 sdb 0.32 4.27 0.75 70.31 0.42 13.47 0.85 8.00 1.55 64.58 3.61 9.41 0.00 0.00 0.00 0.00 0.00 0.00 0.70 3.98 0.01 0.40
>> 09:05:00 sdc 0.42 13.33 2.92 87.50 4.12 32.00 0.87 14.27 3.10 78.15 3.88 16.46 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.33 0.01 0.52
>> 09:05:00 sdd 0.35 5.67 1.07 75.29 0.76 16.19 0.87 11.73 2.47 74.00 12.62 13.54 0.00 0.00 0.00 0.00 0.00 0.00 0.70 2.86 0.01 1.19
>> 09:05:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 1.23 17.47 3.53 74.13 12.08 14.16 0.00 0.00 0.00 0.00 0.00 0.00 0.70 6.19 0.02 1.51
>> 09:05:00 sdf 0.08 5.53 1.30 93.98 11.40 66.40 1.15 13.47 2.62 69.47 13.26 11.71 0.00 0.00 0.00 0.00 0.00 0.00 0.70 4.21 0.02 1.55
>> 09:05:00 sdg 0.08 5.53 1.30 93.98 12.00 66.40 0.80 9.07 1.87 70.00 12.04 11.33 0.00 0.00 0.00 0.00 0.00 0.00 0.70 1.24 0.01 1.03
>> 09:06:00 md127 0.20 0.80 0.00 0.00 23.17 4.00 3.03 113.80 0.00 0.00 1063.22 37.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.23 2.60
>> 09:06:00 sdb 0.30 32.80 7.90 96.34 10.56 109.33 1.22 36.85 8.37 87.30 4.70 30.29 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.56 0.01 0.76
>> 09:06:00 sdc 0.30 37.40 9.05 96.79 8.83 124.67 1.13 42.32 9.82 89.65 4.34 37.34 0.00 0.00 0.00 0.00 0.00 0.00 0.72 4.42 0.01 0.63
>> 09:06:00 sdd 0.58 62.60 15.07 96.27 16.83 107.31 1.43 57.72 13.37 90.32 11.91 40.27 0.00 0.00 0.00 0.00 0.00 0.00 0.72 6.74 0.03 1.79
>> 09:06:00 sde 0.47 64.73 15.72 97.12 10.14 138.71 1.48 45.18 10.18 87.29 11.97 30.46 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.09 0.03 1.78
>> 09:06:00 sdf 0.35 42.53 10.28 96.71 15.48 121.52 1.48 46.72 10.57 87.69 14.58 31.49 0.00 0.00 0.00 0.00 0.00 0.00 0.72 7.40 0.03 1.97
>> 09:06:00 sdg 0.27 31.93 7.72 96.66 16.38 119.75 1.18 52.58 12.33 91.25 11.63 44.44 0.00 0.00 0.00 0.00 0.00 0.00 0.72 3.70 0.02 1.40
>> 09:07:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.10 13.93 0.00 0.00 1405.96 6.63 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.95 1.00
>> 09:07:00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.97 3.36 0.32 24.68 3.19 3.47 0.00 0.00 0.00 0.00 0.00 0.00 0.80 3.54 0.01 0.39
>> 09:07:00 sdc 0.15 7.47 1.72 91.96 0.44 49.78 1.17 13.43 2.63 69.30 4.29 11.51 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.31 0.01 0.48
>> 09:07:00 sdd 0.15 7.47 1.72 91.96 0.44 49.78 1.00 11.43 2.30 69.70 11.55 11.43 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.00 0.01 1.23
>> 09:07:00 sde 0.15 7.47 1.72 91.96 9.44 49.78 1.28 12.83 2.37 64.84 14.22 9.99 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.27 0.02 1.69
>> 09:07:00 sdf 0.05 2.53 0.58 92.11 4.67 50.67 1.18 7.89 1.23 51.03 13.87 6.67 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.29 0.02 1.63
>> 09:07:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.43 0.00 0.00 13.48 1.78 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.08 0.01 1.14
It finally completed with a bang, and the kthread flush is gone.
Dirty change Buffers change
2023-11-04 20:42:47 117128 -8 1407904 12
2023-11-04 20:42:57 117068 -60 1407944 40
2023-11-04 20:43:07 117096 28 1407972 28
2023-11-04 20:43:17 116972 -124 1408000 28
2023-11-04 20:43:27 116864 -108 1408032 32
2023-11-04 20:43:37 116664 -200 1408056 24
2023-11-04 20:43:47 116664 0 1408092 36
2023-11-04 20:43:57 116468 -196 1408132 40
2023-11-04 20:44:07 116480 12 1408168 36
2023-11-04 20:44:17 116364 -116 1408200 32
2023-11-04 20:44:27 116284 -80 1408228 28
2023-11-04 20:44:37 116284 0 1408268 40
2023-11-04 20:44:47 115956 -328 1408300 32
2023-11-04 20:44:57 115804 -152 1408324 24
2023-11-04 20:45:07 115680 -124 1408356 32
2023-11-04 20:45:17 116216 536 1408392 36
2023-11-04 20:45:27 115656 -560 1408432 40
2023-11-04 20:45:37 115744 88 1408460 28
2023-11-04 20:45:47 115348 -396 1408496 36
2023-11-04 20:45:57 115948 600 1408528 32
2023-11-04 20:46:07 428 -115520 1408580 52
2023-11-04 20:46:17 364 -64 1408592 12
2023-11-04 20:46:27 608 244 1408612 20
2023-11-04 20:46:37 808 200 1408636 24
So since around 9am, 12 hours!
The final iostat:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
20:45:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.12 14.40 0.00 0.00 1402.68 6.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.97 0.69
20:45:00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.43 0.00 0.00 2.65 1.78 0.00 0.00 0.00 0.00 0.00 0.00 0.80 2.42 0.00 0.28
20:45:00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.80 1.43 0.00 0.00 2.35 1.78 0.00 0.00 0.00 0.00 0.00 0.00 0.80 2.10 0.00 0.26
20:45:00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 1.15 5.76 0.73 38.94 14.84 5.01 0.00 0.00 0.00 0.00 0.00 0.00 0.80 4.08 0.02 1.79
20:45:00 sde 0.03 1.53 0.35 91.30 0.50 46.00 0.98 11.49 2.33 70.35 13.66 11.69 0.00 0.00 0.00 0.00 0.00 0.00 0.80 3.33 0.02 1.37
20:45:00 sdf 0.03 1.53 0.35 91.30 8.00 46.00 1.33 15.82 3.07 69.70 14.16 11.87 0.00 0.00 0.00 0.00 0.00 0.00 0.80 5.81 0.02 1.89
20:45:00 sdg 0.03 1.53 0.35 91.30 8.00 46.00 1.33 15.82 3.07 69.70 13.65 11.87 0.00 0.00 0.00 0.00 0.00 0.00 0.80 7.00 0.02 1.82
20:46:00 md127 0.02 0.07 0.00 0.00 17.00 4.00 2.35 25.20 0.00 0.00 2141.33 10.72 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.03 1.15
20:46:00 sdb 0.12 5.73 1.32 91.86 12.14 49.14 0.75 12.79 2.83 79.07 2.91 17.06 0.00 0.00 0.00 0.00 0.00 0.00 0.68 2.88 0.01 0.32
20:46:00 sdc 0.12 5.73 1.32 91.86 14.14 49.14 0.72 3.46 0.53 42.67 3.65 4.83 0.00 0.00 0.00 0.00 0.00 0.00 0.68 3.54 0.01 0.36
20:46:00 sdd 0.10 4.60 1.05 91.30 11.00 46.00 0.82 6.39 1.17 58.82 13.76 7.83 0.00 0.00 0.00 0.00 0.00 0.00 0.68 2.76 0.01 1.22
20:46:00 sde 0.07 5.47 1.30 95.12 6.00 82.00 1.08 15.26 3.12 74.21 11.97 14.08 0.00 0.00 0.00 0.00 0.00 0.00 0.68 3.29 0.02 1.26
20:46:00 sdf 0.07 5.47 1.30 95.12 0.50 82.00 1.18 21.86 4.67 79.77 12.14 18.47 0.00 0.00 0.00 0.00 0.00 0.00 0.68 4.27 0.02 1.34
20:46:00 sdg 0.18 10.13 2.35 92.76 5.09 55.27 1.12 19.53 4.15 78.80 12.21 17.49 0.00 0.00 0.00 0.00 0.00 0.00 0.68 4.66 0.02 1.39
20:47:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 256.68 1918.53 0.00 0.00 1094.72 7.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 281.00 7.49
20:47:00 sdb 14.75 382.00 80.75 84.55 89.68 25.90 16.43 557.67 123.10 88.22 7.65 33.94 0.00 0.00 0.00 0.00 0.00 0.00 0.33 5.25 1.45 4.17
20:47:00 sdc 20.15 596.60 129.00 86.49 144.70 29.61 15.37 558.88 124.47 89.01 18.56 36.37 0.00 0.00 0.00 0.00 0.00 0.00 0.33 14.50 3.21 6.11
20:47:00 sdd 20.92 613.27 132.40 86.36 330.56 29.32 15.65 544.01 120.47 88.50 37.67 34.76 0.00 0.00 0.00 0.00 0.00 0.00 0.32 41.16 7.52 8.31
20:47:00 sde 20.07 598.07 129.45 86.58 128.61 29.80 16.03 587.81 131.03 89.10 12.36 36.66 0.00 0.00 0.00 0.00 0.00 0.00 0.32 5.89 2.78 5.96
20:47:00 sdf 19.93 562.07 120.58 85.81 122.18 28.20 16.92 570.48 125.82 88.15 11.14 33.72 0.00 0.00 0.00 0.00 0.00 0.00 0.32 13.47 2.63 5.82
20:47:00 sdg 14.47 345.53 71.92 83.25 79.60 23.88 16.88 563.94 124.22 88.03 9.24 33.40 0.00 0.00 0.00 0.00 0.00 0.00 0.32 6.95 1.31 4.42
20:48:00 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:48:00 sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I expect to receive a new disk early next week and I hope the issue goes away when the array is not degraded.
--
Eyal at Home (eyal@eyal.emu.id.au)
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2023-11-04 10:04 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-30 13:35 problem with recovered array eyal
2023-10-30 16:14 ` Roger Heflin
2023-10-31 2:35 ` eyal
2023-10-31 2:50 ` eyal
2023-10-31 3:21 ` Carlos Carvalho
2023-10-31 6:25 ` eyal
2023-10-31 9:29 ` eyal
2023-10-31 10:24 ` Roger Heflin
2023-10-31 21:40 ` eyal
2023-11-01 10:30 ` Roger Heflin
2023-11-01 13:08 ` eyal
2023-11-01 14:29 ` Roger Heflin
2023-11-02 8:34 ` Johannes Truschnigg
2023-11-02 11:27 ` eyal
2023-11-02 11:57 ` Roger Heflin
2023-11-02 13:05 ` eyal
2023-11-02 17:05 ` Roger Heflin
2023-11-02 23:23 ` eyal
2023-11-03 12:08 ` Roger Heflin
2023-11-02 12:29 ` eyal
2023-11-02 12:46 ` Reindl Harald
2023-11-03 14:16 ` Carlos Carvalho
2023-11-03 14:32 ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
2023-11-03 15:57 ` problem with recovered array Roger Heflin
2023-11-03 22:38 ` eyal
2023-11-04 0:48 ` eyal
2023-11-04 1:01 ` Roger Heflin
2023-11-04 10:04 ` eyal
2023-10-31 12:39 ` Carlos Carvalho
2023-10-31 14:19 ` Roger Heflin
2023-10-31 19:20 ` Carlos Carvalho
2023-10-31 21:44 ` problem with recovered array [more details] eyal
2023-10-31 22:00 ` eyal
2023-11-01 4:31 ` [now urgent] problem with recovered array eyal
2023-11-01 6:44 ` eyal
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.