All of lore.kernel.org
 help / color / mirror / Atom feed
* problem with recovered array
@ 2023-10-30 13:35 eyal
  2023-10-30 16:14 ` Roger Heflin
  2023-10-31 21:44 ` problem with recovered array [more details] eyal
  0 siblings, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-30 13:35 UTC (permalink / raw)
  To: linux-raid

F38

I know this is a bit long but I wanted to provide as much detail as I thought needed.

I have a 7-member raid6. The other day I needed to send a disk for replacement.
I have done this before and all looked well. The array is now degraded until I get the new disk.

At one point my system got into trouble and I am not sure why, but it started to have
very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
It refused to complete the shutdown and after a while I used the sysrq feature for this.

On the restart it dropped into emergency shell, the array had all members listed as spares.
I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
though the array was now listed in mdstat and looked as expected.

Since mdadm suggested I use --force', I did so
	mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1

Q0) was I correct to use this command?

2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127

Q1) What does this last line mean?

Now that the array came up I still could not mount the fs (still in emergency shell).
I rebooted and all came up, the array was there and the fs was mounted and so far
I did not notice any issues with the fs.

However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
SATA disk on the server and it went fast, so the USB disk is OK.

I then noticed (in 'top') that there is a kworker running at 100% CPU:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127

It did it for many hours and I do not know what it is doing.

Q2) what does this worker do?

I also noticed that mdstat shows a high bitmap usage:

Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
       58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
       bitmap: 87/88 pages [348KB], 65536KB chunk

Q3) Is this OK? Should the usage go down? It does not change at all.

While looking at everything, I started iostat on md127 and I see that there is a constant
trickle of writes, about 5KB/s. There is no activity on this fs.
Also, I see similar activity on all the members, at the same rate, so md127 does not show
6 times the members activity. I guess this is just how md works?

Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
	If this is a background thing, how can I monitor it?

Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
	What is the correct way to bring up a degraded array? What about the 'dirty' part?
'mdadm -D /dev/md127' mention'sync':
     Number   Major   Minor   RaidDevice State
        -       0        0        0      removed
        8       8       17        1      active sync   /dev/sdb1
        9       8       33        2      active sync   /dev/sdc1
        7       8       49        3      active sync   /dev/sdd1
        4       8       65        4      active sync   /dev/sde1
        5       8       81        5      active sync   /dev/sdf1
        6       8       97        6      active sync   /dev/sdg1
Is this related?

BTW I plan to run a 'check' at some point.

TIA

-- 
Eyal at Home (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-30 13:35 problem with recovered array eyal
@ 2023-10-30 16:14 ` Roger Heflin
  2023-10-31  2:35   ` eyal
  2023-10-31  3:21   ` Carlos Carvalho
  2023-10-31 21:44 ` problem with recovered array [more details] eyal
  1 sibling, 2 replies; 35+ messages in thread
From: Roger Heflin @ 2023-10-30 16:14 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

look at  SAR -d output for all the disks in the raid6.   It may be a
disk issue (though I suspect not given the 100% cpu show in raid).

Clearly something very expensive/deadlockish is happening because of
the raid having to rebuild the data from the missing disk, not sure
what could be wrong with it.

If you are on a kernel that has a newer version upgrading and
rebooting may change something  If it is one close to the newest
kernel version going back a minor version (if 6.5 go back to the last
6.4 kernel).

You might also install the perf package and run "perf top" and see
what sorts of calls the kernel is spending all of its time in.

On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>
> F38
>
> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>
> I have a 7-member raid6. The other day I needed to send a disk for replacement.
> I have done this before and all looked well. The array is now degraded until I get the new disk.
>
> At one point my system got into trouble and I am not sure why, but it started to have
> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>
> On the restart it dropped into emergency shell, the array had all members listed as spares.
> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
> though the array was now listed in mdstat and looked as expected.
>
> Since mdadm suggested I use --force', I did so
>         mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>
> Q0) was I correct to use this command?
>
> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>
> Q1) What does this last line mean?
>
> Now that the array came up I still could not mount the fs (still in emergency shell).
> I rebooted and all came up, the array was there and the fs was mounted and so far
> I did not notice any issues with the fs.
>
> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
> SATA disk on the server and it went fast, so the USB disk is OK.
>
> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127
>
> It did it for many hours and I do not know what it is doing.
>
> Q2) what does this worker do?
>
> I also noticed that mdstat shows a high bitmap usage:
>
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>        58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>        bitmap: 87/88 pages [348KB], 65536KB chunk
>
> Q3) Is this OK? Should the usage go down? It does not change at all.
>
> While looking at everything, I started iostat on md127 and I see that there is a constant
> trickle of writes, about 5KB/s. There is no activity on this fs.
> Also, I see similar activity on all the members, at the same rate, so md127 does not show
> 6 times the members activity. I guess this is just how md works?
>
> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>         If this is a background thing, how can I monitor it?
>
> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>         What is the correct way to bring up a degraded array? What about the 'dirty' part?
> 'mdadm -D /dev/md127' mention'sync':
>      Number   Major   Minor   RaidDevice State
>         -       0        0        0      removed
>         8       8       17        1      active sync   /dev/sdb1
>         9       8       33        2      active sync   /dev/sdc1
>         7       8       49        3      active sync   /dev/sdd1
>         4       8       65        4      active sync   /dev/sde1
>         5       8       81        5      active sync   /dev/sdf1
>         6       8       97        6      active sync   /dev/sdg1
> Is this related?
>
> BTW I plan to run a 'check' at some point.
>
> TIA
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-30 16:14 ` Roger Heflin
@ 2023-10-31  2:35   ` eyal
  2023-10-31  2:50     ` eyal
  2023-10-31  3:21   ` Carlos Carvalho
  1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-10-31  2:35 UTC (permalink / raw)
  To: linux-raid

Thanks Roger,

On 31/10/2023 03.14, Roger Heflin wrote:
> look at  SAR -d output for all the disks in the raid6.   It may be a
> disk issue (though I suspect not given the 100% cpu show in raid).

It seems that sar is installed but not active here. Latest data is from 2017...

> Clearly something very expensive/deadlockish is happening because of
> the raid having to rebuild the data from the missing disk, not sure
> what could be wrong with it.
> 
> If you are on a kernel that has a newer version upgrading and
> rebooting may change something  If it is one close to the newest
> kernel version going back a minor version (if 6.5 go back to the last
> 6.4 kernel).

I am running the latest available kernel
	6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux

> You might also install the perf package and run "perf top" and see
> what sorts of calls the kernel is spending all of its time in.

I now installed perf. Below is a grab from the console. Should I use different options?

Does anyone understand it?

BTW, I now ran rsync into the array and it barely moved. In 2 hours it copied about 5GB.
iostat shows about 5KB/s (kilobytes!) all that time.


TIA

top shows
     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1385760 root      20   0       0      0      0 R  99.7   0.0 161:32.64 kworker/u16:4+flush-9:127

$ sudo perf top --pid 1385760 -g
+  100.00%     0.49%  [kernel]  [k] ext4_mb_regular_allocator
+   99.96%     0.00%  [kernel]  [k] ext4_do_writepages
+   99.10%     0.00%  [kernel]  [k] ext4_ext_map_blocks
+   89.39%     7.85%  [kernel]  [k] ext4_mb_find_good_group_avg_frag_lists
+   81.78%    55.43%  [kernel]  [k] ext4_mb_good_group
+   28.74%    14.63%  [kernel]  [k] ext4_get_group_info
+   12.40%     0.00%  [kernel]  [k] ret_from_fork_asm
+   12.40%     0.00%  [kernel]  [k] ret_from_fork
+   12.40%     0.00%  [kernel]  [k] kthread
+   12.40%     0.00%  [kernel]  [k] worker_thread
+   12.40%     0.00%  [kernel]  [k] process_one_work
+   12.40%     0.00%  [kernel]  [k] wb_workfn
+   12.40%     0.00%  [kernel]  [k] wb_writeback
+   12.40%     0.00%  [kernel]  [k] __writeback_inodes_wb
+   12.40%     0.00%  [kernel]  [k] writeback_sb_inodes
+   12.40%     0.00%  [kernel]  [k] __writeback_single_inode
+   12.40%     0.00%  [kernel]  [k] do_writepages
+   12.40%     0.00%  [kernel]  [k] ext4_writepages
+   12.40%     0.00%  [kernel]  [k] ext4_map_blocks
+   12.40%     0.00%  [kernel]  [k] ext4_mb_new_blocks
+    7.21%     7.12%  [kernel]  [k] __rcu_read_unlock
+    6.21%     3.23%  [kernel]  [k] ext4_mb_scan_aligned
+    4.77%     4.67%  [kernel]  [k] __rcu_read_lock
+    2.96%     0.56%  [kernel]  [k] mb_find_extent
+    2.63%     0.55%  [kernel]  [k] ext4_mb_load_buddy_gfp
+    2.32%     2.29%  [kernel]  [k] mb_find_order_for_block
+    2.00%     0.08%  [kernel]  [k] pagecache_get_page
+    1.88%     0.14%  [kernel]  [k] __filemap_get_folio
+    1.71%     0.51%  [kernel]  [k] filemap_get_entry
+    1.12%     0.43%  [kernel]  [k] xas_load
+    0.68%     0.63%  [kernel]  [k] xas_descend
      0.34%     0.34%  [kernel]  [k] ext4_mb_unload_buddy
      0.29%     0.22%  [kernel]  [k] _raw_read_unlock
      0.25%     0.25%  [kernel]  [k] _raw_spin_trylock
      0.17%     0.17%  [kernel]  [k] _raw_read_lock
      0.11%     0.10%  [kernel]  [k] mb_find_buddy
      0.11%     0.00%  [kernel]  [k] asm_sysvec_apic_timer_interrupt
      0.10%     0.00%  [kernel]  [k] sysvec_apic_timer_interrupt
      0.09%     0.00%  [kernel]  [k] __sysvec_apic_timer_interrupt
      0.08%     0.00%  [kernel]  [k] hrtimer_interrupt
      0.08%     0.07%  [kernel]  [k] xas_start
      0.07%     0.00%  [kernel]  [k] __hrtimer_run_queues
      0.06%     0.00%  [kernel]  [k] tick_sched_timer
      0.05%     0.04%  [kernel]  [k] folio_mark_accessed
      0.04%     0.04%  [kernel]  [k] _raw_spin_unlock
      0.04%     0.00%  [kernel]  [k] update_process_times
      0.04%     0.02%  [kernel]  [k] __cond_resched
      0.04%     0.04%  [kernel]  [k] folio_test_hugetlb
      0.04%     0.00%  [kernel]  [k] tick_sched_handle
      0.03%     0.00%  [kernel]  [k] scheduler_tick
      0.02%     0.00%  [kernel]  [k] tick_sched_do_timer
      0.02%     0.00%  [kernel]  [k] timekeeping_advance
      0.01%     0.00%  [kernel]  [k] task_tick_fair
      0.01%     0.00%  [kernel]  [k] perf_adjust_freq_unthr_context
      0.01%     0.00%  [kernel]  [k] clockevents_program_event
      0.01%     0.01%  [kernel]  [k] _raw_spin_lock_irqsave
      0.01%     0.00%  [kernel]  [k] timekeeping_update
      0.01%     0.00%  [kernel]  [k] arch_scale_freq_tick
      0.01%     0.01%  [kernel]  [k] native_irq_return_iret
      0.01%     0.01%  [kernel]  [k] native_write_msr
      0.01%     0.00%  [kernel]  [k] update_load_avg
      0.01%     0.00%  [kernel]  [k] lapic_next_deadline
      0.01%     0.00%  [kernel]  [k] sched_clock_cpu
      0.01%     0.00%  [kernel]  [k] irqtime_account_irq
      0.00%     0.00%  [kernel]  [k] sched_clock
      0.00%     0.00%  [kernel]  [k] __irq_exit_rcu
      0.00%     0.00%  [kernel]  [k] native_read_msr
      0.00%     0.00%  [kernel]  [k] native_apic_msr_eoi_write
      0.00%     0.00%  [kernel]  [k] update_rq_clock
      0.00%     0.00%  [kernel]  [k] smp_call_function_single_async
      0.00%     0.00%  [kernel]  [k] native_sched_clock
      0.00%     0.00%  [kernel]  [k] update_vsyscall
      0.00%     0.00%  [kernel]  [k] _raw_spin_lock
      0.00%     0.00%  [kernel]  [k] generic_exec_single
      0.00%     0.00%  [kernel]  [k] notifier_call_chain
      0.00%     0.00%  [kernel]  [k] __update_load_avg_cfs_rq
      0.00%     0.00%  [kernel]  [k] tick_do_update_jiffies64
      0.00%     0.00%  [kernel]  [k] update_irq_load_avg
      0.00%     0.00%  [kernel]  [k] tick_program_event
      0.00%     0.00%  [kernel]  [k] trigger_load_balance
      0.00%     0.00%  [kernel]  [k] rb_erase
      0.00%     0.00%  [kernel]  [k] update_wall_time
      0.00%     0.00%  [kernel]  [k] ktime_get
      0.00%     0.00%  [kernel]  [k] llist_add_batch
      0.00%     0.00%  [kernel]  [k] hrtimer_active
      0.00%     0.00%  [kernel]  [k] wq_worker_tick
      0.00%     0.00%  [kernel]  [k] timerqueue_add
      0.00%     0.00%  [kernel]  [k] irqentry_enter
      0.00%     0.00%  [kernel]  [k] psi_account_irqtime
      0.00%     0.00%  [kernel]  [k] run_posix_cpu_timers
      0.00%     0.00%  [kernel]  [k] ktime_get_update_offsets_now
      0.00%     0.00%  [kernel]  [k] rcu_sched_clock_irq
      0.00%     0.00%  [kernel]  [k] read_tsc
      0.00%     0.00%  [kernel]  [k] x86_pmu_disable
      0.00%     0.00%  [kernel]  [k] update_curr
      0.00%     0.00%  [kernel]  [k] _raw_spin_lock_irq
      0.00%     0.00%  [kernel]  [k] account_system_index_time
      0.00%     0.00%  [kernel]  [k] __update_load_avg_se
      0.00%     0.00%  [kernel]  [k] error_entry
      0.00%     0.00%  [kernel]  [k] update_fast_timekeeper
      0.00%     0.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
      0.00%     0.00%  [kernel]  [k] hrtimer_run_queues
      0.00%     0.00%  [kernel]  [k] pvclock_gtod_notify
      0.00%     0.00%  [kernel]  [k] call_function_single_prep_ipi
      0.00%     0.00%  [kernel]  [k] crypto_shash_update
      0.00%     0.00%  [kernel]  [k] cpuacct_charge
      0.00%     0.00%  [kernel]  [k] perf_pmu_nop_void
      0.00%     0.00%  [kernel]  [k] kthread_data
      0.00%     0.00%  [kernel]  [k] hrtimer_update_next_event
      0.00%     0.00%  [kernel]  [k] hrtimer_forward
      0.00%     0.00%  [kernel]  [k] __hrtimer_next_event_base
      0.00%     0.00%  [kernel]  [k] __do_softirq
      0.00%     0.00%  [kernel]  [k] __schedule
      0.00%     0.00%  [kernel]  [k] __run_timers
      0.00%     0.00%  [kernel]  [k] rcu_note_context_switch
      0.00%     0.00%  [kernel]  [k] _find_next_and_bit
      0.00%     0.00%  [kernel]  [k] tcp_orphan_count_sum
      0.00%     0.00%  [kernel]  [k] cpuacct_account_field
      0.00%     0.00%  [kernel]  [k] pick_next_task_fair
      0.00%     0.00%  [kernel]  [k] irqtime_account_process_tick
      0.00%     0.00%  [kernel]  [k] md_handle_request
      0.00%     0.00%  [kernel]  [k] kmem_cache_alloc
      0.00%     0.00%  [kernel]  [k] calc_global_load_tick
      0.00%     0.00%  [kernel]  [k] irq_exit_rcu
      0.00%     0.00%  [kernel]  [k] blake2s_compress_ssse3
      0.00%     0.00%  [kernel]  [k] find_get_stripe
      0.00%     0.00%  [kernel]  [k] rb_next
      0.00%     0.00%  [kernel]  [k] error_return
      0.00%     0.00%  [kernel]  [k] update_min_vruntime
      0.00%     0.00%  [kernel]  [k] netlink_broadcast
      0.00%     0.00%  [kernel]  [k] prb_retire_rx_blk_timer_expired
      0.00%     0.00%  [kernel]  [k] __accumulate_pelt_segments
      0.00%     0.00%  [kernel]  [k] kernfs_notify
      0.00%     0.00%  [kernel]  [k] ntp_get_next_leap
      0.00%     0.00%  [kernel]  [k] ext4_mb_prefetch
      0.00%     0.00%  [kernel]  [k] __radix_tree_lookup
      0.00%     0.00%  [kernel]  [k] irqentry_exit
      0.00%     0.00%  [kernel]  [k] asm_sysvec_reschedule_ipi
      0.00%     0.00%  [kernel]  [k] ext4_ext_insert_extent
      0.00%     0.00%  [kernel]  [k] fast_mix
      0.00%     0.00%  [kernel]  [k] record_times
      0.00%     0.00%  [kernel]  [k] irq_enter_rcu
      0.00%     0.00%  [kernel]  [k] early_xen_iret_patch
      0.00%     0.00%  [kernel]  [k] __orc_find
      0.00%     0.00%  [kernel]  [k] irq_work_tick
      0.00%     0.00%  [kernel]  [k] kmem_cache_free

> On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>>
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>
>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>> I have done this before and all looked well. The array is now degraded until I get the new disk.
>>
>> At one point my system got into trouble and I am not sure why, but it started to have
>> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
>> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>>
>> On the restart it dropped into emergency shell, the array had all members listed as spares.
>> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
>> though the array was now listed in mdstat and looked as expected.
>>
>> Since mdadm suggested I use --force', I did so
>>          mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>>
>> Q0) was I correct to use this command?
>>
>> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
>> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
>> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>>
>> Q1) What does this last line mean?
>>
>> Now that the array came up I still could not mount the fs (still in emergency shell).
>> I rebooted and all came up, the array was there and the fs was mounted and so far
>> I did not notice any issues with the fs.
>>
>> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
>> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
>> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
>> SATA disk on the server and it went fast, so the USB disk is OK.
>>
>> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>>
>>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>    944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127
>>
>> It did it for many hours and I do not know what it is doing.
>>
>> Q2) what does this worker do?
>>
>> I also noticed that mdstat shows a high bitmap usage:
>>
>> Personalities : [raid6] [raid5] [raid4]
>> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>>         58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>>         bitmap: 87/88 pages [348KB], 65536KB chunk
>>
>> Q3) Is this OK? Should the usage go down? It does not change at all.
>>
>> While looking at everything, I started iostat on md127 and I see that there is a constant
>> trickle of writes, about 5KB/s. There is no activity on this fs.
>> Also, I see similar activity on all the members, at the same rate, so md127 does not show
>> 6 times the members activity. I guess this is just how md works?
>>
>> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>>          If this is a background thing, how can I monitor it?
>>
>> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>>          What is the correct way to bring up a degraded array? What about the 'dirty' part?
>> 'mdadm -D /dev/md127' mention'sync':
>>       Number   Major   Minor   RaidDevice State
>>          -       0        0        0      removed
>>          8       8       17        1      active sync   /dev/sdb1
>>          9       8       33        2      active sync   /dev/sdc1
>>          7       8       49        3      active sync   /dev/sdd1
>>          4       8       65        4      active sync   /dev/sde1
>>          5       8       81        5      active sync   /dev/sdf1
>>          6       8       97        6      active sync   /dev/sdg1
>> Is this related?
>>
>> BTW I plan to run a 'check' at some point.
>>
>> TIA
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31  2:35   ` eyal
@ 2023-10-31  2:50     ` eyal
  0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31  2:50 UTC (permalink / raw)
  To: eyal, linux-raid

On 31/10/2023 13.35, eyal@eyal.emu.id.au wrote:
> Thanks Roger,
> 
> On 31/10/2023 03.14, Roger Heflin wrote:
>> look at  SAR -d output for all the disks in the raid6.   It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
> 
> It seems that sar is installed but not active here. Latest data is from 2017...
> 
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
>>
>> If you are on a kernel that has a newer version upgrading and
>> rebooting may change something  If it is one close to the newest
>> kernel version going back a minor version (if 6.5 go back to the last
>> 6.4 kernel).
> 
> I am running the latest available kernel
>      6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
> 
>> You might also install the perf package and run "perf top" and see
>> what sorts of calls the kernel is spending all of its time in.
> 
> I now installed perf. Below is a grab from the console. Should I use different options?
> 
> Does anyone understand it?
> 
> BTW, I now ran rsync into the array and it barely moved. In 2 hours it copied about 5GB.
> iostat shows about 5KB/s (kilobytes!) all that time.

rsync kept the thread running at 100% CPU. After stopping the copy, the thread was
no longer at the top of 'top' (could not see it listed at all).

> TIA
> 
> top shows
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 1385760 root      20   0       0      0      0 R  99.7   0.0 161:32.64 kworker/u16:4+flush-9:127
> 
> $ sudo perf top --pid 1385760 -g
> +  100.00%     0.49%  [kernel]  [k] ext4_mb_regular_allocator

[stack trimmed]

>> On Mon, Oct 30, 2023 at 8:44 AM <eyal@eyal.emu.id.au> wrote:
>>>
>>> F38
>>>
>>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>>
>>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>>> I have done this before and all looked well. The array is now degraded until I get the new disk.
>>>
>>> At one point my system got into trouble and I am not sure why, but it started to have
>>> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
>>> It refused to complete the shutdown and after a while I used the sysrq feature for this.
>>>
>>> On the restart it dropped into emergency shell, the array had all members listed as spares.
>>> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
>>> though the array was now listed in mdstat and looked as expected.
>>>
>>> Since mdadm suggested I use --force', I did so
>>>          mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
>>>
>>> Q0) was I correct to use this command?
>>>
>>> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
>>> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
>>> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
>>>
>>> Q1) What does this last line mean?
>>>
>>> Now that the array came up I still could not mount the fs (still in emergency shell).
>>> I rebooted and all came up, the array was there and the fs was mounted and so far
>>> I did not notice any issues with the fs.
>>>
>>> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
>>> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
>>> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
>>> SATA disk on the server and it went fast, so the USB disk is OK.
>>>
>>> I then noticed (in 'top') that there is a kworker running at 100% CPU:
>>>
>>>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>    944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127
>>>
>>> It did it for many hours and I do not know what it is doing.
>>>
>>> Q2) what does this worker do?
>>>
>>> I also noticed that mdstat shows a high bitmap usage:
>>>
>>> Personalities : [raid6] [raid5] [raid4]
>>> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>>>         58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>>>         bitmap: 87/88 pages [348KB], 65536KB chunk
>>>
>>> Q3) Is this OK? Should the usage go down? It does not change at all.
>>>
>>> While looking at everything, I started iostat on md127 and I see that there is a constant
>>> trickle of writes, about 5KB/s. There is no activity on this fs.
>>> Also, I see similar activity on all the members, at the same rate, so md127 does not show
>>> 6 times the members activity. I guess this is just how md works?
>>>
>>> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>>>          If this is a background thing, how can I monitor it?
>>>
>>> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>>>          What is the correct way to bring up a degraded array? What about the 'dirty' part?
>>> 'mdadm -D /dev/md127' mention'sync':
>>>       Number   Major   Minor   RaidDevice State
>>>          -       0        0        0      removed
>>>          8       8       17        1      active sync   /dev/sdb1
>>>          9       8       33        2      active sync   /dev/sdc1
>>>          7       8       49        3      active sync   /dev/sdd1
>>>          4       8       65        4      active sync   /dev/sde1
>>>          5       8       81        5      active sync   /dev/sdf1
>>>          6       8       97        6      active sync   /dev/sdg1
>>> Is this related?
>>>
>>> BTW I plan to run a 'check' at some point.
>>>
>>> TIA
>>>
>>> -- 
>>> Eyal at Home (eyal@eyal.emu.id.au)
> 

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-30 16:14 ` Roger Heflin
  2023-10-31  2:35   ` eyal
@ 2023-10-31  3:21   ` Carlos Carvalho
  2023-10-31  6:25     ` eyal
  2023-10-31  9:29     ` eyal
  1 sibling, 2 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31  3:21 UTC (permalink / raw)
  To: linux-raid

Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> look at  SAR -d output for all the disks in the raid6.   It may be a
> disk issue (though I suspect not given the 100% cpu show in raid).
> 
> Clearly something very expensive/deadlockish is happening because of
> the raid having to rebuild the data from the missing disk, not sure
> what could be wrong with it.

This is very similar to what I complained some 3 months ago. For me it happens
with an array in normal state. sar shows no disk activity yet there are no
writes to the array (reads happen normally) and the flushd thread uses 100%
cpu.

For the latest 6.5.* I can reliably reproduce it with
% xzcat linux-6.5.tar.xz | tar x -f -

This leaves the machine with ~1.5GB of dirty pages (as reported by
/proc/meminfo) that it never manages to write to the array. I've waited for
several hours to no avail. After a reboot the kernel tree had only about 220MB
instead of ~1.5GB...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31  3:21   ` Carlos Carvalho
@ 2023-10-31  6:25     ` eyal
  2023-10-31  9:29     ` eyal
  1 sibling, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31  6:25 UTC (permalink / raw)
  To: linux-raid

On 31/10/2023 14.21, Carlos Carvalho wrote:
> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>> look at  SAR -d output for all the disks in the raid6.   It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
>>
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
> 
> This is very similar to what I complained some 3 months ago. For me it happens
> with an array in normal state. sar shows no disk activity yet there are no
> writes to the array (reads happen normally) and the flushd thread uses 100%
> cpu.
> 
> For the latest 6.5.* I can reliably reproduce it with
> % xzcat linux-6.5.tar.xz | tar x -f -
> 
> This leaves the machine with ~1.5GB of dirty pages (as reported by
> /proc/meminfo) that it never manages to write to the array. I've waited for
> several hours to no avail. After a reboot the kernel tree had only about 220MB
> instead of ~1.5GB...

I rebooted the machine, so all is pristine.
This is F38, kernel 6.5.8-200.fc38.x86_64, with 32GB RAM.

I started a copy (SATA->rsync) into the array. Within seconds the kworked started running with 100%CPU.
In less that 1 minute is almost stopped(*1), after transferring about 5GB.
Below(*2) is the meminfo at that time.

10 minutes later it is till copying the same (32KB!) file.
20m later I stopped the copy bnut the kworker remained.
10m later I removed the target directory and the kworker was immediately gone.

I still suspect that after the array was 'all spares' and I re-assembled it, after when it looked good,
something is not completely right.

What other information should I provide to help resolve this issue?

TIA

(*1) the read side had zero activity and the write (the array) had a trikkle of about 50KB/s.

(*2) $ cat /proc/meminfo
MemTotal:       32704880 kB
MemFree:          654660 kB
MemAvailable:   29397832 kB
Buffers:          951396 kB
Cached:         26962316 kB
SwapCached:           32 kB
Active:          2698164 kB
Inactive:       26738212 kB
Active(anon):    1755572 kB
Inactive(anon):    51596 kB
Active(file):     942592 kB
Inactive(file): 26686616 kB
Unevictable:      177112 kB
Mlocked:               0 kB
SwapTotal:      16777212 kB
SwapFree:       16776444 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:           5069080 kB
Writeback:            28 kB
AnonPages:       1699812 kB
Mapped:           669464 kB
Shmem:            284500 kB
KReclaimable:    1452908 kB
Slab:            1731432 kB
SReclaimable:    1452908 kB
SUnreclaim:       278524 kB
KernelStack:       16160 kB
PageTables:        29476 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    33129652 kB
Committed_AS:    8020984 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       58744 kB
VmallocChunk:          0 kB
Percpu:             5792 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      338060 kB
DirectMap2M:     4757504 kB
DirectMap1G:    28311552 kB

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31  3:21   ` Carlos Carvalho
  2023-10-31  6:25     ` eyal
@ 2023-10-31  9:29     ` eyal
  2023-10-31 10:24       ` Roger Heflin
  2023-10-31 12:39       ` Carlos Carvalho
  1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-31  9:29 UTC (permalink / raw)
  To: linux-raid

On 31/10/2023 14.21, Carlos Carvalho wrote:
> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>> look at  SAR -d output for all the disks in the raid6.   It may be a
>> disk issue (though I suspect not given the 100% cpu show in raid).
>>
>> Clearly something very expensive/deadlockish is happening because of
>> the raid having to rebuild the data from the missing disk, not sure
>> what could be wrong with it.
> 
> This is very similar to what I complained some 3 months ago. For me it happens
> with an array in normal state. sar shows no disk activity yet there are no
> writes to the array (reads happen normally) and the flushd thread uses 100%
> cpu.
> 
> For the latest 6.5.* I can reliably reproduce it with
> % xzcat linux-6.5.tar.xz | tar x -f -
> 
> This leaves the machine with ~1.5GB of dirty pages (as reported by
> /proc/meminfo) that it never manages to write to the array. I've waited for
> several hours to no avail. After a reboot the kernel tree had only about 220MB
> instead of ~1.5GB...

More evidence that the problem relates to the cache not flushed to disk.

If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
But it also evicts it from the cache, so nothing accumulates.
The result is a slower than otherwise copying but it streams with no pauses.

It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
only a few large files. rsync copies a very large number of small files which somehow triggers
the problem.

This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.

Just my guess.

BTW I ran fsck on the fs (on the array) and it found no fault.

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31  9:29     ` eyal
@ 2023-10-31 10:24       ` Roger Heflin
  2023-10-31 21:40         ` eyal
  2023-10-31 12:39       ` Carlos Carvalho
  1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-10-31 10:24 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

If you directory entries are large (lots of small files in a
directory) then the recovery of the missing data could be just enough
to push your array too hard.

find /<mount> -type d -size +1M -ls     will find large directories.

do a ls -l <largedirname> | wc -l and see how many files are in there.

ext3/4 has issues with really big directories.

The perf top showed just about all of the time was being spend in
ext3/4 threads allocating new blocks/directory entries and such.

How much free space does the disk show in df?


On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>
> On 31/10/2023 14.21, Carlos Carvalho wrote:
> > Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >> look at  SAR -d output for all the disks in the raid6.   It may be a
> >> disk issue (though I suspect not given the 100% cpu show in raid).
> >>
> >> Clearly something very expensive/deadlockish is happening because of
> >> the raid having to rebuild the data from the missing disk, not sure
> >> what could be wrong with it.
> >
> > This is very similar to what I complained some 3 months ago. For me it happens
> > with an array in normal state. sar shows no disk activity yet there are no
> > writes to the array (reads happen normally) and the flushd thread uses 100%
> > cpu.
> >
> > For the latest 6.5.* I can reliably reproduce it with
> > % xzcat linux-6.5.tar.xz | tar x -f -
> >
> > This leaves the machine with ~1.5GB of dirty pages (as reported by
> > /proc/meminfo) that it never manages to write to the array. I've waited for
> > several hours to no avail. After a reboot the kernel tree had only about 220MB
> > instead of ~1.5GB...
>
> More evidence that the problem relates to the cache not flushed to disk.
>
> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> But it also evicts it from the cache, so nothing accumulates.
> The result is a slower than otherwise copying but it streams with no pauses.
>
> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> only a few large files. rsync copies a very large number of small files which somehow triggers
> the problem.
>
> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>
> Just my guess.
>
> BTW I ran fsck on the fs (on the array) and it found no fault.
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31  9:29     ` eyal
  2023-10-31 10:24       ` Roger Heflin
@ 2023-10-31 12:39       ` Carlos Carvalho
  2023-10-31 14:19         ` Roger Heflin
  1 sibling, 1 reply; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31 12:39 UTC (permalink / raw)
  To: linux-raid

eyal@eyal.emu.id.au (eyal@eyal.emu.id.au) wrote on Tue, Oct 31, 2023 at 06:29:14AM -03:
> More evidence that the problem relates to the cache not flushed to disk.

Yes.

> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> only a few large files. rsync copies a very large number of small files which somehow triggers
> the problem.

Mee too. Writing few files works fine, the problem happens when many files need
flushing. That's why expanding the kernel tree blocks the machine. After many
hours it either crashes or I have to do a hard reboot because all service
stops.

It also happens with 6.1 but 6.5 is a lot more susceptible. Further, the longer
the uptime the more prone to deadlock the machine becomes...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31 12:39       ` Carlos Carvalho
@ 2023-10-31 14:19         ` Roger Heflin
  2023-10-31 19:20           ` Carlos Carvalho
  0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-10-31 14:19 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

If it is a spinning disk there is a limit to the total amount of
updates(iops) read or write that can be done per second.

In Raid5/6 a small read is 1 small read, but a small write requires
several disks to get read and/or written to complete that single
write, when a disk is failed, a read that would have had its data on
the missing disk needs to read all other disks to recalculate the
missing data.

If your array was already busy, the extra reads and recalculates could
push your array over its ability to do io limit.

I set these 2 parms to limit the total amount of dirty cache and this
appears to make the system more responsive.

[root@bm-server ~]# sysctl -a | grep -i dirty | grep -i bytes
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000

The issue being that once you hit dirty_bytes all writes stop until
you get the cache cleared back to dirty_background_bytes, and the
greater the difference between the 2 the longer the freeze time is.

Note that by default most distributions don't use the _bytes they use
dirty_ratio/dirty_background_ratio that is a % of memory and can cause
the write freeze to last for a significant amount of time since a
single % is quite large if you have a lot of ram.

You might also want to look at the io for the sd* devices under the md
device as that would show what MD is doing under the covers to make
the io happen (sar and/or iostat).

On Tue, Oct 31, 2023 at 7:39 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>
> eyal@eyal.emu.id.au (eyal@eyal.emu.id.au) wrote on Tue, Oct 31, 2023 at 06:29:14AM -03:
> > More evidence that the problem relates to the cache not flushed to disk.
>
> Yes.
>
> > It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> > only a few large files. rsync copies a very large number of small files which somehow triggers
> > the problem.
>
> Mee too. Writing few files works fine, the problem happens when many files need
> flushing. That's why expanding the kernel tree blocks the machine. After many
> hours it either crashes or I have to do a hard reboot because all service
> stops.
>
> It also happens with 6.1 but 6.5 is a lot more susceptible. Further, the longer
> the uptime the more prone to deadlock the machine becomes...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31 14:19         ` Roger Heflin
@ 2023-10-31 19:20           ` Carlos Carvalho
  0 siblings, 0 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-10-31 19:20 UTC (permalink / raw)
  To: linux-raid

Roger Heflin (rogerheflin@gmail.com) wrote on Tue, Oct 31, 2023 at 11:19:24AM -03:
> You might also want to look at the io for the sd* devices under the md
> device as that would show what MD is doing under the covers to make
> the io happen (sar and/or iostat).

Exactly. Guess what disks are doing? Nothing! That's the whole point. The
kernel becomes unable to send writes to the disks and the flushd thread remains
using 100% cpu.

Again: this is not an excess of writes that the array cannot cope with. On the
contrary, it's an absence of writes.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31 10:24       ` Roger Heflin
@ 2023-10-31 21:40         ` eyal
  2023-11-01 10:30           ` Roger Heflin
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-10-31 21:40 UTC (permalink / raw)
  To: linux-raid

On 31/10/2023 21.24, Roger Heflin wrote:
> If you directory entries are large (lots of small files in a
> directory) then the recovery of the missing data could be just enough
> to push your array too hard.

Nah, the directory I am copying has nothing really large, and the target directory is created new.

> find /<mount> -type d -size +1M -ls     will find large directories.
> 
> do a ls -l <largedirname> | wc -l and see how many files are in there.
> 
> ext3/4 has issues with really big directories.
> 
> The perf top showed just about all of the time was being spend in
> ext3/4 threads allocating new blocks/directory entries and such.

Just in case there is an issue, I will copy another directory as a test.
[later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.

However, this same content was copied from an external USB disk (NOT to the array) without a problem.

> How much free space does the disk show in df?

Enough  room:
	/dev/md127       55T   45T  9.8T  83% /data1

I still suspect an issue with the array after it was recovered.

A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.

Is there some background md process? Can it be hurried to completion?

> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>>
>> On 31/10/2023 14.21, Carlos Carvalho wrote:
>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>>>> look at  SAR -d output for all the disks in the raid6.   It may be a
>>>> disk issue (though I suspect not given the 100% cpu show in raid).
>>>>
>>>> Clearly something very expensive/deadlockish is happening because of
>>>> the raid having to rebuild the data from the missing disk, not sure
>>>> what could be wrong with it.
>>>
>>> This is very similar to what I complained some 3 months ago. For me it happens
>>> with an array in normal state. sar shows no disk activity yet there are no
>>> writes to the array (reads happen normally) and the flushd thread uses 100%
>>> cpu.
>>>
>>> For the latest 6.5.* I can reliably reproduce it with
>>> % xzcat linux-6.5.tar.xz | tar x -f -
>>>
>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
>>> /proc/meminfo) that it never manages to write to the array. I've waited for
>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
>>> instead of ~1.5GB...
>>
>> More evidence that the problem relates to the cache not flushed to disk.
>>
>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
>> But it also evicts it from the cache, so nothing accumulates.
>> The result is a slower than otherwise copying but it streams with no pauses.
>>
>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
>> only a few large files. rsync copies a very large number of small files which somehow triggers
>> the problem.
>>
>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>>
>> Just my guess.
>>
>> BTW I ran fsck on the fs (on the array) and it found no fault.
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array [more details]
  2023-10-30 13:35 problem with recovered array eyal
  2023-10-30 16:14 ` Roger Heflin
@ 2023-10-31 21:44 ` eyal
  2023-10-31 22:00   ` eyal
  2023-11-01  4:31   ` [now urgent] problem with recovered array eyal
  1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-10-31 21:44 UTC (permalink / raw)
  To: linux-raid

On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
> F38
> 
> I know this is a bit long but I wanted to provide as much detail as I thought needed.
> 
> I have a 7-member raid6. The other day I needed to send a disk for replacement.
> I have done this before and all looked well. The array is now degraded until I get the new disk.
> 
> At one point my system got into trouble and I am not sure why, but it started to have
> very slow response to open/close files, or even keystrokes. At the end I decided to reboot.
> It refused to complete the shutdown and after a while I used the sysrq feature for this.
> 
> On the restart it dropped into emergency shell, the array had all members listed as spares.
> I tried to '--run' the array but mdadm refused 'cannot start dirty degraded array'
> though the array was now listed in mdstat and looked as expected.
> 
> Since mdadm suggested I use --force', I did so
>      mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
> 
> Q0) was I correct to use this command?
> 
> 2023-10-30T01:08:25+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
> 2023-10-30T01:08:25+1100 kernel: md127: detected capacity change from 0 to 117187522560
> 2023-10-30T01:08:25+1100 kernel: md: requested-resync of RAID array md127
> 
> Q1) What does this last line mean?
> 
> Now that the array came up I still could not mount the fs (still in emergency shell).
> I rebooted and all came up, the array was there and the fs was mounted and so far
> I did not notice any issues with the fs.
> 
> However, it is not perfect. I tried to copy some data from an external (USB) disk to the array
> and it went very slowly (as in 10KB/s, the USB could do 120MB/s). The copy (rsync) was running
> at 100% CPU which is unexpected. I then stopped it. As a test, I rsync'ed the USB disk to another
> SATA disk on the server and it went fast, so the USB disk is OK.
> 
> I then noticed (in 'top') that there is a kworker running at 100% CPU:
> 
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   944760 root      20   0       0      0      0 R 100.0   0.0 164:00.85 kworker/u16:3+flush-9:127
> 
> It did it for many hours and I do not know what it is doing.
> 
> Q2) what does this worker do?
> 
> I also noticed that mdstat shows a high bitmap usage:
> 
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdg1[6] sdf1[5] sdd1[7] sdb1[8] sdc1[9]
>        58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>        bitmap: 87/88 pages [348KB], 65536KB chunk
> 
> Q3) Is this OK? Should the usage go down? It does not change at all.
> 
> While looking at everything, I started iostat on md127 and I see that there is a constant
> trickle of writes, about 5KB/s. There is no activity on this fs.
> Also, I see similar activity on all the members, at the same rate, so md127 does not show
> 6 times the members activity. I guess this is just how md works?
> 
> Q4) What is this write activity? Is it related to the abovementioned 'requested-resync'?
>      If this is a background thing, how can I monitor it?
> 
> Q5) Finally, will the array come up (degraded) if I reboot or will I need to coerse it to start?
>      What is the correct way to bring up a degraded array? What about the 'dirty' part?
> 'mdadm -D /dev/md127' mention'sync':
>      Number   Major   Minor   RaidDevice State
>         -       0        0        0      removed
>         8       8       17        1      active sync   /dev/sdb1
>         9       8       33        2      active sync   /dev/sdc1
>         7       8       49        3      active sync   /dev/sdd1
>         4       8       65        4      active sync   /dev/sde1
>         5       8       81        5      active sync   /dev/sdf1
>         6       8       97        6      active sync   /dev/sdg1
> Is this related?
> 
> BTW I plan to run a 'check' at some point.

I ran a 'check' of a few segment of the array, just to see if there is any wholesale issue, but is was clean.

> TIA

It is time to describe in more detail what happened, because it happened again yesterday and I kept notes.

The system became somewhat non responsive so I decided to reboot. The shutdown did not complete.
It took over 10m for the system to slowly give up on waiting for about 10 services, then stopping them hard.
The last line on the screen was
	systemd[1]: Stopped systemd-sysctl.service - Apply Kernel Variables.
After waiting for almost 15m with no activity I rebooted using sysrq e,i,s,u,b.

The restart dropped into emergency shell.
mdstat shows all array members are spare. Same as before.
I repeated what worked last time
	mdadm --assemble         /dev/md127 /dev/sd{b,c,d,e,f,g}1
failed, then:
	mdadm --stop md127
	mdadm --assemble --force /dev/md127 /dev/sd{b,c,d,e,f,g}1
Some messages, then /data1 was (automatically) mounted on md127. The log shows
	2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): recovery complete
	2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): mounted filesystem 378e74a6-e379-4bd5-ade5-f3cd85952099 r/w with ordered data mode. Quota mode: none.

There was also a message saying a recovery (maybe a different term was used?) is going.
mdstat shows the array is up and a recovery is running.

After waiting for a little while it was time to boot.
At this point I used 'exit' hoping for the boot to complete but it did not, there was no response at all.
Ctl-Alt-Del brought the system up. Looking at the system log suggests it was a proper restart.
On the fresh boot I do not see the recovery and the array looks like this:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
       58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
       bitmap: 88/88 pages [352KB], 65536KB chunk

It now shows the bitmap as full. I left the system alone overnight with iostat running and I see
some activity every few minutes. Nothing is running on the array.

01:02:07 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
01:02:07 md127            43.32      1098.67      3659.18         0.00    4257297   14179212          0
01:03:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:04:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:05:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:06:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:07:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:08:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:09:07 md127             0.02         0.00         0.07         0.00          0          4          0
01:10:07 md127             0.38         0.00         1.53         0.00          0         92          0
01:11:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:12:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:13:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:14:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:15:07 md127             0.02         0.00         1.60         0.00          0         96          0
01:16:07 md127             0.18         0.00         7.27         0.00          0        436          0
01:17:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:18:07 md127             0.00         0.00         0.00         0.00          0          0          0
01:19:07 md127             0.02         0.00         0.07         0.00          0          4          0
01:20:07 md127             0.33         0.00         1.33         0.00          0         80          0
01:21:07 md127             0.00         0.00         0.00         0.00          0          0          0

Later, some mythtv recordings were done and I now checked and they play just fine.

This is what the relevant log messages are during this time:

2023-10-31T23:47:41+1100 kernel: EXT4-fs (nvme0n1p3): mounted filesystem 62fbafb4-86ba-497d-ae82-77b4d386b708 ro with ordered data mode. Quota mode: none.
2023-10-31T23:47:47+1100 kernel: EXT4-fs (nvme0n1p3): re-mounted 62fbafb4-86ba-497d-ae82-77b4d386b708 r/w. Quota mode: none.
2023-10-31T23:47:48+1100 kernel: EXT4-fs (nvme0n1p1): mounted filesystem 43e95102-07d4-4012-9d69-cf5e4100e682 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:47:48+1100 kernel: EXT4-fs (nvme0n1p4): mounted filesystem f415fb92-0a9c-49cc-b4b9-48d9135acd70 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:47:49+1100 kernel: EXT4-fs (sda1): mounted filesystem c7b4b181-a62e-4e66-96e1-dfa249049f70 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:48:32+1100 systemd[1]: Dependency failed for data1.mount - /data1.
2023-10-31T23:48:32+1100 systemd[1]: data1.mount: Job data1.mount/start failed with result 'dependency'.
2023-10-31T23:50:50+1100 kernel: md/raid:md127: not clean -- starting background reconstruction
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdf1 operational as raid disk 5
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sde1 operational as raid disk 4
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdg1 operational as raid disk 6
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdb1 operational as raid disk 1
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdd1 operational as raid disk 3
2023-10-31T23:50:50+1100 kernel: md/raid:md127: device sdc1 operational as raid disk 2
2023-10-31T23:50:50+1100 kernel: md/raid:md127: cannot start dirty degraded array.
2023-10-31T23:50:50+1100 kernel: md/raid:md127: failed to run raid set.
2023-10-31T23:50:50+1100 kernel: md: pers->run() failed ...
2023-10-31T23:52:58+1100 kernel: md: md127 stopped.
2023-10-31T23:53:55+1100 kernel: md: md127 stopped.
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdb1 operational as raid disk 1
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdg1 operational as raid disk 6
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdf1 operational as raid disk 5
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sde1 operational as raid disk 4
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdd1 operational as raid disk 3
2023-10-31T23:53:55+1100 kernel: md/raid:md127: device sdc1 operational as raid disk 2
2023-10-31T23:53:55+1100 kernel: md/raid:md127: raid level 6 active with 6 out of 7 devices, algorithm 2
2023-10-31T23:53:55+1100 kernel: md127: detected capacity change from 0 to 117187522560
2023-10-31T23:53:55+1100 kernel: md: requested-resync of RAID array md127
2023-10-31T23:53:56+1100 systemd[1]: Mounting data1.mount - /data1...
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): recovery complete
2023-10-31T23:54:02+1100 kernel: EXT4-fs (md127): mounted filesystem 378e74a6-e379-4bd5-ade5-f3cd85952099 r/w with ordered data mode. Quota mode: none.
2023-10-31T23:54:02+1100 systemd[1]: Mounted data1.mount - /data1.
2023-10-31T23:54:02+1100 mdadm[1286]: DeviceDisappeared event detected on md device /dev/md/md127
2023-10-31T23:54:02+1100 mdadm[1286]: NewArray event detected on md device /dev/md127
2023-10-31T23:54:02+1100 mdadm[1286]: DegradedArray event detected on md device /dev/md127
2023-10-31T23:54:02+1100 mdadm[1286]: RebuildStarted event detected on md device /dev/md127

Of note:

Starting the array without --force fails
	md/raid:md127: not clean -- starting background reconstruction
	md/raid:md127: cannot start dirty degraded array.
	md/raid:md127: failed to run raid set.

Then with --force it starts, /data1 is (automatically) mounted but says
	kernel: EXT4-fs (md127): recovery complete
so the shutdown was not clean.

Soon after
	mdadm[1286]: RebuildStarted event detected on md device /dev/md127

What I do not understand is, what does this mean, is there something going on that does not show in mdstat?
	starting background reconstruction
	RebuildStarted event detected

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array [more details]
  2023-10-31 21:44 ` problem with recovered array [more details] eyal
@ 2023-10-31 22:00   ` eyal
  2023-11-01  4:31   ` [now urgent] problem with recovered array eyal
  1 sibling, 0 replies; 35+ messages in thread
From: eyal @ 2023-10-31 22:00 UTC (permalink / raw)
  To: linux-raid

On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.

[trimmed]

Responding with the array details:

mdadm --query
=============
/dev/md127: 54.57TiB raid6 7 devices, 0 spares. Use mdadm --detail for more detail.

mdadm --detail
==============
/dev/md127:
            Version : 1.2
      Creation Time : Fri Oct 26 17:24:59 2018
         Raid Level : raid6
         Array Size : 58593761280 (54.57 TiB 60.00 TB)
      Used Dev Size : 11718752256 (10.91 TiB 12.00 TB)
       Raid Devices : 7
      Total Devices : 6
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Wed Nov  1 05:09:17 2023
              State : clean, degraded
     Active Devices : 6
    Working Devices : 6
     Failed Devices : 0
      Spare Devices : 0

             Layout : left-symmetric
         Chunk Size : 512K

Consistency Policy : bitmap

               Name : e4.eyal.emu.id.au:127
               UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
             Events : 4802229

     Number   Major   Minor   RaidDevice State
        -       0        0        0      removed
        8       8       17        1      active sync   /dev/sdb1
        9       8       33        2      active sync   /dev/sdc1
        7       8       49        3      active sync   /dev/sdd1
        4       8       65        4      active sync   /dev/sde1
        5       8       81        5      active sync   /dev/sdf1
        6       8       97        6      active sync   /dev/sdg1

mdadm --examine "/dev/sd?1
==========================
/dev/sdb1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : 4bfcc6d6:165b6a78:b8fad560:51ff21de

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : d90d6d62 - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 1
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : 7fb0b851:c3d0463b:6460ff66:170088b9

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : f86afdb5 - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 2
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : ce5bd1bc:3e5c3054:658888dd:9038e09f

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : d94e94f8 - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 3
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : 4706a5ab:e7cea085:9af96e3a:81ac89b1

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : 6822973d - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 4
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : b1732c74:a34e121d:8347018e:c42b5085

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : ef74518f - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 5
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x1
      Array UUID : 15d250cf:fe43eafb:5779f3d8:7e79affc
            Name : e4.eyal.emu.id.au:127
   Creation Time : Fri Oct 26 17:24:59 2018
      Raid Level : raid6
    Raid Devices : 7

  Avail Dev Size : 23437504512 sectors (10.91 TiB 12.00 TB)
      Array Size : 58593761280 KiB (54.57 TiB 60.00 TB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262064 sectors, after=0 sectors
           State : clean
     Device UUID : b44b1807:ed20c6f9:ec0436d7:744d144a

Internal Bitmap : 8 sectors from superblock
     Update Time : Wed Nov  1 05:09:17 2023
   Bad Block Log : 512 entries available at offset 56 sectors
        Checksum : 6d0cdaf7 - correct
          Events : 4802229

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 6
    Array State : .AAAAAA ('A' == active, '.' == missing, 'R' == replacing)

HTH

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [now urgent] problem with recovered array
  2023-10-31 21:44 ` problem with recovered array [more details] eyal
  2023-10-31 22:00   ` eyal
@ 2023-11-01  4:31   ` eyal
  2023-11-01  6:44     ` eyal
  1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-01  4:31 UTC (permalink / raw)
  To: linux-raid

Why do I see it as urgent now:
I suspect that my system will soon run out of cache (due to no writes of dirty blocks to the array)
and fail in an ugly way.

If you can help then please read this thread.

Another unusual observation of the situation where the system looks OK but is evidently not.

While pondering the above situation, I noticed that when I watched a mythtv recording,
which is on the array (md127), I see no activity on md127 or the components '/dev/sd[b-g]1'.(*1)

Now, if I watch a program from earlier, from before the last boot(*2), I do see it in iostat.
Is it possible that all the recent recordings are still cached and were never written to disk?

What is holding the system from draining the cache? Is the array somehow in readonly mode? How do I fix this?
I now think that on shutdown, a sync is done that ends up preventing the shutdown from completion.

BTW, before, in the same situation, running 'sync' never completes.

HTH

(*1)
$ df -h /data1
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127       55T   45T  9.8T  83% /data1

]$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
       58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
       bitmap: 88/88 pages [352KB], 65536KB chunk

$ locate 2050_20231031093600.ts
/data1/mythtv/lib/2050_20231031093600.ts

$ ls -l /var/lib/mythtv
lrwxrwxrwx 1 root root 17 Feb 26  2023 /var/lib/mythtv -> /data1/mythtv/lib

$ ls -l /data1/mythtv/lib/2050_20231031093600.ts
-rw-r--r-- 1 mythtv mythtv 2362511964 Oct 31 21:24 /data1/mythtv/lib/2050_20231031093600.ts

(*2)
$ uptime
  15:02:37 up 15:05, 27 users,  load average: 0.49, 0.59, 0.67


On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>> F38
>>
>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>
>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>> I have done this before and all looked well. The array is now degraded until I get the new disk.

[trimmed] See original posting.

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [now urgent] problem with recovered array
  2023-11-01  4:31   ` [now urgent] problem with recovered array eyal
@ 2023-11-01  6:44     ` eyal
  0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-11-01  6:44 UTC (permalink / raw)
  To: linux-raid

Bugger, I changed my mind. I now see that recordings do show write activity.

The reason why playing a recording from 12h ago does not show any reads is probably
because it is still in the cache, but not dirty. I guess.

My original problem remains though. rsync a large tree of small files to the array
chokes after a minute and after that, writing progresses at a very slow pace
of a few 10s of KB/s. A kthread then starts, runs at 100%CPU for hours even though
only 5 GB were copied.

See my recent thread 'problem with recovered array' for details. The problem is still urgent.
I cannot safely shut down the machine.

On 01/11/2023 15.31, eyal@eyal.emu.id.au wrote:
> Why do I see it as urgent now:
> I suspect that my system will soon run out of cache (due to no writes of dirty blocks to the array)
> and fail in an ugly way.
> 
> If you can help then please read this thread.
> 
> Another unusual observation of the situation where the system looks OK but is evidently not.
> 
> While pondering the above situation, I noticed that when I watched a mythtv recording,
> which is on the array (md127), I see no activity on md127 or the components '/dev/sd[b-g]1'.(*1)
> 
> Now, if I watch a program from earlier, from before the last boot(*2), I do see it in iostat.
> Is it possible that all the recent recordings are still cached and were never written to disk?
> 
> What is holding the system from draining the cache? Is the array somehow in readonly mode? How do I fix this?
> I now think that on shutdown, a sync is done that ends up preventing the shutdown from completion.
> 
> BTW, before, in the same situation, running 'sync' never completes.
> 
> HTH
> 
> (*1)
> $ df -h /data1
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/md127       55T   45T  9.8T  83% /data1
> 
> ]$ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sde1[4] sdc1[9] sdf1[5] sdb1[8] sdd1[7] sdg1[6]
>        58593761280 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>        bitmap: 88/88 pages [352KB], 65536KB chunk
> 
> $ locate 2050_20231031093600.ts
> /data1/mythtv/lib/2050_20231031093600.ts
> 
> $ ls -l /var/lib/mythtv
> lrwxrwxrwx 1 root root 17 Feb 26  2023 /var/lib/mythtv -> /data1/mythtv/lib
> 
> $ ls -l /data1/mythtv/lib/2050_20231031093600.ts
> -rw-r--r-- 1 mythtv mythtv 2362511964 Oct 31 21:24 /data1/mythtv/lib/2050_20231031093600.ts
> 
> (*2)
> $ uptime
>   15:02:37 up 15:05, 27 users,  load average: 0.49, 0.59, 0.67
> 
> 
> On 01/11/2023 08.44, eyal@eyal.emu.id.au wrote:
>> On 31/10/2023 00.35, eyal@eyal.emu.id.au wrote:
>>> F38
>>>
>>> I know this is a bit long but I wanted to provide as much detail as I thought needed.
>>>
>>> I have a 7-member raid6. The other day I needed to send a disk for replacement.
>>> I have done this before and all looked well. The array is now degraded until I get the new disk.
> 
> [trimmed] See original posting.

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-10-31 21:40         ` eyal
@ 2023-11-01 10:30           ` Roger Heflin
  2023-11-01 13:08             ` eyal
  0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-01 10:30 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

Did you check with iostat/sar?    (iostat -x 2 5).  The md
housekeeping/background stuff does not show on the md device itself,
but it shows on the underlying disks.

It might also be related the the bitmap keeping track of how far the
missing disk is behind.

Small files are troublesome.    A tiny file takes several reads and writes.

I think the bitmap is tracking how many writes need to be done on the
missing disk, and if so then until the new disk gets put back will not
start cleaning up.


On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
>
> On 31/10/2023 21.24, Roger Heflin wrote:
> > If you directory entries are large (lots of small files in a
> > directory) then the recovery of the missing data could be just enough
> > to push your array too hard.
>
> Nah, the directory I am copying has nothing really large, and the target directory is created new.
>
> > find /<mount> -type d -size +1M -ls     will find large directories.
> >
> > do a ls -l <largedirname> | wc -l and see how many files are in there.
> >
> > ext3/4 has issues with really big directories.
> >
> > The perf top showed just about all of the time was being spend in
> > ext3/4 threads allocating new blocks/directory entries and such.
>
> Just in case there is an issue, I will copy another directory as a test.
> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
>
> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
>
> > How much free space does the disk show in df?
>
> Enough  room:
>         /dev/md127       55T   45T  9.8T  83% /data1
>
> I still suspect an issue with the array after it was recovered.
>
> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
>
> Is there some background md process? Can it be hurried to completion?
>
> > On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 31/10/2023 14.21, Carlos Carvalho wrote:
> >>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >>>> look at  SAR -d output for all the disks in the raid6.   It may be a
> >>>> disk issue (though I suspect not given the 100% cpu show in raid).
> >>>>
> >>>> Clearly something very expensive/deadlockish is happening because of
> >>>> the raid having to rebuild the data from the missing disk, not sure
> >>>> what could be wrong with it.
> >>>
> >>> This is very similar to what I complained some 3 months ago. For me it happens
> >>> with an array in normal state. sar shows no disk activity yet there are no
> >>> writes to the array (reads happen normally) and the flushd thread uses 100%
> >>> cpu.
> >>>
> >>> For the latest 6.5.* I can reliably reproduce it with
> >>> % xzcat linux-6.5.tar.xz | tar x -f -
> >>>
> >>> This leaves the machine with ~1.5GB of dirty pages (as reported by
> >>> /proc/meminfo) that it never manages to write to the array. I've waited for
> >>> several hours to no avail. After a reboot the kernel tree had only about 220MB
> >>> instead of ~1.5GB...
> >>
> >> More evidence that the problem relates to the cache not flushed to disk.
> >>
> >> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> >> But it also evicts it from the cache, so nothing accumulates.
> >> The result is a slower than otherwise copying but it streams with no pauses.
> >>
> >> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> >> only a few large files. rsync copies a very large number of small files which somehow triggers
> >> the problem.
> >>
> >> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
> >>
> >> Just my guess.
> >>
> >> BTW I ran fsck on the fs (on the array) and it found no fault.
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-01 10:30           ` Roger Heflin
@ 2023-11-01 13:08             ` eyal
  2023-11-01 14:29               ` Roger Heflin
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-01 13:08 UTC (permalink / raw)
  To: linux-raid

On 01/11/2023 21.30, Roger Heflin wrote:
> Did you check with iostat/sar?    (iostat -x 2 5).  The md
> housekeeping/background stuff does not show on the md device itself,
> but it shows on the underlying disks.

Yes, I have iostat on both the md device and the components and I see sparse activity on both.

> It might also be related the the bitmap keeping track of how far the
> missing disk is behind.

This may be the case. In case the disk is re-added it may help to know this,
but if a new disk is added then the full disk will be recreated anyway.

> Small files are troublesome.    A tiny file takes several reads and writes.

Yes, but the array is so fast that this should not be a problem.
The rsync source is 175MB/s and the target array is 800MB/s so I do not
see how the writing can slow the copy.

In one case I (virsh) saved a VM, which created one 8GB file, which took
many hours to be written.

> I think the bitmap is tracking how many writes need to be done on the
> missing disk, and if so then until the new disk gets put back will not
> start cleaning up.
> 
> 
> On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
>>
>> On 31/10/2023 21.24, Roger Heflin wrote:
>>> If you directory entries are large (lots of small files in a
>>> directory) then the recovery of the missing data could be just enough
>>> to push your array too hard.
>>
>> Nah, the directory I am copying has nothing really large, and the target directory is created new.
>>
>>> find /<mount> -type d -size +1M -ls     will find large directories.
>>>
>>> do a ls -l <largedirname> | wc -l and see how many files are in there.
>>>
>>> ext3/4 has issues with really big directories.
>>>
>>> The perf top showed just about all of the time was being spend in
>>> ext3/4 threads allocating new blocks/directory entries and such.
>>
>> Just in case there is an issue, I will copy another directory as a test.
>> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
>> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
>>
>> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
>>
>>> How much free space does the disk show in df?
>>
>> Enough  room:
>>          /dev/md127       55T   45T  9.8T  83% /data1
>>
>> I still suspect an issue with the array after it was recovered.
>>
>> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
>> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
>>
>> Is there some background md process? Can it be hurried to completion?
>>
>>> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
>>>>
>>>> On 31/10/2023 14.21, Carlos Carvalho wrote:
>>>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
>>>>>> look at  SAR -d output for all the disks in the raid6.   It may be a
>>>>>> disk issue (though I suspect not given the 100% cpu show in raid).
>>>>>>
>>>>>> Clearly something very expensive/deadlockish is happening because of
>>>>>> the raid having to rebuild the data from the missing disk, not sure
>>>>>> what could be wrong with it.
>>>>>
>>>>> This is very similar to what I complained some 3 months ago. For me it happens
>>>>> with an array in normal state. sar shows no disk activity yet there are no
>>>>> writes to the array (reads happen normally) and the flushd thread uses 100%
>>>>> cpu.
>>>>>
>>>>> For the latest 6.5.* I can reliably reproduce it with
>>>>> % xzcat linux-6.5.tar.xz | tar x -f -
>>>>>
>>>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
>>>>> /proc/meminfo) that it never manages to write to the array. I've waited for
>>>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
>>>>> instead of ~1.5GB...
>>>>
>>>> More evidence that the problem relates to the cache not flushed to disk.
>>>>
>>>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
>>>> But it also evicts it from the cache, so nothing accumulates.
>>>> The result is a slower than otherwise copying but it streams with no pauses.
>>>>
>>>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
>>>> only a few large files. rsync copies a very large number of small files which somehow triggers
>>>> the problem.
>>>>
>>>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
>>>>
>>>> Just my guess.
>>>>
>>>> BTW I ran fsck on the fs (on the array) and it found no fault.
>>>>
>>>> --
>>>> Eyal at Home (eyal@eyal.emu.id.au)
>>>>
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-01 13:08             ` eyal
@ 2023-11-01 14:29               ` Roger Heflin
  2023-11-02  8:34                 ` Johannes Truschnigg
  0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-01 14:29 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

Did you test writing on the array?    And if you did test it did you
use a big enough file and/or bypass/disable cache?  Having the default
dirty* settings high/default it can cache up to 20% of RAM (without
writing it to disk).  Raid5/6 writing is much slower than reads.  If
you have a single raid5/6 array of spinning disks I would be surprised
if it was able to sustain anything close to 175MB/sec.     Mine
struggles to sustain even 100MB/sec on writes.

But (with my 3M/5Mbyte _bytes setting (lets the cache work, but
prevents massive amounts of data in cache).
vm.dirty_background_bytes = 3000000
vm.dirty_bytes = 5000000
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
501481472 bytes (501 MB, 478 MiB) copied, 8 s, 62.5 MB/s^C
2183+0 records in
2183+0 records out
572260352 bytes (572 MB, 546 MiB) copied, 8.79945 s, 65.0 MB/s

Using the fedora default:
vm.dirty_background_ratio = 20
vm.dirty_ratio = 20
dd if=/dev/zero bs=256K of=testfile.out count=4000 status=progress
4000+0 records in
4000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.687841 s, 1.5 GB/s

Note that on the 2nd test writing was still happening for the next
5-8seconds after it exited, giving an appearance of being much faster
than it is.  As the test size is increased the speed gets lower as
more of the data has to get written.  On my system with the fedora
default settings (64G ram) with a 16G file I am still getting
300MB/sec even though the underlying array can only really do around
70-100MB/sec.

Raid5/6 typically breaks down to about the write rate of a single disk
because of all of the extra work that has to be done for raid5/6 to
work.

Do you happen to know how fragmented the 8gb vm file is?   (filefrag
<filename>).   each fragment requires a separate several ms seek.

On Wed, Nov 1, 2023 at 8:08 AM <eyal@eyal.emu.id.au> wrote:
>
> On 01/11/2023 21.30, Roger Heflin wrote:
> > Did you check with iostat/sar?    (iostat -x 2 5).  The md
> > housekeeping/background stuff does not show on the md device itself,
> > but it shows on the underlying disks.
>
> Yes, I have iostat on both the md device and the components and I see sparse activity on both.
>
> > It might also be related the the bitmap keeping track of how far the
> > missing disk is behind.
>
> This may be the case. In case the disk is re-added it may help to know this,
> but if a new disk is added then the full disk will be recreated anyway.
>
> > Small files are troublesome.    A tiny file takes several reads and writes.
>
> Yes, but the array is so fast that this should not be a problem.
> The rsync source is 175MB/s and the target array is 800MB/s so I do not
> see how the writing can slow the copy.
>
> In one case I (virsh) saved a VM, which created one 8GB file, which took
> many hours to be written.
>
> > I think the bitmap is tracking how many writes need to be done on the
> > missing disk, and if so then until the new disk gets put back will not
> > start cleaning up.
> >
> >
> > On Tue, Oct 31, 2023 at 4:40 PM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 31/10/2023 21.24, Roger Heflin wrote:
> >>> If you directory entries are large (lots of small files in a
> >>> directory) then the recovery of the missing data could be just enough
> >>> to push your array too hard.
> >>
> >> Nah, the directory I am copying has nothing really large, and the target directory is created new.
> >>
> >>> find /<mount> -type d -size +1M -ls     will find large directories.
> >>>
> >>> do a ls -l <largedirname> | wc -l and see how many files are in there.
> >>>
> >>> ext3/4 has issues with really big directories.
> >>>
> >>> The perf top showed just about all of the time was being spend in
> >>> ext3/4 threads allocating new blocks/directory entries and such.
> >>
> >> Just in case there is an issue, I will copy another directory as a test.
> >> [later] Same issue. This time the files were Pictures, 1-3MB each, so it went faster (but not as fast as the array can sustain).
> >> After a few minutes (9GB copied) it took a long pause and a second kworker started. This one gone after I killed the copy.
> >>
> >> However, this same content was copied from an external USB disk (NOT to the array) without a problem.
> >>
> >>> How much free space does the disk show in df?
> >>
> >> Enough  room:
> >>          /dev/md127       55T   45T  9.8T  83% /data1
> >>
> >> I still suspect an issue with the array after it was recovered.
> >>
> >> A replated issue is that there is a constant rate of writes to the array (iostat says) at about 5KB/s
> >> when there is no activity on this fs. In the past I saw zero read/write in iostat in this situation.
> >>
> >> Is there some background md process? Can it be hurried to completion?
> >>
> >>> On Tue, Oct 31, 2023 at 4:29 AM <eyal@eyal.emu.id.au> wrote:
> >>>>
> >>>> On 31/10/2023 14.21, Carlos Carvalho wrote:
> >>>>> Roger Heflin (rogerheflin@gmail.com) wrote on Mon, Oct 30, 2023 at 01:14:49PM -03:
> >>>>>> look at  SAR -d output for all the disks in the raid6.   It may be a
> >>>>>> disk issue (though I suspect not given the 100% cpu show in raid).
> >>>>>>
> >>>>>> Clearly something very expensive/deadlockish is happening because of
> >>>>>> the raid having to rebuild the data from the missing disk, not sure
> >>>>>> what could be wrong with it.
> >>>>>
> >>>>> This is very similar to what I complained some 3 months ago. For me it happens
> >>>>> with an array in normal state. sar shows no disk activity yet there are no
> >>>>> writes to the array (reads happen normally) and the flushd thread uses 100%
> >>>>> cpu.
> >>>>>
> >>>>> For the latest 6.5.* I can reliably reproduce it with
> >>>>> % xzcat linux-6.5.tar.xz | tar x -f -
> >>>>>
> >>>>> This leaves the machine with ~1.5GB of dirty pages (as reported by
> >>>>> /proc/meminfo) that it never manages to write to the array. I've waited for
> >>>>> several hours to no avail. After a reboot the kernel tree had only about 220MB
> >>>>> instead of ~1.5GB...
> >>>>
> >>>> More evidence that the problem relates to the cache not flushed to disk.
> >>>>
> >>>> If I run 'rsync --fsync ...' it slows it down as the writing is flushed to disk for each file.
> >>>> But it also evicts it from the cache, so nothing accumulates.
> >>>> The result is a slower than otherwise copying but it streams with no pauses.
> >>>>
> >>>> It seems that the array is slow to sync files somehow. Mythtv has no problems because it write
> >>>> only a few large files. rsync copies a very large number of small files which somehow triggers
> >>>> the problem.
> >>>>
> >>>> This is why my 'dd if=/dev/zero of=file-on-array' goes fast without problems.
> >>>>
> >>>> Just my guess.
> >>>>
> >>>> BTW I ran fsck on the fs (on the array) and it found no fault.
> >>>>
> >>>> --
> >>>> Eyal at Home (eyal@eyal.emu.id.au)
> >>>>
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-01 14:29               ` Roger Heflin
@ 2023-11-02  8:34                 ` Johannes Truschnigg
  2023-11-02 11:27                   ` eyal
  2023-11-03 14:16                   ` Carlos Carvalho
  0 siblings, 2 replies; 35+ messages in thread
From: Johannes Truschnigg @ 2023-11-02  8:34 UTC (permalink / raw)
  To: Roger Heflin; +Cc: eyal, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

Hi list,

for the record, I do not think that any of the observations the OP made can be
explained by non-pathological phenomena/patterns of behavior. Something is
very clearly wrong with how this system behaves (the reported figures do not
at all match the expected performance of even a degraded RAID6 array in my
experience) and how data written to the filesystem apparently fails to make it
into the backing devices in acceptable time.

The whole affair reeks either of "subtle kernel bug", or maybe "subtle
hardware failure", I think.

Still, it'd be interesting to know what happens when writes to the array thru
the file system are performed with O_DIRECT in effect, i.e., using `dd
oflag=direct status=progress ...` - does this yield any observable (via
`iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
report this way with varying block sizes? Are there no concerning messages in
the debug ringbuffer (`dmesg`) at this time?

I'm not sure how we'd best learn more about what the participating busy kernel
threads (Fedora 38 might have more convenient devices up its sleeve than
ftrace?) are spending their time on in particular, but I think that's what's
needed to be known to pin down the underlying cause of the problem.

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +436502133337
xmpp:  johannes@truschnigg.info

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02  8:34                 ` Johannes Truschnigg
@ 2023-11-02 11:27                   ` eyal
  2023-11-02 11:57                     ` Roger Heflin
  2023-11-02 12:29                     ` eyal
  2023-11-03 14:16                   ` Carlos Carvalho
  1 sibling, 2 replies; 35+ messages in thread
From: eyal @ 2023-11-02 11:27 UTC (permalink / raw)
  To: linux-raid

On 02/11/2023 19.34, Johannes Truschnigg wrote:
> Hi list,
> 
> for the record, I do not think that any of the observations the OP made can be
> explained by non-pathological phenomena/patterns of behavior. Something is
> very clearly wrong with how this system behaves (the reported figures do not
> at all match the expected performance of even a degraded RAID6 array in my
> experience) and how data written to the filesystem apparently fails to make it
> into the backing devices in acceptable time.
> 
> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> hardware failure", I think.
> 
> Still, it'd be interesting to know what happens when writes to the array thru
> the file system are performed with O_DIRECT in effect, i.e., using `dd
> oflag=direct status=progress ...` - does this yield any observable (via
> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> report this way with varying block sizes? Are there no concerning messages in
> the debug ringbuffer (`dmesg`) at this time?
> 
> I'm not sure how we'd best learn more about what the participating busy kernel
> threads (Fedora 38 might have more convenient devices up its sleeve than
> ftrace?) are spending their time on in particular, but I think that's what's
> needed to be known to pin down the underlying cause of the problem.

To clarify what may be lost on this thread.

I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
It is cleared very fast. Tested writing 100GB.

The Dirty blocks rise to the max then drain quickly.

2023-11-02 21:28:27 Dirty:               236 kB
2023-11-02 21:28:37 Dirty:                 8 kB
2023-11-02 21:28:47 Dirty:                40 kB
2023-11-02 21:28:57 Dirty:                68 kB
2023-11-02 21:29:07 Dirty:               216 kB
2023-11-02 21:29:17 Dirty:               364 kB
2023-11-02 21:29:27 Dirty:                20 kB
2023-11-02 21:29:37 Dirty:                48 kB
2023-11-02 21:29:47 Dirty:                24 kB
2023-11-02 21:29:57 Dirty:                36 kB
2023-11-02 21:30:07 Dirty:                28 kB
2023-11-02 21:30:17 Dirty:                52 kB
2023-11-02 21:30:27 Dirty:                36 kB
2023-11-02 21:30:37 Dirty:           4112980 kB
2023-11-02 21:30:47 Dirty:           3772396 kB
2023-11-02 21:30:57 Dirty:           3849688 kB
2023-11-02 21:31:07 Dirty:           3761132 kB
2023-11-02 21:31:17 Dirty:           3846216 kB
2023-11-02 21:31:27 Dirty:           3855060 kB
2023-11-02 21:31:37 Dirty:           3902212 kB
2023-11-02 21:31:47 Dirty:           4173524 kB
2023-11-02 21:31:57 Dirty:           3849856 kB
2023-11-02 21:32:07 Dirty:           3940632 kB
2023-11-02 21:32:17 Dirty:           2212008 kB
2023-11-02 21:32:27 Dirty:               244 kB
2023-11-02 21:32:37 Dirty:               288 kB
2023-11-02 21:32:47 Dirty:                60 kB

rsync'ing a large tree into this same fs is different.
When killed less than 3GB were copied. Mostly small files in many directories.
Something like 92,059 files using 2,519MB.

Note how slowly the dirty blocks are cleared.

2023-11-02 21:36:28 Dirty:               296 kB
2023-11-02 21:36:38 Dirty:            277412 kB
2023-11-02 21:36:48 Dirty:            403928 kB
2023-11-02 21:36:58 Dirty:            606916 kB
2023-11-02 21:37:08 Dirty:            753888 kB
2023-11-02 21:37:18 Dirty:            641624 kB
2023-11-02 21:37:28 Dirty:            744560 kB
2023-11-02 21:37:38 Dirty:            946864 kB
2023-11-02 21:37:48 Dirty:           1355964 kB
2023-11-02 21:37:58 Dirty:           2365632 kB
2023-11-02 21:38:08 Dirty:           2451948 kB
### at this point the rsync was cancelled as I see the kthread start
2023-11-02 21:38:18 Dirty:           2451752 kB
2023-11-02 21:38:28 Dirty:           2440496 kB
2023-11-02 21:38:38 Dirty:           2440308 kB
2023-11-02 21:38:48 Dirty:           2440136 kB
2023-11-02 21:38:58 Dirty:           2440036 kB
2023-11-02 21:39:08 Dirty:           2440240 kB
2023-11-02 21:39:18 Dirty:           2405768 kB
2023-11-02 21:39:28 Dirty:           2405784 kB
2023-11-02 21:39:38 Dirty:           2406012 kB
2023-11-02 21:39:48 Dirty:           2405908 kB
2023-11-02 21:39:58 Dirty:           2405848 kB
2023-11-02 21:40:08 Dirty:           2405876 kB
2023-11-02 21:40:18 Dirty:           2405704 kB
2023-11-02 21:40:28 Dirty:           2405628 kB
2023-11-02 21:40:38 Dirty:           2405544 kB
2023-11-02 21:40:48 Dirty:           2405484 kB
2023-11-02 21:40:58 Dirty:           2405416 kB
2023-11-02 21:41:08 Dirty:           2405412 kB
2023-11-02 21:41:18 Dirty:           2405240 kB
2023-11-02 21:41:28 Dirty:           2405148 kB
2023-11-02 21:41:38 Dirty:           2405132 kB
2023-11-02 21:41:48 Dirty:           2404428 kB
2023-11-02 21:41:58 Dirty:           2405056 kB
2023-11-02 21:42:08 Dirty:           2404944 kB
2023-11-02 21:42:18 Dirty:           2404904 kB
2023-11-02 21:42:28 Dirty:           2404316 kB
2023-11-02 21:42:38 Dirty:           2395340 kB
2023-11-02 21:42:48 Dirty:           2394540 kB
2023-11-02 21:42:58 Dirty:           2394368 kB
2023-11-02 21:43:08 Dirty:           2394520 kB
2023-11-02 21:43:18 Dirty:           2394132 kB
2023-11-02 21:43:28 Dirty:           2394032 kB
2023-11-02 21:43:38 Dirty:           2394276 kB
2023-11-02 21:43:48 Dirty:           2386960 kB
2023-11-02 21:43:58 Dirty:           2387420 kB
2023-11-02 21:44:08 Dirty:           2386620 kB
2023-11-02 21:44:18 Dirty:           2386828 kB
2023-11-02 21:44:28 Dirty:           2386104 kB
2023-11-02 21:44:38 Dirty:           2386328 kB
2023-11-02 21:44:48 Dirty:           2382520 kB
2023-11-02 21:44:58 Dirty:           2382024 kB
2023-11-02 21:45:08 Dirty:           2381344 kB
2023-11-02 21:45:18 Dirty:           2380776 kB
2023-11-02 21:45:28 Dirty:           2380424 kB
2023-11-02 21:45:38 Dirty:           2379672 kB
2023-11-02 21:45:48 Dirty:           2380180 kB
2023-11-02 21:45:58 Dirty:           2373272 kB
2023-11-02 21:46:08 Dirty:           2372416 kB
2023-11-02 21:46:18 Dirty:           2372052 kB
2023-11-02 21:46:28 Dirty:           2372176 kB
... and so on

At the same time I see the kthread in top:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1491995 root      20   0       0      0      0 R  97.3   0.0  12:50.67 kworker/u16:2+flush-9:127

And iostat of md127 shows slow writing like this:
          Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
21:59:52 md127             3.50         0.00        30.40         0.00          0        304          0
22:00:02 md127             3.10         0.00        31.20         0.00          0        312          0
22:00:12 md127             2.00         0.00        64.40         0.00          0        644          0
22:00:22 md127             3.40         0.00        22.80         0.00          0        228          0

So, why the slow clearing of the dirty blocks?

I will leave the machine to clear it into the night...

BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
an array of all spares on restart. At least this is what happened twice already.

HTH

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 11:27                   ` eyal
@ 2023-11-02 11:57                     ` Roger Heflin
  2023-11-02 13:05                       ` eyal
  2023-11-02 12:29                     ` eyal
  1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-02 11:57 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

you need the iostat  data from the disks under md127 to see what
mdraid is having to do to internally.   Get the list of sd devices and
do iostat -x  | grep -E 'Device|sdX|sxy|sdz'    (putting into the list
the whole non-partitioned sd devices in your array)

The slowness of clearing dirty and what perf top showed (ext4* calls)
indicates it is creating entries for small files and putting data in
those files.   Creating entries for small files requires a lot of both
read and write seeks and seeks are expensive.   At most I would expect
you to be able to rsync/copy around 30-50 files/second (when the array
is healthy and once the size gets small enough the mb/sec does not
matter anymore as the seeks dominate).

On my array I do a nightly copy in of data written during the day to a
SSD (security cams).   And without any array issues that makes my
array really busy.   Anything wrong with the array and that process
takes several times as long.

On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
>
> On 02/11/2023 19.34, Johannes Truschnigg wrote:
> > Hi list,
> >
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
> >
> > Still, it'd be interesting to know what happens when writes to the array thru
> > the file system are performed with O_DIRECT in effect, i.e., using `dd
> > oflag=direct status=progress ...` - does this yield any observable (via
> > `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> > report this way with varying block sizes? Are there no concerning messages in
> > the debug ringbuffer (`dmesg`) at this time?
> >
> > I'm not sure how we'd best learn more about what the participating busy kernel
> > threads (Fedora 38 might have more convenient devices up its sleeve than
> > ftrace?) are spending their time on in particular, but I think that's what's
> > needed to be known to pin down the underlying cause of the problem.
>
> To clarify what may be lost on this thread.
>
> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> It is cleared very fast. Tested writing 100GB.
>
> The Dirty blocks rise to the max then drain quickly.
>
> 2023-11-02 21:28:27 Dirty:               236 kB
> 2023-11-02 21:28:37 Dirty:                 8 kB
> 2023-11-02 21:28:47 Dirty:                40 kB
> 2023-11-02 21:28:57 Dirty:                68 kB
> 2023-11-02 21:29:07 Dirty:               216 kB
> 2023-11-02 21:29:17 Dirty:               364 kB
> 2023-11-02 21:29:27 Dirty:                20 kB
> 2023-11-02 21:29:37 Dirty:                48 kB
> 2023-11-02 21:29:47 Dirty:                24 kB
> 2023-11-02 21:29:57 Dirty:                36 kB
> 2023-11-02 21:30:07 Dirty:                28 kB
> 2023-11-02 21:30:17 Dirty:                52 kB
> 2023-11-02 21:30:27 Dirty:                36 kB
> 2023-11-02 21:30:37 Dirty:           4112980 kB
> 2023-11-02 21:30:47 Dirty:           3772396 kB
> 2023-11-02 21:30:57 Dirty:           3849688 kB
> 2023-11-02 21:31:07 Dirty:           3761132 kB
> 2023-11-02 21:31:17 Dirty:           3846216 kB
> 2023-11-02 21:31:27 Dirty:           3855060 kB
> 2023-11-02 21:31:37 Dirty:           3902212 kB
> 2023-11-02 21:31:47 Dirty:           4173524 kB
> 2023-11-02 21:31:57 Dirty:           3849856 kB
> 2023-11-02 21:32:07 Dirty:           3940632 kB
> 2023-11-02 21:32:17 Dirty:           2212008 kB
> 2023-11-02 21:32:27 Dirty:               244 kB
> 2023-11-02 21:32:37 Dirty:               288 kB
> 2023-11-02 21:32:47 Dirty:                60 kB
>
> rsync'ing a large tree into this same fs is different.
> When killed less than 3GB were copied. Mostly small files in many directories.
> Something like 92,059 files using 2,519MB.
>
> Note how slowly the dirty blocks are cleared.
>
> 2023-11-02 21:36:28 Dirty:               296 kB
> 2023-11-02 21:36:38 Dirty:            277412 kB
> 2023-11-02 21:36:48 Dirty:            403928 kB
> 2023-11-02 21:36:58 Dirty:            606916 kB
> 2023-11-02 21:37:08 Dirty:            753888 kB
> 2023-11-02 21:37:18 Dirty:            641624 kB
> 2023-11-02 21:37:28 Dirty:            744560 kB
> 2023-11-02 21:37:38 Dirty:            946864 kB
> 2023-11-02 21:37:48 Dirty:           1355964 kB
> 2023-11-02 21:37:58 Dirty:           2365632 kB
> 2023-11-02 21:38:08 Dirty:           2451948 kB
> ### at this point the rsync was cancelled as I see the kthread start
> 2023-11-02 21:38:18 Dirty:           2451752 kB
> 2023-11-02 21:38:28 Dirty:           2440496 kB
> 2023-11-02 21:38:38 Dirty:           2440308 kB
> 2023-11-02 21:38:48 Dirty:           2440136 kB
> 2023-11-02 21:38:58 Dirty:           2440036 kB
> 2023-11-02 21:39:08 Dirty:           2440240 kB
> 2023-11-02 21:39:18 Dirty:           2405768 kB
> 2023-11-02 21:39:28 Dirty:           2405784 kB
> 2023-11-02 21:39:38 Dirty:           2406012 kB
> 2023-11-02 21:39:48 Dirty:           2405908 kB
> 2023-11-02 21:39:58 Dirty:           2405848 kB
> 2023-11-02 21:40:08 Dirty:           2405876 kB
> 2023-11-02 21:40:18 Dirty:           2405704 kB
> 2023-11-02 21:40:28 Dirty:           2405628 kB
> 2023-11-02 21:40:38 Dirty:           2405544 kB
> 2023-11-02 21:40:48 Dirty:           2405484 kB
> 2023-11-02 21:40:58 Dirty:           2405416 kB
> 2023-11-02 21:41:08 Dirty:           2405412 kB
> 2023-11-02 21:41:18 Dirty:           2405240 kB
> 2023-11-02 21:41:28 Dirty:           2405148 kB
> 2023-11-02 21:41:38 Dirty:           2405132 kB
> 2023-11-02 21:41:48 Dirty:           2404428 kB
> 2023-11-02 21:41:58 Dirty:           2405056 kB
> 2023-11-02 21:42:08 Dirty:           2404944 kB
> 2023-11-02 21:42:18 Dirty:           2404904 kB
> 2023-11-02 21:42:28 Dirty:           2404316 kB
> 2023-11-02 21:42:38 Dirty:           2395340 kB
> 2023-11-02 21:42:48 Dirty:           2394540 kB
> 2023-11-02 21:42:58 Dirty:           2394368 kB
> 2023-11-02 21:43:08 Dirty:           2394520 kB
> 2023-11-02 21:43:18 Dirty:           2394132 kB
> 2023-11-02 21:43:28 Dirty:           2394032 kB
> 2023-11-02 21:43:38 Dirty:           2394276 kB
> 2023-11-02 21:43:48 Dirty:           2386960 kB
> 2023-11-02 21:43:58 Dirty:           2387420 kB
> 2023-11-02 21:44:08 Dirty:           2386620 kB
> 2023-11-02 21:44:18 Dirty:           2386828 kB
> 2023-11-02 21:44:28 Dirty:           2386104 kB
> 2023-11-02 21:44:38 Dirty:           2386328 kB
> 2023-11-02 21:44:48 Dirty:           2382520 kB
> 2023-11-02 21:44:58 Dirty:           2382024 kB
> 2023-11-02 21:45:08 Dirty:           2381344 kB
> 2023-11-02 21:45:18 Dirty:           2380776 kB
> 2023-11-02 21:45:28 Dirty:           2380424 kB
> 2023-11-02 21:45:38 Dirty:           2379672 kB
> 2023-11-02 21:45:48 Dirty:           2380180 kB
> 2023-11-02 21:45:58 Dirty:           2373272 kB
> 2023-11-02 21:46:08 Dirty:           2372416 kB
> 2023-11-02 21:46:18 Dirty:           2372052 kB
> 2023-11-02 21:46:28 Dirty:           2372176 kB
> ... and so on
>
> At the same time I see the kthread in top:
>
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 1491995 root      20   0       0      0      0 R  97.3   0.0  12:50.67 kworker/u16:2+flush-9:127
>
> And iostat of md127 shows slow writing like this:
>           Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 21:59:52 md127             3.50         0.00        30.40         0.00          0        304          0
> 22:00:02 md127             3.10         0.00        31.20         0.00          0        312          0
> 22:00:12 md127             2.00         0.00        64.40         0.00          0        644          0
> 22:00:22 md127             3.40         0.00        22.80         0.00          0        228          0
>
> So, why the slow clearing of the dirty blocks?
>
> I will leave the machine to clear it into the night...
>
> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> an array of all spares on restart. At least this is what happened twice already.
>
> HTH
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 11:27                   ` eyal
  2023-11-02 11:57                     ` Roger Heflin
@ 2023-11-02 12:29                     ` eyal
  2023-11-02 12:46                       ` Reindl Harald
  1 sibling, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 12:29 UTC (permalink / raw)
  To: linux-raid

See update further down.

On 02/11/2023 22.27, eyal@eyal.emu.id.au wrote:
> On 02/11/2023 19.34, Johannes Truschnigg wrote:
>> Hi list,
>>
>> for the record, I do not think that any of the observations the OP made can be
>> explained by non-pathological phenomena/patterns of behavior. Something is
>> very clearly wrong with how this system behaves (the reported figures do not
>> at all match the expected performance of even a degraded RAID6 array in my
>> experience) and how data written to the filesystem apparently fails to make it
>> into the backing devices in acceptable time.
>>
>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>> hardware failure", I think.
>>
>> Still, it'd be interesting to know what happens when writes to the array thru
>> the file system are performed with O_DIRECT in effect, i.e., using `dd
>> oflag=direct status=progress ...` - does this yield any observable (via
>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
>> report this way with varying block sizes? Are there no concerning messages in
>> the debug ringbuffer (`dmesg`) at this time?
>>
>> I'm not sure how we'd best learn more about what the participating busy kernel
>> threads (Fedora 38 might have more convenient devices up its sleeve than
>> ftrace?) are spending their time on in particular, but I think that's what's
>> needed to be known to pin down the underlying cause of the problem.
> 
> To clarify what may be lost on this thread.
> 
> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> It is cleared very fast. Tested writing 100GB.
> 
> The Dirty blocks rise to the max then drain quickly.
> 
> 2023-11-02 21:28:27 Dirty:               236 kB
> 2023-11-02 21:28:37 Dirty:                 8 kB
> 2023-11-02 21:28:47 Dirty:                40 kB
> 2023-11-02 21:28:57 Dirty:                68 kB
> 2023-11-02 21:29:07 Dirty:               216 kB
> 2023-11-02 21:29:17 Dirty:               364 kB
> 2023-11-02 21:29:27 Dirty:                20 kB
> 2023-11-02 21:29:37 Dirty:                48 kB
> 2023-11-02 21:29:47 Dirty:                24 kB
> 2023-11-02 21:29:57 Dirty:                36 kB
> 2023-11-02 21:30:07 Dirty:                28 kB
> 2023-11-02 21:30:17 Dirty:                52 kB
> 2023-11-02 21:30:27 Dirty:                36 kB
> 2023-11-02 21:30:37 Dirty:           4112980 kB
> 2023-11-02 21:30:47 Dirty:           3772396 kB
> 2023-11-02 21:30:57 Dirty:           3849688 kB
> 2023-11-02 21:31:07 Dirty:           3761132 kB
> 2023-11-02 21:31:17 Dirty:           3846216 kB
> 2023-11-02 21:31:27 Dirty:           3855060 kB
> 2023-11-02 21:31:37 Dirty:           3902212 kB
> 2023-11-02 21:31:47 Dirty:           4173524 kB
> 2023-11-02 21:31:57 Dirty:           3849856 kB
> 2023-11-02 21:32:07 Dirty:           3940632 kB
> 2023-11-02 21:32:17 Dirty:           2212008 kB
> 2023-11-02 21:32:27 Dirty:               244 kB
> 2023-11-02 21:32:37 Dirty:               288 kB
> 2023-11-02 21:32:47 Dirty:                60 kB
> 
> rsync'ing a large tree into this same fs is different.
> When killed less than 3GB were copied. Mostly small files in many directories.
> Something like 92,059 files using 2,519MB.
> 
> Note how slowly the dirty blocks are cleared.
> 
> 2023-11-02 21:36:28 Dirty:               296 kB
> 2023-11-02 21:36:38 Dirty:            277412 kB
> 2023-11-02 21:36:48 Dirty:            403928 kB
> 2023-11-02 21:36:58 Dirty:            606916 kB
> 2023-11-02 21:37:08 Dirty:            753888 kB
> 2023-11-02 21:37:18 Dirty:            641624 kB
> 2023-11-02 21:37:28 Dirty:            744560 kB
> 2023-11-02 21:37:38 Dirty:            946864 kB
> 2023-11-02 21:37:48 Dirty:           1355964 kB
> 2023-11-02 21:37:58 Dirty:           2365632 kB
> 2023-11-02 21:38:08 Dirty:           2451948 kB
> ### at this point the rsync was cancelled as I see the kthread start
> 2023-11-02 21:38:18 Dirty:           2451752 kB
> 2023-11-02 21:38:28 Dirty:           2440496 kB
> 2023-11-02 21:38:38 Dirty:           2440308 kB
> 2023-11-02 21:38:48 Dirty:           2440136 kB
> 2023-11-02 21:38:58 Dirty:           2440036 kB
> 2023-11-02 21:39:08 Dirty:           2440240 kB
> 2023-11-02 21:39:18 Dirty:           2405768 kB
> 2023-11-02 21:39:28 Dirty:           2405784 kB
> 2023-11-02 21:39:38 Dirty:           2406012 kB
> 2023-11-02 21:39:48 Dirty:           2405908 kB
> 2023-11-02 21:39:58 Dirty:           2405848 kB
> 2023-11-02 21:40:08 Dirty:           2405876 kB
> 2023-11-02 21:40:18 Dirty:           2405704 kB
> 2023-11-02 21:40:28 Dirty:           2405628 kB
> 2023-11-02 21:40:38 Dirty:           2405544 kB
> 2023-11-02 21:40:48 Dirty:           2405484 kB
> 2023-11-02 21:40:58 Dirty:           2405416 kB
> 2023-11-02 21:41:08 Dirty:           2405412 kB
> 2023-11-02 21:41:18 Dirty:           2405240 kB
> 2023-11-02 21:41:28 Dirty:           2405148 kB
> 2023-11-02 21:41:38 Dirty:           2405132 kB
> 2023-11-02 21:41:48 Dirty:           2404428 kB
> 2023-11-02 21:41:58 Dirty:           2405056 kB
> 2023-11-02 21:42:08 Dirty:           2404944 kB
> 2023-11-02 21:42:18 Dirty:           2404904 kB
> 2023-11-02 21:42:28 Dirty:           2404316 kB
> 2023-11-02 21:42:38 Dirty:           2395340 kB
> 2023-11-02 21:42:48 Dirty:           2394540 kB
> 2023-11-02 21:42:58 Dirty:           2394368 kB
> 2023-11-02 21:43:08 Dirty:           2394520 kB
> 2023-11-02 21:43:18 Dirty:           2394132 kB
> 2023-11-02 21:43:28 Dirty:           2394032 kB
> 2023-11-02 21:43:38 Dirty:           2394276 kB
> 2023-11-02 21:43:48 Dirty:           2386960 kB
> 2023-11-02 21:43:58 Dirty:           2387420 kB
> 2023-11-02 21:44:08 Dirty:           2386620 kB
> 2023-11-02 21:44:18 Dirty:           2386828 kB
> 2023-11-02 21:44:28 Dirty:           2386104 kB
> 2023-11-02 21:44:38 Dirty:           2386328 kB
> 2023-11-02 21:44:48 Dirty:           2382520 kB
> 2023-11-02 21:44:58 Dirty:           2382024 kB
> 2023-11-02 21:45:08 Dirty:           2381344 kB
> 2023-11-02 21:45:18 Dirty:           2380776 kB
> 2023-11-02 21:45:28 Dirty:           2380424 kB
> 2023-11-02 21:45:38 Dirty:           2379672 kB
> 2023-11-02 21:45:48 Dirty:           2380180 kB
> 2023-11-02 21:45:58 Dirty:           2373272 kB
> 2023-11-02 21:46:08 Dirty:           2372416 kB
> 2023-11-02 21:46:18 Dirty:           2372052 kB
> 2023-11-02 21:46:28 Dirty:           2372176 kB
> ... and so on

Interestingly, after about 1.5 hours, when there were 1GB of dirty blocks, the whole lot was cleared fast:

2023-11-02 23:08:49 Dirty:           1018924 kB
2023-11-02 23:08:59 Dirty:           1018640 kB
2023-11-02 23:09:09 Dirty:           1018732 kB
2023-11-02 23:09:19 Dirty:            592196 kB
2023-11-02 23:09:29 Dirty:              1188 kB
2023-11-02 23:09:39 Dirty:               944 kB
2023-11-02 23:09:49 Dirty:               804 kB
2023-11-02 23:09:59 Dirty:                60 kB

And iostat saw it too:
          Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:09:12 md127             2.80         0.00        40.40         0.00          0        404          0
23:09:22 md127          1372.33         0.80     47026.17         0.00          8     470732          0
23:09:32 md127            75.80         0.80     54763.20         0.00          8     547632          0
23:09:42 md127             0.00         0.00         0.00         0.00          0          0          0

> At the same time I see the kthread in top:
> 
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 1491995 root      20   0       0      0      0 R  97.3   0.0  12:50.67 kworker/u16:2+flush-9:127
> 
> And iostat of md127 shows slow writing like this:
>           Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 21:59:52 md127             3.50         0.00        30.40         0.00          0        304          0
> 22:00:02 md127             3.10         0.00        31.20         0.00          0        312          0
> 22:00:12 md127             2.00         0.00        64.40         0.00          0        644          0
> 22:00:22 md127             3.40         0.00        22.80         0.00          0        228          0
> 
> So, why the slow clearing of the dirty blocks?
> 
> I will leave the machine to clear it into the night...
> 
> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> an array of all spares on restart. At least this is what happened twice already.
> 
> HTH
> 

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 12:29                     ` eyal
@ 2023-11-02 12:46                       ` Reindl Harald
  0 siblings, 0 replies; 35+ messages in thread
From: Reindl Harald @ 2023-11-02 12:46 UTC (permalink / raw)
  To: linux-raid



Am 02.11.23 um 13:29 schrieb eyal@eyal.emu.id.au:
> See update further down.
> 
> Interestingly, after about 1.5 hours, when there were 1GB of dirty 
> blocks, the whole lot was cleared fast:
> 
> 2023-11-02 23:08:49 Dirty:           1018924 kB
> 2023-11-02 23:08:59 Dirty:           1018640 kB
> 2023-11-02 23:09:09 Dirty:           1018732 kB
> 2023-11-02 23:09:19 Dirty:            592196 kB
> 2023-11-02 23:09:29 Dirty:              1188 kB
> 2023-11-02 23:09:39 Dirty:               944 kB
> 2023-11-02 23:09:49 Dirty:               804 kB
> 2023-11-02 23:09:59 Dirty:                60 kB
> 
> And iostat saw it too:
>           Device             tps    kB_read/s    kB_wrtn/s    
> kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:09:12 md127             2.80         0.00        40.40         
> 0.00          0        404          0
> 23:09:22 md127          1372.33         0.80     47026.17         
> 0.00          8     470732          0
> 23:09:32 md127            75.80         0.80     54763.20         
> 0.00          8     547632          0
> 23:09:42 md127             0.00         0.00         0.00         
> 0.00          0          0          0

it's pretty easy: RAID6 behaves terrible in degraded state especially 
*with rotating disks* and for the sake of god as long it is degraded and 
not fully rebuilt you should avoid any load which isn't strictly necessary

the chance that another disk dies is increasing especially in the 
rebuild-phase and then start to pray becuase the next unrecoverable read 
error will kill the array

a RAID10 couldn't care less at that point because it don't need to seek 
like crazy on the drives

---------

what i don't understand is why people don't have replacement disks in 
the shelf for every array they operate, replace the drive and leave it 
in peace until the rebuild is finished

i am responsible for 7 machines at 5 locations with mdadm RAID of 
different sizes and there is a replacement disk for each of them - if a 
disk dies or smartd complains it's replaced and the next drive will be 
ordered

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 11:57                     ` Roger Heflin
@ 2023-11-02 13:05                       ` eyal
  2023-11-02 17:05                         ` Roger Heflin
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 13:05 UTC (permalink / raw)
  To: linux-raid

On 02/11/2023 22.57, Roger Heflin wrote:
> you need the iostat  data from the disks under md127 to see what
> mdraid is having to do to internally.   Get the list of sd devices and
> do iostat -x  | grep -E 'Device|sdX|sxy|sdz'    (putting into the list
> the whole non-partitioned sd devices in your array)

I do have iostat running on the components too, and the writing mostly reflects what md127 sees.
However, while there are no reads on md127, there is often some on the components,
very visible in the last two iostat minutes. Is there some RUW going, on the components?

Here is a snippet, when things were slow:

## meminfo data:
2023-11-02 23:05:07 Dirty:           1037680 kB
2023-11-02 23:05:17 Dirty:           1037288 kB
2023-11-02 23:05:27 Dirty:           1036560 kB
2023-11-02 23:05:37 Dirty:           1036064 kB
2023-11-02 23:05:47 Dirty:           1035720 kB
2023-11-02 23:05:57 Dirty:           1035732 kB
2023-11-02 23:06:07 Dirty:           1035860 kB
2023-11-02 23:06:17 Dirty:           1035708 kB
2023-11-02 23:06:27 Dirty:           1035648 kB
2023-11-02 23:06:37 Dirty:           1035388 kB
2023-11-02 23:06:47 Dirty:           1035336 kB
2023-11-02 23:06:57 Dirty:           1035248 kB
2023-11-02 23:07:07 Dirty:           1037312 kB
2023-11-02 23:07:17 Dirty:           1037172 kB
2023-11-02 23:07:27 Dirty:           1035224 kB
2023-11-02 23:07:37 Dirty:           1035152 kB
2023-11-02 23:07:47 Dirty:           1035064 kB
2023-11-02 23:07:57 Dirty:           1025316 kB
2023-11-02 23:08:07 Dirty:           1020000 kB
2023-11-02 23:08:17 Dirty:           1019628 kB
2023-11-02 23:08:27 Dirty:           1019784 kB
2023-11-02 23:08:37 Dirty:           1019284 kB
2023-11-02 23:08:47 Dirty:           1018908 kB
2023-11-02 23:08:57 Dirty:           1018700 kB

## iostat
23:05:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:05:08 md127             2.22         0.00        14.73         0.00          0        884          0
23:05:08 sdb               1.13         1.93        12.10         0.00        116        726          0
23:05:08 sdc               1.32         0.00        16.17         0.00          0        970          0
23:05:08 sdd               1.18         0.00         5.57         0.00          0        334          0
23:05:08 sde               0.83         0.00         1.50         0.00          0         90          0
23:05:08 sdf               0.83         0.00         1.50         0.00          0         90          0
23:05:08 sdg               1.27         4.47         5.97         0.00        268        358          0

23:06:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:06:08 md127             2.75         0.00        36.20         0.00          0       2172          0
23:06:08 sdb               1.30         4.47        22.40         0.00        268       1344          0
23:06:08 sdc               1.13         0.00        14.40         0.00          0        864          0
23:06:08 sdd               1.13         0.80        14.20         0.00         48        852          0
23:06:08 sde               0.82         8.53         9.73         0.00        512        584          0
23:06:08 sdf               0.87         8.53         7.13         0.00        512        428          0
23:06:08 sdg               0.75         8.53         1.20         0.00        512         72          0

23:07:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:07:08 md127             2.30         0.00        10.93         0.00          0        656          0
23:07:08 sdb               1.17         2.13         5.82         0.00        128        349          0
23:07:08 sdc               1.12         0.00         5.35         0.00          0        321          0
23:07:08 sdd               1.35         4.33         9.68         0.00        260        581          0
23:07:08 sde               1.00         4.33         5.68         0.00        260        341          0
23:07:08 sdf               0.82         0.00         3.95         0.00          0        237          0
23:07:08 sdg               1.00         4.33         5.68         0.00        260        341          0

23:08:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:08:08 md127            13.55         0.13       238.73         0.00          8      14324          0
23:08:08 sdb               1.92        36.47        72.25         0.00       2188       4335          0
23:08:08 sdc               1.78        27.67        66.38         0.00       1660       3983          0
23:08:08 sdd               2.25        47.67        95.52         0.00       2860       5731          0
23:08:08 sde               2.03        43.87        77.72         0.00       2632       4663          0
23:08:08 sdf               1.73        49.20        59.65         0.00       2952       3579          0
23:08:08 sdg               1.72        29.80        56.45         0.00       1788       3387          0

23:09:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
23:09:08 md127             3.22         0.00        59.47         0.00          0       3568          0
23:09:08 sdb               1.88        26.60        36.00         0.00       1596       2160          0
23:09:08 sdc               1.82        32.53        13.60         0.00       1952        816          0
23:09:08 sdd               1.83        17.07        41.80         0.00       1024       2508          0
23:09:08 sde               1.10         9.07        33.40         0.00        544       2004          0
23:09:08 sdf               0.93        21.93         3.80         0.00       1316        228          0
23:09:08 sdg               0.90        15.20         6.73         0.00        912        404          0

> The slowness of clearing dirty and what perf top showed (ext4* calls)
> indicates it is creating entries for small files and putting data in
> those files.   Creating entries for small files requires a lot of both
> read and write seeks and seeks are expensive.   At most I would expect
> you to be able to rsync/copy around 30-50 files/second (when the array
> is healthy and once the size gets small enough the mb/sec does not
> matter anymore as the seeks dominate).

My experience is that the array can handle all of this easily, and I never noticed
any hiccups.

> On my array I do a nightly copy in of data written during the day to a
> SSD (security cams).   And without any array issues that makes my
> array really busy.   Anything wrong with the array and that process
> takes several times as long.

What I see here is that a few seconds of usual work take a few hours to perform.
This does not look like the array is stressed, something more substantial is going on.

> On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
>>
>> On 02/11/2023 19.34, Johannes Truschnigg wrote:
>>> Hi list,
>>>
>>> for the record, I do not think that any of the observations the OP made can be
>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>> very clearly wrong with how this system behaves (the reported figures do not
>>> at all match the expected performance of even a degraded RAID6 array in my
>>> experience) and how data written to the filesystem apparently fails to make it
>>> into the backing devices in acceptable time.
>>>
>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>> hardware failure", I think.
>>>
>>> Still, it'd be interesting to know what happens when writes to the array thru
>>> the file system are performed with O_DIRECT in effect, i.e., using `dd
>>> oflag=direct status=progress ...` - does this yield any observable (via
>>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
>>> report this way with varying block sizes? Are there no concerning messages in
>>> the debug ringbuffer (`dmesg`) at this time?
>>>
>>> I'm not sure how we'd best learn more about what the participating busy kernel
>>> threads (Fedora 38 might have more convenient devices up its sleeve than
>>> ftrace?) are spending their time on in particular, but I think that's what's
>>> needed to be known to pin down the underlying cause of the problem.
>>
>> To clarify what may be lost on this thread.
>>
>> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
>> It is cleared very fast. Tested writing 100GB.
>>
>> The Dirty blocks rise to the max then drain quickly.
>>
>> 2023-11-02 21:28:27 Dirty:               236 kB
>> 2023-11-02 21:28:37 Dirty:                 8 kB
>> 2023-11-02 21:28:47 Dirty:                40 kB
>> 2023-11-02 21:28:57 Dirty:                68 kB
>> 2023-11-02 21:29:07 Dirty:               216 kB
>> 2023-11-02 21:29:17 Dirty:               364 kB
>> 2023-11-02 21:29:27 Dirty:                20 kB
>> 2023-11-02 21:29:37 Dirty:                48 kB
>> 2023-11-02 21:29:47 Dirty:                24 kB
>> 2023-11-02 21:29:57 Dirty:                36 kB
>> 2023-11-02 21:30:07 Dirty:                28 kB
>> 2023-11-02 21:30:17 Dirty:                52 kB
>> 2023-11-02 21:30:27 Dirty:                36 kB
>> 2023-11-02 21:30:37 Dirty:           4112980 kB
>> 2023-11-02 21:30:47 Dirty:           3772396 kB
>> 2023-11-02 21:30:57 Dirty:           3849688 kB
>> 2023-11-02 21:31:07 Dirty:           3761132 kB
>> 2023-11-02 21:31:17 Dirty:           3846216 kB
>> 2023-11-02 21:31:27 Dirty:           3855060 kB
>> 2023-11-02 21:31:37 Dirty:           3902212 kB
>> 2023-11-02 21:31:47 Dirty:           4173524 kB
>> 2023-11-02 21:31:57 Dirty:           3849856 kB
>> 2023-11-02 21:32:07 Dirty:           3940632 kB
>> 2023-11-02 21:32:17 Dirty:           2212008 kB
>> 2023-11-02 21:32:27 Dirty:               244 kB
>> 2023-11-02 21:32:37 Dirty:               288 kB
>> 2023-11-02 21:32:47 Dirty:                60 kB
>>
>> rsync'ing a large tree into this same fs is different.
>> When killed less than 3GB were copied. Mostly small files in many directories.
>> Something like 92,059 files using 2,519MB.
>>
>> Note how slowly the dirty blocks are cleared.
>>
>> 2023-11-02 21:36:28 Dirty:               296 kB
>> 2023-11-02 21:36:38 Dirty:            277412 kB
>> 2023-11-02 21:36:48 Dirty:            403928 kB
>> 2023-11-02 21:36:58 Dirty:            606916 kB
>> 2023-11-02 21:37:08 Dirty:            753888 kB
>> 2023-11-02 21:37:18 Dirty:            641624 kB
>> 2023-11-02 21:37:28 Dirty:            744560 kB
>> 2023-11-02 21:37:38 Dirty:            946864 kB
>> 2023-11-02 21:37:48 Dirty:           1355964 kB
>> 2023-11-02 21:37:58 Dirty:           2365632 kB
>> 2023-11-02 21:38:08 Dirty:           2451948 kB
>> ### at this point the rsync was cancelled as I see the kthread start
>> 2023-11-02 21:38:18 Dirty:           2451752 kB
>> 2023-11-02 21:38:28 Dirty:           2440496 kB
>> 2023-11-02 21:38:38 Dirty:           2440308 kB
>> 2023-11-02 21:38:48 Dirty:           2440136 kB
>> 2023-11-02 21:38:58 Dirty:           2440036 kB
>> 2023-11-02 21:39:08 Dirty:           2440240 kB
>> 2023-11-02 21:39:18 Dirty:           2405768 kB
>> 2023-11-02 21:39:28 Dirty:           2405784 kB
>> 2023-11-02 21:39:38 Dirty:           2406012 kB
>> 2023-11-02 21:39:48 Dirty:           2405908 kB
>> 2023-11-02 21:39:58 Dirty:           2405848 kB
>> 2023-11-02 21:40:08 Dirty:           2405876 kB
>> 2023-11-02 21:40:18 Dirty:           2405704 kB
>> 2023-11-02 21:40:28 Dirty:           2405628 kB
>> 2023-11-02 21:40:38 Dirty:           2405544 kB
>> 2023-11-02 21:40:48 Dirty:           2405484 kB
>> 2023-11-02 21:40:58 Dirty:           2405416 kB
>> 2023-11-02 21:41:08 Dirty:           2405412 kB
>> 2023-11-02 21:41:18 Dirty:           2405240 kB
>> 2023-11-02 21:41:28 Dirty:           2405148 kB
>> 2023-11-02 21:41:38 Dirty:           2405132 kB
>> 2023-11-02 21:41:48 Dirty:           2404428 kB
>> 2023-11-02 21:41:58 Dirty:           2405056 kB
>> 2023-11-02 21:42:08 Dirty:           2404944 kB
>> 2023-11-02 21:42:18 Dirty:           2404904 kB
>> 2023-11-02 21:42:28 Dirty:           2404316 kB
>> 2023-11-02 21:42:38 Dirty:           2395340 kB
>> 2023-11-02 21:42:48 Dirty:           2394540 kB
>> 2023-11-02 21:42:58 Dirty:           2394368 kB
>> 2023-11-02 21:43:08 Dirty:           2394520 kB
>> 2023-11-02 21:43:18 Dirty:           2394132 kB
>> 2023-11-02 21:43:28 Dirty:           2394032 kB
>> 2023-11-02 21:43:38 Dirty:           2394276 kB
>> 2023-11-02 21:43:48 Dirty:           2386960 kB
>> 2023-11-02 21:43:58 Dirty:           2387420 kB
>> 2023-11-02 21:44:08 Dirty:           2386620 kB
>> 2023-11-02 21:44:18 Dirty:           2386828 kB
>> 2023-11-02 21:44:28 Dirty:           2386104 kB
>> 2023-11-02 21:44:38 Dirty:           2386328 kB
>> 2023-11-02 21:44:48 Dirty:           2382520 kB
>> 2023-11-02 21:44:58 Dirty:           2382024 kB
>> 2023-11-02 21:45:08 Dirty:           2381344 kB
>> 2023-11-02 21:45:18 Dirty:           2380776 kB
>> 2023-11-02 21:45:28 Dirty:           2380424 kB
>> 2023-11-02 21:45:38 Dirty:           2379672 kB
>> 2023-11-02 21:45:48 Dirty:           2380180 kB
>> 2023-11-02 21:45:58 Dirty:           2373272 kB
>> 2023-11-02 21:46:08 Dirty:           2372416 kB
>> 2023-11-02 21:46:18 Dirty:           2372052 kB
>> 2023-11-02 21:46:28 Dirty:           2372176 kB
>> ... and so on
>>
>> At the same time I see the kthread in top:
>>
>>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 1491995 root      20   0       0      0      0 R  97.3   0.0  12:50.67 kworker/u16:2+flush-9:127
>>
>> And iostat of md127 shows slow writing like this:
>>            Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
>> 21:59:52 md127             3.50         0.00        30.40         0.00          0        304          0
>> 22:00:02 md127             3.10         0.00        31.20         0.00          0        312          0
>> 22:00:12 md127             2.00         0.00        64.40         0.00          0        644          0
>> 22:00:22 md127             3.40         0.00        22.80         0.00          0        228          0
>>
>> So, why the slow clearing of the dirty blocks?
>>
>> I will leave the machine to clear it into the night...
>>
>> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
>> an array of all spares on restart. At least this is what happened twice already.
>>
>> HTH
>>
>> --
>> Eyal at Home (eyal@eyal.emu.id.au)
>>

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 13:05                       ` eyal
@ 2023-11-02 17:05                         ` Roger Heflin
  2023-11-02 23:23                           ` eyal
  0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-02 17:05 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

You need to add the -x for extended stats on iostat.  That will catch
if one of the disks has difficulty recovering bad blocks and is being
super slow.

And that super slow will come and go based on if you are touching the
bad blocks.

On Thu, Nov 2, 2023 at 8:06 AM <eyal@eyal.emu.id.au> wrote:
>
> On 02/11/2023 22.57, Roger Heflin wrote:
> > you need the iostat  data from the disks under md127 to see what
> > mdraid is having to do to internally.   Get the list of sd devices and
> > do iostat -x  | grep -E 'Device|sdX|sxy|sdz'    (putting into the list
> > the whole non-partitioned sd devices in your array)
>
> I do have iostat running on the components too, and the writing mostly reflects what md127 sees.
> However, while there are no reads on md127, there is often some on the components,
> very visible in the last two iostat minutes. Is there some RUW going, on the components?
>
> Here is a snippet, when things were slow:
>
> ## meminfo data:
> 2023-11-02 23:05:07 Dirty:           1037680 kB
> 2023-11-02 23:05:17 Dirty:           1037288 kB
> 2023-11-02 23:05:27 Dirty:           1036560 kB
> 2023-11-02 23:05:37 Dirty:           1036064 kB
> 2023-11-02 23:05:47 Dirty:           1035720 kB
> 2023-11-02 23:05:57 Dirty:           1035732 kB
> 2023-11-02 23:06:07 Dirty:           1035860 kB
> 2023-11-02 23:06:17 Dirty:           1035708 kB
> 2023-11-02 23:06:27 Dirty:           1035648 kB
> 2023-11-02 23:06:37 Dirty:           1035388 kB
> 2023-11-02 23:06:47 Dirty:           1035336 kB
> 2023-11-02 23:06:57 Dirty:           1035248 kB
> 2023-11-02 23:07:07 Dirty:           1037312 kB
> 2023-11-02 23:07:17 Dirty:           1037172 kB
> 2023-11-02 23:07:27 Dirty:           1035224 kB
> 2023-11-02 23:07:37 Dirty:           1035152 kB
> 2023-11-02 23:07:47 Dirty:           1035064 kB
> 2023-11-02 23:07:57 Dirty:           1025316 kB
> 2023-11-02 23:08:07 Dirty:           1020000 kB
> 2023-11-02 23:08:17 Dirty:           1019628 kB
> 2023-11-02 23:08:27 Dirty:           1019784 kB
> 2023-11-02 23:08:37 Dirty:           1019284 kB
> 2023-11-02 23:08:47 Dirty:           1018908 kB
> 2023-11-02 23:08:57 Dirty:           1018700 kB
>
> ## iostat
> 23:05:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:05:08 md127             2.22         0.00        14.73         0.00          0        884          0
> 23:05:08 sdb               1.13         1.93        12.10         0.00        116        726          0
> 23:05:08 sdc               1.32         0.00        16.17         0.00          0        970          0
> 23:05:08 sdd               1.18         0.00         5.57         0.00          0        334          0
> 23:05:08 sde               0.83         0.00         1.50         0.00          0         90          0
> 23:05:08 sdf               0.83         0.00         1.50         0.00          0         90          0
> 23:05:08 sdg               1.27         4.47         5.97         0.00        268        358          0
>
> 23:06:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:06:08 md127             2.75         0.00        36.20         0.00          0       2172          0
> 23:06:08 sdb               1.30         4.47        22.40         0.00        268       1344          0
> 23:06:08 sdc               1.13         0.00        14.40         0.00          0        864          0
> 23:06:08 sdd               1.13         0.80        14.20         0.00         48        852          0
> 23:06:08 sde               0.82         8.53         9.73         0.00        512        584          0
> 23:06:08 sdf               0.87         8.53         7.13         0.00        512        428          0
> 23:06:08 sdg               0.75         8.53         1.20         0.00        512         72          0
>
> 23:07:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:07:08 md127             2.30         0.00        10.93         0.00          0        656          0
> 23:07:08 sdb               1.17         2.13         5.82         0.00        128        349          0
> 23:07:08 sdc               1.12         0.00         5.35         0.00          0        321          0
> 23:07:08 sdd               1.35         4.33         9.68         0.00        260        581          0
> 23:07:08 sde               1.00         4.33         5.68         0.00        260        341          0
> 23:07:08 sdf               0.82         0.00         3.95         0.00          0        237          0
> 23:07:08 sdg               1.00         4.33         5.68         0.00        260        341          0
>
> 23:08:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:08:08 md127            13.55         0.13       238.73         0.00          8      14324          0
> 23:08:08 sdb               1.92        36.47        72.25         0.00       2188       4335          0
> 23:08:08 sdc               1.78        27.67        66.38         0.00       1660       3983          0
> 23:08:08 sdd               2.25        47.67        95.52         0.00       2860       5731          0
> 23:08:08 sde               2.03        43.87        77.72         0.00       2632       4663          0
> 23:08:08 sdf               1.73        49.20        59.65         0.00       2952       3579          0
> 23:08:08 sdg               1.72        29.80        56.45         0.00       1788       3387          0
>
> 23:09:08 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> 23:09:08 md127             3.22         0.00        59.47         0.00          0       3568          0
> 23:09:08 sdb               1.88        26.60        36.00         0.00       1596       2160          0
> 23:09:08 sdc               1.82        32.53        13.60         0.00       1952        816          0
> 23:09:08 sdd               1.83        17.07        41.80         0.00       1024       2508          0
> 23:09:08 sde               1.10         9.07        33.40         0.00        544       2004          0
> 23:09:08 sdf               0.93        21.93         3.80         0.00       1316        228          0
> 23:09:08 sdg               0.90        15.20         6.73         0.00        912        404          0
>
> > The slowness of clearing dirty and what perf top showed (ext4* calls)
> > indicates it is creating entries for small files and putting data in
> > those files.   Creating entries for small files requires a lot of both
> > read and write seeks and seeks are expensive.   At most I would expect
> > you to be able to rsync/copy around 30-50 files/second (when the array
> > is healthy and once the size gets small enough the mb/sec does not
> > matter anymore as the seeks dominate).
>
> My experience is that the array can handle all of this easily, and I never noticed
> any hiccups.
>
> > On my array I do a nightly copy in of data written during the day to a
> > SSD (security cams).   And without any array issues that makes my
> > array really busy.   Anything wrong with the array and that process
> > takes several times as long.
>
> What I see here is that a few seconds of usual work take a few hours to perform.
> This does not look like the array is stressed, something more substantial is going on.
>
> > On Thu, Nov 2, 2023 at 6:34 AM <eyal@eyal.emu.id.au> wrote:
> >>
> >> On 02/11/2023 19.34, Johannes Truschnigg wrote:
> >>> Hi list,
> >>>
> >>> for the record, I do not think that any of the observations the OP made can be
> >>> explained by non-pathological phenomena/patterns of behavior. Something is
> >>> very clearly wrong with how this system behaves (the reported figures do not
> >>> at all match the expected performance of even a degraded RAID6 array in my
> >>> experience) and how data written to the filesystem apparently fails to make it
> >>> into the backing devices in acceptable time.
> >>>
> >>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> >>> hardware failure", I think.
> >>>
> >>> Still, it'd be interesting to know what happens when writes to the array thru
> >>> the file system are performed with O_DIRECT in effect, i.e., using `dd
> >>> oflag=direct status=progress ...` - does this yield any observable (via
> >>> `iostat` et al.) thruput to the disks beneath? What transfer speeds does `dd`
> >>> report this way with varying block sizes? Are there no concerning messages in
> >>> the debug ringbuffer (`dmesg`) at this time?
> >>>
> >>> I'm not sure how we'd best learn more about what the participating busy kernel
> >>> threads (Fedora 38 might have more convenient devices up its sleeve than
> >>> ftrace?) are spending their time on in particular, but I think that's what's
> >>> needed to be known to pin down the underlying cause of the problem.
> >>
> >> To clarify what may be lost on this thread.
> >>
> >> I can dd to this fs just fine, The Dirty count does not go above about 4GB. The machine has 32KB.
> >> It is cleared very fast. Tested writing 100GB.
> >>
> >> The Dirty blocks rise to the max then drain quickly.
> >>
> >> 2023-11-02 21:28:27 Dirty:               236 kB
> >> 2023-11-02 21:28:37 Dirty:                 8 kB
> >> 2023-11-02 21:28:47 Dirty:                40 kB
> >> 2023-11-02 21:28:57 Dirty:                68 kB
> >> 2023-11-02 21:29:07 Dirty:               216 kB
> >> 2023-11-02 21:29:17 Dirty:               364 kB
> >> 2023-11-02 21:29:27 Dirty:                20 kB
> >> 2023-11-02 21:29:37 Dirty:                48 kB
> >> 2023-11-02 21:29:47 Dirty:                24 kB
> >> 2023-11-02 21:29:57 Dirty:                36 kB
> >> 2023-11-02 21:30:07 Dirty:                28 kB
> >> 2023-11-02 21:30:17 Dirty:                52 kB
> >> 2023-11-02 21:30:27 Dirty:                36 kB
> >> 2023-11-02 21:30:37 Dirty:           4112980 kB
> >> 2023-11-02 21:30:47 Dirty:           3772396 kB
> >> 2023-11-02 21:30:57 Dirty:           3849688 kB
> >> 2023-11-02 21:31:07 Dirty:           3761132 kB
> >> 2023-11-02 21:31:17 Dirty:           3846216 kB
> >> 2023-11-02 21:31:27 Dirty:           3855060 kB
> >> 2023-11-02 21:31:37 Dirty:           3902212 kB
> >> 2023-11-02 21:31:47 Dirty:           4173524 kB
> >> 2023-11-02 21:31:57 Dirty:           3849856 kB
> >> 2023-11-02 21:32:07 Dirty:           3940632 kB
> >> 2023-11-02 21:32:17 Dirty:           2212008 kB
> >> 2023-11-02 21:32:27 Dirty:               244 kB
> >> 2023-11-02 21:32:37 Dirty:               288 kB
> >> 2023-11-02 21:32:47 Dirty:                60 kB
> >>
> >> rsync'ing a large tree into this same fs is different.
> >> When killed less than 3GB were copied. Mostly small files in many directories.
> >> Something like 92,059 files using 2,519MB.
> >>
> >> Note how slowly the dirty blocks are cleared.
> >>
> >> 2023-11-02 21:36:28 Dirty:               296 kB
> >> 2023-11-02 21:36:38 Dirty:            277412 kB
> >> 2023-11-02 21:36:48 Dirty:            403928 kB
> >> 2023-11-02 21:36:58 Dirty:            606916 kB
> >> 2023-11-02 21:37:08 Dirty:            753888 kB
> >> 2023-11-02 21:37:18 Dirty:            641624 kB
> >> 2023-11-02 21:37:28 Dirty:            744560 kB
> >> 2023-11-02 21:37:38 Dirty:            946864 kB
> >> 2023-11-02 21:37:48 Dirty:           1355964 kB
> >> 2023-11-02 21:37:58 Dirty:           2365632 kB
> >> 2023-11-02 21:38:08 Dirty:           2451948 kB
> >> ### at this point the rsync was cancelled as I see the kthread start
> >> 2023-11-02 21:38:18 Dirty:           2451752 kB
> >> 2023-11-02 21:38:28 Dirty:           2440496 kB
> >> 2023-11-02 21:38:38 Dirty:           2440308 kB
> >> 2023-11-02 21:38:48 Dirty:           2440136 kB
> >> 2023-11-02 21:38:58 Dirty:           2440036 kB
> >> 2023-11-02 21:39:08 Dirty:           2440240 kB
> >> 2023-11-02 21:39:18 Dirty:           2405768 kB
> >> 2023-11-02 21:39:28 Dirty:           2405784 kB
> >> 2023-11-02 21:39:38 Dirty:           2406012 kB
> >> 2023-11-02 21:39:48 Dirty:           2405908 kB
> >> 2023-11-02 21:39:58 Dirty:           2405848 kB
> >> 2023-11-02 21:40:08 Dirty:           2405876 kB
> >> 2023-11-02 21:40:18 Dirty:           2405704 kB
> >> 2023-11-02 21:40:28 Dirty:           2405628 kB
> >> 2023-11-02 21:40:38 Dirty:           2405544 kB
> >> 2023-11-02 21:40:48 Dirty:           2405484 kB
> >> 2023-11-02 21:40:58 Dirty:           2405416 kB
> >> 2023-11-02 21:41:08 Dirty:           2405412 kB
> >> 2023-11-02 21:41:18 Dirty:           2405240 kB
> >> 2023-11-02 21:41:28 Dirty:           2405148 kB
> >> 2023-11-02 21:41:38 Dirty:           2405132 kB
> >> 2023-11-02 21:41:48 Dirty:           2404428 kB
> >> 2023-11-02 21:41:58 Dirty:           2405056 kB
> >> 2023-11-02 21:42:08 Dirty:           2404944 kB
> >> 2023-11-02 21:42:18 Dirty:           2404904 kB
> >> 2023-11-02 21:42:28 Dirty:           2404316 kB
> >> 2023-11-02 21:42:38 Dirty:           2395340 kB
> >> 2023-11-02 21:42:48 Dirty:           2394540 kB
> >> 2023-11-02 21:42:58 Dirty:           2394368 kB
> >> 2023-11-02 21:43:08 Dirty:           2394520 kB
> >> 2023-11-02 21:43:18 Dirty:           2394132 kB
> >> 2023-11-02 21:43:28 Dirty:           2394032 kB
> >> 2023-11-02 21:43:38 Dirty:           2394276 kB
> >> 2023-11-02 21:43:48 Dirty:           2386960 kB
> >> 2023-11-02 21:43:58 Dirty:           2387420 kB
> >> 2023-11-02 21:44:08 Dirty:           2386620 kB
> >> 2023-11-02 21:44:18 Dirty:           2386828 kB
> >> 2023-11-02 21:44:28 Dirty:           2386104 kB
> >> 2023-11-02 21:44:38 Dirty:           2386328 kB
> >> 2023-11-02 21:44:48 Dirty:           2382520 kB
> >> 2023-11-02 21:44:58 Dirty:           2382024 kB
> >> 2023-11-02 21:45:08 Dirty:           2381344 kB
> >> 2023-11-02 21:45:18 Dirty:           2380776 kB
> >> 2023-11-02 21:45:28 Dirty:           2380424 kB
> >> 2023-11-02 21:45:38 Dirty:           2379672 kB
> >> 2023-11-02 21:45:48 Dirty:           2380180 kB
> >> 2023-11-02 21:45:58 Dirty:           2373272 kB
> >> 2023-11-02 21:46:08 Dirty:           2372416 kB
> >> 2023-11-02 21:46:18 Dirty:           2372052 kB
> >> 2023-11-02 21:46:28 Dirty:           2372176 kB
> >> ... and so on
> >>
> >> At the same time I see the kthread in top:
> >>
> >>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >> 1491995 root      20   0       0      0      0 R  97.3   0.0  12:50.67 kworker/u16:2+flush-9:127
> >>
> >> And iostat of md127 shows slow writing like this:
> >>            Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
> >> 21:59:52 md127             3.50         0.00        30.40         0.00          0        304          0
> >> 22:00:02 md127             3.10         0.00        31.20         0.00          0        312          0
> >> 22:00:12 md127             2.00         0.00        64.40         0.00          0        644          0
> >> 22:00:22 md127             3.40         0.00        22.80         0.00          0        228          0
> >>
> >> So, why the slow clearing of the dirty blocks?
> >>
> >> I will leave the machine to clear it into the night...
> >>
> >> BTW, if I try to shutdown in this situation, it will get stuck, and if I force a sysrq boot I get
> >> an array of all spares on restart. At least this is what happened twice already.
> >>
> >> HTH
> >>
> >> --
> >> Eyal at Home (eyal@eyal.emu.id.au)
> >>
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 17:05                         ` Roger Heflin
@ 2023-11-02 23:23                           ` eyal
  2023-11-03 12:08                             ` Roger Heflin
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-02 23:23 UTC (permalink / raw)
  To: linux-raid

On 03/11/2023 04.05, Roger Heflin wrote:
> You need to add the -x for extended stats on iostat.  That will catch
> if one of the disks has difficulty recovering bad blocks and is being
> super slow.
> 
> And that super slow will come and go based on if you are touching the
> bad blocks.

I did not know about '-x'. I see that the total columns (kB_read, kB_wrtn) are not included:-(

Here is one.

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
md127            1.88    116.72     0.00   0.00   11.27    62.19    6.31   1523.93     0.00   0.00  218.42   241.61    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.40   1.72
sdb              0.67     67.42    16.17  96.02   11.61   100.68    3.74    367.79    89.35  95.98    7.65    98.33    0.00      0.00     0.00   0.00    0.00     0.00    2.02    6.25    0.05   1.92
sdc              0.81     89.74    21.61  96.39   15.30   110.94    3.74    367.58    89.29  95.98    7.70    98.20    0.00      0.00     0.00   0.00    0.00     0.00    2.02    5.15    0.05   1.73
sdd              0.87    102.17    24.66  96.59   16.75   117.28    3.73    367.34    89.24  95.99   15.00    98.45    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.28    0.08   3.92
sde              0.87    101.87    24.58  96.56   19.38   116.46    3.72    367.45    89.28  96.00   16.20    98.71    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.30    0.08   3.94
sdf              0.81     90.11    21.70  96.39   16.24   110.80    3.73    367.15    89.20  95.99   14.19    98.51    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.17    0.07   3.91
sdg              0.68     67.91    16.28  95.97   12.17    99.30    3.73    367.20    89.21  95.98   13.28    98.32    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.10    0.06   3.86

Interesting to see that sd[bc] have lower w_await,aqu-sz and %util and higher f_await.
Even not yet understanding what these mean, I see that sd[bc] are model ST12000NM001G (recently replaced) while the rest are the original ST12000NM0007 (now 5yo).
I expect this shows different tuning in the device fw.

I do not expect this to be relevant to the current situation.

I need to understand the r vs w also. I see wkB/s identical for all members, rkB/s is not.
I expected this to be similar, but maybe md reads different disks at different times to make up for the missing one?

Still, thanks for your help.

> On Thu, Nov 2, 2023 at 8:06 AM <eyal@eyal.emu.id.au> wrote:

[discussion trimmed]

-- 
Eyal at Home (eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02 23:23                           ` eyal
@ 2023-11-03 12:08                             ` Roger Heflin
  0 siblings, 0 replies; 35+ messages in thread
From: Roger Heflin @ 2023-11-03 12:08 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

On Thu, Nov 2, 2023 at 6:31 PM <eyal@eyal.emu.id.au> wrote:
>
> On 03/11/2023 04.05, Roger Heflin wrote:
> > You need to add the -x for extended stats on iostat.  That will catch
> > if one of the disks has difficulty recovering bad blocks and is being
> > super slow.
> >
> > And that super slow will come and go based on if you are touching the
> > bad blocks.
>
> I did not know about '-x'. I see that the total columns (kB_read, kB_wrtn) are not included:-(
>
> Here is one.
>
> Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
> md127            1.88    116.72     0.00   0.00   11.27    62.19    6.31   1523.93     0.00   0.00  218.42   241.61    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.40   1.72
> sdb              0.67     67.42    16.17  96.02   11.61   100.68    3.74    367.79    89.35  95.98    7.65    98.33    0.00      0.00     0.00   0.00    0.00     0.00    2.02    6.25    0.05   1.92
> sdc              0.81     89.74    21.61  96.39   15.30   110.94    3.74    367.58    89.29  95.98    7.70    98.20    0.00      0.00     0.00   0.00    0.00     0.00    2.02    5.15    0.05   1.73
> sdd              0.87    102.17    24.66  96.59   16.75   117.28    3.73    367.34    89.24  95.99   15.00    98.45    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.28    0.08   3.92
> sde              0.87    101.87    24.58  96.56   19.38   116.46    3.72    367.45    89.28  96.00   16.20    98.71    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.30    0.08   3.94
> sdf              0.81     90.11    21.70  96.39   16.24   110.80    3.73    367.15    89.20  95.99   14.19    98.51    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.17    0.07   3.91
> sdg              0.68     67.91    16.28  95.97   12.17    99.30    3.73    367.20    89.21  95.98   13.28    98.32    0.00      0.00     0.00   0.00    0.00     0.00    2.02    3.10    0.06   3.86
>
> Interesting to see that sd[bc] have lower w_await,aqu-sz and %util and higher f_await.
> Even not yet understanding what these mean, I see that sd[bc] are model ST12000NM001G (recently replaced) while the rest are the original ST12000NM0007 (now 5yo).
> I expect this shows different tuning in the device fw.
>
> I do not expect this to be relevant to the current situation.
>
> I need to understand the r vs w also. I see wkB/s identical for all members, rkB/s is not.
> I expected this to be similar, but maybe md reads different disks at different times to make up for the missing one?
>
> Still, thanks for your help.

I would expect the reads to be slightly different.   Note MD is
reading 116kb/sec but the underlying disks hare having to do
500kb/sec.   MD is doing 1523kb/sec writes and doing 2200kb/sec.  So
the rads are doing needing to do 4x the real reads to recover/rebuilt
the data.   The interesting columns are r/s, rkb/s, r_await (how long
a read takes in ms) and w/s, rkb/s, w_await (how long an write takes
in ms) and the %util.   rrqm is read requests and if you divide kb/s
-> requests it indicates average io is around 4k.

The %util column  is the one to watch.   If the disk is having
internal issues %util will hit close to 100% on lowish reads/writes.
If it gets close to 100% that  is a really bad sign.

You might see what that data looks like when the disk is having
issues.  You might also start using dirty_bytes and
dirty_background_bytes that makes the io suck less when your array
gets slow.

My array has mythtv stuff and security cam images.  During the day I
save all of that to a 500GB ssd, and then at midnight move it to the
long term spinning disk, and during that window my disks are really
busy.  And depending on if a rebuild is running and/or if something
else is going on with the array how long that takes varies with the
amount collected during the day and if there are array issues it takes
longer.  I have been keeping a spare to use in emergencies.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-02  8:34                 ` Johannes Truschnigg
  2023-11-02 11:27                   ` eyal
@ 2023-11-03 14:16                   ` Carlos Carvalho
  2023-11-03 14:32                     ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
  2023-11-03 15:57                     ` problem with recovered array Roger Heflin
  1 sibling, 2 replies; 35+ messages in thread
From: Carlos Carvalho @ 2023-11-03 14:16 UTC (permalink / raw)
  To: linux-raid

Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> for the record, I do not think that any of the observations the OP made can be
> explained by non-pathological phenomena/patterns of behavior. Something is
> very clearly wrong with how this system behaves (the reported figures do not
> at all match the expected performance of even a degraded RAID6 array in my
> experience) and how data written to the filesystem apparently fails to make it
> into the backing devices in acceptable time.
> 
> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> hardware failure", I think.

Exactly. That's what I've been saying for months...

I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
shown by sar, and the kworker/flushd runs with 100% cpu usage forever.

Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
blocked in D state and the tarball expansion proceeds so slowly that it'd take
days to complete (checked with quota).

So 6.5 (and 6.4) are unusable in this case. In another machine, which does
hundreds of rsync downloads every day, the same problem exists and I also get
frequent random rsync timeouts.

This is all with raid6 and ext4. One of the machines has a journal disk in the
raid and the filesystem is mounted with nobarriers. Both show the same
behavior. It'd be interesting to try a different filesystem but these are
production machines with many disks and I cannot create another big array to
transfer the contents.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Dirty page flushing regression in 6.5.x vs 6.1.x
  2023-11-03 14:16                   ` Carlos Carvalho
@ 2023-11-03 14:32                     ` Roman Mamedov
  2023-11-03 15:57                     ` problem with recovered array Roger Heflin
  1 sibling, 0 replies; 35+ messages in thread
From: Roman Mamedov @ 2023-11-03 14:32 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

On Fri, 3 Nov 2023 11:16:57 -0300
Carlos Carvalho <carlos@fisica.ufpr.br> wrote:

> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> > 
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
> 
> Exactly. That's what I've been saying for months...
> 
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.

If you have a 100% way to reproduce, next what would be ideal to do is a
bisect to narrow it down to which commit introduced the problem. Of course it
might be not feasible to reboot dozens of times on production machines. Still,
maybe for a start you could narrow it down some more, such as check kernels
around 6.2, 6.3? Those are not offered anymore on kernel.org, but should be
retrievable from distro repositories or git.

Also check the 6.6 kernel which has been released recently.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-03 14:16                   ` Carlos Carvalho
  2023-11-03 14:32                     ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
@ 2023-11-03 15:57                     ` Roger Heflin
  2023-11-03 22:38                       ` eyal
  1 sibling, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-03 15:57 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>
> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
>
> Exactly. That's what I've been saying for months...
>
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>
> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
> blocked in D state and the tarball expansion proceeds so slowly that it'd take
> days to complete (checked with quota).
>
> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
> hundreds of rsync downloads every day, the same problem exists and I also get
> frequent random rsync timeouts.
>
> This is all with raid6 and ext4. One of the machines has a journal disk in the
> raid and the filesystem is mounted with nobarriers. Both show the same
> behavior. It'd be interesting to try a different filesystem but these are
> production machines with many disks and I cannot create another big array to
> transfer the contents.

My array is running 6.5 + xfs, and mine all seems to work normally
(speed wise).  And in the perf top call he ran all of the busy
kworkers were ext4* calls spending a lot of time doing various
filesystem work.

I did find/debug a situation where dumping the cache caused ext4
performance to be a disaster (large directories, lots of files).  It
was tracked back to ext4 relies on the Buffers:  data space in
/proc/meminfo for at least directory entry caching, and that if there
were a lot of directories and/or files in directories that Buffer:
getting dropped and/or getting pruned for any some reason caused the
fragmented directory entries to have to get reloaded from a spinning
disk and require the disk to be seeking for  *MINUTES* to reload it
(there were in this case several million files in a couple of
directories with the directory entries being allocated over time so
very likely heavily fragmented).

I wonder if there was some change with how Buffers is
used/sized/pruned in the recent kernels.   The same drop_cache on an
XFS filesystem had no effect that I could identify and doing a ls -lR
on a big xfs filesystem does not make Buffers grow, but doing the same
ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
depends on how many files/directories are on the filesystem).

He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
see if the poor performance correlates with Buffers suddenly being
smaller for some reason.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-03 15:57                     ` problem with recovered array Roger Heflin
@ 2023-11-03 22:38                       ` eyal
  2023-11-04  0:48                         ` eyal
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-03 22:38 UTC (permalink / raw)
  To: linux-raid

On 04/11/2023 02.57, Roger Heflin wrote:
> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>
>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>> for the record, I do not think that any of the observations the OP made can be
>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>> very clearly wrong with how this system behaves (the reported figures do not
>>> at all match the expected performance of even a degraded RAID6 array in my
>>> experience) and how data written to the filesystem apparently fails to make it
>>> into the backing devices in acceptable time.
>>>
>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>> hardware failure", I think.
>>
>> Exactly. That's what I've been saying for months...
>>
>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>
>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>> days to complete (checked with quota).
>>
>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>> hundreds of rsync downloads every day, the same problem exists and I also get
>> frequent random rsync timeouts.
>>
>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>> raid and the filesystem is mounted with nobarriers. Both show the same
>> behavior. It'd be interesting to try a different filesystem but these are
>> production machines with many disks and I cannot create another big array to
>> transfer the contents.
> 
> My array is running 6.5 + xfs, and mine all seems to work normally
> (speed wise).  And in the perf top call he ran all of the busy
> kworkers were ext4* calls spending a lot of time doing various
> filesystem work.
> 
> I did find/debug a situation where dumping the cache caused ext4
> performance to be a disaster (large directories, lots of files).  It
> was tracked back to ext4 relies on the Buffers:  data space in
> /proc/meminfo for at least directory entry caching, and that if there
> were a lot of directories and/or files in directories that Buffer:
> getting dropped and/or getting pruned for any some reason caused the
> fragmented directory entries to have to get reloaded from a spinning
> disk and require the disk to be seeking for  *MINUTES* to reload it
> (there were in this case several million files in a couple of
> directories with the directory entries being allocated over time so
> very likely heavily fragmented).
> 
> I wonder if there was some change with how Buffers is
> used/sized/pruned in the recent kernels.   The same drop_cache on an
> XFS filesystem had no effect that I could identify and doing a ls -lR
> on a big xfs filesystem does not make Buffers grow, but doing the same
> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
> depends on how many files/directories are on the filesystem).
> 
> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
> see if the poor performance correlates with Buffers suddenly being
> smaller for some reason.

As much as I hate this, I started another small test.

$ uname -a
Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux

$ df -h /data1
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127       55T   45T  9.8T  83% /data1

$ sudo du -sm /data2/no-backup/old-backups/tapes/01
2519    /data2/no-backup/old-backups/tapes/01

$ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
92059

$ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/

It completed in about one minute and it was enough to trigger the problem.

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2781097 root      20   0       0      0      0 R  98.3   0.0  36:04.35 kworker/u16:0+flush-9:127

iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.

Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):

                         Dirty   change   Buffers   change
2023-11-04 08:56:35        44     -580   1170536        8
2023-11-04 08:56:45        48        4   1170552       16
2023-11-04 08:56:55    812456   812408   1171008      456
2023-11-04 08:57:05    538436  -274020   1180820     9812
2023-11-04 08:57:15    698708   160272   1189368     8548
2023-11-04 08:57:25    874208   175500   1195620     6252
2023-11-04 08:57:35    742300  -131908   1202124     6504
2023-11-04 08:57:45    973528   231228   1209580     7456
2023-11-04 08:57:55   1269320   295792   1214900     5320
2023-11-04 08:58:05   1624428   355108   1219764     4864
2023-11-04 08:58:15   1629484     5056   1219816       52
2023-11-04 08:58:25   1629372     -112   1219832       16
2023-11-04 08:58:35   1629028     -344   1219856       24
2023-11-04 08:58:45   1628928     -100   1219880       24
2023-11-04 08:58:55   1628552     -376   1219908       28
2023-11-04 08:59:05   1629252      700   1220072      164
2023-11-04 08:59:15   1628696     -556   1220132       60
2023-11-04 08:59:25   1628304     -392   1220156       24
2023-11-04 08:59:35   1628264      -40   1220188       32
2023-11-04 08:59:45   1628184      -80   1220340      152
2023-11-04 08:59:55   1628144      -40   1220364       24
2023-11-04 09:00:05   1628124      -20   1219940     -424
2023-11-04 09:00:15   1627908     -216   1219976       36
2023-11-04 09:00:25   1627840      -68   1220000       24
2023-11-04 09:00:35   1624276    -3564   1220024       24
2023-11-04 09:00:45   1624100     -176   1220060       36
2023-11-04 09:00:55   1623912     -188   1220092       32
2023-11-04 09:01:05   1624076      164   1220112       20
2023-11-04 09:01:15   1623368     -708   1220160       48
2023-11-04 09:01:25   1623176     -192   1220196       36
2023-11-04 09:01:35   1621872    -1304   1220232       36
2023-11-04 09:01:45   1621732     -140   1220308       76
2023-11-04 09:01:55   1612304    -9428   1220392       84
2023-11-04 09:02:05   1612256      -48   1220420       28
2023-11-04 09:02:15   1612040     -216   1220444       24
2023-11-04 09:02:25   1611968      -72   1220476       32
2023-11-04 09:02:35   1611872      -96   1220492       16
2023-11-04 09:02:45   1609932    -1940   1220524       32
2023-11-04 09:02:55   1609828     -104   1220556       32
2023-11-04 09:03:05   1609916       88   1220572       16
2023-11-04 09:03:15   1609496     -420   1220608       36
2023-11-04 09:03:25   1609392     -104   1220632       24
2023-11-04 09:03:35   1609320      -72   1220648       16
2023-11-04 09:03:45   1609240      -80   1220672       24
2023-11-04 09:03:55   1609152      -88   1220688       16
2023-11-04 09:04:05   1609332      180   1220712       24
2023-11-04 09:04:15   1608892     -440   1220748       36
2023-11-04 09:04:25   1608848      -44   1220764       16
2023-11-04 09:04:35   1608744     -104   1220796       32
2023-11-04 09:04:45   1608436     -308   1220820       24
2023-11-04 09:04:55   1607916     -520   1220836       16
2023-11-04 09:05:05   1608624      708   1220876       40
2023-11-04 09:05:15   1606556    -2068   1220928       52
2023-11-04 09:05:25   1602692    -3864   1221016       88
2023-11-04 09:05:35   1602080     -612   1221052       36
2023-11-04 09:05:45   1602000      -80   1221080       28
2023-11-04 09:05:55   1601928      -72   1221096       16
2023-11-04 09:06:05   1602228      300   1221124       28
2023-11-04 09:06:15   1601848     -380   1221156       32
2023-11-04 09:06:25   1601656     -192   1221180       24
2023-11-04 09:06:35   1601532     -124   1221212       32
2023-11-04 09:06:45   1601476      -56   1221228       16
2023-11-04 09:06:55   1601364     -112   1221252       24

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-03 22:38                       ` eyal
@ 2023-11-04  0:48                         ` eyal
  2023-11-04  1:01                           ` Roger Heflin
  0 siblings, 1 reply; 35+ messages in thread
From: eyal @ 2023-11-04  0:48 UTC (permalink / raw)
  To: linux-raid

On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
> On 04/11/2023 02.57, Roger Heflin wrote:
>> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>>
>>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>>> for the record, I do not think that any of the observations the OP made can be
>>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>>> very clearly wrong with how this system behaves (the reported figures do not
>>>> at all match the expected performance of even a degraded RAID6 array in my
>>>> experience) and how data written to the filesystem apparently fails to make it
>>>> into the backing devices in acceptable time.
>>>>
>>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>>> hardware failure", I think.
>>>
>>> Exactly. That's what I've been saying for months...
>>>
>>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>>
>>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>>> days to complete (checked with quota).
>>>
>>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>>> hundreds of rsync downloads every day, the same problem exists and I also get
>>> frequent random rsync timeouts.
>>>
>>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>>> raid and the filesystem is mounted with nobarriers. Both show the same
>>> behavior. It'd be interesting to try a different filesystem but these are
>>> production machines with many disks and I cannot create another big array to
>>> transfer the contents.
>>
>> My array is running 6.5 + xfs, and mine all seems to work normally
>> (speed wise).  And in the perf top call he ran all of the busy
>> kworkers were ext4* calls spending a lot of time doing various
>> filesystem work.
>>
>> I did find/debug a situation where dumping the cache caused ext4
>> performance to be a disaster (large directories, lots of files).  It
>> was tracked back to ext4 relies on the Buffers:  data space in
>> /proc/meminfo for at least directory entry caching, and that if there
>> were a lot of directories and/or files in directories that Buffer:
>> getting dropped and/or getting pruned for any some reason caused the
>> fragmented directory entries to have to get reloaded from a spinning
>> disk and require the disk to be seeking for  *MINUTES* to reload it
>> (there were in this case several million files in a couple of
>> directories with the directory entries being allocated over time so
>> very likely heavily fragmented).
>>
>> I wonder if there was some change with how Buffers is
>> used/sized/pruned in the recent kernels.   The same drop_cache on an
>> XFS filesystem had no effect that I could identify and doing a ls -lR
>> on a big xfs filesystem does not make Buffers grow, but doing the same
>> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
>> depends on how many files/directories are on the filesystem).
>>
>> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
>> see if the poor performance correlates with Buffers suddenly being
>> smaller for some reason.
> 
> As much as I hate this, I started another small test.
> 
> $ uname -a
> Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
> 
> $ df -h /data1
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/md127       55T   45T  9.8T  83% /data1
> 
> $ sudo du -sm /data2/no-backup/old-backups/tapes/01
> 2519    /data2/no-backup/old-backups/tapes/01
> 
> $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
> 92059
> 
> $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
> 
> It completed in about one minute and it was enough to trigger the problem.
> 
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> 2781097 root      20   0       0      0      0 R  98.3   0.0  36:04.35 kworker/u16:0+flush-9:127
> 
> iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
> 
> Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
> 
>                          Dirty   change   Buffers   change
> 2023-11-04 08:56:35        44     -580   1170536        8
> 2023-11-04 08:56:45        48        4   1170552       16
> 2023-11-04 08:56:55    812456   812408   1171008      456
> 2023-11-04 08:57:05    538436  -274020   1180820     9812
> 2023-11-04 08:57:15    698708   160272   1189368     8548
> 2023-11-04 08:57:25    874208   175500   1195620     6252
> 2023-11-04 08:57:35    742300  -131908   1202124     6504
> 2023-11-04 08:57:45    973528   231228   1209580     7456
> 2023-11-04 08:57:55   1269320   295792   1214900     5320
> 2023-11-04 08:58:05   1624428   355108   1219764     4864
> 2023-11-04 08:58:15   1629484     5056   1219816       52
> 2023-11-04 08:58:25   1629372     -112   1219832       16
> 2023-11-04 08:58:35   1629028     -344   1219856       24
> 2023-11-04 08:58:45   1628928     -100   1219880       24
> 2023-11-04 08:58:55   1628552     -376   1219908       28
> 2023-11-04 08:59:05   1629252      700   1220072      164
> 2023-11-04 08:59:15   1628696     -556   1220132       60
> 2023-11-04 08:59:25   1628304     -392   1220156       24
> 2023-11-04 08:59:35   1628264      -40   1220188       32
> 2023-11-04 08:59:45   1628184      -80   1220340      152
> 2023-11-04 08:59:55   1628144      -40   1220364       24
> 2023-11-04 09:00:05   1628124      -20   1219940     -424
> 2023-11-04 09:00:15   1627908     -216   1219976       36
> 2023-11-04 09:00:25   1627840      -68   1220000       24
> 2023-11-04 09:00:35   1624276    -3564   1220024       24
> 2023-11-04 09:00:45   1624100     -176   1220060       36
> 2023-11-04 09:00:55   1623912     -188   1220092       32
> 2023-11-04 09:01:05   1624076      164   1220112       20
> 2023-11-04 09:01:15   1623368     -708   1220160       48
> 2023-11-04 09:01:25   1623176     -192   1220196       36
> 2023-11-04 09:01:35   1621872    -1304   1220232       36
> 2023-11-04 09:01:45   1621732     -140   1220308       76
> 2023-11-04 09:01:55   1612304    -9428   1220392       84
> 2023-11-04 09:02:05   1612256      -48   1220420       28
> 2023-11-04 09:02:15   1612040     -216   1220444       24
> 2023-11-04 09:02:25   1611968      -72   1220476       32
> 2023-11-04 09:02:35   1611872      -96   1220492       16
> 2023-11-04 09:02:45   1609932    -1940   1220524       32
> 2023-11-04 09:02:55   1609828     -104   1220556       32
> 2023-11-04 09:03:05   1609916       88   1220572       16
> 2023-11-04 09:03:15   1609496     -420   1220608       36
> 2023-11-04 09:03:25   1609392     -104   1220632       24
> 2023-11-04 09:03:35   1609320      -72   1220648       16
> 2023-11-04 09:03:45   1609240      -80   1220672       24
> 2023-11-04 09:03:55   1609152      -88   1220688       16
> 2023-11-04 09:04:05   1609332      180   1220712       24
> 2023-11-04 09:04:15   1608892     -440   1220748       36
> 2023-11-04 09:04:25   1608848      -44   1220764       16
> 2023-11-04 09:04:35   1608744     -104   1220796       32
> 2023-11-04 09:04:45   1608436     -308   1220820       24
> 2023-11-04 09:04:55   1607916     -520   1220836       16
> 2023-11-04 09:05:05   1608624      708   1220876       40
> 2023-11-04 09:05:15   1606556    -2068   1220928       52
> 2023-11-04 09:05:25   1602692    -3864   1221016       88
> 2023-11-04 09:05:35   1602080     -612   1221052       36
> 2023-11-04 09:05:45   1602000      -80   1221080       28
> 2023-11-04 09:05:55   1601928      -72   1221096       16
> 2023-11-04 09:06:05   1602228      300   1221124       28
> 2023-11-04 09:06:15   1601848     -380   1221156       32
> 2023-11-04 09:06:25   1601656     -192   1221180       24
> 2023-11-04 09:06:35   1601532     -124   1221212       32
> 2023-11-04 09:06:45   1601476      -56   1221228       16
> 2023-11-04 09:06:55   1601364     -112   1221252       24

Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.

          Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
09:00:00 md127            0.02      0.07     0.00   0.00   18.00     4.00    2.33     13.27     0.00   0.00 1268.36     5.69    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.96   1.12
09:00:00 sdb              0.10      3.80     0.85  89.47    2.50    38.00    0.85      1.57     0.00   0.00    3.00     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.82    0.01   0.35
09:00:00 sdc              0.13      3.93     0.85  86.44    6.00    29.50    1.00      4.17     0.50  33.33    4.38     4.17    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.72    0.01   0.53
09:00:00 sdd              0.13      3.93     0.85  86.44    7.12    29.50    1.05      3.70     0.33  24.10   13.78     3.52    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.90    0.02   1.54
09:00:00 sde              0.12      3.87     0.85  87.93    2.29    33.14    0.83      1.50     0.00   0.00   14.02     1.80    0.00      0.00     0.00   0.00    0.00     0.00    0.83    0.42    0.01   1.27
09:00:00 sdf              0.03      0.13     0.00   0.00   12.50     4.00    1.37     14.63     2.75  66.80   13.83    10.71    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.88    0.02   1.84
09:00:00 sdg              0.02      0.07     0.00   0.00   21.00     4.00    1.40     14.77     2.75  66.27   13.45    10.55    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.96    0.02   1.77
09:01:00 md127            0.05      0.20     0.00   0.00   26.00     4.00    7.85     83.00     0.00   0.00 2388.76    10.57    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   18.75   1.74
09:01:00 sdb              0.23     15.73     3.70  94.07   20.36    67.43    0.98     29.12     6.68  87.17    5.75    29.62    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.75    0.02   0.73
09:01:00 sdc              0.20     10.33     2.38  92.26   12.08    51.67    0.77      9.93     2.10  73.26    3.20    12.95    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.33    0.01   0.36
09:01:00 sdd              0.23     18.60     4.42  94.98   23.07    79.71    1.03     22.38     4.93  82.68   11.61    21.66    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.62    0.02   1.31
09:01:00 sde              0.22     18.20     4.33  95.24   19.62    84.00    0.98     26.18     5.93  85.78   11.61    26.63    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.40    0.02   1.35
09:01:00 sdf              0.18     18.67     4.48  96.07   42.64   101.82    1.47     49.72    11.33  88.54   12.03    33.90    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.67    0.03   1.62
09:01:00 sdg              0.18     19.27     4.63  96.19   39.00   105.09    1.48     46.98    10.63  87.76   12.37    31.67    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.35    0.03   1.68
09:02:00 md127            0.32      1.27     0.00   0.00   26.95     4.00   14.12    214.93     0.00   0.00  639.34    15.23    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    9.03   2.80
09:02:00 sdb              0.48     51.47    12.38  96.24   18.03   106.48    1.20     63.79    15.13  92.65    3.72    53.16    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.79    0.02   0.79
09:02:00 sdc              0.53     58.07    13.98  96.33   16.16   108.88    1.30     71.99    17.08  92.93    4.95    55.38    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.15    0.02   1.08
09:02:00 sdd              0.52     69.13    16.77  97.01   14.13   133.81    1.52     92.27    21.95  93.54   10.38    60.84    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.60    0.03   1.78
09:02:00 sde              0.50     64.60    15.65  96.90   16.83   129.20    1.87     96.40    22.63  92.38   11.50    51.64    0.00      0.00     0.00   0.00    0.00     0.00    0.78    6.96    0.04   2.20
09:02:00 sdf              0.57     74.80    18.13  96.97   15.65   132.00    1.77     77.60    18.03  91.08   11.51    43.92    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.23    0.03   2.25
09:02:00 sdg              0.68     64.13    15.35  95.74    9.49    93.85    1.72     68.73    15.87  90.24   11.74    40.04    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.43    0.03   2.26
09:03:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    6.97     51.00     0.00   0.00 7200.34     7.32    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   50.16   1.38
09:03:00 sdb              0.15      6.53     1.48  90.82   11.78    43.56    0.78      9.87     2.08  72.67    4.40    12.60    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.09    0.01   0.36
09:03:00 sdc              0.15      6.53     1.48  90.82    9.44    43.56    0.80      9.93     2.08  72.25    4.90    12.42    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.27    0.01   0.36
09:03:00 sdd              0.23     12.87     2.98  92.75    3.57    55.14    1.02     22.80     5.08  83.33   12.15    22.43    0.00      0.00     0.00   0.00    0.00     0.00    0.73    2.18    0.01   1.17
09:03:00 sde              0.45     17.73     3.98  89.85    1.52    39.41    1.37     29.27     6.35  82.29   13.65    21.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.34    0.02   1.60
09:03:00 sdf              0.37     11.40     2.48  87.13    5.59    31.09    1.13     16.33     3.35  74.72   13.82    14.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.98    0.02   1.60
09:03:00 sdg              0.47     21.67     4.95  91.38    3.11    46.43    1.17     16.60     3.38  74.36   14.50    14.23    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.23    0.02   1.58
09:04:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    1.88     11.93     0.00   0.00 1439.04     6.34    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.71   0.92
09:04:00 sdb              0.38      8.73     1.80  82.44    2.00    22.78    0.90      4.70     0.72  44.33    3.94     5.22    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.02    0.01   0.48
09:04:00 sdc              0.38      8.73     1.80  82.44    1.74    22.78    0.82      1.50     0.00   0.00    2.69     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.47    0.00   0.34
09:04:00 sdd              0.30      3.47     0.57  65.38    0.39    11.56    0.98      9.97     1.95  66.48   13.14    10.14    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.14    0.01   1.40
09:04:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.28     13.43     2.52  66.23   13.51    10.47    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.53    0.02   1.73
09:04:00 sdf              0.08      5.27     1.23  93.67    3.40    63.20    1.12      4.97     0.57  33.66   14.93     4.45    0.00      0.00     0.00   0.00    0.00     0.00    0.82    3.55    0.02   1.74
09:04:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.82      1.50     0.00   0.00   12.96     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    0.37    0.01   1.12
09:05:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.27     19.27     0.00   0.00 1670.65     8.50    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.79   1.40
09:05:00 sdb              0.32      4.27     0.75  70.31    0.42    13.47    0.85      8.00     1.55  64.58    3.61     9.41    0.00      0.00     0.00   0.00    0.00     0.00    0.70    3.98    0.01   0.40
09:05:00 sdc              0.42     13.33     2.92  87.50    4.12    32.00    0.87     14.27     3.10  78.15    3.88    16.46    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.33    0.01   0.52
09:05:00 sdd              0.35      5.67     1.07  75.29    0.76    16.19    0.87     11.73     2.47  74.00   12.62    13.54    0.00      0.00     0.00   0.00    0.00     0.00    0.70    2.86    0.01   1.19
09:05:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.23     17.47     3.53  74.13   12.08    14.16    0.00      0.00     0.00   0.00    0.00     0.00    0.70    6.19    0.02   1.51
09:05:00 sdf              0.08      5.53     1.30  93.98   11.40    66.40    1.15     13.47     2.62  69.47   13.26    11.71    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.21    0.02   1.55
09:05:00 sdg              0.08      5.53     1.30  93.98   12.00    66.40    0.80      9.07     1.87  70.00   12.04    11.33    0.00      0.00     0.00   0.00    0.00     0.00    0.70    1.24    0.01   1.03
09:06:00 md127            0.20      0.80     0.00   0.00   23.17     4.00    3.03    113.80     0.00   0.00 1063.22    37.52    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.23   2.60
09:06:00 sdb              0.30     32.80     7.90  96.34   10.56   109.33    1.22     36.85     8.37  87.30    4.70    30.29    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.56    0.01   0.76
09:06:00 sdc              0.30     37.40     9.05  96.79    8.83   124.67    1.13     42.32     9.82  89.65    4.34    37.34    0.00      0.00     0.00   0.00    0.00     0.00    0.72    4.42    0.01   0.63
09:06:00 sdd              0.58     62.60    15.07  96.27   16.83   107.31    1.43     57.72    13.37  90.32   11.91    40.27    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.74    0.03   1.79
09:06:00 sde              0.47     64.73    15.72  97.12   10.14   138.71    1.48     45.18    10.18  87.29   11.97    30.46    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.09    0.03   1.78
09:06:00 sdf              0.35     42.53    10.28  96.71   15.48   121.52    1.48     46.72    10.57  87.69   14.58    31.49    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.40    0.03   1.97
09:06:00 sdg              0.27     31.93     7.72  96.66   16.38   119.75    1.18     52.58    12.33  91.25   11.63    44.44    0.00      0.00     0.00   0.00    0.00     0.00    0.72    3.70    0.02   1.40
09:07:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.10     13.93     0.00   0.00 1405.96     6.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.95   1.00
09:07:00 sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.97      3.36     0.32  24.68    3.19     3.47    0.00      0.00     0.00   0.00    0.00     0.00    0.80    3.54    0.01   0.39
09:07:00 sdc              0.15      7.47     1.72  91.96    0.44    49.78    1.17     13.43     2.63  69.30    4.29    11.51    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.31    0.01   0.48
09:07:00 sdd              0.15      7.47     1.72  91.96    0.44    49.78    1.00     11.43     2.30  69.70   11.55    11.43    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.00    0.01   1.23
09:07:00 sde              0.15      7.47     1.72  91.96    9.44    49.78    1.28     12.83     2.37  64.84   14.22     9.99    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.27    0.02   1.69
09:07:00 sdf              0.05      2.53     0.58  92.11    4.67    50.67    1.18      7.89     1.23  51.03   13.87     6.67    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.29    0.02   1.63
09:07:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.80      1.43     0.00   0.00   13.48     1.78    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.08    0.01   1.14

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-04  0:48                         ` eyal
@ 2023-11-04  1:01                           ` Roger Heflin
  2023-11-04 10:04                             ` eyal
  0 siblings, 1 reply; 35+ messages in thread
From: Roger Heflin @ 2023-11-04  1:01 UTC (permalink / raw)
  To: eyal; +Cc: linux-raid

The await being so high on md127, but normish on the underlying disks
would seem to point at some kernel issue inside md using raid6 where
one disk is out and the data has to be reconstructed.

On Fri, Nov 3, 2023 at 7:54 PM <eyal@eyal.emu.id.au> wrote:
>
> On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
> > On 04/11/2023 02.57, Roger Heflin wrote:
> >> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
> >>>
> >>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> >>>> for the record, I do not think that any of the observations the OP made can be
> >>>> explained by non-pathological phenomena/patterns of behavior. Something is
> >>>> very clearly wrong with how this system behaves (the reported figures do not
> >>>> at all match the expected performance of even a degraded RAID6 array in my
> >>>> experience) and how data written to the filesystem apparently fails to make it
> >>>> into the backing devices in acceptable time.
> >>>>
> >>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> >>>> hardware failure", I think.
> >>>
> >>> Exactly. That's what I've been saying for months...
> >>>
> >>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> >>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> >>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> >>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> >>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> >>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
> >>>
> >>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
> >>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
> >>> days to complete (checked with quota).
> >>>
> >>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
> >>> hundreds of rsync downloads every day, the same problem exists and I also get
> >>> frequent random rsync timeouts.
> >>>
> >>> This is all with raid6 and ext4. One of the machines has a journal disk in the
> >>> raid and the filesystem is mounted with nobarriers. Both show the same
> >>> behavior. It'd be interesting to try a different filesystem but these are
> >>> production machines with many disks and I cannot create another big array to
> >>> transfer the contents.
> >>
> >> My array is running 6.5 + xfs, and mine all seems to work normally
> >> (speed wise).  And in the perf top call he ran all of the busy
> >> kworkers were ext4* calls spending a lot of time doing various
> >> filesystem work.
> >>
> >> I did find/debug a situation where dumping the cache caused ext4
> >> performance to be a disaster (large directories, lots of files).  It
> >> was tracked back to ext4 relies on the Buffers:  data space in
> >> /proc/meminfo for at least directory entry caching, and that if there
> >> were a lot of directories and/or files in directories that Buffer:
> >> getting dropped and/or getting pruned for any some reason caused the
> >> fragmented directory entries to have to get reloaded from a spinning
> >> disk and require the disk to be seeking for  *MINUTES* to reload it
> >> (there were in this case several million files in a couple of
> >> directories with the directory entries being allocated over time so
> >> very likely heavily fragmented).
> >>
> >> I wonder if there was some change with how Buffers is
> >> used/sized/pruned in the recent kernels.   The same drop_cache on an
> >> XFS filesystem had no effect that I could identify and doing a ls -lR
> >> on a big xfs filesystem does not make Buffers grow, but doing the same
> >> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
> >> depends on how many files/directories are on the filesystem).
> >>
> >> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
> >> see if the poor performance correlates with Buffers suddenly being
> >> smaller for some reason.
> >
> > As much as I hate this, I started another small test.
> >
> > $ uname -a
> > Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
> >
> > $ df -h /data1
> > Filesystem      Size  Used Avail Use% Mounted on
> > /dev/md127       55T   45T  9.8T  83% /data1
> >
> > $ sudo du -sm /data2/no-backup/old-backups/tapes/01
> > 2519    /data2/no-backup/old-backups/tapes/01
> >
> > $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
> > 92059
> >
> > $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
> >
> > It completed in about one minute and it was enough to trigger the problem.
> >
> >      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > 2781097 root      20   0       0      0      0 R  98.3   0.0  36:04.35 kworker/u16:0+flush-9:127
> >
> > iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
> >
> > Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
> >
> >                          Dirty   change   Buffers   change
> > 2023-11-04 08:56:35        44     -580   1170536        8
> > 2023-11-04 08:56:45        48        4   1170552       16
> > 2023-11-04 08:56:55    812456   812408   1171008      456
> > 2023-11-04 08:57:05    538436  -274020   1180820     9812
> > 2023-11-04 08:57:15    698708   160272   1189368     8548
> > 2023-11-04 08:57:25    874208   175500   1195620     6252
> > 2023-11-04 08:57:35    742300  -131908   1202124     6504
> > 2023-11-04 08:57:45    973528   231228   1209580     7456
> > 2023-11-04 08:57:55   1269320   295792   1214900     5320
> > 2023-11-04 08:58:05   1624428   355108   1219764     4864
> > 2023-11-04 08:58:15   1629484     5056   1219816       52
> > 2023-11-04 08:58:25   1629372     -112   1219832       16
> > 2023-11-04 08:58:35   1629028     -344   1219856       24
> > 2023-11-04 08:58:45   1628928     -100   1219880       24
> > 2023-11-04 08:58:55   1628552     -376   1219908       28
> > 2023-11-04 08:59:05   1629252      700   1220072      164
> > 2023-11-04 08:59:15   1628696     -556   1220132       60
> > 2023-11-04 08:59:25   1628304     -392   1220156       24
> > 2023-11-04 08:59:35   1628264      -40   1220188       32
> > 2023-11-04 08:59:45   1628184      -80   1220340      152
> > 2023-11-04 08:59:55   1628144      -40   1220364       24
> > 2023-11-04 09:00:05   1628124      -20   1219940     -424
> > 2023-11-04 09:00:15   1627908     -216   1219976       36
> > 2023-11-04 09:00:25   1627840      -68   1220000       24
> > 2023-11-04 09:00:35   1624276    -3564   1220024       24
> > 2023-11-04 09:00:45   1624100     -176   1220060       36
> > 2023-11-04 09:00:55   1623912     -188   1220092       32
> > 2023-11-04 09:01:05   1624076      164   1220112       20
> > 2023-11-04 09:01:15   1623368     -708   1220160       48
> > 2023-11-04 09:01:25   1623176     -192   1220196       36
> > 2023-11-04 09:01:35   1621872    -1304   1220232       36
> > 2023-11-04 09:01:45   1621732     -140   1220308       76
> > 2023-11-04 09:01:55   1612304    -9428   1220392       84
> > 2023-11-04 09:02:05   1612256      -48   1220420       28
> > 2023-11-04 09:02:15   1612040     -216   1220444       24
> > 2023-11-04 09:02:25   1611968      -72   1220476       32
> > 2023-11-04 09:02:35   1611872      -96   1220492       16
> > 2023-11-04 09:02:45   1609932    -1940   1220524       32
> > 2023-11-04 09:02:55   1609828     -104   1220556       32
> > 2023-11-04 09:03:05   1609916       88   1220572       16
> > 2023-11-04 09:03:15   1609496     -420   1220608       36
> > 2023-11-04 09:03:25   1609392     -104   1220632       24
> > 2023-11-04 09:03:35   1609320      -72   1220648       16
> > 2023-11-04 09:03:45   1609240      -80   1220672       24
> > 2023-11-04 09:03:55   1609152      -88   1220688       16
> > 2023-11-04 09:04:05   1609332      180   1220712       24
> > 2023-11-04 09:04:15   1608892     -440   1220748       36
> > 2023-11-04 09:04:25   1608848      -44   1220764       16
> > 2023-11-04 09:04:35   1608744     -104   1220796       32
> > 2023-11-04 09:04:45   1608436     -308   1220820       24
> > 2023-11-04 09:04:55   1607916     -520   1220836       16
> > 2023-11-04 09:05:05   1608624      708   1220876       40
> > 2023-11-04 09:05:15   1606556    -2068   1220928       52
> > 2023-11-04 09:05:25   1602692    -3864   1221016       88
> > 2023-11-04 09:05:35   1602080     -612   1221052       36
> > 2023-11-04 09:05:45   1602000      -80   1221080       28
> > 2023-11-04 09:05:55   1601928      -72   1221096       16
> > 2023-11-04 09:06:05   1602228      300   1221124       28
> > 2023-11-04 09:06:15   1601848     -380   1221156       32
> > 2023-11-04 09:06:25   1601656     -192   1221180       24
> > 2023-11-04 09:06:35   1601532     -124   1221212       32
> > 2023-11-04 09:06:45   1601476      -56   1221228       16
> > 2023-11-04 09:06:55   1601364     -112   1221252       24
>
> Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
> is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.
>
>           Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
> 09:00:00 md127            0.02      0.07     0.00   0.00   18.00     4.00    2.33     13.27     0.00   0.00 1268.36     5.69    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.96   1.12
> 09:00:00 sdb              0.10      3.80     0.85  89.47    2.50    38.00    0.85      1.57     0.00   0.00    3.00     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.82    0.01   0.35
> 09:00:00 sdc              0.13      3.93     0.85  86.44    6.00    29.50    1.00      4.17     0.50  33.33    4.38     4.17    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.72    0.01   0.53
> 09:00:00 sdd              0.13      3.93     0.85  86.44    7.12    29.50    1.05      3.70     0.33  24.10   13.78     3.52    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.90    0.02   1.54
> 09:00:00 sde              0.12      3.87     0.85  87.93    2.29    33.14    0.83      1.50     0.00   0.00   14.02     1.80    0.00      0.00     0.00   0.00    0.00     0.00    0.83    0.42    0.01   1.27
> 09:00:00 sdf              0.03      0.13     0.00   0.00   12.50     4.00    1.37     14.63     2.75  66.80   13.83    10.71    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.88    0.02   1.84
> 09:00:00 sdg              0.02      0.07     0.00   0.00   21.00     4.00    1.40     14.77     2.75  66.27   13.45    10.55    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.96    0.02   1.77
> 09:01:00 md127            0.05      0.20     0.00   0.00   26.00     4.00    7.85     83.00     0.00   0.00 2388.76    10.57    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   18.75   1.74
> 09:01:00 sdb              0.23     15.73     3.70  94.07   20.36    67.43    0.98     29.12     6.68  87.17    5.75    29.62    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.75    0.02   0.73
> 09:01:00 sdc              0.20     10.33     2.38  92.26   12.08    51.67    0.77      9.93     2.10  73.26    3.20    12.95    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.33    0.01   0.36
> 09:01:00 sdd              0.23     18.60     4.42  94.98   23.07    79.71    1.03     22.38     4.93  82.68   11.61    21.66    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.62    0.02   1.31
> 09:01:00 sde              0.22     18.20     4.33  95.24   19.62    84.00    0.98     26.18     5.93  85.78   11.61    26.63    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.40    0.02   1.35
> 09:01:00 sdf              0.18     18.67     4.48  96.07   42.64   101.82    1.47     49.72    11.33  88.54   12.03    33.90    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.67    0.03   1.62
> 09:01:00 sdg              0.18     19.27     4.63  96.19   39.00   105.09    1.48     46.98    10.63  87.76   12.37    31.67    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.35    0.03   1.68
> 09:02:00 md127            0.32      1.27     0.00   0.00   26.95     4.00   14.12    214.93     0.00   0.00  639.34    15.23    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    9.03   2.80
> 09:02:00 sdb              0.48     51.47    12.38  96.24   18.03   106.48    1.20     63.79    15.13  92.65    3.72    53.16    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.79    0.02   0.79
> 09:02:00 sdc              0.53     58.07    13.98  96.33   16.16   108.88    1.30     71.99    17.08  92.93    4.95    55.38    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.15    0.02   1.08
> 09:02:00 sdd              0.52     69.13    16.77  97.01   14.13   133.81    1.52     92.27    21.95  93.54   10.38    60.84    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.60    0.03   1.78
> 09:02:00 sde              0.50     64.60    15.65  96.90   16.83   129.20    1.87     96.40    22.63  92.38   11.50    51.64    0.00      0.00     0.00   0.00    0.00     0.00    0.78    6.96    0.04   2.20
> 09:02:00 sdf              0.57     74.80    18.13  96.97   15.65   132.00    1.77     77.60    18.03  91.08   11.51    43.92    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.23    0.03   2.25
> 09:02:00 sdg              0.68     64.13    15.35  95.74    9.49    93.85    1.72     68.73    15.87  90.24   11.74    40.04    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.43    0.03   2.26
> 09:03:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    6.97     51.00     0.00   0.00 7200.34     7.32    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   50.16   1.38
> 09:03:00 sdb              0.15      6.53     1.48  90.82   11.78    43.56    0.78      9.87     2.08  72.67    4.40    12.60    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.09    0.01   0.36
> 09:03:00 sdc              0.15      6.53     1.48  90.82    9.44    43.56    0.80      9.93     2.08  72.25    4.90    12.42    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.27    0.01   0.36
> 09:03:00 sdd              0.23     12.87     2.98  92.75    3.57    55.14    1.02     22.80     5.08  83.33   12.15    22.43    0.00      0.00     0.00   0.00    0.00     0.00    0.73    2.18    0.01   1.17
> 09:03:00 sde              0.45     17.73     3.98  89.85    1.52    39.41    1.37     29.27     6.35  82.29   13.65    21.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.34    0.02   1.60
> 09:03:00 sdf              0.37     11.40     2.48  87.13    5.59    31.09    1.13     16.33     3.35  74.72   13.82    14.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.98    0.02   1.60
> 09:03:00 sdg              0.47     21.67     4.95  91.38    3.11    46.43    1.17     16.60     3.38  74.36   14.50    14.23    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.23    0.02   1.58
> 09:04:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    1.88     11.93     0.00   0.00 1439.04     6.34    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.71   0.92
> 09:04:00 sdb              0.38      8.73     1.80  82.44    2.00    22.78    0.90      4.70     0.72  44.33    3.94     5.22    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.02    0.01   0.48
> 09:04:00 sdc              0.38      8.73     1.80  82.44    1.74    22.78    0.82      1.50     0.00   0.00    2.69     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.47    0.00   0.34
> 09:04:00 sdd              0.30      3.47     0.57  65.38    0.39    11.56    0.98      9.97     1.95  66.48   13.14    10.14    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.14    0.01   1.40
> 09:04:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.28     13.43     2.52  66.23   13.51    10.47    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.53    0.02   1.73
> 09:04:00 sdf              0.08      5.27     1.23  93.67    3.40    63.20    1.12      4.97     0.57  33.66   14.93     4.45    0.00      0.00     0.00   0.00    0.00     0.00    0.82    3.55    0.02   1.74
> 09:04:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.82      1.50     0.00   0.00   12.96     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    0.37    0.01   1.12
> 09:05:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.27     19.27     0.00   0.00 1670.65     8.50    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.79   1.40
> 09:05:00 sdb              0.32      4.27     0.75  70.31    0.42    13.47    0.85      8.00     1.55  64.58    3.61     9.41    0.00      0.00     0.00   0.00    0.00     0.00    0.70    3.98    0.01   0.40
> 09:05:00 sdc              0.42     13.33     2.92  87.50    4.12    32.00    0.87     14.27     3.10  78.15    3.88    16.46    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.33    0.01   0.52
> 09:05:00 sdd              0.35      5.67     1.07  75.29    0.76    16.19    0.87     11.73     2.47  74.00   12.62    13.54    0.00      0.00     0.00   0.00    0.00     0.00    0.70    2.86    0.01   1.19
> 09:05:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.23     17.47     3.53  74.13   12.08    14.16    0.00      0.00     0.00   0.00    0.00     0.00    0.70    6.19    0.02   1.51
> 09:05:00 sdf              0.08      5.53     1.30  93.98   11.40    66.40    1.15     13.47     2.62  69.47   13.26    11.71    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.21    0.02   1.55
> 09:05:00 sdg              0.08      5.53     1.30  93.98   12.00    66.40    0.80      9.07     1.87  70.00   12.04    11.33    0.00      0.00     0.00   0.00    0.00     0.00    0.70    1.24    0.01   1.03
> 09:06:00 md127            0.20      0.80     0.00   0.00   23.17     4.00    3.03    113.80     0.00   0.00 1063.22    37.52    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.23   2.60
> 09:06:00 sdb              0.30     32.80     7.90  96.34   10.56   109.33    1.22     36.85     8.37  87.30    4.70    30.29    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.56    0.01   0.76
> 09:06:00 sdc              0.30     37.40     9.05  96.79    8.83   124.67    1.13     42.32     9.82  89.65    4.34    37.34    0.00      0.00     0.00   0.00    0.00     0.00    0.72    4.42    0.01   0.63
> 09:06:00 sdd              0.58     62.60    15.07  96.27   16.83   107.31    1.43     57.72    13.37  90.32   11.91    40.27    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.74    0.03   1.79
> 09:06:00 sde              0.47     64.73    15.72  97.12   10.14   138.71    1.48     45.18    10.18  87.29   11.97    30.46    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.09    0.03   1.78
> 09:06:00 sdf              0.35     42.53    10.28  96.71   15.48   121.52    1.48     46.72    10.57  87.69   14.58    31.49    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.40    0.03   1.97
> 09:06:00 sdg              0.27     31.93     7.72  96.66   16.38   119.75    1.18     52.58    12.33  91.25   11.63    44.44    0.00      0.00     0.00   0.00    0.00     0.00    0.72    3.70    0.02   1.40
> 09:07:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.10     13.93     0.00   0.00 1405.96     6.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.95   1.00
> 09:07:00 sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.97      3.36     0.32  24.68    3.19     3.47    0.00      0.00     0.00   0.00    0.00     0.00    0.80    3.54    0.01   0.39
> 09:07:00 sdc              0.15      7.47     1.72  91.96    0.44    49.78    1.17     13.43     2.63  69.30    4.29    11.51    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.31    0.01   0.48
> 09:07:00 sdd              0.15      7.47     1.72  91.96    0.44    49.78    1.00     11.43     2.30  69.70   11.55    11.43    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.00    0.01   1.23
> 09:07:00 sde              0.15      7.47     1.72  91.96    9.44    49.78    1.28     12.83     2.37  64.84   14.22     9.99    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.27    0.02   1.69
> 09:07:00 sdf              0.05      2.53     0.58  92.11    4.67    50.67    1.18      7.89     1.23  51.03   13.87     6.67    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.29    0.02   1.63
> 09:07:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.80      1.43     0.00   0.00   13.48     1.78    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.08    0.01   1.14
>
> --
> Eyal at Home (eyal@eyal.emu.id.au)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: problem with recovered array
  2023-11-04  1:01                           ` Roger Heflin
@ 2023-11-04 10:04                             ` eyal
  0 siblings, 0 replies; 35+ messages in thread
From: eyal @ 2023-11-04 10:04 UTC (permalink / raw)
  To: linux-raid

See update at the bottom.

On 04/11/2023 12.01, Roger Heflin wrote:
> The await being so high on md127, but normish on the underlying disks
> would seem to point at some kernel issue inside md using raid6 where
> one disk is out and the data has to be reconstructed.

Should I continue with this thread or should I log a kernel bug report?
Or maybe there is a more suitable list to post this?

> On Fri, Nov 3, 2023 at 7:54 PM <eyal@eyal.emu.id.au> wrote:
>>
>> On 04/11/2023 09.38, eyal@eyal.emu.id.au wrote:
>>> On 04/11/2023 02.57, Roger Heflin wrote:
>>>> On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>>>>>
>>>>> Johannes Truschnigg (johannes@truschnigg.info) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
>>>>>> for the record, I do not think that any of the observations the OP made can be
>>>>>> explained by non-pathological phenomena/patterns of behavior. Something is
>>>>>> very clearly wrong with how this system behaves (the reported figures do not
>>>>>> at all match the expected performance of even a degraded RAID6 array in my
>>>>>> experience) and how data written to the filesystem apparently fails to make it
>>>>>> into the backing devices in acceptable time.
>>>>>>
>>>>>> The whole affair reeks either of "subtle kernel bug", or maybe "subtle
>>>>>> hardware failure", I think.
>>>>>
>>>>> Exactly. That's what I've been saying for months...
>>>>>
>>>>> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
>>>>> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
>>>>> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
>>>>> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
>>>>> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
>>>>> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>>>>>
>>>>> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
>>>>> blocked in D state and the tarball expansion proceeds so slowly that it'd take
>>>>> days to complete (checked with quota).
>>>>>
>>>>> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
>>>>> hundreds of rsync downloads every day, the same problem exists and I also get
>>>>> frequent random rsync timeouts.
>>>>>
>>>>> This is all with raid6 and ext4. One of the machines has a journal disk in the
>>>>> raid and the filesystem is mounted with nobarriers. Both show the same
>>>>> behavior. It'd be interesting to try a different filesystem but these are
>>>>> production machines with many disks and I cannot create another big array to
>>>>> transfer the contents.
>>>>
>>>> My array is running 6.5 + xfs, and mine all seems to work normally
>>>> (speed wise).  And in the perf top call he ran all of the busy
>>>> kworkers were ext4* calls spending a lot of time doing various
>>>> filesystem work.
>>>>
>>>> I did find/debug a situation where dumping the cache caused ext4
>>>> performance to be a disaster (large directories, lots of files).  It
>>>> was tracked back to ext4 relies on the Buffers:  data space in
>>>> /proc/meminfo for at least directory entry caching, and that if there
>>>> were a lot of directories and/or files in directories that Buffer:
>>>> getting dropped and/or getting pruned for any some reason caused the
>>>> fragmented directory entries to have to get reloaded from a spinning
>>>> disk and require the disk to be seeking for  *MINUTES* to reload it
>>>> (there were in this case several million files in a couple of
>>>> directories with the directory entries being allocated over time so
>>>> very likely heavily fragmented).
>>>>
>>>> I wonder if there was some change with how Buffers is
>>>> used/sized/pruned in the recent kernels.   The same drop_cache on an
>>>> XFS filesystem had no effect that I could identify and doing a ls -lR
>>>> on a big xfs filesystem does not make Buffers grow, but doing the same
>>>> ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
>>>> depends on how many files/directories are on the filesystem).
>>>>
>>>> He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
>>>> see if the poor performance correlates with Buffers suddenly being
>>>> smaller for some reason.
>>>
>>> As much as I hate this, I started another small test.
>>>
>>> $ uname -a
>>> Linux e7.eyal.emu.id.au 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux
>>>
>>> $ df -h /data1
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/md127       55T   45T  9.8T  83% /data1
>>>
>>> $ sudo du -sm /data2/no-backup/old-backups/tapes/01
>>> 2519    /data2/no-backup/old-backups/tapes/01
>>>
>>> $ sudo find /data2/no-backup/old-backups/tapes/01|wc -l
>>> 92059
>>>
>>> $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/01 /data1/no-backup/old-backups/
>>>
>>> It completed in about one minute and it was enough to trigger the problem.
>>>
>>>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>> 2781097 root      20   0       0      0      0 R  98.3   0.0  36:04.35 kworker/u16:0+flush-9:127
>>>
>>> iostat has nothing unusual, writes to md127 at 10-20KB/s and %util is in the low single digit for all members.
>>>
>>> Here is what meminfo showed in the first 10 minutes (it is still going but the trend is clear):
>>>
>>>                           Dirty   change   Buffers   change
>>> 2023-11-04 08:56:35        44     -580   1170536        8
>>> 2023-11-04 08:56:45        48        4   1170552       16
>>> 2023-11-04 08:56:55    812456   812408   1171008      456
>>> 2023-11-04 08:57:05    538436  -274020   1180820     9812
>>> 2023-11-04 08:57:15    698708   160272   1189368     8548
>>> 2023-11-04 08:57:25    874208   175500   1195620     6252
>>> 2023-11-04 08:57:35    742300  -131908   1202124     6504
>>> 2023-11-04 08:57:45    973528   231228   1209580     7456
>>> 2023-11-04 08:57:55   1269320   295792   1214900     5320
>>> 2023-11-04 08:58:05   1624428   355108   1219764     4864
>>> 2023-11-04 08:58:15   1629484     5056   1219816       52
>>> 2023-11-04 08:58:25   1629372     -112   1219832       16
>>> 2023-11-04 08:58:35   1629028     -344   1219856       24
>>> 2023-11-04 08:58:45   1628928     -100   1219880       24
>>> 2023-11-04 08:58:55   1628552     -376   1219908       28
>>> 2023-11-04 08:59:05   1629252      700   1220072      164
>>> 2023-11-04 08:59:15   1628696     -556   1220132       60
>>> 2023-11-04 08:59:25   1628304     -392   1220156       24
>>> 2023-11-04 08:59:35   1628264      -40   1220188       32
>>> 2023-11-04 08:59:45   1628184      -80   1220340      152
>>> 2023-11-04 08:59:55   1628144      -40   1220364       24
>>> 2023-11-04 09:00:05   1628124      -20   1219940     -424
>>> 2023-11-04 09:00:15   1627908     -216   1219976       36
>>> 2023-11-04 09:00:25   1627840      -68   1220000       24
>>> 2023-11-04 09:00:35   1624276    -3564   1220024       24
>>> 2023-11-04 09:00:45   1624100     -176   1220060       36
>>> 2023-11-04 09:00:55   1623912     -188   1220092       32
>>> 2023-11-04 09:01:05   1624076      164   1220112       20
>>> 2023-11-04 09:01:15   1623368     -708   1220160       48
>>> 2023-11-04 09:01:25   1623176     -192   1220196       36
>>> 2023-11-04 09:01:35   1621872    -1304   1220232       36
>>> 2023-11-04 09:01:45   1621732     -140   1220308       76
>>> 2023-11-04 09:01:55   1612304    -9428   1220392       84
>>> 2023-11-04 09:02:05   1612256      -48   1220420       28
>>> 2023-11-04 09:02:15   1612040     -216   1220444       24
>>> 2023-11-04 09:02:25   1611968      -72   1220476       32
>>> 2023-11-04 09:02:35   1611872      -96   1220492       16
>>> 2023-11-04 09:02:45   1609932    -1940   1220524       32
>>> 2023-11-04 09:02:55   1609828     -104   1220556       32
>>> 2023-11-04 09:03:05   1609916       88   1220572       16
>>> 2023-11-04 09:03:15   1609496     -420   1220608       36
>>> 2023-11-04 09:03:25   1609392     -104   1220632       24
>>> 2023-11-04 09:03:35   1609320      -72   1220648       16
>>> 2023-11-04 09:03:45   1609240      -80   1220672       24
>>> 2023-11-04 09:03:55   1609152      -88   1220688       16
>>> 2023-11-04 09:04:05   1609332      180   1220712       24
>>> 2023-11-04 09:04:15   1608892     -440   1220748       36
>>> 2023-11-04 09:04:25   1608848      -44   1220764       16
>>> 2023-11-04 09:04:35   1608744     -104   1220796       32
>>> 2023-11-04 09:04:45   1608436     -308   1220820       24
>>> 2023-11-04 09:04:55   1607916     -520   1220836       16
>>> 2023-11-04 09:05:05   1608624      708   1220876       40
>>> 2023-11-04 09:05:15   1606556    -2068   1220928       52
>>> 2023-11-04 09:05:25   1602692    -3864   1221016       88
>>> 2023-11-04 09:05:35   1602080     -612   1221052       36
>>> 2023-11-04 09:05:45   1602000      -80   1221080       28
>>> 2023-11-04 09:05:55   1601928      -72   1221096       16
>>> 2023-11-04 09:06:05   1602228      300   1221124       28
>>> 2023-11-04 09:06:15   1601848     -380   1221156       32
>>> 2023-11-04 09:06:25   1601656     -192   1221180       24
>>> 2023-11-04 09:06:35   1601532     -124   1221212       32
>>> 2023-11-04 09:06:45   1601476      -56   1221228       16
>>> 2023-11-04 09:06:55   1601364     -112   1221252       24
>>
>> Another view of interest, is the fact that w_await, which I understand is the total time to service a write request,
>> is high, at 1-2s from this point onward. The components are significantly quicker at the low teens.
>>
>>            Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
>> 09:00:00 md127            0.02      0.07     0.00   0.00   18.00     4.00    2.33     13.27     0.00   0.00 1268.36     5.69    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.96   1.12
>> 09:00:00 sdb              0.10      3.80     0.85  89.47    2.50    38.00    0.85      1.57     0.00   0.00    3.00     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.82    0.01   0.35
>> 09:00:00 sdc              0.13      3.93     0.85  86.44    6.00    29.50    1.00      4.17     0.50  33.33    4.38     4.17    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.72    0.01   0.53
>> 09:00:00 sdd              0.13      3.93     0.85  86.44    7.12    29.50    1.05      3.70     0.33  24.10   13.78     3.52    0.00      0.00     0.00   0.00    0.00     0.00    0.83    2.90    0.02   1.54
>> 09:00:00 sde              0.12      3.87     0.85  87.93    2.29    33.14    0.83      1.50     0.00   0.00   14.02     1.80    0.00      0.00     0.00   0.00    0.00     0.00    0.83    0.42    0.01   1.27
>> 09:00:00 sdf              0.03      0.13     0.00   0.00   12.50     4.00    1.37     14.63     2.75  66.80   13.83    10.71    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.88    0.02   1.84
>> 09:00:00 sdg              0.02      0.07     0.00   0.00   21.00     4.00    1.40     14.77     2.75  66.27   13.45    10.55    0.00      0.00     0.00   0.00    0.00     0.00    0.83    4.96    0.02   1.77
>> 09:01:00 md127            0.05      0.20     0.00   0.00   26.00     4.00    7.85     83.00     0.00   0.00 2388.76    10.57    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   18.75   1.74
>> 09:01:00 sdb              0.23     15.73     3.70  94.07   20.36    67.43    0.98     29.12     6.68  87.17    5.75    29.62    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.75    0.02   0.73
>> 09:01:00 sdc              0.20     10.33     2.38  92.26   12.08    51.67    0.77      9.93     2.10  73.26    3.20    12.95    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.33    0.01   0.36
>> 09:01:00 sdd              0.23     18.60     4.42  94.98   23.07    79.71    1.03     22.38     4.93  82.68   11.61    21.66    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.62    0.02   1.31
>> 09:01:00 sde              0.22     18.20     4.33  95.24   19.62    84.00    0.98     26.18     5.93  85.78   11.61    26.63    0.00      0.00     0.00   0.00    0.00     0.00    0.67    3.40    0.02   1.35
>> 09:01:00 sdf              0.18     18.67     4.48  96.07   42.64   101.82    1.47     49.72    11.33  88.54   12.03    33.90    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.67    0.03   1.62
>> 09:01:00 sdg              0.18     19.27     4.63  96.19   39.00   105.09    1.48     46.98    10.63  87.76   12.37    31.67    0.00      0.00     0.00   0.00    0.00     0.00    0.67    7.35    0.03   1.68
>> 09:02:00 md127            0.32      1.27     0.00   0.00   26.95     4.00   14.12    214.93     0.00   0.00  639.34    15.23    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    9.03   2.80
>> 09:02:00 sdb              0.48     51.47    12.38  96.24   18.03   106.48    1.20     63.79    15.13  92.65    3.72    53.16    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.79    0.02   0.79
>> 09:02:00 sdc              0.53     58.07    13.98  96.33   16.16   108.88    1.30     71.99    17.08  92.93    4.95    55.38    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.15    0.02   1.08
>> 09:02:00 sdd              0.52     69.13    16.77  97.01   14.13   133.81    1.52     92.27    21.95  93.54   10.38    60.84    0.00      0.00     0.00   0.00    0.00     0.00    0.78    3.60    0.03   1.78
>> 09:02:00 sde              0.50     64.60    15.65  96.90   16.83   129.20    1.87     96.40    22.63  92.38   11.50    51.64    0.00      0.00     0.00   0.00    0.00     0.00    0.78    6.96    0.04   2.20
>> 09:02:00 sdf              0.57     74.80    18.13  96.97   15.65   132.00    1.77     77.60    18.03  91.08   11.51    43.92    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.23    0.03   2.25
>> 09:02:00 sdg              0.68     64.13    15.35  95.74    9.49    93.85    1.72     68.73    15.87  90.24   11.74    40.04    0.00      0.00     0.00   0.00    0.00     0.00    0.78    7.43    0.03   2.26
>> 09:03:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    6.97     51.00     0.00   0.00 7200.34     7.32    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   50.16   1.38
>> 09:03:00 sdb              0.15      6.53     1.48  90.82   11.78    43.56    0.78      9.87     2.08  72.67    4.40    12.60    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.09    0.01   0.36
>> 09:03:00 sdc              0.15      6.53     1.48  90.82    9.44    43.56    0.80      9.93     2.08  72.25    4.90    12.42    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.27    0.01   0.36
>> 09:03:00 sdd              0.23     12.87     2.98  92.75    3.57    55.14    1.02     22.80     5.08  83.33   12.15    22.43    0.00      0.00     0.00   0.00    0.00     0.00    0.73    2.18    0.01   1.17
>> 09:03:00 sde              0.45     17.73     3.98  89.85    1.52    39.41    1.37     29.27     6.35  82.29   13.65    21.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.34    0.02   1.60
>> 09:03:00 sdf              0.37     11.40     2.48  87.13    5.59    31.09    1.13     16.33     3.35  74.72   13.82    14.41    0.00      0.00     0.00   0.00    0.00     0.00    0.73    3.98    0.02   1.60
>> 09:03:00 sdg              0.47     21.67     4.95  91.38    3.11    46.43    1.17     16.60     3.38  74.36   14.50    14.23    0.00      0.00     0.00   0.00    0.00     0.00    0.73    5.23    0.02   1.58
>> 09:04:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    1.88     11.93     0.00   0.00 1439.04     6.34    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.71   0.92
>> 09:04:00 sdb              0.38      8.73     1.80  82.44    2.00    22.78    0.90      4.70     0.72  44.33    3.94     5.22    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.02    0.01   0.48
>> 09:04:00 sdc              0.38      8.73     1.80  82.44    1.74    22.78    0.82      1.50     0.00   0.00    2.69     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.47    0.00   0.34
>> 09:04:00 sdd              0.30      3.47     0.57  65.38    0.39    11.56    0.98      9.97     1.95  66.48   13.14    10.14    0.00      0.00     0.00   0.00    0.00     0.00    0.82    2.14    0.01   1.40
>> 09:04:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.28     13.43     2.52  66.23   13.51    10.47    0.00      0.00     0.00   0.00    0.00     0.00    0.82    4.53    0.02   1.73
>> 09:04:00 sdf              0.08      5.27     1.23  93.67    3.40    63.20    1.12      4.97     0.57  33.66   14.93     4.45    0.00      0.00     0.00   0.00    0.00     0.00    0.82    3.55    0.02   1.74
>> 09:04:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.82      1.50     0.00   0.00   12.96     1.84    0.00      0.00     0.00   0.00    0.00     0.00    0.82    0.37    0.01   1.12
>> 09:05:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.27     19.27     0.00   0.00 1670.65     8.50    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.79   1.40
>> 09:05:00 sdb              0.32      4.27     0.75  70.31    0.42    13.47    0.85      8.00     1.55  64.58    3.61     9.41    0.00      0.00     0.00   0.00    0.00     0.00    0.70    3.98    0.01   0.40
>> 09:05:00 sdc              0.42     13.33     2.92  87.50    4.12    32.00    0.87     14.27     3.10  78.15    3.88    16.46    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.33    0.01   0.52
>> 09:05:00 sdd              0.35      5.67     1.07  75.29    0.76    16.19    0.87     11.73     2.47  74.00   12.62    13.54    0.00      0.00     0.00   0.00    0.00     0.00    0.70    2.86    0.01   1.19
>> 09:05:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    1.23     17.47     3.53  74.13   12.08    14.16    0.00      0.00     0.00   0.00    0.00     0.00    0.70    6.19    0.02   1.51
>> 09:05:00 sdf              0.08      5.53     1.30  93.98   11.40    66.40    1.15     13.47     2.62  69.47   13.26    11.71    0.00      0.00     0.00   0.00    0.00     0.00    0.70    4.21    0.02   1.55
>> 09:05:00 sdg              0.08      5.53     1.30  93.98   12.00    66.40    0.80      9.07     1.87  70.00   12.04    11.33    0.00      0.00     0.00   0.00    0.00     0.00    0.70    1.24    0.01   1.03
>> 09:06:00 md127            0.20      0.80     0.00   0.00   23.17     4.00    3.03    113.80     0.00   0.00 1063.22    37.52    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.23   2.60
>> 09:06:00 sdb              0.30     32.80     7.90  96.34   10.56   109.33    1.22     36.85     8.37  87.30    4.70    30.29    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.56    0.01   0.76
>> 09:06:00 sdc              0.30     37.40     9.05  96.79    8.83   124.67    1.13     42.32     9.82  89.65    4.34    37.34    0.00      0.00     0.00   0.00    0.00     0.00    0.72    4.42    0.01   0.63
>> 09:06:00 sdd              0.58     62.60    15.07  96.27   16.83   107.31    1.43     57.72    13.37  90.32   11.91    40.27    0.00      0.00     0.00   0.00    0.00     0.00    0.72    6.74    0.03   1.79
>> 09:06:00 sde              0.47     64.73    15.72  97.12   10.14   138.71    1.48     45.18    10.18  87.29   11.97    30.46    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.09    0.03   1.78
>> 09:06:00 sdf              0.35     42.53    10.28  96.71   15.48   121.52    1.48     46.72    10.57  87.69   14.58    31.49    0.00      0.00     0.00   0.00    0.00     0.00    0.72    7.40    0.03   1.97
>> 09:06:00 sdg              0.27     31.93     7.72  96.66   16.38   119.75    1.18     52.58    12.33  91.25   11.63    44.44    0.00      0.00     0.00   0.00    0.00     0.00    0.72    3.70    0.02   1.40
>> 09:07:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.10     13.93     0.00   0.00 1405.96     6.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.95   1.00
>> 09:07:00 sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.97      3.36     0.32  24.68    3.19     3.47    0.00      0.00     0.00   0.00    0.00     0.00    0.80    3.54    0.01   0.39
>> 09:07:00 sdc              0.15      7.47     1.72  91.96    0.44    49.78    1.17     13.43     2.63  69.30    4.29    11.51    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.31    0.01   0.48
>> 09:07:00 sdd              0.15      7.47     1.72  91.96    0.44    49.78    1.00     11.43     2.30  69.70   11.55    11.43    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.00    0.01   1.23
>> 09:07:00 sde              0.15      7.47     1.72  91.96    9.44    49.78    1.28     12.83     2.37  64.84   14.22     9.99    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.27    0.02   1.69
>> 09:07:00 sdf              0.05      2.53     0.58  92.11    4.67    50.67    1.18      7.89     1.23  51.03   13.87     6.67    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.29    0.02   1.63
>> 09:07:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.80      1.43     0.00   0.00   13.48     1.78    0.00      0.00     0.00   0.00    0.00     0.00    0.80    1.08    0.01   1.14


It finally completed with a bang, and the kthread flush is gone.

                         Dirty   change   Buffers   change
2023-11-04 20:42:47    117128       -8   1407904       12
2023-11-04 20:42:57    117068      -60   1407944       40
2023-11-04 20:43:07    117096       28   1407972       28
2023-11-04 20:43:17    116972     -124   1408000       28
2023-11-04 20:43:27    116864     -108   1408032       32
2023-11-04 20:43:37    116664     -200   1408056       24
2023-11-04 20:43:47    116664        0   1408092       36
2023-11-04 20:43:57    116468     -196   1408132       40
2023-11-04 20:44:07    116480       12   1408168       36
2023-11-04 20:44:17    116364     -116   1408200       32
2023-11-04 20:44:27    116284      -80   1408228       28
2023-11-04 20:44:37    116284        0   1408268       40
2023-11-04 20:44:47    115956     -328   1408300       32
2023-11-04 20:44:57    115804     -152   1408324       24
2023-11-04 20:45:07    115680     -124   1408356       32
2023-11-04 20:45:17    116216      536   1408392       36
2023-11-04 20:45:27    115656     -560   1408432       40
2023-11-04 20:45:37    115744       88   1408460       28
2023-11-04 20:45:47    115348     -396   1408496       36
2023-11-04 20:45:57    115948      600   1408528       32
2023-11-04 20:46:07       428  -115520   1408580       52
2023-11-04 20:46:17       364      -64   1408592       12
2023-11-04 20:46:27       608      244   1408612       20
2023-11-04 20:46:37       808      200   1408636       24

So since around 9am, 12 hours!

The final iostat:

          Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
20:45:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    2.12     14.40     0.00   0.00 1402.68     6.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.97   0.69
20:45:00 sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.80      1.43     0.00   0.00    2.65     1.78    0.00      0.00     0.00   0.00    0.00     0.00    0.80    2.42    0.00   0.28
20:45:00 sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.80      1.43     0.00   0.00    2.35     1.78    0.00      0.00     0.00   0.00    0.00     0.00    0.80    2.10    0.00   0.26
20:45:00 sdd              0.00      0.00     0.00   0.00    0.00     0.00    1.15      5.76     0.73  38.94   14.84     5.01    0.00      0.00     0.00   0.00    0.00     0.00    0.80    4.08    0.02   1.79
20:45:00 sde              0.03      1.53     0.35  91.30    0.50    46.00    0.98     11.49     2.33  70.35   13.66    11.69    0.00      0.00     0.00   0.00    0.00     0.00    0.80    3.33    0.02   1.37
20:45:00 sdf              0.03      1.53     0.35  91.30    8.00    46.00    1.33     15.82     3.07  69.70   14.16    11.87    0.00      0.00     0.00   0.00    0.00     0.00    0.80    5.81    0.02   1.89
20:45:00 sdg              0.03      1.53     0.35  91.30    8.00    46.00    1.33     15.82     3.07  69.70   13.65    11.87    0.00      0.00     0.00   0.00    0.00     0.00    0.80    7.00    0.02   1.82
20:46:00 md127            0.02      0.07     0.00   0.00   17.00     4.00    2.35     25.20     0.00   0.00 2141.33    10.72    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.03   1.15
20:46:00 sdb              0.12      5.73     1.32  91.86   12.14    49.14    0.75     12.79     2.83  79.07    2.91    17.06    0.00      0.00     0.00   0.00    0.00     0.00    0.68    2.88    0.01   0.32
20:46:00 sdc              0.12      5.73     1.32  91.86   14.14    49.14    0.72      3.46     0.53  42.67    3.65     4.83    0.00      0.00     0.00   0.00    0.00     0.00    0.68    3.54    0.01   0.36
20:46:00 sdd              0.10      4.60     1.05  91.30   11.00    46.00    0.82      6.39     1.17  58.82   13.76     7.83    0.00      0.00     0.00   0.00    0.00     0.00    0.68    2.76    0.01   1.22
20:46:00 sde              0.07      5.47     1.30  95.12    6.00    82.00    1.08     15.26     3.12  74.21   11.97    14.08    0.00      0.00     0.00   0.00    0.00     0.00    0.68    3.29    0.02   1.26
20:46:00 sdf              0.07      5.47     1.30  95.12    0.50    82.00    1.18     21.86     4.67  79.77   12.14    18.47    0.00      0.00     0.00   0.00    0.00     0.00    0.68    4.27    0.02   1.34
20:46:00 sdg              0.18     10.13     2.35  92.76    5.09    55.27    1.12     19.53     4.15  78.80   12.21    17.49    0.00      0.00     0.00   0.00    0.00     0.00    0.68    4.66    0.02   1.39
20:47:00 md127            0.00      0.00     0.00   0.00    0.00     0.00  256.68   1918.53     0.00   0.00 1094.72     7.47    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  281.00   7.49
20:47:00 sdb             14.75    382.00    80.75  84.55   89.68    25.90   16.43    557.67   123.10  88.22    7.65    33.94    0.00      0.00     0.00   0.00    0.00     0.00    0.33    5.25    1.45   4.17
20:47:00 sdc             20.15    596.60   129.00  86.49  144.70    29.61   15.37    558.88   124.47  89.01   18.56    36.37    0.00      0.00     0.00   0.00    0.00     0.00    0.33   14.50    3.21   6.11
20:47:00 sdd             20.92    613.27   132.40  86.36  330.56    29.32   15.65    544.01   120.47  88.50   37.67    34.76    0.00      0.00     0.00   0.00    0.00     0.00    0.32   41.16    7.52   8.31
20:47:00 sde             20.07    598.07   129.45  86.58  128.61    29.80   16.03    587.81   131.03  89.10   12.36    36.66    0.00      0.00     0.00   0.00    0.00     0.00    0.32    5.89    2.78   5.96
20:47:00 sdf             19.93    562.07   120.58  85.81  122.18    28.20   16.92    570.48   125.82  88.15   11.14    33.72    0.00      0.00     0.00   0.00    0.00     0.00    0.32   13.47    2.63   5.82
20:47:00 sdg             14.47    345.53    71.92  83.25   79.60    23.88   16.88    563.94   124.22  88.03    9.24    33.40    0.00      0.00     0.00   0.00    0.00     0.00    0.32    6.95    1.31   4.42
20:48:00 md127            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sdd              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sde              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sdf              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
20:48:00 sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00


I expect to receive a new disk early next week and I hope the issue goes away when the array is not degraded.

-- 
Eyal at Home (eyal@eyal.emu.id.au)


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-11-04 10:04 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-30 13:35 problem with recovered array eyal
2023-10-30 16:14 ` Roger Heflin
2023-10-31  2:35   ` eyal
2023-10-31  2:50     ` eyal
2023-10-31  3:21   ` Carlos Carvalho
2023-10-31  6:25     ` eyal
2023-10-31  9:29     ` eyal
2023-10-31 10:24       ` Roger Heflin
2023-10-31 21:40         ` eyal
2023-11-01 10:30           ` Roger Heflin
2023-11-01 13:08             ` eyal
2023-11-01 14:29               ` Roger Heflin
2023-11-02  8:34                 ` Johannes Truschnigg
2023-11-02 11:27                   ` eyal
2023-11-02 11:57                     ` Roger Heflin
2023-11-02 13:05                       ` eyal
2023-11-02 17:05                         ` Roger Heflin
2023-11-02 23:23                           ` eyal
2023-11-03 12:08                             ` Roger Heflin
2023-11-02 12:29                     ` eyal
2023-11-02 12:46                       ` Reindl Harald
2023-11-03 14:16                   ` Carlos Carvalho
2023-11-03 14:32                     ` Dirty page flushing regression in 6.5.x vs 6.1.x Roman Mamedov
2023-11-03 15:57                     ` problem with recovered array Roger Heflin
2023-11-03 22:38                       ` eyal
2023-11-04  0:48                         ` eyal
2023-11-04  1:01                           ` Roger Heflin
2023-11-04 10:04                             ` eyal
2023-10-31 12:39       ` Carlos Carvalho
2023-10-31 14:19         ` Roger Heflin
2023-10-31 19:20           ` Carlos Carvalho
2023-10-31 21:44 ` problem with recovered array [more details] eyal
2023-10-31 22:00   ` eyal
2023-11-01  4:31   ` [now urgent] problem with recovered array eyal
2023-11-01  6:44     ` eyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.