* mdxxx_raid6 kernel thread frozen
@ 2021-02-15 21:31 Michael D. O'Brien
2021-02-16 14:30 ` Thomas Kreitler
0 siblings, 1 reply; 3+ messages in thread
From: Michael D. O'Brien @ 2021-02-15 21:31 UTC (permalink / raw)
To: linux-raid
Hi, I have a single mdadm raid6 in a 56-drive raid60 (7x8) with a
kernel thread stuck at 100% cpu. The stuck thread typically happens
during array checks, but is not the resync thread - md122_raid6 is at
100% cpu, whereas md122_resync is at ~0%. When this happens, the
reported sync speed drops until it reaches 4K/sec. Setting sync_action
to idle gets stuck.
iostat shows backing devices aren't doing anything i/o wise, SMART is
clean for all member drives, and dmesg doesn't say anything useful
(until the thread is hung for a long time, then it tells me as much -
I'll post that message when the current issue times out). A reboot
typically clears the issue, but takes quite a long time, as the raid
60 is the backing device for a bcache device (with an optane cache)
that has a large mounted xfs file system in place.
I figured I could strace the process, but I learned that's impossible
with kernel threads :)
Output of various things - please let me know what else I can run to
help track this down:
/prod/mdstat:
md118 : active raid0 md120[4] md119[5] md123[6] md125[3] md121[0]
md124[1] md122[2]
410183875584 blocks super 1.2 3072k chunks
md119 : active raid6 sdbh[1] sdbi[2] sdan[4] sdbc[0] sdar[7] sdaq[6]
sdbe[8] sdao[5]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md120 : active raid6 sdbd[7] sdat[1] sdaz[4] sday[3] sdau[2] sdba[5]
sdbb[6] sdas[0]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md121 : active raid6 sdaj[5] sdag[2] sdal[7] sdai[4] sdae[0] sdak[6]
sdaf[1] sdah[3]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md122 : active raid6 sdu[7] sdq[3] sdr[4] sdp[2] sdn[0] sdt[6] sds[5] sdo[1]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
[================>....] check = 81.5% (7963280396/9766304768)
finish=147106.8min speed=204K/sec
md123 : active raid6 sdax[7] sdaw[6] sdav[5] sdap[4] sdy[3] sdc[0] sdd[1] sdh[2]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md124 : active raid6 sdab[5] sdaa[4] sdad[7] sdz[3] sdv[0] sdx[2] sdac[6] sdw[1]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md125 : active raid6 sde[0] sdam[7] sdg[2] sdbg[8] sdf[1] sdi[3] sdk[5] sdj[4]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
/proc/{PID of md122_raid6}/stack alternates between nothing and:
[<0>] ops_run_io+0x3e/0xdb0 [raid456]
[<0>] handle_stripe+0x144/0x1260 [raid456]
[<0>] handle_active_stripes.isra.0+0x3c5/0x5a0 [raid456]
[<0>] raid5d+0x35c/0x550 [raid456]
[<0>] md_thread+0x97/0x160
[<0>] kthread+0x114/0x150
[<0>] ret_from_fork+0x22/0x30
/proc/{PID of md122_raid6}/status:
Name: md122_raid6
Umask: 0000
State: R (running)
Tgid: 2167
Ngid: 0
Pid: 2167
PPid: 2
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
NStgid: 2167
NSpid: 2167
NSpgid: 0
NSsid: 0
Threads: 1
SigQ: 0/1031010
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: fffffffffffffeff
SigCgt: 0000000000000100
CapInh: 0000000000000000
CapPrm: 000000ffffffffff
CapEff: 000000ffffffffff
CapBnd: 000000ffffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffff
Cpus_allowed_list: 0-23
Mems_allowed:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 73369830
nonvoluntary_ctxt_switches: 29419786
/proc/{PID of md122_raid6}/stat:
2167 (md122_raid6) R 2 0 0 0 -1 2129984 0 0 0 0 0 5079064 0 0 20 0 1 0
1724 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483391 256 0 0 0 17 21
0 0 390998 0 0 0 0 0 0 0 0 0 0
mdadm -D {raid_60_device}:
/dev/md118:
Version : 1.2
Creation Time : Sun Apr 5 13:43:11 2020
Raid Level : raid0
Array Size : 410183875584 (391181.83 GiB 420028.29 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent
Update Time : Sun Apr 5 13:43:11 2020
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0
Layout : -unknown-
Chunk Size : 3072K
Consistency Policy : none
Name : host:all_spinners
UUID : 74727e9d:8d3cd62a:48369430:dea1e4eb
Events : 0
Number Major Minor RaidDevice State
0 9 121 0 active sync /dev/md/host:spinners_1
1 9 124 1 active sync /dev/md/host:spinners_2
2 9 122 2 active sync /dev/md/host:spinners_3
3 9 125 3 active sync /dev/md/host:spinners_4
4 9 120 4 active sync /dev/md/host:spinners_5
5 9 119 5 active sync /dev/md/host:spinners_6
6 9 123 6 active sync /dev/md/host:spinners_7
mdadm -D {md122, frozen device}:
/dev/md122:
Version : 1.2
Creation Time : Sat Apr 4 10:12:53 2020
Raid Level : raid6
Array Size : 58597828608 (55883.24 GiB 60004.18 GB)
Used Dev Size : 9766304768 (9313.87 GiB 10000.70 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Update Time : Mon Feb 15 12:02:41 2021
State : active, checking
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : resync
Check Status : 81% complete
Name : host:spinners_3
UUID : 331bc2af:3207b40c:983b923f:14fe1762
Events : 5869
Number Major Minor RaidDevice State
0 8 208 0 active sync /dev/sdn
1 8 224 1 active sync /dev/sdo
2 8 240 2 active sync /dev/sdp
3 65 0 3 active sync /dev/sdq
4 65 16 4 active sync /dev/sdr
5 65 32 5 active sync /dev/sds
6 65 48 6 active sync /dev/sdt
7 65 64 7 active sync /dev/sdu
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: mdxxx_raid6 kernel thread frozen
2021-02-15 21:31 mdxxx_raid6 kernel thread frozen Michael D. O'Brien
@ 2021-02-16 14:30 ` Thomas Kreitler
2021-02-16 17:20 ` Michael D. O'Brien
0 siblings, 1 reply; 3+ messages in thread
From: Thomas Kreitler @ 2021-02-16 14:30 UTC (permalink / raw)
To: Michael D. O'Brien, linux-raid
On 2021-02-15 22:31, Michael D. O'Brien wrote:
> Hi, I have a single mdadm raid6 in a 56-drive raid60 (7x8) with a
> kernel thread stuck at 100% cpu. The stuck thread typically happens
> during array checks, but is not the resync thread - md122_raid6 is at
> 100% cpu, whereas md122_resync is at ~0%. When this happens, the
> reported sync speed drops until it reaches 4K/sec. Setting sync_action
> to idle gets stuck.
>
> iostat shows backing devices aren't doing anything i/o wise, SMART is
> clean for all member drives, and dmesg doesn't say anything useful
> (until the thread is hung for a long time, then it tells me as much -
> I'll post that message when the current issue times out). A reboot
> typically clears the issue, but takes quite a long time, as the raid
> 60 is the backing device for a bcache device (with an optane cache)
> that has a large mounted xfs file system in place.
>
> I figured I could strace the process, but I learned that's impossible
> with kernel threads :)
>
[...]
Hello Michael,
This sounds pretty much the same what we have experienced whilst
checking raid6 assemblies.
The issue is actively tackled in the moment, c.f the "[PATCH V2] md:
don't unregister sync_thread with reconfig_mutex held" thread.
And more details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
Kind regards,
Thomas
--
Thomas Kreitler - Information Retrieval
kreitler@molgen.mpg.de
49/30/8413 1702
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: mdxxx_raid6 kernel thread frozen
2021-02-16 14:30 ` Thomas Kreitler
@ 2021-02-16 17:20 ` Michael D. O'Brien
0 siblings, 0 replies; 3+ messages in thread
From: Michael D. O'Brien @ 2021-02-16 17:20 UTC (permalink / raw)
To: Thomas Kreitler; +Cc: linux-raid
On Tue, Feb 16, 2021 at 6:30 AM Thomas Kreitler <kreitler@molgen.mpg.de> wrote:
>
> This sounds pretty much the same what we have experienced whilst
> checking raid6 assemblies.
>
> The issue is actively tackled in the moment, c.f the "[PATCH V2] md:
> don't unregister sync_thread with reconfig_mutex held" thread.
>
> And more details in the link:
> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
Thank you for the pointer Thomas - I was reading through that thread
last night (I suppose I should have realized it was similar prior to
my e-mail :)), and the progress is quite encouraging.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-02-16 17:20 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-15 21:31 mdxxx_raid6 kernel thread frozen Michael D. O'Brien
2021-02-16 14:30 ` Thomas Kreitler
2021-02-16 17:20 ` Michael D. O'Brien
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).