From: "Michael D. O'Brien" <obrienmd@gmail.com>
To: linux-raid@vger.kernel.org
Subject: mdxxx_raid6 kernel thread frozen
Date: Mon, 15 Feb 2021 13:31:01 -0800 [thread overview]
Message-ID: <CACs3Z9oqWPRt4uT1pYKMHzH+7JHNtsk_stE_-OmQZSQsy4n46g@mail.gmail.com> (raw)
Hi, I have a single mdadm raid6 in a 56-drive raid60 (7x8) with a
kernel thread stuck at 100% cpu. The stuck thread typically happens
during array checks, but is not the resync thread - md122_raid6 is at
100% cpu, whereas md122_resync is at ~0%. When this happens, the
reported sync speed drops until it reaches 4K/sec. Setting sync_action
to idle gets stuck.
iostat shows backing devices aren't doing anything i/o wise, SMART is
clean for all member drives, and dmesg doesn't say anything useful
(until the thread is hung for a long time, then it tells me as much -
I'll post that message when the current issue times out). A reboot
typically clears the issue, but takes quite a long time, as the raid
60 is the backing device for a bcache device (with an optane cache)
that has a large mounted xfs file system in place.
I figured I could strace the process, but I learned that's impossible
with kernel threads :)
Output of various things - please let me know what else I can run to
help track this down:
/prod/mdstat:
md118 : active raid0 md120[4] md119[5] md123[6] md125[3] md121[0]
md124[1] md122[2]
410183875584 blocks super 1.2 3072k chunks
md119 : active raid6 sdbh[1] sdbi[2] sdan[4] sdbc[0] sdar[7] sdaq[6]
sdbe[8] sdao[5]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md120 : active raid6 sdbd[7] sdat[1] sdaz[4] sday[3] sdau[2] sdba[5]
sdbb[6] sdas[0]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md121 : active raid6 sdaj[5] sdag[2] sdal[7] sdai[4] sdae[0] sdak[6]
sdaf[1] sdah[3]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md122 : active raid6 sdu[7] sdq[3] sdr[4] sdp[2] sdn[0] sdt[6] sds[5] sdo[1]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
[================>....] check = 81.5% (7963280396/9766304768)
finish=147106.8min speed=204K/sec
md123 : active raid6 sdax[7] sdaw[6] sdav[5] sdap[4] sdy[3] sdc[0] sdd[1] sdh[2]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md124 : active raid6 sdab[5] sdaa[4] sdad[7] sdz[3] sdv[0] sdx[2] sdac[6] sdw[1]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
md125 : active raid6 sde[0] sdam[7] sdg[2] sdbg[8] sdf[1] sdi[3] sdk[5] sdj[4]
58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2
[8/8] [UUUUUUUU]
/proc/{PID of md122_raid6}/stack alternates between nothing and:
[<0>] ops_run_io+0x3e/0xdb0 [raid456]
[<0>] handle_stripe+0x144/0x1260 [raid456]
[<0>] handle_active_stripes.isra.0+0x3c5/0x5a0 [raid456]
[<0>] raid5d+0x35c/0x550 [raid456]
[<0>] md_thread+0x97/0x160
[<0>] kthread+0x114/0x150
[<0>] ret_from_fork+0x22/0x30
/proc/{PID of md122_raid6}/status:
Name: md122_raid6
Umask: 0000
State: R (running)
Tgid: 2167
Ngid: 0
Pid: 2167
PPid: 2
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
NStgid: 2167
NSpid: 2167
NSpgid: 0
NSsid: 0
Threads: 1
SigQ: 0/1031010
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: fffffffffffffeff
SigCgt: 0000000000000100
CapInh: 0000000000000000
CapPrm: 000000ffffffffff
CapEff: 000000ffffffffff
CapBnd: 000000ffffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffff
Cpus_allowed_list: 0-23
Mems_allowed:
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 73369830
nonvoluntary_ctxt_switches: 29419786
/proc/{PID of md122_raid6}/stat:
2167 (md122_raid6) R 2 0 0 0 -1 2129984 0 0 0 0 0 5079064 0 0 20 0 1 0
1724 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483391 256 0 0 0 17 21
0 0 390998 0 0 0 0 0 0 0 0 0 0
mdadm -D {raid_60_device}:
/dev/md118:
Version : 1.2
Creation Time : Sun Apr 5 13:43:11 2020
Raid Level : raid0
Array Size : 410183875584 (391181.83 GiB 420028.29 GB)
Raid Devices : 7
Total Devices : 7
Persistence : Superblock is persistent
Update Time : Sun Apr 5 13:43:11 2020
State : clean
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0
Layout : -unknown-
Chunk Size : 3072K
Consistency Policy : none
Name : host:all_spinners
UUID : 74727e9d:8d3cd62a:48369430:dea1e4eb
Events : 0
Number Major Minor RaidDevice State
0 9 121 0 active sync /dev/md/host:spinners_1
1 9 124 1 active sync /dev/md/host:spinners_2
2 9 122 2 active sync /dev/md/host:spinners_3
3 9 125 3 active sync /dev/md/host:spinners_4
4 9 120 4 active sync /dev/md/host:spinners_5
5 9 119 5 active sync /dev/md/host:spinners_6
6 9 123 6 active sync /dev/md/host:spinners_7
mdadm -D {md122, frozen device}:
/dev/md122:
Version : 1.2
Creation Time : Sat Apr 4 10:12:53 2020
Raid Level : raid6
Array Size : 58597828608 (55883.24 GiB 60004.18 GB)
Used Dev Size : 9766304768 (9313.87 GiB 10000.70 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Update Time : Mon Feb 15 12:02:41 2021
State : active, checking
Active Devices : 8
Working Devices : 8
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : resync
Check Status : 81% complete
Name : host:spinners_3
UUID : 331bc2af:3207b40c:983b923f:14fe1762
Events : 5869
Number Major Minor RaidDevice State
0 8 208 0 active sync /dev/sdn
1 8 224 1 active sync /dev/sdo
2 8 240 2 active sync /dev/sdp
3 65 0 3 active sync /dev/sdq
4 65 16 4 active sync /dev/sdr
5 65 32 5 active sync /dev/sds
6 65 48 6 active sync /dev/sdt
7 65 64 7 active sync /dev/sdu
next reply other threads:[~2021-02-15 21:32 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-15 21:31 Michael D. O'Brien [this message]
2021-02-16 14:30 ` mdxxx_raid6 kernel thread frozen Thomas Kreitler
2021-02-16 17:20 ` Michael D. O'Brien
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACs3Z9oqWPRt4uT1pYKMHzH+7JHNtsk_stE_-OmQZSQsy4n46g@mail.gmail.com \
--to=obrienmd@gmail.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.