All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joe Rayhawk <jrayhawk@fairlystable.org>
To: linux-raid@vger.kernel.org
Subject: raid6: badblocks-related bio panics
Date: Mon, 17 Jan 2022 11:24:32 -0800	[thread overview]
Message-ID: <164244747275.86917.2623783912687807916@richardiv.omgwallhack.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 8494 bytes --]

A raid6 array of mine produces one of the following results upon reading
a particular stripe:
commit c82aa1b76787c34fd02374e519b6f52cdeb2f54b^: I/O error
commit c82aa1b76787c34fd02374e519b6f52cdeb2f54b: panic
commit 1a7e76e4f130332b5d3b0c72c4f664e59deb1239^: panic
commit 1a7e76e4f130332b5d3b0c72c4f664e59deb1239: I/O error, then panic
  on second read

jrayhawk@yuzz:~/src/linux$ sudo ./linux break=top con=tty:$(tty) mem=1G $(n=1; for i in $( ls /sys/class/block/md0/slaves/ ); do echo -n ubd$((n++))r=/dev/${i}" "; done)
[...]
/ # mdadm --assemble --scan
mdadm: failed to get exclusive lock on mapfile
mdadm: failed to get exclusive lock on mapfile - continue anyway...
md: md0 stopped.
md/raid:md0: device ubdb operational as raid disk 0
md/raid:md0: device ubdm operational as raid disk 12
md/raid:md0: device ubdl operational as raid disk 11
md/raid:md0: device ubdn operational as raid disk 10
md/raid:md0: device ubdk operational as raid disk 9
md/raid:md0: device ubdj operational as raid disk 8
md/raid:md0: device ubdi operational as raid disk 7
md/raid:md0: device ubdg operational as raid disk 6
md/raid:md0: device ubdh operational as raid disk 5
md/raid:md0: device ubdf operational as raid disk 4
md/raid:md0: device ubde operational as raid disk 3
md/raid:md0: device ubdd operational as raid disk 2
md/raid:md0: device ubdc operational as raid disk 1
md/raid:md0: raid level 6 active with 13 out of 13 devices, algorithm 2
md0: detected capacity change from 0 to 32223655552
mdadm: /dev/md0 has been started with 13 drives.
/ # dd bs=$((1024*64*11)) skip=19477214 count=1 if=/dev/md0 of=/dev/null

Pid: 1138, comm: md0_raid6 Not tainted 5.13.0-rc3-00041-gc82aa1b76787
RIP: 0033:[<00000000602d4922>]
RSP: 0000000063a5fc10  EFLAGS: 00010206
RAX: 0000000100000000 RBX: 000000006274b000 RCX: 00000000949d6f00
RDX: 00000000602a587a RSI: 000000066297ca18 RDI: 00000000619d1550
RBP: 0000000063a5fc40 R08: 0000078200080000 R09: 00000000ffffffff
R10: 0000000061136040 R11: 0000000000000001 R12: 00000000619d1550
R13: 00000000602d491a R14: 000000006283b298 R15: 000000006283a570
Kernel panic - not syncing: Segfault with no mm
CPU: 0 PID: 1138 Comm: md0_raid6 Not tainted 5.13.0-rc3-00041-gc82aa1b76787 #71
Stack:
 63a5fc40 602a5920 602a587a 6274b000
 6283b070 619d1550 63a5fda0 6041dfea
 00000000 00000001 63a5fcb0 60061b0c
Call Trace:
 [<602a5920>] ? bio_endio+0xa6/0x152
 [<602a587a>] ? bio_endio+0x0/0x152
 [<6041dfea>] handle_stripe+0xbcf/0x3096
 [<60061b0c>] ? try_to_wake_up+0x19b/0x1ad
 [<6041d41b>] ? handle_stripe+0x0/0x3096
 [<60420831>] handle_active_stripes.constprop.0+0x380/0x458
 [<604175c1>] ? list_del_init+0x0/0x16
 [<60420d6a>] raid5d+0x2f6/0x4aa
 [<6048eb70>] ? __schedule+0x0/0x43f
 [<6005b2df>] ? kthread_should_stop+0x0/0x2c
 [<60038a12>] ? set_signals+0x37/0x3f
 [<60038a12>] ? set_signals+0x37/0x3f
 [<6005b2df>] ? kthread_should_stop+0x0/0x2c
 [<6044b0e2>] md_thread+0x174/0x18a
 [<6006b426>] ? autoremove_wake_function+0x0/0x39
 [<60043b17>] ? do_exit+0x0/0x93a
 [<6044af6e>] ? md_thread+0x0/0x18a
 [<6005b1fe>] kthread+0x168/0x170
 [<600272dd>] new_thread_handler+0x81/0xb2
Aborted

jrayhawk@yuzz:~/src/linux$ sudo ./linux break=top con=tty:$(tty) mem=1G $(n=1; for i in $( ls /sys/class/block/md0/slaves/ ); do echo -n ubd$((n++))r=/dev/${i}" "; done)
[...]
mdadm: /dev/md0 has been started with 13 drives.
/ # dd bs=$((1024*64*11)) skip=19477214 count=1 if=/dev/md0 of=/dev/null
Buffer I/O error on dev md0, logical block 3427989666, async page read
0+1 records in
0+1 records out
/ # dd bs=$((1024*64*11)) skip=19477214 count=1 if=/dev/md0 of=/dev/null

Pid: 959, comm: dd Not tainted 5.15.0-rc6-00077-g1a7e76e4f130
RIP: 0033:[<00000000602ae55c>]
RSP: 00000000a2b8f600  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000626acc00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000010000 RDI: 00000000626acc00
RBP: 00000000a2b8f610 R08: 00000000ffffff00 R09: 00000000626acc00
R10: 00007fffe39db080 R11: 00007fffe39db090 R12: 0000000000010000
R13: 0000000000010000 R14: 00000000626acc00 R15: 0000000062772af0
Kernel panic - not syncing: Kernel mode fault at addr 0x8, ip 0x602ae55c
CPU: 0 PID: 959 Comm: dd Not tainted 5.15.0-rc6-00077-g1a7e76e4f130 #69
Stack:
 626acc00 626acc00 a2b8f640 602aeb43
 00000020 00000000 621e7000 6255e800
 a2b8f720 6043d5da a2b8f6a0 60798840
Call Trace:
 [<602aeb43>] bio_split+0x11b/0x134
 [<6043d5da>] raid5_make_request+0x17e/0xae9
 [<6003a2ec>] ? os_nsecs+0x1d/0x2b
 [<6006d523>] ? autoremove_wake_function+0x0/0x39
 [<602b62ef>] ? bio_advance_iter_single+0x1a/0x4b
 [<602b6ca6>] ? __blk_queue_split+0x2a6/0x33d
 [<6046aa2e>] ? is_suspended+0x0/0x3e
 [<6046be7f>] md_handle_request+0xcc/0x130
 [<602b0b57>] ? __submit_bio+0x0/0x191
 [<602b0b57>] ? __submit_bio+0x0/0x191
 [<6046c021>] md_submit_bio+0xa3/0xad
 [<602b0cac>] __submit_bio+0x155/0x191
 [<602aca00>] ? bio_next_segment+0x6/0x82
 [<602b03a0>] ? bio_list_pop+0x0/0x23
 [<602b037e>] ? bio_list_merge+0x0/0x22
 [<602b0b57>] ? __submit_bio+0x0/0x191
 [<602b16ee>] submit_bio_noacct+0x174/0x236
 [<60039be6>] ? set_signals+0x0/0x3f
 [<600e123e>] ? readahead_page+0x0/0x98
 [<602b1899>] submit_bio+0xe9/0xf2
 [<604b0567>] ? xa_load+0x0/0x5e
 [<60173cf9>] mpage_bio_submit+0x3b/0x42
 [<60173cbe>] ? mpage_bio_submit+0x0/0x42
 [<60174eae>] mpage_readahead+0x144/0x152
 [<602ac045>] ? blkdev_get_block+0x0/0x32
 [<602abb83>] blkdev_readahead+0x1a/0x1c
 [<600e143a>] read_pages+0x57/0x18b
 [<600e1ec6>] ? get_page+0x10/0x15
 [<600e1787>] page_cache_ra_unbounded+0xef/0x1df
 [<604aff9c>] ? __xas_prev+0x3f/0xe5
 [<600e18b3>] do_page_cache_ra+0x3c/0x3f
 [<600e1aab>] ondemand_readahead+0x1f5/0x204
 [<600e1bd5>] page_cache_sync_ra+0x77/0x7e
 [<600d7f7f>] ? filemap_get_read_batch+0x0/0x112
 [<600daaa4>] filemap_read+0x1ab/0x738
 [<60146d6c>] ? terminate_walk+0x59/0x83
 [<60148abf>] ? path_openat+0x843/0xbb0
 [<600db13f>] generic_file_read_iter+0x10e/0x11d
 [<602abe91>] blkdev_read_iter+0x4c/0x5c
 [<6013922f>] new_sync_read+0x73/0xda
 [<6014558b>] ? putname+0xa9/0xae
 [<60260000>] ? newseg+0x2a8/0x2f0
 [<6013a2be>] vfs_read+0xd0/0x106
 [<60156dd0>] ? __fdget+0x15/0x17
 [<60156df9>] ? __fdget_pos+0x13/0x4a
 [<6013a673>] ksys_read+0x6c/0xa6
 [<6013a6bd>] sys_read+0x10/0x12
 [<6002b5fe>] handle_syscall+0x81/0xb1
 [<6003ba8d>] userspace+0x4af/0x53d
 [<60028446>] fork_handler+0x94/0x96
Aborted

The underlying block devices themselves are fully readable without
error, sync_action check/repair do not make any objections, and
raid6check is entirely happy.

--examine-badblocks output is, however, rather inconsistent:

root@yuzz:~# IFS=$'\n'; slaves=($(ls /sys/class/block/md0/slaves/)); for slave in ${slaves[*]}; do mdadm --examine-badblocks /dev/$slave; done | sort -u | sed -n 's/^ \+//p' | while read badblock; do printf '%-25s:' "$badblock"; for slave in ${slaves[*]}; do mdadm --examine-badblocks /dev/$slave | grep -q " $badblock"; echo -n " $?"; done; echo; done
2174988544 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174989080 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174990608 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174992144 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174993680 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174995208 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2174996744 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2175000120 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2175001656 for 8 sectors : 0 1 0 1 0 0 0 0 0 1 0 0 0
2493345552 for 16 sectors: 0 1 0 1 0 1 0 0 0 1 0 1 0
2493351832 for 8 sectors : 0 1 0 1 0 1 0 0 0 1 0 1 0
2493356936 for 8 sectors : 0 1 0 1 0 1 0 0 0 1 0 1 0
2493398344 for 16 sectors: 0 1 0 1 0 1 0 0 0 1 0 1 0
2493398584 for 8 sectors : 0 1 0 1 0 1 0 0 0 1 0 1 0

(where "0" is "present" and "1" is "missing")

The relevant read triggering the problem is

19477214*1024*64/512+262144 = 2493345536 through 2493411072

which neatly coincides with the 2493345552 through 2493398584 badblocks
entries.

Reading the other (differently inconsistent) badblocks addresses do not
trigger I/O errors or panics.

I don't understand how badblocks got into this inconsistent state or how
to uncorruptedly get it out of this state without copying the entire
array; if anyone has pointers, I would be glad to hear them.

If further information is needed, let me know how to acquire it.

Please CC me in this thread; I am not on the list.

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEE0W3/Ls5On90y4dmX35w74P7tvu8FAmHlwmsACgkQ35w74P7t
vu+59BAAhvmoS8MmZPPqbhiLfDqqw/qgMJARJH9raf/W9T0dBa0Wq3k0of+WO40c
MTzxvZTk75/VJbQrJnVdwoLd3Ej39w3HN2UM1NvlR4xHsWVSa57vGgOmTGhtffPl
XJzKdr1Wp8eV4xViSRsga4oXqoO2/LHF3SLTkna7iHSVZxFMdOtgUsg6lLpog0/A
WEBiT/AvqRRH0MQ3OTpyKjrC1DrneEh1muYlyGbRuzfB/pN1VqnxUECoT+JEvIqs
H+bPkD4sWQKVX73CCgC+cvNFIW1Fwh/eSRfMGM5jmG7jz7qj2zMrxz0D+01Qeq/w
V9sHCYbHTIigMBIy/WH1YuHwgHGRBrkKxC9/1rG0pnQ8LsqtgWgSBYKpKlJFq4p/
IdRPRB7oYn0+4KIkz5XYFO8wJlfGiNDd/AsPOsfbQIuD40la04kGZRrYJd/9jr32
bHpERTg5ndp8n7/0BfBteRS3B+EoTa6maKPXG2ms1APS9r4nwz0Tu+mgUoiLiu7l
3hor/exVhK2mCciMk87k42zw/qVzXMDPe4x8AkxdLJQ6OElCsRMNrEDpJMXA2kSe
f6fxIhF5/8KEWBFOGZ/KzUU6jwP5874tb+rY1I8enu6GL88vx/d6nwUTpcD3JYMy
fy7TUL9L0A0/+OcrmlIOd3weoFruyh5JjybcHgAbzqIpJ6tfwS0=
=xFAm
-----END PGP SIGNATURE-----

             reply	other threads:[~2022-01-17 20:17 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-17 19:24 Joe Rayhawk [this message]
2022-01-17 21:40 ` raid6: badblocks-related bio panics Wol

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=164244747275.86917.2623783912687807916@richardiv.omgwallhack.org \
    --to=jrayhawk@fairlystable.org \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.