All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paweł Wiejacha" <pawel.wiejacha@rtbhouse.com>
To: song@kernel.org
Cc: linux-raid@vger.kernel.org, artur.paszkiewicz@intel.com
Subject: PROBLEM: double fault in md_end_io
Date: Fri, 9 Apr 2021 23:40:52 +0200	[thread overview]
Message-ID: <CADLTsw2OJtc30HyAHCpQVbbUyoD7P9bK-ZfaH+nrdZc+Je4b6g@mail.gmail.com> (raw)

Hello,

Two of my machines constantly crash with a double fault like this:

1146  <0>[33685.629591] traps: PANIC: double fault, error_code: 0x0
1147  <4>[33685.629593] double fault: 0000 [#1] SMP NOPTI
1148  <4>[33685.629594] CPU: 10 PID: 2118287 Comm: kworker/10:0
Tainted: P           OE     5.11.8-051108-generic #202103200636
1149  <4>[33685.629595] Hardware name: ASUSTeK COMPUTER INC. KRPG-U8
Series/KRPG-U8 Series, BIOS 4201 09/25/2020
1150  <4>[33685.629595] Workqueue: xfs-conv/md12 xfs_end_io [xfs]
1151  <4>[33685.629596] RIP: 0010:__slab_free+0x23/0x340
1152  <4>[33685.629597] Code: 4c fe ff ff 0f 1f 00 0f 1f 44 00 00 55
48 89 e5 41 57 49 89 cf 41 56 49 89 fe 41 55 41 54 49 89 f4 53 48 83
e4 f0 48 83 ec 70 <48> 89 54 24 28 0f 1f 44 00 00 41 8b 46 28 4d 8b 6c
24 20 49 8b 5c
1153  <4>[33685.629598] RSP: 0018:ffffa9bc00848fa0 EFLAGS: 00010086
1154  <4>[33685.629599] RAX: ffff94c04d8b10a0 RBX: ffff94437a34a880
RCX: ffff94437a34a880
1155  <4>[33685.629599] RDX: ffff94437a34a880 RSI: ffffcec745e8d280
RDI: ffff944300043b00
1156  <4>[33685.629599] RBP: ffffa9bc00849040 R08: 0000000000000001
R09: ffffffff82a5d6de
1157  <4>[33685.629600] R10: 0000000000000001 R11: 000000009c109000
R12: ffffcec745e8d280
1158  <4>[33685.629600] R13: ffff944300043b00 R14: ffff944300043b00
R15: ffff94437a34a880
1159  <4>[33685.629601] FS:  0000000000000000(0000)
GS:ffff94c04d880000(0000) knlGS:0000000000000000
1160  <4>[33685.629601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
1161  <4>[33685.629602] CR2: ffffa9bc00848f98 CR3: 000000014d04e000
CR4: 0000000000350ee0
1162  <4>[33685.629602] Call Trace:
1163  <4>[33685.629603]  <IRQ>
1164  <4>[33685.629603]  ? kfree+0x3bc/0x3e0
1165  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
1166  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
1167  <4>[33685.629604]  ? mempool_free+0x2f/0x80
1168  <4>[33685.629604]  ? md_end_io+0x4a/0x70
1169  <4>[33685.629604]  ? bio_endio+0xdc/0x130
1170  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
1171  <4>[33685.629605]  ? md_end_io+0x5c/0x70
1172  <4>[33685.629605]  ? bio_endio+0xdc/0x130
1173  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
1174  <4>[33685.629606]  ? md_end_io+0x5c/0x70
1175  <4>[33685.629606]  ? bio_endio+0xdc/0x130
1176  <4>[33685.629606]  ? bio_chain_endio+0x2d/0x40
1177  <4>[33685.629607]  ? md_end_io+0x5c/0x70
... repeated ...
1436  <4>[33685.629677]  ? bio_endio+0xdc/0x130
1437  <4>[33685.629677]  ? bio_chain_endio+0x2d/0x40
1438  <4>[33685.629677]  ? md_end_io+0x5c/0x70
1439  <4>[33685.629677]  ? bio_endio+0xdc/0x130
1440  <4>[33685.629678]  ? bio_chain_endio+0x2d/0x40
1441  <4>[33685.629678]  ? md_
1442  <4>[33685.629679] Lost 357 message(s)!

This happens on:
5.11.8-051108-generic #202103200636 SMP Sat Mar 20 11:17:32 UTC 2021
and on 5.8.0-44-generic #50~20.04.1-Ubuntu
(https://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.8.0-44.50/changelog)
which contains backported
https://github.com/torvalds/linux/commit/41d2d848e5c09209bdb57ff9c0ca34075e22783d
("md: improve io stats accounting").
The 5.8.18-050818-generic #202011011237 SMP Sun Nov 1 12:40:15 UTC
2020 which does not contain above suspected change does not crash.

If there's a better way/place to report this bug just let me know. If
not, here are steps to reproduce:

1. Create a RAID 0 device using three Micron_9300_MTFDHAL7T6TDP disks.
mdadm --create --verbose /dev/md12 --level=stripe --raid-devices=3
/dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1

2. Setup xfs on it:
mkfs.xfs /dev/md12 and mount it

3. Write to a file on this filesystem:
while true; do rm -rf /mnt/md12/crash* ; for i in `seq 8`; do dd
if=/dev/zero of=/mnt/md12/crash$i bs=32K count=50000000 & done; wait;
done
Wait for a crash (usually less than 20 min).

I couldn't reproduce it with a single dd process (maybe I have to wait
a little longer), but a single cat
/very/large/file/on/cephfs/over100GbE > /mnt/md12/crash is enough for
this double fault to occur.

More info:
This long mempool_kfree - md_end_io - *  -md_end_io stack trace looks
always the same, but the panic occurs in different places:

pstore/6948115143318/dmesg.txt-<4>[545649.087998] CPU: 88 PID: 0 Comm:
swapper/88 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948377398316/dmesg.txt-<4>[11275.914909] CPU: 14 PID: 0 Comm:
swapper/14 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948532816002/dmesg.txt-<4>[33685.629594] CPU: 10 PID: 2118287
Comm: kworker/10:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948532816002/dmesg.txt-<4>[33685.629595] Workqueue:
xfs-conv/md12 xfs_end_io [xfs]
pstore/6948855849083/dmesg.txt-<4>[42934.321129] CPU: 85 PID: 0 Comm:
swapper/85 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6948876331782/dmesg.txt-<4>[ 3475.020672] CPU: 86 PID: 0 Comm:
swapper/86 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949083860307/dmesg.txt-<4>[43048.254375] CPU: 45 PID: 0 Comm:
swapper/45 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949091775931/dmesg.txt-<4>[ 1150.790240] CPU: 64 PID: 0 Comm:
swapper/64 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
ceph_con_workfn [libceph]
pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
#202103200636
pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
ceph_con_workfn [libceph]
pstore/6949152322085/dmesg.txt-<4>[ 6437.077874] CPU: 59 PID: 0 Comm:
swapper/59 Tainted: P           OE     5.11.8-051108-generic
#202103200636

cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.11.8-051108-generic
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro net.ifnames=0 biosdevname=0
strict-devmem=0 mitigations=off iommu=pt

cat /proc/cpuinfo
model name      : AMD EPYC 7552 48-Core Processor

cat /proc/mounts
/dev/md12 /mnt/ssd1 xfs
rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=3072,prjquota
0 0

Let me know if you need more information.

Best regards,
Paweł Wiejacha

             reply	other threads:[~2021-04-09 21:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-09 21:40 Paweł Wiejacha [this message]
2021-04-12  6:48 ` PROBLEM: double fault in md_end_io Song Liu
2021-04-13 12:05   ` Paweł Wiejacha
2021-04-15  0:36     ` Song Liu
2021-04-15  6:35       ` Song Liu
2021-04-15 15:35         ` Paweł Wiejacha
2021-04-22 15:40           ` Paweł Wiejacha
2021-04-23  2:36 ` Guoqing Jiang
2021-04-23  6:44   ` Song Liu
2021-05-04 21:17     ` Paweł Wiejacha
2021-05-06  5:48       ` Song Liu
2021-05-06 23:46         ` Guoqing Jiang
2021-05-08  1:17           ` Guoqing Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADLTsw2OJtc30HyAHCpQVbbUyoD7P9bK-ZfaH+nrdZc+Je4b6g@mail.gmail.com \
    --to=pawel.wiejacha@rtbhouse.com \
    --cc=artur.paszkiewicz@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.