All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Liu <song@kernel.org>
To: "Paweł Wiejacha" <pawel.wiejacha@rtbhouse.com>
Cc: linux-raid <linux-raid@vger.kernel.org>,
	Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Subject: Re: PROBLEM: double fault in md_end_io
Date: Sun, 11 Apr 2021 23:48:50 -0700	[thread overview]
Message-ID: <CAPhsuW7YqvSBDhhOxX4oa8-Z0v5DMxtEeEWz4hs5SiNPxWrVmg@mail.gmail.com> (raw)
In-Reply-To: <CADLTsw2OJtc30HyAHCpQVbbUyoD7P9bK-ZfaH+nrdZc+Je4b6g@mail.gmail.com>

On Fri, Apr 9, 2021 at 2:41 PM Paweł Wiejacha
<pawel.wiejacha@rtbhouse.com> wrote:
>
> Hello,
>
> Two of my machines constantly crash with a double fault like this:
>
> 1146  <0>[33685.629591] traps: PANIC: double fault, error_code: 0x0
> 1147  <4>[33685.629593] double fault: 0000 [#1] SMP NOPTI
> 1148  <4>[33685.629594] CPU: 10 PID: 2118287 Comm: kworker/10:0
> Tainted: P           OE     5.11.8-051108-generic #202103200636
> 1149  <4>[33685.629595] Hardware name: ASUSTeK COMPUTER INC. KRPG-U8
> Series/KRPG-U8 Series, BIOS 4201 09/25/2020
> 1150  <4>[33685.629595] Workqueue: xfs-conv/md12 xfs_end_io [xfs]
> 1151  <4>[33685.629596] RIP: 0010:__slab_free+0x23/0x340
> 1152  <4>[33685.629597] Code: 4c fe ff ff 0f 1f 00 0f 1f 44 00 00 55
> 48 89 e5 41 57 49 89 cf 41 56 49 89 fe 41 55 41 54 49 89 f4 53 48 83
> e4 f0 48 83 ec 70 <48> 89 54 24 28 0f 1f 44 00 00 41 8b 46 28 4d 8b 6c
> 24 20 49 8b 5c
> 1153  <4>[33685.629598] RSP: 0018:ffffa9bc00848fa0 EFLAGS: 00010086
> 1154  <4>[33685.629599] RAX: ffff94c04d8b10a0 RBX: ffff94437a34a880
> RCX: ffff94437a34a880
> 1155  <4>[33685.629599] RDX: ffff94437a34a880 RSI: ffffcec745e8d280
> RDI: ffff944300043b00
> 1156  <4>[33685.629599] RBP: ffffa9bc00849040 R08: 0000000000000001
> R09: ffffffff82a5d6de
> 1157  <4>[33685.629600] R10: 0000000000000001 R11: 000000009c109000
> R12: ffffcec745e8d280
> 1158  <4>[33685.629600] R13: ffff944300043b00 R14: ffff944300043b00
> R15: ffff94437a34a880
> 1159  <4>[33685.629601] FS:  0000000000000000(0000)
> GS:ffff94c04d880000(0000) knlGS:0000000000000000
> 1160  <4>[33685.629601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 1161  <4>[33685.629602] CR2: ffffa9bc00848f98 CR3: 000000014d04e000
> CR4: 0000000000350ee0
> 1162  <4>[33685.629602] Call Trace:
> 1163  <4>[33685.629603]  <IRQ>
> 1164  <4>[33685.629603]  ? kfree+0x3bc/0x3e0
> 1165  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
> 1166  <4>[33685.629603]  ? mempool_kfree+0xe/0x10
> 1167  <4>[33685.629604]  ? mempool_free+0x2f/0x80
> 1168  <4>[33685.629604]  ? md_end_io+0x4a/0x70
> 1169  <4>[33685.629604]  ? bio_endio+0xdc/0x130
> 1170  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
> 1171  <4>[33685.629605]  ? md_end_io+0x5c/0x70
> 1172  <4>[33685.629605]  ? bio_endio+0xdc/0x130
> 1173  <4>[33685.629605]  ? bio_chain_endio+0x2d/0x40
> 1174  <4>[33685.629606]  ? md_end_io+0x5c/0x70
> 1175  <4>[33685.629606]  ? bio_endio+0xdc/0x130
> 1176  <4>[33685.629606]  ? bio_chain_endio+0x2d/0x40
> 1177  <4>[33685.629607]  ? md_end_io+0x5c/0x70
> ... repeated ...
> 1436  <4>[33685.629677]  ? bio_endio+0xdc/0x130
> 1437  <4>[33685.629677]  ? bio_chain_endio+0x2d/0x40
> 1438  <4>[33685.629677]  ? md_end_io+0x5c/0x70
> 1439  <4>[33685.629677]  ? bio_endio+0xdc/0x130
> 1440  <4>[33685.629678]  ? bio_chain_endio+0x2d/0x40
> 1441  <4>[33685.629678]  ? md_
> 1442  <4>[33685.629679] Lost 357 message(s)!
>
> This happens on:
> 5.11.8-051108-generic #202103200636 SMP Sat Mar 20 11:17:32 UTC 2021
> and on 5.8.0-44-generic #50~20.04.1-Ubuntu
> (https://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_5.8.0-44.50/changelog)
> which contains backported
> https://github.com/torvalds/linux/commit/41d2d848e5c09209bdb57ff9c0ca34075e22783d
> ("md: improve io stats accounting").
> The 5.8.18-050818-generic #202011011237 SMP Sun Nov 1 12:40:15 UTC
> 2020 which does not contain above suspected change does not crash.
>
> If there's a better way/place to report this bug just let me know. If
> not, here are steps to reproduce:
>
> 1. Create a RAID 0 device using three Micron_9300_MTFDHAL7T6TDP disks.
> mdadm --create --verbose /dev/md12 --level=stripe --raid-devices=3
> /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1
>
> 2. Setup xfs on it:
> mkfs.xfs /dev/md12 and mount it
>
> 3. Write to a file on this filesystem:
> while true; do rm -rf /mnt/md12/crash* ; for i in `seq 8`; do dd
> if=/dev/zero of=/mnt/md12/crash$i bs=32K count=50000000 & done; wait;
> done
> Wait for a crash (usually less than 20 min).
>
> I couldn't reproduce it with a single dd process (maybe I have to wait
> a little longer), but a single cat
> /very/large/file/on/cephfs/over100GbE > /mnt/md12/crash is enough for
> this double fault to occur.
>
> More info:
> This long mempool_kfree - md_end_io - *  -md_end_io stack trace looks
> always the same, but the panic occurs in different places:
>
> pstore/6948115143318/dmesg.txt-<4>[545649.087998] CPU: 88 PID: 0 Comm:
> swapper/88 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6948377398316/dmesg.txt-<4>[11275.914909] CPU: 14 PID: 0 Comm:
> swapper/14 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6948532816002/dmesg.txt-<4>[33685.629594] CPU: 10 PID: 2118287
> Comm: kworker/10:0 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6948532816002/dmesg.txt-<4>[33685.629595] Workqueue:
> xfs-conv/md12 xfs_end_io [xfs]
> pstore/6948855849083/dmesg.txt-<4>[42934.321129] CPU: 85 PID: 0 Comm:
> swapper/85 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6948876331782/dmesg.txt-<4>[ 3475.020672] CPU: 86 PID: 0 Comm:
> swapper/86 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6949083860307/dmesg.txt-<4>[43048.254375] CPU: 45 PID: 0 Comm:
> swapper/45 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6949091775931/dmesg.txt-<4>[ 1150.790240] CPU: 64 PID: 0 Comm:
> swapper/64 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
> kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
> ceph_con_workfn [libceph]
> pstore/6949123356826/dmesg.txt-<4>[ 6963.858253] CPU: 6 PID: 51 Comm:
> kworker/6:0 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
> pstore/6949123356826/dmesg.txt-<4>[ 6963.858255] Workqueue: ceph-msgr
> ceph_con_workfn [libceph]
> pstore/6949152322085/dmesg.txt-<4>[ 6437.077874] CPU: 59 PID: 0 Comm:
> swapper/59 Tainted: P           OE     5.11.8-051108-generic
> #202103200636
>
> cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-5.11.8-051108-generic
> root=/dev/mapper/ubuntu--vg-ubuntu--lv ro net.ifnames=0 biosdevname=0
> strict-devmem=0 mitigations=off iommu=pt
>
> cat /proc/cpuinfo
> model name      : AMD EPYC 7552 48-Core Processor
>
> cat /proc/mounts
> /dev/md12 /mnt/ssd1 xfs
> rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=3072,prjquota
> 0 0
>
> Let me know if you need more information.

Hi Pawel,

Thanks for the report. Could you please try whether the following
change fixes the issue?

Song

diff --git i/drivers/md/md.c w/drivers/md/md.c
index bb09070a63bcf..889e9440a9f38 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -5648,7 +5648,7 @@ static int md_alloc(dev_t dev, char *name)
                 */
                mddev->hold_active = UNTIL_STOP;

-       error = mempool_init_kmalloc_pool(&mddev->md_io_pool, BIO_POOL_SIZE,
+       error = mempool_init_slab_pool(&mddev->md_io_pool, BIO_POOL_SIZE,
                                          sizeof(struct md_io));
        if (error)
                goto abort;

  reply	other threads:[~2021-04-12  6:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-09 21:40 PROBLEM: double fault in md_end_io Paweł Wiejacha
2021-04-12  6:48 ` Song Liu [this message]
2021-04-13 12:05   ` Paweł Wiejacha
2021-04-15  0:36     ` Song Liu
2021-04-15  6:35       ` Song Liu
2021-04-15 15:35         ` Paweł Wiejacha
2021-04-22 15:40           ` Paweł Wiejacha
2021-04-23  2:36 ` Guoqing Jiang
2021-04-23  6:44   ` Song Liu
2021-05-04 21:17     ` Paweł Wiejacha
2021-05-06  5:48       ` Song Liu
2021-05-06 23:46         ` Guoqing Jiang
2021-05-08  1:17           ` Guoqing Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPhsuW7YqvSBDhhOxX4oa8-Z0v5DMxtEeEWz4hs5SiNPxWrVmg@mail.gmail.com \
    --to=song@kernel.org \
    --cc=artur.paszkiewicz@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=pawel.wiejacha@rtbhouse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.