From: Donald Buczek <buczek@molgen.mpg.de>
To: linux-xfs@vger.kernel.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
it+linux-xfs@molgen.mpg.de
Subject: v5.10.1 xfs deadlock
Date: Thu, 17 Dec 2020 18:44:51 +0100 [thread overview]
Message-ID: <b8da4aed-ee44-5d9f-88dc-3d32f0298564@molgen.mpg.de> (raw)
Dear xfs developer,
I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
root 7322 2167 0 Dec16 pts/1 00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
root 7329 2169 0 Dec16 pts/1 00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
root 13856 2170 0 Dec16 pts/1 00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
root 13899 2168 0 Dec16 pts/1 00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
It stands out, that there are many (549 for md0, but only 10 for md1) "xfs-conv" threads all with stacks like this
[<0>] xfs_log_commit_cil+0x6cc/0x7c0
[<0>] __xfs_trans_commit+0xab/0x320
[<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
[<0>] xfs_end_ioend+0xc6/0x110
[<0>] xfs_end_io+0xad/0xe0
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x2d/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30
xfs_log_commit_cil+0x6cc is
xfs_log_commit_cil()
xlog_cil_push_background(log)
xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
There are also single "flush" process for each md device with this stack signature:
[<0>] xfs_map_blocks+0xbf/0x400
[<0>] iomap_do_writepage+0x15e/0x880
[<0>] write_cache_pages+0x175/0x3f0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x59/0x80
[<0>] do_writepages+0x4b/0xe0
[<0>] __writeback_single_inode+0x42/0x300
[<0>] writeback_sb_inodes+0x198/0x3f0
[<0>] __writeback_inodes_wb+0x5e/0xc0
[<0>] wb_writeback+0x246/0x2d0
[<0>] wb_workfn+0x26e/0x490
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x2d/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30
xfs_map_blocks+0xbf is the
xfs_ilock(ip, XFS_ILOCK_SHARED);
in xfs_map_blocks().
The system is low on free memory
MemTotal: 197587764 kB
MemFree: 2196496 kB
MemAvailable: 189895408 kB
but responsive.
I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
Best
Donald
next reply other threads:[~2020-12-17 17:45 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-17 17:44 Donald Buczek [this message]
2020-12-17 19:43 ` v5.10.1 xfs deadlock Brian Foster
2020-12-17 21:30 ` Donald Buczek
2020-12-18 15:35 ` Brian Foster
2020-12-18 18:35 ` Donald Buczek
2020-12-27 17:34 ` Donald Buczek
2020-12-28 23:13 ` Donald Buczek
2020-12-29 23:56 ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
2020-12-30 22:16 ` Dave Chinner
2020-12-31 11:48 ` Donald Buczek
2020-12-31 21:59 ` Dave Chinner
2021-01-02 19:12 ` Donald Buczek
2021-01-02 22:44 ` Dave Chinner
2021-01-03 16:03 ` Donald Buczek
2021-01-07 22:19 ` Dave Chinner
2021-01-09 14:39 ` Donald Buczek
2021-01-04 16:23 ` Brian Foster
2021-01-07 21:54 ` Dave Chinner
2021-01-08 16:56 ` Brian Foster
2021-01-11 16:38 ` Brian Foster
2021-01-13 21:53 ` Dave Chinner
2021-02-15 13:36 ` Donald Buczek
2021-02-16 11:18 ` Brian Foster
2021-02-16 12:40 ` Donald Buczek
2021-01-13 21:44 ` Dave Chinner
[not found] ` <20201230024642.2171-1-hdanton@sina.com>
2020-12-30 16:54 ` Donald Buczek
2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
2020-12-21 12:22 ` Donald Buczek
2020-12-27 17:22 ` Donald Buczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b8da4aed-ee44-5d9f-88dc-3d32f0298564@molgen.mpg.de \
--to=buczek@molgen.mpg.de \
--cc=it+linux-xfs@molgen.mpg.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.