linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Donald Buczek <buczek@molgen.mpg.de>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	it+linux-xfs@molgen.mpg.de
Subject: Re: v5.10.1 xfs deadlock
Date: Fri, 18 Dec 2020 19:35:29 +0100	[thread overview]
Message-ID: <8e9a2939-220d-b12f-a24e-0fb48fa95215@molgen.mpg.de> (raw)
In-Reply-To: <20201218153533.GA2563439@bfoster>

On 18.12.20 16:35, Brian Foster wrote:
> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>> On 17.12.20 20:43, Brian Foster wrote:
>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>> Dear xfs developer,
>>>>
>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>
>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>
>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>
>>>>       root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>       root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>       root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>       root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>
> 
> Do you have any indication of whether these workloads actually hung or
> just became incredibly slow?

There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)

>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>
>>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>>
>>>>       [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>       [<0>] __xfs_trans_commit+0xab/0x320
>>>>       [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>       [<0>] xfs_end_ioend+0xc6/0x110
>>>>       [<0>] xfs_end_io+0xad/0xe0
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> xfs_log_commit_cil+0x6cc is
>>>>
>>>>     xfs_log_commit_cil()
>>>>       xlog_cil_push_background(log)
>>>>         xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>
> 
> This looks like the transaction commit throttling code. That was
> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
> Throttle commits on delayed background CIL push"). The purpose of that
> change was to prevent the CIL from growing too large. FWIW, I don't
> recall that being a functional problem so it should be possible to
> simply remove that blocking point and see if that avoids the problem or
> if we simply stall out somewhere else, if you wanted to give that a
> test.

Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.

>>>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>>>
>>>> There are also single "flush" process for each md device with this stack signature:
>>>>
>>>>       [<0>] xfs_map_blocks+0xbf/0x400
>>>>       [<0>] iomap_do_writepage+0x15e/0x880
>>>>       [<0>] write_cache_pages+0x175/0x3f0
>>>>       [<0>] iomap_writepages+0x1c/0x40
>>>>       [<0>] xfs_vm_writepages+0x59/0x80
>>>>       [<0>] do_writepages+0x4b/0xe0
>>>>       [<0>] __writeback_single_inode+0x42/0x300
>>>>       [<0>] writeback_sb_inodes+0x198/0x3f0
>>>>       [<0>] __writeback_inodes_wb+0x5e/0xc0
>>>>       [<0>] wb_writeback+0x246/0x2d0
>>>>       [<0>] wb_workfn+0x26e/0x490
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
> 
> Is writeback still blocked as such or was this just a transient stack?

This is frozen. I did another collection of the stack dump today and all the stack trace (related to these two filesystems) are unchanged.

However, I discovered, that my original report was not correct. The above stack trace is from PID 348 ("kworker/u81:8+flush-9:0") while the other thread for the other device (PID 20, "kworker/u82:0+flush-9:1" doesn't have an identical thread, it has three more frames and is blocking at xfs_log_commit_cil+0x6cc as most other processes.

     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
     [<0>] __xfs_trans_commit+0xab/0x320
     [<0>] xfs_bmapi_convert_delalloc+0x437/0x4b0
     [<0>] xfs_map_blocks+0x1e3/0x400
     [<0>] iomap_do_writepage+0x15e/0x880
     [<0>] write_cache_pages+0x175/0x3f0
     [<0>] iomap_writepages+0x1c/0x40
     [<0>] xfs_vm_writepages+0x59/0x80
     [<0>] do_writepages+0x4b/0xe0
     [<0>] __writeback_single_inode+0x42/0x300
     [<0>] writeback_sb_inodes+0x198/0x3f0
     [<0>] __writeback_inodes_wb+0x5e/0xc0
     [<0>] wb_writeback+0x246/0x2d0
     [<0>] wb_workfn+0x26e/0x490
     [<0>] process_one_work+0x1dd/0x3e0
     [<0>] worker_thread+0x2d/0x3b0
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x22/0x30

> 
>>>> xfs_map_blocks+0xbf is the
>>>>
>>>>       xfs_ilock(ip, XFS_ILOCK_SHARED);
>>>>
>>>> in xfs_map_blocks().
>>>>
>>>> The system is low on free memory
>>>>
>>>>       MemTotal:       197587764 kB
>>>>       MemFree:          2196496 kB
>>>>       MemAvailable:   189895408 kB
>>>>
>>>> but responsive.
>>>>
>>>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
>>>>
>>>> I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
>>>>
>>>
>>> It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
>>> each fs to get an idea of the state of the logs as well...
>>
>>
>> root@deadbird:~# for f in /sys/fs/xfs/*/log/*; do echo $f : $(cat $f);done
>> /sys/fs/xfs/md0/log/log_head_lsn : 5:714808
>> /sys/fs/xfs/md0/log/log_tail_lsn : 5:581592
>> /sys/fs/xfs/md0/log/reserve_grant_head : 5:365981696
>> /sys/fs/xfs/md0/log/write_grant_head : 5:365981696
> 
> Hm, so it looks like the log is populated but not necessarily full. What
> looks more interesting is that the grant heads (365981696 bytes) line up
> with the physical log head (714808 512b sectors). That suggests there is
> no outstanding transaction reservation and thus perhaps all workload
> tasks are sitting at that throttling point just after the current
> transaction commits and releases unused reservation. That certainly
> shouldn't be such a longstanding blocking point as it only waits for the
> CIL push to start.
> 
> Out of curiosity, have any of the above values changed since the sample
> provided here was collected? As above, I'm curious if the filesystem
> happens to be moving along slowly or not at all, whether the AIL has
> been drained in the background, etc.

Zero I/O on the block devices.
All related thread have exact same stack frame after >2days

> Could you post the xfs_info for the affected filesystems?

buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md0
meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md1
meta-data=/dev/md1               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


> Also since it seems like you should have plenty of available log
> reservation, are you able to perform any writable operations on the fs
> (i.e., touch <file>)? If so, I wonder if you were able to start a new
> copy workload on of the fs' capable of triggering the blocking threshold
> again, if that might eventually unstick the currently blocked tasks when
> the next CIL push occurs...

When I do something like

     touch /jbod/M8067/scratch/x.001

This doesn't block and shortly thereafter I see momentary write activity on the md and member devices. But the blocked processes don't resume.

     root@deadbird:~# iostat|grep md0 ; sleep 30 ; iostat|grep md0 ; touch /jbod/M8067/scratch/x.008 ; sleep 30 ; iostat|grep md0 ; sleep 30 ; iostat|grep md0
     md0              25.81        52.45      2385.56         0.00   12806780  582460904          0
     md0              25.81        52.45      2385.27         0.00   12806780  582460904          0
     md0              25.81        52.44      2384.98         0.00   12806780  582460908          0
     md0              25.80        52.43      2384.68         0.00   12806780  582460908          0

The touched files are visible with `ls`.

So then I tried something longer:

     cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/linux.001

which also triggerd a few I/Os in the beginning but then came to a perceived halt (no I/O in several minutes).

Interestingly, this new "cp" is not blocking on xfs_log_commit_cil+0x6cc as the four "cp" commands from the stress test but has this stack:

     [<0>] balance_dirty_pages+0x31c/0xd80
     [<0>] balance_dirty_pages_ratelimited+0x2f9/0x3c0
     [<0>] iomap_write_actor+0x11d/0x190
     [<0>] iomap_apply+0x117/0x2e0
     [<0>] iomap_file_buffered_write+0x62/0x90
     [<0>] xfs_file_buffered_aio_write+0xd3/0x320
     [<0>] new_sync_write+0x11f/0x1b0
     [<0>] vfs_write+0x1ea/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Maybe this would change if I waited a few hours more.

So I will reboot the system now and retry test test exactly the same configuration as then again with 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push") reverted.

Donald

> Brian
> 
>> /sys/fs/xfs/md1/log/log_head_lsn : 3:2963880
>> /sys/fs/xfs/md1/log/log_tail_lsn : 3:2772656
>> /sys/fs/xfs/md1/log/reserve_grant_head : 3:1517506560
>> /sys/fs/xfs/md1/log/write_grant_head : 3:1517506560
>> /sys/fs/xfs/sda1/log/log_head_lsn : 233:106253
>> /sys/fs/xfs/sda1/log/log_tail_lsn : 233:106251
>> /sys/fs/xfs/sda1/log/reserve_grant_head : 233:54403812
>> /sys/fs/xfs/sda1/log/write_grant_head : 233:54403812
>> /sys/fs/xfs/sda2/log/log_head_lsn : 84:5653
>> /sys/fs/xfs/sda2/log/log_tail_lsn : 84:5651
>> /sys/fs/xfs/sda2/log/reserve_grant_head : 84:2894336
>> /sys/fs/xfs/sda2/log/write_grant_head : 84:2894336
>>
>>
>>
>>>
>>> Brian
>>>
>>>> Best
>>>>     Donald
>>>>
>>>
>>
>> -- 
>> Donald Buczek
>> buczek@molgen.mpg.de
>> Tel: +49 30 8413 1433
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

  reply	other threads:[~2020-12-18 18:36 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-17 17:44 v5.10.1 xfs deadlock Donald Buczek
2020-12-17 19:43 ` Brian Foster
2020-12-17 21:30   ` Donald Buczek
2020-12-18 15:35     ` Brian Foster
2020-12-18 18:35       ` Donald Buczek [this message]
2020-12-27 17:34         ` Donald Buczek
2020-12-28 23:13           ` Donald Buczek
2020-12-29 23:56             ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
2020-12-30 22:16               ` Dave Chinner
2020-12-31 11:48                 ` Donald Buczek
2020-12-31 21:59                   ` Dave Chinner
2021-01-02 19:12                     ` Donald Buczek
2021-01-02 22:44                       ` Dave Chinner
2021-01-03 16:03                         ` Donald Buczek
2021-01-07 22:19                           ` Dave Chinner
2021-01-09 14:39                             ` Donald Buczek
2021-01-04 16:23                 ` Brian Foster
2021-01-07 21:54                   ` Dave Chinner
2021-01-08 16:56                     ` Brian Foster
2021-01-11 16:38                       ` Brian Foster
2021-01-13 21:53                         ` Dave Chinner
2021-02-15 13:36                           ` Donald Buczek
2021-02-16 11:18                             ` Brian Foster
2021-02-16 12:40                               ` Donald Buczek
2021-01-13 21:44                       ` Dave Chinner
     [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
2020-12-30 16:54               ` Donald Buczek
2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
2020-12-21 12:22   ` Donald Buczek
2020-12-27 17:22     ` Donald Buczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e9a2939-220d-b12f-a24e-0fb48fa95215@molgen.mpg.de \
    --to=buczek@molgen.mpg.de \
    --cc=bfoster@redhat.com \
    --cc=it+linux-xfs@molgen.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).