All of lore.kernel.org
 help / color / mirror / Atom feed
From: Donald Buczek <buczek@molgen.mpg.de>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	it+linux-xfs@molgen.mpg.de
Subject: Re: v5.10.1 xfs deadlock
Date: Fri, 18 Dec 2020 19:35:29 +0100	[thread overview]
Message-ID: <8e9a2939-220d-b12f-a24e-0fb48fa95215@molgen.mpg.de> (raw)
In-Reply-To: <20201218153533.GA2563439@bfoster>

On 18.12.20 16:35, Brian Foster wrote:
> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>> On 17.12.20 20:43, Brian Foster wrote:
>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>> Dear xfs developer,
>>>>
>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>
>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>
>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>
>>>>       root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>       root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>       root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>       root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>
> 
> Do you have any indication of whether these workloads actually hung or
> just became incredibly slow?

There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)

>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>
>>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>>
>>>>       [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>       [<0>] __xfs_trans_commit+0xab/0x320
>>>>       [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>       [<0>] xfs_end_ioend+0xc6/0x110
>>>>       [<0>] xfs_end_io+0xad/0xe0
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> xfs_log_commit_cil+0x6cc is
>>>>
>>>>     xfs_log_commit_cil()
>>>>       xlog_cil_push_background(log)
>>>>         xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>
> 
> This looks like the transaction commit throttling code. That was
> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
> Throttle commits on delayed background CIL push"). The purpose of that
> change was to prevent the CIL from growing too large. FWIW, I don't
> recall that being a functional problem so it should be possible to
> simply remove that blocking point and see if that avoids the problem or
> if we simply stall out somewhere else, if you wanted to give that a
> test.

Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.

>>>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>>>
>>>> There are also single "flush" process for each md device with this stack signature:
>>>>
>>>>       [<0>] xfs_map_blocks+0xbf/0x400
>>>>       [<0>] iomap_do_writepage+0x15e/0x880
>>>>       [<0>] write_cache_pages+0x175/0x3f0
>>>>       [<0>] iomap_writepages+0x1c/0x40
>>>>       [<0>] xfs_vm_writepages+0x59/0x80
>>>>       [<0>] do_writepages+0x4b/0xe0
>>>>       [<0>] __writeback_single_inode+0x42/0x300
>>>>       [<0>] writeback_sb_inodes+0x198/0x3f0
>>>>       [<0>] __writeback_inodes_wb+0x5e/0xc0
>>>>       [<0>] wb_writeback+0x246/0x2d0
>>>>       [<0>] wb_workfn+0x26e/0x490
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
> 
> Is writeback still blocked as such or was this just a transient stack?

This is frozen. I did another collection of the stack dump today and all the stack trace (related to these two filesystems) are unchanged.

However, I discovered, that my original report was not correct. The above stack trace is from PID 348 ("kworker/u81:8+flush-9:0") while the other thread for the other device (PID 20, "kworker/u82:0+flush-9:1" doesn't have an identical thread, it has three more frames and is blocking at xfs_log_commit_cil+0x6cc as most other processes.

     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
     [<0>] __xfs_trans_commit+0xab/0x320
     [<0>] xfs_bmapi_convert_delalloc+0x437/0x4b0
     [<0>] xfs_map_blocks+0x1e3/0x400
     [<0>] iomap_do_writepage+0x15e/0x880
     [<0>] write_cache_pages+0x175/0x3f0
     [<0>] iomap_writepages+0x1c/0x40
     [<0>] xfs_vm_writepages+0x59/0x80
     [<0>] do_writepages+0x4b/0xe0
     [<0>] __writeback_single_inode+0x42/0x300
     [<0>] writeback_sb_inodes+0x198/0x3f0
     [<0>] __writeback_inodes_wb+0x5e/0xc0
     [<0>] wb_writeback+0x246/0x2d0
     [<0>] wb_workfn+0x26e/0x490
     [<0>] process_one_work+0x1dd/0x3e0
     [<0>] worker_thread+0x2d/0x3b0
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x22/0x30

> 
>>>> xfs_map_blocks+0xbf is the
>>>>
>>>>       xfs_ilock(ip, XFS_ILOCK_SHARED);
>>>>
>>>> in xfs_map_blocks().
>>>>
>>>> The system is low on free memory
>>>>
>>>>       MemTotal:       197587764 kB
>>>>       MemFree:          2196496 kB
>>>>       MemAvailable:   189895408 kB
>>>>
>>>> but responsive.
>>>>
>>>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
>>>>
>>>> I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
>>>>
>>>
>>> It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
>>> each fs to get an idea of the state of the logs as well...
>>
>>
>> root@deadbird:~# for f in /sys/fs/xfs/*/log/*; do echo $f : $(cat $f);done
>> /sys/fs/xfs/md0/log/log_head_lsn : 5:714808
>> /sys/fs/xfs/md0/log/log_tail_lsn : 5:581592
>> /sys/fs/xfs/md0/log/reserve_grant_head : 5:365981696
>> /sys/fs/xfs/md0/log/write_grant_head : 5:365981696
> 
> Hm, so it looks like the log is populated but not necessarily full. What
> looks more interesting is that the grant heads (365981696 bytes) line up
> with the physical log head (714808 512b sectors). That suggests there is
> no outstanding transaction reservation and thus perhaps all workload
> tasks are sitting at that throttling point just after the current
> transaction commits and releases unused reservation. That certainly
> shouldn't be such a longstanding blocking point as it only waits for the
> CIL push to start.
> 
> Out of curiosity, have any of the above values changed since the sample
> provided here was collected? As above, I'm curious if the filesystem
> happens to be moving along slowly or not at all, whether the AIL has
> been drained in the background, etc.

Zero I/O on the block devices.
All related thread have exact same stack frame after >2days

> Could you post the xfs_info for the affected filesystems?

buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md0
meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md1
meta-data=/dev/md1               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


> Also since it seems like you should have plenty of available log
> reservation, are you able to perform any writable operations on the fs
> (i.e., touch <file>)? If so, I wonder if you were able to start a new
> copy workload on of the fs' capable of triggering the blocking threshold
> again, if that might eventually unstick the currently blocked tasks when
> the next CIL push occurs...

When I do something like

     touch /jbod/M8067/scratch/x.001

This doesn't block and shortly thereafter I see momentary write activity on the md and member devices. But the blocked processes don't resume.

     root@deadbird:~# iostat|grep md0 ; sleep 30 ; iostat|grep md0 ; touch /jbod/M8067/scratch/x.008 ; sleep 30 ; iostat|grep md0 ; sleep 30 ; iostat|grep md0
     md0              25.81        52.45      2385.56         0.00   12806780  582460904          0
     md0              25.81        52.45      2385.27         0.00   12806780  582460904          0
     md0              25.81        52.44      2384.98         0.00   12806780  582460908          0
     md0              25.80        52.43      2384.68         0.00   12806780  582460908          0

The touched files are visible with `ls`.

So then I tried something longer:

     cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/linux.001

which also triggerd a few I/Os in the beginning but then came to a perceived halt (no I/O in several minutes).

Interestingly, this new "cp" is not blocking on xfs_log_commit_cil+0x6cc as the four "cp" commands from the stress test but has this stack:

     [<0>] balance_dirty_pages+0x31c/0xd80
     [<0>] balance_dirty_pages_ratelimited+0x2f9/0x3c0
     [<0>] iomap_write_actor+0x11d/0x190
     [<0>] iomap_apply+0x117/0x2e0
     [<0>] iomap_file_buffered_write+0x62/0x90
     [<0>] xfs_file_buffered_aio_write+0xd3/0x320
     [<0>] new_sync_write+0x11f/0x1b0
     [<0>] vfs_write+0x1ea/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Maybe this would change if I waited a few hours more.

So I will reboot the system now and retry test test exactly the same configuration as then again with 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push") reverted.

Donald

> Brian
> 
>> /sys/fs/xfs/md1/log/log_head_lsn : 3:2963880
>> /sys/fs/xfs/md1/log/log_tail_lsn : 3:2772656
>> /sys/fs/xfs/md1/log/reserve_grant_head : 3:1517506560
>> /sys/fs/xfs/md1/log/write_grant_head : 3:1517506560
>> /sys/fs/xfs/sda1/log/log_head_lsn : 233:106253
>> /sys/fs/xfs/sda1/log/log_tail_lsn : 233:106251
>> /sys/fs/xfs/sda1/log/reserve_grant_head : 233:54403812
>> /sys/fs/xfs/sda1/log/write_grant_head : 233:54403812
>> /sys/fs/xfs/sda2/log/log_head_lsn : 84:5653
>> /sys/fs/xfs/sda2/log/log_tail_lsn : 84:5651
>> /sys/fs/xfs/sda2/log/reserve_grant_head : 84:2894336
>> /sys/fs/xfs/sda2/log/write_grant_head : 84:2894336
>>
>>
>>
>>>
>>> Brian
>>>
>>>> Best
>>>>     Donald
>>>>
>>>
>>
>> -- 
>> Donald Buczek
>> buczek@molgen.mpg.de
>> Tel: +49 30 8413 1433
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

  reply	other threads:[~2020-12-18 18:36 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-17 17:44 v5.10.1 xfs deadlock Donald Buczek
2020-12-17 19:43 ` Brian Foster
2020-12-17 21:30   ` Donald Buczek
2020-12-18 15:35     ` Brian Foster
2020-12-18 18:35       ` Donald Buczek [this message]
2020-12-27 17:34         ` Donald Buczek
2020-12-28 23:13           ` Donald Buczek
2020-12-29 23:56             ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
2020-12-30 22:16               ` Dave Chinner
2020-12-31 11:48                 ` Donald Buczek
2020-12-31 21:59                   ` Dave Chinner
2021-01-02 19:12                     ` Donald Buczek
2021-01-02 22:44                       ` Dave Chinner
2021-01-03 16:03                         ` Donald Buczek
2021-01-07 22:19                           ` Dave Chinner
2021-01-09 14:39                             ` Donald Buczek
2021-01-04 16:23                 ` Brian Foster
2021-01-07 21:54                   ` Dave Chinner
2021-01-08 16:56                     ` Brian Foster
2021-01-11 16:38                       ` Brian Foster
2021-01-13 21:53                         ` Dave Chinner
2021-02-15 13:36                           ` Donald Buczek
2021-02-16 11:18                             ` Brian Foster
2021-02-16 12:40                               ` Donald Buczek
2021-01-13 21:44                       ` Dave Chinner
     [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
2020-12-30 16:54               ` Donald Buczek
2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
2020-12-21 12:22   ` Donald Buczek
2020-12-27 17:22     ` Donald Buczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8e9a2939-220d-b12f-a24e-0fb48fa95215@molgen.mpg.de \
    --to=buczek@molgen.mpg.de \
    --cc=bfoster@redhat.com \
    --cc=it+linux-xfs@molgen.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.