linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chris Dunlop <chris@onthe.net.au>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	linux-xfs <linux-xfs@vger.kernel.org>
Subject: Re: Highly reflinked and fragmented considered harmful?
Date: Tue, 10 May 2022 18:16:32 +1000	[thread overview]
Message-ID: <20220510081632.GS1098723@dread.disaster.area> (raw)
In-Reply-To: <20220510063051.GA215522@onthe.net.au>

On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > "df" on the same filesystem?
> ...
> > > From a product POV, I think what should have happened here is that
> > > freeing up the space would have taken 10 days in the background, but
> > > otherwise, filesystem should not have been blocking other processes
> > > for long periods of time.

Sure, that's obvious, and without looking at the code I know what
that is: statfs()

> > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > them, or were there other parts of the system stalling elsewhere too?
> > 50 billion updates is a lot, but there shouldn't be stall warnings.
> 
> Sure: https://file.io/25za5BNBlnU8  (6.8M)
> 
> Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().

yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
processes, there's a couple of veeamagent processes stuck and
an (ostnamed) process, whatever that is...

No real surprises there, and it's not why the filesystem is taking
so long to remove all the reflink references.

There is just one background inodegc worker thread running, so
there's no excessive load being generated by inodegc, either. It's
no different to a single rm running xfs_inactive() directly on a
slightly older kernel and filling the journal:

May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
May 06 09:49:01 d5 kernel: Call Trace:
May 06 09:49:01 d5 kernel:  <TASK>
May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
May 06 09:49:01 d5 kernel:  down+0x3b/0x50
May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
May 06 09:49:01 d5 kernel:  kthread+0x121/0x140

There are 18 tasks in destroy_inode() blocked on a workqueue flush
- these are new unlinks that are getting throttled because that
per-cpu inodegc queue is full and work is ongoing. Not a huge
deal, maybe we should look to hand full queues to another CPU if
the neighbour CPU has an empty queue. That would reduce
unnecessary throttling, though it may mean more long running
inodegc processes in the background....

Really, though, I don't see anything deadlocked, just a system
backed up doing a really large amount of metadata modification.
Everything is sitting on throttles or waiting on IO and making
slow forwards progress as metadata writeback allows log space to be
freed.

I think we should just accept that statfs() can never really report
exactly accurate space usagei to userspace and get rid of the flush.
Work around the specific fstests dependencies on rm always
immediately making unlinked inodes free space in fstests rather than
in the kernel code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-05-10  8:16 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-09  2:46 Highly reflinked and fragmented considered harmful? Chris Dunlop
2022-05-09 23:09 ` Dave Chinner
2022-05-10  2:55   ` Chris Dunlop
2022-05-10  5:14     ` Darrick J. Wong
2022-05-10  4:07   ` Amir Goldstein
2022-05-10  5:10     ` Darrick J. Wong
2022-05-10  6:30       ` Chris Dunlop
2022-05-10  8:16         ` Dave Chinner [this message]
2022-05-10 19:19           ` Darrick J. Wong
2022-05-10 21:54             ` Dave Chinner
2022-05-11  0:37               ` Darrick J. Wong
2022-05-11  1:36                 ` Dave Chinner
2022-05-11  2:16                   ` Chris Dunlop
2022-05-11  2:52                     ` Dave Chinner
2022-05-11  3:58                       ` Chris Dunlop
2022-05-11  5:18                         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220510081632.GS1098723@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=chris@onthe.net.au \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).