Re: Highly reflinked and fragmented considered harmful?

From: Dave Chinner <david@fromorbit.com>
To: Chris Dunlop <chris@onthe.net.au>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	linux-xfs <linux-xfs@vger.kernel.org>
Subject: Re: Highly reflinked and fragmented considered harmful?
Date: Tue, 10 May 2022 18:16:32 +1000	[thread overview]
Message-ID: <20220510081632.GS1098723@dread.disaster.area> (raw)
In-Reply-To: <20220510063051.GA215522@onthe.net.au>

On Tue, May 10, 2022 at 04:30:51PM +1000, Chris Dunlop wrote:
> On Mon, May 09, 2022 at 10:10:57PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 07:07:35AM +0300, Amir Goldstein wrote:
> > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > > data could take days, the entire time blocking other tasks like "rm" and
> > > > "df" on the same filesystem?
> ...
> > > From a product POV, I think what should have happened here is that
> > > freeing up the space would have taken 10 days in the background, but
> > > otherwise, filesystem should not have been blocking other processes
> > > for long periods of time.

Sure, that's obvious, and without looking at the code I know what
that is: statfs()

> > Indeed.  Chris, do you happen to have the sysrq-w output handy?  I'm
> > curious if the stall warning backtraces all had xfs_inodegc_flush() in
> > them, or were there other parts of the system stalling elsewhere too?
> > 50 billion updates is a lot, but there shouldn't be stall warnings.
> 
> Sure: https://file.io/25za5BNBlnU8  (6.8M)
> 
> Of the 3677 tasks in there, only 38 do NOT show xfs_inodegc_flush().

yup, 3677 tasks in statfs(). The majority are rm, df, and check_disk
processes, there's a couple of veeamagent processes stuck and
an (ostnamed) process, whatever that is...

No real surprises there, and it's not why the filesystem is taking
so long to remove all the reflink references.

There is just one background inodegc worker thread running, so
there's no excessive load being generated by inodegc, either. It's
no different to a single rm running xfs_inactive() directly on a
slightly older kernel and filling the journal:

May 06 09:49:01 d5 kernel: task:kworker/6:1     state:D stack:    0 pid:23258 ppid:     2 flags:0x00004000
May 06 09:49:01 d5 kernel: Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker [xfs]
May 06 09:49:01 d5 kernel: Call Trace:
May 06 09:49:01 d5 kernel:  <TASK>
May 06 09:49:01 d5 kernel:  __schedule+0x241/0x740
May 06 09:49:01 d5 kernel:  schedule+0x3a/0xa0
May 06 09:49:01 d5 kernel:  schedule_timeout+0x271/0x310
May 06 09:49:01 d5 kernel:  __down+0x6c/0xa0
May 06 09:49:01 d5 kernel:  down+0x3b/0x50
May 06 09:49:01 d5 kernel:  xfs_buf_lock+0x40/0xe0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_find.isra.32+0x3ee/0x730 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_get_map+0x3c/0x430 [xfs]
May 06 09:49:01 d5 kernel:  xfs_buf_read_map+0x37/0x2c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_trans_read_buf_map+0x1cb/0x300 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_read_buf_block.constprop.40+0x75/0xb0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_lookup_get_block+0x85/0x150 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_overlapped_query_range+0x33c/0x3c0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_btree_query_range+0xd5/0x100 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_query_range+0x71/0x80 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_lookup_le_range+0x88/0x180 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_unmap_shared+0x89/0x560 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_finish_one+0x201/0x260 [xfs]
May 06 09:49:01 d5 kernel:  xfs_rmap_update_finish_item+0x33/0x60 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish_noroll+0x215/0x5a0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_defer_finish+0x13/0x70 [xfs]
May 06 09:49:01 d5 kernel:  xfs_itruncate_extents_flags+0xc4/0x240 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive_truncate+0x7f/0xc0 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inactive+0x10c/0x130 [xfs]
May 06 09:49:01 d5 kernel:  xfs_inodegc_worker+0xb5/0x140 [xfs]
May 06 09:49:01 d5 kernel:  process_one_work+0x2a8/0x4c0
May 06 09:49:01 d5 kernel:  worker_thread+0x21b/0x3c0
May 06 09:49:01 d5 kernel:  kthread+0x121/0x140

There are 18 tasks in destroy_inode() blocked on a workqueue flush
- these are new unlinks that are getting throttled because that
per-cpu inodegc queue is full and work is ongoing. Not a huge
deal, maybe we should look to hand full queues to another CPU if
the neighbour CPU has an empty queue. That would reduce
unnecessary throttling, though it may mean more long running
inodegc processes in the background....

Really, though, I don't see anything deadlocked, just a system
backed up doing a really large amount of metadata modification.
Everything is sitting on throttles or waiting on IO and making
slow forwards progress as metadata writeback allows log space to be
freed.

I think we should just accept that statfs() can never really report
exactly accurate space usagei to userspace and get rid of the flush.
Work around the specific fstests dependencies on rm always
immediately making unlinked inodes free space in fstests rather than
in the kernel code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com