All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: akpm@linux-foundation.org, Matthew Wilcox <willy@infradead.org>,
	Jan Kara <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Christoph Hellwig <hch@lst.de>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
Date: Mon, 19 Sep 2022 08:57:31 +1000	[thread overview]
Message-ID: <20220918225731.GG3600936@dread.disaster.area> (raw)
In-Reply-To: <166329933874.2786261.18236541386474985669.stgit@dwillia2-xfh.jf.intel.com>

On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> In preparation for moving DAX pages to be 0-based rather than 1-based
> for the idle refcount, the fsdax core wants to have all mappings in a
> "zapped" state before truncate. For typical pages this happens naturally
> via unmap_mapping_range(), for DAX pages some help is needed to record
> this state in the 'struct address_space' of the inode(s) where the page
> is mapped.
> 
> That "zapped" state is recorded in DAX entries as a side effect of
> xfs_break_layouts(). Arrange for it to be called before all truncation
> events which already happens for truncate() and PUNCH_HOLE, but not
> truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> truncate_inode_pages_final().

Ugh. That's nasty and awful.



> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c  |   13 +++++++++----
>  fs/xfs/xfs_inode.c |    3 ++-
>  fs/xfs/xfs_inode.h |    6 ++++--
>  fs/xfs/xfs_super.c |   22 ++++++++++++++++++++++
>  4 files changed, 37 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 556e28d06788..d3ff692d5546 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -816,7 +816,8 @@ xfs_wait_dax_page(
>  int
>  xfs_break_dax_layouts(
>  	struct inode		*inode,
> -	bool			*retry)
> +	bool			*retry,
> +	int			state)
>  {
>  	struct page		*page;
>  
> @@ -827,8 +828,8 @@ xfs_break_dax_layouts(
>  		return 0;
>  
>  	*retry = true;
> -	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
> -				 0, 0, xfs_wait_dax_page(inode));
> +	return ___wait_var_event(page, dax_page_idle(page), state, 0, 0,
> +				 xfs_wait_dax_page(inode));
>  }
>  
>  int
> @@ -839,14 +840,18 @@ xfs_break_layouts(
>  {
>  	bool			retry;
>  	int			error;
> +	int			state = TASK_INTERRUPTIBLE;
>  
>  	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>  
>  	do {
>  		retry = false;
>  		switch (reason) {
> +		case BREAK_UNMAP_FINAL:
> +			state = TASK_UNINTERRUPTIBLE;
> +			fallthrough;
>  		case BREAK_UNMAP:
> -			error = xfs_break_dax_layouts(inode, &retry);
> +			error = xfs_break_dax_layouts(inode, &retry, state);
>  			if (error || retry)
>  				break;
>  			fallthrough;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 28493c8e9bb2..72ce1cb72736 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
>  	struct xfs_inode	*ip1,
>  	struct xfs_inode	*ip2)
>  {
> +	int			state = TASK_INTERRUPTIBLE;
>  	int			error;
>  	bool			retry;
>  	struct page		*page;
> @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
>  	retry = false;
>  	/* Lock the first inode */
>  	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> -	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state);
>  	if (error || retry) {
>  		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>  		if (error == 0 && retry)
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index fa780f08dc89..e4994eb6e521 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * layout-holder has a consistent view of the file's extent map. While
>   * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases,
>   * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
> - * go idle.
> + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of
> + * BREAK_UNMAP.
>   */
>  enum layout_break_reason {
>          BREAK_WRITE,
>          BREAK_UNMAP,
> +        BREAK_UNMAP_FINAL,
>  };
>  
>  /*
> @@ -531,7 +533,7 @@ xfs_itruncate_extents(
>  }
>  
>  /* from xfs_file.c */
> -int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
> +int	xfs_break_dax_layouts(struct inode *inode, bool *retry, int state);
>  int	xfs_break_layouts(struct inode *inode, uint *iolock,
>  		enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 9ac59814bbb6..ebb4a6eba3fc 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
>  	return generic_drop_inode(inode);
>  }
>  
> +STATIC void
> +xfs_fs_evict_inode(
> +	struct inode		*inode)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +	long			error;
> +
> +	xfs_ilock(ip, iolock);

I'm guessing you never ran this through lockdep.

The general rule is that XFS should not take inode locks directly in
the inode eviction path because lockdep tends to throw all manner of
memory reclaim related false positives when we do this. We most
definitely don't want to be doing this for anything other than
regular files that are DAX enabled, yes?

We also don't want to arbitrarily block memory reclaim for long
periods of time waiting on an inode lock.  People seem to get very
upset when we introduce unbound latencies into the memory reclaim
path...

Indeed, what are you actually trying to serialise against here?
Nothing should have a reference to the inode, nor should anything be
able to find and take a new reference to the inode while it is being
evicted....

> +	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL);
> +
> +	/* The final layout break is uninterruptible */
> +	ASSERT_ALWAYS(!error);

We don't do error handling with BUG(). If xfs_break_layouts() truly
can't fail (what happens if the fs is shut down and some internal
call path now detects that and returns -EFSCORRUPTED?), theni
WARN_ON_ONCE() and continuing to tear down the inode so the system
is not immediately compromised is the appropriate action here.

> +
> +	truncate_inode_pages_final(&inode->i_data);
> +	clear_inode(inode);
> +
> +	xfs_iunlock(ip, iolock);
> +}

That all said, this really looks like a bit of a band-aid.

I can't work out why would we we ever have an actual layout lease
here that needs breaking given they are file based and active files
hold a reference to the inode. If we ever break that, then I suspect
this change will cause major problems for anyone using pNFS with XFS
as xfs_break_layouts() can end up waiting for NFS delegation
revocation. This is something we should never be doing in inode
eviction/memory reclaim.

Hence I have to ask why this lease break is being done
unconditionally for all inodes, instead of only calling
xfs_break_dax_layouts() directly on DAX enabled regular files?  I
also wonder what exciting new system deadlocks this will create
because BREAK_UNMAP_FINAL can essentially block forever waiting on
dax mappings going away. If that DAX mapping reclaim requires memory
allocations.....

/me looks deeper into the dax_layout_busy_page() stuff and realises
that both ext4 and XFS implementations of ext4_break_layouts() and
xfs_break_dax_layouts() are actually identical.

That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
XFS_MMAPLOCK_EXCL) operate on exactly the same
inode->i_mapping->invalidate_lock. Hence the implementations in ext4
and XFS are both functionally identical. Further, when the inode is
in the eviction path there is no reason for needing to take that
mapping->invalidation_lock to invalidate remaining stale DAX
mappings before truncate blasts them away.

IOWs, I don't see why fixing this problem needs to add new code to
XFS or ext4 at all. The DAX mapping invalidation and waiting can be
done enitrely within truncate_inode_pages_final() (conditional on
IS_DAX()) after mapping_set_exiting() has been set with generic code
and it should not require locking at all. I also think that
ext4_break_layouts() and xfs_break_dax_layouts() should be merged
into a generic dax infrastructure function so the filesystems don't
need to care about the internal details of DAX mappings at all...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-09-18 23:29 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
2022-09-20 14:30   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
2022-09-20 14:31   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
2022-09-18 22:57   ` Dave Chinner [this message]
2022-09-19 16:11     ` Dan Williams
2022-09-19 21:29       ` Dave Chinner
2022-09-20 16:44         ` Dan Williams
2022-09-21 22:14           ` Dave Chinner
2022-09-21 22:28             ` Jason Gunthorpe
2022-09-23  0:18               ` Dave Chinner
2022-09-23  0:41                 ` Dan Williams
2022-09-23  2:10                   ` Dave Chinner
2022-09-23  9:38                     ` Jan Kara
2022-09-23 23:06                       ` Dan Williams
2022-09-25 23:54                       ` Dave Chinner
2022-09-26 14:10                         ` Jan Kara
2022-09-29 23:33                           ` Dan Williams
2022-09-30 13:41                             ` Jan Kara
2022-09-30 17:56                               ` Dan Williams
2022-09-30 18:06                                 ` Jason Gunthorpe
2022-09-30 18:46                                   ` Dan Williams
2022-10-03  7:55                                   ` Jan Kara
2022-09-23 12:39                     ` Jason Gunthorpe
2022-09-26  0:34                       ` Dave Chinner
2022-09-26 13:04                         ` Jason Gunthorpe
2022-09-22  0:02             ` Dan Williams
2022-09-22  0:10               ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
2022-09-21 14:03   ` Jason Gunthorpe
2022-09-21 15:18     ` Dan Williams
2022-09-21 21:38       ` Dan Williams
2022-09-21 22:07         ` Jason Gunthorpe
2022-09-22  0:14           ` Dan Williams
2022-09-22  0:25             ` Jason Gunthorpe
2022-09-22  2:17               ` Dan Williams
2022-09-22 17:55                 ` Jason Gunthorpe
2022-09-22 21:54                   ` Dan Williams
2022-09-23  1:36                     ` Dave Chinner
2022-09-23  2:01                       ` Dan Williams
2022-09-23 13:24                     ` Jason Gunthorpe
2022-09-23 16:29                       ` Dan Williams
2022-09-23 17:42                         ` Jason Gunthorpe
2022-09-23 19:03                           ` Dan Williams
2022-09-23 19:23                             ` Jason Gunthorpe
2022-09-27  6:07                             ` Alistair Popple
2022-09-27 12:56                               ` Jason Gunthorpe
2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
2022-09-27  6:20   ` Alistair Popple
2022-09-29 22:38     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
2022-09-21 14:06   ` Jason Gunthorpe
2022-09-21 15:19     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
2022-09-21 14:10   ` Jason Gunthorpe
2022-09-21 15:48     ` Dan Williams
2022-09-21 22:23       ` Jason Gunthorpe
2022-09-22  0:15         ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
2022-09-21 15:24   ` Jason Gunthorpe
2022-09-21 23:45     ` Dan Williams
2022-09-22  0:03       ` Alistair Popple
2022-09-22  0:04       ` Jason Gunthorpe
2022-09-22  0:34         ` Dan Williams
2022-09-22  1:36           ` Alistair Popple
2022-09-22  2:34             ` Dan Williams
2022-09-26  6:17               ` Alistair Popple
2022-09-22  0:13       ` John Hubbard
2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
2022-09-20 16:50   ` Dan Williams
2022-11-09  0:20 ` Andrew Morton
2022-11-09 11:38   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220918225731.GG3600936@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.