All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Dave Chinner <david@fromorbit.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>,
	"Jan Kara" <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Christoph Hellwig <hch@lst.de>,
	John Hubbard <jhubbard@nvidia.com>,
	<linux-fsdevel@vger.kernel.org>, <nvdimm@lists.linux.dev>,
	<linux-xfs@vger.kernel.org>, <linux-mm@kvack.org>,
	<linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
Date: Tue, 20 Sep 2022 09:44:52 -0700	[thread overview]
Message-ID: <6329ee04c9272_2a6ded294bf@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20220919212959.GL3600936@dread.disaster.area>

Dave Chinner wrote:
> On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> > > > In preparation for moving DAX pages to be 0-based rather than 1-based
> > > > for the idle refcount, the fsdax core wants to have all mappings in a
> > > > "zapped" state before truncate. For typical pages this happens naturally
> > > > via unmap_mapping_range(), for DAX pages some help is needed to record
> > > > this state in the 'struct address_space' of the inode(s) where the page
> > > > is mapped.
> > > > 
> > > > That "zapped" state is recorded in DAX entries as a side effect of
> > > > xfs_break_layouts(). Arrange for it to be called before all truncation
> > > > events which already happens for truncate() and PUNCH_HOLE, but not
> > > > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> > > > truncate_inode_pages_final().
> ....
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 9ac59814bbb6..ebb4a6eba3fc 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
> > > >  	return generic_drop_inode(inode);
> > > >  }
> > > >  
> > > > +STATIC void
> > > > +xfs_fs_evict_inode(
> > > > +	struct inode		*inode)
> > > > +{
> > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > +	long			error;
> > > > +
> > > > +	xfs_ilock(ip, iolock);
> > > 
> > > I'm guessing you never ran this through lockdep.
> > 
> > I always run with lockdep enabled in my development kernels, but maybe my
> > testing was insufficient? Somewhat moot with your concerns below...
> 
> I'm guessing your testing doesn't generate inode cache pressure and
> then have direct memory reclaim inodes. e.g. on a directory inode
> this will trigger lockdep immediately because readdir locks with
> XFS_IOLOCK_SHARED and then does GFP_KERNEL memory reclaim. If we try
> to take XFS_IOLOCK_EXCL from memory reclaim of directory inodes,
> lockdep will then shout from the rooftops...

Got it.

> 
> > > > +
> > > > +	truncate_inode_pages_final(&inode->i_data);
> > > > +	clear_inode(inode);
> > > > +
> > > > +	xfs_iunlock(ip, iolock);
> > > > +}
> > > 
> > > That all said, this really looks like a bit of a band-aid.
> > 
> > It definitely is since DAX is in this transitory state between doing
> > some activities page-less and others with page metadata. If DAX was
> > fully committed to behaving like a typical page then
> > unmap_mapping_range() would have already satisfied this reference
> > counting situation.
> > 
> > > I can't work out why would we we ever have an actual layout lease
> > > here that needs breaking given they are file based and active files
> > > hold a reference to the inode. If we ever break that, then I suspect
> > > this change will cause major problems for anyone using pNFS with XFS
> > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > revocation. This is something we should never be doing in inode
> > > eviction/memory reclaim.
> > > 
> > > Hence I have to ask why this lease break is being done
> > > unconditionally for all inodes, instead of only calling
> > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > also wonder what exciting new system deadlocks this will create
> > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > allocations.....
> > 
> > There should be no memory allocations in the DAX mapping reclaim path.
> > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> 
> So if the task that holds the pin needs memory allocation before it
> can unpin the page to allow direct inode reclaim to make progress?

No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
with this hang since any GFP_KERNEL in a path that took a DAX page pin
path could run afoul of this need to wait.

So, this has me looking at invalidate_inodes() and iput_final(), where I
did not see the reclaim entanglement, and thinking DAX has the unique
requirement to make sure that no access to a page outlives the hosting
inode.

Not that I need to tell you, but to get my own thinking straight,
compare that to typical page cache as the pinner can keep a pinned
page-cache page as long as it wants even after it has been truncated.
DAX needs to make sure that truncate_inode_pages() ceases all access to
the page synchronous with the truncate. The typical page-cache will
ensure that the next mapping of the file will get a new page if the page
previously pinned for that offset is still in use, DAX can not offer
that as the same page that was previously pinned is always used.

So I think this means something like this:

diff --git a/fs/inode.c b/fs/inode.c
index 6462276dfdf0..ab16772b9a8d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
                        continue;
                }
 
+               if (dax_inode_busy(inode)) {
+                       busy = 1;
+                       continue;
+               }
+
                inode->i_state |= I_FREEING;
                inode_lru_list_del(inode);
                spin_unlock(&inode->i_lock);
@@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
                spin_unlock(&inode->i_lock);
 
                write_inode_now(inode, 1);
+               if (IS_DAX(inode))
+                       dax_break_layouts(inode);
 
                spin_lock(&inode->i_lock);
                state = inode->i_state;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..e4a74ab310b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
 extern int inode_needs_sync(struct inode *inode);
 extern int generic_delete_inode(struct inode *inode);
+
+static inline bool dax_inode_busy(struct inode *inode)
+{
+       if (!IS_DAX(inode))
+               return false;
+
+       return dax_zap_pages(inode) != NULL;
+}
+
 static inline int generic_drop_inode(struct inode *inode)
 {
+       if (dax_inode_busy(inode))
+               return 0;
+
        return !inode->i_nlink || inode_unhashed(inode);
 }
 extern void d_mark_dontcache(struct inode *inode);

...where generic code skips over dax-inodes with pinned pages.

> 
> > > /me looks deeper into the dax_layout_busy_page() stuff and realises
> > > that both ext4 and XFS implementations of ext4_break_layouts() and
> > > xfs_break_dax_layouts() are actually identical.
> > > 
> > > That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
> > > XFS_MMAPLOCK_EXCL) operate on exactly the same
> > > inode->i_mapping->invalidate_lock. Hence the implementations in ext4
> > > and XFS are both functionally identical.
> > 
> > I assume you mean for the purposes of this "final" break since
> > xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts().
> 
> No, I'm just looking at the two *dax* functions - we don't care what
> locks xfs_break_layouts() requires - dax mapping manipulation is
> covered by the mapping->invalidate_lock and not the inode->i_rwsem.
> This is explicitly documented in the code by the the asserts in both
> ext4_break_layouts() and xfs_break_dax_layouts().
> 
> XFS holds the inode->i_rwsem over xfs_break_layouts() because we
> have to break *file layout leases* from there, too. These are
> serialised by the inode->i_rwsem, not the mapping->invalidate_lock.

Got it, will make generic helpers for the scenario where only
dax-break-layouts needs to be performed.

  reply	other threads:[~2022-09-20 16:44 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
2022-09-20 14:30   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
2022-09-20 14:31   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
2022-09-18 22:57   ` Dave Chinner
2022-09-19 16:11     ` Dan Williams
2022-09-19 21:29       ` Dave Chinner
2022-09-20 16:44         ` Dan Williams [this message]
2022-09-21 22:14           ` Dave Chinner
2022-09-21 22:28             ` Jason Gunthorpe
2022-09-23  0:18               ` Dave Chinner
2022-09-23  0:41                 ` Dan Williams
2022-09-23  2:10                   ` Dave Chinner
2022-09-23  9:38                     ` Jan Kara
2022-09-23 23:06                       ` Dan Williams
2022-09-25 23:54                       ` Dave Chinner
2022-09-26 14:10                         ` Jan Kara
2022-09-29 23:33                           ` Dan Williams
2022-09-30 13:41                             ` Jan Kara
2022-09-30 17:56                               ` Dan Williams
2022-09-30 18:06                                 ` Jason Gunthorpe
2022-09-30 18:46                                   ` Dan Williams
2022-10-03  7:55                                   ` Jan Kara
2022-09-23 12:39                     ` Jason Gunthorpe
2022-09-26  0:34                       ` Dave Chinner
2022-09-26 13:04                         ` Jason Gunthorpe
2022-09-22  0:02             ` Dan Williams
2022-09-22  0:10               ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
2022-09-21 14:03   ` Jason Gunthorpe
2022-09-21 15:18     ` Dan Williams
2022-09-21 21:38       ` Dan Williams
2022-09-21 22:07         ` Jason Gunthorpe
2022-09-22  0:14           ` Dan Williams
2022-09-22  0:25             ` Jason Gunthorpe
2022-09-22  2:17               ` Dan Williams
2022-09-22 17:55                 ` Jason Gunthorpe
2022-09-22 21:54                   ` Dan Williams
2022-09-23  1:36                     ` Dave Chinner
2022-09-23  2:01                       ` Dan Williams
2022-09-23 13:24                     ` Jason Gunthorpe
2022-09-23 16:29                       ` Dan Williams
2022-09-23 17:42                         ` Jason Gunthorpe
2022-09-23 19:03                           ` Dan Williams
2022-09-23 19:23                             ` Jason Gunthorpe
2022-09-27  6:07                             ` Alistair Popple
2022-09-27 12:56                               ` Jason Gunthorpe
2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
2022-09-27  6:20   ` Alistair Popple
2022-09-29 22:38     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
2022-09-21 14:06   ` Jason Gunthorpe
2022-09-21 15:19     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
2022-09-21 14:10   ` Jason Gunthorpe
2022-09-21 15:48     ` Dan Williams
2022-09-21 22:23       ` Jason Gunthorpe
2022-09-22  0:15         ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
2022-09-21 15:24   ` Jason Gunthorpe
2022-09-21 23:45     ` Dan Williams
2022-09-22  0:03       ` Alistair Popple
2022-09-22  0:04       ` Jason Gunthorpe
2022-09-22  0:34         ` Dan Williams
2022-09-22  1:36           ` Alistair Popple
2022-09-22  2:34             ` Dan Williams
2022-09-26  6:17               ` Alistair Popple
2022-09-22  0:13       ` John Hubbard
2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
2022-09-20 16:50   ` Dan Williams
2022-11-09  0:20 ` Andrew Morton
2022-11-09 11:38   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6329ee04c9272_2a6ded294bf@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.