nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: akpm@linux-foundation.org, Matthew Wilcox <willy@infradead.org>,
	Jan Kara <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Christoph Hellwig <hch@lst.de>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev,
	linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org
Subject: Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
Date: Thu, 22 Sep 2022 08:14:16 +1000	[thread overview]
Message-ID: <20220921221416.GT3600936@dread.disaster.area> (raw)
In-Reply-To: <6329ee04c9272_2a6ded294bf@dwillia2-xfh.jf.intel.com.notmuch>

On Tue, Sep 20, 2022 at 09:44:52AM -0700, Dan Williams wrote:
> Dave Chinner wrote:
> > On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > > Dave Chinner wrote:
> > > > That all said, this really looks like a bit of a band-aid.
> > > 
> > > It definitely is since DAX is in this transitory state between doing
> > > some activities page-less and others with page metadata. If DAX was
> > > fully committed to behaving like a typical page then
> > > unmap_mapping_range() would have already satisfied this reference
> > > counting situation.
> > > 
> > > > I can't work out why would we we ever have an actual layout lease
> > > > here that needs breaking given they are file based and active files
> > > > hold a reference to the inode. If we ever break that, then I suspect
> > > > this change will cause major problems for anyone using pNFS with XFS
> > > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > > revocation. This is something we should never be doing in inode
> > > > eviction/memory reclaim.
> > > > 
> > > > Hence I have to ask why this lease break is being done
> > > > unconditionally for all inodes, instead of only calling
> > > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > > also wonder what exciting new system deadlocks this will create
> > > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > > allocations.....
> > > 
> > > There should be no memory allocations in the DAX mapping reclaim path.
> > > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> > 
> > So if the task that holds the pin needs memory allocation before it
> > can unpin the page to allow direct inode reclaim to make progress?
> 
> No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
> with this hang since any GFP_KERNEL in a path that took a DAX page pin
> path could run afoul of this need to wait.
> 
> So, this has me looking at invalidate_inodes() and iput_final(), where I
> did not see the reclaim entanglement, and thinking DAX has the unique
> requirement to make sure that no access to a page outlives the hosting
> inode.
> 
> Not that I need to tell you, but to get my own thinking straight,
> compare that to typical page cache as the pinner can keep a pinned
> page-cache page as long as it wants even after it has been truncated.

Right, because the page pin prevents the page from being freed
after the page references the page cache keeps have been released.

But page cache page != DAX page. The DAX page is a direct reference
to the storage media, not a generic reference counted kernel page
that the kernel will keep alive as long as there is a reference to
it.

Hence for a DAX page, we have to revoke all access to the page
before the controlling owner context is torn down, otherwise we have
a use-after-free scenario at the storage media level. For a FSDAX
file data page, that owner context is the inode...

> DAX needs to make sure that truncate_inode_pages() ceases all access to
> the page synchronous with the truncate.

Yes, exactly.

>
> The typical page-cache will
> ensure that the next mapping of the file will get a new page if the page
> previously pinned for that offset is still in use, DAX can not offer
> that as the same page that was previously pinned is always used.

Yes, because the new DAX ipage lookup will return the original page
in the storage media, not a newly instantiated page cache page.

> So I think this means something like this:
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 6462276dfdf0..ab16772b9a8d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>                         continue;
>                 }
>  
> +               if (dax_inode_busy(inode)) {
> +                       busy = 1;
> +                       continue;
> +               }

That this does more than a check (i.e. it runs whatever
dax_zap_pages() does) means it cannot be run under the inode
spinlock.

As this is called from the block device code when a bdev is being
removed (i.e. will only find a superblock and inodes to invalidate
on hot-unplug), shouldn't this DAX mapping invalidation actually be
handled by the pmem failure notification infrastructure we've just
added for reflink?

> +
>                 inode->i_state |= I_FREEING;
>                 inode_lru_list_del(inode);
>                 spin_unlock(&inode->i_lock);
> @@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
>                 spin_unlock(&inode->i_lock);
>  
>                 write_inode_now(inode, 1);
> +               if (IS_DAX(inode))
> +                       dax_break_layouts(inode);
>  
>                 spin_lock(&inode->i_lock);
>                 state = inode->i_state;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9eced4cc286e..e4a74ab310b5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
>  extern ino_t iunique(struct super_block *, ino_t);
>  extern int inode_needs_sync(struct inode *inode);
>  extern int generic_delete_inode(struct inode *inode);
> +
> +static inline bool dax_inode_busy(struct inode *inode)
> +{
> +       if (!IS_DAX(inode))
> +               return false;
> +
> +       return dax_zap_pages(inode) != NULL;
> +}
> +
>  static inline int generic_drop_inode(struct inode *inode)
>  {
> +       if (dax_inode_busy(inode))
> +               return 0;
> +
>         return !inode->i_nlink || inode_unhashed(inode);
>  }

I don't think that's valid. This can result in unreferenced unlinked
inodes that should be torn down immediately being placed in the LRU
and cached in memory and potentially not processed until there is
future memory pressure or an unmount....

i.e. dropping the final reference on an unlinked inode needs to
reclaim the inode immediately and allow the filesystem to free the
inode, regardless of any other factor. Nothing should have an active
reference to the inode or inode related data/metadata at this point
in time.

Honestly, this still seems like a band-aid because it doesn't appear
to address that something has pinned the storage media without
having an active reference to the object that arbitrates access to
that storage media (i.e. the inode and, by proxy, then filesystem).
Where are these DAX page pins that don't require the pin holder to
also hold active references to the filesystem objects coming from?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2022-09-21 22:14 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
2022-09-20 14:30   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
2022-09-20 14:31   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
2022-09-18 22:57   ` Dave Chinner
2022-09-19 16:11     ` Dan Williams
2022-09-19 21:29       ` Dave Chinner
2022-09-20 16:44         ` Dan Williams
2022-09-21 22:14           ` Dave Chinner [this message]
2022-09-21 22:28             ` Jason Gunthorpe
2022-09-23  0:18               ` Dave Chinner
2022-09-23  0:41                 ` Dan Williams
2022-09-23  2:10                   ` Dave Chinner
2022-09-23  9:38                     ` Jan Kara
2022-09-23 23:06                       ` Dan Williams
2022-09-25 23:54                       ` Dave Chinner
2022-09-26 14:10                         ` Jan Kara
2022-09-29 23:33                           ` Dan Williams
2022-09-30 13:41                             ` Jan Kara
2022-09-30 17:56                               ` Dan Williams
2022-09-30 18:06                                 ` Jason Gunthorpe
2022-09-30 18:46                                   ` Dan Williams
2022-10-03  7:55                                   ` Jan Kara
2022-09-23 12:39                     ` Jason Gunthorpe
2022-09-26  0:34                       ` Dave Chinner
2022-09-26 13:04                         ` Jason Gunthorpe
2022-09-22  0:02             ` Dan Williams
2022-09-22  0:10               ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
2022-09-21 14:03   ` Jason Gunthorpe
2022-09-21 15:18     ` Dan Williams
2022-09-21 21:38       ` Dan Williams
2022-09-21 22:07         ` Jason Gunthorpe
2022-09-22  0:14           ` Dan Williams
2022-09-22  0:25             ` Jason Gunthorpe
2022-09-22  2:17               ` Dan Williams
2022-09-22 17:55                 ` Jason Gunthorpe
2022-09-22 21:54                   ` Dan Williams
2022-09-23  1:36                     ` Dave Chinner
2022-09-23  2:01                       ` Dan Williams
2022-09-23 13:24                     ` Jason Gunthorpe
2022-09-23 16:29                       ` Dan Williams
2022-09-23 17:42                         ` Jason Gunthorpe
2022-09-23 19:03                           ` Dan Williams
2022-09-23 19:23                             ` Jason Gunthorpe
2022-09-27  6:07                             ` Alistair Popple
2022-09-27 12:56                               ` Jason Gunthorpe
2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
2022-09-27  6:20   ` Alistair Popple
2022-09-29 22:38     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
2022-09-21 14:06   ` Jason Gunthorpe
2022-09-21 15:19     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
2022-09-21 14:10   ` Jason Gunthorpe
2022-09-21 15:48     ` Dan Williams
2022-09-21 22:23       ` Jason Gunthorpe
2022-09-22  0:15         ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
2022-09-21 15:24   ` Jason Gunthorpe
2022-09-21 23:45     ` Dan Williams
2022-09-22  0:03       ` Alistair Popple
2022-09-22  0:04       ` Jason Gunthorpe
2022-09-22  0:34         ` Dan Williams
2022-09-22  1:36           ` Alistair Popple
2022-09-22  2:34             ` Dan Williams
2022-09-26  6:17               ` Alistair Popple
2022-09-22  0:13       ` John Hubbard
2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
2022-09-20 16:50   ` Dan Williams
2022-11-09  0:20 ` Andrew Morton
2022-11-09 11:38   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220921221416.GT3600936@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).