All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: "Darrick J. Wong" <djwong@kernel.org>,
	Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: <linux-fsdevel@vger.kernel.org>, <nvdimm@lists.linux.dev>,
	<linux-xfs@vger.kernel.org>, <linux-mm@kvack.org>,
	<dan.j.williams@intel.com>, <willy@infradead.org>, <jack@suse.cz>,
	<akpm@linux-foundation.org>
Subject: Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
Date: Tue, 4 Apr 2023 21:36:25 -0700	[thread overview]
Message-ID: <642cfac98978d_21a8294cd@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20230404174517.GF109974@frogsfrogsfrogs>

Darrick J. Wong wrote:
> On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> > 
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >    -> devres_release_all()
> >     -> kill_dax()
> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >       -> xfs_dax_notify_failure()
> >       `-> freeze_super()
> >       `-> do xfs rmap
> >       ` -> mf_dax_kill_procs()
> >       `  -> collect_procs_fsdax()    // all associated
> >       `  -> unmap_and_kill()
> >       ` -> invalidate_inode_pages2() // drop file's cache
> >       `-> thaw_super()
> > 
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  Freeze the filesystem to prevent new dax mapping being created.
> > And do not shutdown filesystem directly if something not supported, or
> > if failure range includes metadata area.  Make sure all files and
> > processes are handled correctly.  Also drop the cache of associated
> > files before pmem is removed.
> > 
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > 
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> >  drivers/dax/super.c         |  3 +-
> >  fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
> >  include/linux/mm.h          |  1 +
> >  mm/memory-failure.c         | 17 ++++++++---
> >  4 files changed, 67 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4c4728a36e4..2e1a35e82fce 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> >  		return;
> >  
> >  	if (dax_dev->holder_data != NULL)
> > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > +				MF_MEM_PRE_REMOVE);
> >  
> >  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> >  	synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 1e2eddb8f90f..1b4eff43f9b5 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> >  
> >  #include <linux/mm.h>
> >  #include <linux/dax.h>
> > +#include <linux/fs.h>
> >  
> >  struct xfs_failure_info {
> >  	xfs_agblock_t		startblock;
> > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> >  	struct xfs_mount		*mp = cur->bc_mp;
> >  	struct xfs_inode		*ip;
> >  	struct xfs_failure_info		*notify = data;
> > +	struct address_space		*mapping;
> > +	pgoff_t				pgoff;
> > +	unsigned long			pgcnt;
> >  	int				error = 0;
> >  
> >  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> >  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > +		/* The device is about to be removed.  Not a really failure. */
> > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +			return 0;
> >  		notify->want_shutdown = true;
> >  		return 0;
> >  	}
> > @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
> >  		return 0;
> >  	}
> >  
> > -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > -				  xfs_failure_pgoff(mp, rec, notify),
> > -				  xfs_failure_pgcnt(mp, rec, notify),
> > -				  notify->mf_flags);
> > +	mapping = VFS_I(ip)->i_mapping;
> > +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> > +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > +
> > +	/* Continue the rmap query if the inode isn't a dax file. */
> > +	if (dax_mapping(mapping))
> > +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > +				notify->mf_flags);
> > +
> > +	/* Invalidate the cache anyway. */
> > +	invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> > +
> >  	xfs_irele(ip);
> >  	return error;
> >  }
> > @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
> >  	}
> >  
> >  	xfs_trans_cancel(tp);
> > +
> > +	/* Unfreeze filesystem anyway if it is freezed before. */
> > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > +		error = thaw_super(mp->m_super);
> > +		if (error)
> > +			return error;
> 
> If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
> 
> I guess that's useful for knowing if someone's screwed up the freeze
> state on us, but ... really, don't you want to make sure you've gotten
> the freeze and nobody else can take it away?
> 
> I think you want the kernel-initiated freeze proposed by Luis here:
> https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
> 
> Also: Is Fujitsu still pursuing pmem products?  Even though Optane is
> dead?  I'm no longer sure of what the roadmap is for all this fsdax code
> and whatnot.

First, I need to spend more time on DAX patches, I have let CXL
monopolize too much of my time and you've relied reviewed these when you
have other XFS things to worry about.

As for the future of fsdax / pmem, I had written this up earlier:

https://lore.kernel.org/all/62ef05515b085_1b3c29434@dwillia2-xfh.jf.intel.com.notmuch/

That said, I would feel better if there were examples to point at doing
"PMEM over CXL" in the market. Maybe there are and I have missed them.

There are examples of vendors doing combined CXL memory with NVME flash,
but the CXL in that case seems to just be a way to move DMA buffers
closer to the device, not something more interesting like a
hardware-accelerated page-cache / pmem device.

For now the kernel is ready for PMEM CXL devices per the specification
(drivers/cxl/pmem.c).

Now, all that said the motivation for this patch is a bit different in
that it solves an architectural problem with the way current pmem
devices are shutdown and the missing step to properly evacuate usage of
'struct page' metadata that may be active on that device. So while CXL
hotplug is a practical trigger for this, it can also be achieved without
hardware via:

echo "namespaceX.Y" > /sys/bus/nd/drivers/nd_pmem/unbind

...to hot remove an in use namespace / volume. The observation is that
before the pmem driver can assume that it can rip active pages away from
the system it needs to tell anyone that cares about those page
disappearing to abandon their interest and shutdown. So the idea is for
surprise shutdown failures to tell dax-filesystems that all of the pmem
is going away at once.

  parent reply	other threads:[~2023-04-05  4:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-28  9:41 [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-03-28  9:41 ` [PATCH v11 1/2] xfs: fix the calculation of length and end Shiyang Ruan
2023-03-28  9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-04-04 17:45   ` Darrick J. Wong
2023-04-05  1:28     ` Yasunori Gotou (Fujitsu)
2023-04-05  4:36     ` Dan Williams [this message]
2023-04-06 10:50     ` Shiyang Ruan
2023-04-06 14:54       ` Darrick J. Wong
2023-04-07  2:07         ` Shiyang Ruan
2023-04-12 10:52   ` [RFC PATCH v11.1 " Shiyang Ruan
2023-04-20  2:07     ` Shiyang Ruan
2023-04-20 12:09       ` Jan Kara
2023-04-25 12:47         ` Shiyang Ruan
2023-04-25 13:23           ` Jan Kara
2023-04-25 15:18             ` Darrick J. Wong
2023-04-26  2:27               ` Shiyang Ruan
2023-04-26  2:37                 ` Darrick J. Wong
2023-04-04  4:33 ` [PATCH v11 0/2] " Shiyang Ruan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=642cfac98978d_21a8294cd@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.