ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "y-goto@fujitsu.com" <y-goto@fujitsu.com>,
	"jack@suse.cz" <jack@suse.cz>,
	"fnstml-iaas@cn.fujitsu.com" <fnstml-iaas@cn.fujitsu.com>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"darrick.wong@oracle.com" <darrick.wong@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"ruansy.fnst@fujitsu.com" <ruansy.fnst@fujitsu.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"ocfs2-devel@oss.oracle.com" <ocfs2-devel@oss.oracle.com>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"qi.fuli@fujitsu.com" <qi.fuli@fujitsu.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [Ocfs2-devel] Question about the "EXPERIMENTAL" tag for dax in XFS
Date: Sun, 28 Feb 2021 09:36:11 +1100	[thread overview]
Message-ID: <20210227223611.GZ4662@dread.disaster.area> (raw)
In-Reply-To: <CAPcyv4jryJ32R5vOwwEdoU3V8C0B7zu_pCt=7f6A3Gk-9h6Dfg@mail.gmail.com>

On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > My immediate concern is the issue Jason recently highlighted [1] with
> > > > > respect to invalidating all dax mappings when / if the device is
> > > > > ripped out from underneath the fs. I don't think that will collide
> > > > > with Ruan's implementation, but it does need new communication from
> > > > > driver to fs about removal events.
> > > > >
> > > > > [1]: http://lore.kernel.org/r/CAPcyv4i+PZhYZiePf2PaH0dT5jDfkmkDX-3usQy1fAhf6LPyfw@mail.gmail.com
> > > >
> > > > Oh, yay.
> > > >
> > > > The XFS shutdown code is centred around preventing new IO from being
> > > > issued - we don't actually do anything about DAX mappings because,
> > > > well, I don't think anyone on the filesystem side thought they had
> > > > to do anything special if pmem went away from under it.
> > > >
> > > > My understanding -was- that the pmem removal invalidates
> > > > all the ptes currently mapped into CPU page tables that point at
> > > > the dax device across the system. THe vmas that manage these
> > > > mappings are not really something the filesystem really manages,
> > > > but a function of the mm subsystem. What the filesystem cares about
> > > > is that it gets page faults triggered when a change of state occurs
> > > > so that it can remap the page to it's backing store correctly.
> > > >
> > > > IOWs, all the mm subsystem needs to when pmem goes away is clear the
> > > > CPU ptes, because then when then when userspace tries to access the
> > > > mapped DAX pages we get a new page fault. In processing the fault, the
> > > > filesystem will try to get direct access to the pmem from the block
> > > > device. This will get an ENODEV error from the block device because
> > > > because the backing store (pmem) has been unplugged and is no longer
> > > > there...
> > > >
> > > > AFAICT, as long as pmem removal invalidates all the active ptes that
> > > > point at the pmem being removed, the filesystem doesn't need to
> > > > care about device removal at all, DAX or no DAX...
> > >
> > > How would the pmem removal do that without walking all the active
> > > inodes in the fs at the time of shutdown and call
> > > unmap_mapping_range(inode->i_mapping, 0, 0, 1)?
> >
> > Which then immediately ends up back at the vmas that manage the ptes
> > to unmap them.
> >
> > Isn't finding the vma(s) that map a specific memory range exactly
> > what the rmap code in the mm subsystem is supposed to address?
> 
> rmap can lookup only vmas from a virt address relative to a given
> mm_struct. The driver has neither the list of mm_struct objects nor
> virt addresses to do a lookup. All it knows is that someone might have
> mapped pages through the fsdax interface.

So there's no physical addr to vma translation in the mm subsystem
at all?

That doesn't make sense. We do exactly this for hwpoison for DAX
mappings. While we don't look at ptes, we get a pfn, grab the page
it points to, check if it points to the PMEM that is being removed,
grab the page it points to, map that to the relevant struct page,
run collect_procs() on that page, then kill the user processes that
map that page.

So why can't we walk the ptes, check the physical pages that they
map to and if they map to a pmem page we go poison that
page and that kills any user process that maps it.

i.e. I can't see how unexpected pmem device unplug is any different
to an MCE delivering a hwpoison event to a DAX mapped page.  Both
indicate a physical address range now contains invalid data and the
filesystem has to take the same action...

IOWs, we could just call ->corrupted_range(0, EOD) here to tell the
filesystem the entire device went away. Then the filesystem deal
with this however it needs to. However, it would be more efficient
from an invalidation POV to just call it on the pages that have
currently active ptes because once the block device is dead
new page faults on DAX mappings will get a SIGBUS naturally.

> To me this looks like a notifier that fires from memunmap_pages()
> after dev_pagemap_kill() to notify any block_device associated with
> that dev_pagemap() to say that any dax mappings arranged through this
> block_device are now invalid. The reason to do this after
> dev_pagemap_kill() is so that any new mapping attempts that are racing
> the removal will be blocked.

I don't see why this needs a unique notifier. At the filesystem
level, we want a single interface that tells us "something bad
happened to the block device", not a proliferation of similar but
subtly different "bad thing X happened to block device" interfaces
that are unique to specific physical device drivers...

> The receiver of that notification needs to go from a block_device to a
> superblock that has mapped inodes and walk ->sb_inodes triggering the
> unmap/invalidation.

Not necessarily.

What if the filesystem is managing mirrored data across multiple
devices and this device is only one leg of the mirror? Or that the
pmem was used by the RT device in XFS and the data/log devices are
still fine? What if the pmem is just being used as a cache tier, and
no data was actually lost?

IOWs, what needs to happen at this point is very filesystem
specific. Assuming that "device unplug == filesystem dead" is not
correct, nor is specifying a generic action that assumes the
filesystem is dead because a device it is using went away.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

  reply	other threads:[~2021-02-27 22:40 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-26  0:20 [Ocfs2-devel] [PATCH v2 00/10] fsdax, xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 01/10] fsdax: Factor helpers to simplify dax fault code Shiyang Ruan
2021-03-03  9:13   ` Christoph Hellwig
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor() Shiyang Ruan
2021-03-03  9:28   ` Christoph Hellwig
2021-03-12  9:01     ` ruansy.fnst
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 03/10] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
2021-03-03  9:30   ` Christoph Hellwig
2021-03-03  9:41     ` ruansy.fnst
2021-03-03  9:44       ` Christoph Hellwig
2021-03-03  9:48   ` Christoph Hellwig
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 08/10] fsdax: Dedup file range to use a compare function Shiyang Ruan
2021-02-26  8:28   ` Shiyang Ruan
2021-03-03  8:20   ` Joe Perches
2021-03-03  8:45     ` ruansy.fnst
2021-03-03  9:04       ` Joe Perches
2021-03-03  9:39       ` hch
2021-03-03  9:46         ` ruansy.fnst
2021-03-04  5:42   ` [Ocfs2-devel] [RESEND PATCH v2.1 " Shiyang Ruan
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
2021-03-03  9:43   ` Christoph Hellwig
2021-03-03  9:57     ` ruansy.fnst
2021-03-03 10:43       ` Christoph Hellwig
2021-03-04  1:35         ` ruansy.fnst
2021-02-26  0:20 ` [Ocfs2-devel] [PATCH v2 10/10] fs/xfs: Add dedupe support for fsdax Shiyang Ruan
2021-02-26  9:45 ` [Ocfs2-devel] Question about the "EXPERIMENTAL" tag for dax in XFS ruansy.fnst
2021-02-26 19:04   ` Darrick J. Wong
2021-02-26 19:24     ` Dan Williams
2021-02-26 20:51       ` Dave Chinner
2021-02-26 20:59         ` Dan Williams
2021-02-26 21:27           ` Dave Chinner
2021-02-26 22:41             ` Dan Williams
2021-02-27 22:36               ` Dave Chinner [this message]
2021-02-27 23:40                 ` Dan Williams
2021-02-28 22:38                   ` Dave Chinner
2021-03-01 20:55                     ` Dan Williams
2021-03-01 22:46                       ` Dave Chinner
2021-03-02  0:32                         ` Dan Williams
2021-03-02  2:42                           ` Dave Chinner
2021-03-02  3:33                             ` Dan Williams
2021-03-02  5:38                               ` Dave Chinner
2021-03-02  5:50                                 ` Dan Williams
2021-03-02  3:28                       ` Darrick J. Wong
2021-03-02  5:41                         ` Dan Williams
2021-03-02  7:57                           ` Dave Chinner
2021-03-02 17:49                             ` Dan Williams
2021-03-04 23:40                               ` Darrick J. Wong
2021-03-01  7:26       ` Yasunori Goto
2021-03-01 21:34         ` Dan Williams
     [not found] ` <20210226002030.653855-5-ruansy.fnst@fujitsu.com>
2021-03-03  9:29   ` [Ocfs2-devel] [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy() Christoph Hellwig
     [not found] ` <20210226002030.653855-7-ruansy.fnst@fujitsu.com>
2021-03-03  9:31   ` [Ocfs2-devel] [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Christoph Hellwig
     [not found] ` <20210226002030.653855-8-ruansy.fnst@fujitsu.com>
2021-02-26  4:14   ` [Ocfs2-devel] [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files Darrick J. Wong
2021-02-26  8:11     ` ruansy.fnst
2021-02-26  8:25   ` Shiyang Ruan
2021-03-04  5:41   ` [Ocfs2-devel] [RESEND PATCH v2.1 " Shiyang Ruan
2021-03-11 12:30     ` Christoph Hellwig
2021-03-09  6:36 ` [Ocfs2-devel] [PATCH v2 00/10] fsdax, xfs: Add reflink&dedupe support for fsdax Xiaoguang Wang
2021-03-10  1:32   ` ruansy.fnst
2021-03-09 16:19 ` Goldwyn Rodrigues
2021-03-10  1:26   ` ruansy.fnst
2021-03-10 12:30 ` Neal Gompa
2021-03-10 13:02   ` Matthew Wilcox
2021-03-10 13:36     ` Neal Gompa
2021-03-10 13:55       ` Matthew Wilcox
2021-03-10 14:21     ` Goldwyn Rodrigues
2021-03-10 14:26       ` Matthew Wilcox
2021-03-10 17:04         ` Goldwyn Rodrigues
2021-03-11  0:53         ` Dan Williams
2021-03-11  8:26           ` Neal Gompa
2021-03-13 13:07         ` Adam Borowski
2021-03-13 16:24           ` Neal Gompa
2021-03-13 22:00             ` Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210227223611.GZ4662@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=fnstml-iaas@cn.fujitsu.com \
    --cc=jack@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ocfs2-devel@oss.oracle.com \
    --cc=qi.fuli@fujitsu.com \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=y-goto@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).