archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <>
To: Dan Williams <>
Cc: "Darrick J. Wong" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>,
	"" <>, "" <>,
	"" <>,
	"" <>,
	"" <>
Subject: Re: Question about the "EXPERIMENTAL" tag for dax in XFS
Date: Tue, 2 Mar 2021 18:57:36 +1100	[thread overview]
Message-ID: <20210302075736.GJ4662@dread.disaster.area> (raw)
In-Reply-To: <>

On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong <> wrote:
> > > > I really don't see you seem to be telling us that invalidation is an
> > > > either/or choice. There's more ways to convert physical block
> > > > address -> inode file offset and mapping index than brute force
> > > > inode cache walks....
> > >
> > > Yes, but I was trying to map it to an existing mechanism and the
> > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > needs to happen here.
> >
> > Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> > that function except for the invalidate_mapping_* call itself.  The goal
> > of this series is first to wire up a callback within both the block and
> > pmem subsystems so that they can take notifications and reverse-map them
> > through the storage stack until they reach an fs superblock.
> I'm chuckling because this "reverse map all the way up the block
> layer" is the opposite of what Dave said at the first reaction to my
> proposal, "can't the mm map pfns to fs inode  address_spaces?".

Ah, no, I never said that the filesystem can't do reverse maps. I
was asking if the mm could directly (brute-force) invalidate PTEs
pointing at physical pmem ranges without needing walk the inode
mappings. That would be far more efficient if it could be done....

> Today whenever the pmem driver receives new corrupted range
> notification from the lower level nvdimm
> infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> associated with the pmem gendisk and then notifies userspace that
> there are new badblocks. This seems a perfect place to signal an upper
> level stacked block device that may also be watching disk->bb. Then
> each gendisk in a stacked topology is responsible for watching the
> badblock notifications of the next level and storing a remapped
> instance of those blocks until ultimately the filesystem mounted on
> the top-level block device is responsible for registering for those
> top-level disk->bb events.
> The device gone notification does not map cleanly onto 'struct badblocks'.

Filesystems are not allowed to interact with the gendisk
infrastructure - that's for supporting the device side of a block
device. It's a layering violation, and many a filesytem developer
has been shouted at for trying to do this. At most we can peek
through it to query functionality support from the request queue,
but otherwise filesystems do not interact with anything under

As it is, badblocks are used by devices to manage internal state.
e.g. md for recording stripes that need recovery if the system
crashes while they are being written out.

> If an upper level agent really cared about knowing about ->remove()
> events before they happened it could maybe do something like:
> dev = disk_to_dev(bdev->bd_disk)->parent;
> bus_register_notifier(dev->bus. &disk_host_device_notifier_block)

Yeah, that's exactly the sort of thing that filesystems have been
aggressively discouraged from doing for years.

Part of the reason for this is that gendisk based mechanisms are not
very good for stacked device error reporting. Part of the problem
here is that every layer of the stacked device has to hook the
notifier of the block devices underneath it, then translate the
event to match the upper block device map, then regenerate the
notification for the next layer up. This isn't an efficient way to
pass a notification through a series of stacked devices and it is
messy and cumbersome to maintain.

It can be effective for getting notifications to userspace about
something that happens to a specific block device. But The userspace
still ends up having to solve the "what does this error resolve to"
problem. i.e. Userspace still needs to map that notification to a
filesystem, and for data loss events map it to objects within the
filesystem, which can be extremely expensive to do from userspace.

This is exactly the sort of userspace error reporting mess that
various projects have asked us to try to fix. Plumbing errors
internally through the kernel up to the filesystem where the
filesytem can point directly to the user data that is affected is a
simple, effective solution to the problem. Especially if we then
have a generic error notification mechanism for filesystems to emit
errors to registered userspace watchers...

> I still don't think that solves the need for a separate mechanism for
> global dax_device pte invalidation.

It's just another type of media error because.....

> I think that global dax_device invalidation needs new kernel
> infrastructure to allow internal users, like dm-writecache and future
> filesystems using dax for metadata, to take a fault when pmem is
> offlined.

.... if userspace has directly mapped into the cache, and the cache
storage goes away, the userspace app has to be killed because we
have no idea if the device going away has caused data loss or not.
IOWs, if userspace writes direct to the cache device and it hasn't
been written back to other storage when it gets yanked, we have just
caused data corruption to occur.

At minimum, we now have to tell the filesystem that the dirty data
in the cache is now bad, and direct map applications that map those
dirty ranges need to be killed because their backing store is no
longer valid nor does the backup copy contain the data they last
wrote. Nor is it acessible by direct access, which is going to be
interesting because dynamically changing dax to non-dax access can't
be done without forcibly kicking the inode out of the cache. That
requires all references to the inode to go away. And that means the
event really has to go up to the filesystem.

But I think the biggest piece of the puzzle that you haven't grokked
here is that the dm cache device isn't a linear map - it's made up of
random ranges from the underlying devices. Hence the "remove" of a dm
cache device turns into a huge number of small, sparse corrupt
ranges, not a single linear device remove event.

IOWs, device unplug/remove events are not just simple "pass it on"
events in a stacked storage setup. There can be non-trivial mappings
through the layers, and device disappearance may in fact manifest to
the user as data corruption rather than causing data to be

Hence "remove" notifications just don't work in the storage stack.
They need to be translated to block ranges going bad (i.e.  media
errors), and reported to higher layers as bad ranges, not as device

The same goes for DAX devices. The moment they can be placed in
storage stacks in non-trivial configurations and/or used as cache
devices that can be directly accessed over tranditional block
devices, we end up with error conditions that can only be mapped as
ranges of blocks that have gone bad.


Dave Chinner
Linux-nvdimm mailing list --
To unsubscribe send an email to

  reply	other threads:[~2021-03-02  7:57 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-26  0:20 [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax Shiyang Ruan
2021-02-26  0:20 ` [PATCH v2 01/10] fsdax: Factor helpers to simplify dax fault code Shiyang Ruan
2021-03-03  9:13   ` Christoph Hellwig
2021-02-26  0:20 ` [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor() Shiyang Ruan
2021-03-03  9:28   ` Christoph Hellwig
2021-03-12  9:01     ` ruansy.fnst
2021-02-26  0:20 ` [PATCH v2 03/10] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
2021-02-26  0:20 ` [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
2021-03-03  9:29   ` Christoph Hellwig
2021-02-26  0:20 ` [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
2021-03-03  9:30   ` Christoph Hellwig
2021-03-03  9:41     ` ruansy.fnst
2021-03-03  9:44       ` Christoph Hellwig
2021-03-03  9:48   ` Christoph Hellwig
2021-02-26  0:20 ` [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero Shiyang Ruan
2021-03-03  9:31   ` Christoph Hellwig
2021-02-26  0:20 ` [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files Shiyang Ruan
2021-02-26  4:14   ` Darrick J. Wong
2021-02-26  8:11     ` ruansy.fnst
2021-02-26  8:25   ` Shiyang Ruan
2021-03-04  5:41   ` [RESEND PATCH v2.1 " Shiyang Ruan
2021-03-11 12:30     ` Christoph Hellwig
2021-02-26  0:20 ` [PATCH v2 08/10] fsdax: Dedup file range to use a compare function Shiyang Ruan
2021-02-26  8:28   ` Shiyang Ruan
2021-03-03  8:20   ` Joe Perches
2021-03-04  5:42   ` [RESEND PATCH v2.1 " Shiyang Ruan
2021-02-26  0:20 ` [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path Shiyang Ruan
2021-03-03  9:43   ` Christoph Hellwig
2021-03-03  9:57     ` ruansy.fnst
2021-03-03 10:43       ` Christoph Hellwig
2021-03-04  1:35         ` ruansy.fnst
2021-02-26  0:20 ` [PATCH v2 10/10] fs/xfs: Add dedupe support for fsdax Shiyang Ruan
2021-02-26  9:45 ` Question about the "EXPERIMENTAL" tag for dax in XFS ruansy.fnst
2021-02-26 19:04   ` Darrick J. Wong
2021-02-26 19:24     ` Dan Williams
2021-02-26 20:51       ` Dave Chinner
2021-02-26 20:59         ` Dan Williams
2021-02-26 21:27           ` Dave Chinner
2021-02-26 22:41             ` Dan Williams
2021-02-27 22:36               ` Dave Chinner
2021-02-27 23:40                 ` Dan Williams
2021-02-28 22:38                   ` Dave Chinner
2021-03-01 20:55                     ` Dan Williams
2021-03-01 22:46                       ` Dave Chinner
2021-03-02  0:32                         ` Dan Williams
2021-03-02  2:42                           ` Dave Chinner
2021-03-02  3:33                             ` Dan Williams
2021-03-02  5:38                               ` Dave Chinner
2021-03-02  5:50                                 ` Dan Williams
2021-03-02  3:28                       ` Darrick J. Wong
2021-03-02  5:41                         ` Dan Williams
2021-03-02  7:57                           ` Dave Chinner [this message]
2021-03-02 17:49                             ` Dan Williams
2021-03-04 23:40                               ` Darrick J. Wong
2021-03-01  7:26       ` Yasunori Goto
2021-03-01 21:34         ` Dan Williams
2021-03-09  6:36 ` [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax Xiaoguang Wang
2021-03-09 16:19 ` Goldwyn Rodrigues
2021-03-10  1:26   ` ruansy.fnst
2021-03-10 12:30 ` Neal Gompa
2021-03-10 13:02   ` Matthew Wilcox
2021-03-10 13:36     ` Neal Gompa
2021-03-10 13:55       ` Matthew Wilcox
2021-03-10 14:21     ` Goldwyn Rodrigues
2021-03-10 14:26       ` Matthew Wilcox
2021-03-10 17:04         ` Goldwyn Rodrigues
2021-03-11  0:53         ` Dan Williams
2021-03-11  8:26           ` Neal Gompa
2021-03-13 13:07         ` Adam Borowski
2021-03-13 16:24           ` Neal Gompa
2021-03-13 22:00             ` Adam Borowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210302075736.GJ4662@dread.disaster.area \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).