nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org, linux-nvdimm@lists.01.org,
	Jason Gunthorpe <jgg@ziepe.ca>, Christoph Hellwig <hch@lst.de>,
	Shiyang Ruan <ruansy.fnst@fujitsu.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()
Date: Thu, 18 Mar 2021 15:57:45 +1100	[thread overview]
Message-ID: <20210318045745.GC349301@dread.disaster.area> (raw)
In-Reply-To: <161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com>

On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare.

This is a highly suboptimal design. Filesystems only need a single
callout to trigger a shutdown that unmaps every active mapping in
the filesystem - we do not need a page-by-page error notification
which results in 250 million hwposion callouts per TB of pmem to do

Indeed, the moment we get the first hwpoison from this patch, we'll
map it to the primary XFS superblock and we'd almost certainly
consider losing the storage behind that block to be a shut down
trigger. During the shutdown, the filesystem should unmap all the
active mappings (we already need to add this to shutdown on DAX
regardless of this device remove issue) and so we really don't need
a page-by-page notification of badness.

AFAICT, it's going to take minutes, maybe hours for do the page-by-page
iteration to hwposion every page. It's going to take a few seconds
for the filesystem shutdown to run a device wide invalidation.

SO, yeah, I think this should simply be a single ranged call to the
filesystem like:

	->memory_failure(dev, 0, -1ULL)

to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.

Dave Chinner
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

  parent reply	other threads:[~2021-03-18  4:57 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18  4:08 [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove Dan Williams
2021-03-18  4:08 ` [PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure() Dan Williams
2021-03-18  4:08 ` [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() Dan Williams
2021-03-18  4:45   ` Darrick J. Wong
2021-03-18 19:26     ` Dan Williams
2021-03-18  4:57   ` Dave Chinner [this message]
2021-03-18 19:20     ` Dan Williams
2021-03-20  1:46       ` Dave Chinner
2021-03-20  2:39         ` Dan Williams
2021-03-18  4:08 ` [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path Dan Williams
2021-03-18 10:00   ` Joao Martins
2021-03-18 17:03     ` Dan Williams
     [not found]       ` <20210325143419.GK2710221@ziepe.ca>
2021-03-29 23:24         ` Dan Williams
2021-03-24 17:45     ` Dan Williams
2021-03-24 19:00       ` Joao Martins
2021-04-01 19:54         ` Joao Martins
2021-03-25 13:02 ` [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210318045745.GC349301@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=willy@infradead.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).