From: Dan Williams <dan.j.williams@intel.com>
To: linux-mm@kvack.org, linux-nvdimm@lists.01.org
Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>,
Christoph Hellwig <hch@lst.de>,
"Darrick J. Wong" <djwong@kernel.org>,
Dave Chinner <david@fromorbit.com>,
Matthew Wilcox <willy@infradead.org>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
Shiyang Ruan <ruansy.fnst@fujitsu.com>,
akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove
Date: Wed, 17 Mar 2021 21:08:02 -0700 [thread overview]
Message-ID: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> (raw)
Summary:
A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect.
All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.
Details:
The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.
Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).
What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.
The topic of remove_memory() failures gets to the question of what
happens to active page references when the inopportune ->remove() event
happens. For transient pins the ->remove() event will wait for for all
pins to be dropped before allowing ->remove() to complete. Since
fileystem-dax forbids longterm pins all those pins are transient.
Device-dax, on the other hand, does allow longterm pins which means that
->remove() will hang unless / until the longterm pin is dropped.
Hopefully an unmap_mapping_range() event is sufficient to get the pin
dropped, but I suspect device-dax might need to trigger memory_failure()
as well to get the longterm pin holder to wake up and get out of the
way (TBD).
Lest we repeat the "longterm-pin-revoke" debate, which highlighted that
RDMA devices do not respond well to having context torn down, keep in
mind that this proposal is to do a best effort recovery of an event that
should not happen (surprise removal) under nominal operation.
---
Dan Williams (3):
mm/memory-failure: Prepare for mass memory_failure()
mm, dax, pmem: Introduce dev_pagemap_failure()
mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path
drivers/dax/super.c | 15 +++++++++++++++
drivers/nvdimm/pmem.c | 10 +++++++++-
drivers/nvdimm/pmem.h | 1 +
include/linux/dax.h | 5 +++++
include/linux/memremap.h | 5 +++++
include/linux/mm.h | 3 +++
mm/gup.c | 38 ++++++++++++++++----------------------
mm/memory-failure.c | 36 +++++++++++++++++++++++-------------
mm/memremap.c | 11 +++++++++++
9 files changed, 88 insertions(+), 36 deletions(-)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
next reply other threads:[~2021-03-18 4:08 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-18 4:08 Dan Williams [this message]
2021-03-18 4:08 ` [PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure() Dan Williams
2021-03-18 4:08 ` [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() Dan Williams
2021-03-18 4:45 ` Darrick J. Wong
2021-03-18 19:26 ` Dan Williams
2021-03-18 4:57 ` Dave Chinner
2021-03-18 19:20 ` Dan Williams
2021-03-20 1:46 ` Dave Chinner
2021-03-20 2:39 ` Dan Williams
2021-03-18 4:08 ` [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path Dan Williams
2021-03-18 10:00 ` Joao Martins
2021-03-18 17:03 ` Dan Williams
[not found] ` <20210325143419.GK2710221@ziepe.ca>
2021-03-29 23:24 ` Dan Williams
2021-03-24 17:45 ` Dan Williams
2021-03-24 19:00 ` Joao Martins
2021-04-01 19:54 ` Joao Martins
2021-03-25 13:02 ` [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com \
--to=dan.j.williams@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=naoya.horiguchi@nec.com \
--cc=ruansy.fnst@fujitsu.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).