From: Dan Williams <email@example.com> To: firstname.lastname@example.org, email@example.com Cc: Ira Weiny <firstname.lastname@example.org>, Jason Gunthorpe <email@example.com>, Dave Jiang <firstname.lastname@example.org>, Vishal Verma <email@example.com>, Jan Kara <firstname.lastname@example.org>, Christoph Hellwig <email@example.com>, "Darrick J. Wong" <firstname.lastname@example.org>, Dave Chinner <email@example.com>, Matthew Wilcox <firstname.lastname@example.org>, Naoya Horiguchi <email@example.com>, Shiyang Ruan <firstname.lastname@example.org>, Andrew Morton <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Subject: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove Date: Wed, 17 Mar 2021 21:08:02 -0700 [thread overview] Message-ID: <email@example.com> (raw) Summary: A dax_dev can be unbound from its driver at any time. Unbind can not fail. The driver-core will always trigger ->remove() and the result from ->remove() is ignored. After ->remove() the driver-core proceeds to tear down context. The filesystem-dax implementation can leave pfns mapped after ->remove() if it is triggered while the filesystem is mounted. Security and data-integrity is forfeit if the dax_dev is repurposed for another security domain (new filesystem or change device modes), or if the dax_dev is physically replaced. CXL is a hotplug bus that makes dax_dev physical replace a real world prospect. All dax_dev pfns must be unmapped at remove. Detect the "remove while mounted" case and trigger memory_failure() over the entire dax_dev range. Details: The get_user_pages_fast() path expects all synchronization to be handled by the pattern of checking for pte presence, taking a page reference, and then validating that the pte was stable over that event. The gup-fast path for devmap / DAX pages additionally attempts to take/hold a live reference against the hosting pgmap over the page pin. The rational for the pgmap reference is to synchronize against a dax-device unbind / ->remove() event, but that is unnecessary if pte invalidation is guaranteed in the ->remove() path. Global dax-device pte invalidation *does* happen when the device is in raw "device-dax" mode where there is a single shared inode to unmap at remove, but the filesystem-dax path has a large number of actively mapped inodes unknown to the driver at ->remove() time. So, that unmap does not happen today for filesystem-dax. However, as Jason points out, that unmap / invalidation *needs* to happen not only to cleanup get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev ->remove() is correlated with actual physical removal / replacement the implications of allowing a physical pfn to be exchanged without tearing down old mappings are severe (security and data-integrity). What is not in this patch set is coordination with the dax_kmem driver to trigger memory_failure() when the dax_dev is onlined as "System RAM". The remove_memory() API was built with the assumption that platform firmware negotiates all removal requests and the OS has a chance to say "no". This is why dax_kmem today simply leaks request_region() to burn that physical address space for any other usage until the next reboot on a manual unbind event if the memory can't be offlined. However a future to make sure that remove_memory() succeeds after memory_failure() of the same range seems a better semantic than permanently burning physical address space. The topic of remove_memory() failures gets to the question of what happens to active page references when the inopportune ->remove() event happens. For transient pins the ->remove() event will wait for for all pins to be dropped before allowing ->remove() to complete. Since fileystem-dax forbids longterm pins all those pins are transient. Device-dax, on the other hand, does allow longterm pins which means that ->remove() will hang unless / until the longterm pin is dropped. Hopefully an unmap_mapping_range() event is sufficient to get the pin dropped, but I suspect device-dax might need to trigger memory_failure() as well to get the longterm pin holder to wake up and get out of the way (TBD). Lest we repeat the "longterm-pin-revoke" debate, which highlighted that RDMA devices do not respond well to having context torn down, keep in mind that this proposal is to do a best effort recovery of an event that should not happen (surprise removal) under nominal operation. --- Dan Williams (3): mm/memory-failure: Prepare for mass memory_failure() mm, dax, pmem: Introduce dev_pagemap_failure() mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path drivers/dax/super.c | 15 +++++++++++++++ drivers/nvdimm/pmem.c | 10 +++++++++- drivers/nvdimm/pmem.h | 1 + include/linux/dax.h | 5 +++++ include/linux/memremap.h | 5 +++++ include/linux/mm.h | 3 +++ mm/gup.c | 38 ++++++++++++++++---------------------- mm/memory-failure.c | 36 +++++++++++++++++++++++------------- mm/memremap.c | 11 +++++++++++ 9 files changed, 88 insertions(+), 36 deletions(-)
next reply other threads:[~2021-03-18 4:09 UTC|newest] Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-03-18 4:08 Dan Williams [this message] 2021-03-18 4:08 ` [PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure() Dan Williams 2021-03-18 4:08 ` [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() Dan Williams 2021-03-18 4:45 ` Darrick J. Wong 2021-03-18 19:26 ` Dan Williams 2021-03-18 4:57 ` Dave Chinner 2021-03-18 19:20 ` Dan Williams 2021-03-20 1:46 ` Dave Chinner 2021-03-20 2:39 ` Dan Williams 2021-03-18 4:08 ` [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path Dan Williams 2021-03-18 10:00 ` Joao Martins 2021-03-18 17:03 ` Dan Williams 2021-03-25 14:34 ` Jason Gunthorpe 2021-03-29 23:24 ` Dan Williams 2021-03-30 13:49 ` Jason Gunthorpe 2021-03-24 17:45 ` Dan Williams 2021-03-24 19:00 ` Joao Martins 2021-04-01 19:54 ` Joao Martins 2021-03-25 14:37 ` Jason Gunthorpe 2021-03-25 13:02 ` [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove David Hildenbrand
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).