From: Dan Williams <dan.j.williams@intel.com> To: linux-mm@kvack.org, linux-nvdimm@lists.01.org Cc: Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>, "Darrick J. Wong" <djwong@kernel.org>, Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Shiyang Ruan <ruansy.fnst@fujitsu.com>, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove Date: Wed, 17 Mar 2021 21:08:02 -0700 [thread overview] Message-ID: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) Summary: A dax_dev can be unbound from its driver at any time. Unbind can not fail. The driver-core will always trigger ->remove() and the result from ->remove() is ignored. After ->remove() the driver-core proceeds to tear down context. The filesystem-dax implementation can leave pfns mapped after ->remove() if it is triggered while the filesystem is mounted. Security and data-integrity is forfeit if the dax_dev is repurposed for another security domain (new filesystem or change device modes), or if the dax_dev is physically replaced. CXL is a hotplug bus that makes dax_dev physical replace a real world prospect. All dax_dev pfns must be unmapped at remove. Detect the "remove while mounted" case and trigger memory_failure() over the entire dax_dev range. Details: The get_user_pages_fast() path expects all synchronization to be handled by the pattern of checking for pte presence, taking a page reference, and then validating that the pte was stable over that event. The gup-fast path for devmap / DAX pages additionally attempts to take/hold a live reference against the hosting pgmap over the page pin. The rational for the pgmap reference is to synchronize against a dax-device unbind / ->remove() event, but that is unnecessary if pte invalidation is guaranteed in the ->remove() path. Global dax-device pte invalidation *does* happen when the device is in raw "device-dax" mode where there is a single shared inode to unmap at remove, but the filesystem-dax path has a large number of actively mapped inodes unknown to the driver at ->remove() time. So, that unmap does not happen today for filesystem-dax. However, as Jason points out, that unmap / invalidation *needs* to happen not only to cleanup get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev ->remove() is correlated with actual physical removal / replacement the implications of allowing a physical pfn to be exchanged without tearing down old mappings are severe (security and data-integrity). What is not in this patch set is coordination with the dax_kmem driver to trigger memory_failure() when the dax_dev is onlined as "System RAM". The remove_memory() API was built with the assumption that platform firmware negotiates all removal requests and the OS has a chance to say "no". This is why dax_kmem today simply leaks request_region() to burn that physical address space for any other usage until the next reboot on a manual unbind event if the memory can't be offlined. However a future to make sure that remove_memory() succeeds after memory_failure() of the same range seems a better semantic than permanently burning physical address space. The topic of remove_memory() failures gets to the question of what happens to active page references when the inopportune ->remove() event happens. For transient pins the ->remove() event will wait for for all pins to be dropped before allowing ->remove() to complete. Since fileystem-dax forbids longterm pins all those pins are transient. Device-dax, on the other hand, does allow longterm pins which means that ->remove() will hang unless / until the longterm pin is dropped. Hopefully an unmap_mapping_range() event is sufficient to get the pin dropped, but I suspect device-dax might need to trigger memory_failure() as well to get the longterm pin holder to wake up and get out of the way (TBD). Lest we repeat the "longterm-pin-revoke" debate, which highlighted that RDMA devices do not respond well to having context torn down, keep in mind that this proposal is to do a best effort recovery of an event that should not happen (surprise removal) under nominal operation. --- Dan Williams (3): mm/memory-failure: Prepare for mass memory_failure() mm, dax, pmem: Introduce dev_pagemap_failure() mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path drivers/dax/super.c | 15 +++++++++++++++ drivers/nvdimm/pmem.c | 10 +++++++++- drivers/nvdimm/pmem.h | 1 + include/linux/dax.h | 5 +++++ include/linux/memremap.h | 5 +++++ include/linux/mm.h | 3 +++ mm/gup.c | 38 ++++++++++++++++---------------------- mm/memory-failure.c | 36 +++++++++++++++++++++++------------- mm/memremap.c | 11 +++++++++++ 9 files changed, 88 insertions(+), 36 deletions(-) _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: linux-mm@kvack.org, linux-nvdimm@lists.01.org Cc: Ira Weiny <ira.weiny@intel.com>, Jason Gunthorpe <jgg@ziepe.ca>, Dave Jiang <dave.jiang@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>, "Darrick J. Wong" <djwong@kernel.org>, Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>, Naoya Horiguchi <naoya.horiguchi@nec.com>, Shiyang Ruan <ruansy.fnst@fujitsu.com>, Andrew Morton <akpm@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Subject: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove Date: Wed, 17 Mar 2021 21:08:02 -0700 [thread overview] Message-ID: <161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) Summary: A dax_dev can be unbound from its driver at any time. Unbind can not fail. The driver-core will always trigger ->remove() and the result from ->remove() is ignored. After ->remove() the driver-core proceeds to tear down context. The filesystem-dax implementation can leave pfns mapped after ->remove() if it is triggered while the filesystem is mounted. Security and data-integrity is forfeit if the dax_dev is repurposed for another security domain (new filesystem or change device modes), or if the dax_dev is physically replaced. CXL is a hotplug bus that makes dax_dev physical replace a real world prospect. All dax_dev pfns must be unmapped at remove. Detect the "remove while mounted" case and trigger memory_failure() over the entire dax_dev range. Details: The get_user_pages_fast() path expects all synchronization to be handled by the pattern of checking for pte presence, taking a page reference, and then validating that the pte was stable over that event. The gup-fast path for devmap / DAX pages additionally attempts to take/hold a live reference against the hosting pgmap over the page pin. The rational for the pgmap reference is to synchronize against a dax-device unbind / ->remove() event, but that is unnecessary if pte invalidation is guaranteed in the ->remove() path. Global dax-device pte invalidation *does* happen when the device is in raw "device-dax" mode where there is a single shared inode to unmap at remove, but the filesystem-dax path has a large number of actively mapped inodes unknown to the driver at ->remove() time. So, that unmap does not happen today for filesystem-dax. However, as Jason points out, that unmap / invalidation *needs* to happen not only to cleanup get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev ->remove() is correlated with actual physical removal / replacement the implications of allowing a physical pfn to be exchanged without tearing down old mappings are severe (security and data-integrity). What is not in this patch set is coordination with the dax_kmem driver to trigger memory_failure() when the dax_dev is onlined as "System RAM". The remove_memory() API was built with the assumption that platform firmware negotiates all removal requests and the OS has a chance to say "no". This is why dax_kmem today simply leaks request_region() to burn that physical address space for any other usage until the next reboot on a manual unbind event if the memory can't be offlined. However a future to make sure that remove_memory() succeeds after memory_failure() of the same range seems a better semantic than permanently burning physical address space. The topic of remove_memory() failures gets to the question of what happens to active page references when the inopportune ->remove() event happens. For transient pins the ->remove() event will wait for for all pins to be dropped before allowing ->remove() to complete. Since fileystem-dax forbids longterm pins all those pins are transient. Device-dax, on the other hand, does allow longterm pins which means that ->remove() will hang unless / until the longterm pin is dropped. Hopefully an unmap_mapping_range() event is sufficient to get the pin dropped, but I suspect device-dax might need to trigger memory_failure() as well to get the longterm pin holder to wake up and get out of the way (TBD). Lest we repeat the "longterm-pin-revoke" debate, which highlighted that RDMA devices do not respond well to having context torn down, keep in mind that this proposal is to do a best effort recovery of an event that should not happen (surprise removal) under nominal operation. --- Dan Williams (3): mm/memory-failure: Prepare for mass memory_failure() mm, dax, pmem: Introduce dev_pagemap_failure() mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path drivers/dax/super.c | 15 +++++++++++++++ drivers/nvdimm/pmem.c | 10 +++++++++- drivers/nvdimm/pmem.h | 1 + include/linux/dax.h | 5 +++++ include/linux/memremap.h | 5 +++++ include/linux/mm.h | 3 +++ mm/gup.c | 38 ++++++++++++++++---------------------- mm/memory-failure.c | 36 +++++++++++++++++++++++------------- mm/memremap.c | 11 +++++++++++ 9 files changed, 88 insertions(+), 36 deletions(-)
next reply other threads:[~2021-03-18 4:08 UTC|newest] Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-03-18 4:08 Dan Williams [this message] 2021-03-18 4:08 ` [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove Dan Williams 2021-03-18 4:08 ` [PATCH 1/3] mm/memory-failure: Prepare for mass memory_failure() Dan Williams 2021-03-18 4:08 ` Dan Williams 2021-03-18 4:08 ` [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure() Dan Williams 2021-03-18 4:08 ` Dan Williams 2021-03-18 4:45 ` Darrick J. Wong 2021-03-18 4:45 ` Darrick J. Wong 2021-03-18 19:26 ` Dan Williams 2021-03-18 19:26 ` Dan Williams 2021-03-18 19:26 ` Dan Williams 2021-03-18 4:57 ` Dave Chinner 2021-03-18 4:57 ` Dave Chinner 2021-03-18 19:20 ` Dan Williams 2021-03-18 19:20 ` Dan Williams 2021-03-18 19:20 ` Dan Williams 2021-03-20 1:46 ` Dave Chinner 2021-03-20 1:46 ` Dave Chinner 2021-03-20 2:39 ` Dan Williams 2021-03-20 2:39 ` Dan Williams 2021-03-20 2:39 ` Dan Williams 2021-03-18 4:08 ` [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path Dan Williams 2021-03-18 4:08 ` Dan Williams 2021-03-18 10:00 ` Joao Martins 2021-03-18 10:00 ` Joao Martins 2021-03-18 17:03 ` Dan Williams 2021-03-18 17:03 ` Dan Williams 2021-03-18 17:03 ` Dan Williams 2021-03-25 14:34 ` Jason Gunthorpe 2021-03-29 23:24 ` Dan Williams 2021-03-29 23:24 ` Dan Williams 2021-03-29 23:24 ` Dan Williams 2021-03-30 13:49 ` Jason Gunthorpe 2021-03-24 17:45 ` Dan Williams 2021-03-24 17:45 ` Dan Williams 2021-03-24 17:45 ` Dan Williams 2021-03-24 19:00 ` Joao Martins 2021-03-24 19:00 ` Joao Martins 2021-04-01 19:54 ` Joao Martins 2021-04-01 19:54 ` Joao Martins 2021-03-25 14:37 ` Jason Gunthorpe 2021-03-25 13:02 ` [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove David Hildenbrand 2021-03-25 13:02 ` David Hildenbrand
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=161604048257.1463742.1374527716381197629.stgit@dwillia2-desk3.amr.corp.intel.com \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=david@fromorbit.com \ --cc=djwong@kernel.org \ --cc=hch@lst.de \ --cc=jack@suse.cz \ --cc=jgg@ziepe.ca \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=linux-nvdimm@lists.01.org \ --cc=naoya.horiguchi@nec.com \ --cc=ruansy.fnst@fujitsu.com \ --cc=willy@infradead.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.