[PATCH v3 00/25] Fix the DAX-gup mistake

* [PATCH v3 00/25] Fix the DAX-gup mistake
@ 2022-10-14 23:56 Dan Williams
  2022-10-14 23:57 ` [PATCH v3 01/25] fsdax: Wait on @page not @page->_refcount Dan Williams
                   ` (24 more replies)
  0 siblings, 25 replies; 42+ messages in thread
From: Dan Williams @ 2022-10-14 23:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Alex Deucher, Ben Skeggs, Jason Gunthorpe, Alistair Popple,
	Jan Kara, Lyude Paul, Karol Herbst, Christoph Hellwig,
	Jérôme Glisse, David Airlie, Darrick J. Wong,
	Matthew Wilcox, Dave Chinner, John Hubbard, kernel test robot,
	Pan, Xinhui, kernel test robot, Christian König,
	Felix Kuehling, Daniel Vetter, nvdimm, akpm, linux-fsdevel

Changes since v2 [1]:
- All of the bandaids in the VFS and filesystems have been dropped. Dave
  made the observation that at iput_final() time the inode is idle so
  a common dax_break_layouts() in the fsdax core is suitable to replace
  new {ext4,xfs}_break_layouts() calls in ->drop_inode(). Additionally,
  the wait for page pins can be handled in the existing
  dax_delete_mapping_entry() called by truncate_inode_pages_final()
  (Dave).

- With the code movement the kbuild robot noticed some long standing
  sparse issues and other fixups.

- Updated the cover letter to try to make the story of how we got here
  easier to follow.

- Rebased on mm-stable, see base-commit. This includes Alistair's recent
  refcount fixups for MEMORY_DEVICE_PRIVATE, and I rework
  zone_device_page_init() into a new pgmap_request_folios() interface.

[1]: http://lore.kernel.org/r/166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com

My expectation is that this goes through -mm since it depends on changes
in mm-stable.

---

ZONE_DEVICE was created to allow for get_user_pages() on DAX mappings.
It has since grown other users, but the main motivation was gup and
synchronizing device shutdown events with pinned pages. Memory
device shutdown triggers driver ->remove(), and ->remove() always
succeeds, so ZONE_DEVICE needed a mechanism to stop new page references
from being taken, and await existing references to drain before allowing
device removal to proceed. This is the origin of 'struct dev_pagemap'
and its percpu_ref.

The original ZONE_DEVICE implementation started by noticing that 'struct
page' initialization, for typical page allocator pages, started pages at
a refcount of 1. Later those pages are 'onlined' by freeing them to the
page allocator via put_page() to drop that initial reference and
populate the free page lists. ZONE_DEVICE abused that "initialized but
never freed" state to both avoid leaking ZONE_DEVICE pages into places
that were not ready for them, and add some metadata to the unused
(because refcount was never 0) page->lru space.

As more users of ZONE_DEVICE arrived that special casing became more and
more unnecessary, and more and more expensive. The 'struct page'
modernization eliminated the need for the ->lru hack. The folio work had
to remember to sprinkle special case ZONE_DEVICE accounting in the right
places. The MEMORY_DEVICE_PRIVATE use case spearheaded much of the work
to support typical reference counting for ZONE_DEVICE pages and allow
them to be used in more kernel code paths. All the while the DAX case
kept its tech debt in place, until now.

However, while fixing the DAX page refcount semantics and arranging for
free_zone_device_page() to be the common end of life of all ZONE_DEVICE
pages, the mitigation for truncate() vs pinned DAX pages was found to be
incomplete. Unlike typical pages that surprisingly can remain pinned for
DMA after they have been truncated from a file, the DAX core must
enforce that nobody has access to a page after truncate() has
disconnected it from inode->i_pages. I.e. the file block that is
identical to the page still remains an extent of the file. The existing
mitigation handled explicit truncate while the inode was alive, but not
the implicit truncate right before the inode is freed.

So, in addition to moving DAX pages to be refcount-0 at idle, and add
'break_layouts' wakeups to free_zone_device_page(), this series also
introduces another occurrence of 'break_layouts' to the inode freeing
path. Recall that 'break_layouts' for DAX is the mechanism to await code
paths that previously arbitrated page access to drop their interest /
page-pins. This new synchronization point is implemented by special
casing dax_delete_mapping_entry(), called by
truncate_inode_pages_final(), to await page pins when mapping_exiting()
is true.

Thanks to Jason for the nudge to get this fixed up properly and the
review on v1, Dave, Jan, and Jason for the discussion on what to do
about the inode end-of-life-truncate problem, and Alistair for cleaning
up the last of the refcount-1 assumptions in the MEMORY_DEVICE_PRIVATE
users.

---

Dan Williams (25):
      fsdax: Wait on @page not @page->_refcount
      fsdax: Use dax_page_idle() to document DAX busy page checking
      fsdax: Include unmapped inodes for page-idle detection
      fsdax: Introduce dax_zap_mappings()
      fsdax: Wait for pinned pages during truncate_inode_pages_final()
      fsdax: Validate DAX layouts broken before truncate
      fsdax: Hold dax lock over mapping insertion
      fsdax: Update dax_insert_entry() calling convention to return an error
      fsdax: Rework for_each_mapped_pfn() to dax_for_each_folio()
      fsdax: Introduce pgmap_request_folios()
      fsdax: Rework dax_insert_entry() calling convention
      fsdax: Cleanup dax_associate_entry()
      devdax: Minor warning fixups
      devdax: Fix sparse lock imbalance warning
      libnvdimm/pmem: Support pmem block devices without dax
      devdax: Move address_space helpers to the DAX core
      devdax: Sparse fixes for xarray locking
      devdax: Sparse fixes for vmfault_t / dax-entry conversions
      devdax: Sparse fixes for vm_fault_t in tracepoints
      devdax: add PUD support to the DAX mapping infrastructure
      devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
      mm/memremap_pages: Replace zone_device_page_init() with pgmap_request_folios()
      mm/memremap_pages: Initialize all ZONE_DEVICE pages to start at refcount 0
      mm/meremap_pages: Delete put_devmap_managed_page_refs()
      mm/gup: Drop DAX pgmap accounting

 .clang-format                            |    1 
 arch/powerpc/kvm/book3s_hv_uvmem.c       |    3 
 drivers/Makefile                         |    2 
 drivers/dax/Kconfig                      |    5 
 drivers/dax/Makefile                     |    1 
 drivers/dax/bus.c                        |    9 
 drivers/dax/dax-private.h                |    2 
 drivers/dax/device.c                     |   73 +-
 drivers/dax/mapping.c                    | 1083 ++++++++++++++++++++++++++++++
 drivers/dax/super.c                      |   10 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |    3 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |    3 
 drivers/nvdimm/Kconfig                   |    3 
 drivers/nvdimm/pmem.c                    |   47 +
 fs/dax.c                                 | 1069 ++----------------------------
 fs/ext4/inode.c                          |    9 
 fs/fuse/dax.c                            |    9 
 fs/xfs/xfs_file.c                        |    7 
 fs/xfs/xfs_inode.c                       |    4 
 include/linux/dax.h                      |  149 +++-
 include/linux/huge_mm.h                  |   26 -
 include/linux/memremap.h                 |   22 +
 include/linux/mm.h                       |   30 -
 include/linux/mm_types.h                 |   26 -
 include/trace/events/fs_dax.h            |   16 
 lib/test_hmm.c                           |    3 
 mm/gup.c                                 |   89 +-
 mm/huge_memory.c                         |   48 -
 mm/memremap.c                            |  123 ++-
 mm/page_alloc.c                          |    9 
 mm/swap.c                                |    2 
 31 files changed, 1535 insertions(+), 1351 deletions(-)
 create mode 100644 drivers/dax/mapping.c

base-commit: ef6e06b2ef87077104d1145a0fd452ff8dbbc4b7

^ permalink raw reply	[flat|nested] 42+ messages in thread