All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: linux-mm@kvack.org
Cc: "Alex Deucher" <alexander.deucher@amd.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Alistair Popple" <apopple@nvidia.com>, "Jan Kara" <jack@suse.cz>,
	"Lyude Paul" <lyude@redhat.com>,
	"Karol Herbst" <kherbst@redhat.com>,
	"Christoph Hellwig" <hch@lst.de>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"David Airlie" <airlied@linux.ie>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Dave Chinner" <david@fromorbit.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"kernel test robot" <lkp@intel.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"kernel test robot" <lkp@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	nvdimm@lists.linux.dev, akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org
Subject: [PATCH v3 00/25] Fix the DAX-gup mistake
Date: Fri, 14 Oct 2022 16:56:56 -0700	[thread overview]
Message-ID: <166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com> (raw)

Changes since v2 [1]:
- All of the bandaids in the VFS and filesystems have been dropped. Dave
  made the observation that at iput_final() time the inode is idle so
  a common dax_break_layouts() in the fsdax core is suitable to replace
  new {ext4,xfs}_break_layouts() calls in ->drop_inode(). Additionally,
  the wait for page pins can be handled in the existing
  dax_delete_mapping_entry() called by truncate_inode_pages_final()
  (Dave).

- With the code movement the kbuild robot noticed some long standing
  sparse issues and other fixups.

- Updated the cover letter to try to make the story of how we got here
  easier to follow.

- Rebased on mm-stable, see base-commit. This includes Alistair's recent
  refcount fixups for MEMORY_DEVICE_PRIVATE, and I rework
  zone_device_page_init() into a new pgmap_request_folios() interface.

[1]: http://lore.kernel.org/r/166329930818.2786261.6086109734008025807.stgit@dwillia2-xfh.jf.intel.com

My expectation is that this goes through -mm since it depends on changes
in mm-stable.

---

ZONE_DEVICE was created to allow for get_user_pages() on DAX mappings.
It has since grown other users, but the main motivation was gup and
synchronizing device shutdown events with pinned pages. Memory
device shutdown triggers driver ->remove(), and ->remove() always
succeeds, so ZONE_DEVICE needed a mechanism to stop new page references
from being taken, and await existing references to drain before allowing
device removal to proceed. This is the origin of 'struct dev_pagemap'
and its percpu_ref.

The original ZONE_DEVICE implementation started by noticing that 'struct
page' initialization, for typical page allocator pages, started pages at
a refcount of 1. Later those pages are 'onlined' by freeing them to the
page allocator via put_page() to drop that initial reference and
populate the free page lists. ZONE_DEVICE abused that "initialized but
never freed" state to both avoid leaking ZONE_DEVICE pages into places
that were not ready for them, and add some metadata to the unused
(because refcount was never 0) page->lru space.

As more users of ZONE_DEVICE arrived that special casing became more and
more unnecessary, and more and more expensive. The 'struct page'
modernization eliminated the need for the ->lru hack. The folio work had
to remember to sprinkle special case ZONE_DEVICE accounting in the right
places. The MEMORY_DEVICE_PRIVATE use case spearheaded much of the work
to support typical reference counting for ZONE_DEVICE pages and allow
them to be used in more kernel code paths. All the while the DAX case
kept its tech debt in place, until now.

However, while fixing the DAX page refcount semantics and arranging for
free_zone_device_page() to be the common end of life of all ZONE_DEVICE
pages, the mitigation for truncate() vs pinned DAX pages was found to be
incomplete. Unlike typical pages that surprisingly can remain pinned for
DMA after they have been truncated from a file, the DAX core must
enforce that nobody has access to a page after truncate() has
disconnected it from inode->i_pages. I.e. the file block that is
identical to the page still remains an extent of the file. The existing
mitigation handled explicit truncate while the inode was alive, but not
the implicit truncate right before the inode is freed.

So, in addition to moving DAX pages to be refcount-0 at idle, and add
'break_layouts' wakeups to free_zone_device_page(), this series also
introduces another occurrence of 'break_layouts' to the inode freeing
path. Recall that 'break_layouts' for DAX is the mechanism to await code
paths that previously arbitrated page access to drop their interest /
page-pins. This new synchronization point is implemented by special
casing dax_delete_mapping_entry(), called by
truncate_inode_pages_final(), to await page pins when mapping_exiting()
is true.

Thanks to Jason for the nudge to get this fixed up properly and the
review on v1, Dave, Jan, and Jason for the discussion on what to do
about the inode end-of-life-truncate problem, and Alistair for cleaning
up the last of the refcount-1 assumptions in the MEMORY_DEVICE_PRIVATE
users.

---

Dan Williams (25):
      fsdax: Wait on @page not @page->_refcount
      fsdax: Use dax_page_idle() to document DAX busy page checking
      fsdax: Include unmapped inodes for page-idle detection
      fsdax: Introduce dax_zap_mappings()
      fsdax: Wait for pinned pages during truncate_inode_pages_final()
      fsdax: Validate DAX layouts broken before truncate
      fsdax: Hold dax lock over mapping insertion
      fsdax: Update dax_insert_entry() calling convention to return an error
      fsdax: Rework for_each_mapped_pfn() to dax_for_each_folio()
      fsdax: Introduce pgmap_request_folios()
      fsdax: Rework dax_insert_entry() calling convention
      fsdax: Cleanup dax_associate_entry()
      devdax: Minor warning fixups
      devdax: Fix sparse lock imbalance warning
      libnvdimm/pmem: Support pmem block devices without dax
      devdax: Move address_space helpers to the DAX core
      devdax: Sparse fixes for xarray locking
      devdax: Sparse fixes for vmfault_t / dax-entry conversions
      devdax: Sparse fixes for vm_fault_t in tracepoints
      devdax: add PUD support to the DAX mapping infrastructure
      devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
      mm/memremap_pages: Replace zone_device_page_init() with pgmap_request_folios()
      mm/memremap_pages: Initialize all ZONE_DEVICE pages to start at refcount 0
      mm/meremap_pages: Delete put_devmap_managed_page_refs()
      mm/gup: Drop DAX pgmap accounting


 .clang-format                            |    1 
 arch/powerpc/kvm/book3s_hv_uvmem.c       |    3 
 drivers/Makefile                         |    2 
 drivers/dax/Kconfig                      |    5 
 drivers/dax/Makefile                     |    1 
 drivers/dax/bus.c                        |    9 
 drivers/dax/dax-private.h                |    2 
 drivers/dax/device.c                     |   73 +-
 drivers/dax/mapping.c                    | 1083 ++++++++++++++++++++++++++++++
 drivers/dax/super.c                      |   10 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |    3 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |    3 
 drivers/nvdimm/Kconfig                   |    3 
 drivers/nvdimm/pmem.c                    |   47 +
 fs/dax.c                                 | 1069 ++----------------------------
 fs/ext4/inode.c                          |    9 
 fs/fuse/dax.c                            |    9 
 fs/xfs/xfs_file.c                        |    7 
 fs/xfs/xfs_inode.c                       |    4 
 include/linux/dax.h                      |  149 +++-
 include/linux/huge_mm.h                  |   26 -
 include/linux/memremap.h                 |   22 +
 include/linux/mm.h                       |   30 -
 include/linux/mm_types.h                 |   26 -
 include/trace/events/fs_dax.h            |   16 
 lib/test_hmm.c                           |    3 
 mm/gup.c                                 |   89 +-
 mm/huge_memory.c                         |   48 -
 mm/memremap.c                            |  123 ++-
 mm/page_alloc.c                          |    9 
 mm/swap.c                                |    2 
 31 files changed, 1535 insertions(+), 1351 deletions(-)
 create mode 100644 drivers/dax/mapping.c

base-commit: ef6e06b2ef87077104d1145a0fd452ff8dbbc4b7

             reply	other threads:[~2022-10-14 23:56 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14 23:56 Dan Williams [this message]
2022-10-14 23:57 ` [PATCH v3 01/25] fsdax: Wait on @page not @page->_refcount Dan Williams
2022-10-14 23:57 ` [PATCH v3 02/25] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
2022-10-14 23:57 ` [PATCH v3 03/25] fsdax: Include unmapped inodes for page-idle detection Dan Williams
2022-10-14 23:57 ` [PATCH v3 04/25] fsdax: Introduce dax_zap_mappings() Dan Williams
2022-11-02 13:04   ` Aneesh Kumar K.V
2022-10-14 23:57 ` [PATCH v3 05/25] fsdax: Wait for pinned pages during truncate_inode_pages_final() Dan Williams
2022-10-14 23:57 ` [PATCH v3 06/25] fsdax: Validate DAX layouts broken before truncate Dan Williams
2022-10-14 23:57 ` [PATCH v3 07/25] fsdax: Hold dax lock over mapping insertion Dan Williams
2022-10-17 19:31   ` Jason Gunthorpe
2022-10-17 20:17     ` Dan Williams
2022-10-18  5:26       ` Christoph Hellwig
2022-10-18 17:30         ` Dan Williams
2022-10-14 23:57 ` [PATCH v3 08/25] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
2022-10-14 23:57 ` [PATCH v3 09/25] fsdax: Rework for_each_mapped_pfn() to dax_for_each_folio() Dan Williams
2022-10-14 23:57 ` [PATCH v3 10/25] fsdax: Introduce pgmap_request_folios() Dan Williams
2022-10-17  6:31   ` Alistair Popple
2022-10-17 20:06     ` Dan Williams
2022-10-17 20:11       ` Jason Gunthorpe
2022-10-17 20:51         ` Dan Williams
2022-10-17 23:57           ` Jason Gunthorpe
2022-10-18  0:19             ` Dan Williams
2022-10-17 19:41   ` Jason Gunthorpe
2022-10-14 23:58 ` [PATCH v3 11/25] fsdax: Rework dax_insert_entry() calling convention Dan Williams
2022-10-14 23:58 ` [PATCH v3 12/25] fsdax: Cleanup dax_associate_entry() Dan Williams
2022-10-14 23:58 ` [PATCH v3 13/25] devdax: Minor warning fixups Dan Williams
2022-10-14 23:58 ` [PATCH v3 14/25] devdax: Fix sparse lock imbalance warning Dan Williams
2022-10-14 23:58 ` [PATCH v3 15/25] libnvdimm/pmem: Support pmem block devices without dax Dan Williams
2022-10-14 23:58 ` [PATCH v3 16/25] devdax: Move address_space helpers to the DAX core Dan Williams
2022-10-14 23:58 ` [PATCH v3 17/25] devdax: Sparse fixes for xarray locking Dan Williams
2022-10-14 23:58 ` [PATCH v3 18/25] devdax: Sparse fixes for vmfault_t / dax-entry conversions Dan Williams
2022-10-14 23:58 ` [PATCH v3 19/25] devdax: Sparse fixes for vm_fault_t in tracepoints Dan Williams
2022-10-14 23:58 ` [PATCH v3 20/25] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
2022-10-14 23:59 ` [PATCH v3 21/25] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
2022-10-14 23:59 ` [PATCH v3 22/25] mm/memremap_pages: Replace zone_device_page_init() with pgmap_request_folios() Dan Williams
2022-10-17 19:17   ` Lyude Paul
2022-10-14 23:59 ` [PATCH v3 23/25] mm/memremap_pages: Initialize all ZONE_DEVICE pages to start at refcount 0 Dan Williams
2022-10-17  7:04   ` Alistair Popple
2022-10-17 19:48   ` Jason Gunthorpe
2022-10-14 23:59 ` [PATCH v3 24/25] mm/meremap_pages: Delete put_devmap_managed_page_refs() Dan Williams
2022-10-17  7:08   ` Alistair Popple
2022-10-14 23:59 ` [PATCH v3 25/25] mm/gup: Drop DAX pgmap accounting Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com \
    --to=dan.j.williams@intel.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=apopple@nvidia.com \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kherbst@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=lyude@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.