linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
@ 2023-05-04 21:27 Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 1/3] mm/mmap: separate writenotify and dirty tracking logic Lorenzo Stoakes
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-04 21:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger, Lorenzo Stoakes

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.

v9:
- Refactored vma_needs_dirty_tracking() and vma_wants_writenotify() to avoid
  duplicate check of shared writable/needs writenotify.
- Removed redundant comments.
- Improved vma_needs_dirty_tracking() commit message.
- Moved folio_fast_pin_allowed() into CONFIG_HAVE_FAST_GUP block as used by
  both the CONFIG_ARCH_HAS_PTE_SPECIAL and huge page cases, both of which
  are invoked under any CONFIG_HAVE_FAST_GUP configuration. Should fix
  mips/arm builds.
- Permit pins of swap cache anon pages.
- Permit KSM anon pages.

v8:
- Fixed typo writeable -> writable.
- Fixed bug in writable_file_mapping_allowed() - must check combination of
  FOLL_PIN AND FOLL_LONGTERM not either/or.
- Updated vma_needs_dirty_tracking() to include write/shared to account for
  MAP_PRIVATE mappings.
- Move to open-coding the checks in folio_pin_allowed() so we can
  READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
  account for fact we now check flags here.
- Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
  anon. Defer to slow path.
- Perform GUP-fast check _after_ the lowest page table level is confirmed to
  be stable.
- Updated comments and commit message for final patch as per Jason's
  suggestions.
https://lore.kernel.org/all/cover.1683067198.git.lstoakes@gmail.com/

v7:
- Fixed very silly bug in writeable_file_mapping_allowed() inverting the
  logic.
- Removed unnecessary RCU lock code and replaced with adaptation of Peter's
  idea.
- Removed unnecessary open-coded folio_test_anon() in
  folio_longterm_write_pin_allowed() and restructured to generally permit
  NULL folio_mapping().
https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com/

v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
  correctly.
- Split patch series into 3 to make it more digestible.
https://lore.kernel.org/all/cover.1682981880.git.lstoakes@gmail.com/

v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
  writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/

v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
  reduce duplication and update to use this in the GUP check. Note that
  both separately check vm_ops_needs_writenotify() as the latter needs to
  test this before the vm_pgprot_modify() test, resulting in
  vma_wants_writenotify() checking this twice, however it is such a small
  check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/

v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
  vma_wants_writenotify() could not be used directly as it is very much focused
  on determining if the PTE r/w should be set (e.g. assuming private mapping
  does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
  disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
  for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
 https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/

v2:
- Add accidentally excluded ptrace_access_vm() use of
  FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/

v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/

Lorenzo Stoakes (3):
  mm/mmap: separate writenotify and dirty tracking logic
  mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
    mappings
  mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
    mappings

 include/linux/mm.h |   1 +
 mm/gup.c           | 145 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/mmap.c          |  58 ++++++++++++++----
 3 files changed, 191 insertions(+), 13 deletions(-)

--
2.40.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v9 1/3] mm/mmap: separate writenotify and dirty tracking logic
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
@ 2023-05-04 21:27 ` Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings Lorenzo Stoakes
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-04 21:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger, Lorenzo Stoakes

vma_wants_writenotify() is specifically intended for setting PTE page table
flags, accounting for existing page table flag state and whether the
underlying filesystem performs dirty tracking for a file-backed mapping.

Everything is predicated firstly on whether the mapping is shared writable,
as this is the only instance where dirty tracking is pertinent -
MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only
file-backed shared mappings cannot be written to, even with FOLL_FORCE.

All other checks are in line with existing logic, though now separated into
checks eplicitily for dirty tracking and those for determining how to set
page table flags.

We make this change so we can perform checks in the GUP logic to determine
which mappings might be problematic when written to.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  1 +
 mm/mmap.c          | 58 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..7b1d4e7393ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2422,6 +2422,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma);
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 13678edaa22c..8ef5929057fc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1475,6 +1475,48 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */
 
+static bool vm_ops_needs_writenotify(const struct vm_operations_struct *vm_ops)
+{
+	return vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite);
+}
+
+static bool vma_is_shared_writable(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & (VM_WRITE | VM_SHARED)) ==
+		(VM_WRITE | VM_SHARED);
+}
+
+static bool vma_fs_can_writeback(struct vm_area_struct *vma)
+{
+	/* No managed pages to writeback. */
+	if (vma->vm_flags & VM_PFNMAP)
+		return false;
+
+	return vma->vm_file && vma->vm_file->f_mapping &&
+		mapping_can_writeback(vma->vm_file->f_mapping);
+}
+
+/*
+ * Does this VMA require the underlying folios to have their dirty state
+ * tracked?
+ */
+bool vma_needs_dirty_tracking(struct vm_area_struct *vma)
+{
+	/* Only shared, writable VMAs require dirty tracking. */
+	if (!vma_is_shared_writable(vma))
+		return false;
+
+	/* Does the filesystem need to be notified? */
+	if (vm_ops_needs_writenotify(vma->vm_ops))
+		return true;
+
+	/*
+	 * Even if the filesystem doesn't indicate a need for writenotify, if it
+	 * can writeback, dirty tracking is still required.
+	 */
+	return vma_fs_can_writeback(vma);
+}
+
 /*
  * Some shared mappings will want the pages marked read-only
  * to track write events. If so, we'll downgrade vm_page_prot
@@ -1483,21 +1525,18 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
  */
 int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 {
-	vm_flags_t vm_flags = vma->vm_flags;
-	const struct vm_operations_struct *vm_ops = vma->vm_ops;
-
 	/* If it was private or non-writable, the write bit is already clear */
-	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
+	if (!vma_is_shared_writable(vma))
 		return 0;
 
 	/* The backer wishes to know when pages are first written to? */
-	if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
+	if (vm_ops_needs_writenotify(vma->vm_ops))
 		return 1;
 
 	/* The open routine did something to the protections that pgprot_modify
 	 * won't preserve? */
 	if (pgprot_val(vm_page_prot) !=
-	    pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags)))
+	    pgprot_val(vm_pgprot_modify(vm_page_prot, vma->vm_flags)))
 		return 0;
 
 	/*
@@ -1511,13 +1550,8 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
 	if (userfaultfd_wp(vma))
 		return 1;
 
-	/* Specialty mapping? */
-	if (vm_flags & VM_PFNMAP)
-		return 0;
-
 	/* Can the mapping track the dirty pages? */
-	return vma->vm_file && vma->vm_file->f_mapping &&
-		mapping_can_writeback(vma->vm_file->f_mapping);
+	return vma_fs_can_writeback(vma);
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v9 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 1/3] mm/mmap: separate writenotify and dirty tracking logic Lorenzo Stoakes
@ 2023-05-04 21:27 ` Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast " Lorenzo Stoakes
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-04 21:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger, Lorenzo Stoakes

Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This results in both data being written to a folio without writenotify, and
the folio being dirtied unexpectedly (if the caller decides to do so).

This issue was first reported by Jan Kara [1] in 2018, where the problem
resulted in file system crashes.

This is only relevant when the mappings are file-backed and the underlying
file system requires folio dirty tracking. File systems which do not, such
as shmem or hugetlb, are not at risk and therefore can be written to
without issue.

Unfortunately this limitation of GUP has been present for some time and
requires future rework of the GUP API in order to provide correct write
access to such mappings.

However, for the time being we introduce this check to prevent the most
egregious case of this occurring, use of the FOLL_LONGTERM pin.

These mappings are considerably more likely to be written to after
folios are cleaned and thus simply must not be permitted to do so.

This patch changes only the slow-path GUP functions, a following patch
adapts the GUP-fast path along similar lines.

[1]:https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index ff689c88a357..0ea9ebec9547 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,54 @@ static int faultin_page(struct vm_area_struct *vma,
 	return 0;
 }
 
+/*
+ * Writing to file-backed mappings which require folio dirty tracking using GUP
+ * is a fundamentally broken operation, as kernel write access to GUP mappings
+ * do not adhere to the semantics expected by a file system.
+ *
+ * Consider the following scenario:-
+ *
+ * 1. A folio is written to via GUP which write-faults the memory, notifying
+ *    the file system and dirtying the folio.
+ * 2. Later, writeback is triggered, resulting in the folio being cleaned and
+ *    the PTE being marked read-only.
+ * 3. The GUP caller writes to the folio, as it is mapped read/write via the
+ *    direct mapping.
+ * 4. The GUP caller, now done with the page, unpins it and sets it dirty
+ *    (though it does not have to).
+ *
+ * This results in both data being written to a folio without writenotify, and
+ * the folio being dirtied unexpectedly (if the caller decides to do so).
+ */
+static bool writable_file_mapping_allowed(struct vm_area_struct *vma,
+					  unsigned long gup_flags)
+{
+	/*
+	 * If we aren't pinning then no problematic write can occur. A long term
+	 * pin is the most egregious case so this is the case we disallow.
+	 */
+	if ((gup_flags & (FOLL_PIN | FOLL_LONGTERM)) !=
+	    (FOLL_PIN | FOLL_LONGTERM))
+		return true;
+
+	/*
+	 * If the VMA does not require dirty tracking then no problematic write
+	 * can occur either.
+	 */
+	return !vma_needs_dirty_tracking(vma);
+}
+
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 	int write = (gup_flags & FOLL_WRITE);
 	int foreign = (gup_flags & FOLL_REMOTE);
+	bool vma_anon = vma_is_anonymous(vma);
 
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;
 
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1016,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
+		if (!vma_anon &&
+		    !writable_file_mapping_allowed(vma, gup_flags))
+			return -EFAULT;
+
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v9 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 1/3] mm/mmap: separate writenotify and dirty tracking logic Lorenzo Stoakes
  2023-05-04 21:27 ` [PATCH v9 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings Lorenzo Stoakes
@ 2023-05-04 21:27 ` Lorenzo Stoakes
  2023-05-05 20:21 ` [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default David Hildenbrand
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-04 21:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger, Lorenzo Stoakes

Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous or belonging to a whitelisted filesystem - specifically
hugetlb and shmem mappings.

We take special care to ensure that both the folio and mapping are safe to
access when performing these checks and document folio_fast_pin_allowed()
accordingly.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 0ea9ebec9547..ef43ffb3d1fe 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -2379,6 +2380,82 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  */
 #ifdef CONFIG_HAVE_FAST_GUP
 
+/*
+ * Used in the GUP-fast path to determine whether a pin is permitted for a
+ * specific folio.
+ *
+ * This call assumes the caller has pinned the folio, that the lowest page table
+ * level still points to this folio, and that interrupts have been disabled.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases and if in doubt,
+ * fall back to the slow path.
+ */
+static bool folio_fast_pin_allowed(struct folio *folio, unsigned int flags)
+{
+	struct address_space *mapping;
+	unsigned long mapping_flags;
+
+	/*
+	 * If we aren't pinning then no problematic write can occur. A long term
+	 * pin is the most egregious case so this is the one we disallow.
+	 */
+	if ((flags & (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE)) !=
+	    (FOLL_PIN | FOLL_LONGTERM | FOLL_WRITE))
+		return true;
+
+	/* The folio is pinned, so we can safely access folio fields. */
+
+	if (WARN_ON_ONCE(folio_test_slab(folio)))
+		return false;
+
+	/* hugetlb mappings do not require dirty-tracking. */
+	if (folio_test_hugetlb(folio))
+		return true;
+
+	/*
+	 * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
+	 * cannot proceed, which means no actions performed under RCU can
+	 * proceed either.
+	 *
+	 * inodes and thus their mappings are freed under RCU, which means the
+	 * mapping cannot be freed beneath us and thus we can safely dereference
+	 * it.
+	 */
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * However, there may be operations which _alter_ the mapping, so ensure
+	 * we read it once and only once.
+	 */
+	mapping = READ_ONCE(folio->mapping);
+
+	/*
+	 * The mapping may have been truncated, in any case we cannot determine
+	 * if this mapping is safe - fall back to slow path to determine how to
+	 * proceed.
+	 */
+	if (!mapping)
+		return false;
+
+	/* Anonymous folios pose no problem. */
+	mapping_flags = (unsigned long)mapping & PAGE_MAPPING_FLAGS;
+	if (mapping_flags)
+		return mapping_flags & PAGE_MAPPING_ANON;
+
+	/*
+	 * At this point, we know the mapping is non-null and points to an
+	 * address_space object. The only remaining whitelisted file system is
+	 * shmem.
+	 */
+	return shmem_mapping(mapping);
+}
+
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 					    unsigned int flags,
 					    struct page **pages)
@@ -2464,6 +2541,11 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 			goto pte_unmap;
 		}
 
+		if (!folio_fast_pin_allowed(folio, flags)) {
+			gup_put_folio(folio, 1, flags);
+			goto pte_unmap;
+		}
+
 		if (!pte_write(pte) && gup_must_unshare(NULL, flags, page)) {
 			gup_put_folio(folio, 1, flags);
 			goto pte_unmap;
@@ -2656,6 +2738,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
+	if (!folio_fast_pin_allowed(folio, flags)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2722,6 +2809,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
+	if (!folio_fast_pin_allowed(folio, flags)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
 	if (!pmd_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2762,6 +2853,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
+	if (!folio_fast_pin_allowed(folio, flags)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	if (!pud_write(orig) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2797,6 +2893,11 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
+	if (!folio_fast_pin_allowed(folio, flags)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	folio_set_referenced(folio);
 	return 1;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2023-05-04 21:27 ` [PATCH v9 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast " Lorenzo Stoakes
@ 2023-05-05 20:21 ` David Hildenbrand
  2023-05-05 21:12   ` Lorenzo Stoakes
  2023-05-14 19:20 ` Lorenzo Stoakes
  2023-05-15 11:03 ` Kirill A . Shutemov
  5 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2023-05-05 20:21 UTC (permalink / raw)
  To: Lorenzo Stoakes, linux-mm, linux-kernel, Andrew Morton, Jens Axboe
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, Dave Chinner, Theodore Ts'o,
	Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On 04.05.23 23:27, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>     the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>     the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>     direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>     (though it does not have to).
> 
> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> pin_user_pages_fast_only() does not exist, we can rely on a slightly
> imperfect whitelisting in the PUP-fast case and fall back to the slow case
> should this fail.
> 
>

Thanks a lot, this looks pretty good to me!

I started writing some selftests (assuming none would be in the works) using
iouring and and the gup_tests interface. So far, no real surprises for the general
GUP interaction [1].


There are two things I noticed when registering an iouring fixed buffer (that differ
now from generic gup_test usage):


(1) Registering a fixed buffer targeting an unsupported MAP_SHARED FS file now fails with
     EFAULT (from pin_user_pages()) instead of EOPNOTSUPP (from io_pin_pages()).

The man page for io_uring_register documents:

        EOPNOTSUPP
               User buffers point to file-backed memory.

... we'd have to do some kind of errno translation in io_pin_pages(). But the
translation is not simple (sometimes we want to forward EOPNOTSUPP). That also
applies once we remove that special-casing in io_uring code.

... maybe we can simply update the manpage (stating that older kernels returned
EOPNOTSUPP) and start returning EFAULT?


(2) Registering a fixed buffer targeting a MAP_PRIVATE FS file fails with EOPNOTSUPP
     (from io_pin_pages()). As discussed, there is nothing wrong with pinning all-anon
     pages (resulting from breaking COW).

That could be easily be handled (allow any !VM_MAYSHARE), and would automatically be
handled once removing the iouring special-casing.


[1]

# ./pin_longterm
# [INFO] detected hugetlb size: 2048 KiB
# [INFO] detected hugetlb size: 1048576 KiB
TAP version 13
1..50
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Pinning failed as expected
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 # SKIP need more free huge pages
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 5 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
ok 6 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
ok 7 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
ok 8 Pinning failed as expected
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 9 # SKIP need more free huge pages
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 10 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 11 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 12 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 13 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 14 # SKIP need more free huge pages
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 15 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
ok 16 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
ok 17 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
ok 18 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 19 # SKIP need more free huge pages
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 20 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
ok 21 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
ok 22 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 23 Pinning succeeded as expected
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 24 # SKIP need more free huge pages
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 25 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
ok 26 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
ok 27 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 28 Pinning succeeded as expected
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 29 # SKIP need more free huge pages
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 30 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
ok 31 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
ok 32 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 33 Pinning succeeded as expected
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 34 # SKIP need more free huge pages
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 35 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
ok 36 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
ok 37 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 38 Pinning succeeded as expected
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 39 # SKIP need more free huge pages
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 40 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd
ok 41 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with tmpfile
ok 42 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with local tmpfile
ok 43 Pinning failed as expected
# [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 44 # SKIP need more free huge pages
# [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 45 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd
ok 46 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with tmpfile
ok 47 Pinning succeeded as expected
# [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with local tmpfile
not ok 48 Pinning failed as expected
# [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 49 # SKIP need more free huge pages
# [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 50 Pinning succeeded as expected
Bail out! 1 out of 50 tests failed
# Totals: pass:39 fail:1 xfail:0 xpass:0 skip:10 error:0


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-05 20:21 ` [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default David Hildenbrand
@ 2023-05-05 21:12   ` Lorenzo Stoakes
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-05 21:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Andrew Morton, Jens Axboe,
	Jason Gunthorpe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, Dave Chinner, Theodore Ts'o,
	Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On Fri, May 05, 2023 at 10:21:21PM +0200, David Hildenbrand wrote:
> On 04.05.23 23:27, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
> >
> > The problem arises when, after an initial write to the folio, writeback
> > results in the folio being cleaned and then the caller, via the GUP
> > interface, writes to the folio again.
> >
> > As a result of the use of this secondary, direct, mapping to the folio no
> > write notify will occur, and if the caller does mark the folio dirty, this
> > will be done so unexpectedly.
> >
> > For example, consider the following scenario:-
> >
> > 1. A folio is written to via GUP which write-faults the memory, notifying
> >     the file system and dirtying the folio.
> > 2. Later, writeback is triggered, resulting in the folio being cleaned and
> >     the PTE being marked read-only.
> > 3. The GUP caller writes to the folio, as it is mapped read/write via the
> >     direct mapping.
> > 4. The GUP caller, now done with the page, unpins it and sets it dirty
> >     (though it does not have to).
> >
> > This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> > pin_user_pages_fast_only() does not exist, we can rely on a slightly
> > imperfect whitelisting in the PUP-fast case and fall back to the slow case
> > should this fail.
> >
> >
>
> Thanks a lot, this looks pretty good to me!

Thanks!

>
> I started writing some selftests (assuming none would be in the works) using
> iouring and and the gup_tests interface. So far, no real surprises for the general
> GUP interaction [1].
>

Nice! I was using the cow selftests as just looking for something that
touches FOLL_LONGTERM with PUP_fast, I hacked it so it always wrote just to
test patches but clearly we need something more thorough.

>
> There are two things I noticed when registering an iouring fixed buffer (that differ
> now from generic gup_test usage):
>
>
> (1) Registering a fixed buffer targeting an unsupported MAP_SHARED FS file now fails with
>     EFAULT (from pin_user_pages()) instead of EOPNOTSUPP (from io_pin_pages()).
>
> The man page for io_uring_register documents:
>
>        EOPNOTSUPP
>               User buffers point to file-backed memory.
>
> ... we'd have to do some kind of errno translation in io_pin_pages(). But the
> translation is not simple (sometimes we want to forward EOPNOTSUPP). That also
> applies once we remove that special-casing in io_uring code.
>
> ... maybe we can simply update the manpage (stating that older kernels returned
> EOPNOTSUPP) and start returning EFAULT?

Yeah I noticed this discrepancy when going through initial attempts to
refactor in the vmas patch series, I wonder how important it is to
differentiate? I have a feeling it probably doesn't matter too much but
obviously need input from Jens and Pavel.

>
>
> (2) Registering a fixed buffer targeting a MAP_PRIVATE FS file fails with EOPNOTSUPP
>     (from io_pin_pages()). As discussed, there is nothing wrong with pinning all-anon
>     pages (resulting from breaking COW).
>
> That could be easily be handled (allow any !VM_MAYSHARE), and would automatically be
> handled once removing the iouring special-casing.

The entire intent of this series (for me :)) was to allow io_uring to just
drop this code altogether so we can unblock my drop the 'vmas' parameter
from GUP series [1].

I always intended to respin that after this settled down, Jens and Pavel
seemed onboard with this (and really they shouldn't need to be doing that
check, that was always a failing in GUP).

I will do a v5 of this soon.

[1]: https://lore.kernel.org/all/cover.1681831798.git.lstoakes@gmail.com/

>
>
> [1]
>
> # ./pin_longterm
> # [INFO] detected hugetlb size: 2048 KiB
> # [INFO] detected hugetlb size: 1048576 KiB
> TAP version 13
> 1..50
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Pinning failed as expected
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 # SKIP need more free huge pages
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
> ok 5 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
> ok 6 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
> ok 7 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
> ok 8 Pinning failed as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 9 # SKIP need more free huge pages
> # [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
> ok 10 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 11 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 12 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 13 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 14 # SKIP need more free huge pages
> # [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
> ok 15 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
> ok 16 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
> ok 17 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
> ok 18 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 19 # SKIP need more free huge pages
> # [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
> ok 20 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
> ok 21 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
> ok 22 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
> ok 23 Pinning succeeded as expected
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
> ok 24 # SKIP need more free huge pages
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
> ok 25 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
> ok 26 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
> ok 27 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
> ok 28 Pinning succeeded as expected
> # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
> ok 29 # SKIP need more free huge pages
> # [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
> ok 30 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
> ok 31 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
> ok 32 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
> ok 33 Pinning succeeded as expected
> # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
> ok 34 # SKIP need more free huge pages
> # [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
> ok 35 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
> ok 36 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
> ok 37 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
> ok 38 Pinning succeeded as expected
> # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
> ok 39 # SKIP need more free huge pages
> # [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
> ok 40 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd
> ok 41 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with tmpfile
> ok 42 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with local tmpfile
> ok 43 Pinning failed as expected
> # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 44 # SKIP need more free huge pages
> # [RUN] iouring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
> ok 45 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd
> ok 46 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with tmpfile
> ok 47 Pinning succeeded as expected
> # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with local tmpfile
> not ok 48 Pinning failed as expected
> # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
> ok 49 # SKIP need more free huge pages
> # [RUN] iouring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
> ok 50 Pinning succeeded as expected
> Bail out! 1 out of 50 tests failed
> # Totals: pass:39 fail:1 xfail:0 xpass:0 skip:10 error:0
>
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2023-05-05 20:21 ` [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default David Hildenbrand
@ 2023-05-14 19:20 ` Lorenzo Stoakes
  2023-05-15  5:14   ` Christoph Hellwig
  2023-05-15 11:03 ` Kirill A . Shutemov
  5 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-14 19:20 UTC (permalink / raw)
  To: linux-mm, linux-kernel, Andrew Morton
  Cc: Jason Gunthorpe, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	Jason Gunthorpe, John Hubbard, Jan Kara, Kirill A . Shutemov,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
>
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
>
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
>
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
>
> For example, consider the following scenario:-
>
> 1. A folio is written to via GUP which write-faults the memory, notifying
>    the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>    the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>    direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>    (though it does not have to).
>
> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> pin_user_pages_fast_only() does not exist, we can rely on a slightly
> imperfect whitelisting in the PUP-fast case and fall back to the slow case
> should this fail.
[snip]

As discussed at LSF/MM, on the flight over I wrote a little repro [0] which
reliably triggers the ext4 warning by recreating the scenario described
above, using a small userland program and kernel module.

This code is not perfect (plane code :) but does seem to do the job
adequately, also obviously this should only be run in a VM environment
where data loss is acceptable (in my case a small qemu instance).

Hopefully this is useful in some way. Note that I explicitly use
pin_user_pages() without FOLL_LONGTERM here in order to not run into the
mitigation this very patch series provides! Obviously if you revert this
series you can see the same happening with FOLL_LONGTERM set.

I have licensed the code as GPLv2 so anybody's free to do with it as they
will if it's useful in any way!

[0]:https://github.com/lorenzo-stoakes/gup-repro

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-14 19:20 ` Lorenzo Stoakes
@ 2023-05-15  5:14   ` Christoph Hellwig
  2023-05-15 11:31     ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-15  5:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, Andrew Morton, Jason Gunthorpe,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, Jason Gunthorpe,
	John Hubbard, Jan Kara, Kirill A . Shutemov, Pavel Begunkov,
	Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote:
> As discussed at LSF/MM, on the flight over I wrote a little repro [0] which
> reliably triggers the ext4 warning by recreating the scenario described
> above, using a small userland program and kernel module.
> 
> This code is not perfect (plane code :) but does seem to do the job
> adequately, also obviously this should only be run in a VM environment
> where data loss is acceptable (in my case a small qemu instance).

It would be really awesome if you could wire it up with and submit it
to xfstests.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2023-05-14 19:20 ` Lorenzo Stoakes
@ 2023-05-15 11:03 ` Kirill A . Shutemov
  2023-05-15 11:16   ` Lorenzo Stoakes
  5 siblings, 1 reply; 20+ messages in thread
From: Kirill A . Shutemov @ 2023-05-15 11:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, Andrew Morton, Jason Gunthorpe,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, Jason Gunthorpe,
	John Hubbard, Jan Kara, Pavel Begunkov, Mika Penttila,
	David Hildenbrand, Dave Chinner, Theodore Ts'o, Peter Xu,
	Matthew Rosato, Paul E . McKenney, Christian Borntraeger

On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.

Okay, problem is clear and the patchset look good to me. But I'm worried
breaking existing users.

Do we expect the change to be visible to real world users? If yes, are we
okay to break them?

One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..."
It is mostly used for pmem emulation.

Do we have plan B?

Just a random/crazy/broken idea:

 - Allow folio_mkclean() (and folio_clear_dirty_for_io()) to fail,
   indicating that the page cannot be cleared because it is pinned;

 - Introduce a new vm_operations_struct::mkclean() that would be called by
   page_vma_mkclean_one() before clearing the range and can fail;

 - On GUP, create an in-kernel fake VMA that represents the file, but with
   custom vm_ops. The VMA registered in rmap to get notified on
   folio_mkclean() and fail it because of GUP.

 - folio_clear_dirty_for_io() callers will handle the new failure as
   indication that the page can be written back but will stay dirty and
   fs-specific data that is associated with the page writeback cannot be
   freed.

I'm sure the idea is broken on many levels (I have never looked closely at
the writeback path). But maybe it is good enough as conversation started?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15 11:03 ` Kirill A . Shutemov
@ 2023-05-15 11:16   ` Lorenzo Stoakes
  2023-05-15 12:12     ` Jason Gunthorpe
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-15 11:16 UTC (permalink / raw)
  To: Kirill A . Shutemov
  Cc: linux-mm, linux-kernel, Andrew Morton, Jason Gunthorpe,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, Jason Gunthorpe,
	John Hubbard, Jan Kara, Pavel Begunkov, Mika Penttila,
	David Hildenbrand, Dave Chinner, Theodore Ts'o, Peter Xu,
	Matthew Rosato, Paul E . McKenney, Christian Borntraeger

On Mon, May 15, 2023 at 02:03:15PM +0300, Kirill A . Shutemov wrote:
> On Thu, May 04, 2023 at 10:27:50PM +0100, Lorenzo Stoakes wrote:
> > Writing to file-backed mappings which require folio dirty tracking using
> > GUP is a fundamentally broken operation, as kernel write access to GUP
> > mappings do not adhere to the semantics expected by a file system.
> >
> > A GUP caller uses the direct mapping to access the folio, which does not
> > cause write notify to trigger, nor does it enforce that the caller marks
> > the folio dirty.
>
> Okay, problem is clear and the patchset look good to me. But I'm worried
> breaking existing users.
>
> Do we expect the change to be visible to real world users? If yes, are we
> okay to break them?

The general consensus at the moment is that there is no entirely reasonable
usage of this case and you're already running the riks of a kernel oops if
you do this, so it's already broken.

>
> One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..."
> It is mostly used for pmem emulation.
>
> Do we have plan B?

Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy
to implement in the event of any issues arising.

>
> Just a random/crazy/broken idea:
>
>  - Allow folio_mkclean() (and folio_clear_dirty_for_io()) to fail,
>    indicating that the page cannot be cleared because it is pinned;
>
>  - Introduce a new vm_operations_struct::mkclean() that would be called by
>    page_vma_mkclean_one() before clearing the range and can fail;
>
>  - On GUP, create an in-kernel fake VMA that represents the file, but with
>    custom vm_ops. The VMA registered in rmap to get notified on
>    folio_mkclean() and fail it because of GUP.
>
>  - folio_clear_dirty_for_io() callers will handle the new failure as
>    indication that the page can be written back but will stay dirty and
>    fs-specific data that is associated with the page writeback cannot be
>    freed.
>
> I'm sure the idea is broken on many levels (I have never looked closely at
> the writeback path). But maybe it is good enough as conversation started?
>

Yeah there are definitely a few ideas down this road that might be
possible, I am not sure how a filesystem can be expected to cope or this to
be reasonably used without dirty/writeback though because you'll just not
track anything or I guess you mean the mapping would be read-only but
somehow stay dirty?

I also had ideas along these lines of e.g. having a special vmalloc mode
which mimics the correct wrprotect settings + does the right thing, but of
course that does nothing to help DMA writing to a GUP-pinned page.

Though if the issue is at the point of the kernel marking the page dirty
unexpectedly, perhaps we can just invoke the mkwrite() _there_ before
marking dirty?

There are probably some sycnhronisation issues there too.

Jason will have some thoughts on this I'm sure. I guess the key question
here is - is it actually feasible for this to work at all? Once we
establish that, the rest are details :)

> --
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15  5:14   ` Christoph Hellwig
@ 2023-05-15 11:31     ` Lorenzo Stoakes
  2023-05-17  8:26       ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-15 11:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-kernel, Andrew Morton, Jason Gunthorpe,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, Jason Gunthorpe,
	John Hubbard, Jan Kara, Kirill A . Shutemov, Pavel Begunkov,
	Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On Sun, May 14, 2023 at 10:14:46PM -0700, Christoph Hellwig wrote:
> On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote:
> > As discussed at LSF/MM, on the flight over I wrote a little repro [0] which
> > reliably triggers the ext4 warning by recreating the scenario described
> > above, using a small userland program and kernel module.
> >
> > This code is not perfect (plane code :) but does seem to do the job
> > adequately, also obviously this should only be run in a VM environment
> > where data loss is acceptable (in my case a small qemu instance).
>
> It would be really awesome if you could wire it up with and submit it
> to xfstests.

Sure am happy to take a look at that! Also happy if David finds it useful in any
way for this unit tests.

The kernel module interface is a bit sketchy (it takes a user address which it
blindly pins for you) so it's not something that should be run in any unsafe
environment but as long as we are ok with that :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15 11:16   ` Lorenzo Stoakes
@ 2023-05-15 12:12     ` Jason Gunthorpe
  2023-05-15 13:07       ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2023-05-15 12:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kirill A . Shutemov, linux-mm, linux-kernel, Andrew Morton,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, John Hubbard,
	Jan Kara, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote:
> > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..."
> > It is mostly used for pmem emulation.
> >
> > Do we have plan B?
> 
> Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy
> to implement in the event of any issues arising.

I'm becoming less keen on the idea of a per-subsystem opt out. I think
we should make a kernel wide opt out. I like the idea of using lower
lockdown levels. Lots of things become unavaiable in the uAPI when the
lockdown level increases already.

> Jason will have some thoughts on this I'm sure. I guess the key question
> here is - is it actually feasible for this to work at all? Once we
> establish that, the rest are details :)

Surely it is, but like Ted said, the FS folks are not interested and
they are at least half the solution..

The FS also has to actively not write out the page while it cannot be
write protected unless it copies the data to a stable page. The block
stack needs the source data to be stable to do checksum/parity/etc
stuff. It is a complicated subject.

Jason 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15 12:12     ` Jason Gunthorpe
@ 2023-05-15 13:07       ` Lorenzo Stoakes
  2023-05-17  7:29         ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-15 13:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kirill A . Shutemov, linux-mm, linux-kernel, Andrew Morton,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, John Hubbard,
	Jan Kara, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote:
> On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote:
> > > One thing that came to mind is KVM with "qemu -object memory-backend-file,share=on..."
> > > It is mostly used for pmem emulation.
> > >
> > > Do we have plan B?
> >
> > Yes, we can make it opt-in or opt-out via a FOLL_FLAG. This would be easy
> > to implement in the event of any issues arising.
>
> I'm becoming less keen on the idea of a per-subsystem opt out. I think
> we should make a kernel wide opt out. I like the idea of using lower
> lockdown levels. Lots of things become unavaiable in the uAPI when the
> lockdown level increases already.

This would be the 'safest' in the sense that a user can't be surprised by
higher lockdown = access modes disallowed, however we'd _definitely_ need
to have an opt-in in that instance so io_uring can make use of this
regardless. That's easy to add however.

If we do go down that road, we can be even stricter/vary what we do at
different levels right?

>
> > Jason will have some thoughts on this I'm sure. I guess the key question
> > here is - is it actually feasible for this to work at all? Once we
> > establish that, the rest are details :)
>
> Surely it is, but like Ted said, the FS folks are not interested and
> they are at least half the solution..

:'(

>
> The FS also has to actively not write out the page while it cannot be
> write protected unless it copies the data to a stable page. The block
> stack needs the source data to be stable to do checksum/parity/etc
> stuff. It is a complicated subject.

Yes my sense was that being able to write arbitrarily to these pages _at
all_ was a big issue, not only the dirty tracking aspect.

I guess at some level letting filesystems have such total flexibility as to
how they implement things leaves us in a difficult position.

>
> Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15 13:07       ` Lorenzo Stoakes
@ 2023-05-17  7:29         ` Jan Kara
  2023-05-17  7:40           ` Lorenzo Stoakes
  2023-05-17  7:42           ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kara @ 2023-05-17  7:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jason Gunthorpe, Kirill A . Shutemov, linux-mm, linux-kernel,
	Andrew Morton, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	John Hubbard, Jan Kara, Pavel Begunkov, Mika Penttila,
	David Hildenbrand, Dave Chinner, Theodore Ts'o, Peter Xu,
	Matthew Rosato, Paul E . McKenney, Christian Borntraeger

On Mon 15-05-23 14:07:57, Lorenzo Stoakes wrote:
> On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote:
> > > Jason will have some thoughts on this I'm sure. I guess the key question
> > > here is - is it actually feasible for this to work at all? Once we
> > > establish that, the rest are details :)
> >
> > Surely it is, but like Ted said, the FS folks are not interested and
> > they are at least half the solution..
> 
> :'(

Well, I'd phrase this a bit differently - it is a difficult sell to fs
maintainers that they should significantly complicate writeback code / VFS
with bounce page handling etc. for a thing that is not much used corner
case. So if we can get away with forbiding long-term pins, then that's the
easiest solution. Dealing with short-term pins is easier as we can just
wait for unpinning which is implementable in a localized manner.

> > The FS also has to actively not write out the page while it cannot be
> > write protected unless it copies the data to a stable page. The block
> > stack needs the source data to be stable to do checksum/parity/etc
> > stuff. It is a complicated subject.
> 
> Yes my sense was that being able to write arbitrarily to these pages _at
> all_ was a big issue, not only the dirty tracking aspect.

Yes.

> I guess at some level letting filesystems have such total flexibility as to
> how they implement things leaves us in a difficult position.

I'm not sure what you mean by "total flexibility" here. In my opinion it is
also about how HW performs checksumming etc.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-17  7:29         ` Jan Kara
@ 2023-05-17  7:40           ` Lorenzo Stoakes
  2023-05-17  7:43             ` Christoph Hellwig
  2023-05-17  7:42           ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-17  7:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jason Gunthorpe, Kirill A . Shutemov, linux-mm, linux-kernel,
	Andrew Morton, Jens Axboe, Matthew Wilcox, Dennis Dalessandro,
	Leon Romanovsky, Christian Benvenuti, Nelson Escobar,
	Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	John Hubbard, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Wed, May 17, 2023 at 09:29:20AM +0200, Jan Kara wrote:
> On Mon 15-05-23 14:07:57, Lorenzo Stoakes wrote:
> > On Mon, May 15, 2023 at 09:12:49AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 15, 2023 at 12:16:21PM +0100, Lorenzo Stoakes wrote:
> > > > Jason will have some thoughts on this I'm sure. I guess the key question
> > > > here is - is it actually feasible for this to work at all? Once we
> > > > establish that, the rest are details :)
> > >
> > > Surely it is, but like Ted said, the FS folks are not interested and
> > > they are at least half the solution..
> >
> > :'(
>
> Well, I'd phrase this a bit differently - it is a difficult sell to fs
> maintainers that they should significantly complicate writeback code / VFS
> with bounce page handling etc. for a thing that is not much used corner
> case. So if we can get away with forbiding long-term pins, then that's the
> easiest solution. Dealing with short-term pins is easier as we can just
> wait for unpinning which is implementable in a localized manner.
>

Totally understandable. It's unfortunately I feel a case of something we
should simply not have allowed.

> > > The FS also has to actively not write out the page while it cannot be
> > > write protected unless it copies the data to a stable page. The block
> > > stack needs the source data to be stable to do checksum/parity/etc
> > > stuff. It is a complicated subject.
> >
> > Yes my sense was that being able to write arbitrarily to these pages _at
> > all_ was a big issue, not only the dirty tracking aspect.
>
> Yes.
>
> > I guess at some level letting filesystems have such total flexibility as to
> > how they implement things leaves us in a difficult position.
>
> I'm not sure what you mean by "total flexibility" here. In my opinion it is
> also about how HW performs checksumming etc.

I mean to say *_ops allow a lot of flexibility in how things are
handled. Certainly checksumming is a great example but in theory an
arbitrary filesystem could be doing, well, anything and always assuming
that only userland mappings should be modifying the underlying data.

>
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-17  7:29         ` Jan Kara
  2023-05-17  7:40           ` Lorenzo Stoakes
@ 2023-05-17  7:42           ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-17  7:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Lorenzo Stoakes, Jason Gunthorpe, Kirill A . Shutemov, linux-mm,
	linux-kernel, Andrew Morton, Jens Axboe, Matthew Wilcox,
	Dennis Dalessandro, Leon Romanovsky, Christian Benvenuti,
	Nelson Escobar, Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	John Hubbard, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Wed, May 17, 2023 at 09:29:20AM +0200, Jan Kara wrote:
> > > Surely it is, but like Ted said, the FS folks are not interested and
> > > they are at least half the solution..
> > 
> > :'(
> 
> Well, I'd phrase this a bit differently - it is a difficult sell to fs
> maintainers that they should significantly complicate writeback code / VFS
> with bounce page handling etc. for a thing that is not much used corner
> case. So if we can get away with forbiding long-term pins, then that's the
> easiest solution. Dealing with short-term pins is easier as we can just
> wait for unpinning which is implementable in a localized manner.

Full agreement here.  The whole concept of supporting writeback for
long term mappings does not make much sense.

> > > The FS also has to actively not write out the page while it cannot be
> > > write protected unless it copies the data to a stable page. The block
> > > stack needs the source data to be stable to do checksum/parity/etc
> > > stuff. It is a complicated subject.
> > 
> > Yes my sense was that being able to write arbitrarily to these pages _at
> > all_ was a big issue, not only the dirty tracking aspect.
> 
> Yes.
> 
> > I guess at some level letting filesystems have such total flexibility as to
> > how they implement things leaves us in a difficult position.
> 
> I'm not sure what you mean by "total flexibility" here. In my opinion it is
> also about how HW performs checksumming etc.

I have no idea what total flexbility is even supposed to be.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-17  7:40           ` Lorenzo Stoakes
@ 2023-05-17  7:43             ` Christoph Hellwig
  2023-05-17  7:55               ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-17  7:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, Jason Gunthorpe, Kirill A . Shutemov, linux-mm,
	linux-kernel, Andrew Morton, Jens Axboe, Matthew Wilcox,
	Dennis Dalessandro, Leon Romanovsky, Christian Benvenuti,
	Nelson Escobar, Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	John Hubbard, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Wed, May 17, 2023 at 08:40:26AM +0100, Lorenzo Stoakes wrote:
> > I'm not sure what you mean by "total flexibility" here. In my opinion it is
> > also about how HW performs checksumming etc.
> 
> I mean to say *_ops allow a lot of flexibility in how things are
> handled. Certainly checksumming is a great example but in theory an
> arbitrary filesystem could be doing, well, anything and always assuming
> that only userland mappings should be modifying the underlying data.

File systems need a wait to track when a page is dirtied so that it can
be written back.  Not much to do with flexbility.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-17  7:43             ` Christoph Hellwig
@ 2023-05-17  7:55               ` Lorenzo Stoakes
  2023-05-17  8:10                 ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-17  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Jason Gunthorpe, Kirill A . Shutemov, linux-mm,
	linux-kernel, Andrew Morton, Jens Axboe, Matthew Wilcox,
	Dennis Dalessandro, Leon Romanovsky, Christian Benvenuti,
	Nelson Escobar, Bernard Metzler, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter, Bjorn Topel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Christian Brauner, Richard Cochran, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	linux-fsdevel, linux-perf-users, netdev, bpf, Oleg Nesterov,
	John Hubbard, Pavel Begunkov, Mika Penttila, David Hildenbrand,
	Dave Chinner, Theodore Ts'o, Peter Xu, Matthew Rosato,
	Paul E . McKenney, Christian Borntraeger

On Wed, May 17, 2023 at 12:43:34AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2023 at 08:40:26AM +0100, Lorenzo Stoakes wrote:
> > > I'm not sure what you mean by "total flexibility" here. In my opinion it is
> > > also about how HW performs checksumming etc.
> >
> > I mean to say *_ops allow a lot of flexibility in how things are
> > handled. Certainly checksumming is a great example but in theory an
> > arbitrary filesystem could be doing, well, anything and always assuming
> > that only userland mappings should be modifying the underlying data.
>
> File systems need a wait to track when a page is dirtied so that it can
> be written back.  Not much to do with flexbility.

I'll try to take this in good faith because... yeah. I do get that, I mean
I literally created a repro for this situation and referenced in the commit
msg and comments this precise problem in my patch series that
addresses... this problem :P

Perhaps I'm not being clear but it was simply my intent to highlight that
yes this is the primary problem but ALSO GUP writing to ostensibly 'clean'
pages 'behind the back' of a fs is _also_ a problem.

Not least for checksumming (e.g. assume hw-reported checksum for a block ==
checksum derived from page cache) but, because VFS allows a great deal of
flexibility in how filesystems are implemented, perhaps in other respects
we haven't considered.

So I just wanted to highlight (happy to be corrected if I'm wrong) that the
PRIMARY problem is the dirty tracking breaking, but also strikes me that
arbitrary writes to 'clean' pages in the background is one too.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-17  7:55               ` Lorenzo Stoakes
@ 2023-05-17  8:10                 ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-17  8:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, Jan Kara, Jason Gunthorpe,
	Kirill A . Shutemov, linux-mm, linux-kernel, Andrew Morton,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, John Hubbard,
	Pavel Begunkov, Mika Penttila, David Hildenbrand, Dave Chinner,
	Theodore Ts'o, Peter Xu, Matthew Rosato, Paul E . McKenney,
	Christian Borntraeger

On Wed, May 17, 2023 at 08:55:27AM +0100, Lorenzo Stoakes wrote:
> I'll try to take this in good faith because... yeah. I do get that, I mean
> I literally created a repro for this situation and referenced in the commit
> msg and comments this precise problem in my patch series that
> addresses... this problem :P
> 
> Perhaps I'm not being clear but it was simply my intent to highlight that
> yes this is the primary problem but ALSO GUP writing to ostensibly 'clean'
> pages 'behind the back' of a fs is _also_ a problem.

Yes, it absolutely is a problem if that happens.  But we can just
fix it in the kernel using the:

   lock_page()
   copy data
   set_page_dirty_locked()
   unlock_page();

pattern, and we should have covere every place that did in tree.
But there's no good way to verify it except for regular code audits.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default
  2023-05-15 11:31     ` Lorenzo Stoakes
@ 2023-05-17  8:26       ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2023-05-17  8:26 UTC (permalink / raw)
  To: Lorenzo Stoakes, Christoph Hellwig
  Cc: linux-mm, linux-kernel, Andrew Morton, Jason Gunthorpe,
	Jens Axboe, Matthew Wilcox, Dennis Dalessandro, Leon Romanovsky,
	Christian Benvenuti, Nelson Escobar, Bernard Metzler,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Bjorn Topel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Brauner,
	Richard Cochran, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, linux-fsdevel,
	linux-perf-users, netdev, bpf, Oleg Nesterov, Jason Gunthorpe,
	John Hubbard, Jan Kara, Kirill A . Shutemov, Pavel Begunkov,
	Mika Penttila, Dave Chinner, Theodore Ts'o, Peter Xu,
	Matthew Rosato, Paul E . McKenney, Christian Borntraeger

On 15.05.23 13:31, Lorenzo Stoakes wrote:
> On Sun, May 14, 2023 at 10:14:46PM -0700, Christoph Hellwig wrote:
>> On Sun, May 14, 2023 at 08:20:04PM +0100, Lorenzo Stoakes wrote:
>>> As discussed at LSF/MM, on the flight over I wrote a little repro [0] which
>>> reliably triggers the ext4 warning by recreating the scenario described
>>> above, using a small userland program and kernel module.
>>>
>>> This code is not perfect (plane code :) but does seem to do the job
>>> adequately, also obviously this should only be run in a VM environment
>>> where data loss is acceptable (in my case a small qemu instance).
>>
>> It would be really awesome if you could wire it up with and submit it
>> to xfstests.
> 
> Sure am happy to take a look at that! Also happy if David finds it useful in any
> way for this unit tests.

I played with a simple selftest that would reuse the existing gup_test 
infrastructure (adding PIN_LONGTERM_TEST_WRITE), and try reproducing an 
actual data corruption.

So far, I was not able to reproduce any corruption easily without your 
patches, because d824ec2a1546 ("mm: do not reclaim private data from 
pinned page") seems to mitigate most of it.

So ... before my patches (adding PIN_LONGTERM_TEST_WRITE) I cannot test 
it from a selftest, with d824ec2a1546 ("mm: do not reclaim private data 
from pinned page") I cannot reproduce and with your patches long-term 
pinning just fails.

Long story short: I'll most probably not add such a test but instead 
keep testing that long-term pinning works/fails now as expected, based 
on the FS type.

> 
> The kernel module interface is a bit sketchy (it takes a user address which it
> blindly pins for you) so it's not something that should be run in any unsafe
> environment but as long as we are ok with that :)

I can submit the PIN_LONGTERM_TEST_WRITE extension, that would allow to 
test with a stock kernel that has the module compiled in. It won't allow 
!longterm, though (it would be kind-of hacky to have !longterm 
controlled by user space, even if it's a GUP test module).

Finding an actual reproducer using existing pinning functionality would 
be preferred. For example, using O_DIRECT (should be possible even 
before it starts using FOLL_PIN instead of FOLL_GET). That would be 
highly racy then, but most probably not impossible.

Such (racy) tests are not a good fit for selftests.

Maybe I'll have a try later to reproduce with O_DIRECT.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-17  8:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-04 21:27 [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default Lorenzo Stoakes
2023-05-04 21:27 ` [PATCH v9 1/3] mm/mmap: separate writenotify and dirty tracking logic Lorenzo Stoakes
2023-05-04 21:27 ` [PATCH v9 2/3] mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings Lorenzo Stoakes
2023-05-04 21:27 ` [PATCH v9 3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast " Lorenzo Stoakes
2023-05-05 20:21 ` [PATCH v9 0/3] mm/gup: disallow GUP writing to file-backed mappings by default David Hildenbrand
2023-05-05 21:12   ` Lorenzo Stoakes
2023-05-14 19:20 ` Lorenzo Stoakes
2023-05-15  5:14   ` Christoph Hellwig
2023-05-15 11:31     ` Lorenzo Stoakes
2023-05-17  8:26       ` David Hildenbrand
2023-05-15 11:03 ` Kirill A . Shutemov
2023-05-15 11:16   ` Lorenzo Stoakes
2023-05-15 12:12     ` Jason Gunthorpe
2023-05-15 13:07       ` Lorenzo Stoakes
2023-05-17  7:29         ` Jan Kara
2023-05-17  7:40           ` Lorenzo Stoakes
2023-05-17  7:43             ` Christoph Hellwig
2023-05-17  7:55               ` Lorenzo Stoakes
2023-05-17  8:10                 ` Christoph Hellwig
2023-05-17  7:42           ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).