linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] RFC: userfaultfd remap
@ 2019-01-12  0:36 Blake Caldwell
  2019-01-12  0:36 ` [PATCH 1/4] userfaultfd: UFFDIO_REMAP: rmap preparation Blake Caldwell
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Blake Caldwell @ 2019-01-12  0:36 UTC (permalink / raw)
  To: blake.caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov, linux-mm, aarcange

Hello,

Since userfaultfd remap functionality was first proposed by Andrea
Arcangeli [1], a new use case has been demonstrated for removing pages
from the userfaultfd registered region. FluidMem [2] is a system for
expanding or limiting the resident size of a VM using a remote key-value
store as backing storage instead of swap space. It runs on the hypervisor
and uses userfaultfd to manage the memory regions malloc'd by qemu.
Since FluidMem maintains a constant resident size using an LRU list, it
must evict pages to the remote key-value store to make room for pages that
were just faulted in. This requires UFFDIO_REMAP to remove pages from the
uncooperative userspace page fault handler.

The VM shadow page tables must be kept in sync after a remapping, so
mmu_notifier_invalidate_range_(start/end) calls are made as necessary.

FluiMem enables page fault latencies to a remote key-value store that are
as fast as swap backed by DRAM (/dev/pmem0) and 77% faster than swap with a
SSD drive. pmbench [3] was used to measure page fault latencies with a 4 GB
working set size, within a VM using 1 GB DRAM (20% local):

  FluidMem (RAMCloud): 24.87 microseconds
  Swap (pmem DRAM): 26.34 microseconds
  Swap (NVMe over Fabrics): 41.73 microseconds
  Swap (SSD): 106.56 microseconds

For real applications FluidMem has an additional benefit of allowing
unused kernel pages to be removed from DRAM and stored in backend storage,
making room for additional application pages to be kept in local DRAM.
The useful memory capacity for the VM is increased.

The main complexity of this code is found in rmap, where it overwrites the
page->index when it moves the page to a different vma with different
vma->vm_pgoff. Overwriting page->index requires the rmap change and it's
only possible when the page_mapcount is 1.

Changes since [1]:
 - Changed the direction supported by UFFDIO_REMAP to the OUT direction 
   needed by FluidMem. The IN direction is not necessary, as UFFDIO_COPY
   should be used instead because it doesn't require a TLB flush.
 - Code has been kept up-to-date by Andrea in branch userfault from [4].

[1] https://lkml.org/lkml/2015/3/5/576
[2] Caldwell, Blake, Youngbin Im, Sangtae Ha, Richard Han, and
    Eric Keller. "FluidMem: Memory as a Service for the Datacenter."
    arXiv preprint arXiv:1707.07780 (2017).
    https://github.com/blakecaldwell/fluidmem
[3] Yang, Jisoo, and Julian Seymour. "Pmbench: A Micro-Benchmark for
    Profiling Paging Performance on a System with Low-Latency SSDs."
    Information Technology-New Generations. Springer, Cham, 2018. 627-633.
    https://bitbucket.org/jisooy/pmbench/src
[4] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Andrea Arcangeli (3):
  userfaultfd: UFFDIO_REMAP: rmap preparation
  userfaultfd: UFFDIO_REMAP uABI
  userfaultfd: UFFDIO_REMAP

Blake Caldwell (1):
  userfaultfd: change the direction for UFFDIO_REMAP to out

 Documentation/admin-guide/mm/userfaultfd.rst |  10 +
 fs/userfaultfd.c                             |  49 +++
 include/linux/userfaultfd_k.h                |  17 +
 include/uapi/linux/userfaultfd.h             |  25 +-
 mm/huge_memory.c                             | 117 ++++++
 mm/khugepaged.c                              |   3 +
 mm/rmap.c                                    |  13 +
 mm/userfaultfd.c                             | 536 +++++++++++++++++++++++++++
 8 files changed, 769 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/4] userfaultfd: UFFDIO_REMAP: rmap preparation
  2019-01-12  0:36 [PATCH 0/4] RFC: userfaultfd remap Blake Caldwell
@ 2019-01-12  0:36 ` Blake Caldwell
  2019-01-12  0:36 ` [PATCH 2/4] userfaultfd: UFFDIO_REMAP uABI Blake Caldwell
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Blake Caldwell @ 2019-01-12  0:36 UTC (permalink / raw)
  To: blake.caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov,
	Andrea Arcangeli, linux-mm

From: Andrea Arcangeli <aarcange@redhat.com>

As far as the rmap code is concerned, UFFDIO_REMAP only alters the
page->mapping and page->index. It does it while holding the page
lock. However page_referenced() is doing rmap walks without taking the
page lock first, so page_lock_anon_vma_read must be updated to
re-check that the page->mapping didn't change after we obtained the
anon_vma read lock.

UFFDIO_REMAP takes the anon_vma lock for writing before altering the
page->mapping, so if the page->mapping is still the same after
obtaining the anon_vma read lock (without the page lock), the rmap
walks can go ahead safely (and UFFDIO_REMAP will wait the rmap walk to
complete before proceeding).

UFFDIO_REMAP serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
UFFDIO_REMAP holds the mmap_sem for reading.

There's one constraint enforced to allow this simplification: the
source pages passed to UFFDIO_REMAP must be mapped only in one vma,
but this constraint is an acceptable tradeoff for UFFDIO_REMAP
users.

The source addresses passed to UFFDIO_REMAP should be set as
VM_DONTCOPY with MADV_DONTFORK to avoid any risk of the mapcount of
the pages increasing if some thread of the process forks() before
UFFDIO_REMAP run.

Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/rmap.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc2..d8f228d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -510,6 +510,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long)READ_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -548,6 +549,18 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/*
+	 * Check if UFFDIO_REMAP changed the anon_vma. This is needed
+	 * because we don't assume the page was locked.
+	 */
+	if (unlikely((unsigned long) READ_ONCE(page->mapping) !=
+		     anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/4] userfaultfd: UFFDIO_REMAP uABI
  2019-01-12  0:36 [PATCH 0/4] RFC: userfaultfd remap Blake Caldwell
  2019-01-12  0:36 ` [PATCH 1/4] userfaultfd: UFFDIO_REMAP: rmap preparation Blake Caldwell
@ 2019-01-12  0:36 ` Blake Caldwell
  2019-01-12  0:36 ` [PATCH 3/4] userfaultfd: UFFDIO_REMAP Blake Caldwell
  2019-01-12  0:36 ` [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out Blake Caldwell
  3 siblings, 0 replies; 7+ messages in thread
From: Blake Caldwell @ 2019-01-12  0:36 UTC (permalink / raw)
  To: blake.caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov,
	Andrea Arcangeli, linux-mm

From: Andrea Arcangeli <aarcange@redhat.com>

This implements the uABI of UFFDIO_REMAP.

Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.

Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/userfaultfd.h | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c..a0d6106 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -34,7 +34,8 @@
 #define UFFD_API_RANGE_IOCTLS			\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
-	 (__u64)1 << _UFFDIO_ZEROPAGE)
+	 (__u64)1 << _UFFDIO_ZEROPAGE |		\
+	 (__u64)1 << _UFFDIO_REMAP)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY)
@@ -52,6 +53,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_REMAP			(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -68,6 +70,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_REMAP		_IOWR(UFFDIO, _UFFDIO_REMAP,	\
+				      struct uffdio_remap)
 
 /* read() structure */
 struct uffd_msg {
@@ -231,4 +235,23 @@ struct uffdio_zeropage {
 	__s64 zeropage;
 };
 
+struct uffdio_remap {
+	__u64 dst;
+	__u64 src;
+	__u64 len;
+	/*
+	 * Especially if used to atomically remove memory from the
+	 * address space the wake on the dst range is not needed.
+	 */
+#define UFFDIO_REMAP_MODE_DONTWAKE		((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES	((__u64)1<<1)
+	__u64 mode;
+
+	/*
+	 * "remap" is written by the ioctl and must be at the end: the
+	 * copy_from_user will not read the last 8 bytes.
+	 */
+	__s64 remap;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/4] userfaultfd: UFFDIO_REMAP
  2019-01-12  0:36 [PATCH 0/4] RFC: userfaultfd remap Blake Caldwell
  2019-01-12  0:36 ` [PATCH 1/4] userfaultfd: UFFDIO_REMAP: rmap preparation Blake Caldwell
  2019-01-12  0:36 ` [PATCH 2/4] userfaultfd: UFFDIO_REMAP uABI Blake Caldwell
@ 2019-01-12  0:36 ` Blake Caldwell
  2019-01-12  0:36 ` [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out Blake Caldwell
  3 siblings, 0 replies; 7+ messages in thread
From: Blake Caldwell @ 2019-01-12  0:36 UTC (permalink / raw)
  To: blake.caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov,
	Andrea Arcangeli, linux-mm

From: Andrea Arcangeli <aarcange@redhat.com>

This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

Co-Developed-by: Blake Caldwell <blake.caldwell@colorado.edu>
Signed-off-by: Blake Caldwell <blake.caldwell@colorado.edu>
---
 fs/userfaultfd.c              |  49 ++++
 include/linux/userfaultfd_k.h |  17 ++
 mm/huge_memory.c              | 117 +++++++++
 mm/khugepaged.c               |   3 +
 mm/userfaultfd.c              | 536 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 722 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 89800fc..cf68cdb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1792,6 +1792,52 @@ static inline unsigned int uffd_ctx_features(__u64 user_features)
 	return (unsigned int)user_features;
 }
 
+static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
+			     unsigned long arg)
+{
+	__s64 ret;
+	struct uffdio_remap uffdio_remap;
+	struct uffdio_remap __user *user_uffdio_remap;
+	struct userfaultfd_wake_range range;
+
+	user_uffdio_remap = (struct uffdio_remap __user *) arg;
+
+	ret = -EFAULT;
+	if (copy_from_user(&uffdio_remap, user_uffdio_remap,
+			   /* don't copy "remap" last field */
+			   sizeof(uffdio_remap)-sizeof(__s64)))
+		goto out;
+
+	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (uffdio_remap.mode & ~(UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES|
+				  UFFDIO_REMAP_MODE_DONTWAKE))
+		goto out;
+
+	ret = remap_pages(ctx->mm, current->mm,
+			  uffdio_remap.dst, uffdio_remap.src,
+			  uffdio_remap.len, uffdio_remap.mode);
+	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
+		return -EFAULT;
+	if (ret < 0)
+		goto out;
+	/* len == 0 would wake all */
+	BUG_ON(!ret);
+	range.len = ret;
+	if (!(uffdio_remap.mode & UFFDIO_REMAP_MODE_DONTWAKE)) {
+		range.start = uffdio_remap.dst;
+		wake_userfault(ctx, &range);
+	}
+	ret = range.len == uffdio_remap.len ? 0 : -EAGAIN;
+out:
+	return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -1861,6 +1907,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_ZEROPAGE:
 		ret = userfaultfd_zeropage(ctx, arg);
 		break;
+	case UFFDIO_REMAP:
+		ret = userfaultfd_remap(ctx, arg);
+		break;
 	}
 	return ret;
 }
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba..56fb0e6 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -38,6 +38,23 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
 			      unsigned long len,
 			      bool *mmap_changing);
 
+/* remap_pages */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern ssize_t remap_pages(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   unsigned long dst_start,
+			   unsigned long src_start,
+			   unsigned long len, __u64 flags);
+extern int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+				struct mm_struct *src_mm,
+				pmd_t *dst_pmd, pmd_t *src_pmd,
+				pmd_t dst_pmdval,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
 					struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357e..05f00d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1941,6 +1941,123 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+			 struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd,
+			 pmd_t dst_pmdval,
+			 struct vm_area_struct *dst_vma,
+			 struct vm_area_struct *src_vma,
+			 unsigned long dst_addr,
+			 unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	struct mmu_notifier_range range;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	mmu_notifier_range_init(&range, src_mm, src_addr,
+				src_addr + HPAGE_PMD_SIZE);
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(src_mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&src_mm->mmap_sem));
+	BUG_ON(!rwsem_is_locked(&dst_mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(&range);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(&range);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(&range);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	WRITE_ONCE(src_page->mapping, (struct address_space *) dst_anon_vma);
+	WRITE_ONCE(src_page->index, linear_page_index(dst_vma, dst_addr));
+
+	if (!pmd_same(pmdp_huge_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	if (dst_mm != src_mm) {
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	}
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(&range);
+	return 0;
+}
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
  *
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4f01733..c91a748 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1002,6 +1002,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 *
+	 * UFFDIO_REMAP is prevented to race as well thanks to the
+	 * mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	result = hugepage_vma_revalidate(mm, address, &vma);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a7..bbe7189 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -615,3 +615,539 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
 {
 	return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
 }
+
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+	else
+		__acquire(ptl2);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+	else
+		__release(ptl2);
+}
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_pages_pte(struct mm_struct *dst_mm,
+			   struct mm_struct *src_mm,
+			   pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+			   struct vm_area_struct *dst_vma,
+			   struct vm_area_struct *src_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   spinlock_t *dst_ptl,
+			   spinlock_t *src_ptl,
+			   __u64 mode)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	struct mmu_notifier_range range;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		mmu_notifier_range_init(&range, src_mm, src_addr,
+					src_addr + PAGE_SIZE);
+
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		mmu_notifier_invalidate_range_start(&range);
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		WRITE_ONCE(src_page->mapping,\
+			   (struct address_space *) dst_anon_vma);
+		WRITE_ONCE(src_page->index, linear_page_index(dst_vma,
+							      dst_addr));
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_dst_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		mmu_notifier_invalidate_range_end(&range);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+	} else {
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(src_mm, src_pmd,
+						     src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		/*
+		 * COUNT_CONTINUE to be returned is fine here, no need
+		 * of follow all swap continuation to check against
+		 * number 1.
+		 */
+		if (__swp_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(src_mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(dst_mm, dst_addr, dst_pte, orig_src_pte);
+
+		if (dst_mm != src_mm) {
+			inc_mm_counter(dst_mm, MM_ANONPAGES);
+			dec_mm_counter(src_mm, MM_ANONPAGES);
+		}
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+/**
+ * remap_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * remap_pages() remaps arbitrary anonymous pages atomically in zero
+ * copy. It only works on non shared anonymous pages because those can
+ * be relocated without generating non linear anon_vmas in the rmap
+ * code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_pages() to fail with -EBUSY if the
+ * process forks before remap_pages() is called), then it will call
+ * remap_pages() to map the page in the faulting address in the
+ * destination vma.
+ *
+ * This userfaultfd command works purely via pagetables, so it's the
+ * most efficient way to move physical non shared anonymous pages
+ * across different virtual addresses. Unlike mremap()/mmap()/munmap()
+ * it does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_pages() will fail respectively with -ENOENT or -EEXIST. This
+ * provides a very strict behavior to avoid any chance of memory
+ * corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_pages() on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this command.
+ *
+ * The command retval will return "len" is succesful. The command
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_pages() command should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES flag can be specified to
+ * prevent -ENOENT errors to materialize if there are holes in the
+ * source virtual range that is being remapped. The holes will be
+ * accounted as successfully remapped in the retval of the
+ * command. This is mostly useful to remap hugepage naturally aligned
+ * virtual regions without knowing if there are transparent hugepage
+ * in the regions or not, but preventing the risk of having to split
+ * the hugepmd during the remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_pages() before the lock could be obtained. This is the only
+ * additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    unsigned long dst_start, unsigned long src_start,
+		    unsigned long len, __u64 mode)
+{
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	ssize_t moved = 0;
+
+	/*
+	 * Sanitize the command parameters:
+	 */
+	BUG_ON(src_start & ~PAGE_MASK);
+	BUG_ON(dst_start & ~PAGE_MASK);
+	BUG_ON(len & ~PAGE_MASK);
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	BUG_ON(src_start + len <= src_start);
+	BUG_ON(dst_start + len <= dst_start);
+
+	/*
+	 * Because these are read sempahores there's no risk of lock
+	 * inversion.
+	 */
+	down_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		down_read(&src_mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(src_mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(dst_mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Be strict and only allow remap_pages if either the src or
+	 * dst range is registered in the userfaultfd to prevent
+	 * userland errors going unnoticed. As far as the VM
+	 * consistency is concerned, it would be perfectly safe to
+	 * remove this check, but there's no useful usage for
+	 * remap_pages ouside of userfaultfd registered ranges. This
+	 * is after all why it is an ioctl belonging to the
+	 * userfaultfd and not a syscall.
+	 *
+	 * Allow both vmas to be registered in the userfaultfd, just
+	 * in case somebody finds a way to make such a case useful.
+	 * Normally only one of the two vmas would be registered in
+	 * the userfaultfd.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx &&
+	    !src_vma->vm_userfaultfd_ctx.ctx)
+		goto out;
+
+	/*
+	 * FIXME: only allow remapping across anonymous vmas,
+	 * tmpfs should be added.
+	 */
+	if (src_vma->vm_ops || dst_vma->vm_ops)
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len;) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(src_mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(src_mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(dst_mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		ptl = pmd_trans_huge_lock(src_pmd, src_vma);
+		if (ptl) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_pmd(src_vma, src_pmd, src_addr);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_pages_huge_pmd(dst_mm,
+							   src_mm,
+							   dst_pmd,
+							   src_pmd,
+							   dst_pmdval,
+							   dst_vma,
+							   src_vma,
+							   dst_addr,
+							   src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(src_mm,
+							 src_pmd))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(dst_mm, dst_pmd);
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+
+		err = remap_pages_pte(dst_mm, src_mm,
+				      dst_pte, src_pte, src_pmd,
+				      dst_vma, src_vma,
+				      dst_addr, src_addr,
+				      dst_ptl, src_ptl, mode);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		up_read(&src_mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out
  2019-01-12  0:36 [PATCH 0/4] RFC: userfaultfd remap Blake Caldwell
                   ` (2 preceding siblings ...)
  2019-01-12  0:36 ` [PATCH 3/4] userfaultfd: UFFDIO_REMAP Blake Caldwell
@ 2019-01-12  0:36 ` Blake Caldwell
  2019-01-20 21:07   ` Mike Rapoport
  3 siblings, 1 reply; 7+ messages in thread
From: Blake Caldwell @ 2019-01-12  0:36 UTC (permalink / raw)
  To: blake.caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov, linux-mm, aarcange

Moving a page out of a userfaultfd registered region and into a userland
anonymous vma is needed by the use case of uncooperatively limiting the
resident size of the userfaultfd region. Reverse the direction of the
original userfaultfd_remap() to the out direction. Now after memory has
been removed, subsequent accesses will generate uffdio page fault events.

Signed-off-by: Blake Caldwell <blake.caldwell@colorado.edu>
---
 Documentation/admin-guide/mm/userfaultfd.rst | 10 ++++++++++
 fs/userfaultfd.c                             |  6 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf6..714af49 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,16 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
 half copied page since it'll keep userfaulting until the copy has
 finished.
 
+To move pages out of a userfault registered region and into a user vma
+the UFFDIO_REMAP ioctl can be used. This is only possible for the
+"OUT" direction. For the "IN" direction, UFFDIO_COPY is preferred
+since UFFDIO_REMAP requires a TLB flush on the source range at a
+greater penalty than copying the page. With
+UFFDIO_REGISTER_MODE_MISSING set, subsequent accesses to the same
+region will generate a page fault event. This allows non-cooperative
+removal of memory in a userfaultfd registered vma, effectively
+limiting the amount of resident memory in such a region.
+
 QEMU/KVM
 ========
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cf68cdb..8099da2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1808,10 +1808,10 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
 			   sizeof(uffdio_remap)-sizeof(__s64)))
 		goto out;
 
-	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+	ret = validate_range(current->mm, uffdio_remap.dst, uffdio_remap.len);
 	if (ret)
 		goto out;
-	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+	ret = validate_range(ctx->mm, uffdio_remap.src, uffdio_remap.len);
 	if (ret)
 		goto out;
 	ret = -EINVAL;
@@ -1819,7 +1819,7 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
 				  UFFDIO_REMAP_MODE_DONTWAKE))
 		goto out;
 
-	ret = remap_pages(ctx->mm, current->mm,
+	ret = remap_pages(current->mm, ctx->mm,
 			  uffdio_remap.dst, uffdio_remap.src,
 			  uffdio_remap.len, uffdio_remap.mode);
 	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out
  2019-01-12  0:36 ` [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out Blake Caldwell
@ 2019-01-20 21:07   ` Mike Rapoport
  2019-01-24 23:36     ` Blake Caldwell
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Rapoport @ 2019-01-20 21:07 UTC (permalink / raw)
  To: Blake Caldwell
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov, linux-mm, aarcange

Hi,

On Sat, Jan 12, 2019 at 12:36:29AM +0000, Blake Caldwell wrote:
> Moving a page out of a userfaultfd registered region and into a userland
> anonymous vma is needed by the use case of uncooperatively limiting the
> resident size of the userfaultfd region. Reverse the direction of the
> original userfaultfd_remap() to the out direction. Now after memory has
> been removed, subsequent accesses will generate uffdio page fault events.

It took me a while but better late then never :)

Why did you keep this as a separate patch? If the primary use case for
UFFDIO_REMAP to move pages out of userfaultfd region, why not make it so
from the beginning?

> Signed-off-by: Blake Caldwell <blake.caldwell@colorado.edu>
> ---
>  Documentation/admin-guide/mm/userfaultfd.rst | 10 ++++++++++
>  fs/userfaultfd.c                             |  6 +++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 5048cf6..714af49 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -108,6 +108,16 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
>  half copied page since it'll keep userfaulting until the copy has
>  finished.
> 
> +To move pages out of a userfault registered region and into a user vma
> +the UFFDIO_REMAP ioctl can be used. This is only possible for the
> +"OUT" direction. For the "IN" direction, UFFDIO_COPY is preferred
> +since UFFDIO_REMAP requires a TLB flush on the source range at a
> +greater penalty than copying the page. With
> +UFFDIO_REGISTER_MODE_MISSING set, subsequent accesses to the same
> +region will generate a page fault event. This allows non-cooperative
> +removal of memory in a userfaultfd registered vma, effectively
> +limiting the amount of resident memory in such a region.
> +
>  QEMU/KVM
>  ========
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index cf68cdb..8099da2 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1808,10 +1808,10 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
>  			   sizeof(uffdio_remap)-sizeof(__s64)))
>  		goto out;
> 
> -	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
> +	ret = validate_range(current->mm, uffdio_remap.dst, uffdio_remap.len);
>  	if (ret)
>  		goto out;
> -	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
> +	ret = validate_range(ctx->mm, uffdio_remap.src, uffdio_remap.len);
>  	if (ret)
>  		goto out;
>  	ret = -EINVAL;
> @@ -1819,7 +1819,7 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
>  				  UFFDIO_REMAP_MODE_DONTWAKE))
>  		goto out;
> 
> -	ret = remap_pages(ctx->mm, current->mm,
> +	ret = remap_pages(current->mm, ctx->mm,
>  			  uffdio_remap.dst, uffdio_remap.src,
>  			  uffdio_remap.len, uffdio_remap.mode);
>  	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
> -- 
> 1.8.3.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out
  2019-01-20 21:07   ` Mike Rapoport
@ 2019-01-24 23:36     ` Blake Caldwell
  0 siblings, 0 replies; 7+ messages in thread
From: Blake Caldwell @ 2019-01-24 23:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: rppt, xemul, akpm, mike.kravetz, kirill.shutemov, linux-mm, aarcange

[-- Attachment #1: Type: text/plain, Size: 3536 bytes --]


> On Jan 20, 2019, at 4:07 PM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> Hi,
> 
> On Sat, Jan 12, 2019 at 12:36:29AM +0000, Blake Caldwell wrote:
>> Moving a page out of a userfaultfd registered region and into a userland
>> anonymous vma is needed by the use case of uncooperatively limiting the
>> resident size of the userfaultfd region. Reverse the direction of the
>> original userfaultfd_remap() to the out direction. Now after memory has
>> been removed, subsequent accesses will generate uffdio page fault events.
> 
> It took me a while but better late then never :)
> 
> Why did you keep this as a separate patch? If the primary use case for
> UFFDIO_REMAP to move pages out of userfaultfd region, why not make it so
> from the beginning?

Only to show what has changed since this was last proposed, but yes, that
change to fs/userfaultfd.c should be squashed with patch 3. The purpose of
patch 4 will only be documenting UFFDIO_REMAP.

I will make those changes for the next revision. Thanks for looking this over.

> 
>> Signed-off-by: Blake Caldwell <blake.caldwell@colorado.edu>
>> ---
>> Documentation/admin-guide/mm/userfaultfd.rst | 10 ++++++++++
>> fs/userfaultfd.c                             |  6 +++---
>> 2 files changed, 13 insertions(+), 3 deletions(-)
>> 
>> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
>> index 5048cf6..714af49 100644
>> --- a/Documentation/admin-guide/mm/userfaultfd.rst
>> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
>> @@ -108,6 +108,16 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
>> half copied page since it'll keep userfaulting until the copy has
>> finished.
>> 
>> +To move pages out of a userfault registered region and into a user vma
>> +the UFFDIO_REMAP ioctl can be used. This is only possible for the
>> +"OUT" direction. For the "IN" direction, UFFDIO_COPY is preferred
>> +since UFFDIO_REMAP requires a TLB flush on the source range at a
>> +greater penalty than copying the page. With
>> +UFFDIO_REGISTER_MODE_MISSING set, subsequent accesses to the same
>> +region will generate a page fault event. This allows non-cooperative
>> +removal of memory in a userfaultfd registered vma, effectively
>> +limiting the amount of resident memory in such a region.
>> +
>> QEMU/KVM
>> ========
>> 
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index cf68cdb..8099da2 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1808,10 +1808,10 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
>> 			   sizeof(uffdio_remap)-sizeof(__s64)))
>> 		goto out;
>> 
>> -	ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
>> +	ret = validate_range(current->mm, uffdio_remap.dst, uffdio_remap.len);
>> 	if (ret)
>> 		goto out;
>> -	ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
>> +	ret = validate_range(ctx->mm, uffdio_remap.src, uffdio_remap.len);
>> 	if (ret)
>> 		goto out;
>> 	ret = -EINVAL;
>> @@ -1819,7 +1819,7 @@ static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
>> 				  UFFDIO_REMAP_MODE_DONTWAKE))
>> 		goto out;
>> 
>> -	ret = remap_pages(ctx->mm, current->mm,
>> +	ret = remap_pages(current->mm, ctx->mm,
>> 			  uffdio_remap.dst, uffdio_remap.src,
>> 			  uffdio_remap.len, uffdio_remap.mode);
>> 	if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
>> -- 
>> 1.8.3.1
>> 
> 
> -- 
> Sincerely yours,
> Mike.


[-- Attachment #2: Type: text/html, Size: 15365 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-24 23:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-12  0:36 [PATCH 0/4] RFC: userfaultfd remap Blake Caldwell
2019-01-12  0:36 ` [PATCH 1/4] userfaultfd: UFFDIO_REMAP: rmap preparation Blake Caldwell
2019-01-12  0:36 ` [PATCH 2/4] userfaultfd: UFFDIO_REMAP uABI Blake Caldwell
2019-01-12  0:36 ` [PATCH 3/4] userfaultfd: UFFDIO_REMAP Blake Caldwell
2019-01-12  0:36 ` [PATCH 4/4] userfaultfd: change the direction for UFFDIO_REMAP to out Blake Caldwell
2019-01-20 21:07   ` Mike Rapoport
2019-01-24 23:36     ` Blake Caldwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).