linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
@ 2019-05-15 15:11 Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 1/5] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

This patchset adds a new syscall, which makes possible
to clone a mapping from a process to another process.
The syscall supplements the functionality provided
by process_vm_writev() and process_vm_readv() syscalls,
and it may be useful in many situation.

For example, it allows to make a zero copy of data,
when process_vm_writev() was previously used:

	struct iovec local_iov, remote_iov;
	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

	local_iov->iov_base = buf;
	local_iov->iov_len = n * PAGE_SIZE;
	remove_iov = ...;

	process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
	munmap(buf, n * PAGE_SIZE);

	(Note, that above completely ignores error handling)

There are several problems with process_vm_writev() in this example:

1)it causes pagefault on remote process memory, and it forces
  allocation of a new page (if was not preallocated);

2)amount of memory for this example is doubled in a moment --
  n pages in current and n pages in remote tasks are occupied
  at the same time;

3)received data has no a chance to be properly swapped for
  a long time.

The third is the most critical in case of remote process touches
the data pages some time after process_vm_writev() was made.
Imagine, node is under memory pressure:

a)kernel moves @buf pages into swap right after recv();
b)process_vm_writev() reads the data back from swap to pages;
c)process_vm_writev() allocates duplicate pages in remote
  process and populates them;
d)munmap() unmaps @buf;
e)5 minutes later remote task touches data.

In stages "a" and "b" kernel submits unneeded IO and makes
system IO throughput worse. To make "b" and "c", kernel
reclaims memory, and moves pages of some other processes
to swap, so they have to read pages from swap back. Also,
unneeded copying of pages is occured, while zero-copy is
more preferred.

We observe similar problem during online migration of big enough
containers, when after doubling of container's size, the time
increases 100 times. The system resides under high IO and
throwing out of useful cashes.

The proposed syscall aims to introduce an interface, which
supplements currently existing process_vm_writev() and
process_vm_readv(), and allows to solve the problem with
anonymous memory transfer. The above example may be rewritten as:

	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
	munmap(buf, n * PAGE_SIZE);

It is swap-friendly: in case of memory is swapped right after recv(),
the syscall just copies pagetable entries like we do on fork(),
so real access to pages does not occurs, and no IO is needed.
No excess pages are reclaimed, and number of pages is not doubled.
Also, zero-copy takes a place, and this also reduces overhead.

The patchset does not introduce much new code, since we simply
reuse existing copy_page_range() and copy_vma() functions.
We extend copy_vma() to be able merge VMAs in remote task [2/5],
and teach copy_page_range() to work with different local and
remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
which mostly consists of sanity checks. The rest of patches
are preparations.

This syscall may be used for page servers like in example
above, for migration (I assume, even virtual machines may
want something like this), for zero-copy desiring users
of process_vm_writev() and process_vm_readv(), for debug
purposes, etc. It requires the same permittions like
existing proc_vm_xxx() syscalls have.

The tests I used may be obtained here:

[1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
[2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562

---

Kirill Tkhai (5):
      mm: Add process_vm_mmap() syscall declaration
      mm: Extend copy_vma()
      mm: Extend copy_page_range()
      mm: Export round_hint_to_min()
      mm: Add process_vm_mmap()


 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    2 
 include/linux/huge_mm.h                |    6 +
 include/linux/mm.h                     |   11 ++
 include/linux/mm_types.h               |    2 
 include/linux/mman.h                   |   14 +++
 include/linux/syscalls.h               |    5 +
 include/uapi/asm-generic/mman-common.h |    5 +
 include/uapi/asm-generic/unistd.h      |    5 +
 init/Kconfig                           |    9 +-
 kernel/fork.c                          |    5 +
 kernel/sys_ni.c                        |    2 
 mm/huge_memory.c                       |   30 ++++--
 mm/memory.c                            |  165 +++++++++++++++++++++-----------
 mm/mmap.c                              |  154 ++++++++++++++++++++++++++----
 mm/mremap.c                            |    4 -
 mm/process_vm_access.c                 |   71 ++++++++++++++
 17 files changed, 392 insertions(+), 99 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC 1/5] mm: Add process_vm_mmap() syscall declaration
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
@ 2019-05-15 15:11 ` Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 2/5] mm: Extend copy_vma() Kirill Tkhai
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

Similar to process_vm_readv() and process_vm_writev(),
add declarations of a new syscall, which will allow
to map memory from or to another process.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 ++
 include/linux/syscalls.h               |    5 +++++
 include/uapi/asm-generic/unistd.h      |    5 ++++-
 init/Kconfig                           |    9 +++++----
 kernel/sys_ni.c                        |    2 ++
 6 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..bf8cc5de918f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	process_vm_mmap		sys_process_vm_mmap		__ia32_compat_sys_process_vm_mmap
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..5af619c2d512 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	process_vm_mmap		__x64_sys_process_vm_mmap
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -398,3 +399,4 @@
 545	x32	execveat		__x32_compat_sys_execveat/ptregs
 546	x32	preadv2			__x32_compat_sys_preadv64v2
 547	x32	pwritev2		__x32_compat_sys_pwritev64v2
+548	x32	process_vm_mmap		__x32_compat_sys_process_vm_mmap
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..7d8ae36589cf 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -997,6 +997,11 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
 asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
+asmlinkage long sys_process_vm_mmap(pid_t pid,
+				    unsigned long src_addr,
+				    unsigned long len,
+				    unsigned long dst_addr,
+				    unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..1273d86bf546 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,12 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_process_vm_mmap 428
+__SC_COMP(__NR_process_vm_mmap, sys_process_vm_mmap, \
+          compat_sys_process_vm_mmap)
 
 #undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
 
 /*
  * 32 bit systems traditionally used different
diff --git a/init/Kconfig b/init/Kconfig
index 8b9ffe236e4f..604db5f14718 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -320,13 +320,14 @@ config POSIX_MQUEUE_SYSCTL
 	default y
 
 config CROSS_MEMORY_ATTACH
-	bool "Enable process_vm_readv/writev syscalls"
+	bool "Enable process_vm_readv/writev/mmap syscalls"
 	depends on MMU
 	default y
 	help
-	  Enabling this option adds the system calls process_vm_readv and
-	  process_vm_writev which allow a process with the correct privileges
-	  to directly read from or write to another process' address space.
+	  Enabling this option adds the system calls process_vm_readv,
+	  process_vm_writev and process_vm_mmap, which allow a process
+	  with the correct privileges to directly read from or write to
+	  or mmap another process' address space.
 	  See the man page for more details.
 
 config USELIB
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..6f51634f4f7e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -316,6 +316,8 @@ COND_SYSCALL(process_vm_readv);
 COND_SYSCALL_COMPAT(process_vm_readv);
 COND_SYSCALL(process_vm_writev);
 COND_SYSCALL_COMPAT(process_vm_writev);
+COND_SYSCALL(process_vm_mmap);
+COND_SYSCALL_COMPAT(process_vm_mmap);
 
 /* compare kernel pointers */
 COND_SYSCALL(kcmp);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 2/5] mm: Extend copy_vma()
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 1/5] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
@ 2019-05-15 15:11 ` Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 3/5] mm: Extend copy_page_range() Kirill Tkhai
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

This prepares the function to copy a vma between
two processes. Two new arguments are introduced.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mm.h |    4 ++--
 mm/mmap.c          |   33 ++++++++++++++++++++++++---------
 mm/mremap.c        |    4 ++--
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..afe07e4a76f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2329,8 +2329,8 @@ extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
 	struct rb_node **, struct rb_node *);
 extern void unlink_file_vma(struct vm_area_struct *);
 extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
-	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks);
+	struct mm_struct *, unsigned long addr, unsigned long len,
+	pgoff_t pgoff, bool *need_rmap_locks, bool clear_flags_ctx);
 extern void exit_mmap(struct mm_struct *);
 
 static inline int check_data_rlimit(unsigned long rlim,
diff --git a/mm/mmap.c b/mm/mmap.c
index 9cf52bdb22a8..46266f6825ae 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3194,19 +3194,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 }
 
 /*
- * Copy the vma structure to a new location in the same mm,
- * prior to moving page table entries, to effect an mremap move.
+ * Copy the vma structure to new location in the same vma
+ * prior to moving page table entries, to effect an mremap move;
  */
 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
-	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks)
+				struct mm_struct *mm, unsigned long addr,
+				unsigned long len, pgoff_t pgoff,
+				bool *need_rmap_locks, bool clear_flags_ctx)
 {
 	struct vm_area_struct *vma = *vmap;
 	unsigned long vma_start = vma->vm_start;
-	struct mm_struct *mm = vma->vm_mm;
+	struct vm_userfaultfd_ctx uctx;
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	bool faulted_in_anon_vma = true;
+	unsigned long flags;
 
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
@@ -3219,15 +3221,25 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
-	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			    vma->vm_userfaultfd_ctx);
+
+	uctx = vma->vm_userfaultfd_ctx;
+	flags = vma->vm_flags;
+	if (clear_flags_ctx) {
+		uctx = NULL_VM_UFFD_CTX;
+		flags &= ~(VM_UFFD_MISSING | VM_UFFD_WP | VM_MERGEABLE |
+			   VM_LOCKED | VM_LOCKONFAULT | VM_WIPEONFORK |
+			   VM_DONTCOPY);
+	}
+
+	new_vma = vma_merge(mm, prev, addr, addr + len, flags, vma->anon_vma,
+			    vma->vm_file, pgoff, vma_policy(vma), uctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
 		 */
 		if (unlikely(vma_start >= new_vma->vm_start &&
-			     vma_start < new_vma->vm_end)) {
+			     vma_start < new_vma->vm_end) &&
+			     vma->vm_mm == mm) {
 			/*
 			 * The only way we can get a vma_merge with
 			 * self during an mremap is if the vma hasn't
@@ -3248,6 +3260,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		new_vma = vm_area_dup(vma);
 		if (!new_vma)
 			goto out;
+		new_vma->vm_mm = mm;
+		new_vma->vm_flags = flags;
+		new_vma->vm_userfaultfd_ctx = uctx;
 		new_vma->vm_start = addr;
 		new_vma->vm_end = addr + len;
 		new_vma->vm_pgoff = pgoff;
diff --git a/mm/mremap.c b/mm/mremap.c
index 37b5b2ad91be..9a96cfc28675 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -352,8 +352,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return err;
 
 	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
-	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
-			   &need_rmap_locks);
+	new_vma = copy_vma(&vma, mm, new_addr, new_len, new_pgoff,
+			   &need_rmap_locks, false);
 	if (!new_vma)
 		return -ENOMEM;
 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 3/5] mm: Extend copy_page_range()
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 1/5] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 2/5] mm: Extend copy_vma() Kirill Tkhai
@ 2019-05-15 15:11 ` Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 4/5] mm: Export round_hint_to_min() Kirill Tkhai
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

This allows to copy pages not only to the same addreses
in another process, but also to a specified address.
Huge pages and unaligned address cases are handled
by splitting.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/huge_mm.h |    6 +-
 include/linux/mm.h      |    3 +
 kernel/fork.c           |    5 +
 mm/huge_memory.c        |   30 ++++++---
 mm/memory.c             |  165 +++++++++++++++++++++++++++++++----------------
 5 files changed, 141 insertions(+), 68 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..1e6002ee7c44 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -9,11 +9,13 @@
 
 extern vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long dst_addr,
+			 unsigned long src_addr, unsigned long len,
 			 struct vm_area_struct *vma);
 extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-			 pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+			 pud_t *dst_pud, pud_t *src_pud, unsigned long dst_addr,
+			 unsigned long src_addr, unsigned long len,
 			 struct vm_area_struct *vma);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe07e4a76f8..54328d08dbdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1485,7 +1485,8 @@ int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+			struct vm_area_struct *vma, unsigned long dst_addr,
+			unsigned long src_addr, unsigned long src_end);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		   struct mmu_notifier_range *range,
 		   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d4b5227630..2cce9bb78c1d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -584,7 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(mm, oldmm, mpnt);
+			retval = copy_page_range(mm, oldmm, mpnt,
+						 mpnt->vm_start,
+						 mpnt->vm_start,
+						 mpnt->vm_end);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..f338b06f42c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -956,7 +956,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long dst_addr,
+		  unsigned long src_addr, unsigned long len,
 		  struct vm_area_struct *vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
@@ -969,6 +970,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(vma))
 		return 0;
 
+	if (len != HPAGE_PMD_SIZE) {
+		split_huge_pmd(vma, src_pmd, src_addr);
+		return -EAGAIN;
+	}
+
 	pgtable = pte_alloc_one(dst_mm);
 	if (unlikely(!pgtable))
 		goto out;
@@ -990,12 +996,12 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
-			set_pmd_at(src_mm, addr, src_pmd, pmd);
+			set_pmd_at(src_mm, src_addr, src_pmd, pmd);
 		}
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		set_pmd_at(dst_mm, dst_addr, dst_pmd, pmd);
 		ret = 0;
 		goto out_unlock;
 	}
@@ -1018,7 +1024,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 * reference.
 		 */
 		zero_page = mm_get_huge_zero_page(dst_mm);
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, dst_addr, dst_pmd,
 				zero_page);
 		ret = 0;
 		goto out_unlock;
@@ -1032,9 +1038,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmdp_set_wrprotect(src_mm, src_addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
-	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	set_pmd_at(dst_mm, dst_addr, dst_pmd, pmd);
 
 	ret = 0;
 out_unlock:
@@ -1096,13 +1102,19 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+		  pud_t *dst_pud, pud_t *src_pud, unsigned long dst_addr,
+		  unsigned long src_addr, unsigned long len,
 		  struct vm_area_struct *vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	pud_t pud;
 	int ret;
 
+	if (len != HPAGE_PUD_SIZE) {
+		split_huge_pud(vma, src_pud, src_addr);
+		return -EAGAIN;
+	}
+
 	dst_ptl = pud_lock(dst_mm, dst_pud);
 	src_ptl = pud_lockptr(src_mm, src_pud);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -1121,9 +1133,9 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
-	pudp_set_wrprotect(src_mm, addr, src_pud);
+	pudp_set_wrprotect(src_mm, src_addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
-	set_pud_at(dst_mm, addr, dst_pud, pud);
+	set_pud_at(dst_mm, dst_addr, dst_pud, pud);
 
 	ret = 0;
 out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 0d0711a912de..9d0fe2aee5f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -699,7 +699,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 static inline unsigned long
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		unsigned long src_addr, int *rss, unsigned long dst_addr)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -737,7 +737,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(*src_pte))
 					pte = pte_swp_mksoft_dirty(pte);
-				set_pte_at(src_mm, addr, src_pte, pte);
+				set_pte_at(src_mm, src_addr, src_pte, pte);
 			}
 		} else if (is_device_private_entry(entry)) {
 			page = device_private_entry_to_page(entry);
@@ -766,7 +766,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			    is_cow_mapping(vm_flags)) {
 				make_device_private_entry_read(&entry);
 				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
+				set_pte_at(src_mm, src_addr, src_pte, pte);
 			}
 		}
 		goto out_set_pte;
@@ -777,7 +777,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotect(src_mm, src_addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -789,7 +789,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, src_addr, pte);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
@@ -810,13 +810,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 out_set_pte:
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	set_pte_at(dst_mm, dst_addr, dst_pte, pte);
 	return 0;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   unsigned long addr, unsigned long end)
+		   unsigned long src_addr, unsigned long src_end,
+		   unsigned long dst_addr)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -828,10 +829,10 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 again:
 	init_rss_vec(rss);
 
-	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
+	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, dst_addr, &dst_ptl);
 	if (!dst_pte)
 		return -ENOMEM;
-	src_pte = pte_offset_map(src_pmd, addr);
+	src_pte = pte_offset_map(src_pmd, src_addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
@@ -854,11 +855,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+					 vma, src_addr, rss, dst_addr);
 		if (entry.val)
 			break;
 		progress += 8;
-	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+	} while (dst_pte++, src_pte++, dst_addr += PAGE_SIZE,
+		 src_addr += PAGE_SIZE, src_addr != src_end);
 
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
@@ -872,108 +874,147 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			return -ENOMEM;
 		progress = 0;
 	}
-	if (addr != end)
+	if (src_addr != src_end)
 		goto again;
 	return 0;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	pmd_t *src_pmd, *dst_pmd;
-	unsigned long next;
 
-	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, dst_addr);
 	if (!dst_pmd)
 		return -ENOMEM;
-	src_pmd = pmd_offset(src_pud, addr);
+	src_pmd = pmd_offset(src_pud, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		src_next = pmd_addr_end(src_addr, src_end);
+		dst_next = pmd_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
-			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
-			err = copy_huge_pmd(dst_mm, src_mm,
-					    dst_pmd, src_pmd, addr, vma);
+			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+					    dst_addr, src_addr, len, vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
-				continue;
+				goto next;
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
-			continue;
+			goto next;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pmd++;
+		if (dst_len == len)
+			dst_pmd++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	pud_t *src_pud, *dst_pud;
-	unsigned long next;
 
-	dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
+	dst_pud = pud_alloc(dst_mm, dst_p4d, dst_addr);
 	if (!dst_pud)
 		return -ENOMEM;
-	src_pud = pud_offset(src_p4d, addr);
+	src_pud = pud_offset(src_p4d, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pud_addr_end(addr, end);
+		src_next = pud_addr_end(src_addr, src_end);
+		dst_next = pud_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
-			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma);
-			err = copy_huge_pud(dst_mm, src_mm,
-					    dst_pud, src_pud, addr, vma);
+			err = copy_huge_pud(dst_mm, src_mm, dst_pud, src_pud,
+					    dst_addr, src_addr, len, vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
-				continue;
+				goto next;
 			/* fall through */
 		}
 		if (pud_none_or_clear_bad(src_pud))
-			continue;
+			goto next;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_pud++, src_pud++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pud++;
+		if (dst_len == len)
+			dst_pud++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	p4d_t *src_p4d, *dst_p4d;
-	unsigned long next;
 
-	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+	dst_p4d = p4d_alloc(dst_mm, dst_pgd, dst_addr);
 	if (!dst_p4d)
 		return -ENOMEM;
-	src_p4d = p4d_offset(src_pgd, addr);
+
+	src_p4d = p4d_offset(src_pgd, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		src_next = p4d_addr_end(src_addr, src_end);
+		dst_next = p4d_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (p4d_none_or_clear_bad(src_p4d))
-			continue;
+			goto next;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_p4d++, src_p4d++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_p4d++;
+		if (dst_len == len)
+			dst_p4d++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned long dst_addr,
+		unsigned long src_addr, unsigned long src_end)
 {
 	pgd_t *src_pgd, *dst_pgd;
-	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
@@ -1011,23 +1052,37 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	if (is_cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
-					0, vma, src_mm, addr, end);
+					0, vma, src_mm, src_addr, src_end);
 		mmu_notifier_invalidate_range_start(&range);
 	}
 
 	ret = 0;
-	dst_pgd = pgd_offset(dst_mm, addr);
-	src_pgd = pgd_offset(src_mm, addr);
+	dst_pgd = pgd_offset(dst_mm, dst_addr);
+	src_pgd = pgd_offset(src_mm, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		src_next = pgd_addr_end(src_addr, src_end);
+		dst_next = pgd_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (pgd_none_or_clear_bad(src_pgd))
-			continue;
+			goto next;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					vma, src_addr, src_next, dst_addr))) {
 			ret = -ENOMEM;
 			break;
 		}
-	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pgd++;
+		if (dst_len == len)
+			dst_pgd++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 
 	if (is_cow)
 		mmu_notifier_invalidate_range_end(&range);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 4/5] mm: Export round_hint_to_min()
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (2 preceding siblings ...)
  2019-05-15 15:11 ` [PATCH RFC 3/5] mm: Extend copy_page_range() Kirill Tkhai
@ 2019-05-15 15:11 ` Kirill Tkhai
  2019-05-15 15:11 ` [PATCH RFC 5/5] mm: Add process_vm_mmap() Kirill Tkhai
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mman.h |   14 ++++++++++++++
 mm/mmap.c            |   13 -------------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c9c538..69feb3144c12 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -4,6 +4,7 @@
 
 #include <linux/mm.h>
 #include <linux/percpu_counter.h>
+#include <linux/security.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -73,6 +74,19 @@ static inline void vm_unacct_memory(long pages)
 	vm_acct_memory(-pages);
 }
 
+/*
+ * If a hint addr is less than mmap_min_addr change hint to be as
+ * low as possible but still greater than mmap_min_addr
+ */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+	hint &= PAGE_MASK;
+	if (((void *)hint != NULL) &&
+	    (hint < mmap_min_addr))
+		return PAGE_ALIGN(mmap_min_addr);
+	return hint;
+}
+
 /*
  * Allow architectures to handle additional protection bits
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 46266f6825ae..b2a1f77643cd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1318,19 +1318,6 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 	return NULL;
 }
 
-/*
- * If a hint addr is less than mmap_min_addr change hint to be as
- * low as possible but still greater than mmap_min_addr
- */
-static inline unsigned long round_hint_to_min(unsigned long hint)
-{
-	hint &= PAGE_MASK;
-	if (((void *)hint != NULL) &&
-	    (hint < mmap_min_addr))
-		return PAGE_ALIGN(mmap_min_addr);
-	return hint;
-}
-
 static inline int mlock_future_check(struct mm_struct *mm,
 				     unsigned long flags,
 				     unsigned long len)


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC 5/5] mm: Add process_vm_mmap()
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (3 preceding siblings ...)
  2019-05-15 15:11 ` [PATCH RFC 4/5] mm: Export round_hint_to_min() Kirill Tkhai
@ 2019-05-15 15:11 ` Kirill Tkhai
  2019-05-15 18:29   ` Kees Cook
  2019-05-15 18:46 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Jann Horn
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-15 15:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, pasha.tatashin, alexander.h.duyck, ira.weiny,
	andreyknvl, arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, linux-kernel, linux-mm

This adds a new syscall to map from or to another
process vma. Flag PVMMAP_FIXED may be specified,
its meaning is similar to mmap()'s MAP_FIXED.

@pid > 0 means to map from process of @pid to current,
@pid < 0 means to map from current to @pid process.

VMA are merged on destination, i.e. if source task
has VMA with address [start; end], and we map it sequentially
twice:

process_vm_mmap(@pid, start, start + (end - start)/2, ...);
process_vm_mmap(@pid, start + (end - start)/2, end,   ...);

the destination task will have single vma [start, end].

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mm.h                     |    4 +
 include/linux/mm_types.h               |    2 +
 include/uapi/asm-generic/mman-common.h |    5 +
 mm/mmap.c                              |  108 ++++++++++++++++++++++++++++++++
 mm/process_vm_access.c                 |   71 +++++++++++++++++++++
 5 files changed, 190 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54328d08dbdd..c49bcfac593c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2382,6 +2382,10 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
+extern unsigned long mmap_process_vm(struct mm_struct *, unsigned long,
+				     struct mm_struct *, unsigned long,
+				     unsigned long, unsigned long,
+				     struct list_head *);
 
 static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1815fbc40926..885f256f2fb7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -261,11 +261,13 @@ struct vm_region {
 
 #ifdef CONFIG_USERFAULTFD
 #define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+#define IS_NULL_VM_UFFD_CTX(uctx) ((uctx)->ctx == NULL)
 struct vm_userfaultfd_ctx {
 	struct userfaultfd_ctx *ctx;
 };
 #else /* CONFIG_USERFAULTFD */
 #define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+#define IS_NULL_VM_UFFD_CTX(uctx) (true)
 struct vm_userfaultfd_ctx {};
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..44cb6cf77e93 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -28,6 +28,11 @@
 /* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 
+/*
+ * Flags for process_vm_mmap
+ */
+#define PVMMAP_FIXED	0x01
+
 /*
  * Flags for mlock
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index b2a1f77643cd..3dbf280e9f8e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3274,6 +3274,114 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return NULL;
 }
 
+static int do_mmap_process_vm(struct vm_area_struct *src_vma,
+			      unsigned long src_addr,
+			      struct mm_struct *dst_mm,
+			      unsigned long dst_addr,
+			      unsigned long len,
+			      struct list_head *uf)
+{
+	struct vm_area_struct *dst_vma;
+	unsigned long pgoff, ret;
+	bool unused;
+
+	if (do_munmap(dst_mm, dst_addr, len, uf))
+		return -ENOMEM;
+
+	if (src_vma->vm_flags & VM_ACCOUNT) {
+		if (security_vm_enough_memory_mm(dst_mm, len >> PAGE_SHIFT))
+			return -ENOMEM;
+	}
+
+	pgoff = src_vma->vm_pgoff +
+			((src_addr - src_vma->vm_start) >> PAGE_SHIFT);
+	dst_vma = copy_vma(&src_vma, dst_mm, dst_addr,
+			   len, pgoff, &unused, false);
+	if (!dst_vma) {
+		ret = -ENOMEM;
+		goto unacct;
+	}
+
+	ret = copy_page_range(dst_mm, src_vma->vm_mm, src_vma,
+			      dst_addr, src_addr, src_addr + len);
+	if (ret) {
+		do_munmap(dst_mm, dst_addr, len, uf);
+		return -ENOMEM;
+	}
+
+	if (dst_vma->vm_file)
+		uprobe_mmap(dst_vma);
+	perf_event_mmap(dst_vma);
+
+	dst_vma->vm_flags |= VM_SOFTDIRTY;
+	vma_set_page_prot(dst_vma);
+
+	vm_stat_account(dst_mm, dst_vma->vm_flags, len >> PAGE_SHIFT);
+	return 0;
+
+unacct:
+	vm_unacct_memory(len >> PAGE_SHIFT);
+	return ret;
+}
+
+unsigned long mmap_process_vm(struct mm_struct *src_mm,
+			      unsigned long src_addr,
+			      struct mm_struct *dst_mm,
+			      unsigned long dst_addr,
+			      unsigned long len,
+			      unsigned long flags,
+			      struct list_head *uf)
+{
+	struct vm_area_struct *src_vma = find_vma(src_mm, src_addr);
+	unsigned long gua_flags = 0;
+	unsigned long ret;
+
+	if (!src_vma || src_vma->vm_start > src_addr)
+		return -EFAULT;
+	if (len > src_vma->vm_end - src_addr)
+		return -EFAULT;
+	if (src_vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
+		return -EFAULT;
+	if (is_vm_hugetlb_page(src_vma) || (src_vma->vm_flags & VM_IO))
+		return -EINVAL;
+        if (dst_mm->map_count + 2 > sysctl_max_map_count)
+                return -ENOMEM;
+	if (!IS_NULL_VM_UFFD_CTX(&src_vma->vm_userfaultfd_ctx))
+		return -ENOTSUPP;
+
+	if (src_vma->vm_flags & VM_SHARED)
+		gua_flags |= MAP_SHARED;
+	else
+		gua_flags |= MAP_PRIVATE;
+	if (vma_is_anonymous(src_vma) || vma_is_shmem(src_vma))
+		gua_flags |= MAP_ANONYMOUS;
+	if (flags & PVMMAP_FIXED)
+		gua_flags |= MAP_FIXED;
+	ret = get_unmapped_area(src_vma->vm_file, dst_addr, len,
+				src_vma->vm_pgoff +
+				((src_addr - src_vma->vm_start) >> PAGE_SHIFT),
+				gua_flags);
+	if (offset_in_page(ret))
+                return ret;
+	dst_addr = ret;
+
+	/* Check against address space limit. */
+	if (!may_expand_vm(dst_mm, src_vma->vm_flags, len >> PAGE_SHIFT)) {
+		unsigned long nr_pages;
+
+		nr_pages = count_vma_pages_range(dst_mm, dst_addr, dst_addr + len);
+		if (!may_expand_vm(dst_mm, src_vma->vm_flags,
+					(len >> PAGE_SHIFT) - nr_pages))
+			return -ENOMEM;
+	}
+
+	ret = do_mmap_process_vm(src_vma, src_addr, dst_mm, dst_addr, len, uf);
+	if (ret)
+                return ret;
+
+	return dst_addr;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index a447092d4635..7fca2c5c7edd 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -17,6 +17,8 @@
 #include <linux/ptrace.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
+#include <linux/mman.h>
+#include <linux/userfaultfd_k.h>
 
 #ifdef CONFIG_COMPAT
 #include <linux/compat.h>
@@ -295,6 +297,68 @@ static ssize_t process_vm_rw(pid_t pid,
 	return rc;
 }
 
+static unsigned long process_vm_mmap(pid_t pid, unsigned long src_addr,
+				     unsigned long len, unsigned long dst_addr,
+				     unsigned long flags)
+{
+	struct mm_struct *src_mm, *dst_mm;
+	struct task_struct *task;
+	unsigned long ret;
+	int depth = 0;
+	LIST_HEAD(uf);
+
+	len = PAGE_ALIGN(len);
+	src_addr = round_down(src_addr, PAGE_SIZE);
+	if (flags & PVMMAP_FIXED)
+		dst_addr = round_down(dst_addr, PAGE_SIZE);
+	else
+		dst_addr = round_hint_to_min(dst_addr);
+
+	if ((flags & ~PVMMAP_FIXED) || len == 0 || len > TASK_SIZE ||
+	    src_addr == 0 || dst_addr > TASK_SIZE - len)
+		return -EINVAL;
+	task = find_get_task_by_vpid(pid > 0 ? pid : -pid);
+	if (!task)
+		return -ESRCH;
+	if (unlikely(task->flags & PF_KTHREAD)) {
+		ret = -EINVAL;
+		goto out_put_task;
+	}
+
+	src_mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+	if (!src_mm || IS_ERR(src_mm)) {
+		ret = IS_ERR(src_mm) ? PTR_ERR(src_mm) : -ESRCH;
+		goto out_put_task;
+	}
+	dst_mm = current->mm;
+	mmget(dst_mm);
+
+	if (pid < 0)
+		swap(src_mm, dst_mm);
+
+	/* Double lock mm in address order: smallest is the first */
+	if (src_mm < dst_mm) {
+		down_write(&src_mm->mmap_sem);
+		depth = SINGLE_DEPTH_NESTING;
+	}
+	down_write_nested(&dst_mm->mmap_sem, depth);
+	if (src_mm > dst_mm)
+		down_write_nested(&src_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	ret = mmap_process_vm(src_mm, src_addr, dst_mm, dst_addr, len, flags, &uf);
+
+	up_write(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		up_write(&src_mm->mmap_sem);
+
+	userfaultfd_unmap_complete(dst_mm, &uf);
+	mmput(src_mm);
+	mmput(dst_mm);
+out_put_task:
+	put_task_struct(task);
+	return ret;
+}
+
 SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
 		unsigned long, liovcnt, const struct iovec __user *, rvec,
 		unsigned long, riovcnt,	unsigned long, flags)
@@ -310,6 +374,13 @@ SYSCALL_DEFINE6(process_vm_writev, pid_t, pid,
 	return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 1);
 }
 
+SYSCALL_DEFINE5(process_vm_mmap, pid_t, pid,
+		unsigned long, src_addr, unsigned long, len,
+		unsigned long, dst_addr, unsigned long, flags)
+{
+	return process_vm_mmap(pid, src_addr, len, dst_addr, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 5/5] mm: Add process_vm_mmap()
  2019-05-15 15:11 ` [PATCH RFC 5/5] mm: Add process_vm_mmap() Kirill Tkhai
@ 2019-05-15 18:29   ` Kees Cook
  2019-05-16 12:54     ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Kees Cook @ 2019-05-15 18:29 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan,
	linux-kernel, linux-mm, Jann Horn, Matthew Wilcox

On Wed, May 15, 2019 at 06:11:44PM +0300, Kirill Tkhai wrote:
> This adds a new syscall to map from or to another
> process vma. Flag PVMMAP_FIXED may be specified,
> its meaning is similar to mmap()'s MAP_FIXED.
> 
> @pid > 0 means to map from process of @pid to current,
> @pid < 0 means to map from current to @pid process.
> 
> VMA are merged on destination, i.e. if source task
> has VMA with address [start; end], and we map it sequentially
> twice:
> 
> process_vm_mmap(@pid, start, start + (end - start)/2, ...);
> process_vm_mmap(@pid, start + (end - start)/2, end,   ...);
> 
> the destination task will have single vma [start, end].
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
> [...]
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index abd238d0f7a4..44cb6cf77e93 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -28,6 +28,11 @@
>  /* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
>  #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
>  
> +/*
> + * Flags for process_vm_mmap
> + */
> +#define PVMMAP_FIXED	0x01

I think PVMMAP_FIXED_NOREPLACE should be included from the start too. It
seems like providing the "do not overwrite existing remote mapping"
from the start would be good. :)

> [...]
> +unsigned long mmap_process_vm(struct mm_struct *src_mm,
> +			      unsigned long src_addr,
> +			      struct mm_struct *dst_mm,
> +			      unsigned long dst_addr,
> +			      unsigned long len,
> +			      unsigned long flags,
> +			      struct list_head *uf)
> +{
> +	struct vm_area_struct *src_vma = find_vma(src_mm, src_addr);
> +	unsigned long gua_flags = 0;
> +	unsigned long ret;
> +
> +	if (!src_vma || src_vma->vm_start > src_addr)
> +		return -EFAULT;
> +	if (len > src_vma->vm_end - src_addr)
> +		return -EFAULT;
> +	if (src_vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> +		return -EFAULT;
> +	if (is_vm_hugetlb_page(src_vma) || (src_vma->vm_flags & VM_IO))
> +		return -EINVAL;
> +        if (dst_mm->map_count + 2 > sysctl_max_map_count)
> +                return -ENOMEM;

whitespace damage? Also, I think this should be:

	if (dst_mm->map_count >= sysctl_max_map_count - 2) ...

> +	if (!IS_NULL_VM_UFFD_CTX(&src_vma->vm_userfaultfd_ctx))
> +		return -ENOTSUPP;

Are these various checks from other places? I see simliar things in
vma_to_resize(). Should these be collected in a single helper for common
checks in various places?

> +
> +	if (src_vma->vm_flags & VM_SHARED)
> +		gua_flags |= MAP_SHARED;
> +	else
> +		gua_flags |= MAP_PRIVATE;
> +	if (vma_is_anonymous(src_vma) || vma_is_shmem(src_vma))
> +		gua_flags |= MAP_ANONYMOUS;
> +	if (flags & PVMMAP_FIXED)
> +		gua_flags |= MAP_FIXED;

And obviously add MAP_FIXED_NOREPLACE here too...

> +	ret = get_unmapped_area(src_vma->vm_file, dst_addr, len,
> +				src_vma->vm_pgoff +
> +				((src_addr - src_vma->vm_start) >> PAGE_SHIFT),
> +				gua_flags);
> +	if (offset_in_page(ret))
> +                return ret;
> +	dst_addr = ret;
> +
> +	/* Check against address space limit. */
> +	if (!may_expand_vm(dst_mm, src_vma->vm_flags, len >> PAGE_SHIFT)) {
> +		unsigned long nr_pages;
> +
> +		nr_pages = count_vma_pages_range(dst_mm, dst_addr, dst_addr + len);
> +		if (!may_expand_vm(dst_mm, src_vma->vm_flags,
> +					(len >> PAGE_SHIFT) - nr_pages))
> +			return -ENOMEM;
> +	}
> +
> +	ret = do_mmap_process_vm(src_vma, src_addr, dst_mm, dst_addr, len, uf);
> +	if (ret)
> +                return ret;
> +
> +	return dst_addr;
> +}
> +
>  /*
>   * Return true if the calling process may expand its vm space by the passed
>   * number of pages
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index a447092d4635..7fca2c5c7edd 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -17,6 +17,8 @@
>  #include <linux/ptrace.h>
>  #include <linux/slab.h>
>  #include <linux/syscalls.h>
> +#include <linux/mman.h>
> +#include <linux/userfaultfd_k.h>
>  
>  #ifdef CONFIG_COMPAT
>  #include <linux/compat.h>
> @@ -295,6 +297,68 @@ static ssize_t process_vm_rw(pid_t pid,
>  	return rc;
>  }
>  
> +static unsigned long process_vm_mmap(pid_t pid, unsigned long src_addr,
> +				     unsigned long len, unsigned long dst_addr,
> +				     unsigned long flags)
> +{
> +	struct mm_struct *src_mm, *dst_mm;
> +	struct task_struct *task;
> +	unsigned long ret;
> +	int depth = 0;
> +	LIST_HEAD(uf);
> +
> +	len = PAGE_ALIGN(len);
> +	src_addr = round_down(src_addr, PAGE_SIZE);
> +	if (flags & PVMMAP_FIXED)
> +		dst_addr = round_down(dst_addr, PAGE_SIZE);
> +	else
> +		dst_addr = round_hint_to_min(dst_addr);
> +
> +	if ((flags & ~PVMMAP_FIXED) || len == 0 || len > TASK_SIZE ||
> +	    src_addr == 0 || dst_addr > TASK_SIZE - len)
> +		return -EINVAL;

And PVMMAP_FIXED_NOREPLACE here...

> +	task = find_get_task_by_vpid(pid > 0 ? pid : -pid);
> +	if (!task)
> +		return -ESRCH;
> +	if (unlikely(task->flags & PF_KTHREAD)) {
> +		ret = -EINVAL;
> +		goto out_put_task;
> +	}
> +
> +	src_mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!src_mm || IS_ERR(src_mm)) {
> +		ret = IS_ERR(src_mm) ? PTR_ERR(src_mm) : -ESRCH;
> +		goto out_put_task;
> +	}
> +	dst_mm = current->mm;
> +	mmget(dst_mm);
> +
> +	if (pid < 0)
> +		swap(src_mm, dst_mm);
> +
> +	/* Double lock mm in address order: smallest is the first */
> +	if (src_mm < dst_mm) {
> +		down_write(&src_mm->mmap_sem);
> +		depth = SINGLE_DEPTH_NESTING;
> +	}
> +	down_write_nested(&dst_mm->mmap_sem, depth);
> +	if (src_mm > dst_mm)
> +		down_write_nested(&src_mm->mmap_sem, SINGLE_DEPTH_NESTING);
> +
> +	ret = mmap_process_vm(src_mm, src_addr, dst_mm, dst_addr, len, flags, &uf);
> +
> +	up_write(&dst_mm->mmap_sem);
> +	if (dst_mm != src_mm)
> +		up_write(&src_mm->mmap_sem);
> +
> +	userfaultfd_unmap_complete(dst_mm, &uf);
> +	mmput(src_mm);
> +	mmput(dst_mm);
> +out_put_task:
> +	put_task_struct(task);
> +	return ret;
> +}
> +
>  SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
>  		unsigned long, liovcnt, const struct iovec __user *, rvec,
>  		unsigned long, riovcnt,	unsigned long, flags)
> @@ -310,6 +374,13 @@ SYSCALL_DEFINE6(process_vm_writev, pid_t, pid,
>  	return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 1);
>  }
>  
> +SYSCALL_DEFINE5(process_vm_mmap, pid_t, pid,
> +		unsigned long, src_addr, unsigned long, len,
> +		unsigned long, dst_addr, unsigned long, flags)
> +{
> +	return process_vm_mmap(pid, src_addr, len, dst_addr, flags);
> +}
> +
>  #ifdef CONFIG_COMPAT
>  
>  static ssize_t
> 

Looks pretty interesting. I do wonder about "ATTACH" being a sufficient
description of this feature. "Give me your VMA" and "take this VMA"
is quite a bit stronger than "give me a copy of that memory" and "I
will write to your memory" in the sense that memory content changes are
now happening directly instead of through syscalls. But it's not much
different from regular shared memory, so, I guess it's fine? :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (4 preceding siblings ...)
  2019-05-15 15:11 ` [PATCH RFC 5/5] mm: Add process_vm_mmap() Kirill Tkhai
@ 2019-05-15 18:46 ` Jann Horn
  2019-05-16 13:02   ` Kirill Tkhai
  2019-05-15 19:38 ` Adam Borowski
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Jann Horn @ 2019-05-15 18:46 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	shakeelb, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, kernel list,
	Linux-MM

On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
[...]
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.

Have you considered using userfaultfd instead? userfaultfd has
interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
into the VMAs of other processes. This works without the churn of
creating and merging VMAs all the time. userfaultfd is the interface
that was written to support virtual machine migration (and it supports
live migration, too).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (5 preceding siblings ...)
  2019-05-15 18:46 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Jann Horn
@ 2019-05-15 19:38 ` Adam Borowski
  2019-05-16 13:10   ` Kirill Tkhai
  2019-05-16 13:30 ` Michal Hocko
  2019-05-16 13:32 ` Jann Horn
  8 siblings, 1 reply; 20+ messages in thread
From: Adam Borowski @ 2019-05-15 19:38 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm

On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
> 
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:

I wonder, why not optimize the existing interfaces to do zero copy if
properly aligned?  No need for a new syscall, and old code would immediately
benefit.

> There are several problems with process_vm_writev() in this example:
> 
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);
> 
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;
> 
> 3)received data has no a chance to be properly swapped for
>   a long time.

That'll handle all of your above problems, except for making pages
subject to CoW if written to.  But if making pages writeably shared is
desired, the old functions have a "flags" argument that doesn't yet have a
single bit defined.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄⠀⠀⠀⠀ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 5/5] mm: Add process_vm_mmap()
  2019-05-15 18:29   ` Kees Cook
@ 2019-05-16 12:54     ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 12:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan,
	linux-kernel, linux-mm, Jann Horn, Matthew Wilcox

Hi, Kees,

On 15.05.2019 21:29, Kees Cook wrote:
> On Wed, May 15, 2019 at 06:11:44PM +0300, Kirill Tkhai wrote:
>> This adds a new syscall to map from or to another
>> process vma. Flag PVMMAP_FIXED may be specified,
>> its meaning is similar to mmap()'s MAP_FIXED.
>>
>> @pid > 0 means to map from process of @pid to current,
>> @pid < 0 means to map from current to @pid process.
>>
>> VMA are merged on destination, i.e. if source task
>> has VMA with address [start; end], and we map it sequentially
>> twice:
>>
>> process_vm_mmap(@pid, start, start + (end - start)/2, ...);
>> process_vm_mmap(@pid, start + (end - start)/2, end,   ...);
>>
>> the destination task will have single vma [start, end].
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>> [...]
>> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
>> index abd238d0f7a4..44cb6cf77e93 100644
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -28,6 +28,11 @@
>>  /* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
>>  #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
>>  
>> +/*
>> + * Flags for process_vm_mmap
>> + */
>> +#define PVMMAP_FIXED	0x01
> 
> I think PVMMAP_FIXED_NOREPLACE should be included from the start too. It
> seems like providing the "do not overwrite existing remote mapping"
> from the start would be good. :)

Good idea :)
>> [...]
>> +unsigned long mmap_process_vm(struct mm_struct *src_mm,
>> +			      unsigned long src_addr,
>> +			      struct mm_struct *dst_mm,
>> +			      unsigned long dst_addr,
>> +			      unsigned long len,
>> +			      unsigned long flags,
>> +			      struct list_head *uf)
>> +{
>> +	struct vm_area_struct *src_vma = find_vma(src_mm, src_addr);
>> +	unsigned long gua_flags = 0;
>> +	unsigned long ret;
>> +
>> +	if (!src_vma || src_vma->vm_start > src_addr)
>> +		return -EFAULT;
>> +	if (len > src_vma->vm_end - src_addr)
>> +		return -EFAULT;
>> +	if (src_vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
>> +		return -EFAULT;
>> +	if (is_vm_hugetlb_page(src_vma) || (src_vma->vm_flags & VM_IO))
>> +		return -EINVAL;
>> +        if (dst_mm->map_count + 2 > sysctl_max_map_count)
>> +                return -ENOMEM;
> 
> whitespace damage? Also, I think this should be:
> 
> 	if (dst_mm->map_count >= sysctl_max_map_count - 2) ...

Sure, thanks.

>> +	if (!IS_NULL_VM_UFFD_CTX(&src_vma->vm_userfaultfd_ctx))
>> +		return -ENOTSUPP;
> 
> Are these various checks from other places? I see simliar things in
> vma_to_resize(). Should these be collected in a single helper for common
> checks in various places?

Yes, some of them are from other places. VM_DONTEXPAND check is because of
we want to filter mappings like AIO rings. Cloning of VM_IO vma may require
additional permissions (and more work), so it's not supported now.

It looks like, we may move find_vma() and first four checks into a helper,
which is common with the function you pointed.

>> +
>> +	if (src_vma->vm_flags & VM_SHARED)
>> +		gua_flags |= MAP_SHARED;
>> +	else
>> +		gua_flags |= MAP_PRIVATE;
>> +	if (vma_is_anonymous(src_vma) || vma_is_shmem(src_vma))
>> +		gua_flags |= MAP_ANONYMOUS;
>> +	if (flags & PVMMAP_FIXED)
>> +		gua_flags |= MAP_FIXED;
> 
> And obviously add MAP_FIXED_NOREPLACE here too...
>>> +	ret = get_unmapped_area(src_vma->vm_file, dst_addr, len,
>> +				src_vma->vm_pgoff +
>> +				((src_addr - src_vma->vm_start) >> PAGE_SHIFT),
>> +				gua_flags);
>> +	if (offset_in_page(ret))
>> +                return ret;
>> +	dst_addr = ret;
>> +
>> +	/* Check against address space limit. */
>> +	if (!may_expand_vm(dst_mm, src_vma->vm_flags, len >> PAGE_SHIFT)) {
>> +		unsigned long nr_pages;
>> +
>> +		nr_pages = count_vma_pages_range(dst_mm, dst_addr, dst_addr + len);
>> +		if (!may_expand_vm(dst_mm, src_vma->vm_flags,
>> +					(len >> PAGE_SHIFT) - nr_pages))
>> +			return -ENOMEM;
>> +	}
>> +
>> +	ret = do_mmap_process_vm(src_vma, src_addr, dst_mm, dst_addr, len, uf);
>> +	if (ret)
>> +                return ret;
>> +
>> +	return dst_addr;
>> +}
>> +
>>  /*
>>   * Return true if the calling process may expand its vm space by the passed
>>   * number of pages
>> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
>> index a447092d4635..7fca2c5c7edd 100644
>> --- a/mm/process_vm_access.c
>> +++ b/mm/process_vm_access.c
>> @@ -17,6 +17,8 @@
>>  #include <linux/ptrace.h>
>>  #include <linux/slab.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/mman.h>
>> +#include <linux/userfaultfd_k.h>
>>  
>>  #ifdef CONFIG_COMPAT
>>  #include <linux/compat.h>
>> @@ -295,6 +297,68 @@ static ssize_t process_vm_rw(pid_t pid,
>>  	return rc;
>>  }
>>  
>> +static unsigned long process_vm_mmap(pid_t pid, unsigned long src_addr,
>> +				     unsigned long len, unsigned long dst_addr,
>> +				     unsigned long flags)
>> +{
>> +	struct mm_struct *src_mm, *dst_mm;
>> +	struct task_struct *task;
>> +	unsigned long ret;
>> +	int depth = 0;
>> +	LIST_HEAD(uf);
>> +
>> +	len = PAGE_ALIGN(len);
>> +	src_addr = round_down(src_addr, PAGE_SIZE);
>> +	if (flags & PVMMAP_FIXED)
>> +		dst_addr = round_down(dst_addr, PAGE_SIZE);
>> +	else
>> +		dst_addr = round_hint_to_min(dst_addr);
>> +
>> +	if ((flags & ~PVMMAP_FIXED) || len == 0 || len > TASK_SIZE ||
>> +	    src_addr == 0 || dst_addr > TASK_SIZE - len)
>> +		return -EINVAL;
> 
> And PVMMAP_FIXED_NOREPLACE here...
> 
>> +	task = find_get_task_by_vpid(pid > 0 ? pid : -pid);
>> +	if (!task)
>> +		return -ESRCH;
>> +	if (unlikely(task->flags & PF_KTHREAD)) {
>> +		ret = -EINVAL;
>> +		goto out_put_task;
>> +	}
>> +
>> +	src_mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
>> +	if (!src_mm || IS_ERR(src_mm)) {
>> +		ret = IS_ERR(src_mm) ? PTR_ERR(src_mm) : -ESRCH;
>> +		goto out_put_task;
>> +	}
>> +	dst_mm = current->mm;
>> +	mmget(dst_mm);
>> +
>> +	if (pid < 0)
>> +		swap(src_mm, dst_mm);
>> +
>> +	/* Double lock mm in address order: smallest is the first */
>> +	if (src_mm < dst_mm) {
>> +		down_write(&src_mm->mmap_sem);
>> +		depth = SINGLE_DEPTH_NESTING;
>> +	}
>> +	down_write_nested(&dst_mm->mmap_sem, depth);
>> +	if (src_mm > dst_mm)
>> +		down_write_nested(&src_mm->mmap_sem, SINGLE_DEPTH_NESTING);
>> +
>> +	ret = mmap_process_vm(src_mm, src_addr, dst_mm, dst_addr, len, flags, &uf);
>> +
>> +	up_write(&dst_mm->mmap_sem);
>> +	if (dst_mm != src_mm)
>> +		up_write(&src_mm->mmap_sem);
>> +
>> +	userfaultfd_unmap_complete(dst_mm, &uf);
>> +	mmput(src_mm);
>> +	mmput(dst_mm);
>> +out_put_task:
>> +	put_task_struct(task);
>> +	return ret;
>> +}
>> +
>>  SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
>>  		unsigned long, liovcnt, const struct iovec __user *, rvec,
>>  		unsigned long, riovcnt,	unsigned long, flags)
>> @@ -310,6 +374,13 @@ SYSCALL_DEFINE6(process_vm_writev, pid_t, pid,
>>  	return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 1);
>>  }
>>  
>> +SYSCALL_DEFINE5(process_vm_mmap, pid_t, pid,
>> +		unsigned long, src_addr, unsigned long, len,
>> +		unsigned long, dst_addr, unsigned long, flags)
>> +{
>> +	return process_vm_mmap(pid, src_addr, len, dst_addr, flags);
>> +}
>> +
>>  #ifdef CONFIG_COMPAT
>>  
>>  static ssize_t
>>
> 
> Looks pretty interesting. I do wonder about "ATTACH" being a sufficient
> description of this feature. "Give me your VMA" and "take this VMA"
> is quite a bit stronger than "give me a copy of that memory" and "I
> will write to your memory" in the sense that memory content changes are
> now happening directly instead of through syscalls. But it's not much
> different from regular shared memory, so, I guess it's fine? :)

Yeah, as a conception it's similar to regular shared memory and to duplication
of vma and PTEs we have during fork(). There are no much differences.
Next time I'll advance the description to focus more on VMA, than on memory.

Thanks for your comments.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 18:46 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Jann Horn
@ 2019-05-16 13:02   ` Kirill Tkhai
  2019-05-16 13:14     ` Jann Horn
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 13:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	shakeelb, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, kernel list,
	Linux-MM

Hi, Jann,

On 15.05.2019 21:46, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
> [...]
>> This syscall may be used for page servers like in example
>> above, for migration (I assume, even virtual machines may
>> want something like this), for zero-copy desiring users
>> of process_vm_writev() and process_vm_readv(), for debug
>> purposes, etc. It requires the same permittions like
>> existing proc_vm_xxx() syscalls have.
> 
> Have you considered using userfaultfd instead? userfaultfd has
> interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
> into the VMAs of other processes. This works without the churn of
> creating and merging VMAs all the time. userfaultfd is the interface
> that was written to support virtual machine migration (and it supports
> live migration, too).

I know about userfaultfd, but it does solve the discussed problem.
It allocates new pages to make UFFDIO_COPY (see mcopy_atomic_pte()),
and it accumulates all the disadvantages, the example from [0/5]
message has.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 19:38 ` Adam Borowski
@ 2019-05-16 13:10   ` Kirill Tkhai
  2019-05-16 13:42     ` Adam Borowski
  0 siblings, 1 reply; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 13:10 UTC (permalink / raw)
  To: Adam Borowski
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm

Hi, Adam,

On 15.05.2019 22:38, Adam Borowski wrote:
> On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
> 
> I wonder, why not optimize the existing interfaces to do zero copy if
> properly aligned?  No need for a new syscall, and old code would immediately
> benefit.

Because, this is just not possible. You can't zero copy anonymous pages
of a process to pages of a remote process, when they are different pages.

>> There are several problems with process_vm_writev() in this example:
>>
>> 1)it causes pagefault on remote process memory, and it forces
>>   allocation of a new page (if was not preallocated);
>>
>> 2)amount of memory for this example is doubled in a moment --
>>   n pages in current and n pages in remote tasks are occupied
>>   at the same time;
>>
>> 3)received data has no a chance to be properly swapped for
>>   a long time.
> 
> That'll handle all of your above problems, except for making pages
> subject to CoW if written to.  But if making pages writeably shared is
> desired, the old functions have a "flags" argument that doesn't yet have a
> single bit defined.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:02   ` Kirill Tkhai
@ 2019-05-16 13:14     ` Jann Horn
  0 siblings, 0 replies; 20+ messages in thread
From: Jann Horn @ 2019-05-16 13:14 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, kernel list,
	Linux-MM

On Thu, May 16, 2019 at 3:03 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 15.05.2019 21:46, Jann Horn wrote:
> > On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a mapping from a process to another process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> > [...]
> >> This syscall may be used for page servers like in example
> >> above, for migration (I assume, even virtual machines may
> >> want something like this), for zero-copy desiring users
> >> of process_vm_writev() and process_vm_readv(), for debug
> >> purposes, etc. It requires the same permittions like
> >> existing proc_vm_xxx() syscalls have.
> >
> > Have you considered using userfaultfd instead? userfaultfd has
> > interfaces (UFFDIO_COPY and UFFDIO_ZERO) for directly shoving pages
> > into the VMAs of other processes. This works without the churn of
> > creating and merging VMAs all the time. userfaultfd is the interface
> > that was written to support virtual machine migration (and it supports
> > live migration, too).
>
> I know about userfaultfd, but it does solve the discussed problem.
> It allocates new pages to make UFFDIO_COPY (see mcopy_atomic_pte()),
> and it accumulates all the disadvantages, the example from [0/5]
> message has.

Sorry, right, I misremembered that.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (6 preceding siblings ...)
  2019-05-15 19:38 ` Adam Borowski
@ 2019-05-16 13:30 ` Michal Hocko
  2019-05-16 13:52   ` Michal Hocko
  2019-05-16 13:32 ` Jann Horn
  8 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2019-05-16 13:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api

[You are defining a new user visible API, please always add linux-api
 mailing list - now done]

On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
> 
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
> 
> 	struct iovec local_iov, remote_iov;
> 	void *buf;
> 
> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 	recv(sock, buf, n * PAGE_SIZE, 0);
> 
> 	local_iov->iov_base = buf;
> 	local_iov->iov_len = n * PAGE_SIZE;
> 	remove_iov = ...;
> 
> 	process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> 	munmap(buf, n * PAGE_SIZE);
> 
> 	(Note, that above completely ignores error handling)
> 
> There are several problems with process_vm_writev() in this example:
> 
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);
> 
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;
> 
> 3)received data has no a chance to be properly swapped for
>   a long time.
> 
> The third is the most critical in case of remote process touches
> the data pages some time after process_vm_writev() was made.
> Imagine, node is under memory pressure:
> 
> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;
> c)process_vm_writev() allocates duplicate pages in remote
>   process and populates them;
> d)munmap() unmaps @buf;
> e)5 minutes later remote task touches data.
> 
> In stages "a" and "b" kernel submits unneeded IO and makes
> system IO throughput worse. To make "b" and "c", kernel
> reclaims memory, and moves pages of some other processes
> to swap, so they have to read pages from swap back. Also,
> unneeded copying of pages is occured, while zero-copy is
> more preferred.
> 
> We observe similar problem during online migration of big enough
> containers, when after doubling of container's size, the time
> increases 100 times. The system resides under high IO and
> throwing out of useful cashes.
> 
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
> 
> 	void *buf;
> 
> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 	recv(sock, buf, n * PAGE_SIZE, 0);
> 
> 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> 	munmap(buf, n * PAGE_SIZE);
> 
> It is swap-friendly: in case of memory is swapped right after recv(),
> the syscall just copies pagetable entries like we do on fork(),
> so real access to pages does not occurs, and no IO is needed.
> No excess pages are reclaimed, and number of pages is not doubled.
> Also, zero-copy takes a place, and this also reduces overhead.
> 
> The patchset does not introduce much new code, since we simply
> reuse existing copy_page_range() and copy_vma() functions.
> We extend copy_vma() to be able merge VMAs in remote task [2/5],
> and teach copy_page_range() to work with different local and
> remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
> which mostly consists of sanity checks. The rest of patches
> are preparations.
> 
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.
> 
> The tests I used may be obtained here:
> 
> [1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
> [2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562
> 
> ---
> 
> Kirill Tkhai (5):
>       mm: Add process_vm_mmap() syscall declaration
>       mm: Extend copy_vma()
>       mm: Extend copy_page_range()
>       mm: Export round_hint_to_min()
>       mm: Add process_vm_mmap()
> 
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    2 
>  include/linux/huge_mm.h                |    6 +
>  include/linux/mm.h                     |   11 ++
>  include/linux/mm_types.h               |    2 
>  include/linux/mman.h                   |   14 +++
>  include/linux/syscalls.h               |    5 +
>  include/uapi/asm-generic/mman-common.h |    5 +
>  include/uapi/asm-generic/unistd.h      |    5 +
>  init/Kconfig                           |    9 +-
>  kernel/fork.c                          |    5 +
>  kernel/sys_ni.c                        |    2 
>  mm/huge_memory.c                       |   30 ++++--
>  mm/memory.c                            |  165 +++++++++++++++++++++-----------
>  mm/mmap.c                              |  154 ++++++++++++++++++++++++++----
>  mm/mremap.c                            |    4 -
>  mm/process_vm_access.c                 |   71 ++++++++++++++
>  17 files changed, 392 insertions(+), 99 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (7 preceding siblings ...)
  2019-05-16 13:30 ` Michal Hocko
@ 2019-05-16 13:32 ` Jann Horn
  2019-05-16 13:56   ` Kirill Tkhai
  8 siblings, 1 reply; 20+ messages in thread
From: Jann Horn @ 2019-05-16 13:32 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, kernel list,
	Linux-MM, Linux API

On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
[...]
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
>
>         void *buf;
>
>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
>         /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>         process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>         munmap(buf, n * PAGE_SIZE);

In this specific example, an alternative would be to splice() from the
socket into /proc/$pid/mem, or something like that, right?
proc_mem_operations has no ->splice_read() at the moment, and it'd
need that to be more efficient, but that could be built without
creating new UAPI, right?

But I guess maybe your workload is not that simple? What do you
actually do with the received data between receiving it and shoving it
over into the other process?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:10   ` Kirill Tkhai
@ 2019-05-16 13:42     ` Adam Borowski
  2019-05-16 14:25       ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Adam Borowski @ 2019-05-16 13:42 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm

On Thu, May 16, 2019 at 04:10:07PM +0300, Kirill Tkhai wrote:
> On 15.05.2019 22:38, Adam Borowski wrote:
> > On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a mapping from a process to another process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> > 
> > I wonder, why not optimize the existing interfaces to do zero copy if
> > properly aligned?  No need for a new syscall, and old code would immediately
> > benefit.
> 
> Because, this is just not possible. You can't zero copy anonymous pages
> of a process to pages of a remote process, when they are different pages.

fork() manages that, and so does KSM.  Like KSM, you want to make a page
shared -- you just skip the comparison step as you want to overwrite the old
contents.

And there's no need to touch the page, as fork() manages that fine no matter
if the page is resident, anonymous in swap, or file-backed, all without
reading from swap.

> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >>
> >> 2)amount of memory for this example is doubled in a moment --
> >>   n pages in current and n pages in remote tasks are occupied
> >>   at the same time;
> >>
> >> 3)received data has no a chance to be properly swapped for
> >>   a long time.
> > 
> > That'll handle all of your above problems, except for making pages
> > subject to CoW if written to.  But if making pages writeably shared is
> > desired, the old functions have a "flags" argument that doesn't yet have a
> > single bit defined.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄⠀⠀⠀⠀ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:30 ` Michal Hocko
@ 2019-05-16 13:52   ` Michal Hocko
  2019-05-16 14:22     ` Kirill Tkhai
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2019-05-16 13:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api

On Thu 16-05-19 15:30:34, Michal Hocko wrote:
> [You are defining a new user visible API, please always add linux-api
>  mailing list - now done]
> 
> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
[...]
> > The proposed syscall aims to introduce an interface, which
> > supplements currently existing process_vm_writev() and
> > process_vm_readv(), and allows to solve the problem with
> > anonymous memory transfer. The above example may be rewritten as:
> > 
> > 	void *buf;
> > 
> > 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> > 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
> > 	recv(sock, buf, n * PAGE_SIZE, 0);
> > 
> > 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> > 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> > 	munmap(buf, n * PAGE_SIZE);

AFAIU this means that you actually want to do an mmap of an anonymous
memory with a COW semantic to the remote process right? How does the
remote process find out where and what has been mmaped? What if the
range collides? This sounds quite scary to me TBH. Why cannot you simply
use shared memory for that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:32 ` Jann Horn
@ 2019-05-16 13:56   ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 13:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
	Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, kernel list,
	Linux-MM, Linux API

On 16.05.2019 16:32, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> [...]
>> The proposed syscall aims to introduce an interface, which
>> supplements currently existing process_vm_writev() and
>> process_vm_readv(), and allows to solve the problem with
>> anonymous memory transfer. The above example may be rewritten as:
>>
>>         void *buf;
>>
>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>>         /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>>         process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>         munmap(buf, n * PAGE_SIZE);
> 
> In this specific example, an alternative would be to splice() from the
> socket into /proc/$pid/mem, or something like that, right?
> proc_mem_operations has no ->splice_read() at the moment, and it'd
> need that to be more efficient, but that could be built without
> creating new UAPI, right?

I have just never seen, a socket memory may be preempted into swap.
If so, there is a fundamental problem.
But, anyway, like you guessed below:
 
> But I guess maybe your workload is not that simple? What do you
> actually do with the received data between receiving it and shoving it
> over into the other process?

Data are usually sent encrypted and compressed by socket, so there is no
possibility to go this way. You may want to do everything with data,
before passing to another process.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:52   ` Michal Hocko
@ 2019-05-16 14:22     ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm, linux-api

On 16.05.2019 16:52, Michal Hocko wrote:
> On Thu 16-05-19 15:30:34, Michal Hocko wrote:
>> [You are defining a new user visible API, please always add linux-api
>>  mailing list - now done]
>>
>> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> [...]
>>> The proposed syscall aims to introduce an interface, which
>>> supplements currently existing process_vm_writev() and
>>> process_vm_readv(), and allows to solve the problem with
>>> anonymous memory transfer. The above example may be rewritten as:
>>>
>>> 	void *buf;
>>>
>>> 	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>> 		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> 	recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> 	/* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>>> 	process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>> 	munmap(buf, n * PAGE_SIZE);
> 
> AFAIU this means that you actually want to do an mmap of an anonymous
> memory with a COW semantic to the remote process right?

Yes.

> How does the remote process find out where and what has been mmaped?

Any way. Isn't this a trivial task? :) You may use socket or any
of appropriate linux features to communicate between them.

>What if the range collides? This sounds quite scary to me TBH.

In case of range collides, the part of old VMA becomes unmapped.
The same way we behave on ordinary mmap. You may intersect a range,
which another thread mapped, so you need a synchronization between
them. There is no a principle difference.

Also I'm going to add a flag to prevent unmapping like Kees suggested.
Please, see his message.

> Why cannot you simply use shared memory for that?

Because of remote task may want specific type of VMA. It may want not to
share a VMA with its children.

Speaking about online migration, a task wants its anonymous private VMAs
remain the same after the migration. Otherwise, imagine the situation,
when task's stack becomes a shared VMA after the migration.
Also, task wants anonymous mapping remains anonymous.

In general, in case of shared memory is enough for everything, we would
have never had process_vm_writev() and process_vm_readv() syscalls.

Kirill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-16 13:42     ` Adam Borowski
@ 2019-05-16 14:25       ` Kirill Tkhai
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill Tkhai @ 2019-05-16 14:25 UTC (permalink / raw)
  To: Adam Borowski
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
	vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
	shakeelb, guro, aarcange, hughd, jglisse, mgorman,
	daniel.m.jordan, linux-kernel, linux-mm

On 16.05.2019 16:42, Adam Borowski wrote:
> On Thu, May 16, 2019 at 04:10:07PM +0300, Kirill Tkhai wrote:
>> On 15.05.2019 22:38, Adam Borowski wrote:
>>> On Wed, May 15, 2019 at 06:11:15PM +0300, Kirill Tkhai wrote:
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a mapping from a process to another process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>>
>>>> For example, it allows to make a zero copy of data,
>>>> when process_vm_writev() was previously used:
>>>
>>> I wonder, why not optimize the existing interfaces to do zero copy if
>>> properly aligned?  No need for a new syscall, and old code would immediately
>>> benefit.
>>
>> Because, this is just not possible. You can't zero copy anonymous pages
>> of a process to pages of a remote process, when they are different pages.
> 
> fork() manages that, and so does KSM.  Like KSM, you want to make a page
> shared -- you just skip the comparison step as you want to overwrite the old
> contents.
> 
> And there's no need to touch the page, as fork() manages that fine no matter
> if the page is resident, anonymous in swap, or file-backed, all without
> reading from swap.

Yes, and in case of you dive into the patchset, you will found the new syscall
manages page table entries in the same way fork() makes.
 
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>>
>>>> 2)amount of memory for this example is doubled in a moment --
>>>>   n pages in current and n pages in remote tasks are occupied
>>>>   at the same time;
>>>>
>>>> 3)received data has no a chance to be properly swapped for
>>>>   a long time.
>>>
>>> That'll handle all of your above problems, except for making pages
>>> subject to CoW if written to.  But if making pages writeably shared is
>>> desired, the old functions have a "flags" argument that doesn't yet have a
>>> single bit defined.
> 
> 
> Meow!
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-05-16 14:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 15:11 [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
2019-05-15 15:11 ` [PATCH RFC 1/5] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
2019-05-15 15:11 ` [PATCH RFC 2/5] mm: Extend copy_vma() Kirill Tkhai
2019-05-15 15:11 ` [PATCH RFC 3/5] mm: Extend copy_page_range() Kirill Tkhai
2019-05-15 15:11 ` [PATCH RFC 4/5] mm: Export round_hint_to_min() Kirill Tkhai
2019-05-15 15:11 ` [PATCH RFC 5/5] mm: Add process_vm_mmap() Kirill Tkhai
2019-05-15 18:29   ` Kees Cook
2019-05-16 12:54     ` Kirill Tkhai
2019-05-15 18:46 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Jann Horn
2019-05-16 13:02   ` Kirill Tkhai
2019-05-16 13:14     ` Jann Horn
2019-05-15 19:38 ` Adam Borowski
2019-05-16 13:10   ` Kirill Tkhai
2019-05-16 13:42     ` Adam Borowski
2019-05-16 14:25       ` Kirill Tkhai
2019-05-16 13:30 ` Michal Hocko
2019-05-16 13:52   ` Michal Hocko
2019-05-16 14:22     ` Kirill Tkhai
2019-05-16 13:32 ` Jann Horn
2019-05-16 13:56   ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).