linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
@ 2019-05-20 14:00 Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

v2: Add PVMMAP_FIXED_NOREPLACE flag.
    Use find_vma_without_flags() and may_mmap_overlapped_region() helpers,
    so even more code became reused.
    Syscall number is changed.
    Fix whitespaces.

    Prohibited a cloning from local to remote process. Only mapping
    to local process mm is allowed, since I missed initially, that
    get_unmapped_area() can't be used for remote process. This may
    be very simply solved by passing @mm argument to all .get_unmapped_area
    handlers. In this patchset I don't do this, since this gives a lot
    of cleanup patches, which hides main logic away. I'm going to
    send them later, as another series, after we finish with this.

[Summary]

New syscall, which allows to clone a remote process VMA
into local process VM. The remote process's page table
entries related to the VMA are cloned into local process's
page table (in any desired address, which makes this different
from that happens during fork()). Huge pages are handled
appropriately.

This allows to improve performance in significant way like
it's shows in the example below.

[Description] 

This patchset adds a new syscall, which makes possible
to clone a VMA from a process to current process.
The syscall supplements the functionality provided
by process_vm_writev() and process_vm_readv() syscalls,
and it may be useful in many situation.

For example, it allows to make a zero copy of data,
when process_vm_writev() was previously used:

	struct iovec local_iov, remote_iov;
	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

	local_iov->iov_base = buf;
	local_iov->iov_len = n * PAGE_SIZE;
	remove_iov = ...;

	process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
	munmap(buf, n * PAGE_SIZE);

	(Note, that above completely ignores error handling)

There are several problems with process_vm_writev() in this example:

1)it causes pagefault on remote process memory, and it forces
  allocation of a new page (if was not preallocated);

2)amount of memory for this example is doubled in a moment --
  n pages in current and n pages in remote tasks are occupied
  at the same time;

3)received data has no a chance to be properly swapped for
  a long time.

The third is the most critical in case of remote process touches
the data pages some time after process_vm_writev() was made.
Imagine, node is under memory pressure:

a)kernel moves @buf pages into swap right after recv();
b)process_vm_writev() reads the data back from swap to pages;
c)process_vm_writev() allocates duplicate pages in remote
  process and populates them;
d)munmap() unmaps @buf;
e)5 minutes later remote task touches data.

In stages "a" and "b" kernel submits unneeded IO and makes
system IO throughput worse. To make "b" and "c", kernel
reclaims memory, and moves pages of some other processes
to swap, so they have to read pages from swap back. Also,
unneeded copying of pages is occured, while zero-copy is
more preferred.

We observe similar problem during online migration of big enough
containers, when after doubling of container's size, the time
increases 100 times. The system resides under high IO and
throwing out of useful cashes.

The proposed syscall aims to introduce an interface, which
supplements currently existing process_vm_writev() and
process_vm_readv(), and allows to solve the problem with
anonymous memory transfer. The above example may be rewritten as:

[Task 1]
	void *buf;

	buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_PRIVATE|MAP_ANONYMOUS, ...);
	recv(sock, buf, n * PAGE_SIZE, 0);

[Task 2]
	buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);

This creates a copy of VMA related to buf from task1 in task2's VM.
Task1's page table entries are copied into corresponding page table
entries of VM of task2.

It is swap-friendly: in case of memory is swapped right after recv(),
the syscall just copies pagetable entries like we do on fork(),
so real access to pages does not occurs, and no IO is needed.
No excess pages are reclaimed, and number of pages is not doubled.
Also, zero-copy takes a place, and this also reduces overhead.

The patchset does not introduce much new code, since we simply
reuse existing copy_page_range() and copy_vma() functions.
We extend copy_vma() to be able merge VMAs in remote task [2/7],
and teach copy_page_range() to work with different local and
remote addresses [3/7]. Patch [7/7] introduces the syscall logic,
which mostly consists of sanity checks. The rest of patches
are preparations.

This syscall may be used for page servers like in example
above, for migration (I assume, even virtual machines may
want something like this), for zero-copy desiring users
of process_vm_writev() and process_vm_readv(), for debug
purposes, etc. It requires the same permittions like
existing proc_vm_xxx() syscalls have.

The tests I used may be obtained here (UPDATED):

[1]https://gist.github.com/tkhai/ce46502fc53580372da35e8c3b7818b9
[2]https://gist.github.com/tkhai/40bda78e304d2fe0d90863214b9ac5b5

Previous version (RFC):
[3]https://lore.kernel.org/lkml/CAG48ez0itiEE1x=SXeMbjKvMGkrj7wxjM6c+ZB00LpXAAhqmiw@mail.gmail.com/T/

---

Kirill Tkhai (7):
      mm: Add process_vm_mmap() syscall declaration
      mm: Extend copy_vma()
      mm: Extend copy_page_range()
      mm: Export round_hint_to_min()
      mm: Introduce may_mmap_overlapped_region() helper
      mm: Introduce find_vma_filter_flags() helper
      mm: Add process_vm_mmap()


 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    2 
 include/linux/huge_mm.h                |    6 +
 include/linux/mm.h                     |   14 ++
 include/linux/mm_types.h               |    2 
 include/linux/mman.h                   |   14 ++
 include/linux/syscalls.h               |    5 +
 include/uapi/asm-generic/mman-common.h |    6 +
 include/uapi/asm-generic/unistd.h      |    5 +
 init/Kconfig                           |    9 +-
 kernel/fork.c                          |    5 +
 kernel/sys_ni.c                        |    2 
 mm/huge_memory.c                       |   30 ++++-
 mm/memory.c                            |  165 +++++++++++++++++++---------
 mm/mmap.c                              |  186 ++++++++++++++++++++++++++------
 mm/mremap.c                            |   43 +++++--
 mm/process_vm_access.c                 |   69 ++++++++++++
 17 files changed, 439 insertions(+), 125 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-21  0:28   ` Ira Weiny
  2019-05-20 14:00 ` [PATCH v2 2/7] mm: Extend copy_vma() Kirill Tkhai
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

Similar to process_vm_readv() and process_vm_writev(),
add declarations of a new syscall, which will allow
to map memory from or to another process.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 ++
 include/linux/syscalls.h               |    5 +++++
 include/uapi/asm-generic/unistd.h      |    5 ++++-
 init/Kconfig                           |    9 +++++----
 kernel/sys_ni.c                        |    2 ++
 6 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..99d6e0085576 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+434	i386	process_vm_mmap		sys_process_vm_mmap		__ia32_compat_sys_process_vm_mmap
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..46d7d2898f7a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+434	common	process_vm_mmap		__x64_sys_process_vm_mmap
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -398,3 +399,4 @@
 545	x32	execveat		__x32_compat_sys_execveat/ptregs
 546	x32	preadv2			__x32_compat_sys_preadv64v2
 547	x32	pwritev2		__x32_compat_sys_pwritev64v2
+548	x32	process_vm_mmap		__x32_compat_sys_process_vm_mmap
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..7d8ae36589cf 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -997,6 +997,11 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
 asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
+asmlinkage long sys_process_vm_mmap(pid_t pid,
+				    unsigned long src_addr,
+				    unsigned long len,
+				    unsigned long dst_addr,
+				    unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..b7aaa5ae02da 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,12 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_process_vm_mmap 424
+__SC_COMP(__NR_process_vm_mmap, sys_process_vm_mmap, \
+          compat_sys_process_vm_mmap)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435
 
 /*
  * 32 bit systems traditionally used different
diff --git a/init/Kconfig b/init/Kconfig
index 8b9ffe236e4f..604db5f14718 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -320,13 +320,14 @@ config POSIX_MQUEUE_SYSCTL
 	default y
 
 config CROSS_MEMORY_ATTACH
-	bool "Enable process_vm_readv/writev syscalls"
+	bool "Enable process_vm_readv/writev/mmap syscalls"
 	depends on MMU
 	default y
 	help
-	  Enabling this option adds the system calls process_vm_readv and
-	  process_vm_writev which allow a process with the correct privileges
-	  to directly read from or write to another process' address space.
+	  Enabling this option adds the system calls process_vm_readv,
+	  process_vm_writev and process_vm_mmap, which allow a process
+	  with the correct privileges to directly read from or write to
+	  or mmap another process' address space.
 	  See the man page for more details.
 
 config USELIB
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..6f51634f4f7e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -316,6 +316,8 @@ COND_SYSCALL(process_vm_readv);
 COND_SYSCALL_COMPAT(process_vm_readv);
 COND_SYSCALL(process_vm_writev);
 COND_SYSCALL_COMPAT(process_vm_writev);
+COND_SYSCALL(process_vm_mmap);
+COND_SYSCALL_COMPAT(process_vm_mmap);
 
 /* compare kernel pointers */
 COND_SYSCALL(kcmp);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/7] mm: Extend copy_vma()
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-21  8:18   ` Kirill A. Shutemov
  2019-05-20 14:00 ` [PATCH v2 3/7] mm: Extend copy_page_range() Kirill Tkhai
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

This prepares the function to copy a vma between
two processes. Two new arguments are introduced.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mm.h |    4 ++--
 mm/mmap.c          |   33 ++++++++++++++++++++++++---------
 mm/mremap.c        |    4 ++--
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..afe07e4a76f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2329,8 +2329,8 @@ extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
 	struct rb_node **, struct rb_node *);
 extern void unlink_file_vma(struct vm_area_struct *);
 extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
-	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks);
+	struct mm_struct *, unsigned long addr, unsigned long len,
+	pgoff_t pgoff, bool *need_rmap_locks, bool clear_flags_ctx);
 extern void exit_mmap(struct mm_struct *);
 
 static inline int check_data_rlimit(unsigned long rlim,
diff --git a/mm/mmap.c b/mm/mmap.c
index 57803a0a3a5c..99778e724ad1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3195,19 +3195,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 }
 
 /*
- * Copy the vma structure to a new location in the same mm,
- * prior to moving page table entries, to effect an mremap move.
+ * Copy the vma structure to new location in the same vma
+ * prior to moving page table entries, to effect an mremap move;
  */
 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
-	unsigned long addr, unsigned long len, pgoff_t pgoff,
-	bool *need_rmap_locks)
+				struct mm_struct *mm, unsigned long addr,
+				unsigned long len, pgoff_t pgoff,
+				bool *need_rmap_locks, bool clear_flags_ctx)
 {
 	struct vm_area_struct *vma = *vmap;
 	unsigned long vma_start = vma->vm_start;
-	struct mm_struct *mm = vma->vm_mm;
+	struct vm_userfaultfd_ctx uctx;
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	bool faulted_in_anon_vma = true;
+	unsigned long flags;
 
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
@@ -3220,15 +3222,25 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
-	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			    vma->vm_userfaultfd_ctx);
+
+	uctx = vma->vm_userfaultfd_ctx;
+	flags = vma->vm_flags;
+	if (clear_flags_ctx) {
+		uctx = NULL_VM_UFFD_CTX;
+		flags &= ~(VM_UFFD_MISSING | VM_UFFD_WP | VM_MERGEABLE |
+			   VM_LOCKED | VM_LOCKONFAULT | VM_WIPEONFORK |
+			   VM_DONTCOPY);
+	}
+
+	new_vma = vma_merge(mm, prev, addr, addr + len, flags, vma->anon_vma,
+			    vma->vm_file, pgoff, vma_policy(vma), uctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
 		 */
 		if (unlikely(vma_start >= new_vma->vm_start &&
-			     vma_start < new_vma->vm_end)) {
+			     vma_start < new_vma->vm_end) &&
+			     vma->vm_mm == mm) {
 			/*
 			 * The only way we can get a vma_merge with
 			 * self during an mremap is if the vma hasn't
@@ -3249,6 +3261,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		new_vma = vm_area_dup(vma);
 		if (!new_vma)
 			goto out;
+		new_vma->vm_mm = mm;
+		new_vma->vm_flags = flags;
+		new_vma->vm_userfaultfd_ctx = uctx;
 		new_vma->vm_start = addr;
 		new_vma->vm_end = addr + len;
 		new_vma->vm_pgoff = pgoff;
diff --git a/mm/mremap.c b/mm/mremap.c
index 37b5b2ad91be..9a96cfc28675 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -352,8 +352,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return err;
 
 	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
-	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
-			   &need_rmap_locks);
+	new_vma = copy_vma(&vma, mm, new_addr, new_len, new_pgoff,
+			   &need_rmap_locks, false);
 	if (!new_vma)
 		return -ENOMEM;
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/7] mm: Extend copy_page_range()
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 2/7] mm: Extend copy_vma() Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 4/7] mm: Export round_hint_to_min() Kirill Tkhai
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

This allows to copy pages not only to the same addreses
in another process, but also to a specified address.
Huge pages and unaligned address cases are handled
by splitting.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/huge_mm.h |    6 +-
 include/linux/mm.h      |    3 +
 kernel/fork.c           |    5 +
 mm/huge_memory.c        |   30 ++++++---
 mm/memory.c             |  165 +++++++++++++++++++++++++++++++----------------
 5 files changed, 141 insertions(+), 68 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..1e6002ee7c44 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -9,11 +9,13 @@
 
 extern vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long dst_addr,
+			 unsigned long src_addr, unsigned long len,
 			 struct vm_area_struct *vma);
 extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-			 pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+			 pud_t *dst_pud, pud_t *src_pud, unsigned long dst_addr,
+			 unsigned long src_addr, unsigned long len,
 			 struct vm_area_struct *vma);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe07e4a76f8..54328d08dbdd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1485,7 +1485,8 @@ int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+			struct vm_area_struct *vma, unsigned long dst_addr,
+			unsigned long src_addr, unsigned long src_end);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		   struct mmu_notifier_range *range,
 		   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
diff --git a/kernel/fork.c b/kernel/fork.c
index 45fde571c5dd..35f7240ed5c9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -584,7 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(mm, oldmm, mpnt);
+			retval = copy_page_range(mm, oldmm, mpnt,
+						 mpnt->vm_start,
+						 mpnt->vm_start,
+						 mpnt->vm_end);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..f338b06f42c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -956,7 +956,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long dst_addr,
+		  unsigned long src_addr, unsigned long len,
 		  struct vm_area_struct *vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
@@ -969,6 +970,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!vma_is_anonymous(vma))
 		return 0;
 
+	if (len != HPAGE_PMD_SIZE) {
+		split_huge_pmd(vma, src_pmd, src_addr);
+		return -EAGAIN;
+	}
+
 	pgtable = pte_alloc_one(dst_mm);
 	if (unlikely(!pgtable))
 		goto out;
@@ -990,12 +996,12 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
-			set_pmd_at(src_mm, addr, src_pmd, pmd);
+			set_pmd_at(src_mm, src_addr, src_pmd, pmd);
 		}
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		set_pmd_at(dst_mm, dst_addr, dst_pmd, pmd);
 		ret = 0;
 		goto out_unlock;
 	}
@@ -1018,7 +1024,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		 * reference.
 		 */
 		zero_page = mm_get_huge_zero_page(dst_mm);
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, dst_addr, dst_pmd,
 				zero_page);
 		ret = 0;
 		goto out_unlock;
@@ -1032,9 +1038,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmdp_set_wrprotect(src_mm, src_addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
-	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	set_pmd_at(dst_mm, dst_addr, dst_pmd, pmd);
 
 	ret = 0;
 out_unlock:
@@ -1096,13 +1102,19 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 }
 
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+		  pud_t *dst_pud, pud_t *src_pud, unsigned long dst_addr,
+		  unsigned long src_addr, unsigned long len,
 		  struct vm_area_struct *vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	pud_t pud;
 	int ret;
 
+	if (len != HPAGE_PUD_SIZE) {
+		split_huge_pud(vma, src_pud, src_addr);
+		return -EAGAIN;
+	}
+
 	dst_ptl = pud_lock(dst_mm, dst_pud);
 	src_ptl = pud_lockptr(src_mm, src_pud);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -1121,9 +1133,9 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
-	pudp_set_wrprotect(src_mm, addr, src_pud);
+	pudp_set_wrprotect(src_mm, src_addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
-	set_pud_at(dst_mm, addr, dst_pud, pud);
+	set_pud_at(dst_mm, dst_addr, dst_pud, pud);
 
 	ret = 0;
 out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 0d0711a912de..9d0fe2aee5f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -699,7 +699,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 static inline unsigned long
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		unsigned long src_addr, int *rss, unsigned long dst_addr)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -737,7 +737,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(*src_pte))
 					pte = pte_swp_mksoft_dirty(pte);
-				set_pte_at(src_mm, addr, src_pte, pte);
+				set_pte_at(src_mm, src_addr, src_pte, pte);
 			}
 		} else if (is_device_private_entry(entry)) {
 			page = device_private_entry_to_page(entry);
@@ -766,7 +766,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			    is_cow_mapping(vm_flags)) {
 				make_device_private_entry_read(&entry);
 				pte = swp_entry_to_pte(entry);
-				set_pte_at(src_mm, addr, src_pte, pte);
+				set_pte_at(src_mm, src_addr, src_pte, pte);
 			}
 		}
 		goto out_set_pte;
@@ -777,7 +777,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotect(src_mm, src_addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -789,7 +789,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, src_addr, pte);
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
@@ -810,13 +810,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 out_set_pte:
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	set_pte_at(dst_mm, dst_addr, dst_pte, pte);
 	return 0;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   unsigned long addr, unsigned long end)
+		   unsigned long src_addr, unsigned long src_end,
+		   unsigned long dst_addr)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -828,10 +829,10 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 again:
 	init_rss_vec(rss);
 
-	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
+	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, dst_addr, &dst_ptl);
 	if (!dst_pte)
 		return -ENOMEM;
-	src_pte = pte_offset_map(src_pmd, addr);
+	src_pte = pte_offset_map(src_pmd, src_addr);
 	src_ptl = pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	orig_src_pte = src_pte;
@@ -854,11 +855,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+					 vma, src_addr, rss, dst_addr);
 		if (entry.val)
 			break;
 		progress += 8;
-	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+	} while (dst_pte++, src_pte++, dst_addr += PAGE_SIZE,
+		 src_addr += PAGE_SIZE, src_addr != src_end);
 
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
@@ -872,108 +874,147 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			return -ENOMEM;
 		progress = 0;
 	}
-	if (addr != end)
+	if (src_addr != src_end)
 		goto again;
 	return 0;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	pmd_t *src_pmd, *dst_pmd;
-	unsigned long next;
 
-	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, dst_addr);
 	if (!dst_pmd)
 		return -ENOMEM;
-	src_pmd = pmd_offset(src_pud, addr);
+	src_pmd = pmd_offset(src_pud, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pmd_addr_end(addr, end);
+		src_next = pmd_addr_end(src_addr, src_end);
+		dst_next = pmd_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
-			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
-			err = copy_huge_pmd(dst_mm, src_mm,
-					    dst_pmd, src_pmd, addr, vma);
+			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+					    dst_addr, src_addr, len, vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
-				continue;
+				goto next;
 			/* fall through */
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
-			continue;
+			goto next;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pmd++;
+		if (dst_len == len)
+			dst_pmd++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	pud_t *src_pud, *dst_pud;
-	unsigned long next;
 
-	dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
+	dst_pud = pud_alloc(dst_mm, dst_p4d, dst_addr);
 	if (!dst_pud)
 		return -ENOMEM;
-	src_pud = pud_offset(src_p4d, addr);
+	src_pud = pud_offset(src_p4d, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pud_addr_end(addr, end);
+		src_next = pud_addr_end(src_addr, src_end);
+		dst_next = pud_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
-			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma);
-			err = copy_huge_pud(dst_mm, src_mm,
-					    dst_pud, src_pud, addr, vma);
+			err = copy_huge_pud(dst_mm, src_mm, dst_pud, src_pud,
+					    dst_addr, src_addr, len, vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
-				continue;
+				goto next;
 			/* fall through */
 		}
 		if (pud_none_or_clear_bad(src_pud))
-			continue;
+			goto next;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_pud++, src_pud++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pud++;
+		if (dst_len == len)
+			dst_pud++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+		unsigned long src_addr, unsigned long src_end, unsigned long dst_addr)
 {
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	p4d_t *src_p4d, *dst_p4d;
-	unsigned long next;
 
-	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+	dst_p4d = p4d_alloc(dst_mm, dst_pgd, dst_addr);
 	if (!dst_p4d)
 		return -ENOMEM;
-	src_p4d = p4d_offset(src_pgd, addr);
+
+	src_p4d = p4d_offset(src_pgd, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = p4d_addr_end(addr, end);
+		src_next = p4d_addr_end(src_addr, src_end);
+		dst_next = p4d_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (p4d_none_or_clear_bad(src_p4d))
-			continue;
+			goto next;
 		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
-						vma, addr, next))
+				   vma, src_addr, src_next, dst_addr))
 			return -ENOMEM;
-	} while (dst_p4d++, src_p4d++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_p4d++;
+		if (dst_len == len)
+			dst_p4d++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned long dst_addr,
+		unsigned long src_addr, unsigned long src_end)
 {
 	pgd_t *src_pgd, *dst_pgd;
-	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long src_next, dst_next, src_len, dst_len, dst_end, len;
 	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
@@ -1011,23 +1052,37 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	if (is_cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
-					0, vma, src_mm, addr, end);
+					0, vma, src_mm, src_addr, src_end);
 		mmu_notifier_invalidate_range_start(&range);
 	}
 
 	ret = 0;
-	dst_pgd = pgd_offset(dst_mm, addr);
-	src_pgd = pgd_offset(src_mm, addr);
+	dst_pgd = pgd_offset(dst_mm, dst_addr);
+	src_pgd = pgd_offset(src_mm, src_addr);
+	dst_end = dst_addr + (src_end - src_addr);
 	do {
-		next = pgd_addr_end(addr, end);
+		src_next = pgd_addr_end(src_addr, src_end);
+		dst_next = pgd_addr_end(dst_addr, dst_end);
+		src_len = src_next - src_addr;
+		dst_len = dst_next - dst_addr;
+
+		len = min(src_len, dst_len);
+		src_next = src_addr + len;
+		dst_next = dst_addr + len;
+
 		if (pgd_none_or_clear_bad(src_pgd))
-			continue;
+			goto next;
 		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					vma, src_addr, src_next, dst_addr))) {
 			ret = -ENOMEM;
 			break;
 		}
-	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+next:
+		if (src_len == len)
+			src_pgd++;
+		if (dst_len == len)
+			dst_pgd++;
+	} while (src_addr = src_next, dst_addr = dst_next, src_addr != src_end);
 
 	if (is_cow)
 		mmu_notifier_invalidate_range_end(&range);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 4/7] mm: Export round_hint_to_min()
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (2 preceding siblings ...)
  2019-05-20 14:00 ` [PATCH v2 3/7] mm: Extend copy_page_range() Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 5/7] mm: Introduce may_mmap_overlapped_region() helper Kirill Tkhai
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mman.h |   14 ++++++++++++++
 mm/mmap.c            |   13 -------------
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c9c538..69feb3144c12 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -4,6 +4,7 @@
 
 #include <linux/mm.h>
 #include <linux/percpu_counter.h>
+#include <linux/security.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -73,6 +74,19 @@ static inline void vm_unacct_memory(long pages)
 	vm_acct_memory(-pages);
 }
 
+/*
+ * If a hint addr is less than mmap_min_addr change hint to be as
+ * low as possible but still greater than mmap_min_addr
+ */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+	hint &= PAGE_MASK;
+	if (((void *)hint != NULL) &&
+	    (hint < mmap_min_addr))
+		return PAGE_ALIGN(mmap_min_addr);
+	return hint;
+}
+
 /*
  * Allow architectures to handle additional protection bits
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 99778e724ad1..e4ced5366643 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1318,19 +1318,6 @@ struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 	return NULL;
 }
 
-/*
- * If a hint addr is less than mmap_min_addr change hint to be as
- * low as possible but still greater than mmap_min_addr
- */
-static inline unsigned long round_hint_to_min(unsigned long hint)
-{
-	hint &= PAGE_MASK;
-	if (((void *)hint != NULL) &&
-	    (hint < mmap_min_addr))
-		return PAGE_ALIGN(mmap_min_addr);
-	return hint;
-}
-
 static inline int mlock_future_check(struct mm_struct *mm,
 				     unsigned long flags,
 				     unsigned long len)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 5/7] mm: Introduce may_mmap_overlapped_region() helper
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (3 preceding siblings ...)
  2019-05-20 14:00 ` [PATCH v2 4/7] mm: Export round_hint_to_min() Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 6/7] mm: Introduce find_vma_filter_flags() helper Kirill Tkhai
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

Extract address space limit check for overlapped regions
in a separate helper.

v2: New

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 mm/mmap.c |   33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e4ced5366643..260e47e917e6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -583,6 +583,24 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm,
 	return nr_pages;
 }
 
+/*
+ * Check against address space limit, whether we may expand mm
+ * with a new mapping. Currently mapped in the given range pages
+ * are not accounted in the limit.
+ */
+static bool may_mmap_overlapped_region(struct mm_struct *mm,
+		unsigned long vm_flags, unsigned long addr, unsigned long len)
+{
+	unsigned long nr_pages = len >> PAGE_SHIFT;
+
+	if (!may_expand_vm(mm, vm_flags, nr_pages)) {
+		nr_pages -= count_vma_pages_range(mm, addr, addr + len);
+		if (!may_expand_vm(mm, vm_flags, nr_pages))
+			return false;
+	}
+	return true;
+}
+
 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct rb_node **rb_link, struct rb_node *rb_parent)
 {
@@ -1697,19 +1715,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long charged = 0;
 
 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
-
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, addr + len);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
+	if (!may_mmap_overlapped_region(mm, vm_flags, addr, len))
+		return -ENOMEM;
 
 	/* Clear old maps */
 	while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 6/7] mm: Introduce find_vma_filter_flags() helper
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (4 preceding siblings ...)
  2019-05-20 14:00 ` [PATCH v2 5/7] mm: Introduce may_mmap_overlapped_region() helper Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-20 14:00 ` [PATCH v2 7/7] mm: Add process_vm_mmap() Kirill Tkhai
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

This patch introduce a new helper, which returns
vma of enough length at given address, but only
if it does not contain passed flags.

v2: New

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mm.h |    3 +++
 mm/mremap.c        |   39 ++++++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54328d08dbdd..65ceb56acd44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1515,6 +1515,9 @@ void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+struct vm_area_struct *find_vma_without_flags(struct mm_struct *mm,
+		unsigned long addr, unsigned long len,
+		unsigned long prohibited_flags);
 #else
 static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 		unsigned long address, unsigned int flags)
diff --git a/mm/mremap.c b/mm/mremap.c
index 9a96cfc28675..dabae6a70287 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -430,14 +430,37 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	return new_addr;
 }
 
+struct vm_area_struct *find_vma_without_flags(struct mm_struct *mm,
+		unsigned long addr, unsigned long len,
+		unsigned long prohibited_flags)
+{
+	struct vm_area_struct *vma = find_vma(mm, addr);
+
+	if (!vma || vma->vm_start > addr)
+		return ERR_PTR(-EFAULT);
+
+	/* vm area boundaries crossing */
+	if (len > vma->vm_end - addr)
+		return ERR_PTR(-EFAULT);
+
+	if (vma->vm_flags & prohibited_flags)
+		return ERR_PTR(-EFAULT);
+
+	return vma;
+}
+
 static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	unsigned long old_len, unsigned long new_len, unsigned long *p)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = find_vma(mm, addr);
-	unsigned long pgoff;
+	struct vm_area_struct *vma;
+	unsigned long pgoff, prohibited_flags = VM_HUGETLB;
 
-	if (!vma || vma->vm_start > addr)
+	if (old_len != new_len)
+		prohibited_flags |= VM_DONTEXPAND | VM_PFNMAP;
+
+	vma = find_vma_without_flags(mm, addr, old_len, prohibited_flags);
+	if (IS_ERR(vma))
 		return ERR_PTR(-EFAULT);
 
 	/*
@@ -453,13 +476,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 		return ERR_PTR(-EINVAL);
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return ERR_PTR(-EINVAL);
-
-	/* We can't remap across vm area boundaries */
-	if (old_len > vma->vm_end - addr)
-		return ERR_PTR(-EFAULT);
-
 	if (new_len == old_len)
 		return vma;
 
@@ -469,9 +485,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
 		return ERR_PTR(-EINVAL);
 
-	if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
-		return ERR_PTR(-EFAULT);
-
 	if (vma->vm_flags & VM_LOCKED) {
 		unsigned long locked, lock_limit;
 		locked = atomic64_read(&mm->locked_vm) << PAGE_SHIFT;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 7/7] mm: Add process_vm_mmap()
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (5 preceding siblings ...)
  2019-05-20 14:00 ` [PATCH v2 6/7] mm: Introduce find_vma_filter_flags() helper Kirill Tkhai
@ 2019-05-20 14:00 ` Kirill Tkhai
  2019-05-21 14:43 ` [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Andy Lutomirski
  2019-05-22 15:22 ` Kirill A. Shutemov
  8 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-20 14:00 UTC (permalink / raw)
  To: akpm, dan.j.williams, ktkhai, mhocko, keith.busch,
	kirill.shutemov, alexander.h.duyck, ira.weiny, andreyknvl,
	arunks, vbabka, cl, riel, keescook, hannes, npiggin,
	mathieu.desnoyers, shakeelb, guro, aarcange, hughd, jglisse,
	mgorman, daniel.m.jordan, jannh, kilobyte, linux-api,
	linux-kernel, linux-mm

This adds a new syscall to duplicate a VMA of other
process to current. Flag PVMMAP_FIXED may be specified,
its meaning is similar to mmap()'s MAP_FIXED.

VMA are merged on destination, i.e. if source task
has VMA with address [start; end], and we map it sequentially
twice:

process_vm_mmap(@pid, start, start + (end - start)/2, ...);
process_vm_mmap(@pid, start + (end - start)/2, end,   ...);

the destination task will have single vma [start, end].

v2:
    Add PVMMAP_FIXED_NOREPLACE flag.
    Use find_vma_without_flags() and may_mmap_overlapped_region() helpers.
    Fix whitespaces.

    Previous version has a possibility to duplicate VMA from
    current to remote process, but there was a error, so I
    removed that. It's needed to advance get_unmapped_area
    to make it working with remote VMA (which I missed
    initially). This requires a lot of refactoring, which
    may hide the main logic away from a reader, so let's
    I do that later in a separate series.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/linux/mm.h                     |    4 +
 include/linux/mm_types.h               |    2 +
 include/uapi/asm-generic/mman-common.h |    6 ++
 mm/mmap.c                              |  107 ++++++++++++++++++++++++++++++++
 mm/process_vm_access.c                 |   69 +++++++++++++++++++++
 5 files changed, 188 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 65ceb56acd44..9d1c79a44128 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2385,6 +2385,10 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
+extern unsigned long mmap_process_vm(struct mm_struct *, unsigned long,
+				     struct mm_struct *, unsigned long,
+				     unsigned long, unsigned long,
+				     struct list_head *);
 
 static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1815fbc40926..885f256f2fb7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -261,11 +261,13 @@ struct vm_region {
 
 #ifdef CONFIG_USERFAULTFD
 #define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+#define IS_NULL_VM_UFFD_CTX(uctx) ((uctx)->ctx == NULL)
 struct vm_userfaultfd_ctx {
 	struct userfaultfd_ctx *ctx;
 };
 #else /* CONFIG_USERFAULTFD */
 #define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+#define IS_NULL_VM_UFFD_CTX(uctx) (true)
 struct vm_userfaultfd_ctx {};
 #endif /* CONFIG_USERFAULTFD */
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..91803e6ada7c 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -28,6 +28,12 @@
 /* 0x0100 - 0x80000 flags are defined in asm-generic/mman.h */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
 
+/*
+ * Flags for process_vm_mmap
+ */
+#define PVMMAP_FIXED		0x01
+#define PVMMAP_FIXED_NOREPLACE	0x02		/* PVMAP_FIXED which doesn't unmap underlying mapping */
+
 /*
  * Flags for mlock
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 260e47e917e6..3123ecee5fb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3282,6 +3282,113 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return NULL;
 }
 
+static int do_mmap_process_vm(struct vm_area_struct *src_vma,
+			      unsigned long src_addr,
+			      struct mm_struct *dst_mm,
+			      unsigned long dst_addr,
+			      unsigned long len,
+			      struct list_head *uf)
+{
+	struct vm_area_struct *dst_vma;
+	unsigned long pgoff, ret;
+	bool unused;
+
+	if (do_munmap(dst_mm, dst_addr, len, uf))
+		return -ENOMEM;
+
+	if (src_vma->vm_flags & VM_ACCOUNT) {
+		if (security_vm_enough_memory_mm(dst_mm, len >> PAGE_SHIFT))
+			return -ENOMEM;
+	}
+
+	pgoff = src_vma->vm_pgoff +
+			((src_addr - src_vma->vm_start) >> PAGE_SHIFT);
+	dst_vma = copy_vma(&src_vma, dst_mm, dst_addr,
+			   len, pgoff, &unused, false);
+	if (!dst_vma) {
+		ret = -ENOMEM;
+		goto unacct;
+	}
+
+	ret = copy_page_range(dst_mm, src_vma->vm_mm, src_vma,
+			      dst_addr, src_addr, src_addr + len);
+	if (ret) {
+		do_munmap(dst_mm, dst_addr, len, uf);
+		return -ENOMEM;
+	}
+
+	if (dst_vma->vm_file)
+		uprobe_mmap(dst_vma);
+	perf_event_mmap(dst_vma);
+
+	dst_vma->vm_flags |= VM_SOFTDIRTY;
+	vma_set_page_prot(dst_vma);
+
+	vm_stat_account(dst_mm, dst_vma->vm_flags, len >> PAGE_SHIFT);
+	return 0;
+
+unacct:
+	vm_unacct_memory(len >> PAGE_SHIFT);
+	return ret;
+}
+
+unsigned long mmap_process_vm(struct mm_struct *src_mm,
+			      unsigned long src_addr,
+			      struct mm_struct *dst_mm,
+			      unsigned long dst_addr,
+			      unsigned long len,
+			      unsigned long flags,
+			      struct list_head *uf)
+{
+	struct vm_area_struct *src_vma, *dst_vma;
+	unsigned long gua_flags = 0;
+	unsigned long ret;
+
+	src_vma = find_vma_without_flags(src_mm, src_addr, len,
+				VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO);
+	if (IS_ERR(src_vma))
+		return -EFAULT;
+	if (dst_mm->map_count > sysctl_max_map_count - 2)
+		return -ENOMEM;
+	if (!IS_NULL_VM_UFFD_CTX(&src_vma->vm_userfaultfd_ctx))
+		return -ENOTSUPP;
+
+	if (src_vma->vm_flags & VM_SHARED)
+		gua_flags |= MAP_SHARED;
+	else
+		gua_flags |= MAP_PRIVATE;
+	if (vma_is_anonymous(src_vma) || vma_is_shmem(src_vma))
+		gua_flags |= MAP_ANONYMOUS;
+	if (flags & PVMMAP_FIXED)
+		gua_flags |= MAP_FIXED;
+	if (flags & PVMMAP_FIXED_NOREPLACE)
+		gua_flags |= MAP_FIXED | MAP_FIXED_NOREPLACE;
+
+	ret = get_unmapped_area(src_vma->vm_file, dst_addr, len,
+				src_vma->vm_pgoff +
+				((src_addr - src_vma->vm_start) >> PAGE_SHIFT),
+				gua_flags);
+	if (offset_in_page(ret))
+                return ret;
+	if (flags & PVMMAP_FIXED_NOREPLACE) {
+		dst_vma = find_vma(dst_mm, dst_addr);
+		if (dst_vma && dst_vma->vm_start < dst_addr + len)
+			return -EEXIST;
+	}
+
+	dst_addr = ret;
+
+	/* Check against address space limit. */
+	if (!may_mmap_overlapped_region(dst_mm, src_vma->vm_flags, dst_addr, len))
+		return -ENOMEM;
+
+	ret = do_mmap_process_vm(src_vma, src_addr, dst_mm, dst_addr, len, uf);
+	if (ret)
+                return ret;
+
+	return dst_addr;
+}
+
 /*
  * Return true if the calling process may expand its vm space by the passed
  * number of pages
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index a447092d4635..e2073f52415b 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -17,6 +17,8 @@
 #include <linux/ptrace.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
+#include <linux/mman.h>
+#include <linux/userfaultfd_k.h>
 
 #ifdef CONFIG_COMPAT
 #include <linux/compat.h>
@@ -295,6 +297,66 @@ static ssize_t process_vm_rw(pid_t pid,
 	return rc;
 }
 
+static unsigned long process_vm_mmap(pid_t pid, unsigned long src_addr,
+				     unsigned long len, unsigned long dst_addr,
+				     unsigned long flags)
+{
+	struct mm_struct *src_mm, *dst_mm;
+	struct task_struct *task;
+	unsigned long ret;
+	int depth = 0;
+	LIST_HEAD(uf);
+
+	len = PAGE_ALIGN(len);
+	src_addr = round_down(src_addr, PAGE_SIZE);
+	if (flags & PVMMAP_FIXED)
+		dst_addr = round_down(dst_addr, PAGE_SIZE);
+	else
+		dst_addr = round_hint_to_min(dst_addr);
+
+	if ((flags & ~(PVMMAP_FIXED|PVMMAP_FIXED_NOREPLACE)) ||
+	    len == 0 || len > TASK_SIZE || src_addr == 0 ||
+	    dst_addr > TASK_SIZE - len || pid <= 0)
+		return -EINVAL;
+	task = find_get_task_by_vpid(pid);
+	if (!task)
+		return -ESRCH;
+	if (unlikely(task->flags & PF_KTHREAD)) {
+		ret = -EINVAL;
+		goto out_put_task;
+	}
+
+	src_mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+	if (!src_mm || IS_ERR(src_mm)) {
+		ret = IS_ERR(src_mm) ? PTR_ERR(src_mm) : -ESRCH;
+		goto out_put_task;
+	}
+	dst_mm = current->mm;
+	mmget(dst_mm);
+
+	/* Double lock mm in address order: smallest is the first */
+	if (src_mm < dst_mm) {
+		down_write(&src_mm->mmap_sem);
+		depth = SINGLE_DEPTH_NESTING;
+	}
+	down_write_nested(&dst_mm->mmap_sem, depth);
+	if (src_mm > dst_mm)
+		down_write_nested(&src_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	ret = mmap_process_vm(src_mm, src_addr, dst_mm, dst_addr, len, flags, &uf);
+
+	up_write(&dst_mm->mmap_sem);
+	if (dst_mm != src_mm)
+		up_write(&src_mm->mmap_sem);
+
+	userfaultfd_unmap_complete(dst_mm, &uf);
+	mmput(src_mm);
+	mmput(dst_mm);
+out_put_task:
+	put_task_struct(task);
+	return ret;
+}
+
 SYSCALL_DEFINE6(process_vm_readv, pid_t, pid, const struct iovec __user *, lvec,
 		unsigned long, liovcnt, const struct iovec __user *, rvec,
 		unsigned long, riovcnt,	unsigned long, flags)
@@ -310,6 +372,13 @@ SYSCALL_DEFINE6(process_vm_writev, pid_t, pid,
 	return process_vm_rw(pid, lvec, liovcnt, rvec, riovcnt, flags, 1);
 }
 
+SYSCALL_DEFINE5(process_vm_mmap, pid_t, pid,
+		unsigned long, src_addr, unsigned long, len,
+		unsigned long, dst_addr, unsigned long, flags)
+{
+	return process_vm_mmap(pid, src_addr, len, dst_addr, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration
  2019-05-20 14:00 ` [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
@ 2019-05-21  0:28   ` Ira Weiny
  2019-05-21  8:29     ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Ira Weiny @ 2019-05-21  0:28 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, andreyknvl, arunks, vbabka, cl, riel,
	keescook, hannes, npiggin, mathieu.desnoyers, shakeelb, guro,
	aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Mon, May 20, 2019 at 05:00:07PM +0300, Kirill Tkhai wrote:
> Similar to process_vm_readv() and process_vm_writev(),
> add declarations of a new syscall, which will allow
> to map memory from or to another process.

Shouldn't this be the last patch in the series so that the syscall is actually
implemented first?

Ira

> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    2 ++
>  include/linux/syscalls.h               |    5 +++++
>  include/uapi/asm-generic/unistd.h      |    5 ++++-
>  init/Kconfig                           |    9 +++++----
>  kernel/sys_ni.c                        |    2 ++
>  6 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ad968b7bac72..99d6e0085576 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
>  432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
>  433	i386	fspick			sys_fspick			__ia32_sys_fspick
> +434	i386	process_vm_mmap		sys_process_vm_mmap		__ia32_compat_sys_process_vm_mmap
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index b4e6f9e6204a..46d7d2898f7a 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  431	common	fsconfig		__x64_sys_fsconfig
>  432	common	fsmount			__x64_sys_fsmount
>  433	common	fspick			__x64_sys_fspick
> +434	common	process_vm_mmap		__x64_sys_process_vm_mmap
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -398,3 +399,4 @@
>  545	x32	execveat		__x32_compat_sys_execveat/ptregs
>  546	x32	preadv2			__x32_compat_sys_preadv64v2
>  547	x32	pwritev2		__x32_compat_sys_pwritev64v2
> +548	x32	process_vm_mmap		__x32_compat_sys_process_vm_mmap
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..7d8ae36589cf 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -997,6 +997,11 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
>  asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
>  				       siginfo_t __user *info,
>  				       unsigned int flags);
> +asmlinkage long sys_process_vm_mmap(pid_t pid,
> +				    unsigned long src_addr,
> +				    unsigned long len,
> +				    unsigned long dst_addr,
> +				    unsigned long flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a87904daf103..b7aaa5ae02da 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -844,9 +844,12 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
>  __SYSCALL(__NR_fsmount, sys_fsmount)
>  #define __NR_fspick 433
>  __SYSCALL(__NR_fspick, sys_fspick)
> +#define __NR_process_vm_mmap 424
> +__SC_COMP(__NR_process_vm_mmap, sys_process_vm_mmap, \
> +          compat_sys_process_vm_mmap)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 434
> +#define __NR_syscalls 435
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/init/Kconfig b/init/Kconfig
> index 8b9ffe236e4f..604db5f14718 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -320,13 +320,14 @@ config POSIX_MQUEUE_SYSCTL
>  	default y
>  
>  config CROSS_MEMORY_ATTACH
> -	bool "Enable process_vm_readv/writev syscalls"
> +	bool "Enable process_vm_readv/writev/mmap syscalls"
>  	depends on MMU
>  	default y
>  	help
> -	  Enabling this option adds the system calls process_vm_readv and
> -	  process_vm_writev which allow a process with the correct privileges
> -	  to directly read from or write to another process' address space.
> +	  Enabling this option adds the system calls process_vm_readv,
> +	  process_vm_writev and process_vm_mmap, which allow a process
> +	  with the correct privileges to directly read from or write to
> +	  or mmap another process' address space.
>  	  See the man page for more details.
>  
>  config USELIB
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..6f51634f4f7e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -316,6 +316,8 @@ COND_SYSCALL(process_vm_readv);
>  COND_SYSCALL_COMPAT(process_vm_readv);
>  COND_SYSCALL(process_vm_writev);
>  COND_SYSCALL_COMPAT(process_vm_writev);
> +COND_SYSCALL(process_vm_mmap);
> +COND_SYSCALL_COMPAT(process_vm_mmap);
>  
>  /* compare kernel pointers */
>  COND_SYSCALL(kcmp);
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/7] mm: Extend copy_vma()
  2019-05-20 14:00 ` [PATCH v2 2/7] mm: Extend copy_vma() Kirill Tkhai
@ 2019-05-21  8:18   ` Kirill A. Shutemov
  2019-05-21  8:48     ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-05-21  8:18 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Mon, May 20, 2019 at 05:00:12PM +0300, Kirill Tkhai wrote:
> This prepares the function to copy a vma between
> two processes. Two new arguments are introduced.

This kind of changes requires a lot more explanation in commit message,
describing all possible corner cases.

For instance, I would really like to see a story on why logic around
need_rmap_locks is safe after the change.

> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  include/linux/mm.h |    4 ++--
>  mm/mmap.c          |   33 ++++++++++++++++++++++++---------
>  mm/mremap.c        |    4 ++--
>  3 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0e8834ac32b7..afe07e4a76f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2329,8 +2329,8 @@ extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
>  	struct rb_node **, struct rb_node *);
>  extern void unlink_file_vma(struct vm_area_struct *);
>  extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
> -	unsigned long addr, unsigned long len, pgoff_t pgoff,
> -	bool *need_rmap_locks);
> +	struct mm_struct *, unsigned long addr, unsigned long len,
> +	pgoff_t pgoff, bool *need_rmap_locks, bool clear_flags_ctx);
>  extern void exit_mmap(struct mm_struct *);
>  
>  static inline int check_data_rlimit(unsigned long rlim,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 57803a0a3a5c..99778e724ad1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3195,19 +3195,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
>  }
>  
>  /*
> - * Copy the vma structure to a new location in the same mm,
> - * prior to moving page table entries, to effect an mremap move.
> + * Copy the vma structure to new location in the same vma
> + * prior to moving page table entries, to effect an mremap move;
>   */
>  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> -	unsigned long addr, unsigned long len, pgoff_t pgoff,
> -	bool *need_rmap_locks)
> +				struct mm_struct *mm, unsigned long addr,
> +				unsigned long len, pgoff_t pgoff,
> +				bool *need_rmap_locks, bool clear_flags_ctx)
>  {
>  	struct vm_area_struct *vma = *vmap;
>  	unsigned long vma_start = vma->vm_start;
> -	struct mm_struct *mm = vma->vm_mm;
> +	struct vm_userfaultfd_ctx uctx;
>  	struct vm_area_struct *new_vma, *prev;
>  	struct rb_node **rb_link, *rb_parent;
>  	bool faulted_in_anon_vma = true;
> +	unsigned long flags;
>  
>  	/*
>  	 * If anonymous vma has not yet been faulted, update new pgoff
> @@ -3220,15 +3222,25 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  
>  	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
>  		return NULL;	/* should never get here */
> -	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> -			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> -			    vma->vm_userfaultfd_ctx);
> +
> +	uctx = vma->vm_userfaultfd_ctx;
> +	flags = vma->vm_flags;
> +	if (clear_flags_ctx) {
> +		uctx = NULL_VM_UFFD_CTX;
> +		flags &= ~(VM_UFFD_MISSING | VM_UFFD_WP | VM_MERGEABLE |
> +			   VM_LOCKED | VM_LOCKONFAULT | VM_WIPEONFORK |
> +			   VM_DONTCOPY);
> +	}

Why is the new logic required? No justification given.

> +
> +	new_vma = vma_merge(mm, prev, addr, addr + len, flags, vma->anon_vma,
> +			    vma->vm_file, pgoff, vma_policy(vma), uctx);
>  	if (new_vma) {
>  		/*
>  		 * Source vma may have been merged into new_vma
>  		 */
>  		if (unlikely(vma_start >= new_vma->vm_start &&
> -			     vma_start < new_vma->vm_end)) {
> +			     vma_start < new_vma->vm_end) &&
> +			     vma->vm_mm == mm) {

How can vma_merge() succeed if vma->vm_mm != mm?

>  			/*
>  			 * The only way we can get a vma_merge with
>  			 * self during an mremap is if the vma hasn't
> @@ -3249,6 +3261,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		new_vma = vm_area_dup(vma);
>  		if (!new_vma)
>  			goto out;
> +		new_vma->vm_mm = mm;
> +		new_vma->vm_flags = flags;
> +		new_vma->vm_userfaultfd_ctx = uctx;
>  		new_vma->vm_start = addr;
>  		new_vma->vm_end = addr + len;
>  		new_vma->vm_pgoff = pgoff;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 37b5b2ad91be..9a96cfc28675 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -352,8 +352,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  		return err;
>  
>  	new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
> -	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
> -			   &need_rmap_locks);
> +	new_vma = copy_vma(&vma, mm, new_addr, new_len, new_pgoff,
> +			   &need_rmap_locks, false);
>  	if (!new_vma)
>  		return -ENOMEM;
>  
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration
  2019-05-21  0:28   ` Ira Weiny
@ 2019-05-21  8:29     ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21  8:29 UTC (permalink / raw)
  To: Ira Weiny
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, andreyknvl, arunks, vbabka, cl, riel,
	keescook, hannes, npiggin, mathieu.desnoyers, shakeelb, guro,
	aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

Hi, Ira,

On 21.05.2019 03:28, Ira Weiny wrote:
> On Mon, May 20, 2019 at 05:00:07PM +0300, Kirill Tkhai wrote:
>> Similar to process_vm_readv() and process_vm_writev(),
>> add declarations of a new syscall, which will allow
>> to map memory from or to another process.
> 
> Shouldn't this be the last patch in the series so that the syscall is actually
> implemented first?

It looks like there is no dependencies in the last patch to declarations made
in the first patch, so we really can move it.

I'll make this after there are accumulated some commentaries about the logic
to reduce number of patch series.

[...]

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/7] mm: Extend copy_vma()
  2019-05-21  8:18   ` Kirill A. Shutemov
@ 2019-05-21  8:48     ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21  8:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

Hi, Kirill,

On 21.05.2019 11:18, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:12PM +0300, Kirill Tkhai wrote:
>> This prepares the function to copy a vma between
>> two processes. Two new arguments are introduced.
> 
> This kind of changes requires a lot more explanation in commit message,
> describing all possible corner cases> For instance, I would really like to see a story on why logic around
> need_rmap_locks is safe after the change.

Let me fast answer on the below question firstly, and later I'll write
wide explanations, since this requires much more time.
 
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>  include/linux/mm.h |    4 ++--
>>  mm/mmap.c          |   33 ++++++++++++++++++++++++---------
>>  mm/mremap.c        |    4 ++--
>>  3 files changed, 28 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 0e8834ac32b7..afe07e4a76f8 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2329,8 +2329,8 @@ extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
>>  	struct rb_node **, struct rb_node *);
>>  extern void unlink_file_vma(struct vm_area_struct *);
>>  extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
>> -	unsigned long addr, unsigned long len, pgoff_t pgoff,
>> -	bool *need_rmap_locks);
>> +	struct mm_struct *, unsigned long addr, unsigned long len,
>> +	pgoff_t pgoff, bool *need_rmap_locks, bool clear_flags_ctx);
>>  extern void exit_mmap(struct mm_struct *);
>>  
>>  static inline int check_data_rlimit(unsigned long rlim,
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 57803a0a3a5c..99778e724ad1 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3195,19 +3195,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
>>  }
>>  
>>  /*
>> - * Copy the vma structure to a new location in the same mm,
>> - * prior to moving page table entries, to effect an mremap move.
>> + * Copy the vma structure to new location in the same vma
>> + * prior to moving page table entries, to effect an mremap move;
>>   */
>>  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>> -	unsigned long addr, unsigned long len, pgoff_t pgoff,
>> -	bool *need_rmap_locks)
>> +				struct mm_struct *mm, unsigned long addr,
>> +				unsigned long len, pgoff_t pgoff,
>> +				bool *need_rmap_locks, bool clear_flags_ctx)
>>  {
>>  	struct vm_area_struct *vma = *vmap;
>>  	unsigned long vma_start = vma->vm_start;
>> -	struct mm_struct *mm = vma->vm_mm;
>> +	struct vm_userfaultfd_ctx uctx;
>>  	struct vm_area_struct *new_vma, *prev;
>>  	struct rb_node **rb_link, *rb_parent;
>>  	bool faulted_in_anon_vma = true;
>> +	unsigned long flags;
>>  
>>  	/*
>>  	 * If anonymous vma has not yet been faulted, update new pgoff
>> @@ -3220,15 +3222,25 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>  
>>  	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
>>  		return NULL;	/* should never get here */
>> -	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
>> -			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
>> -			    vma->vm_userfaultfd_ctx);
>> +
>> +	uctx = vma->vm_userfaultfd_ctx;
>> +	flags = vma->vm_flags;
>> +	if (clear_flags_ctx) {
>> +		uctx = NULL_VM_UFFD_CTX;
>> +		flags &= ~(VM_UFFD_MISSING | VM_UFFD_WP | VM_MERGEABLE |
>> +			   VM_LOCKED | VM_LOCKONFAULT | VM_WIPEONFORK |
>> +			   VM_DONTCOPY);
>> +	}
> 
> Why is the new logic required? No justification given.

Ditto.

>> +
>> +	new_vma = vma_merge(mm, prev, addr, addr + len, flags, vma->anon_vma,
>> +			    vma->vm_file, pgoff, vma_policy(vma), uctx);
>>  	if (new_vma) {
>>  		/*
>>  		 * Source vma may have been merged into new_vma
>>  		 */
>>  		if (unlikely(vma_start >= new_vma->vm_start &&
>> -			     vma_start < new_vma->vm_end)) {
>> +			     vma_start < new_vma->vm_end) &&
>> +			     vma->vm_mm == mm) {
> 
> How can vma_merge() succeed if vma->vm_mm != mm?

We don't use vma as an argument of vma_merge(). We use vma as a source of
vma->anon_vma, vma->vm_file and vma_policy().

We search some new_vma in mm with the same characteristics as vma has in vma->vm_mm.
In case of success vma_merge() returns it for us. For example, it may success, when
vma->vm_mm is mm_struct of forked process, while mm is mm_struct of its parent.

[...]

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (6 preceding siblings ...)
  2019-05-20 14:00 ` [PATCH v2 7/7] mm: Add process_vm_mmap() Kirill Tkhai
@ 2019-05-21 14:43 ` Andy Lutomirski
  2019-05-21 15:52   ` Kirill Tkhai
  2019-05-22 15:22 ` Kirill A. Shutemov
  8 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-21 14:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>

> [Summary]
>
> New syscall, which allows to clone a remote process VMA
> into local process VM. The remote process's page table
> entries related to the VMA are cloned into local process's
> page table (in any desired address, which makes this different
> from that happens during fork()). Huge pages are handled
> appropriately.
>
> This allows to improve performance in significant way like
> it's shows in the example below.
>
> [Description]
>
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
>
>         struct iovec local_iov, remote_iov;
>         void *buf;
>
>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
>         local_iov->iov_base = buf;
>         local_iov->iov_len = n * PAGE_SIZE;
>         remove_iov = ...;
>
>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>         munmap(buf, n * PAGE_SIZE);
>
>         (Note, that above completely ignores error handling)
>
> There are several problems with process_vm_writev() in this example:
>
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);

I don't see how your new syscall helps.  You're writing to remote
memory.  If that memory wasn't allocated, it's going to get allocated
regardless of whether you use a write-like interface or an mmap-like
interface.  Keep in mind that, on x86, just the hardware part of a
page fault is very slow -- populating the memory with a syscall
instead of a fault may well be faster.

>
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;

This seems disingenuous.  If you're writing p pages total in chunks of
n pages, you will use a total of p pages if you use mmap and p+n if
you use write.  That only doubles the amount of memory if you let n
scale linearly with p, which seems unlikely.

>
> 3)received data has no a chance to be properly swapped for
>   a long time.

...

> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;

If you're under that much memory pressure and thrashing that badly,
your performance is going to be awful no matter what you're doing.  If
you indeed observe this behavior under normal loads, then this seems
like a VM issue that should be addressed in its own right.

>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
> [Task 2]
>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>
> This creates a copy of VMA related to buf from task1 in task2's VM.
> Task1's page table entries are copied into corresponding page table
> entries of VM of task2.

You need to fully explain a whole bunch of details that you're
ignored.  For example, if the remote VMA is MAP_ANONYMOUS, do you get
a CoW copy of it?  I assume you don't since the whole point is to
write to remote memory, but it's at the very least quite unusual in
Linux to have two different anonymous VMAs such that writing one of
them changes the other one.  But there are plenty of other questions.
What happens if the remote VMA is a gate area or other special mapping
(vDSO, vvar area, etc)?  What if the remote memory comes from a driver
that wasn't expecting the mapping to get magically copied to a
different process?

This new API seems quite dangerous and complex to me, and I don't
think the value has been adequately demonstrated.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 14:43 ` [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Andy Lutomirski
@ 2019-05-21 15:52   ` Kirill Tkhai
  2019-05-21 15:59     ` Kirill Tkhai
                       ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21 15:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On 21.05.2019 17:43, Andy Lutomirski wrote:
> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
> 
>> [Summary]
>>
>> New syscall, which allows to clone a remote process VMA
>> into local process VM. The remote process's page table
>> entries related to the VMA are cloned into local process's
>> page table (in any desired address, which makes this different
>> from that happens during fork()). Huge pages are handled
>> appropriately.
>>
>> This allows to improve performance in significant way like
>> it's shows in the example below.
>>
>> [Description]
>>
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
>>
>>         struct iovec local_iov, remote_iov;
>>         void *buf;
>>
>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>>         local_iov->iov_base = buf;
>>         local_iov->iov_len = n * PAGE_SIZE;
>>         remove_iov = ...;
>>
>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>         munmap(buf, n * PAGE_SIZE);
>>
>>         (Note, that above completely ignores error handling)
>>
>> There are several problems with process_vm_writev() in this example:
>>
>> 1)it causes pagefault on remote process memory, and it forces
>>   allocation of a new page (if was not preallocated);
> 
> I don't see how your new syscall helps.  You're writing to remote
> memory.  If that memory wasn't allocated, it's going to get allocated
> regardless of whether you use a write-like interface or an mmap-like
> interface.

No, the talk is not about just another interface for copying memory.
The talk is about borrowing of remote task's VMA and corresponding
page table's content. Syscall allows to copy part of page table
with preallocated pages from remote to local process. See here:

[task1]                                                        [task2]

buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS, ...);

<task1 populates buf>

                                                               buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
munmap(buf);


process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
just like in the way we do during fork syscall.

There is no copying of buf memory content, unless COW happens. This is
the principal difference to process_vm_writev(), which just allocates
pages in remote VM.

> Keep in mind that, on x86, just the hardware part of a
> page fault is very slow -- populating the memory with a syscall
> instead of a fault may well be faster.

It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
pages related to buf of task1 are swapped:

1)process_vm_writev() reads them back into memory;

2)process_vm_mmap() just copies swap PTEs from task1 page table
  to task2 page table.

Also, for faster page faults one may use huge pages for the mappings.
But really, it's funny to think about page faults, when there are
disk IO problems I shown.

>>
>> 2)amount of memory for this example is doubled in a moment --
>>   n pages in current and n pages in remote tasks are occupied
>>   at the same time;
> 
> This seems disingenuous.  If you're writing p pages total in chunks of
> n pages, you will use a total of p pages if you use mmap and p+n if
> you use write.

I didn't understand this sentence because of many ifs, sorry. Could you
please explain your thought once again?

> That only doubles the amount of memory if you let n
> scale linearly with p, which seems unlikely.
>
>>
>> 3)received data has no a chance to be properly swapped for
>>   a long time.
> 
> ...
> 
>> a)kernel moves @buf pages into swap right after recv();
>> b)process_vm_writev() reads the data back from swap to pages;
> 
> If you're under that much memory pressure and thrashing that badly,
> your performance is going to be awful no matter what you're doing.  If
> you indeed observe this behavior under normal loads, then this seems
> like a VM issue that should be addressed in its own right.

I don't think so. Imagine: a container migrates from one node to another.
The nodes are the same, say, every of them has 4GB of RAM.

Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
After the page server on the second node received the pages, we want these
pages become swapped as soon as possible, and we don't want to read them from
swap to pass a read consumer.

The page server is task1 in the example. The real consumer is task2.

This is a rather normal load, I think.

>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>> [Task 2]
>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>
>> This creates a copy of VMA related to buf from task1 in task2's VM.
>> Task1's page table entries are copied into corresponding page table
>> entries of VM of task2.
> 
> You need to fully explain a whole bunch of details that you're
> ignored.

Yeah, it's not a problem :) I'm ready to explain and describe everything,
what may cause a question. Just ask ;) 

> For example, if the remote VMA is MAP_ANONYMOUS, do you get
> a CoW copy of it? I assume you don't since the whole point is to
> write to remote memory

But, no, there *is* COW semantic. We do not copy memory. We copy
page table content. This is just the same we have on fork(), when
children duplicates parent's VMA and related page table subset,
and parent's PTEs lose _PAGE_RW flag.

There is all copy_page_range() code reused for that. Please, see [3/7]
for the details.

I'm going to get special performance using THP, when number of entries
to copy is smaller than in case of PTE.

Copy several of PMD from one task page table to another's is much much much faster,
than process_vm_write() copies pages (even not mention about its reading from swap).

>,but it's at the very least quite unusual in
> Linux to have two different anonymous VMAs such that writing one of
> them changes the other one.
Writing to a new VMA does not affect old VMA. Old VMA is just used to
get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
each other.

> But there are plenty of other questions.
> What happens if the remote VMA is a gate area or other special mapping
> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> that wasn't expecting the mapping to get magically copied to a
> different process?

In case of someone wants to duplicate such the mappings, we may consider
that, and extend the interface in the future for VMA types, which are
safe for that.

But now the logic is very overprotective, and all the unusual mappings
like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
for the details.

> This new API seems quite dangerous and complex to me, and I don't
> think the value has been adequately demonstrated.

I don't think it's dangerous and complex, because of I haven't introduced
any principal VMA conceptions different to what we have now. We just
borrow vma->anon_vma and vma->vm_file from remote process to local
like we did on fork() (borrowing of vma->anon_vma means not blindly
copying, but ordinary anon_vma_fork()).

Maybe I had to focus the description more on copying of PTE/PMD
instead of vma duplication. So, it's unexpected for me, that people
think about simple memory copying after reading the example I gave.
But I gave more explanation here, so I hope the situation became
clearer for a reader. Anyway, if you have any questions, please
ask me.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 15:52   ` Kirill Tkhai
@ 2019-05-21 15:59     ` Kirill Tkhai
  2019-05-21 16:20     ` Jann Horn
  2019-05-21 16:43     ` Andy Lutomirski
  2 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21 15:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On 21.05.2019 18:52, Kirill Tkhai wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>
>>
>>> [Summary]
>>>
>>> New syscall, which allows to clone a remote process VMA
>>> into local process VM. The remote process's page table
>>> entries related to the VMA are cloned into local process's
>>> page table (in any desired address, which makes this different
>>> from that happens during fork()). Huge pages are handled
>>> appropriately.
>>>
>>> This allows to improve performance in significant way like
>>> it's shows in the example below.
>>>
>>> [Description]
>>>
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>>
>>> For example, it allows to make a zero copy of data,
>>> when process_vm_writev() was previously used:
>>>
>>>         struct iovec local_iov, remote_iov;
>>>         void *buf;
>>>
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>>         local_iov->iov_base = buf;
>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>         remove_iov = ...;
>>>
>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>         munmap(buf, n * PAGE_SIZE);
>>>
>>>         (Note, that above completely ignores error handling)
>>>
>>> There are several problems with process_vm_writev() in this example:
>>>
>>> 1)it causes pagefault on remote process memory, and it forces
>>>   allocation of a new page (if was not preallocated);
>>
>> I don't see how your new syscall helps.  You're writing to remote
>> memory.  If that memory wasn't allocated, it's going to get allocated
>> regardless of whether you use a write-like interface or an mmap-like
>> interface.
> 
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
> 
> [task1]                                                        [task2]
> 
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 
> <task1 populates buf>
> 
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
> 
> 
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
> 
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
> 
>> Keep in mind that, on x86, just the hardware part of a
>> page fault is very slow -- populating the memory with a syscall
>> instead of a fault may well be faster.
> 
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
> 
> 1)process_vm_writev() reads them back into memory;
> 
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
> 
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
> 
>>>
>>> 2)amount of memory for this example is doubled in a moment --
>>>   n pages in current and n pages in remote tasks are occupied
>>>   at the same time;
>>
>> This seems disingenuous.  If you're writing p pages total in chunks of
>> n pages, you will use a total of p pages if you use mmap and p+n if
>> you use write.
> 
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?
> 
>> That only doubles the amount of memory if you let n
>> scale linearly with p, which seems unlikely.
>>
>>>
>>> 3)received data has no a chance to be properly swapped for
>>>   a long time.
>>
>> ...
>>
>>> a)kernel moves @buf pages into swap right after recv();
>>> b)process_vm_writev() reads the data back from swap to pages;
>>
>> If you're under that much memory pressure and thrashing that badly,
>> your performance is going to be awful no matter what you're doing.  If
>> you indeed observe this behavior under normal loads, then this seems
>> like a VM issue that should be addressed in its own right.
> 
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
> 
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

Should be "to pass a *real* consumer".

> 
> The page server is task1 in the example. The real consumer is task2.
> 
> This is a rather normal load, I think.
> 
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> [Task 2]
>>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>>
>>> This creates a copy of VMA related to buf from task1 in task2's VM.
>>> Task1's page table entries are copied into corresponding page table
>>> entries of VM of task2.
>>
>> You need to fully explain a whole bunch of details that you're
>> ignored.
> 
> Yeah, it's not a problem :) I'm ready to explain and describe everything,
> what may cause a question. Just ask ;) 
> 
>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>> a CoW copy of it? I assume you don't since the whole point is to
>> write to remote memory
> 
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.
> 
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.
> 
> I'm going to get special performance using THP, when number of entries
> to copy is smaller than in case of PTE.
> 
> Copy several of PMD from one task page table to another's is much much much faster,
> than process_vm_write() copies pages (even not mention about its reading from swap).
> 
>> ,but it's at the very least quite unusual in
>> Linux to have two different anonymous VMAs such that writing one of
>> them changes the other one.
> Writing to a new VMA does not affect old VMA. Old VMA is just used to
> get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
> each other.
> 
>> But there are plenty of other questions.
>> What happens if the remote VMA is a gate area or other special mapping
>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>> that wasn't expecting the mapping to get magically copied to a
>> different process?
> 
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.
> 
> But now the logic is very overprotective, and all the unusual mappings
> like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
> for the details.
> 
>> This new API seems quite dangerous and complex to me, and I don't
>> think the value has been adequately demonstrated.
> 
> I don't think it's dangerous and complex, because of I haven't introduced
> any principal VMA conceptions different to what we have now. We just
> borrow vma->anon_vma and vma->vm_file from remote process to local
> like we did on fork() (borrowing of vma->anon_vma means not blindly
> copying, but ordinary anon_vma_fork()).
> 
> Maybe I had to focus the description more on copying of PTE/PMD
> instead of vma duplication. So, it's unexpected for me, that people
> think about simple memory copying after reading the example I gave.
> But I gave more explanation here, so I hope the situation became
> clearer for a reader. Anyway, if you have any questions, please
> ask me.
> 
> Thanks,
> Kirill
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 15:52   ` Kirill Tkhai
  2019-05-21 15:59     ` Kirill Tkhai
@ 2019-05-21 16:20     ` Jann Horn
  2019-05-21 17:03       ` Kirill Tkhai
  2019-05-21 16:43     ` Andy Lutomirski
  2 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2019-05-21 16:20 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Adam Borowski, Linux API, LKML, Linux-MM

On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
[...]
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
[...]
> > That only doubles the amount of memory if you let n
> > scale linearly with p, which seems unlikely.
> >
> >>
> >> 3)received data has no a chance to be properly swapped for
> >>   a long time.
> >
> > ...
> >
> >> a)kernel moves @buf pages into swap right after recv();
> >> b)process_vm_writev() reads the data back from swap to pages;
> >
> > If you're under that much memory pressure and thrashing that badly,
> > your performance is going to be awful no matter what you're doing.  If
> > you indeed observe this behavior under normal loads, then this seems
> > like a VM issue that should be addressed in its own right.
>
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
>
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

But you don't have to copy that memory into the container's tasks all
at once, right? Can't you, every time you've received a few dozen
kilobytes of data or whatever, shove them into the target task? That
way you don't have problems with swap because the time before the data
has arrived in its final VMA is tiny.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 15:52   ` Kirill Tkhai
  2019-05-21 15:59     ` Kirill Tkhai
  2019-05-21 16:20     ` Jann Horn
@ 2019-05-21 16:43     ` Andy Lutomirski
  2019-05-21 17:44       ` Kirill Tkhai
  2 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-21 16:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >
> >> [Summary]
> >>
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
> >>
> >> This allows to improve performance in significant way like
> >> it's shows in the example below.
> >>
> >> [Description]
> >>
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> >>
> >>         struct iovec local_iov, remote_iov;
> >>         void *buf;
> >>
> >>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>         recv(sock, buf, n * PAGE_SIZE, 0);
> >>
> >>         local_iov->iov_base = buf;
> >>         local_iov->iov_len = n * PAGE_SIZE;
> >>         remove_iov = ...;
> >>
> >>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> >>         munmap(buf, n * PAGE_SIZE);
> >>
> >>         (Note, that above completely ignores error handling)
> >>
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.

If I understand this correctly, your intended use is to have one task
allocate memory and fill it, have the other task clone the VMA, and
have the first task free the VMA?  If so, that wasn't at all obvious
from your original email.

Why don't you use splice() instead?  splice() the data to the remote
task and have the remove task read() it?  All these VMA games will
result in a lot of flushes, which is bad for performance.  Or,
depending on your exact constraints, you could map a memfd in both
tasks instead, which has the same flushing issues but at least has a
sensible API.

>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.

What are you doing that is causing *disk* IO in any of this?  I
suspect your real problem is that you are using far too large of a
buffer. See below.

>
> >>
> >> 2)amount of memory for this example is doubled in a moment --
> >>   n pages in current and n pages in remote tasks are occupied
> >>   at the same time;
> >
> > This seems disingenuous.  If you're writing p pages total in chunks of
> > n pages, you will use a total of p pages if you use mmap and p+n if
> > you use write.
>
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?

You seem to have a function that tries to populate p pages of memory
with data received from a socket.  It looks like you're doing
something like this:

void copy_p_pages(size_t p)
{
  size_t n = some_value(p);
  char *buf = malloc(n * PAGE_SIZE);
  for (int i = 0; i < p; i += n*PAGE_SIZE) {
    read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
    process_vm_writev(write n*PAGE_SIZE bytes to remote process);
  }
  free(buf);
}

If you have a *constant* n (i.e. some_value(p) is just a number like
16)), then you aren't doubling memory usage.  If you have
some_value(p) return p, then you are indeed doubling memory usage.  So
don't do that!
If buf is getting swapped out, you are very likely doing something
wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
surprised you have problems.  Try something reasonable like 128kB. For
extra fun, you could mlock() that buf, but if you're thrashing on
access to a 128kB working set, you will probably also get your *code*
swapped out, in which case you pretty much lose.


> > For example, if the remote VMA is MAP_ANONYMOUS, do you get
> > a CoW copy of it? I assume you don't since the whole point is to
> > write to remote memory
>
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.

Then you need to document this very carefully, because other people
will use your syscall in different ways than you use it.

And, if you are doing CoW like this, then your syscall is basically
only useful for your really weird use case in which you're using it to
import an already-populated VMA.  Maybe this is a reasonable feature
to add to the kernel, but it needs a benchmark against a reasonable
alternative.

>
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.

You can't as users of a syscall to read the nitty gritty mm code to
figure out what the syscall does from a user's perspective.

> > But there are plenty of other questions.
> > What happens if the remote VMA is a gate area or other special mapping
> > (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> > that wasn't expecting the mapping to get magically copied to a
> > different process?
>
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.

Do you mean that the code you sent rejects this case?  If so, please
document it.  In any case, I looked at the code, and it seems to be
trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
would reject copying a vDSO.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 16:20     ` Jann Horn
@ 2019-05-21 17:03       ` Kirill Tkhai
  2019-05-21 17:28         ` Jann Horn
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21 17:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Adam Borowski, Linux API, LKML, Linux-MM

On 21.05.2019 19:20, Jann Horn wrote:
> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
> [...]
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> [...]
>>> That only doubles the amount of memory if you let n
>>> scale linearly with p, which seems unlikely.
>>>
>>>>
>>>> 3)received data has no a chance to be properly swapped for
>>>>   a long time.
>>>
>>> ...
>>>
>>>> a)kernel moves @buf pages into swap right after recv();
>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>
>>> If you're under that much memory pressure and thrashing that badly,
>>> your performance is going to be awful no matter what you're doing.  If
>>> you indeed observe this behavior under normal loads, then this seems
>>> like a VM issue that should be addressed in its own right.
>>
>> I don't think so. Imagine: a container migrates from one node to another.
>> The nodes are the same, say, every of them has 4GB of RAM.
>>
>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>> After the page server on the second node received the pages, we want these
>> pages become swapped as soon as possible, and we don't want to read them from
>> swap to pass a read consumer.
> 
> But you don't have to copy that memory into the container's tasks all
> at once, right? Can't you, every time you've received a few dozen
> kilobytes of data or whatever, shove them into the target task? That
> way you don't have problems with swap because the time before the data
> has arrived in its final VMA is tiny.

We try to maintain online migration with as small downtime as possible,
and the container on source node is completely stopped at the very end.
Memory of container tasks is copied in background without container
completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.

Container may create any new processes during the migration, and these
processes may contain any memory mappings.

Imagine the situation. We migrate a big web server with a lot of processes,
and some of children processes have the same COW mapping as parent has.
In case of all memory dump is available at the moment of the grand parent
web server process creation, we populate the mapping in parent, and all
the children may inherit the mapping in case of they want after fork.
COW works here. But in case of some processes are created before all memory
is available on destination node, we can't do such the COW inheritance.
This will be the reason, the memory consumed by container grows many
times after the migration. So, the only solution is to create process
tree after memory is available and all mappings are known.

It's on of the examples. But believe me, there are a lot of another reasons,
why process tree should be created only after all process tree is freezed,
and no new tasks on source are possible. PGID and SSID inheritance, for
example. All of this requires special order of tasks creation. In case of
you try to restore process tree with correct namespaces and especial in
case of many user namespaces in a container, you will just see like a hell
will open before your eyes, and we never can think about this.

So, no, we can't create any task before the whole process tree is knows.
Believe me, the reason is heavy and serious.

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 17:03       ` Kirill Tkhai
@ 2019-05-21 17:28         ` Jann Horn
  2019-05-22 10:03           ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Jann Horn @ 2019-05-21 17:28 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Adam Borowski, Linux API, LKML, Linux-MM

On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 19:20, Jann Horn wrote:
> > On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> On 21.05.2019 17:43, Andy Lutomirski wrote:
> >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>> New syscall, which allows to clone a remote process VMA
> >>>> into local process VM. The remote process's page table
> >>>> entries related to the VMA are cloned into local process's
> >>>> page table (in any desired address, which makes this different
> >>>> from that happens during fork()). Huge pages are handled
> >>>> appropriately.
> > [...]
> >>>> There are several problems with process_vm_writev() in this example:
> >>>>
> >>>> 1)it causes pagefault on remote process memory, and it forces
> >>>>   allocation of a new page (if was not preallocated);
> >>>
> >>> I don't see how your new syscall helps.  You're writing to remote
> >>> memory.  If that memory wasn't allocated, it's going to get allocated
> >>> regardless of whether you use a write-like interface or an mmap-like
> >>> interface.
> >>
> >> No, the talk is not about just another interface for copying memory.
> >> The talk is about borrowing of remote task's VMA and corresponding
> >> page table's content. Syscall allows to copy part of page table
> >> with preallocated pages from remote to local process. See here:
> >>
> >> [task1]                                                        [task2]
> >>
> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>
> >> <task1 populates buf>
> >>
> >>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> >> munmap(buf);
> >>
> >>
> >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> >> just like in the way we do during fork syscall.
> >>
> >> There is no copying of buf memory content, unless COW happens. This is
> >> the principal difference to process_vm_writev(), which just allocates
> >> pages in remote VM.
> >>
> >>> Keep in mind that, on x86, just the hardware part of a
> >>> page fault is very slow -- populating the memory with a syscall
> >>> instead of a fault may well be faster.
> >>
> >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> >> pages related to buf of task1 are swapped:
> >>
> >> 1)process_vm_writev() reads them back into memory;
> >>
> >> 2)process_vm_mmap() just copies swap PTEs from task1 page table
> >>   to task2 page table.
> >>
> >> Also, for faster page faults one may use huge pages for the mappings.
> >> But really, it's funny to think about page faults, when there are
> >> disk IO problems I shown.
> > [...]
> >>> That only doubles the amount of memory if you let n
> >>> scale linearly with p, which seems unlikely.
> >>>
> >>>>
> >>>> 3)received data has no a chance to be properly swapped for
> >>>>   a long time.
> >>>
> >>> ...
> >>>
> >>>> a)kernel moves @buf pages into swap right after recv();
> >>>> b)process_vm_writev() reads the data back from swap to pages;
> >>>
> >>> If you're under that much memory pressure and thrashing that badly,
> >>> your performance is going to be awful no matter what you're doing.  If
> >>> you indeed observe this behavior under normal loads, then this seems
> >>> like a VM issue that should be addressed in its own right.
> >>
> >> I don't think so. Imagine: a container migrates from one node to another.
> >> The nodes are the same, say, every of them has 4GB of RAM.
> >>
> >> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> >> After the page server on the second node received the pages, we want these
> >> pages become swapped as soon as possible, and we don't want to read them from
> >> swap to pass a read consumer.
> >
> > But you don't have to copy that memory into the container's tasks all
> > at once, right? Can't you, every time you've received a few dozen
> > kilobytes of data or whatever, shove them into the target task? That
> > way you don't have problems with swap because the time before the data
> > has arrived in its final VMA is tiny.
>
> We try to maintain online migration with as small downtime as possible,
> and the container on source node is completely stopped at the very end.
> Memory of container tasks is copied in background without container
> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>
> Container may create any new processes during the migration, and these
> processes may contain any memory mappings.
>
> Imagine the situation. We migrate a big web server with a lot of processes,
> and some of children processes have the same COW mapping as parent has.
> In case of all memory dump is available at the moment of the grand parent
> web server process creation, we populate the mapping in parent, and all
> the children may inherit the mapping in case of they want after fork.
> COW works here. But in case of some processes are created before all memory
> is available on destination node, we can't do such the COW inheritance.
> This will be the reason, the memory consumed by container grows many
> times after the migration. So, the only solution is to create process
> tree after memory is available and all mappings are known.

But if one of the processes modifies the memory after you've started
migrating it to the new machine, that memory can't be CoW anymore
anyway, right? So it should work if you first do a first pass of
copying the memory and creating the process hierarchy, and then copy
more recent changes into the individual processes, breaking the CoW
for those pages, right?

> It's on of the examples. But believe me, there are a lot of another reasons,
> why process tree should be created only after all process tree is freezed,
> and no new tasks on source are possible. PGID and SSID inheritance, for
> example. All of this requires special order of tasks creation. In case of
> you try to restore process tree with correct namespaces and especial in
> case of many user namespaces in a container, you will just see like a hell
> will open before your eyes, and we never can think about this.

Could you elaborate on why that is so hellish?


> So, no, we can't create any task before the whole process tree is knows.
> Believe me, the reason is heavy and serious.
>
> Kirill
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 16:43     ` Andy Lutomirski
@ 2019-05-21 17:44       ` Kirill Tkhai
  2019-05-23 16:19         ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-21 17:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On 21.05.2019 19:43, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>
>>>> [Summary]
>>>>
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
>>>>
>>>> This allows to improve performance in significant way like
>>>> it's shows in the example below.
>>>>
>>>> [Description]
>>>>
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>>
>>>> For example, it allows to make a zero copy of data,
>>>> when process_vm_writev() was previously used:
>>>>
>>>>         struct iovec local_iov, remote_iov;
>>>>         void *buf;
>>>>
>>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>>
>>>>         local_iov->iov_base = buf;
>>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>>         remove_iov = ...;
>>>>
>>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>>         munmap(buf, n * PAGE_SIZE);
>>>>
>>>>         (Note, that above completely ignores error handling)
>>>>
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
> 
> If I understand this correctly, your intended use is to have one task
> allocate memory and fill it, have the other task clone the VMA, and
> have the first task free the VMA?  If so, that wasn't at all obvious
> from your original email.

Yes, exactly this. Sorry for confusing in initial description, it's not intentionally.

> Why don't you use splice() instead?

I just don't see a possibility of anonymous memory may be moved from
one process to another via splice(). Maybe you may explain your idea
more detailed?

> splice() the data to the remote
> task and have the remove task read() it?  All these VMA games will
> result in a lot of flushes, which is bad for performance.  Or,
> depending on your exact constraints, you could map a memfd in both
> tasks instead, which has the same flushing issues but at least has a
> sensible API.

memfd() is file-backed mapping, and it is not suitable for that.
In case of a process had anonymous mapping before the migration,
it wants the mapping remains the same after the migration. So,
if we use memfd(), we have to copy the memory from memfd mapping
to its real anonymous mapping target, which has the same problems
as process_vm_writev().

>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> 
> What are you doing that is causing *disk* IO in any of this?  I
> suspect your real problem is that you are using far too large of a
> buffer. See below.

Imagine, we are migrating a container, which consists of 9 GB of pages,
and we have 8GB RAM on destination node. Before the migration, we had
some of pages in RAM and some of pages in swap.

Source node sends pages to destination node. And there are limitations,
which do not allow to start creation of process tree on the destination
node, before all memory is received.

Pages are received by some page server task on destination. After all pages
are received, we create process tree and populate container tasks mappings.

When we're populating tasks mapping, we have to copy memory from page server
to a target task. In case of the pages were swapped from page server's
address space, we have to read synchronously them from swap. This introduces
big latency, and big IO I talked.

> 
>>
>>>>
>>>> 2)amount of memory for this example is doubled in a moment --
>>>>   n pages in current and n pages in remote tasks are occupied
>>>>   at the same time;
>>>
>>> This seems disingenuous.  If you're writing p pages total in chunks of
>>> n pages, you will use a total of p pages if you use mmap and p+n if
>>> you use write.
>>
>> I didn't understand this sentence because of many ifs, sorry. Could you
>> please explain your thought once again?
> 
> You seem to have a function that tries to populate p pages of memory
> with data received from a socket.  It looks like you're doing
> something like this:
> 
> void copy_p_pages(size_t p)
> {
>   size_t n = some_value(p);
>   char *buf = malloc(n * PAGE_SIZE);
>   for (int i = 0; i < p; i += n*PAGE_SIZE) {
>     read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
>     process_vm_writev(write n*PAGE_SIZE bytes to remote process);
>   }
>   free(buf);
> }
> 
> If you have a *constant* n (i.e. some_value(p) is just a number like
> 16)), then you aren't doubling memory usage.  If you have
> some_value(p) return p, then you are indeed doubling memory usage.  So
> don't do that!
> If buf is getting swapped out, you are very likely doing something
> wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
> surprised you have problems.  Try something reasonable like 128kB. For
> extra fun, you could mlock() that buf, but if you're thrashing on
> access to a 128kB working set, you will probably also get your *code*
> swapped out, in which case you pretty much lose.

The thing is we can't use small buffer. We have to receive all the restored
tasks pages on the destination node, before we start the process tree
creation like I wrote above. All the anonymous memory is mapped into
page server's MM, so it becomes swapped before container's process
tree starts to create.
 
>>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>>> a CoW copy of it? I assume you don't since the whole point is to
>>> write to remote memory
>>
>> But, no, there *is* COW semantic. We do not copy memory. We copy
>> page table content. This is just the same we have on fork(), when
>> children duplicates parent's VMA and related page table subset,
>> and parent's PTEs lose _PAGE_RW flag.
> 
> Then you need to document this very carefully, because other people
> will use your syscall in different ways than you use it.

Ok, I'll do.

> And, if you are doing CoW like this, then your syscall is basically
> only useful for your really weird use case in which you're using it to
> import an already-populated VMA.  Maybe this is a reasonable feature
> to add to the kernel, but it needs a benchmark against a reasonable
> alternative.

Do you mean comparison with process_vm_writev/readv() or something like
this?

>>
>> There is all copy_page_range() code reused for that. Please, see [3/7]
>> for the details.
> 
> You can't as users of a syscall to read the nitty gritty mm code to
> figure out what the syscall does from a user's perspective.

Yeah, sure :)
 
>>> But there are plenty of other questions.
>>> What happens if the remote VMA is a gate area or other special mapping
>>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>>> that wasn't expecting the mapping to get magically copied to a
>>> different process?
>>
>> In case of someone wants to duplicate such the mappings, we may consider
>> that, and extend the interface in the future for VMA types, which are
>> safe for that.
> 
> Do you mean that the code you sent rejects this case?  If so, please
> document it.  In any case, I looked at the code, and it seems to be
> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
> would reject copying a vDSO.

I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
I'll check carefully, whether it's enough for vDSO.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 17:28         ` Jann Horn
@ 2019-05-22 10:03           ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-22 10:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Adam Borowski, Linux API, LKML, Linux-MM

On 21.05.2019 20:28, Jann Horn wrote:
> On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 19:20, Jann Horn wrote:
>>> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>> New syscall, which allows to clone a remote process VMA
>>>>>> into local process VM. The remote process's page table
>>>>>> entries related to the VMA are cloned into local process's
>>>>>> page table (in any desired address, which makes this different
>>>>>> from that happens during fork()). Huge pages are handled
>>>>>> appropriately.
>>> [...]
>>>>>> There are several problems with process_vm_writev() in this example:
>>>>>>
>>>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>>>   allocation of a new page (if was not preallocated);
>>>>>
>>>>> I don't see how your new syscall helps.  You're writing to remote
>>>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>>>> regardless of whether you use a write-like interface or an mmap-like
>>>>> interface.
>>>>
>>>> No, the talk is not about just another interface for copying memory.
>>>> The talk is about borrowing of remote task's VMA and corresponding
>>>> page table's content. Syscall allows to copy part of page table
>>>> with preallocated pages from remote to local process. See here:
>>>>
>>>> [task1]                                                        [task2]
>>>>
>>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>
>>>> <task1 populates buf>
>>>>
>>>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>>>> munmap(buf);
>>>>
>>>>
>>>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>>>> just like in the way we do during fork syscall.
>>>>
>>>> There is no copying of buf memory content, unless COW happens. This is
>>>> the principal difference to process_vm_writev(), which just allocates
>>>> pages in remote VM.
>>>>
>>>>> Keep in mind that, on x86, just the hardware part of a
>>>>> page fault is very slow -- populating the memory with a syscall
>>>>> instead of a fault may well be faster.
>>>>
>>>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>>>> pages related to buf of task1 are swapped:
>>>>
>>>> 1)process_vm_writev() reads them back into memory;
>>>>
>>>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>>>   to task2 page table.
>>>>
>>>> Also, for faster page faults one may use huge pages for the mappings.
>>>> But really, it's funny to think about page faults, when there are
>>>> disk IO problems I shown.
>>> [...]
>>>>> That only doubles the amount of memory if you let n
>>>>> scale linearly with p, which seems unlikely.
>>>>>
>>>>>>
>>>>>> 3)received data has no a chance to be properly swapped for
>>>>>>   a long time.
>>>>>
>>>>> ...
>>>>>
>>>>>> a)kernel moves @buf pages into swap right after recv();
>>>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>>>
>>>>> If you're under that much memory pressure and thrashing that badly,
>>>>> your performance is going to be awful no matter what you're doing.  If
>>>>> you indeed observe this behavior under normal loads, then this seems
>>>>> like a VM issue that should be addressed in its own right.
>>>>
>>>> I don't think so. Imagine: a container migrates from one node to another.
>>>> The nodes are the same, say, every of them has 4GB of RAM.
>>>>
>>>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>>>> After the page server on the second node received the pages, we want these
>>>> pages become swapped as soon as possible, and we don't want to read them from
>>>> swap to pass a read consumer.
>>>
>>> But you don't have to copy that memory into the container's tasks all
>>> at once, right? Can't you, every time you've received a few dozen
>>> kilobytes of data or whatever, shove them into the target task? That
>>> way you don't have problems with swap because the time before the data
>>> has arrived in its final VMA is tiny.
>>
>> We try to maintain online migration with as small downtime as possible,
>> and the container on source node is completely stopped at the very end.
>> Memory of container tasks is copied in background without container
>> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>>
>> Container may create any new processes during the migration, and these
>> processes may contain any memory mappings.
>>
>> Imagine the situation. We migrate a big web server with a lot of processes,
>> and some of children processes have the same COW mapping as parent has.
>> In case of all memory dump is available at the moment of the grand parent
>> web server process creation, we populate the mapping in parent, and all
>> the children may inherit the mapping in case of they want after fork.
>> COW works here. But in case of some processes are created before all memory
>> is available on destination node, we can't do such the COW inheritance.
>> This will be the reason, the memory consumed by container grows many
>> times after the migration. So, the only solution is to create process
>> tree after memory is available and all mappings are known.
> 
> But if one of the processes modifies the memory after you've started
> migrating it to the new machine, that memory can't be CoW anymore
> anyway, right? So it should work if you first do a first pass of
> copying the memory and creating the process hierarchy, and then copy
> more recent changes into the individual processes, breaking the CoW
> for those pages, right?

Not so. We have to have all processes killed on source node, before
creation the same process tree on destination machine. The process
tree should be completely analyzed before a try of its recreation
from ground. The analysis allows to choose the strategy and the sequence
of each process creation and inheritance of entities like namespaces,
mm, fdtables, etc. It's impossible to restore a process tree in case
of you already have started to create it, but haven't stopped processes
on source node. Also, we can restore only subset of process trees,
but you never know what will happen with a live process tree in further.
So, source process tree must be freezed before restore.

A restore of arbitrary process tree in laws of all linux limitations
on sequence of action to recreate an entity of a process makes this
nontrivial mathematical problem, which has no a solution at the moment,
and it's unknown whether it has a solution.

So, at the moment of restore starts, we know all about all tasks and
their relation ships. Yes, we start to copy memory from source to destination,
when container on source is alive. But we track this memory with _PAGE_SOFT_DIRTY
flag, and dirtied pages are repopulated with new content. Please, see the comment
about hellish below.
 
>> It's on of the examples. But believe me, there are a lot of another reasons,
>> why process tree should be created only after all process tree is freezed,
>> and no new tasks on source are possible. PGID and SSID inheritance, for
>> example. All of this requires special order of tasks creation. In case of
>> you try to restore process tree with correct namespaces and especial in
>> case of many user namespaces in a container, you will just see like a hell
>> will open before your eyes, and we never can think about this.
> 
> Could you elaborate on why that is so hellish?

Because you never know the way, the system came into the state you're seeing
at the moment. Even in a simple process chain like:

          task1
            |
          task2
            |
          task3
          /   \
      task4  task5

any of these processes may change its namespace, pgid, ssid, unshare mm,
files, become a child subreaper, die and make its children reparented,
do all of these before or after parent did one of its actions. All of
these actions are not independent between each other, and some actions
prohibit another actions. And you can restore the process tree only
in case you repeat all of the sequence in the same order it was made
by the container.

It's impossible to say "task2, set your session to 5!", because the only
way to set ssid is call setsid() syscall, which may be made only once in
a process life. But setsid itself implies limitations on further setpgid()
syscall (see the code, if interesting). And this limitation does not mean
we always should call setpgid() before it, no, these are no such the simple
rules. The same and moreover with inheritance of namespaces, when some of
task's children may inherit old task's namespaces, some may inherit current,
some will inherit further. A child will be able to assign a namespace, say,
net ns, only in case of its userns allows to do this. This implies another
limitation on process creation order, while this does not mean this gives
a stable rule of choosing an order you create process tree. No, the only
rule is "in the moment of time, one of tasks should made one of the above
actions".

This is a mathematical problem of ordering finite number of actors with
finite number of rules and relationship between them, and the rules do not
imply unambiguous replay from the ground by the end state.

We behave well in limited and likely cases of process tree configurations,
and this is enough for most cases. But common problem is very difficult,
and currently it's not proven it even has a solution as a finite set of
rules you may apply to restore any process tree.

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
                   ` (7 preceding siblings ...)
  2019-05-21 14:43 ` [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Andy Lutomirski
@ 2019-05-22 15:22 ` Kirill A. Shutemov
  2019-05-23 16:11   ` Kirill Tkhai
                     ` (2 more replies)
  8 siblings, 3 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-05-22 15:22 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.

Kirill, could you explain how the change affects rmap and how it is safe.

My concern is that the patchset allows to map the same page multiple times
within one process or even map page allocated by child to the parrent.

It was not allowed before.

In the best case it makes reasoning about rmap substantially more difficult.

But I'm worry it will introduce hard-to-debug bugs, like described in
https://lwn.net/Articles/383162/.

Note, that is some cases we care about rmap walk order (see for instance
mremap() case). I'm not convinced that the feature will not break
something in the area.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-22 15:22 ` Kirill A. Shutemov
@ 2019-05-23 16:11   ` Kirill Tkhai
  2019-05-24 10:45   ` Kirill Tkhai
  2019-06-03 14:38   ` Kirill Tkhai
  2 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-23 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
>
> In the best case it makes reasoning about rmap substantially more difficult.

I don't think here is big impact from process relationships, because of
as it existed before, the main rule of VMA chaining is that VMA is younger
or older each other. For example, reusing of anon_vma in anon_vma_clone()
may be done either children or siblings. Also, it is possible reparenting
after some of processes dies; or splitting two branches of processes having
the same grand parent into two chains after the grand parent dies, so it looks
there should be many combinations already available.

Mapping of the same page multiple times is a different thing, and it was never
allowed for rmap.

Could you please say more specifically what looks suspicious for you and I'll
try to answer then? Otherwise, it's possible to write explanations as big as
a dissertation and to miss all answers to that is interested for you :)

> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

I read the article, but there are a lot of messages in thread, I'm not sure,
that found the actual fix there. But it looks like one of the fixes may be
be usage of anon_vma->root in __page_set_anon_rmap().

> Note, that is some cases we care about rmap walk order (see for instance
> mremap() case). I'm not convinced that the feature will not break
> something in the area.

Yeah, thanks for pointing, I'll check this.

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-21 17:44       ` Kirill Tkhai
@ 2019-05-23 16:19         ` Andy Lutomirski
  2019-05-24 10:36           ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2019-05-23 16:19 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin,
	Andrea Arcangeli, Hugh Dickins, Jerome Glisse, Mel Gorman,
	daniel.m.jordan, Jann Horn, Adam Borowski, Linux API, LKML,
	Linux-MM

On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.05.2019 19:43, Andy Lutomirski wrote:
> > On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 21.05.2019 17:43, Andy Lutomirski wrote:

> > Do you mean that the code you sent rejects this case?  If so, please
> > document it.  In any case, I looked at the code, and it seems to be
> > trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
> > would reject copying a vDSO.
>
> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
> I'll check carefully, whether it's enough for vDSO.

I think you could make the new syscall a lot more comprehensible bg
restricting it to just MAP_ANONYMOUS, by making it unmap the source,
or possibly both.  If the new syscall unmaps the source (in order so
that the source is gone before the newly mapped pages become
accessible), then you avoid issues in which you need to define
sensible semantics for what happens if both copies are accessed
simultaneously.


--Andy


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-23 16:19         ` Andy Lutomirski
@ 2019-05-24 10:36           ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-24 10:36 UTC (permalink / raw)
  To: Andy Lutomirski, Kirill A. Shutemov
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	alexander.h.duyck, Weiny Ira, Andrey Konovalov, arunks,
	Vlastimil Babka, Christoph Lameter, Rik van Riel, Kees Cook,
	Johannes Weiner, Nicholas Piggin, Mathieu Desnoyers,
	Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
	Jerome Glisse, Mel Gorman, daniel.m.jordan, Jann Horn,
	Adam Borowski, Linux API, LKML, Linux-MM

On 23.05.2019 19:19, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 19:43, Andy Lutomirski wrote:
>>> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
> 
>>> Do you mean that the code you sent rejects this case?  If so, please
>>> document it.  In any case, I looked at the code, and it seems to be
>>> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
>>> would reject copying a vDSO.
>>
>> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
>> I'll check carefully, whether it's enough for vDSO.
> 
> I think you could make the new syscall a lot more comprehensible bg
> restricting it to just MAP_ANONYMOUS, by making it unmap the source,
> or possibly both.  If the new syscall unmaps the source (in order so
> that the source is gone before the newly mapped pages become
> accessible), then you avoid issues in which you need to define
> sensible semantics for what happens if both copies are accessed
> simultaneously.

In case of we unmap source, this does not introduce a new principal
behavior with the same page mapped twice in a single process like
Kirill pointed. This sounds as a good idea and this covers my
application area.

The only new principal thing is a child process will be able to inherit
a parent's VMA, which is not possible now. But it looks like we never
depend on processes relationship in the mapping code, and process
reparenting already gives many combinations, so the new change should
not affect much on this.

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-22 15:22 ` Kirill A. Shutemov
  2019-05-23 16:11   ` Kirill Tkhai
@ 2019-05-24 10:45   ` Kirill Tkhai
  2019-05-24 11:52     ` Kirill A. Shutemov
  2019-06-03 14:38   ` Kirill Tkhai
  2 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-24 10:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
> 
> In the best case it makes reasoning about rmap substantially more difficult.
> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

Andy suggested to unmap PTEs from source page table, and this make the single
page never be mapped in the same process twice. This is OK for my use case,
and here we will just do a small step "allow to inherit VMA by a child process",
which we didn't have before this. If someone still needs to continue the work
to allow the same page be mapped twice in a single process in the future, this
person will have a supported basis we do in this small step. I believe, someone
like debugger may want to have this to make a fast snapshot of a process private
memory (when the task is stopped for a small time to get its memory). But for
me remapping is enough at the moment.

What do you think about this?

[...]

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-24 10:45   ` Kirill Tkhai
@ 2019-05-24 11:52     ` Kirill A. Shutemov
  2019-05-24 14:00       ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-05-24 11:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> > 
> > Kirill, could you explain how the change affects rmap and how it is safe.
> > 
> > My concern is that the patchset allows to map the same page multiple times
> > within one process or even map page allocated by child to the parrent.
> > 
> > It was not allowed before.
> > 
> > In the best case it makes reasoning about rmap substantially more difficult.
> > 
> > But I'm worry it will introduce hard-to-debug bugs, like described in
> > https://lwn.net/Articles/383162/.
> 
> Andy suggested to unmap PTEs from source page table, and this make the single
> page never be mapped in the same process twice. This is OK for my use case,
> and here we will just do a small step "allow to inherit VMA by a child process",
> which we didn't have before this. If someone still needs to continue the work
> to allow the same page be mapped twice in a single process in the future, this
> person will have a supported basis we do in this small step. I believe, someone
> like debugger may want to have this to make a fast snapshot of a process private
> memory (when the task is stopped for a small time to get its memory). But for
> me remapping is enough at the moment.
> 
> What do you think about this?

I don't think that unmapping alone will do. Consider the following
scenario:

1. Task A creates and populates the mapping.
2. Task A forks. We have now Task B mapping the same pages, but
write-protected.
3. Task B calls process_vm_mmap() and passes the mapping to the parent.

After this Task A will have the same anon pages mapped twice.

One possible way out would be to force CoW on all pages in the mapping,
before passing the mapping to the new process.

Thanks,
Kirill.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-24 11:52     ` Kirill A. Shutemov
@ 2019-05-24 14:00       ` Kirill Tkhai
  2019-05-27 23:30         ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-24 14:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>
>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>
>>> My concern is that the patchset allows to map the same page multiple times
>>> within one process or even map page allocated by child to the parrent.
>>>
>>> It was not allowed before.
>>>
>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>
>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>> https://lwn.net/Articles/383162/.
>>
>> Andy suggested to unmap PTEs from source page table, and this make the single
>> page never be mapped in the same process twice. This is OK for my use case,
>> and here we will just do a small step "allow to inherit VMA by a child process",
>> which we didn't have before this. If someone still needs to continue the work
>> to allow the same page be mapped twice in a single process in the future, this
>> person will have a supported basis we do in this small step. I believe, someone
>> like debugger may want to have this to make a fast snapshot of a process private
>> memory (when the task is stopped for a small time to get its memory). But for
>> me remapping is enough at the moment.
>>
>> What do you think about this?
> 
> I don't think that unmapping alone will do. Consider the following
> scenario:
> 
> 1. Task A creates and populates the mapping.
> 2. Task A forks. We have now Task B mapping the same pages, but
> write-protected.
> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> 
> After this Task A will have the same anon pages mapped twice.

Ah, sure.

> One possible way out would be to force CoW on all pages in the mapping,
> before passing the mapping to the new process.

This will pop all swapped pages up, which is the thing the patchset aims
to prevent.

Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
only chain and which vma->anon_vma_chain contains single entry? This is
a vma, which were faulted, but its mm never were duplicated (or which
forks already died).

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-24 14:00       ` Kirill Tkhai
@ 2019-05-27 23:30         ` Kirill A. Shutemov
  2019-05-28  9:15           ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-05-27 23:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> > On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> >> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>>> This patchset adds a new syscall, which makes possible
> >>>> to clone a VMA from a process to current process.
> >>>> The syscall supplements the functionality provided
> >>>> by process_vm_writev() and process_vm_readv() syscalls,
> >>>> and it may be useful in many situation.
> >>>
> >>> Kirill, could you explain how the change affects rmap and how it is safe.
> >>>
> >>> My concern is that the patchset allows to map the same page multiple times
> >>> within one process or even map page allocated by child to the parrent.
> >>>
> >>> It was not allowed before.
> >>>
> >>> In the best case it makes reasoning about rmap substantially more difficult.
> >>>
> >>> But I'm worry it will introduce hard-to-debug bugs, like described in
> >>> https://lwn.net/Articles/383162/.
> >>
> >> Andy suggested to unmap PTEs from source page table, and this make the single
> >> page never be mapped in the same process twice. This is OK for my use case,
> >> and here we will just do a small step "allow to inherit VMA by a child process",
> >> which we didn't have before this. If someone still needs to continue the work
> >> to allow the same page be mapped twice in a single process in the future, this
> >> person will have a supported basis we do in this small step. I believe, someone
> >> like debugger may want to have this to make a fast snapshot of a process private
> >> memory (when the task is stopped for a small time to get its memory). But for
> >> me remapping is enough at the moment.
> >>
> >> What do you think about this?
> > 
> > I don't think that unmapping alone will do. Consider the following
> > scenario:
> > 
> > 1. Task A creates and populates the mapping.
> > 2. Task A forks. We have now Task B mapping the same pages, but
> > write-protected.
> > 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> > 
> > After this Task A will have the same anon pages mapped twice.
> 
> Ah, sure.
> 
> > One possible way out would be to force CoW on all pages in the mapping,
> > before passing the mapping to the new process.
> 
> This will pop all swapped pages up, which is the thing the patchset aims
> to prevent.
> 
> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
> only chain and which vma->anon_vma_chain contains single entry? This is
> a vma, which were faulted, but its mm never were duplicated (or which
> forks already died).

The requirement for the VMA to be faulted (have any pages mapped) looks
excessive to me, but the general idea may work.

One issue I see is that userspace may not have full control to create such
VMA. vma_merge() can merge the VMA to the next one without any consent
from userspace and you'll get anon_vma inherited from the VMA you've
justed merged with.

I don't have any valid idea on how to get around this.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-27 23:30         ` Kirill A. Shutemov
@ 2019-05-28  9:15           ` Kirill Tkhai
  2019-05-28 16:15             ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-28  9:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 28.05.2019 02:30, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>> This patchset adds a new syscall, which makes possible
>>>>>> to clone a VMA from a process to current process.
>>>>>> The syscall supplements the functionality provided
>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>> and it may be useful in many situation.
>>>>>
>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>
>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>> within one process or even map page allocated by child to the parrent.
>>>>>
>>>>> It was not allowed before.
>>>>>
>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>
>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>> https://lwn.net/Articles/383162/.
>>>>
>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>> which we didn't have before this. If someone still needs to continue the work
>>>> to allow the same page be mapped twice in a single process in the future, this
>>>> person will have a supported basis we do in this small step. I believe, someone
>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>> me remapping is enough at the moment.
>>>>
>>>> What do you think about this?
>>>
>>> I don't think that unmapping alone will do. Consider the following
>>> scenario:
>>>
>>> 1. Task A creates and populates the mapping.
>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>> write-protected.
>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>
>>> After this Task A will have the same anon pages mapped twice.
>>
>> Ah, sure.
>>
>>> One possible way out would be to force CoW on all pages in the mapping,
>>> before passing the mapping to the new process.
>>
>> This will pop all swapped pages up, which is the thing the patchset aims
>> to prevent.
>>
>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>> only chain and which vma->anon_vma_chain contains single entry? This is
>> a vma, which were faulted, but its mm never were duplicated (or which
>> forks already died).
> 
> The requirement for the VMA to be faulted (have any pages mapped) looks
> excessive to me, but the general idea may work.
> 
> One issue I see is that userspace may not have full control to create such
> VMA. vma_merge() can merge the VMA to the next one without any consent
> from userspace and you'll get anon_vma inherited from the VMA you've
> justed merged with.
> 
> I don't have any valid idea on how to get around this.

Technically it is possible by creating boundary 1-page VMAs with another protection:
one above and one below the desired region, then map the desired mapping. But this
is not comfortable.

I don't think it's difficult to find a natural limitation, which prevents mapping
a single page twice if we want to avoid this at least on start. Another suggestion:

prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
is the same as root of one of local process's VMA.

What about this?

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-28  9:15           ` Kirill Tkhai
@ 2019-05-28 16:15             ` Kirill A. Shutemov
  2019-05-29 14:33               ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-05-28 16:15 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote:
> On 28.05.2019 02:30, Kirill A. Shutemov wrote:
> > On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
> >> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> >>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> >>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>>>>> This patchset adds a new syscall, which makes possible
> >>>>>> to clone a VMA from a process to current process.
> >>>>>> The syscall supplements the functionality provided
> >>>>>> by process_vm_writev() and process_vm_readv() syscalls,
> >>>>>> and it may be useful in many situation.
> >>>>>
> >>>>> Kirill, could you explain how the change affects rmap and how it is safe.
> >>>>>
> >>>>> My concern is that the patchset allows to map the same page multiple times
> >>>>> within one process or even map page allocated by child to the parrent.
> >>>>>
> >>>>> It was not allowed before.
> >>>>>
> >>>>> In the best case it makes reasoning about rmap substantially more difficult.
> >>>>>
> >>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
> >>>>> https://lwn.net/Articles/383162/.
> >>>>
> >>>> Andy suggested to unmap PTEs from source page table, and this make the single
> >>>> page never be mapped in the same process twice. This is OK for my use case,
> >>>> and here we will just do a small step "allow to inherit VMA by a child process",
> >>>> which we didn't have before this. If someone still needs to continue the work
> >>>> to allow the same page be mapped twice in a single process in the future, this
> >>>> person will have a supported basis we do in this small step. I believe, someone
> >>>> like debugger may want to have this to make a fast snapshot of a process private
> >>>> memory (when the task is stopped for a small time to get its memory). But for
> >>>> me remapping is enough at the moment.
> >>>>
> >>>> What do you think about this?
> >>>
> >>> I don't think that unmapping alone will do. Consider the following
> >>> scenario:
> >>>
> >>> 1. Task A creates and populates the mapping.
> >>> 2. Task A forks. We have now Task B mapping the same pages, but
> >>> write-protected.
> >>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> >>>
> >>> After this Task A will have the same anon pages mapped twice.
> >>
> >> Ah, sure.
> >>
> >>> One possible way out would be to force CoW on all pages in the mapping,
> >>> before passing the mapping to the new process.
> >>
> >> This will pop all swapped pages up, which is the thing the patchset aims
> >> to prevent.
> >>
> >> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
> >> only chain and which vma->anon_vma_chain contains single entry? This is
> >> a vma, which were faulted, but its mm never were duplicated (or which
> >> forks already died).
> > 
> > The requirement for the VMA to be faulted (have any pages mapped) looks
> > excessive to me, but the general idea may work.
> > 
> > One issue I see is that userspace may not have full control to create such
> > VMA. vma_merge() can merge the VMA to the next one without any consent
> > from userspace and you'll get anon_vma inherited from the VMA you've
> > justed merged with.
> > 
> > I don't have any valid idea on how to get around this.
> 
> Technically it is possible by creating boundary 1-page VMAs with another protection:
> one above and one below the desired region, then map the desired mapping. But this
> is not comfortable.
> 
> I don't think it's difficult to find a natural limitation, which prevents mapping
> a single page twice if we want to avoid this at least on start. Another suggestion:
> 
> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
> is the same as root of one of local process's VMA.
> 
> What about this?

I don't see anything immediately wrong with this, but it's still going to
produce puzzling errors for a user. How would you document such limitation
in the way it makes sense for userspace developer?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-28 16:15             ` Kirill A. Shutemov
@ 2019-05-29 14:33               ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-05-29 14:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 28.05.2019 19:15, Kirill A. Shutemov wrote:
> On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote:
>> On 28.05.2019 02:30, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>>>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>>>> This patchset adds a new syscall, which makes possible
>>>>>>>> to clone a VMA from a process to current process.
>>>>>>>> The syscall supplements the functionality provided
>>>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>>>> and it may be useful in many situation.
>>>>>>>
>>>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>>>
>>>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>>>> within one process or even map page allocated by child to the parrent.
>>>>>>>
>>>>>>> It was not allowed before.
>>>>>>>
>>>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>>>
>>>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>>>> https://lwn.net/Articles/383162/.
>>>>>>
>>>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>>>> which we didn't have before this. If someone still needs to continue the work
>>>>>> to allow the same page be mapped twice in a single process in the future, this
>>>>>> person will have a supported basis we do in this small step. I believe, someone
>>>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>>>> me remapping is enough at the moment.
>>>>>>
>>>>>> What do you think about this?
>>>>>
>>>>> I don't think that unmapping alone will do. Consider the following
>>>>> scenario:
>>>>>
>>>>> 1. Task A creates and populates the mapping.
>>>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>>>> write-protected.
>>>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>>>
>>>>> After this Task A will have the same anon pages mapped twice.
>>>>
>>>> Ah, sure.
>>>>
>>>>> One possible way out would be to force CoW on all pages in the mapping,
>>>>> before passing the mapping to the new process.
>>>>
>>>> This will pop all swapped pages up, which is the thing the patchset aims
>>>> to prevent.
>>>>
>>>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>>>> only chain and which vma->anon_vma_chain contains single entry? This is
>>>> a vma, which were faulted, but its mm never were duplicated (or which
>>>> forks already died).
>>>
>>> The requirement for the VMA to be faulted (have any pages mapped) looks
>>> excessive to me, but the general idea may work.
>>>
>>> One issue I see is that userspace may not have full control to create such
>>> VMA. vma_merge() can merge the VMA to the next one without any consent
>>> from userspace and you'll get anon_vma inherited from the VMA you've
>>> justed merged with.
>>>
>>> I don't have any valid idea on how to get around this.
>>
>> Technically it is possible by creating boundary 1-page VMAs with another protection:
>> one above and one below the desired region, then map the desired mapping. But this
>> is not comfortable.
>>
>> I don't think it's difficult to find a natural limitation, which prevents mapping
>> a single page twice if we want to avoid this at least on start. Another suggestion:
>>
>> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
>> is the same as root of one of local process's VMA.
>>
>> What about this?
> 
> I don't see anything immediately wrong with this, but it's still going to
> produce puzzling errors for a user. How would you document such limitation
> in the way it makes sense for userspace developer?

It's difficult, since the limitation is artificial.

I just may to suggest more strict limitation.

Something like "VMA may be remapped only as a whole region,
and only in the case of there were not fork() after VMA
appeared in a process (by mmap or remapping from another
remote process). In case of VMA were merged with a neighbouring
VMA, the same rules are applied to the neighbours.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..0bcd6f598e73 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -287,13 +287,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_MAY_REMOTE_REMAP	VM_HIGH_ARCH_5
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
diff --git a/kernel/fork.c b/kernel/fork.c
index ff4efd16fd82..a3c758c8cd54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -584,8 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
 			retval = copy_page_range(mm, oldmm, mpnt);
+			mpnt->vm_flags &= ~VM_MAY_REMOTE_REMAP;
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-05-22 15:22 ` Kirill A. Shutemov
  2019-05-23 16:11   ` Kirill Tkhai
  2019-05-24 10:45   ` Kirill Tkhai
@ 2019-06-03 14:38   ` Kirill Tkhai
  2019-06-03 14:56     ` Kirill Tkhai
  2 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-06-03 14:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.

Speaking honestly, we already support this model, since ZERO_PAGE() may
be mapped multiply times in any number of mappings.

Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-06-03 14:38   ` Kirill Tkhai
@ 2019-06-03 14:56     ` Kirill Tkhai
  2019-06-03 17:47       ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2019-06-03 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 03.06.2019 17:38, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>
>> Kirill, could you explain how the change affects rmap and how it is safe.
>>
>> My concern is that the patchset allows to map the same page multiple times
>> within one process or even map page allocated by child to the parrent.
> 
> Speaking honestly, we already support this model, since ZERO_PAGE() may
> be mapped multiply times in any number of mappings.

Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
the case, when the same huge page is mapped as huge page and as set of ordinary
pages in the same process.

Summing up two above cases, is there really a fundamental problem with
the functionality the patch set introduces? It looks like we already have
these cases in stable kernel supported.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-06-03 14:56     ` Kirill Tkhai
@ 2019-06-03 17:47       ` Kirill A. Shutemov
  2019-06-04  9:32         ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2019-06-03 17:47 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
> On 03.06.2019 17:38, Kirill Tkhai wrote:
> > On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> >> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >>> This patchset adds a new syscall, which makes possible
> >>> to clone a VMA from a process to current process.
> >>> The syscall supplements the functionality provided
> >>> by process_vm_writev() and process_vm_readv() syscalls,
> >>> and it may be useful in many situation.
> >>
> >> Kirill, could you explain how the change affects rmap and how it is safe.
> >>
> >> My concern is that the patchset allows to map the same page multiple times
> >> within one process or even map page allocated by child to the parrent.
> > 
> > Speaking honestly, we already support this model, since ZERO_PAGE() may
> > be mapped multiply times in any number of mappings.
> 
> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
> the case, when the same huge page is mapped as huge page and as set of ordinary
> pages in the same process.
> 
> Summing up two above cases, is there really a fundamental problem with
> the functionality the patch set introduces? It looks like we already have
> these cases in stable kernel supported.

It *might* work. But it requires a lot of audit to prove that it actually
*does* work.

For instance, are you sure it will not break KSM? What does it mean for
memory accounting? memcg?

My point is that you breaking long standing invariant in Linux MM and it
has to be properly justified.

I would expect to see some strange deadlocks or permanent trylock failure
as result of such change.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
  2019-06-03 17:47       ` Kirill A. Shutemov
@ 2019-06-04  9:32         ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2019-06-04  9:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm

On 03.06.2019 20:47, Kirill A. Shutemov wrote:> On Mon, Jun 03, 2019 at 05:56:32PM +0300, Kirill Tkhai wrote:
>> On 03.06.2019 17:38, Kirill Tkhai wrote:
>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>> This patchset adds a new syscall, which makes possible
>>>>> to clone a VMA from a process to current process.
>>>>> The syscall supplements the functionality provided
>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>> and it may be useful in many situation.
>>>>
>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>
>>>> My concern is that the patchset allows to map the same page multiple times
>>>> within one process or even map page allocated by child to the parrent.
>>>
>>> Speaking honestly, we already support this model, since ZERO_PAGE() may
>>> be mapped multiply times in any number of mappings.
>>
>> Picking of huge_zero_page and mremapping its VMA to unaligned address also gives
>> the case, when the same huge page is mapped as huge page and as set of ordinary
>> pages in the same process.
>>
>> Summing up two above cases, is there really a fundamental problem with
>> the functionality the patch set introduces? It looks like we already have
>> these cases in stable kernel supported.
> 
> It *might* work. But it requires a lot of audit to prove that it actually
> *does* work.

Please, give the represent of the way the audit results should look like
for you. In case of I hadn't done some audit before patchset preparing,
I wouldn't have sent it. So, give an idea that you expect from this.

> For instance, are you sure it will not break KSM?

Yes, it does not break KSM. The main point is that in case of KSM we already
may have not just only a page mapped twice in a single process, but even
a page mapped twice in a single VMA. And this is just a particular case of
generic supported set. (Ordinary page still can't be mapped twice in a single
VMA, since pgoff differences won't allow to merge such two hunks together).

The generic rule of ksm is "everything may happen with a page in a real time,
and all of this will be reflected in stable and unstable trees and rmap_items
some time later". Pages of a duplicated VMA will be interpreted as KSM fork,
and the corresponding checks in unstable_tree_search_insert() and
stable_tree_search() provide this.

When both of source and destination VMAs are mergeable,
1)if page was added to stable tree before the duplication of related VMA,
  then during scanning destination VMA in cmp_and_merge_page() it will be
  detected as a duplicate, and we will just add related rmap_item
  to stable node chain;
2)if page was added to unstable tree before the duplication of related VMA,
  and it is remaining there, then the page will be detected as a duplicate
  in destination VMA, and the scan of page will be skipped till next turn;
3)if page was not added to any tree before the duplication, it may be added
  to one of the trees and it will be handled by one of two rules above.

When one of source or destination VMAs is not mergeable, while a page become
PageKsm() during scanning other of them, the unmergeable VMA becomes to refer
to PageKsm(), which does not have rmap_item. But it still possible to unmap
that page from unmergeable VMA, since rmap_walk_ksm() goes over all anon_vma
under rb_root. Just the same as what happens, when process forks, and its
child makes VMA unmergeable.

> What does it mean for memory accounting? memcg?

Once assigned memcg remains the same after VMA duplication. Mapped page range
advances counters in vm_stat_account(). Since we keep fork() semantics,
the same thing occurs as after fork()+mremap().

> My point is that you breaking long standing invariant in Linux MM and it
> has to be properly justified.

I'm not against that. Please, say, which form of the justification you expect.
I assume you do not mean retelling of every string of existing code, because
this way the words will take 10 times more, than the code, and just not human
possible.

Please, give the specific request what you expect, and how this should look like.

> I would expect to see some strange deadlocks or permanent trylock failure
> as result of such change.

Do you hint some specific area? Do you expect I run some specific test cases?
Do you want we add some debugging engine on top of page locking to detect such
the trylock failures?

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-06-04  9:32 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20 14:00 [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 1/7] mm: Add process_vm_mmap() syscall declaration Kirill Tkhai
2019-05-21  0:28   ` Ira Weiny
2019-05-21  8:29     ` Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 2/7] mm: Extend copy_vma() Kirill Tkhai
2019-05-21  8:18   ` Kirill A. Shutemov
2019-05-21  8:48     ` Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 3/7] mm: Extend copy_page_range() Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 4/7] mm: Export round_hint_to_min() Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 5/7] mm: Introduce may_mmap_overlapped_region() helper Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 6/7] mm: Introduce find_vma_filter_flags() helper Kirill Tkhai
2019-05-20 14:00 ` [PATCH v2 7/7] mm: Add process_vm_mmap() Kirill Tkhai
2019-05-21 14:43 ` [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping Andy Lutomirski
2019-05-21 15:52   ` Kirill Tkhai
2019-05-21 15:59     ` Kirill Tkhai
2019-05-21 16:20     ` Jann Horn
2019-05-21 17:03       ` Kirill Tkhai
2019-05-21 17:28         ` Jann Horn
2019-05-22 10:03           ` Kirill Tkhai
2019-05-21 16:43     ` Andy Lutomirski
2019-05-21 17:44       ` Kirill Tkhai
2019-05-23 16:19         ` Andy Lutomirski
2019-05-24 10:36           ` Kirill Tkhai
2019-05-22 15:22 ` Kirill A. Shutemov
2019-05-23 16:11   ` Kirill Tkhai
2019-05-24 10:45   ` Kirill Tkhai
2019-05-24 11:52     ` Kirill A. Shutemov
2019-05-24 14:00       ` Kirill Tkhai
2019-05-27 23:30         ` Kirill A. Shutemov
2019-05-28  9:15           ` Kirill Tkhai
2019-05-28 16:15             ` Kirill A. Shutemov
2019-05-29 14:33               ` Kirill Tkhai
2019-06-03 14:38   ` Kirill Tkhai
2019-06-03 14:56     ` Kirill Tkhai
2019-06-03 17:47       ` Kirill A. Shutemov
2019-06-04  9:32         ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).