linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] Introduce first class virtual address spaces
@ 2017-03-13 22:14 Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region' Till Smejkal
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

First class virtual address spaces (also called VAS) are a new functionality of
the Linux kernel allowing address spaces to exist independently of processes.
The general idea behind this feature is described in a paper at ASPLOS16 with
the title 'SpaceJMP: Programming with Multiple Virtual Address Spaces' [1].

This patchset extends the kernel memory management subsystem with a new
type of address spaces (called VAS) which can be created and destroyed
independently of processes by a user in the system. During its lifetime
such a VAS can be attached to processes by the user which allows a process
to have multiple address spaces and thereby multiple, potentially
different, views on the system's main memory. During its execution the
threads belonging to the process are able to switch freely between the
different attached VAS and the process' original AS enabling them to
utilize the different available views on the memory. These multiple virtual
address spaces per process and the possibility to switch between them
freely can be used in multiple interesting ways as also outlined in the
mentioned paper. Some of the many possible applications are for example to
compartmentalize a process for security reasons, to improve the performance
of data-centric applications and to introduce new application models [1].

In addition to the concept of first class virtual address spaces, this
patchset introduces yet another feature called VAS segments. VAS segments
are memory regions which have a fixed size and position in the virtual
address space and can be shared between multiple first class virtual
address spaces. Such shareable memory regions are especially useful for
in-memory pointer-based data structures or other pure in-memory data.

First class virtual address spaces have a significant advantage compared to
forking a process and using inter process communication mechanism, namely
that creating and switching between VAS is significant faster than creating
and switching between processes. As it can be seen in the following table,
measured on an Intel Xeon E5620 CPU with 2.40GHz, creating a VAS is about 7
times faster than forking and switching between VAS is up to 4 times faster
than switching between processes.

            |     VAS     |  processes  |
    -------------------------------------
    switch  |       468ns |      1944ns |
    create  |     20003ns |    150491ns |

Hence, first class virtual address spaces provide a fast mechanism for
applications to utilize multiple virtual address spaces in parallel with a
higher performance than splitting up the application into multiple
independent processes.

Both VAS and VAS segments have another significant advantage when combined
with non-volatile memory. Because of their independent life cycle from
processes and other kernel data structures, they can be used to save
special memory regions or even whole AS into non-volatile memory making it
possible to reuse them across multiple system reboots.

At the current state of the development, first class virtual address spaces
have one limitation, that we haven't been able to solve so far. The feature
allows, that different threads of the same process can execute in different
AS at the same time. This is possible, because the VAS-switch operation
only changes the active mm_struct for the task_struct of the calling
thread. However, when a thread switches into a first class virtual address
space, some parts of its original AS are duplicated into the new one to
allow the thread to continue its execution at its current state.
Accordingly, parts of the processes AS (e.g. the code section, data
section, heap section and stack sections) exist in multiple AS if the
process has a VAS attached to it. Changes to these shared memory regions
are synchronized between the address spaces whenever a thread switches
between two of them. Unfortunately, in some scenarios the kernel is not
able to properly synchronize all these shared memory regions because of
conflicting changes. One such example happens if there are two threads, one
executing in an attached first class virtual address space, the other in
the tasks original address space. If both threads make changes to the heap
section that cause expansion of the underlying vm_area_struct, the kernel
cannot correctly synchronize these changes, because that would cause parts
of the virtual address space to be overwritten with unrelated data. In the
current implementation such conflicts are only detected but not resolved
and result in an error code being returned by the kernel during the VAS
switch operation. Unfortunately, that means for the particular thread that
tried to make the switch, that it cannot do this anymore in the future and
accordingly has to be killed.

This code was developed during an internship at Hewlett Packard Enterprise.

[1] http://impact.crhc.illinois.edu/shared/Papers/ASPLOS16-SpaceJMP.pdf

Till Smejkal (13):
  mm: Add mm_struct argument to 'mmap_region'
  mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff'
  mm: Rename 'unmap_region' and add mm_struct argument
  mm: Add mm_struct argument to 'get_unmapped_area' and
    'vm_unmapped_area'
  mm: Add mm_struct argument to 'mm_populate' and '__mm_populate'
  mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem
  kernel/fork: Split and export 'mm_alloc' and 'mm_init'
  kernel/fork: Define explicitly which mm_struct to duplicate during
    fork
  mm/memory: Add function to one-to-one duplicate page ranges
  mm: Introduce first class virtual address spaces
  mm/vas: Introduce VAS segments - shareable address space regions
  mm/vas: Add lazy-attach support for first class virtual address spaces
  fs/proc: Add procfs support for first class virtual address spaces

 MAINTAINERS                                  |   10 +
 arch/alpha/kernel/osf_sys.c                  |   19 +-
 arch/arc/mm/mmap.c                           |    8 +-
 arch/arm/kernel/process.c                    |    2 +-
 arch/arm/mach-rpc/ecard.c                    |    2 +-
 arch/arm/mm/mmap.c                           |   19 +-
 arch/arm64/kernel/vdso.c                     |    2 +-
 arch/blackfin/include/asm/pgtable.h          |    3 +-
 arch/blackfin/kernel/sys_bfin.c              |    5 +-
 arch/frv/mm/elf-fdpic.c                      |   11 +-
 arch/hexagon/kernel/vdso.c                   |    2 +-
 arch/ia64/kernel/perfmon.c                   |    3 +-
 arch/ia64/kernel/sys_ia64.c                  |    6 +-
 arch/ia64/mm/hugetlbpage.c                   |    7 +-
 arch/metag/mm/hugetlbpage.c                  |   11 +-
 arch/mips/kernel/vdso.c                      |    4 +-
 arch/mips/mm/mmap.c                          |   27 +-
 arch/parisc/kernel/sys_parisc.c              |   19 +-
 arch/parisc/mm/hugetlbpage.c                 |    7 +-
 arch/powerpc/include/asm/book3s/64/hugetlb.h |    6 +-
 arch/powerpc/include/asm/page_64.h           |    3 +-
 arch/powerpc/kernel/vdso.c                   |    2 +-
 arch/powerpc/mm/hugetlbpage-radix.c          |    9 +-
 arch/powerpc/mm/hugetlbpage.c                |    9 +-
 arch/powerpc/mm/mmap.c                       |   17 +-
 arch/powerpc/mm/slice.c                      |   25 +-
 arch/s390/kernel/vdso.c                      |    3 +-
 arch/s390/mm/mmap.c                          |   42 +-
 arch/sh/kernel/vsyscall/vsyscall.c           |    2 +-
 arch/sh/mm/mmap.c                            |   19 +-
 arch/sparc/include/asm/pgtable_64.h          |    4 +-
 arch/sparc/kernel/sys_sparc_32.c             |    6 +-
 arch/sparc/kernel/sys_sparc_64.c             |   31 +-
 arch/sparc/mm/hugetlbpage.c                  |   26 +-
 arch/tile/kernel/vdso.c                      |    2 +-
 arch/tile/mm/elf.c                           |    2 +-
 arch/tile/mm/hugetlbpage.c                   |   26 +-
 arch/x86/entry/syscalls/syscall_32.tbl       |   16 +
 arch/x86/entry/syscalls/syscall_64.tbl       |   16 +
 arch/x86/entry/vdso/vma.c                    |    2 +-
 arch/x86/kernel/sys_x86_64.c                 |   19 +-
 arch/x86/mm/hugetlbpage.c                    |   26 +-
 arch/x86/mm/mpx.c                            |    6 +-
 arch/xtensa/kernel/syscall.c                 |    7 +-
 drivers/char/mem.c                           |   15 +-
 drivers/dax/dax.c                            |   10 +-
 drivers/media/usb/uvc/uvc_v4l2.c             |    6 +-
 drivers/media/v4l2-core/v4l2-dev.c           |    8 +-
 drivers/media/v4l2-core/videobuf2-v4l2.c     |    5 +-
 drivers/mtd/mtdchar.c                        |    3 +-
 drivers/usb/gadget/function/uvc_v4l2.c       |    3 +-
 fs/aio.c                                     |    4 +-
 fs/exec.c                                    |    5 +-
 fs/hugetlbfs/inode.c                         |    8 +-
 fs/proc/base.c                               |  528 ++++
 fs/proc/inode.c                              |   11 +-
 fs/proc/internal.h                           |    1 +
 fs/ramfs/file-mmu.c                          |    5 +-
 fs/ramfs/file-nommu.c                        |   10 +-
 fs/romfs/mmap-nommu.c                        |    3 +-
 include/linux/fs.h                           |    2 +-
 include/linux/huge_mm.h                      |   12 +-
 include/linux/hugetlb.h                      |   10 +-
 include/linux/mm.h                           |   53 +-
 include/linux/mm_types.h                     |   16 +-
 include/linux/sched.h                        |   34 +-
 include/linux/shmem_fs.h                     |    5 +-
 include/linux/syscalls.h                     |   21 +
 include/linux/vas.h                          |  322 +++
 include/linux/vas_types.h                    |  173 ++
 include/media/v4l2-dev.h                     |    3 +-
 include/media/videobuf2-v4l2.h               |    5 +-
 include/uapi/asm-generic/unistd.h            |   34 +-
 include/uapi/linux/Kbuild                    |    1 +
 include/uapi/linux/vas.h                     |   28 +
 init/main.c                                  |    2 +
 ipc/shm.c                                    |   22 +-
 kernel/events/uprobes.c                      |    2 +-
 kernel/exit.c                                |    2 +
 kernel/fork.c                                |   99 +-
 kernel/sys_ni.c                              |   18 +
 mm/Kconfig                                   |   47 +
 mm/Makefile                                  |    1 +
 mm/gup.c                                     |    4 +-
 mm/huge_memory.c                             |   83 +-
 mm/hugetlb.c                                 |  205 +-
 mm/internal.h                                |   19 +
 mm/memory.c                                  |  469 +++-
 mm/mlock.c                                   |   21 +-
 mm/mmap.c                                    |  124 +-
 mm/mremap.c                                  |   13 +-
 mm/nommu.c                                   |   17 +-
 mm/shmem.c                                   |   14 +-
 mm/util.c                                    |    4 +-
 mm/vas.c                                     | 3466 ++++++++++++++++++++++++++
 sound/core/pcm_native.c                      |    3 +-
 96 files changed, 5927 insertions(+), 545 deletions(-)
 create mode 100644 include/linux/vas.h
 create mode 100644 include/linux/vas_types.h
 create mode 100644 include/uapi/linux/vas.h
 create mode 100644 mm/vas.c

-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region'
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff' Till Smejkal
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add to the 'mmap_region' function the mm_struct that it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, mmap_region needs also be able to operate on other memory
maps than only the current task ones. By adding it as argument we can now
explicitly define which memory map to use.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 arch/mips/kernel/vdso.c |  2 +-
 arch/tile/mm/elf.c      |  2 +-
 include/linux/mm.h      |  5 +++--
 mm/mmap.c               | 10 +++++-----
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index f9dbfb14af33..9631b42908f3 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -108,7 +108,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return -EINTR;
 
 	/* Map delay slot emulation page */
-	base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+	base = mmap_region(mm, NULL, STACK_TOP, PAGE_SIZE,
 			   VM_READ|VM_WRITE|VM_EXEC|
 			   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
 			   0);
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 6225cc998db1..a22768059b7a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -141,7 +141,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 	 */
 	if (!retval) {
 		unsigned long addr = MEM_USER_INTRPT;
-		addr = mmap_region(NULL, addr, INTRPT_SIZE,
+		addr = mmap_region(mm, NULL, addr, INTRPT_SIZE,
 				   VM_READ|VM_EXEC|
 				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0);
 		if (addr > (unsigned long) -PAGE_SIZE)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..fa483d2ff3eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2016,8 +2016,9 @@ extern int install_special_mapping(struct mm_struct *mm,
 
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
-extern unsigned long mmap_region(struct file *file, unsigned long addr,
-	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+				 unsigned long addr, unsigned long len,
+				 vm_flags_t vm_flags, unsigned long pgoff);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..5ac276ac9807 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1447,7 +1447,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff);
+	addr = mmap_region(mm, file, addr, len, vm_flags, pgoff);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1582,10 +1582,10 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
-unsigned long mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, vm_flags_t vm_flags,
+		unsigned long pgoff)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
 	int error;
 	struct rb_node **rb_link, *rb_parent;
@@ -1704,7 +1704,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm)))
+					vma == get_gate_vma(mm)))
 			mm->locked_vm += (len >> PAGE_SHIFT);
 		else
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff'
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region' Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument Till Smejkal
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add to the 'do_mmap' and 'do_mmap_pgoff' functions the mm_struct they
should operate on as additional argument. Before, both functions simply
used the memory map of the current task. However, with the introduction of
first class virtual address spaces, these functions also need to be usable
for other memory maps than just the one of the current process. Hence,
explicitly define during the function call which memory map to use.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 arch/x86/mm/mpx.c  |  4 ++--
 fs/aio.c           |  4 ++--
 include/linux/mm.h | 11 ++++++-----
 ipc/shm.c          |  3 ++-
 mm/mmap.c          | 16 ++++++++--------
 mm/nommu.c         |  7 ++++---
 mm/util.c          |  2 +-
 7 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index af59f808742f..99c664a97c35 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -50,8 +50,8 @@ static unsigned long mpx_mmap(unsigned long len)
 		return -EINVAL;
 
 	down_write(&mm->mmap_sem);
-	addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate);
+	addr = do_mmap(mm, NULL, 0, len, PROT_READ | PROT_WRITE,
+		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate);
 	up_write(&mm->mmap_sem);
 	if (populate)
 		mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 873b4ca82ccb..df9bba5a2aff 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -510,8 +510,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 		return -EINTR;
 	}
 
-	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
-				       PROT_READ | PROT_WRITE,
+	ctx->mmap_base = do_mmap_pgoff(current->mm, ctx->aio_ring_file, 0,
+				       ctx->mmap_size, PROT_READ | PROT_WRITE,
 				       MAP_SHARED, 0, &unused);
 	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa483d2ff3eb..fb11be77545f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2019,17 +2019,18 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 				 unsigned long addr, unsigned long len,
 				 vm_flags_t vm_flags, unsigned long pgoff);
-extern unsigned long do_mmap(struct file *file, unsigned long addr,
-	unsigned long len, unsigned long prot, unsigned long flags,
-	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
+extern unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+	unsigned long addr, unsigned long len, unsigned long prot,
+	unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
+	unsigned long *populate);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
-do_mmap_pgoff(struct file *file, unsigned long addr,
+do_mmap_pgoff(struct mm_struct *mm, struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate)
 {
-	return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate);
+	return do_mmap(mm, file, addr, len, prot, flags, 0, pgoff, populate);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 81203e8ba013..64c21fb32ca9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1222,7 +1222,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
 			goto invalid;
 	}
 
-	addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate);
+	addr = do_mmap_pgoff(mm, file, addr, size, prot, flags, 0,
+			     &populate);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ac276ac9807..70028bf7b58d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1299,14 +1299,14 @@ static inline int mlock_future_check(struct mm_struct *mm,
 }
 
 /*
- * The caller must hold down_write(&current->mm->mmap_sem).
+ * The caller must hold down_write(&mm->mmap_sem).
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
-			unsigned long len, unsigned long prot,
-			unsigned long flags, vm_flags_t vm_flags,
-			unsigned long pgoff, unsigned long *populate)
+unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+		      unsigned long addr, unsigned long len,
+		      unsigned long prot, unsigned long flags,
+		      vm_flags_t vm_flags, unsigned long pgoff,
+		      unsigned long *populate)
 {
-	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
 	*populate = 0;
@@ -2779,8 +2779,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	}
 
 	file = get_file(vma->vm_file);
-	ret = do_mmap_pgoff(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate);
+	ret = do_mmap_pgoff(mm, vma->vm_file, start, size,
+			    prot, flags, pgoff, &populate);
 	fput(file);
 out:
 	up_write(&mm->mmap_sem);
diff --git a/mm/nommu.c b/mm/nommu.c
index 24f9f5f39145..54825d29f50b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1198,7 +1198,8 @@ static int do_mmap_private(struct vm_area_struct *vma,
 /*
  * handle mapping creation for uClinux
  */
-unsigned long do_mmap(struct file *file,
+unsigned long do_mmap(struct mm_struct *mm,
+			struct file *file,
 			unsigned long addr,
 			unsigned long len,
 			unsigned long prot,
@@ -1375,10 +1376,10 @@ unsigned long do_mmap(struct file *file,
 	/* okay... we have a mapping; now we have to register it */
 	result = vma->vm_start;
 
-	current->mm->total_vm += len >> PAGE_SHIFT;
+	mm->total_vm += len >> PAGE_SHIFT;
 
 share:
-	add_vma_to_mm(current->mm, vma);
+	add_vma_to_mm(mm, vma);
 
 	/* we flush the region from the icache only when the first executable
 	 * mapping of it is made  */
diff --git a/mm/util.c b/mm/util.c
index 3cb2164f4099..46d05eef9a6b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -302,7 +302,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	if (!ret) {
 		if (down_write_killable(&mm->mmap_sem))
 			return -EINTR;
-		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
+		ret = do_mmap_pgoff(mm, file, addr, len, prot, flag, pgoff,
 				    &populate);
 		up_write(&mm->mmap_sem);
 		if (populate)
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region' Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff' Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area' Till Smejkal
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Rename the 'unmap_region' function to 'munmap_region' so that it uses the
same naming pattern as the do_mmap <-> mmap_region couple. In addition
also make the new 'munmap_region' function publicly available to all other
kernel sources.

In addition, also add to the function the mm_struct it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, munmap_region need also be able to operate on other memory
maps than just the current task's one. Accordingly, add a new argument to
the function so that one can define explicitly which memory map should be
used.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 include/linux/mm.h |  4 ++++
 mm/mmap.c          | 14 +++++---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb11be77545f..71a90604d21f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2023,6 +2023,10 @@ extern unsigned long do_mmap(struct mm_struct *mm, struct file *file,
 	unsigned long addr, unsigned long len, unsigned long prot,
 	unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
 	unsigned long *populate);
+
+extern void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+			  struct vm_area_struct *prev, unsigned long start,
+			  unsigned long end);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
diff --git a/mm/mmap.c b/mm/mmap.c
index 70028bf7b58d..ea79bc4da5b7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -70,10 +70,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long start, unsigned long end);
-
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
  * behavior is in parens:
@@ -1731,7 +1727,7 @@ unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 	fput(file);
 
 	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+	munmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 	if (vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
@@ -2447,9 +2443,9 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long start, unsigned long end)
+void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+		struct vm_area_struct *prev, unsigned long start,
+		unsigned long end)
 {
 	struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
 	struct mmu_gather tlb;
@@ -2654,7 +2650,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 	 * Remove the vma's, and unmap the actual pages
 	 */
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
-	unmap_region(mm, vma, prev, start, end);
+	munmap_region(mm, vma, prev, start, end);
 
 	arch_unmap(mm, vma, start, end);
 
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area'
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (2 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 05/13] mm: Add mm_struct argument to 'mm_populate' and '__mm_populate' Till Smejkal
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add the mm_struct that for which an unmapped area should be found as
explicit argument to the 'get_unmapped_area' function. Previously, the
function simply search for an unmapped area in the memory map of the
 current task. However, with the introduction of first class virtual
address spaces, it is necessary that get_unmapped_area also can look for
unmapped area in memory maps other than the one of the current task.

Changing the signature of the generic 'get_unmapped_area' function also
requires that all the 'arch_get_unmapped_area' functions as well as the
'vm_unmapped_area' function with its dependents have to take the memory
map that they should work on as additional argument. Simply using the one
of the current task, as these functions did before, is not correct anymore
and leads to incorrect results.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 arch/alpha/kernel/osf_sys.c                  | 19 ++++++------
 arch/arc/mm/mmap.c                           |  8 ++---
 arch/arm/kernel/process.c                    |  2 +-
 arch/arm/mm/mmap.c                           | 19 ++++++------
 arch/arm64/kernel/vdso.c                     |  2 +-
 arch/blackfin/include/asm/pgtable.h          |  3 +-
 arch/blackfin/kernel/sys_bfin.c              |  5 ++--
 arch/frv/mm/elf-fdpic.c                      | 11 +++----
 arch/hexagon/kernel/vdso.c                   |  2 +-
 arch/ia64/kernel/perfmon.c                   |  3 +-
 arch/ia64/kernel/sys_ia64.c                  |  6 ++--
 arch/ia64/mm/hugetlbpage.c                   |  7 +++--
 arch/metag/mm/hugetlbpage.c                  | 11 +++----
 arch/mips/kernel/vdso.c                      |  2 +-
 arch/mips/mm/mmap.c                          | 27 +++++++++--------
 arch/parisc/kernel/sys_parisc.c              | 19 ++++++------
 arch/parisc/mm/hugetlbpage.c                 |  7 +++--
 arch/powerpc/include/asm/book3s/64/hugetlb.h |  6 ++--
 arch/powerpc/include/asm/page_64.h           |  3 +-
 arch/powerpc/kernel/vdso.c                   |  2 +-
 arch/powerpc/mm/hugetlbpage-radix.c          |  9 +++---
 arch/powerpc/mm/hugetlbpage.c                |  9 +++---
 arch/powerpc/mm/mmap.c                       | 17 +++++------
 arch/powerpc/mm/slice.c                      | 25 ++++++++--------
 arch/s390/kernel/vdso.c                      |  3 +-
 arch/s390/mm/mmap.c                          | 42 +++++++++++++-------------
 arch/sh/kernel/vsyscall/vsyscall.c           |  2 +-
 arch/sh/mm/mmap.c                            | 19 ++++++------
 arch/sparc/include/asm/pgtable_64.h          |  4 +--
 arch/sparc/kernel/sys_sparc_32.c             |  6 ++--
 arch/sparc/kernel/sys_sparc_64.c             | 31 +++++++++++---------
 arch/sparc/mm/hugetlbpage.c                  | 26 ++++++++--------
 arch/tile/kernel/vdso.c                      |  2 +-
 arch/tile/mm/hugetlbpage.c                   | 26 ++++++++--------
 arch/x86/entry/vdso/vma.c                    |  2 +-
 arch/x86/kernel/sys_x86_64.c                 | 19 ++++++------
 arch/x86/mm/hugetlbpage.c                    | 26 ++++++++--------
 arch/xtensa/kernel/syscall.c                 |  7 +++--
 drivers/char/mem.c                           | 15 ++++++----
 drivers/dax/dax.c                            | 10 +++----
 drivers/media/usb/uvc/uvc_v4l2.c             |  6 ++--
 drivers/media/v4l2-core/v4l2-dev.c           |  8 ++---
 drivers/media/v4l2-core/videobuf2-v4l2.c     |  5 ++--
 drivers/mtd/mtdchar.c                        |  3 +-
 drivers/usb/gadget/function/uvc_v4l2.c       |  3 +-
 fs/hugetlbfs/inode.c                         |  8 ++---
 fs/proc/inode.c                              | 10 +++----
 fs/ramfs/file-mmu.c                          |  5 ++--
 fs/ramfs/file-nommu.c                        | 10 ++++---
 fs/romfs/mmap-nommu.c                        |  3 +-
 include/linux/fs.h                           |  2 +-
 include/linux/huge_mm.h                      |  6 ++--
 include/linux/hugetlb.h                      |  5 ++--
 include/linux/mm.h                           | 16 ++++++----
 include/linux/mm_types.h                     |  7 +++--
 include/linux/sched.h                        | 10 +++----
 include/linux/shmem_fs.h                     |  5 ++--
 include/media/v4l2-dev.h                     |  3 +-
 include/media/videobuf2-v4l2.h               |  5 ++--
 ipc/shm.c                                    | 10 +++----
 kernel/events/uprobes.c                      |  2 +-
 mm/huge_memory.c                             | 18 +++++++-----
 mm/mmap.c                                    | 44 ++++++++++++++--------------
 mm/mremap.c                                  | 11 +++----
 mm/nommu.c                                   | 10 ++++---
 mm/shmem.c                                   | 14 ++++-----
 sound/core/pcm_native.c                      |  3 +-
 67 files changed, 370 insertions(+), 326 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 54d8616644e2..281109bcdc5d 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1308,8 +1308,8 @@ SYSCALL_DEFINE1(old_adjtimex, struct timex32 __user *, txc_p)
    generic version except that we know how to honor ADDR_LIMIT_32BIT.  */
 
 static unsigned long
-arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
-		         unsigned long limit)
+arch_get_unmapped_area_1(struct mm_struct *mm, unsigned long addr,
+			 unsigned long len, unsigned long limit)
 {
 	struct vm_unmapped_area_info info;
 
@@ -1319,13 +1319,13 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
 	info.high_limit = limit;
 	info.align_mask = 0;
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		       unsigned long len, unsigned long pgoff,
-		       unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		       unsigned long addr, unsigned long len,
+		       unsigned long pgoff, unsigned long flags)
 {
 	unsigned long limit;
 
@@ -1352,19 +1352,20 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	   this feature should be incorporated into all ports?  */
 
 	if (addr) {
-		addr = arch_get_unmapped_area_1 (PAGE_ALIGN(addr), len, limit);
+		addr = arch_get_unmapped_area_1 (mm, PAGE_ALIGN(addr), len,
+						 limit);
 		if (addr != (unsigned long) -ENOMEM)
 			return addr;
 	}
 
 	/* Next, try allocating at TASK_UNMAPPED_BASE.  */
-	addr = arch_get_unmapped_area_1 (PAGE_ALIGN(TASK_UNMAPPED_BASE),
+	addr = arch_get_unmapped_area_1 (mm, PAGE_ALIGN(TASK_UNMAPPED_BASE),
 					 len, limit);
 	if (addr != (unsigned long) -ENOMEM)
 		return addr;
 
 	/* Finally, try allocating in low memory.  */
-	addr = arch_get_unmapped_area_1 (PAGE_SIZE, len, limit);
+	addr = arch_get_unmapped_area_1 (mm, PAGE_SIZE, len, limit);
 
 	return addr;
 }
diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c
index 2e06d56e987b..2fa95d302b02 100644
--- a/arch/arc/mm/mmap.c
+++ b/arch/arc/mm/mmap.c
@@ -28,10 +28,10 @@
  * SHMLBA bytes.
  */
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int do_align = 0;
 	int aliasing = cache_is_vipt_aliasing();
@@ -74,5 +74,5 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.high_limit = TASK_SIZE;
 	info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 91d2d5b01414..4c6f9595fdbe 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -426,7 +426,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 	hint = sigpage_addr(mm, npages);
-	addr = get_unmapped_area(NULL, hint, npages << PAGE_SHIFT, 0, 0);
+	addr = get_unmapped_area(mm, NULL, hint, npages << PAGE_SHIFT, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index 66353caa35b9..48b49bfdd1d7 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -52,10 +52,10 @@ static unsigned long mmap_base(unsigned long rnd)
  * in the VIVT case, we optimise out the alignment rules.
  */
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int do_align = 0;
 	int aliasing = cache_is_vipt_aliasing();
@@ -99,16 +99,15 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.high_limit = TASK_SIZE;
 	info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			const unsigned long len, const unsigned long pgoff,
-			const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			const unsigned long addr0, const unsigned long len,
+			const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	int do_align = 0;
 	int aliasing = cache_is_vipt_aliasing();
@@ -150,7 +149,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -163,7 +162,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = mm->mmap_base;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index a2c2478e7d78..403e71456297 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -166,7 +166,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
-	vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
+	vdso_base = get_unmapped_area(mm, NULL, 0, vdso_mapping_len, 0, 0);
 	if (IS_ERR_VALUE(vdso_base)) {
 		ret = ERR_PTR(vdso_base);
 		goto up_fail;
diff --git a/arch/blackfin/include/asm/pgtable.h b/arch/blackfin/include/asm/pgtable.h
index c1ee3d6533fb..fe2c5d41d62a 100644
--- a/arch/blackfin/include/asm/pgtable.h
+++ b/arch/blackfin/include/asm/pgtable.h
@@ -92,7 +92,8 @@ extern char empty_zero_page[];
 #define	VMALLOC_END	0xffffffff
 
 /* provide a special get_unmapped_area for framebuffer mmaps of nommu */
-extern unsigned long get_fb_unmapped_area(struct file *filp, unsigned long,
+extern unsigned long get_fb_unmapped_area(struct mm_struct *mm,
+					  struct file *filp, unsigned long,
 					  unsigned long, unsigned long,
 					  unsigned long);
 #define HAVE_ARCH_FB_UNMAPPED_AREA
diff --git a/arch/blackfin/kernel/sys_bfin.c b/arch/blackfin/kernel/sys_bfin.c
index d998383cb956..65acfd2bc8d1 100644
--- a/arch/blackfin/kernel/sys_bfin.c
+++ b/arch/blackfin/kernel/sys_bfin.c
@@ -42,8 +42,9 @@ asmlinkage void *sys_dma_memcpy(void *dest, const void *src, size_t len)
 #if defined(CONFIG_FB) || defined(CONFIG_FB_MODULE)
 #include <linux/fb.h>
 #include <linux/export.h>
-unsigned long get_fb_unmapped_area(struct file *filp, unsigned long orig_addr,
-	unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long get_fb_unmapped_area(struct mm_struct *mm, struct file *filp,
+	unsigned long orig_addr, unsigned long len, unsigned long pgoff,
+	unsigned long flags)
 {
 	struct fb_info *info = filp->private_data;
 	return (unsigned long)info->screen_base;
diff --git a/arch/frv/mm/elf-fdpic.c b/arch/frv/mm/elf-fdpic.c
index 836f14707a62..c7158d8883e2 100644
--- a/arch/frv/mm/elf-fdpic.c
+++ b/arch/frv/mm/elf-fdpic.c
@@ -56,7 +56,8 @@ void elf_fdpic_arch_lay_out_mm(struct elf_fdpic_params *exec_params,
  * place non-fixed mmaps firstly in the bottom part of memory, working up, and then in the top part
  * of memory, working down
  */
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len,
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+				     unsigned long addr, unsigned long len,
 				     unsigned long pgoff, unsigned long flags)
 {
 	struct vm_area_struct *vma;
@@ -72,7 +73,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
 	/* only honour a hint if we're not going to clobber something doing so */
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
-		vma = find_vma(current->mm, addr);
+		vma = find_vma(mm, addr);
 		if (TASK_SIZE - len >= addr &&
 		    (!vma || addr + len <= vma->vm_start))
 			goto success;
@@ -82,10 +83,10 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
-	info.high_limit = (current->mm->start_stack - 0x00200000);
+	info.high_limit = (mm->start_stack - 0x00200000);
 	info.align_mask = 0;
 	info.align_offset = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 	if (!(addr & ~PAGE_MASK))
 		goto success;
 	VM_BUG_ON(addr != -ENOMEM);
@@ -93,7 +94,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
 	/* search from just above the WorkRAM area to the top of memory */
 	info.low_limit = PAGE_ALIGN(0x80000000);
 	info.high_limit = TASK_SIZE;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 	if (!(addr & ~PAGE_MASK))
 		goto success;
 	VM_BUG_ON(addr != -ENOMEM);
diff --git a/arch/hexagon/kernel/vdso.c b/arch/hexagon/kernel/vdso.c
index 3ea968415539..1ec46d4dd0e6 100644
--- a/arch/hexagon/kernel/vdso.c
+++ b/arch/hexagon/kernel/vdso.c
@@ -71,7 +71,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	/* Try to get it loaded right near ld.so/glibc. */
 	vdso_base = STACK_TOP;
 
-	vdso_base = get_unmapped_area(NULL, vdso_base, PAGE_SIZE, 0, 0);
+	vdso_base = get_unmapped_area(mm, NULL, vdso_base, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(vdso_base)) {
 		ret = vdso_base;
 		goto up_fail;
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 677a86826771..8a5cb50e698f 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2308,7 +2308,8 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
 	down_write(&task->mm->mmap_sem);
 
 	/* find some free area in address space, must have mmap sem held */
-	vma->vm_start = get_unmapped_area(NULL, 0, size, 0, MAP_PRIVATE|MAP_ANONYMOUS);
+	vma->vm_start = get_unmapped_area(mm, NULL, 0, size, 0,
+					  MAP_PRIVATE|MAP_ANONYMOUS);
 	if (IS_ERR_VALUE(vma->vm_start)) {
 		DPRINT(("Cannot find unmapped area for size %ld\n", size));
 		up_write(&task->mm->mmap_sem);
diff --git a/arch/ia64/kernel/sys_ia64.c b/arch/ia64/kernel/sys_ia64.c
index a09c12230bc5..2f33e1c9d002 100644
--- a/arch/ia64/kernel/sys_ia64.c
+++ b/arch/ia64/kernel/sys_ia64.c
@@ -21,12 +21,12 @@
 #include <linux/uaccess.h>
 
 unsigned long
-arch_get_unmapped_area (struct file *filp, unsigned long addr, unsigned long len,
+arch_get_unmapped_area (struct mm_struct *mm, struct file *filp,
+			unsigned long addr, unsigned long len,
 			unsigned long pgoff, unsigned long flags)
 {
 	long map_shared = (flags & MAP_SHARED);
 	unsigned long align_mask = 0;
-	struct mm_struct *mm = current->mm;
 	struct vm_unmapped_area_info info;
 
 	if (len > RGN_MAP_LIMIT)
@@ -61,7 +61,7 @@ arch_get_unmapped_area (struct file *filp, unsigned long addr, unsigned long len
 	info.high_limit = TASK_SIZE;
 	info.align_mask = align_mask;
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 asmlinkage long
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 85de86d36fdf..fd9ff12b37f4 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -134,8 +134,9 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 	free_pgd_range(tlb, addr, end, floor, ceiling);
 }
 
-unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
-		unsigned long pgoff, unsigned long flags)
+unsigned long hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct vm_unmapped_area_info info;
 
@@ -161,7 +162,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, u
 	info.high_limit = HPAGE_REGION_BASE + RGN_MAP_LIMIT;
 	info.align_mask = PAGE_MASK & (HPAGE_SIZE - 1);
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 static int __init hugetlb_setup_sz(char *str)
diff --git a/arch/metag/mm/hugetlbpage.c b/arch/metag/mm/hugetlbpage.c
index db1b7da91e4f..4bf5b3c1ae7e 100644
--- a/arch/metag/mm/hugetlbpage.c
+++ b/arch/metag/mm/hugetlbpage.c
@@ -179,7 +179,7 @@ hugetlb_get_unmapped_area_existing(unsigned long len)
 
 /* Do a full search to find an area without any nearby normal pages. */
 static unsigned long
-hugetlb_get_unmapped_area_new_pmd(unsigned long len)
+hugetlb_get_unmapped_area_new_pmd(struct mm_struct *mm, unsigned long len)
 {
 	struct vm_unmapped_area_info info;
 
@@ -189,12 +189,13 @@ hugetlb_get_unmapped_area_new_pmd(unsigned long len)
 	info.high_limit = TASK_SIZE;
 	info.align_mask = PAGE_MASK & HUGEPT_MASK;
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 
@@ -227,7 +228,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	 * Find an unmapped naturally aligned set of 4MB blocks that we can use
 	 * for huge pages.
 	 */
-	return hugetlb_get_unmapped_area_new_pmd(len);
+	return hugetlb_get_unmapped_area_new_pmd(mm, len);
 }
 
 #endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 9631b42908f3..05d87680a05a 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -129,7 +129,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vvar_size = gic_size + PAGE_SIZE;
 	size = vvar_size + image->size;
 
-	base = get_unmapped_area(NULL, 0, size, 0, 0);
+	base = get_unmapped_area(mm, NULL, 0, size, 0, 0);
 	if (IS_ERR_VALUE(base)) {
 		ret = base;
 		goto out;
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index d08ea3ff0f53..86a8cb07a584 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -51,11 +51,11 @@ static unsigned long mmap_base(unsigned long rnd)
 
 enum mmap_allocation_direction {UP, DOWN};
 
-static unsigned long arch_get_unmapped_area_common(struct file *filp,
-	unsigned long addr0, unsigned long len, unsigned long pgoff,
-	unsigned long flags, enum mmap_allocation_direction dir)
+static unsigned long arch_get_unmapped_area_common(struct mm_struct *mm,
+	struct file *filp, unsigned long addr0, unsigned long len,
+	unsigned long pgoff, unsigned long flags,
+	enum mmap_allocation_direction dir)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long addr = addr0;
 	int do_color_align;
@@ -104,7 +104,7 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
 		info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 		info.low_limit = PAGE_SIZE;
 		info.high_limit = mm->mmap_base;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 
 		if (!(addr & ~PAGE_MASK))
 			return addr;
@@ -120,13 +120,14 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
 	info.flags = 0;
 	info.low_limit = mm->mmap_base;
 	info.high_limit = TASK_SIZE;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0,
-	unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+	unsigned long addr0, unsigned long len, unsigned long pgoff,
+	unsigned long flags)
 {
-	return arch_get_unmapped_area_common(filp,
+	return arch_get_unmapped_area_common(mm, filp,
 			addr0, len, pgoff, flags, UP);
 }
 
@@ -134,11 +135,11 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0,
  * There is no need to export this but sched.h declares the function as
  * extern so making it static here results in an error.
  */
-unsigned long arch_get_unmapped_area_topdown(struct file *filp,
-	unsigned long addr0, unsigned long len, unsigned long pgoff,
-	unsigned long flags)
+unsigned long arch_get_unmapped_area_topdown(struct mm_struct *mm,
+	struct file *filp, unsigned long addr0, unsigned long len,
+	unsigned long pgoff, unsigned long flags)
 {
-	return arch_get_unmapped_area_common(filp,
+	return arch_get_unmapped_area_common(mm, filp,
 			addr0, len, pgoff, flags, DOWN);
 }
 
diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c
index bf3294171230..68664e9a3317 100644
--- a/arch/parisc/kernel/sys_parisc.c
+++ b/arch/parisc/kernel/sys_parisc.c
@@ -84,10 +84,10 @@ static unsigned long mmap_upper_limit(void)
 }
 
 
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long task_size = TASK_SIZE;
 	int do_color_align, last_mmap;
@@ -127,7 +127,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.high_limit = mmap_upper_limit();
 	info.align_mask = last_mmap ? (PAGE_MASK & (SHM_COLOUR - 1)) : 0;
 	info.align_offset = shared_align_offset(last_mmap, pgoff);
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 found_addr:
 	if (do_color_align && !last_mmap && !(addr & ~PAGE_MASK))
@@ -137,12 +137,11 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	int do_color_align, last_mmap;
 	struct vm_unmapped_area_info info;
@@ -187,7 +186,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = last_mmap ? (PAGE_MASK & (SHM_COLOUR - 1)) : 0;
 	info.align_offset = shared_align_offset(last_mmap, pgoff);
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 	if (!(addr & ~PAGE_MASK))
 		goto found_addr;
 	VM_BUG_ON(addr != -ENOMEM);
@@ -198,7 +197,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	 * can happen with large stack limits and large mmap()
 	 * allocations.
 	 */
-	return arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
+	return arch_get_unmapped_area(mm, filp, addr0, len, pgoff, flags);
 
 found_addr:
 	if (do_color_align && !last_mmap && !(addr & ~PAGE_MASK))
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index 5d6eea925cf4..a0f93122e142 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -21,8 +21,9 @@
 
 
 unsigned long
-hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
 
@@ -39,7 +40,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		addr = ALIGN(addr, huge_page_size(h));
 
 	/* we need to make sure the colouring is OK */
-	return arch_get_unmapped_area(file, addr, len, pgoff, flags);
+	return arch_get_unmapped_area(mm, file, addr, len, pgoff, flags);
 }
 
 
diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index c62f14d0bec1..5ff88a3d946e 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -8,9 +8,9 @@
 void radix__flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 void radix__local_flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 extern unsigned long
-radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-				unsigned long len, unsigned long pgoff,
-				unsigned long flags);
+radix__hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags);
 
 static inline int hstate_get_psize(struct hstate *hstate)
 {
diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h
index dd5f0712afa2..1ac2c5814d14 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -115,7 +115,8 @@ struct slice_mask {
 
 struct mm_struct;
 
-extern unsigned long slice_get_unmapped_area(unsigned long addr,
+extern unsigned long slice_get_unmapped_area(struct mm_struct *mm,
+					     unsigned long addr,
 					     unsigned long len,
 					     unsigned long flags,
 					     unsigned int psize,
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 4111d30badfa..3f7c4b334620 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -198,7 +198,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	 */
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
-	vdso_base = get_unmapped_area(NULL, vdso_base,
+	vdso_base = get_unmapped_area(mm, NULL, vdso_base,
 				      (vdso_pages << PAGE_SHIFT) +
 				      ((VDSO_ALIGNMENT - 1) & PAGE_MASK),
 				      0, 0);
diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
index 35254a678456..86e99682033a 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -41,11 +41,10 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
  * ie, use topdown or not based on mmap_is_legacy check ?
  */
 unsigned long
-radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-				unsigned long len, unsigned long pgoff,
-				unsigned long flags)
+radix__hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
@@ -78,5 +77,5 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.high_limit = current->mm->mmap_base;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8c3389cbcd12..817bf97e59cc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -698,17 +698,18 @@ int gup_huge_pd(hugepd_t hugepd, unsigned long addr, unsigned pdshift,
 }
 
 #ifdef CONFIG_PPC_MM_SLICES
-unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-					unsigned long len, unsigned long pgoff,
+unsigned long hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+					unsigned long addr, unsigned long len,
+					unsigned long pgoff,
 					unsigned long flags)
 {
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
 	if (radix_enabled())
-		return radix__hugetlb_get_unmapped_area(file, addr, len,
+		return radix__hugetlb_get_unmapped_area(mm, file, addr, len,
 						       pgoff, flags);
-	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+	return slice_get_unmapped_area(mm, addr, len, flags, mmu_psize, 1);
 }
 #endif
 
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index 2f1e44362198..b05ad3e33c40 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -88,11 +88,10 @@ static inline unsigned long mmap_base(unsigned long rnd)
  * HAVE_ARCH_UNMAPPED_AREA
  */
 static unsigned long
-radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
-			     unsigned long len, unsigned long pgoff,
-			     unsigned long flags)
+radix__arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+			     unsigned long addr, unsigned long len,
+			     unsigned long pgoff, unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
@@ -115,18 +114,18 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.low_limit = mm->mmap_base;
 	info.high_limit = TASK_SIZE;
 	info.align_mask = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 static unsigned long
-radix__arch_get_unmapped_area_topdown(struct file *filp,
+radix__arch_get_unmapped_area_topdown(struct mm_struct *mm,
+				     struct file *filp,
 				     const unsigned long addr0,
 				     const unsigned long len,
 				     const unsigned long pgoff,
 				     const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
@@ -151,7 +150,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
 	info.high_limit = mm->mmap_base;
 	info.align_mask = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -164,7 +163,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2b27458902ee..ebcb47c62d47 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -296,7 +296,7 @@ static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
 		}
 		info.high_limit = addr;
 
-		found = vm_unmapped_area(&info);
+		found = vm_unmapped_area(mm, &info);
 		if (!(found & ~PAGE_MASK))
 			return found;
 	}
@@ -339,7 +339,7 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 		}
 		info.low_limit = addr;
 
-		found = vm_unmapped_area(&info);
+		found = vm_unmapped_area(mm, &info);
 		if (!(found & ~PAGE_MASK))
 			return found;
 	}
@@ -380,9 +380,9 @@ static unsigned long slice_find_area(struct mm_struct *mm, unsigned long len,
 #define MMU_PAGE_BASE	MMU_PAGE_4K
 #endif
 
-unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
-				      unsigned long flags, unsigned int psize,
-				      int topdown)
+unsigned long slice_get_unmapped_area(struct mm_struct *mm, unsigned long addr,
+				      unsigned long len, unsigned long flags,
+				      unsigned int psize, int topdown)
 {
 	struct slice_mask mask = {0, 0};
 	struct slice_mask good_mask;
@@ -390,7 +390,6 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 	struct slice_mask compat_mask = {0, 0};
 	int fixed = (flags & MAP_FIXED);
 	int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
-	struct mm_struct *mm = current->mm;
 	unsigned long newaddr;
 
 	/* Sanity checks */
@@ -544,24 +543,26 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 }
 EXPORT_SYMBOL_GPL(slice_get_unmapped_area);
 
-unsigned long arch_get_unmapped_area(struct file *filp,
+unsigned long arch_get_unmapped_area(struct mm_struct *mm,
+				     struct file *filp,
 				     unsigned long addr,
 				     unsigned long len,
 				     unsigned long pgoff,
 				     unsigned long flags)
 {
-	return slice_get_unmapped_area(addr, len, flags,
-				       current->mm->context.user_psize, 0);
+	return slice_get_unmapped_area(mm, addr, len, flags,
+				       mm->context.user_psize, 0);
 }
 
-unsigned long arch_get_unmapped_area_topdown(struct file *filp,
+unsigned long arch_get_unmapped_area_topdown(struct mm_struct *mm,
+					     struct file *filp,
 					     const unsigned long addr0,
 					     const unsigned long len,
 					     const unsigned long pgoff,
 					     const unsigned long flags)
 {
-	return slice_get_unmapped_area(addr0, len, flags,
-				       current->mm->context.user_psize, 1);
+	return slice_get_unmapped_area(mm, addr0, len, flags,
+				       mm->context.user_psize, 1);
 }
 
 unsigned int get_slice_psize(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 5904abf6b1ae..2e8d5d6d7806 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -218,7 +218,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	 */
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
-	vdso_base = get_unmapped_area(NULL, 0, vdso_pages << PAGE_SHIFT, 0, 0);
+	vdso_base = get_unmapped_area(mm, NULL, 0, vdso_pages << PAGE_SHIFT,
+				      0, 0);
 	if (IS_ERR_VALUE(vdso_base)) {
 		rc = vdso_base;
 		goto out_up;
diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
index eb9df2822da1..e9622a8bc740 100644
--- a/arch/s390/mm/mmap.c
+++ b/arch/s390/mm/mmap.c
@@ -81,10 +81,10 @@ static inline unsigned long mmap_base(unsigned long rnd)
 }
 
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
@@ -111,16 +111,15 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	else
 		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
@@ -149,7 +148,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	else
 		info.align_mask = 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -162,7 +161,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
@@ -180,14 +179,14 @@ int s390_mmap_check(unsigned long addr, unsigned long len, unsigned long flags)
 }
 
 static unsigned long
-s390_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+s390_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long area;
 	int rc;
 
-	area = arch_get_unmapped_area(filp, addr, len, pgoff, flags);
+	area = arch_get_unmapped_area(mm, filp, addr, len, pgoff, flags);
 	if (!(area & ~PAGE_MASK))
 		return area;
 	if (area == -ENOMEM && !is_compat_task() && TASK_SIZE < TASK_MAX_SIZE) {
@@ -195,21 +194,22 @@ s390_get_unmapped_area(struct file *filp, unsigned long addr,
 		rc = crst_table_upgrade(mm);
 		if (rc)
 			return (unsigned long) rc;
-		area = arch_get_unmapped_area(filp, addr, len, pgoff, flags);
+		area = arch_get_unmapped_area(mm, filp, addr, len, pgoff,
+					      flags);
 	}
 	return area;
 }
 
 static unsigned long
-s390_get_unmapped_area_topdown(struct file *filp, const unsigned long addr,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+s390_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long area;
 	int rc;
 
-	area = arch_get_unmapped_area_topdown(filp, addr, len, pgoff, flags);
+	area = arch_get_unmapped_area_topdown(mm, filp, addr, len, pgoff,
+					      flags);
 	if (!(area & ~PAGE_MASK))
 		return area;
 	if (area == -ENOMEM && !is_compat_task() && TASK_SIZE < TASK_MAX_SIZE) {
@@ -217,7 +217,7 @@ s390_get_unmapped_area_topdown(struct file *filp, const unsigned long addr,
 		rc = crst_table_upgrade(mm);
 		if (rc)
 			return (unsigned long) rc;
-		area = arch_get_unmapped_area_topdown(filp, addr, len,
+		area = arch_get_unmapped_area_topdown(mm, filp, addr, len,
 						      pgoff, flags);
 	}
 	return area;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index cc0cc5b4ff18..81f68365b511 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -67,7 +67,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 
-	addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+	addr = get_unmapped_area(mm, NULL, 0, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c
index 6777177807c2..ab873264c260 100644
--- a/arch/sh/mm/mmap.c
+++ b/arch/sh/mm/mmap.c
@@ -30,10 +30,10 @@ static inline unsigned long COLOUR_ALIGN(unsigned long addr,
 	return base + off;
 }
 
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
-	unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+	unsigned long addr, unsigned long len, unsigned long pgoff,
+	unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int do_colour_align;
 	struct vm_unmapped_area_info info;
@@ -73,16 +73,15 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.high_limit = TASK_SIZE;
 	info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	int do_colour_align;
 	struct vm_unmapped_area_info info;
@@ -123,7 +122,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -136,7 +135,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 314b66851348..e5e47115ba43 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1013,8 +1013,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 /* We provide a special get_unmapped_area for framebuffer mmaps to try and use
  * the largest alignment possible such that larget PTEs can be used.
  */
-unsigned long get_fb_unmapped_area(struct file *filp, unsigned long,
-				   unsigned long, unsigned long,
+unsigned long get_fb_unmapped_area(struct mm_struct *mm, struct file *filp,
+				   unsigned long, unsigned long, unsigned long,
 				   unsigned long);
 #define HAVE_ARCH_FB_UNMAPPED_AREA
 
diff --git a/arch/sparc/kernel/sys_sparc_32.c b/arch/sparc/kernel/sys_sparc_32.c
index fb7b185ee941..52319c194f96 100644
--- a/arch/sparc/kernel/sys_sparc_32.c
+++ b/arch/sparc/kernel/sys_sparc_32.c
@@ -36,7 +36,9 @@ asmlinkage unsigned long sys_getpagesize(void)
 	return PAGE_SIZE; /* Possibly older binaries want 8192 on sun4's? */
 }
 
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+				     unsigned long addr, unsigned long len,
+				     unsigned long pgoff, unsigned long flags)
 {
 	struct vm_unmapped_area_info info;
 
@@ -63,7 +65,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
 	info.align_mask = (flags & MAP_SHARED) ?
 		(PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 /*
diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index 884c70331345..3e77c04700ef 100644
--- a/arch/sparc/kernel/sys_sparc_64.c
+++ b/arch/sparc/kernel/sys_sparc_64.c
@@ -83,9 +83,10 @@ static inline unsigned long COLOR_ALIGN(unsigned long addr,
 	return base + off;
 }
 
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+				     unsigned long addr, unsigned long len,
+				     unsigned long pgoff, unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
 	unsigned long task_size = TASK_SIZE;
 	int do_color_align;
@@ -128,25 +129,24 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
 	info.high_limit = min(task_size, VA_EXCLUDE_START);
 	info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	if ((addr & ~PAGE_MASK) && task_size > VA_EXCLUDE_END) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.low_limit = VA_EXCLUDE_END;
 		info.high_limit = task_size;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long task_size = STACK_TOP32;
 	unsigned long addr = addr0;
 	int do_color_align;
@@ -191,7 +191,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -204,20 +204,23 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = STACK_TOP32;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
 /* Try to align mapping such that we align it as much as possible. */
-unsigned long get_fb_unmapped_area(struct file *filp, unsigned long orig_addr, unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long get_fb_unmapped_area(struct mm_sturc *mm, struct file *filp,
+				   unsigned long orig_addr, unsigned long len,
+				   unsigned long pgoff, unsigned long flags)
 {
 	unsigned long align_goal, addr = -ENOMEM;
-	unsigned long (*get_area)(struct file *, unsigned long,
-				  unsigned long, unsigned long, unsigned long);
+	unsigned long (*get_area)(struct mm_struct *, struct file *,
+				  unsigned long, unsigned long, unsigned long,
+				  unsigned long);
 
-	get_area = current->mm->get_unmapped_area;
+	get_area = mm->get_unmapped_area;
 
 	if (flags & MAP_FIXED) {
 		/* Ok, don't mess with it. */
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 988acc8b1b80..92841d9ddd59 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -22,7 +22,8 @@
  * definition we don't have to worry about any page coloring stuff
  */
 
-static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *filp,
+static unsigned long hugetlb_get_unmapped_area_bottomup(struct mm_struct *mm,
+							struct file *filp,
 							unsigned long addr,
 							unsigned long len,
 							unsigned long pgoff,
@@ -40,25 +41,26 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *filp,
 	info.high_limit = min(task_size, VA_EXCLUDE_START);
 	info.align_mask = PAGE_MASK & ~HPAGE_MASK;
 	info.align_offset = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	if ((addr & ~PAGE_MASK) && task_size > VA_EXCLUDE_END) {
 		VM_BUG_ON(addr != -ENOMEM);
 		info.low_limit = VA_EXCLUDE_END;
 		info.high_limit = task_size;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
 static unsigned long
-hugetlb_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
+hugetlb_get_unmapped_area_topdown(struct mm_struct *mm,
+				  struct file *filp,
+				  const unsigned long addr0,
 				  const unsigned long len,
 				  const unsigned long pgoff,
 				  const unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
@@ -71,7 +73,7 @@ hugetlb_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = PAGE_MASK & ~HPAGE_MASK;
 	info.align_offset = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -84,17 +86,17 @@ hugetlb_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = STACK_TOP32;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
 unsigned long
-hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long task_size = TASK_SIZE;
 
@@ -120,10 +122,10 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 			return addr;
 	}
 	if (mm->get_unmapped_area == arch_get_unmapped_area)
-		return hugetlb_get_unmapped_area_bottomup(file, addr, len,
+		return hugetlb_get_unmapped_area_bottomup(mm, file, addr, len,
 				pgoff, flags);
 	else
-		return hugetlb_get_unmapped_area_topdown(file, addr, len,
+		return hugetlb_get_unmapped_area_topdown(mm, file, addr, len,
 				pgoff, flags);
 }
 
diff --git a/arch/tile/kernel/vdso.c b/arch/tile/kernel/vdso.c
index 5bc51d7dfdcb..74aa55aa11ec 100644
--- a/arch/tile/kernel/vdso.c
+++ b/arch/tile/kernel/vdso.c
@@ -150,7 +150,7 @@ int setup_vdso_pages(void)
 	if (pages == 0)
 		return 0;
 
-	vdso_base = get_unmapped_area(NULL, vdso_base,
+	vdso_base = get_unmapped_area(mm, NULL, vdso_base,
 				      (pages << PAGE_SHIFT) +
 				      ((VDSO_ALIGNMENT - 1) & PAGE_MASK),
 				      0, 0);
diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index 77ceaa343fce..24e8c99b5b3f 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -161,7 +161,8 @@ int pud_huge(pud_t pud)
 }
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
-static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
+static unsigned long hugetlb_get_unmapped_area_bottomup(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
@@ -174,10 +175,11 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 	info.high_limit = TASK_SIZE;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
-static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
+static unsigned long hugetlb_get_unmapped_area_topdown(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
@@ -188,10 +190,10 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
-	info.high_limit = current->mm->mmap_base;
+	info.high_limit = mm->mmap_base;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -204,17 +206,17 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
-unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
 	if (len & ~huge_page_mask(h))
@@ -235,11 +237,11 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		    (!vma || addr + len <= vma->vm_start))
 			return addr;
 	}
-	if (current->mm->get_unmapped_area == arch_get_unmapped_area)
-		return hugetlb_get_unmapped_area_bottomup(file, addr, len,
+	if (mm->get_unmapped_area == arch_get_unmapped_area)
+		return hugetlb_get_unmapped_area_bottomup(mm, file, addr, len,
 				pgoff, flags);
 	else
-		return hugetlb_get_unmapped_area_topdown(file, addr, len,
+		return hugetlb_get_unmapped_area_topdown(mm, file, addr, len,
 				pgoff, flags);
 }
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10820f6cefbf..8f728f3a1698 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -153,7 +153,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 
-	addr = get_unmapped_area(NULL, addr,
+	addr = get_unmapped_area(mm, NULL, addr,
 				 image->size - image->sym_vvar_start, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..bf9198998895 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -120,10 +120,10 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
 }
 
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 	unsigned long begin, end;
@@ -154,16 +154,15 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
@@ -197,7 +196,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 	if (!(addr & ~PAGE_MASK))
 		return addr;
 	VM_BUG_ON(addr != -ENOMEM);
@@ -209,5 +208,5 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	 * can happen with large stack limits and large mmap()
 	 * allocations.
 	 */
-	return arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
+	return arch_get_unmapped_area(mm, filp, addr0, len, pgoff, flags);
 }
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 2ae8584b44c7..93f0dccd0fcd 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -72,7 +72,8 @@ int pud_huge(pud_t pud)
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
+static unsigned long hugetlb_get_unmapped_area_bottomup(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
@@ -81,14 +82,15 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
 
 	info.flags = 0;
 	info.length = len;
-	info.low_limit = current->mm->mmap_legacy_base;
+	info.low_limit = mm->mmap_legacy_base;
 	info.high_limit = TASK_SIZE;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 
-static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
+static unsigned long hugetlb_get_unmapped_area_topdown(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr0, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
@@ -99,10 +101,10 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
-	info.high_limit = current->mm->mmap_base;
+	info.high_limit = mm->mmap_base;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -115,18 +117,18 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
 }
 
 unsigned long
-hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct hstate *h = hstate_file(file);
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 
 	if (len & ~huge_page_mask(h))
@@ -148,10 +150,10 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 			return addr;
 	}
 	if (mm->get_unmapped_area == arch_get_unmapped_area)
-		return hugetlb_get_unmapped_area_bottomup(file, addr, len,
+		return hugetlb_get_unmapped_area_bottomup(mm, file, addr, len,
 				pgoff, flags);
 	else
-		return hugetlb_get_unmapped_area_topdown(file, addr, len,
+		return hugetlb_get_unmapped_area_topdown(mm, file, addr, len,
 				pgoff, flags);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
diff --git a/arch/xtensa/kernel/syscall.c b/arch/xtensa/kernel/syscall.c
index d3fd100dffc9..4a17828917df 100644
--- a/arch/xtensa/kernel/syscall.c
+++ b/arch/xtensa/kernel/syscall.c
@@ -58,8 +58,9 @@ asmlinkage long xtensa_fadvise64_64(int fd, int advice,
 }
 
 #ifdef CONFIG_MMU
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct vm_area_struct *vmm;
 
@@ -83,7 +84,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
+	for (vmm = find_vma(mm, addr); ; vmm = vmm->vm_next) {
 		/* At this point:  (!vmm || addr < vmm->vm_end). */
 		if (TASK_SIZE - len < addr)
 			return -ENOMEM;
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 6d9cc2d39d22..4b6ac3f373df 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -273,7 +273,8 @@ static pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 #endif
 
 #ifndef CONFIG_MMU
-static unsigned long get_unmapped_area_mem(struct file *file,
+static unsigned long get_unmapped_area_mem(struct mm_struct *mm,
+					   struct file *file,
 					   unsigned long addr,
 					   unsigned long len,
 					   unsigned long pgoff,
@@ -662,9 +663,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
-static unsigned long get_unmapped_area_zero(struct file *file,
-				unsigned long addr, unsigned long len,
-				unsigned long pgoff, unsigned long flags)
+static unsigned long get_unmapped_area_zero(struct mm_struct *mm,
+				struct file *file, unsigned long addr,
+				unsigned long len, unsigned long pgoff,
+				unsigned long flags)
 {
 #ifdef CONFIG_MMU
 	if (flags & MAP_SHARED) {
@@ -674,11 +676,12 @@ static unsigned long get_unmapped_area_zero(struct file *file,
 		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
 		 * so as not to confuse shmem with our handle on "/dev/zero".
 		 */
-		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+		return shmem_get_unmapped_area(mm, NULL, addr, len, pgoff,
+					       flags);
 	}
 
 	/* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
-	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+	return mm->get_unmapped_area(mm, file, addr, len, pgoff, flags);
 #else
 	return -ENOSYS;
 #endif
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index ed758b74ddf0..ab60b8c05b7f 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -552,9 +552,9 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 }
 
 /* return an unmapped area aligned to the dax region specified alignment */
-static unsigned long dax_get_unmapped_area(struct file *filp,
-		unsigned long addr, unsigned long len, unsigned long pgoff,
-		unsigned long flags)
+static unsigned long dax_get_unmapped_area(struct mm_struct *mm,
+		struct file *filp, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags)
 {
 	unsigned long off, off_end, off_align, len_align, addr_align, align;
 	struct dax_dev *dax_dev = filp ? filp->private_data : NULL;
@@ -576,14 +576,14 @@ static unsigned long dax_get_unmapped_area(struct file *filp,
 	if ((off + len_align) < off)
 		goto out;
 
-	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+	addr_align = mm->get_unmapped_area(mm, filp, addr, len_align,
 			pgoff, flags);
 	if (!IS_ERR_VALUE(addr_align)) {
 		addr_align += (off - addr_align) & (align - 1);
 		return addr_align;
 	}
  out:
-	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+	return mm->get_unmapped_area(mm, filp, addr, len, pgoff, flags);
 }
 
 static int dax_open(struct inode *inode, struct file *filp)
diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..2224c5ebb459 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -1434,9 +1434,9 @@ static unsigned int uvc_v4l2_poll(struct file *file, poll_table *wait)
 }
 
 #ifndef CONFIG_MMU
-static unsigned long uvc_v4l2_get_unmapped_area(struct file *file,
-		unsigned long addr, unsigned long len, unsigned long pgoff,
-		unsigned long flags)
+static unsigned long uvc_v4l2_get_unmapped_area(struct mm_struct *mm,
+		struct file *file, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags)
 {
 	struct uvc_fh *handle = file->private_data;
 	struct uvc_streaming *stream = handle->stream;
diff --git a/drivers/media/v4l2-core/v4l2-dev.c b/drivers/media/v4l2-core/v4l2-dev.c
index fa2124cb31bd..f227ebc369f7 100644
--- a/drivers/media/v4l2-core/v4l2-dev.c
+++ b/drivers/media/v4l2-core/v4l2-dev.c
@@ -369,9 +369,9 @@ static long v4l2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 #ifdef CONFIG_MMU
 #define v4l2_get_unmapped_area NULL
 #else
-static unsigned long v4l2_get_unmapped_area(struct file *filp,
-		unsigned long addr, unsigned long len, unsigned long pgoff,
-		unsigned long flags)
+static unsigned long v4l2_get_unmapped_area(struct mm_struct *mm,
+		struct file *filp, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags)
 {
 	struct video_device *vdev = video_devdata(filp);
 	int ret;
@@ -380,7 +380,7 @@ static unsigned long v4l2_get_unmapped_area(struct file *filp,
 		return -ENOSYS;
 	if (!video_is_registered(vdev))
 		return -ENODEV;
-	ret = vdev->fops->get_unmapped_area(filp, addr, len, pgoff, flags);
+	ret = vdev->fops->get_unmapped_area(mm, filp, addr, len, pgoff, flags);
 	if (vdev->dev_debug & V4L2_DEV_DEBUG_FOP)
 		printk(KERN_DEBUG "%s: get_unmapped_area (%d)\n",
 			video_device_node_name(vdev), ret);
diff --git a/drivers/media/v4l2-core/videobuf2-v4l2.c b/drivers/media/v4l2-core/videobuf2-v4l2.c
index 3529849d2218..953417fb4914 100644
--- a/drivers/media/v4l2-core/videobuf2-v4l2.c
+++ b/drivers/media/v4l2-core/videobuf2-v4l2.c
@@ -932,8 +932,9 @@ unsigned int vb2_fop_poll(struct file *file, poll_table *wait)
 EXPORT_SYMBOL_GPL(vb2_fop_poll);
 
 #ifndef CONFIG_MMU
-unsigned long vb2_fop_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long vb2_fop_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	struct video_device *vdev = video_devdata(file);
 
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index ce5ccc573a9c..8774808c69e5 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -1156,7 +1156,8 @@ static long mtdchar_compat_ioctl(struct file *file, unsigned int cmd,
  *   mappings)
  */
 #ifndef CONFIG_MMU
-static unsigned long mtdchar_get_unmapped_area(struct file *file,
+static unsigned long mtdchar_get_unmapped_area(struct mm_struct *mm,
+					   struct file *file,
 					   unsigned long addr,
 					   unsigned long len,
 					   unsigned long pgoff,
diff --git a/drivers/usb/gadget/function/uvc_v4l2.c b/drivers/usb/gadget/function/uvc_v4l2.c
index 3e22b45687d3..96ad4de14dc7 100644
--- a/drivers/usb/gadget/function/uvc_v4l2.c
+++ b/drivers/usb/gadget/function/uvc_v4l2.c
@@ -343,7 +343,8 @@ uvc_v4l2_poll(struct file *file, poll_table *wait)
 }
 
 #ifndef CONFIG_MMU
-static unsigned long uvcg_v4l2_get_unmapped_area(struct file *file,
+static unsigned long uvcg_v4l2_get_unmapped_area(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags)
 {
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 54de77e78775..69037eb6a728 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -168,10 +168,10 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 #ifndef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 static unsigned long
-hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
@@ -201,7 +201,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.high_limit = TASK_SIZE;
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 #endif
 
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 842a5ff5b85c..cb2d5702bdce 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -291,9 +291,9 @@ static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma)
 }
 
 static unsigned long
-proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
-			   unsigned long len, unsigned long pgoff,
-			   unsigned long flags)
+proc_reg_get_unmapped_area(struct mm_struct *mm, struct file *file,
+			   unsigned long orig_addr, unsigned long len,
+			   unsigned long pgoff, unsigned long flags)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	unsigned long rv = -EIO;
@@ -304,11 +304,11 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
 		get_area = pde->proc_fops->get_unmapped_area;
 #ifdef CONFIG_MMU
 		if (!get_area)
-			get_area = current->mm->get_unmapped_area;
+			get_area = mm->get_unmapped_area;
 #endif
 
 		if (get_area)
-			rv = get_area(file, orig_addr, len, pgoff, flags);
+			rv = get_area(mm, file, orig_addr, len, pgoff, flags);
 		else
 			rv = orig_addr;
 		unuse_pde(pde);
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 12af0490322f..a4584130234a 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -31,11 +31,12 @@
 
 #include "internal.h"
 
-static unsigned long ramfs_mmu_get_unmapped_area(struct file *file,
+static unsigned long ramfs_mmu_get_unmapped_area(struct mm_struct *mm,
+		struct file *file,
 		unsigned long addr, unsigned long len, unsigned long pgoff,
 		unsigned long flags)
 {
-	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+	return mm->get_unmapped_area(mm, file, addr, len, pgoff, flags);
 }
 
 const struct file_operations ramfs_file_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 2ef7ce75c062..564ba84d61b6 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -27,7 +27,8 @@
 #include "internal.h"
 
 static int ramfs_nommu_setattr(struct dentry *, struct iattr *);
-static unsigned long ramfs_nommu_get_unmapped_area(struct file *file,
+static unsigned long ramfs_nommu_get_unmapped_area(struct mm_struct *mm,
+						   struct file *file,
 						   unsigned long addr,
 						   unsigned long len,
 						   unsigned long pgoff,
@@ -202,9 +203,10 @@ static int ramfs_nommu_setattr(struct dentry *dentry, struct iattr *ia)
  *   - the pages to be mapped must exist
  *   - the pages be physically contiguous in sequence
  */
-static unsigned long ramfs_nommu_get_unmapped_area(struct file *file,
-					    unsigned long addr, unsigned long len,
-					    unsigned long pgoff, unsigned long flags)
+static unsigned long ramfs_nommu_get_unmapped_area(struct mm_struct *mm,
+					struct file *file, unsigned long addr,
+					unsigned long len, unsigned long pgoff,
+					unsigned long flags)
 {
 	unsigned long maxpages, lpages, nr, loop, ret;
 	struct inode *inode = file_inode(file);
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index 1118a0dc6b45..0bfebcafb077 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -19,7 +19,8 @@
  *   mappings)
  * - attempts to map through to the underlying MTD device
  */
-static unsigned long romfs_get_unmapped_area(struct file *file,
+static unsigned long romfs_get_unmapped_area(struct mm_struct *mm,
+					     struct file *file,
 					     unsigned long addr,
 					     unsigned long len,
 					     unsigned long pgoff,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..45f8cf874ea1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1662,7 +1662,7 @@ struct file_operations {
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
 	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
-	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+	unsigned long (*get_unmapped_area)(struct mm_struct *, struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 	int (*check_flags)(int);
 	int (*flock) (struct file *, int, struct file_lock *);
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 97e478d6b690..94a0e680b7d7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -87,9 +87,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-extern unsigned long thp_get_unmapped_area(struct file *filp,
-		unsigned long addr, unsigned long len, unsigned long pgoff,
-		unsigned long flags);
+extern unsigned long thp_get_unmapped_area(struct mm_struct *mm,
+		struct file *filp, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags);
 
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 48c76d612d40..72260cc252f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -287,8 +287,9 @@ hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
-unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
-					unsigned long len, unsigned long pgoff,
+unsigned long hugetlb_get_unmapped_area(struct mm_struct *mm, struct file *file,
+					unsigned long addr, unsigned long len,
+					unsigned long pgoff,
 					unsigned long flags);
 #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 71a90604d21f..1520da8f9c67 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2014,7 +2014,9 @@ extern int install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
-extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+extern unsigned long get_unmapped_area(struct mm_struct *, struct file *,
+				       unsigned long, unsigned long,
+				       unsigned long, unsigned long);
 
 extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 				 unsigned long addr, unsigned long len,
@@ -2066,8 +2068,10 @@ struct vm_unmapped_area_info {
 	unsigned long align_offset;
 };
 
-extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
-extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
+extern unsigned long unmapped_area(struct mm_struct *mm,
+				   struct vm_unmapped_area_info *info);
+extern unsigned long unmapped_area_topdown(struct mm_struct *mm,
+					   struct vm_unmapped_area_info *info);
 
 /*
  * Search for an unmapped address range.
@@ -2079,12 +2083,12 @@ extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
  * - satisfies (begin_addr & align_mask) == (align_offset & align_mask)
  */
 static inline unsigned long
-vm_unmapped_area(struct vm_unmapped_area_info *info)
+vm_unmapped_area(struct mm_struct *mm, struct vm_unmapped_area_info *info)
 {
 	if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
-		return unmapped_area_topdown(info);
+		return unmapped_area_topdown(mm, info);
 	else
-		return unmapped_area(info);
+		return unmapped_area(mm, info);
 }
 
 /* truncate.c */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 808751d7b737..6aa03e88dcff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,9 +398,10 @@ struct mm_struct {
 	struct rb_root mm_rb;
 	u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
-				unsigned long addr, unsigned long len,
-				unsigned long pgoff, unsigned long flags);
+	unsigned long (*get_unmapped_area) (struct mm_struct *mm,
+				struct file *filp, unsigned long addr,
+				unsigned long len, unsigned long pgoff,
+				unsigned long flags);
 #endif
 	unsigned long mmap_base;		/* base of mmap area */
 	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9ec61f7..42b9b93a50ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -476,12 +476,12 @@ struct user_namespace;
 #ifdef CONFIG_MMU
 extern void arch_pick_mmap_layout(struct mm_struct *mm);
 extern unsigned long
-arch_get_unmapped_area(struct file *, unsigned long, unsigned long,
-		       unsigned long, unsigned long);
+arch_get_unmapped_area(struct mm_struct *mm, struct file *, unsigned long,
+		       unsigned long, unsigned long, unsigned long);
 extern unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr,
-			  unsigned long len, unsigned long pgoff,
-			  unsigned long flags);
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			       unsigned long addr, unsigned long len,
+			       unsigned long pgoff, unsigned long flags);
 #else
 static inline void arch_pick_mmap_layout(struct mm_struct *mm) {}
 #endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7043b6..37d27369bb42 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -54,8 +54,9 @@ extern struct file *shmem_file_setup(const char *name,
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
-extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags);
+extern unsigned long shmem_get_unmapped_area(struct mm_struct *, struct file *,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
 extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/include/media/v4l2-dev.h b/include/media/v4l2-dev.h
index e657614521e3..d7287a53f590 100644
--- a/include/media/v4l2-dev.h
+++ b/include/media/v4l2-dev.h
@@ -156,7 +156,8 @@ struct v4l2_file_operations {
 #ifdef CONFIG_COMPAT
 	long (*compat_ioctl32) (struct file *, unsigned int, unsigned long);
 #endif
-	unsigned long (*get_unmapped_area) (struct file *, unsigned long,
+	unsigned long (*get_unmapped_area) (struct mm_struct *mm,
+				struct file *filep, unsigned long,
 				unsigned long, unsigned long, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct file *);
diff --git a/include/media/videobuf2-v4l2.h b/include/media/videobuf2-v4l2.h
index 036127c54bbf..43f4d0a3b333 100644
--- a/include/media/videobuf2-v4l2.h
+++ b/include/media/videobuf2-v4l2.h
@@ -264,8 +264,9 @@ ssize_t vb2_fop_read(struct file *file, char __user *buf,
 		size_t count, loff_t *ppos);
 unsigned int vb2_fop_poll(struct file *file, poll_table *wait);
 #ifndef CONFIG_MMU
-unsigned long vb2_fop_get_unmapped_area(struct file *file, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags);
+unsigned long vb2_fop_get_unmapped_area(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags);
 #endif
 
 /**
diff --git a/ipc/shm.c b/ipc/shm.c
index 64c21fb32ca9..2fd73cda4ec8 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -465,14 +465,14 @@ static long shm_fallocate(struct file *file, int mode, loff_t offset,
 	return sfd->file->f_op->fallocate(file, mode, offset, len);
 }
 
-static unsigned long shm_get_unmapped_area(struct file *file,
-	unsigned long addr, unsigned long len, unsigned long pgoff,
-	unsigned long flags)
+static unsigned long shm_get_unmapped_area(struct mm_struct *mm,
+	struct file *file, unsigned long addr, unsigned long len,
+	unsigned long pgoff, unsigned long flags)
 {
 	struct shm_file_data *sfd = shm_file_data(file);
 
-	return sfd->file->f_op->get_unmapped_area(sfd->file, addr, len,
-						pgoff, flags);
+	return sfd->file->f_op->get_unmapped_area(mm, sfd->file, addr, len,
+						  pgoff, flags);
 }
 
 static const struct file_operations shm_file_operations = {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d416f3baf392..ed95aae0f386 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1142,7 +1142,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 
 	if (!area->vaddr) {
 		/* Try to map as high as possible, this is only a hint. */
-		area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+		area->vaddr = get_unmapped_area(mm, NULL, TASK_SIZE - PAGE_SIZE,
 						PAGE_SIZE, 0, 0);
 		if (area->vaddr & ~PAGE_MASK) {
 			ret = area->vaddr;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5f3ad65c85de..d5b2604867e5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -499,8 +499,9 @@ void prep_transhuge_page(struct page *page)
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
-unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
-		loff_t off, unsigned long flags, unsigned long size)
+unsigned long __thp_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long len, loff_t off, unsigned long flags,
+		unsigned long size)
 {
 	unsigned long addr;
 	loff_t off_end = off + len;
@@ -514,8 +515,8 @@ unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
 	if (len_pad < len || (off + len_pad) < off)
 		return 0;
 
-	addr = current->mm->get_unmapped_area(filp, 0, len_pad,
-					      off >> PAGE_SHIFT, flags);
+	addr = mm->get_unmapped_area(mm, filp, 0, len_pad, off >> PAGE_SHIFT,
+				     flags);
 	if (IS_ERR_VALUE(addr))
 		return 0;
 
@@ -523,8 +524,9 @@ unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len,
 	return addr;
 }
 
-unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long thp_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
@@ -533,12 +535,12 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
 
-	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
+	addr = __thp_get_unmapped_area(mm, filp, len, off, flags, PMD_SIZE);
 	if (addr)
 		return addr;
 
  out:
-	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+	return mm->get_unmapped_area(mm, filp, addr, len, pgoff, flags);
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ea79bc4da5b7..8a73dc458e69 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1339,7 +1339,7 @@ unsigned long do_mmap(struct mm_struct *mm, struct file *file,
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+	addr = get_unmapped_area(mm, file, addr, len, pgoff, flags);
 	if (offset_in_page(addr))
 		return addr;
 
@@ -1742,7 +1742,8 @@ unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 	return error;
 }
 
-unsigned long unmapped_area(struct vm_unmapped_area_info *info)
+unsigned long unmapped_area(struct mm_struct *mm,
+			    struct vm_unmapped_area_info *info)
 {
 	/*
 	 * We implement the search by looking for an rbtree node that
@@ -1752,7 +1753,6 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 	 * - gap_end - gap_start >= length
 	 */
 
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
 
@@ -1844,9 +1844,9 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 	return gap_start;
 }
 
-unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
+unsigned long unmapped_area_topdown(struct mm_struct *mm,
+				    struct vm_unmapped_area_info *info)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
 
@@ -1955,10 +1955,10 @@ unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
  */
 #ifndef HAVE_ARCH_UNMAPPED_AREA
 unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
-		unsigned long len, unsigned long pgoff, unsigned long flags)
+arch_get_unmapped_area(struct mm_struct *mm, struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
@@ -1981,7 +1981,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.low_limit = mm->mmap_base;
 	info.high_limit = TASK_SIZE;
 	info.align_mask = 0;
-	return vm_unmapped_area(&info);
+	return vm_unmapped_area(mm, &info);
 }
 #endif
 
@@ -1991,12 +1991,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
  */
 #ifndef HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
-			  const unsigned long len, const unsigned long pgoff,
-			  const unsigned long flags)
+arch_get_unmapped_area_topdown(struct mm_struct *mm, struct file *filp,
+			  const unsigned long addr0, const unsigned long len,
+			  const unsigned long pgoff, const unsigned long flags)
 {
 	struct vm_area_struct *vma;
-	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
@@ -2021,7 +2020,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
 	info.high_limit = mm->mmap_base;
 	info.align_mask = 0;
-	addr = vm_unmapped_area(&info);
+	addr = vm_unmapped_area(mm, &info);
 
 	/*
 	 * A failed mmap() very likely causes application failure,
@@ -2034,7 +2033,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = TASK_SIZE;
-		addr = vm_unmapped_area(&info);
+		addr = vm_unmapped_area(mm, &info);
 	}
 
 	return addr;
@@ -2042,11 +2041,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 #endif
 
 unsigned long
-get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
-		unsigned long pgoff, unsigned long flags)
+get_unmapped_area(struct mm_struct *mm, struct file *file, unsigned long addr,
+		  unsigned long len, unsigned long pgoff, unsigned long flags)
 {
-	unsigned long (*get_area)(struct file *, unsigned long,
-				  unsigned long, unsigned long, unsigned long);
+	unsigned long (*get_area)(struct mm_struct *, struct file *,
+				  unsigned long, unsigned long, unsigned long,
+				  unsigned long);
 
 	unsigned long error = arch_mmap_check(addr, len, flags);
 	if (error)
@@ -2056,7 +2056,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
-	get_area = current->mm->get_unmapped_area;
+	get_area = mm->get_unmapped_area;
 	if (file) {
 		if (file->f_op->get_unmapped_area)
 			get_area = file->f_op->get_unmapped_area;
@@ -2070,7 +2070,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		get_area = shmem_get_unmapped_area;
 	}
 
-	addr = get_area(file, addr, len, pgoff, flags);
+	addr = get_area(mm, file, addr, len, pgoff, flags);
 	if (IS_ERR_VALUE(addr))
 		return addr;
 
@@ -2819,7 +2819,7 @@ static int do_brk(unsigned long addr, unsigned long request)
 
 	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
 
-	error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED);
+	error = get_unmapped_area(mm, NULL, addr, len, 0, MAP_FIXED);
 	if (offset_in_page(error))
 		return error;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 30d7d2482eea..f085eed57bac 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -452,8 +452,9 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (vma->vm_flags & VM_MAYSHARE)
 		map_flags |= MAP_SHARED;
 
-	ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff +
-				((addr - vma->vm_start) >> PAGE_SHIFT),
+	ret = get_unmapped_area(mm, vma->vm_file, new_addr, new_len,
+				vma->vm_pgoff +
+					((addr - vma->vm_start) >> PAGE_SHIFT),
 				map_flags);
 	if (offset_in_page(ret))
 		goto out1;
@@ -475,8 +476,8 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
 		return 0;
 	if (vma->vm_next && vma->vm_next->vm_start < end) /* intersection */
 		return 0;
-	if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
-			      0, MAP_FIXED) & ~PAGE_MASK)
+	if (get_unmapped_area(vma->vm_mm, NULL, vma->vm_start,
+			      end - vma->vm_start, 0, MAP_FIXED) & ~PAGE_MASK)
 		return 0;
 	return 1;
 }
@@ -583,7 +584,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		if (vma->vm_flags & VM_MAYSHARE)
 			map_flags |= MAP_SHARED;
 
-		new_addr = get_unmapped_area(vma->vm_file, 0, new_len,
+		new_addr = get_unmapped_area(mm, vma->vm_file, 0, new_len,
 					vma->vm_pgoff +
 					((addr - vma->vm_start) >> PAGE_SHIFT),
 					map_flags);
diff --git a/mm/nommu.c b/mm/nommu.c
index 54825d29f50b..5075a56b75ec 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1333,8 +1333,9 @@ unsigned long do_mmap(struct mm_struct *mm,
 		 *   tell us the location of a shared mapping
 		 */
 		if (capabilities & NOMMU_MAP_DIRECT) {
-			addr = file->f_op->get_unmapped_area(file, addr, len,
-							     pgoff, flags);
+			addr = file->f_op->get_unmapped_area(mm, file, addr,
+							     len, pgoff,
+							     flags);
 			if (IS_ERR_VALUE(addr)) {
 				ret = addr;
 				if (ret != -ENOSYS)
@@ -1782,8 +1783,9 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
-unsigned long arch_get_unmapped_area(struct file *file, unsigned long addr,
-	unsigned long len, unsigned long pgoff, unsigned long flags)
+unsigned long arch_get_unmapped_area(struct mm_struct *mm, struct file *file,
+	unsigned long addr, unsigned long len, unsigned long pgoff,
+	unsigned long flags)
 {
 	return -ENOMEM;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 3a7587a0314d..5e27727c05a8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1971,11 +1971,11 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
-unsigned long shmem_get_unmapped_area(struct file *file,
+unsigned long shmem_get_unmapped_area(struct mm_struct *mm, struct file *file,
 				      unsigned long uaddr, unsigned long len,
 				      unsigned long pgoff, unsigned long flags)
 {
-	unsigned long (*get_area)(struct file *,
+	unsigned long (*get_area)(struct mm_struct *mm, struct file *,
 		unsigned long, unsigned long, unsigned long, unsigned long);
 	unsigned long addr;
 	unsigned long offset;
@@ -1986,8 +1986,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (len > TASK_SIZE)
 		return -ENOMEM;
 
-	get_area = current->mm->get_unmapped_area;
-	addr = get_area(file, uaddr, len, pgoff, flags);
+	get_area = mm->get_unmapped_area;
+	addr = get_area(mm, file, uaddr, len, pgoff, flags);
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
 		return addr;
@@ -2043,7 +2043,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_len < len)
 		return addr;
 
-	inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+	inflated_addr = get_area(mm, NULL, 0, inflated_len, 0, flags);
 	if (IS_ERR_VALUE(inflated_addr))
 		return addr;
 	if (inflated_addr & ~PAGE_MASK)
@@ -3971,11 +3971,11 @@ void shmem_unlock_mapping(struct address_space *mapping)
 }
 
 #ifdef CONFIG_MMU
-unsigned long shmem_get_unmapped_area(struct file *file,
+unsigned long shmem_get_unmapped_area(struct mm_struct *mm, struct file *file,
 				      unsigned long addr, unsigned long len,
 				      unsigned long pgoff, unsigned long flags)
 {
-	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+	return mm->get_unmapped_area(mm, file, addr, len, pgoff, flags);
 }
 #endif
 
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index 9d33c1e85c79..f3d2ee8ecf9f 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -3649,7 +3649,8 @@ static int snd_pcm_hw_params_old_user(struct snd_pcm_substream *substream,
 #endif /* CONFIG_SND_SUPPORT_OLD_API */
 
 #ifndef CONFIG_MMU
-static unsigned long snd_pcm_get_unmapped_area(struct file *file,
+static unsigned long snd_pcm_get_unmapped_area(struct mm_struct *mm,
+					       struct file *file,
 					       unsigned long addr,
 					       unsigned long len,
 					       unsigned long pgoff,
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 05/13] mm: Add mm_struct argument to 'mm_populate' and '__mm_populate'
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (3 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area' Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem Till Smejkal
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add to the 'mm_populate' and '__mm_populate' functions as additional
argument which mm_struct they should use during their execution. Before,
these functions simply used the memory map of the current task. However,
with the introduction of first class virtual address spaces, both
functions also need to be able to operate on other memory maps than just
the one of the current task. Accordingly, it is now possible to specify
explicitly which memory map these functions should use via an additional
argument.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 arch/x86/mm/mpx.c  |  2 +-
 include/linux/mm.h | 13 ++++++++-----
 ipc/shm.c          |  9 +++++----
 mm/gup.c           |  4 ++--
 mm/mlock.c         | 21 +++++++++++----------
 mm/mmap.c          |  6 +++---
 mm/mremap.c        |  2 +-
 mm/util.c          |  2 +-
 8 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 99c664a97c35..b46f7cdbdad8 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -54,7 +54,7 @@ static unsigned long mpx_mmap(unsigned long len)
 		       MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, &populate);
 	up_write(&mm->mmap_sem);
 	if (populate)
-		mm_populate(addr, populate);
+		mm_populate(mm, addr, populate);
 
 	return addr;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1520da8f9c67..92925d97da20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2040,15 +2040,18 @@ do_mmap_pgoff(struct mm_struct *mm, struct file *file, unsigned long addr,
 }
 
 #ifdef CONFIG_MMU
-extern int __mm_populate(unsigned long addr, unsigned long len,
-			 int ignore_errors);
-static inline void mm_populate(unsigned long addr, unsigned long len)
+extern int __mm_populate(struct mm_struct *mm, unsigned long addr,
+			 unsigned long len, int ignore_errors);
+static inline void mm_populate(struct mm_struct *mm, unsigned long addr,
+			       unsigned long len)
 {
 	/* Ignore errors */
-	(void) __mm_populate(addr, len, 1);
+	(void) __mm_populate(mm, addr, len, 1);
 }
 #else
-static inline void mm_populate(unsigned long addr, unsigned long len) {}
+static inline void mm_populate(struct mm_struct *mm, unsigned long addr,
+			       unsigned long len)
+{}
 #endif
 
 /* These take the mm semaphore themselves */
diff --git a/ipc/shm.c b/ipc/shm.c
index 2fd73cda4ec8..be692e0abe79 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1106,6 +1106,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
 	struct shm_file_data *sfd;
 	struct path path;
 	fmode_t f_mode;
+	struct mm_struct *mm = current->mm;
 	unsigned long populate = 0;
 
 	err = -EINVAL;
@@ -1208,7 +1209,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
 	if (err)
 		goto out_fput;
 
-	if (down_write_killable(&current->mm->mmap_sem)) {
+	if (down_write_killable(&mm->mmap_sem)) {
 		err = -EINTR;
 		goto out_fput;
 	}
@@ -1218,7 +1219,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
 		if (addr + size < addr)
 			goto invalid;
 
-		if (find_vma_intersection(current->mm, addr, addr + size))
+		if (find_vma_intersection(mm, addr, addr + size))
 			goto invalid;
 	}
 
@@ -1229,9 +1230,9 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
 	if (IS_ERR_VALUE(addr))
 		err = (long)addr;
 invalid:
-	up_write(&current->mm->mmap_sem);
+	up_write(&mm->mmap_sem);
 	if (populate)
-		mm_populate(addr, populate);
+		mm_populate(mm, addr, populate);
 
 out_fput:
 	fput(file);
diff --git a/mm/gup.c b/mm/gup.c
index 55315555489d..ca5ba2703b40 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1053,9 +1053,9 @@ long populate_vma_page_range(struct vm_area_struct *vma,
  * flags. VMAs must be already marked with the desired vm_flags, and
  * mmap_sem must not be held.
  */
-int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
+int __mm_populate(struct mm_struct *mm, unsigned long start, unsigned long len,
+		  int ignore_errors)
 {
-	struct mm_struct *mm = current->mm;
 	unsigned long end, nstart, nend;
 	struct vm_area_struct *vma = NULL;
 	int locked = 0;
diff --git a/mm/mlock.c b/mm/mlock.c
index cdbed8aaa426..9d74948c7b22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -664,6 +664,7 @@ static int count_mm_mlocked_page_nr(struct mm_struct *mm,
 
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned long locked;
 	unsigned long lock_limit;
 	int error = -ENOMEM;
@@ -680,10 +681,10 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
 	lock_limit >>= PAGE_SHIFT;
 	locked = len >> PAGE_SHIFT;
 
-	if (down_write_killable(&current->mm->mmap_sem))
+	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 
-	locked += current->mm->locked_vm;
+	locked += mm->locked_vm;
 	if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) {
 		/*
 		 * It is possible that the regions requested intersect with
@@ -691,19 +692,18 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
 		 * should not be counted to new mlock increment count. So check
 		 * and adjust locked count if necessary.
 		 */
-		locked -= count_mm_mlocked_page_nr(current->mm,
-				start, len);
+		locked -= count_mm_mlocked_page_nr(mm, start, len);
 	}
 
 	/* check against resource limits */
 	if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
 		error = apply_vma_lock_flags(start, len, flags);
 
-	up_write(&current->mm->mmap_sem);
+	up_write(&mm->mmap_sem);
 	if (error)
 		return error;
 
-	error = __mm_populate(start, len, 0);
+	error = __mm_populate(mm, start, len, 0);
 	if (error)
 		return __mlock_posix_error_return(error);
 	return 0;
@@ -790,6 +790,7 @@ static int apply_mlockall_flags(int flags)
 
 SYSCALL_DEFINE1(mlockall, int, flags)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned long lock_limit;
 	int ret;
 
@@ -805,16 +806,16 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	lock_limit >>= PAGE_SHIFT;
 
-	if (down_write_killable(&current->mm->mmap_sem))
+	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 
 	ret = -ENOMEM;
-	if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) ||
+	if (!(flags & MCL_CURRENT) || (mm->total_vm <= lock_limit) ||
 	    capable(CAP_IPC_LOCK))
 		ret = apply_mlockall_flags(flags);
-	up_write(&current->mm->mmap_sem);
+	up_write(&mm->mmap_sem);
 	if (!ret && (flags & MCL_CURRENT))
-		mm_populate(0, TASK_SIZE);
+		mm_populate(mm, 0, TASK_SIZE);
 
 	return ret;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 8a73dc458e69..3f60c8ebd6b6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -236,7 +236,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;
 	up_write(&mm->mmap_sem);
 	if (populate)
-		mm_populate(oldbrk, newbrk - oldbrk);
+		mm_populate(mm, oldbrk, newbrk - oldbrk);
 	return brk;
 
 out:
@@ -2781,7 +2781,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 out:
 	up_write(&mm->mmap_sem);
 	if (populate)
-		mm_populate(ret, populate);
+		mm_populate(mm, ret, populate);
 	if (!IS_ERR_VALUE(ret))
 		ret = 0;
 	return ret;
@@ -2898,7 +2898,7 @@ int vm_brk(unsigned long addr, unsigned long len)
 	populate = ((mm->def_flags & VM_LOCKED) != 0);
 	up_write(&mm->mmap_sem);
 	if (populate && !ret)
-		mm_populate(addr, len);
+		mm_populate(mm, addr, len);
 	return ret;
 }
 EXPORT_SYMBOL(vm_brk);
diff --git a/mm/mremap.c b/mm/mremap.c
index f085eed57bac..0f18ec452441 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -602,6 +602,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	}
 	up_write(&current->mm->mmap_sem);
 	if (locked && new_len > old_len)
-		mm_populate(new_addr + old_len, new_len - old_len);
+		mm_populate(mm, new_addr + old_len, new_len - old_len);
 	return ret;
 }
diff --git a/mm/util.c b/mm/util.c
index 46d05eef9a6b..7b1116b400a3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -306,7 +306,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 				    &populate);
 		up_write(&mm->mmap_sem);
 		if (populate)
-			mm_populate(ret, populate);
+			mm_populate(mm, ret, populate);
 	}
 	return ret;
 }
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (4 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 05/13] mm: Add mm_struct argument to 'mm_populate' and '__mm_populate' Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init' Till Smejkal
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Make the functions 'vma_link' and 'find_vma_links' accessible to other
source files in the mm/ source directory of the kernel so that other files
in that directory can also perform low level changes to mm_struct data
structures.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 mm/internal.h | 11 +++++++++++
 mm/mmap.c     | 12 ++++++------
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 7aa2ea0a8623..e22cb031b45b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -76,6 +76,17 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * in mm/mmap.c
+ */
+extern void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+		     struct vm_area_struct *prev, struct rb_node **rb_link,
+		     struct rb_node *rb_parent);
+extern int find_vma_links(struct mm_struct *mm, unsigned long addr,
+			  unsigned long end, struct vm_area_struct **pprev,
+			  struct rb_node ***rb_link,
+			  struct rb_node **rb_parent);
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f60c8ebd6b6..d35c6b51cadf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -466,9 +466,9 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static int find_vma_links(struct mm_struct *mm, unsigned long addr,
-		unsigned long end, struct vm_area_struct **pprev,
-		struct rb_node ***rb_link, struct rb_node **rb_parent)
+int find_vma_links(struct mm_struct *mm, unsigned long addr,
+		   unsigned long end, struct vm_area_struct **pprev,
+		   struct rb_node ***rb_link, struct rb_node **rb_parent)
 {
 	struct rb_node **__rb_link, *__rb_parent, *rb_prev;
 
@@ -580,9 +580,9 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	__vma_link_rb(mm, vma, rb_link, rb_parent);
 }
 
-static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct vm_area_struct *prev, struct rb_node **rb_link,
-			struct rb_node *rb_parent)
+void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+	      struct vm_area_struct *prev, struct rb_node **rb_link,
+	      struct rb_node *rb_parent)
 {
 	struct address_space *mapping = NULL;
 
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (5 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-14 10:18   ` David Laight
  2017-03-13 22:14 ` [RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork Till Smejkal
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

The only way until now to create a new memory map was via the exported
function 'mm_alloc'. Unfortunately, this function not only allocates a new
memory map, but also completely initializes it. However, with the
introduction of first class virtual address spaces, some initialization
steps done in 'mm_alloc' are not applicable to the memory maps needed for
this feature and hence would lead to errors in the kernel code.

Instead of introducing a new function that can allocate and initialize
memory maps for first class virtual address spaces and potentially
duplicate some code, I decided to split the mm_alloc function as well as
the 'mm_init' function that it uses.

Now there are four functions exported instead of only one. The new
'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
want to have the old behavior of mm_alloc one can use the newly introduced
function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
also fully initializes it.

The old 'mm_init' function which fully initialized a mm_struct was split up
into two separate functions. The first one - 'mm_setup' - does all the
initialization of the mm_struct that is not related to the mm_struct
belonging to a particular task. This part of the initialization is done in
the 'mm_set_task' function. This way it is possible to create memory maps
that don't have any task-specific information as needed by the first class
virtual address space feature. Both functions, 'mm_setup' and 'mm_set_task'
are also exported, so that they can be used in all files in the source
tree.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 arch/arm/mach-rpc/ecard.c |  2 +-
 fs/exec.c                 |  2 +-
 include/linux/sched.h     |  7 +++++-
 kernel/fork.c             | 64 +++++++++++++++++++++++++++++++++++++----------
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index dc67a7fb3831..15845e8abd7e 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -245,7 +245,7 @@ static void ecard_init_pgtables(struct mm_struct *mm)
 
 static int ecard_init_mm(void)
 {
-	struct mm_struct * mm = mm_alloc();
+	struct mm_struct *mm = mm_alloc_and_setup();
 	struct mm_struct *active_mm = current->active_mm;
 
 	if (!mm)
diff --git a/fs/exec.c b/fs/exec.c
index e57946610733..68d7908a1e5a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -380,7 +380,7 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	int err;
 	struct mm_struct *mm = NULL;
 
-	bprm->mm = mm = mm_alloc();
+	bprm->mm = mm = mm_alloc_and_setup();
 	err = -ENOMEM;
 	if (!mm)
 		goto err;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42b9b93a50ac..7955adc00397 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2922,7 +2922,12 @@ static inline unsigned long sigsp(unsigned long sp, struct ksignal *ksig)
 /*
  * Routines for handling mm_structs
  */
-extern struct mm_struct * mm_alloc(void);
+extern struct mm_struct *mm_setup(struct mm_struct *mm);
+extern struct mm_struct *mm_set_task(struct mm_struct *mm,
+				     struct task_struct *p,
+				     struct user_namespace *user_ns);
+extern struct mm_struct *mm_alloc(void);
+extern struct mm_struct *mm_alloc_and_setup(void);
 
 /* mmdrop drops the mm and the page tables */
 extern void __mmdrop(struct mm_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8ab827c..9209f6d5d7c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -747,8 +747,10 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
-	struct user_namespace *user_ns)
+/**
+ * Initialize all the task-unrelated fields of a mm_struct.
+ **/
+struct mm_struct *mm_setup(struct mm_struct *mm)
 {
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
@@ -767,24 +769,37 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	spin_lock_init(&mm->page_table_lock);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
-	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
 
+	mm->flags = default_dump_filter;
+	mm->def_flags = 0;
+
+	if (mm_alloc_pgd(mm))
+		goto fail_nopgd;
+
+	return mm;
+
+fail_nopgd:
+	free_mm(mm);
+	return NULL;
+}
+
+/**
+ * Initialize all the task-related fields of a mm_struct.
+ **/
+struct mm_struct *mm_set_task(struct mm_struct *mm, struct task_struct *p,
+			      struct user_namespace *user_ns)
+{
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
 		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
-	} else {
-		mm->flags = default_dump_filter;
-		mm->def_flags = 0;
 	}
 
-	if (mm_alloc_pgd(mm))
-		goto fail_nopgd;
-
+	mm_init_owner(mm, p);
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
@@ -793,11 +808,21 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 
 fail_nocontext:
 	mm_free_pgd(mm);
-fail_nopgd:
 	free_mm(mm);
 	return NULL;
 }
 
+static struct mm_struct *mm_setup_all(struct mm_struct *mm,
+				      struct task_struct *p,
+				      struct user_namespace *user_ns)
+{
+	mm = mm_setup(mm);
+	if (!mm)
+		return NULL;
+
+	return mm_set_task(mm, p, user_ns);
+}
+
 static void check_mm(struct mm_struct *mm)
 {
 	int i;
@@ -822,10 +847,22 @@ static void check_mm(struct mm_struct *mm)
 #endif
 }
 
+struct mm_struct *mm_alloc(void)
+{
+	struct mm_struct *mm;
+
+	mm = allocate_mm();
+	if (!mm)
+		return NULL;
+
+	memset(mm, 0, sizeof(*mm));
+	return mm;
+}
+
 /*
  * Allocate and initialize an mm_struct.
  */
-struct mm_struct *mm_alloc(void)
+struct mm_struct *mm_alloc_and_setup(void)
 {
 	struct mm_struct *mm;
 
@@ -834,9 +871,10 @@ struct mm_struct *mm_alloc(void)
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));
-	return mm_init(mm, current, current_user_ns());
+	return mm_setup_all(mm, current, current_user_ns());
 }
 
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -1131,7 +1169,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 
 	memcpy(mm, oldmm, sizeof(*mm));
 
-	if (!mm_init(mm, tsk, mm->user_ns))
+	if (!mm_setup_all(mm, tsk, mm->user_ns))
 		goto fail_nomem;
 
 	err = dup_mmap(mm, oldmm);
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (6 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init' Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges Till Smejkal
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

The dup_mm-function used during 'do_fork' to duplicate the current task's
mm_struct for the newly forked task always implicitly uses current->mm for
this purpose. However, during copy_mm it was already decided which
mm_struct to copy/duplicate. So pass this mm_struct to dup_mm instead of
again deciding which mm_struct to use.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 kernel/fork.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9209f6d5d7c0..d3087d870855 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1158,9 +1158,10 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
  * Allocate a new mm structure and copy contents from the
  * mm structure of the passed in task structure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+				struct mm_struct *oldmm)
 {
-	struct mm_struct *mm, *oldmm = current->mm;
+	struct mm_struct *mm;
 	int err;
 
 	mm = allocate_mm();
@@ -1226,7 +1227,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	}
 
 	retval = -ENOMEM;
-	mm = dup_mm(tsk);
+	mm = dup_mm(tsk, oldmm);
 	if (!mm)
 		goto fail_nomem;
 
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (7 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 10/13] mm: Introduce first class virtual address spaces Till Smejkal
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add new function to one-to-one duplicate a page table range of one memory
map to another memory map. The new function 'dup_page_range' copies the
page table entries for the specified region from the page table of the
source memory map to the page table of the destination memory map and
thereby allows actual sharing of the referenced memory pages instead of
relying on copy-on-write for anonymous memory pages or page faults for
read-only memory pages as it is done by the existing function
'copy_page_range'. Hence, 'dup_page_range' will produce shared pages
between two address spaces whereas 'copy_page_range' will result in copies
of pages if necessary.

Preexisting mappings in the page table of the destination memory map are
properly zapped by the 'dup_page_range' function if they differ from the
ones in the source memory map before they are replaced with the new ones.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 include/linux/huge_mm.h |   6 +
 include/linux/hugetlb.h |   5 +
 include/linux/mm.h      |   2 +
 mm/huge_memory.c        |  65 +++++++
 mm/hugetlb.c            | 205 +++++++++++++++------
 mm/memory.c             | 461 +++++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 620 insertions(+), 124 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 94a0e680b7d7..52c0498426ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
+extern int dup_huge_pmd(struct mm_struct *dst_mm,
+			struct vm_area_struct *dst_vma,
+			struct mm_struct *src_mm,
+			struct vm_area_struct *src_vma,
+			struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+			unsigned long addr);
 extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 extern int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
 extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 72260cc252f2..d8eb682e39a1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -63,6 +63,10 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 #endif
 
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+int dup_hugetlb_page_range(struct mm_struct *dst_mm,
+			   struct vm_area_struct *dst_vma,
+			   struct mm_struct *src_mm,
+			   struct vm_area_struct *src_vma);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
 			 unsigned long *, unsigned long *, long, unsigned int);
@@ -134,6 +138,7 @@ static inline unsigned long hugetlb_total_pages(void)
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
+#define dup_hugetlb_page_range(dst, dst_vma, src, src_vma) ({ BUG(); 0; })
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 92925d97da20..b39ec795f64c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1208,6 +1208,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
+int dup_page_range(struct mm_struct *dst, struct vm_area_struct *dst_vma,
+		   struct mm_struct *src, struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5b2604867e5..1edf8c6d1814 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -887,6 +887,71 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+int dup_huge_pmd(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+		 struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+		 struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+		 unsigned long addr)
+{
+	spinlock_t *dst_ptl, *src_ptl;
+	struct page *page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (!pgtable)
+		return -ENOMEM;
+
+	if (!pmd_none_or_clear_bad(dst_pmd) &&
+	    unlikely(zap_huge_pmd(tlb, dst_vma, dst_pmd, addr)))
+		return -ENOMEM;
+
+	dst_ptl = pmd_lock(dst_mm, dst_pmd);
+	src_ptl = pmd_lockptr(src_mm, src_pmd);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+	if (!pmd_trans_huge(*src_pmd)) {
+		pte_free(dst_mm, pgtable);
+		ret = -EAGAIN;
+		goto out_unlock;
+	}
+
+	if (is_huge_zero_pmd(*src_pmd)) {
+		struct page *zero_page;
+
+		zero_page = mm_get_huge_zero_page(dst_mm);
+		set_huge_zero_page(pgtable, dst_mm, dst_vma, addr, dst_pmd,
+				   zero_page);
+
+		ret = 0;
+		goto out_unlock;
+	}
+
+	pmd = *src_pmd;
+
+	page = pmd_page(pmd);
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	get_page(page);
+	page_dup_rmap(page, true);
+
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	atomic_long_inc(&dst_mm->nr_ptes);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+
+	if (!(dst_vma->vm_flags & VM_WRITE))
+		pmd = pmd_wrprotect(pmd);
+	pmd = pmd_mkold(pmd);
+
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	ret = 0;
+
+out_unlock:
+	spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
+
+	return ret;
+}
+
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	pmd_t entry;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c7025c132670..776c024de7c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3286,6 +3286,74 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	return ret;
 }
 
+static inline int
+unmap_one_hugepage(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		   pte_t *ptep, unsigned long addr, struct page *ref_page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct hstate *h = hstate_vma(vma);
+
+	ptl = huge_pte_lock(h, mm, ptep);
+	if (huge_pmd_unshare(mm, &addr, ptep)) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	pte = huge_ptep_get(ptep);
+	if (huge_pte_none(pte)) {
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	/*
+	 * Migrating hugepage or HWPoisoned hugepage is already
+	 * unmapped and its refcount is dropped, so just clear pte here.
+	 */
+	if (unlikely(!pte_present(pte))) {
+		huge_pte_clear(mm, addr, ptep);
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	page = pte_page(pte);
+	/*
+	 * If a reference page is supplied, it is because a specific
+	 * page is being unmapped, not a range. Ensure the page we
+	 * are about to unmap is the actual page of interest.
+	 */
+	if (ref_page) {
+		if (page != ref_page) {
+			spin_unlock(ptl);
+			return 0;
+		}
+		/*
+		 * Mark the VMA as having unmapped its page so that
+		 * future faults in this VMA will fail rather than
+		 * looking like data was lost
+		 */
+		set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
+	}
+
+	pte = huge_ptep_get_and_clear(mm, addr, ptep);
+	tlb_remove_huge_tlb_entry(h, tlb, ptep, addr);
+	if (huge_pte_dirty(pte))
+		set_page_dirty(page);
+
+	hugetlb_count_sub(pages_per_huge_page(h), mm);
+	page_remove_rmap(page, true);
+
+	spin_unlock(ptl);
+	tlb_remove_page_size(tlb, page, huge_page_size(h));
+
+	/*
+	 * Bail out after unmapping reference page if supplied
+	 */
+	return ref_page ? 1 : 0;
+}
+
 void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			    unsigned long start, unsigned long end,
 			    struct page *ref_page)
@@ -3293,9 +3361,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
 	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
-	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
 	const unsigned long mmun_start = start;	/* For mmu_notifiers */
@@ -3318,62 +3383,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		if (!ptep)
 			continue;
 
-		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, &address, ptep)) {
-			spin_unlock(ptl);
-			continue;
-		}
-
-		pte = huge_ptep_get(ptep);
-		if (huge_pte_none(pte)) {
-			spin_unlock(ptl);
-			continue;
-		}
-
-		/*
-		 * Migrating hugepage or HWPoisoned hugepage is already
-		 * unmapped and its refcount is dropped, so just clear pte here.
-		 */
-		if (unlikely(!pte_present(pte))) {
-			huge_pte_clear(mm, address, ptep);
-			spin_unlock(ptl);
-			continue;
-		}
-
-		page = pte_page(pte);
-		/*
-		 * If a reference page is supplied, it is because a specific
-		 * page is being unmapped, not a range. Ensure the page we
-		 * are about to unmap is the actual page of interest.
-		 */
-		if (ref_page) {
-			if (page != ref_page) {
-				spin_unlock(ptl);
-				continue;
-			}
-			/*
-			 * Mark the VMA as having unmapped its page so that
-			 * future faults in this VMA will fail rather than
-			 * looking like data was lost
-			 */
-			set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
-		}
-
-		pte = huge_ptep_get_and_clear(mm, address, ptep);
-		tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
-		if (huge_pte_dirty(pte))
-			set_page_dirty(page);
-
-		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
-
-		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, page, huge_page_size(h));
-		/*
-		 * Bail out after unmapping reference page if supplied
-		 */
-		if (ref_page)
+		if (unlikely(unmap_one_hugepage(tlb, vma, ptep, address,
+						ref_page)))
 			break;
+
 	}
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	tlb_end_vma(tlb, vma);
@@ -3411,6 +3424,82 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 	tlb_finish_mmu(&tlb, start, end);
 }
 
+int dup_hugetlb_page_range(struct mm_struct *dst_mm,
+			   struct vm_area_struct *dst_vma,
+			   struct mm_struct *src_mm,
+			   struct vm_area_struct *src_vma)
+{
+	pte_t *dst_pte, *src_pte;
+	struct mmu_gather tlb;
+	struct hstate *h = hstate_vma(dst_vma);
+	unsigned long addr;
+	unsigned long mmu_start = dst_vma->vm_start;
+	unsigned long mmu_end = dst_vma->vm_end;
+	unsigned long size = huge_page_size(h);
+	int ret;
+
+	tlb_gather_mmu(&tlb, dst_mm, mmu_start, mmu_end);
+	tlb_remove_check_page_size_change(&tlb, size);
+	mmu_notifier_invalidate_range_start(dst_mm, mmu_start, mmu_end);
+
+	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += size) {
+		pte_t pte;
+		spinlock_t *dst_ptl, *src_ptl;
+		struct page *page;
+
+		dst_pte = huge_pte_offset(dst_mm, addr);
+		src_pte = huge_pte_offset(src_mm, addr);
+
+		if (dst_pte == src_pte)
+			/* Just continue if the ptes are already equal. */
+			continue;
+		else if (dst_pte && !huge_pte_none(*dst_pte))
+			/*
+			 * ptes are not equal, so we have to get rid of the old
+			 * mapping in the destination page table.
+			 */
+			unmap_one_hugepage(&tlb, dst_vma, dst_pte, addr, NULL);
+
+		if (!src_pte || huge_pte_none(*src_pte))
+			continue;
+
+		dst_pte = huge_pte_alloc(dst_mm, addr, size);
+		if (!dst_pte) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		dst_ptl = huge_pte_lock(h, dst_mm, dst_pte);
+		src_ptl = huge_pte_lockptr(h, src_mm, src_pte);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+		pte = huge_ptep_get(src_pte);
+		page = pte_page(pte);
+		if (page) {
+			get_page(page);
+			page_dup_rmap(page, true);
+		}
+
+		if (likely(!is_hugetlb_entry_migration(pte) &&
+			   !is_hugetlb_entry_hwpoisoned(pte)))
+			hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+
+		if (!(dst_vma->vm_flags & VM_WRITE))
+			pte = pte_wrprotect(pte);
+		pte = pte_mkold(pte);
+
+		set_huge_pte_at(dst_mm, addr, dst_pte, pte);
+
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+	}
+
+	mmu_notifier_invalidate_range_end(dst_mm, mmu_start, mmu_end);
+	tlb_finish_mmu(&tlb, mmu_start, mmu_end);
+
+	return ret;
+}
+
 /*
  * This is called when the original mapper is failing to COW a MAP_PRIVATE
  * mappping it owns the reserve page for. The intention is to unmap the page
diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..7026f2146fcd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1108,6 +1108,82 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+static unsigned long zap_one_pte(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma, pte_t *pte,
+				 unsigned long *paddr, int *force_flush,
+				 int *rss, struct zap_details *details)
+{
+	unsigned long addr = *paddr;
+	struct mm_struct *mm = tlb->mm;
+	swp_entry_t entry;
+	pte_t ptent = *pte;
+
+	if (pte_present(ptent)) {
+		struct page *page;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (unlikely(details) && page) {
+			/*
+			 * unmap_shared_mapping_pages() wants to
+			 * invalidate cache without truncating:
+			 * unmap shared but keep private pages.
+			 */
+			if (details->check_mapping &&
+			    details->check_mapping != page_rmapping(page))
+				return 0;
+		}
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		if (unlikely(!page))
+			return 0;
+
+		if (!PageAnon(page)) {
+			if (pte_dirty(ptent)) {
+				/*
+				 * oom_reaper cannot tear down dirty
+				 * pages
+				 */
+				if (unlikely(details && details->ignore_dirty))
+					return 0;
+				*force_flush = 1;
+				set_page_dirty(page);
+			}
+			if (pte_young(ptent) &&
+			    likely(!(vma->vm_flags & VM_SEQ_READ)))
+				mark_page_accessed(page);
+		}
+		rss[mm_counter(page)]--;
+		page_remove_rmap(page, false);
+		if (unlikely(page_mapcount(page) < 0))
+			print_bad_pte(vma, addr, ptent, page);
+		if (unlikely(__tlb_remove_page(tlb, page))) {
+			*force_flush = 1;
+			*paddr += PAGE_SIZE;
+			return 1;
+		}
+		return 0;
+	}
+	/* only check swap_entries if explicitly asked for in details */
+	if (unlikely(details && !details->check_swap_entries))
+		return 0;
+
+	entry = pte_to_swp_entry(ptent);
+	if (!non_swap_entry(entry))
+		rss[MM_SWAPENTS]--;
+	else if (is_migration_entry(entry)) {
+		struct page *page;
+
+		page = migration_entry_to_page(entry);
+		rss[mm_counter(page)]--;
+	}
+	if (unlikely(!free_swap_and_cache(entry)))
+		print_bad_pte(vma, addr, ptent, NULL);
+	pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+
+	return 0;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1119,7 +1195,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
 
 	tlb_remove_check_page_size_change(tlb, PAGE_SIZE);
 again:
@@ -1128,73 +1203,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte = start_pte;
 	arch_enter_lazy_mmu_mode();
 	do {
-		pte_t ptent = *pte;
-		if (pte_none(ptent)) {
-			continue;
-		}
-
-		if (pte_present(ptent)) {
-			struct page *page;
-
-			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
-				continue;
-
-			if (!PageAnon(page)) {
-				if (pte_dirty(ptent)) {
-					/*
-					 * oom_reaper cannot tear down dirty
-					 * pages
-					 */
-					if (unlikely(details && details->ignore_dirty))
-						continue;
-					force_flush = 1;
-					set_page_dirty(page);
-				}
-				if (pte_young(ptent) &&
-				    likely(!(vma->vm_flags & VM_SEQ_READ)))
-					mark_page_accessed(page);
-			}
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
-				force_flush = 1;
-				addr += PAGE_SIZE;
-				break;
-			}
-			continue;
-		}
-		/* only check swap_entries if explicitly asked for in details */
-		if (unlikely(details && !details->check_swap_entries))
+		if (pte_none(*pte))
 			continue;
 
-		entry = pte_to_swp_entry(ptent);
-		if (!non_swap_entry(entry))
-			rss[MM_SWAPENTS]--;
-		else if (is_migration_entry(entry)) {
-			struct page *page;
-
-			page = migration_entry_to_page(entry);
-			rss[mm_counter(page)]--;
-		}
-		if (unlikely(!free_swap_and_cache(entry)))
-			print_bad_pte(vma, addr, ptent, NULL);
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		if (unlikely(zap_one_pte(tlb, vma, pte, &addr, &force_flush,
+					 rss, details)))
+			break;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1445,6 +1459,321 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
+static inline int
+dup_one_pte(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+	    struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+	    struct mmu_gather *tlb, pte_t *dst_pte, pte_t *src_pte,
+	    unsigned long addr, int *force_flush, int *rss)
+{
+	unsigned long raddr = addr;
+	pte_t pte = *src_pte;
+	struct page *page;
+
+	/*
+	 * If the ptes are already exactly the same, we don't have to do
+	 * anything.
+	 */
+	if (likely(src_pte == dst_pte))
+		return 0;
+
+	/* Remove the old mapping first */
+	if (!pte_none(*dst_pte) &&
+	    unlikely(zap_one_pte(tlb, dst_vma, dst_pte, &raddr, force_flush,
+				 rss, NULL)))
+		return -ENOMEM;
+
+	/* pte contains position in swap or file, so copy. */
+	if (unlikely(!pte_present(pte))) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (likely(!non_swap_entry(entry))) {
+			if (swap_duplicate(entry) < 0)
+				return entry.val;
+
+			/* make sure dst_mm is on swapoff's mmlist. */
+			if (unlikely(list_empty(&dst_mm->mmlist))) {
+				spin_lock(&mmlist_lock);
+				if (list_empty(&dst_mm->mmlist))
+					list_add(&dst_mm->mmlist,
+							&src_mm->mmlist);
+				spin_unlock(&mmlist_lock);
+			}
+			rss[MM_SWAPENTS]++;
+		} else if (is_migration_entry(entry)) {
+			page = migration_entry_to_page(entry);
+
+			rss[mm_counter(page)]++;
+		}
+		goto out_set_pte;
+	}
+
+	pte = pte_mkold(pte);
+
+	page = vm_normal_page(src_vma, addr, pte);
+	if (page) {
+		get_page(page);
+		page_dup_rmap(page, false);
+		rss[mm_counter(page)]++;
+	}
+
+out_set_pte:
+	if (!(dst_vma->vm_flags & VM_WRITE))
+		pte = pte_wrprotect(pte);
+
+	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return 0;
+}
+
+static inline int
+dup_pte_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+	      struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+	      struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+	      unsigned long addr, unsigned long end)
+{
+	pte_t *orig_dst_pte, *orig_src_pte;
+	pte_t *dst_pte, *src_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	int force_flush = 0;
+	int progress = 0;
+	int rss[NR_MM_COUNTERS];
+	swp_entry_t entry = (swp_entry_t){0};
+
+again:
+	init_rss_vec(rss);
+
+	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
+	if (!dst_pte)
+		return -ENOMEM;
+	src_pte = pte_offset_map(src_pmd, addr);
+	src_ptl = pte_lockptr(src_mm, src_pmd);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+	orig_dst_pte = dst_pte;
+	orig_src_pte = src_pte;
+
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		/* Make sure that we are not holding the looks too long. */
+		if (progress >= 32) {
+			progress = 0;
+			if (need_resched() || spin_needbreak(src_ptl) ||
+			    spin_needbreak(dst_ptl))
+				break;
+		}
+
+		if (pte_none(*src_pte) && pte_none(*dst_pte)) {
+			progress++;
+			continue;
+		} else if (pte_none(*src_pte)) {
+			unsigned long raddr = addr;
+			int ret;
+
+			ret = zap_one_pte(tlb, dst_vma, dst_pte, &raddr,
+					  &force_flush, rss, NULL);
+			pte_clear(dst_mm, addr, dst_pte);
+
+			progress += 8;
+			if (ret)
+				break;
+
+			continue;
+		}
+
+		entry.val = dup_one_pte(dst_mm, dst_vma, src_mm, src_vma,
+					tlb, dst_pte, src_pte, addr,
+					&force_flush, rss);
+
+		if (entry.val)
+			break;
+		progress += 8;
+	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(src_ptl);
+	pte_unmap(orig_src_pte);
+	add_mm_rss_vec(dst_mm, rss);
+
+	/* Do the TLB flush before unlocking the destination ptl */
+	if (force_flush)
+		tlb_flush_mmu_tlbonly(tlb);
+	pte_unmap_unlock(orig_dst_pte, dst_ptl);
+
+	/* Sometimes we have to free all the batch memory as well. */
+	if (force_flush) {
+		force_flush = 0;
+		tlb_flush_mmu_free(tlb);
+	}
+
+	cond_resched();
+	if (entry.val) {
+		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+			return -ENOMEM;
+		progress = 0;
+	}
+	if (addr != end)
+		goto again;
+
+	return 0;
+}
+
+static inline int
+dup_pmd_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+	      struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+	      struct mmu_gather *tlb, pud_t *dst_pud, pud_t *src_pud,
+	      unsigned long addr, unsigned long end)
+{
+	pmd_t *dst_pmd, *src_pmd;
+	unsigned long next;
+
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	if (!dst_pmd)
+		return -ENOMEM;
+	src_pmd = pmd_offset(src_pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+
+		if (pmd_none_or_clear_bad(src_pmd) &&
+		    pmd_none_or_clear_bad(dst_pmd)) {
+			continue;
+		} else if (pmd_none_or_clear_bad(src_pmd)) {
+			/* src unmapped, but dst not --> free dst too */
+			zap_pte_range(tlb, dst_vma, dst_pmd, addr, next, NULL);
+			free_pte_range(tlb, dst_pmd, addr);
+
+			continue;
+		} else if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+			int err;
+
+			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
+
+			/*
+			 * We may need to unmap the content of the destination
+			 * page table first. So check this here, because
+			 * inside dup_huge_pmd we cannot do it anymore.
+			 */
+			if (unlikely(!pmd_trans_huge(*dst_pmd) &&
+				     !pmd_devmap(*dst_pmd) &&
+				     !pmd_none_or_clear_bad(dst_pmd))) {
+				zap_pte_range(tlb, dst_vma, dst_pmd, addr, next,
+					      NULL);
+				free_pte_range(tlb, dst_pmd, addr);
+			}
+
+			err = dup_huge_pmd(dst_mm, dst_vma, src_mm, src_vma,
+					   tlb, dst_pmd, src_pmd, addr);
+
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* explicit fall through */
+
+		}
+
+		if (unlikely(dup_pte_range(dst_mm, dst_vma, src_mm, src_vma,
+					   tlb, dst_pmd, src_pmd, addr, next)))
+			return -ENOMEM;
+	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
+
+	return 0;
+}
+
+static inline int
+dup_pud_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+	      struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+	      struct mmu_gather *tlb, pgd_t *dst_pgd, pgd_t *src_pgd,
+	      unsigned long addr, unsigned long end)
+{
+	pud_t *dst_pud, *src_pud;
+	unsigned long next;
+
+	dst_pud = pud_alloc(dst_mm, dst_pgd, addr);
+	if (!dst_pud)
+		return -ENOMEM;
+	src_pud = pud_offset(src_pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(src_pud) &&
+		    pud_none_or_clear_bad(dst_pud)) {
+			continue;
+		} else if (pud_none_or_clear_bad(src_pud)) {
+			/* src is unmapped, but dst not --> free dst too */
+			zap_pmd_range(tlb, dst_vma, dst_pud, addr, next, NULL);
+			free_pmd_range(tlb, dst_pud, addr, next, addr, next);
+
+			continue;
+		}
+
+		if (unlikely(dup_pmd_range(dst_mm, dst_vma, src_mm, src_vma,
+					   tlb, dst_pud, src_pud, addr, next)))
+			return -ENOMEM;
+	} while (dst_pud++, src_pud++, addr = next, addr != end);
+
+	return 0;
+}
+
+/**
+ * One-to-one duplicate the page table entries of one memory map to another
+ * memory map. After this function, the destination memory map will have the
+ * exact same page table entries for the specified region as the source memory
+ * map. Preexisting mappings in the destination memory map will be removed
+ * before they are overwritten with the ones from the source memory map if they
+ * differ.
+ *
+ * The difference between this function and @copy_page_range is that
+ * 'copy_page_range' will copy the underlying memory pages if necessary (e.g.
+ * for anonymous memory) with the help of copy-on-write while 'dup_page_range'
+ * will only duplicate the page table entries and hence allow both memory maps
+ * to actually share the referenced memory pages.
+ **/
+int dup_page_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+		   struct mm_struct *src_mm, struct vm_area_struct *src_vma)
+{
+	pgd_t *dst_pgd, *src_pgd;
+	struct mmu_gather tlb;
+	unsigned long next;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
+	unsigned long mmu_start = dst_vma->vm_start;
+	unsigned long mmu_end = dst_vma->vm_end;
+	int ret = 0;
+
+	if (is_vm_hugetlb_page(src_vma))
+		return dup_hugetlb_page_range(dst_mm, dst_vma, src_mm,
+					      src_vma);
+
+	tlb_gather_mmu(&tlb, dst_mm, mmu_start, mmu_end);
+	mmu_notifier_invalidate_range_start(dst_mm, mmu_start, mmu_end);
+
+	dst_pgd = pgd_offset(dst_mm, addr);
+	src_pgd = pgd_offset(src_mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(src_pgd) &&
+		    pgd_none_or_clear_bad(dst_pgd)) {
+			continue;
+		} else if (pgd_none_or_clear_bad(src_pgd)) {
+			/* src is unmapped, but dst not --> free dst too */
+			zap_pud_range(&tlb, dst_vma, dst_pgd, addr, next, NULL);
+			free_pud_range(&tlb, dst_pgd, addr, next, addr, next);
+
+			continue;
+		}
+
+		if (unlikely(dup_pud_range(dst_mm, dst_vma, src_mm, src_vma,
+					   &tlb, dst_pgd, src_pgd, addr,
+					   next))) {
+			ret = -ENOMEM;
+			break;
+		}
+	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	mmu_notifier_invalidate_range_end(dst_mm, mmu_start, mmu_end);
+	tlb_finish_mmu(&tlb, mmu_start, mmu_end);
+
+	return ret;
+}
+
 pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			spinlock_t **ptl)
 {
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (8 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 23:52   ` Greg Kroah-Hartman
  2017-03-14  1:35   ` Vineet Gupta
  2017-03-13 22:14 ` [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions Till Smejkal
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Introduce a different type of address spaces which are first class citizens
in the OS. That means that the kernel now handles two types of AS, those
which are closely coupled with a process and those which aren't. While the
former ones are created and destroyed together with the process by the
kernel and are the default type of AS in the Linux kernel, the latter ones
have to be managed explicitly by the user and are the newly introduced
type.

Accordingly, a first class AS (also called VAS == virtual address space)
can exist in the OS independently from any process. A user has to
explicitly create and destroy them in the system. Processes and VAS can be
combined by attaching a previously created VAS to a process which basically
adds an additional AS to the process that the process' threads are able to
execute in. Hence, VAS allow a process to have different views onto the
main memory of the system (its original AS and the attached VAS) between
which its threads can switch arbitrarily during their lifetime.

The functionality made available through first class virtual address spaces
can be used in various different ways. One possible way to utilize VAS is
to compartmentalize a process for security reasons. Another possible usage
is to improve the performance of data-centric applications by being able to
manage different sets of data in memory without the need to map or unmap
them.

Furthermore, first class virtual address spaces can be attached to
different processes at the same time if the underlying memory is only
readable. This mechanism allows sharing of whole address spaces between
multiple processes that can both execute in them using the contained
memory.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
---
 MAINTAINERS                            |   10 +
 arch/x86/entry/syscalls/syscall_32.tbl |    9 +
 arch/x86/entry/syscalls/syscall_64.tbl |    9 +
 fs/exec.c                              |    3 +
 include/linux/mm_types.h               |    8 +
 include/linux/sched.h                  |   17 +
 include/linux/syscalls.h               |   11 +
 include/linux/vas.h                    |  182 +++
 include/linux/vas_types.h              |   88 ++
 include/uapi/asm-generic/unistd.h      |   20 +-
 include/uapi/linux/Kbuild              |    1 +
 include/uapi/linux/vas.h               |   16 +
 init/main.c                            |    2 +
 kernel/exit.c                          |    2 +
 kernel/fork.c                          |   28 +-
 kernel/sys_ni.c                        |   11 +
 mm/Kconfig                             |   20 +
 mm/Makefile                            |    1 +
 mm/internal.h                          |    8 +
 mm/memory.c                            |    3 +
 mm/mmap.c                              |   22 +
 mm/vas.c                               | 2188 ++++++++++++++++++++++++++++++++
 22 files changed, 2657 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/vas.h
 create mode 100644 include/linux/vas_types.h
 create mode 100644 include/uapi/linux/vas.h
 create mode 100644 mm/vas.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d13759ecc..060b1c64e67a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5040,6 +5040,16 @@ F:	Documentation/firmware_class/
 F:	drivers/base/firmware*.c
 F:	include/linux/firmware.h
 
+FIRST CLASS VIRTUAL ADDRESS SPACES
+M:	Till Smejkal <till.smejkal@gmail.com>
+L:	linux-kernel@vger.kernel.org
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	include/linux/vas_types.h
+F:	include/linux/vas.h
+F:	include/uapi/linux/vas.h
+F:	mm/vas.c
+
 FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card)
 M:	Joshua Morris <josh.h.morris@us.ibm.com>
 M:	Philip Kelleher <pjk1939@linux.vnet.ibm.com>
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2b3618542544..8c553eef8c44 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -389,3 +389,12 @@
 380	i386	pkey_mprotect		sys_pkey_mprotect
 381	i386	pkey_alloc		sys_pkey_alloc
 382	i386	pkey_free		sys_pkey_free
+383	i386	vas_create		sys_vas_create
+384	i386	vas_delete		sys_vas_delete
+385	i386	vas_find		sys_vas_find
+386	i386	vas_attach		sys_vas_attach
+387	i386	vas_detach		sys_vas_detach
+388	i386	vas_switch		sys_vas_switch
+389	i386	active_vas		sys_active_vas
+390	i386	vas_getattr		sys_vas_getattr
+391	i386	vas_setattr		sys_vas_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index e93ef0b38db8..72f1f0495710 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -338,6 +338,15 @@
 329	common	pkey_mprotect		sys_pkey_mprotect
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
+332	common	vas_create		sys_vas_create
+333	common	vas_delete		sys_vas_delete
+334	common	vas_find		sys_vas_find
+335	common	vas_attach		sys_vas_attach
+336	common	vas_detach		sys_vas_detach
+337	common	vas_switch		sys_vas_switch
+338	common	active_vas		sys_active_vas
+339	common	vas_getattr		sys_vas_getattr
+340	common	vas_setattr		sys_vas_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 68d7908a1e5a..e1ac0a8c76bf 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1020,6 +1020,9 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+#ifdef CONFIG_VAS
+	tsk->original_mm = mm;
+#endif
 	activate_mm(active_mm, mm);
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6aa03e88dcff..82bf78ea83ee 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/ktime.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -358,6 +359,10 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_VAS
+	struct mm_struct *vas_reference;
+	ktime_t vas_last_update;
+#endif
 };
 
 struct core_thread {
@@ -514,6 +519,9 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#ifdef CONFIG_VAS
+	ktime_t vas_last_update;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7955adc00397..216876912e77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1508,6 +1508,18 @@ struct tlbflush_unmap_batch {
 	bool writable;
 };
 
+/* Shared information of attached VASes between processes */
+#ifdef CONFIG_VAS
+struct vas_context {
+	spinlock_t lock;
+	u16 refcount;			// < the number of tasks using this
+					//   VAS context.
+
+	struct list_head vases;		// < the list of attached-VASes which
+					//   are handled by this VAS context.
+};
+#endif
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1583,6 +1595,11 @@ struct task_struct {
 #endif
 
 	struct mm_struct *mm, *active_mm;
+#ifdef CONFIG_VAS
+	struct mm_struct *original_mm;
+	struct vas_context *vas_ctx;
+	int active_vas;
+#endif
 	/* per-thread vma caching */
 	u32 vmacache_seqnum;
 	struct vm_area_struct *vmacache[VMACACHE_SIZE];
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 91a740f6b884..fdea27d37c96 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+struct vas_attr;
 union bpf_attr;
 
 #include <linux/types.h>
@@ -903,4 +904,14 @@ asmlinkage long sys_pkey_mprotect(unsigned long start, size_t len,
 asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
 asmlinkage long sys_pkey_free(int pkey);
 
+asmlinkage long sys_vas_create(const char __user *name, umode_t mode);
+asmlinkage long sys_vas_delete(int vid);
+asmlinkage long sys_vas_find(const char __user *name);
+asmlinkage long sys_vas_attach(pid_t pid, int vid, int type);
+asmlinkage long sys_vas_detach(pid_t pid, int vid);
+asmlinkage long sys_vas_switch(int vid);
+asmlinkage long sys_active_vas(void);
+asmlinkage long sys_vas_getattr(int vid, struct vas_attr __user *attr);
+asmlinkage long sys_vas_setattr(int vid, struct vas_attr __user *attr);
+
 #endif
diff --git a/include/linux/vas.h b/include/linux/vas.h
new file mode 100644
index 000000000000..6a72e42f96d2
--- /dev/null
+++ b/include/linux/vas.h
@@ -0,0 +1,182 @@
+#ifndef _LINUX_VAS_H
+#define _LINUX_VAS_H
+
+
+#include <linux/sched.h>
+#include <linux/vas_types.h>
+
+
+/***
+ * General management of the VAS subsystem
+ ***/
+
+#ifdef CONFIG_VAS
+
+/***
+ * Management of VASes
+ ***/
+
+/**
+ * Lock and unlock helper for VAS.
+ **/
+#define vas_lock(vas) mutex_lock(&(vas)->mtx)
+#define vas_unlock(vas) mutex_unlock(&(vas)->mtx)
+
+/**
+ * Create a new VAS.
+ *
+ * @param[in] name:		The name of the new VAS.
+ * @param[in] mode:		The access rights for the VAS.
+ *
+ * @returns:			The VAS ID on success, -ERRNO otherwise.
+ **/
+extern int vas_create(const char *name, umode_t mode);
+
+/**
+ * Get a pointer to a VAS data structure.
+ *
+ * @param[in] vid:		The ID of the VAS whose data structure should be
+ *				returned.
+ *
+ * @returns:			The pointer to the VAS data structure on
+ *				success, or NULL otherwise.
+ **/
+extern struct vas *vas_get(int vid);
+
+/**
+ * Return a pointer to a VAS data structure again.
+ *
+ * @param[in] vas:		The pointer to the VAS data structure that
+ *				should be returned.
+ **/
+extern void vas_put(struct vas *vas);
+
+/**
+ * Get the ID of the VAS belonging to the given name.
+ *
+ * @param[in] name:		The name of the VAS for which the ID should be
+ *				returned.
+ *
+ * @returns:			The VAS ID on success, -ERRNO otherwise.
+ **/
+extern int vas_find(const char *name);
+
+/**
+ * Delete the given VAS structure again.
+ *
+ * @param[in] vid:		The ID of the VAS which should be deleted.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_delete(int vid);
+
+/**
+ * Attach a VAS to a process.
+ *
+ * @param[in] tsk:		The task to which the VAS should be attached to.
+ * @param[in] vid:		The ID of the VAS which should be attached.
+ * @param[in] type:		The type how the VAS should be attached.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_attach(struct task_struct *tsk, int vid, int type);
+
+/**
+ * Detach a VAS from a process.
+ *
+ * @param[in] tsk:		The task from which the VAS should be detached.
+ * @param[in] vid:		The ID of the VAS which should be detached.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_detach(struct task_struct *tsk, int vid);
+
+/**
+ * Switch to a different VAS.
+ *
+ * @param[in] tsk:		The task for which the VAS should be switched.
+ * @param[in] vid:		The ID of the VAS which should be activated.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_switch(struct task_struct *tsk, int vid);
+
+/**
+ * Get attributes of a VAS.
+ *
+ * @param[in] vid:		The ID of the VAS for which the attributes
+ *				should be returned.
+ * @param[out] attr:		The pointer to the struct where the attributes
+ *				should be saved.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_getattr(int vid, struct vas_attr *attr);
+
+/**
+ * Set attributes of a VAS.
+ *
+ * @param[in] vid:		The ID of the VAS for which the attributes
+ *				should be updated.
+ * @param[in] attr:		The pointer to the struct containing the new
+ *				attributes.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_setattr(int vid, struct vas_attr *attr);
+
+
+/***
+ * Management of VAS contexts
+ ***/
+
+/**
+ * Lock and unlock helper for VAS contexts.
+ **/
+#define vas_context_lock(ctx) spin_lock(&(ctx)->lock)
+#define vas_context_unlock(ctx) spin_unlock(&(ctx)->lock)
+
+
+/***
+ * Management of the VAS subsystem
+ ***/
+
+/**
+ * Initialize the VAS subsystem
+ **/
+extern void vas_init(void);
+
+
+/***
+ * Management of the VAS subsystem during fork and exit
+ ***/
+
+/**
+ * Initialize the task-specific VAS data structures during the clone system
+ * call.
+ *
+ * @param[in] clone_flags:	The flags which were given to the system call by
+ *				the user.
+ * @param[in] tsk:		The new task which should be initialized.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_clone(int clone_flags, struct task_struct *tsk);
+
+/**
+ * Destroy the task-specific VAS data structures during the exit system call.
+ *
+ * @param[in] tsk:		The task for which data structures should be
+ *				destructed.
+ **/
+extern void vas_exit(struct task_struct *tsk);
+
+#else /* CONFIG_VAS */
+
+static inline void __init vas_init(void) {}
+static inline int vas_clone(int cf, struct task_struct *tsk) { return 0; }
+static inline int vas_exit(struct task_struct *tsk) { return 0; }
+
+#endif /* CONFIG_VAS */
+
+#endif
diff --git a/include/linux/vas_types.h b/include/linux/vas_types.h
new file mode 100644
index 000000000000..f06bfa9ef729
--- /dev/null
+++ b/include/linux/vas_types.h
@@ -0,0 +1,88 @@
+#ifndef _LINUX_VAS_TYPES_H
+#define _LINUX_VAS_TYPES_H
+
+#include <uapi/linux/vas.h>
+
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/spinlock_types.h>
+#include <linux/types.h>
+
+
+#define VAS_MAX_NAME_LENGTH 256
+
+#define VAS_IS_ERROR(id) ((id) < 0)
+
+/**
+ * Forward declare various important shared data structures.
+ **/
+struct mm_struct;
+struct task_struct;
+
+/**
+ * The struct representing a Virtual Address Space (VAS).
+ *
+ * This data structure contains all the necessary information of a VAS such as
+ * its name, ID. It also contains access rights and other management
+ * information.
+ **/
+struct vas {
+	struct kobject kobj;		/* < the internal kobject that we use *
+					 *   for reference counting and sysfs *
+					 *   handling.                        */
+
+	int id;				/* < ID                               */
+	char name[VAS_MAX_NAME_LENGTH];	/* < name                             */
+
+	struct mutex mtx;		/* < lock for parallel access.        */
+
+	struct mm_struct *mm;		/* < a partial memory map containing  *
+					 *   all mappings of this VAS.        */
+
+	struct list_head link;		/* < the link in the global VAS list. */
+	struct rcu_head rcu;		/* < the RCU helper used for          *
+					 *   asynchronous VAS deletion.       */
+
+	u16 refcount;			/* < how often is the VAS attached.   */
+	struct list_head attaches;	/* < the list of tasks which have     *
+					 *   this VAS attached.               */
+
+	spinlock_t share_lock;		/* < lock for protecting sharing      *
+					 *   state.                           */
+	u32 sharing;			/* < the variable used to keep track  *
+					 *   of the current sharing state of  *
+					 *   the VAS.                         */
+
+	umode_t mode;			/* < the access rights to this VAS.   */
+	kuid_t uid;			/* < the UID of the owning user of    *
+					 *   this VAS.                        */
+	kgid_t gid;			/* < the GID of the owning group of   *
+					 *   this VAS.                        */
+};
+
+/**
+ * The struct representing a VAS being attached to a process.
+ *
+ * Once a VAS is attached to a process additional information are necessary.
+ * This data structure contains all these information which makes using a VAS
+ * fast and easy.
+ **/
+struct att_vas {
+	struct vas *vas;		/* < the reference to the actual VAS  *
+					 *   containing all the information.  */
+
+	struct task_struct *tsk;	/* < the reference to the task to     *
+					 *   which the VAS is attached to.    */
+
+	struct mm_struct *mm;		/* < the backing memory map.          */
+
+	struct list_head tsk_link;	/* < the link in the list managed     *
+					 *   inside the task.                 */
+	struct list_head vas_link;	/* < the link in the list managed     *
+					 *   inside the VAS.                  */
+
+	int type;			/* < the type of attaching (RO/RW).   */
+};
+
+#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 9b1462e38b82..35df7d40a443 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -730,9 +730,27 @@ __SYSCALL(__NR_pkey_mprotect, sys_pkey_mprotect)
 __SYSCALL(__NR_pkey_alloc,    sys_pkey_alloc)
 #define __NR_pkey_free 290
 __SYSCALL(__NR_pkey_free,     sys_pkey_free)
+#define __NR_vas_create 291
+__SYSCALL(__NR_vas_create, sys_vas_create)
+#define __NR_vas_delete 292
+__SYSCALL(__NR_vas_delete, sys_vas_delete)
+#define __NR_vas_find 293
+__SYSCALL(__NR_vas_find, sys_vas_find)
+#define __NR_vas_attach 294
+__SYSCALL(__NR_vas_attach, sys_vas_attach)
+#define __NR_vas_detach 295
+__SYSCALL(__NR_vas_detach, sys_vas_detach)
+#define __NR_vas_switch 296
+__SYSCALL(__NR_vas_switch, sys_vas_switch)
+#define __NR_active_vas 297
+__SYSCALL(__NR_active_vas, sys_active_vas)
+#define __NR_vas_getattr 298
+__SYSCALL(__NR_vas_getattr, sys_vas_getattr)
+#define __NR_vas_setattr 299
+__SYSCALL(__NR_vas_setattr, sys_vas_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 291
+#define __NR_syscalls 300
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f330ba4547cf..5666900bdf06 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -446,6 +446,7 @@ header-y += v4l2-controls.h
 header-y += v4l2-dv-timings.h
 header-y += v4l2-mediabus.h
 header-y += v4l2-subdev.h
+header-y += vas.h
 header-y += veth.h
 header-y += vfio.h
 header-y += vhost.h
diff --git a/include/uapi/linux/vas.h b/include/uapi/linux/vas.h
new file mode 100644
index 000000000000..02f70f88bdcb
--- /dev/null
+++ b/include/uapi/linux/vas.h
@@ -0,0 +1,16 @@
+#ifndef _UAPI_LINUX_VAS_H
+#define _UAPI_LINUX_VAS_H
+
+#include <linux/types.h>
+
+
+/**
+ * The struct containing attributes of a VAS.
+ **/
+struct vas_attr {
+	__kernel_mode_t mode;		/* < the access rights to the VAS.    */
+	__kernel_uid_t user;		/* < the owning user of the VAS.      */
+	__kernel_gid_t group;		/* < the owning group of the VAS.     */
+};
+
+#endif
diff --git a/init/main.c b/init/main.c
index b0c9d6facef9..16f33b04f8ea 100644
--- a/init/main.c
+++ b/init/main.c
@@ -82,6 +82,7 @@
 #include <linux/proc_ns.h>
 #include <linux/io.h>
 #include <linux/cache.h>
+#include <linux/vas.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -538,6 +539,7 @@ asmlinkage __visible void __init start_kernel(void)
 	sort_main_extable();
 	trap_init();
 	mm_init();
+	vas_init();
 
 	/*
 	 * Set up the scheduler prior starting any interrupts (such as the
diff --git a/kernel/exit.c b/kernel/exit.c
index 8f14b866f9f6..b9687ea70a5b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -55,6 +55,7 @@
 #include <linux/shm.h>
 #include <linux/kcov.h>
 #include <linux/random.h>
+#include <linux/vas.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -823,6 +824,7 @@ void __noreturn do_exit(long code)
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
 
+	vas_exit(tsk);
 	exit_mm(tsk);
 
 	if (group_dead)
diff --git a/kernel/fork.c b/kernel/fork.c
index d3087d870855..292299c7995e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,6 +76,8 @@
 #include <linux/compiler.h>
 #include <linux/sysctl.h>
 #include <linux/kcov.h>
+#include <linux/vas.h>
+#include <linux/timekeeping.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -781,6 +783,10 @@ struct mm_struct *mm_setup(struct mm_struct *mm)
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
+#ifdef CONFIG_VAS
+	mm->vas_last_update = ktime_get();
+#endif
+
 	return mm;
 
 fail_nopgd:
@@ -1207,6 +1213,9 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 
 	tsk->mm = NULL;
 	tsk->active_mm = NULL;
+#ifdef CONFIG_VAS
+	tsk->original_mm = NULL;
+#endif
 
 	/*
 	 * Are we cloning a kernel thread?
@@ -1217,6 +1226,15 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldmm)
 		return 0;
 
+#ifdef CONFIG_VAS
+	/*
+	 * Never fork the address space of a VAS but use the process'
+	 * original one.
+	 */
+	if (oldmm != current->original_mm)
+		oldmm = current->original_mm;
+#endif
+
 	/* initialize the new vmacache entries */
 	vmacache_flush(tsk);
 
@@ -1234,6 +1252,9 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 good_mm:
 	tsk->mm = mm;
 	tsk->active_mm = mm;
+#ifdef CONFIG_VAS
+	tsk->original_mm = mm;
+#endif
 	return 0;
 
 fail_nomem:
@@ -1700,9 +1721,12 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = copy_mm(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_signal;
-	retval = copy_namespaces(clone_flags, p);
+	retval = vas_clone(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_mm;
+	retval = copy_namespaces(clone_flags, p);
+	if (retval)
+		goto bad_fork_cleanup_vas;
 	retval = copy_io(clone_flags, p);
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
@@ -1885,6 +1909,8 @@ static __latent_entropy struct task_struct *copy_process(
 		exit_io_context(p);
 bad_fork_cleanup_namespaces:
 	exit_task_namespaces(p);
+bad_fork_cleanup_vas:
+	vas_exit(p);
 bad_fork_cleanup_mm:
 	if (p->mm)
 		mmput(p->mm);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 8acef8576ce9..f6f83c5ec1a1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -258,3 +258,14 @@ cond_syscall(sys_membarrier);
 cond_syscall(sys_pkey_mprotect);
 cond_syscall(sys_pkey_alloc);
 cond_syscall(sys_pkey_free);
+
+/* first class virtual address spaces */
+cond_syscall(sys_vas_create);
+cond_syscall(sys_vas_delete);
+cond_syscall(sys_vas_find);
+cond_syscall(sys_vas_attach);
+cond_syscall(sys_vas_detach);
+cond_syscall(sys_vas_switch);
+cond_syscall(sys_active_vas);
+cond_syscall(sys_vas_getattr);
+cond_syscall(sys_vas_setattr);
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..9a80877f3536 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -707,3 +707,23 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+
+config VAS
+	bool "Support for First Class Virtual Address Spaces"
+	default n
+	help
+	  Support for First Class Virtual Address Spaces which are address space
+	  that are not bound to the lifetime of any process but can exist
+	  independently in the system. With this feature processes are allowed
+	  to have multiple different address spaces between which their threads
+	  can switch arbitrarily.
+
+	  If not sure, then say N.
+
+config VAS_DEBUG
+	bool "Debugging output for First Class Virtual Address Spaces"
+	depends on VAS
+	default n
+	help
+	  Enable extensive debugging output for the First Class Virtual Address
+	  Spaces feature.
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..ba8995e944d7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -100,3 +100,4 @@ obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
+obj-$(CONFIG_VAS)	+= vas.o
diff --git a/mm/internal.h b/mm/internal.h
index e22cb031b45b..f947e8c50bae 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -499,4 +499,12 @@ extern const struct trace_print_flags pageflag_names[];
 extern const struct trace_print_flags vmaflag_names[];
 extern const struct trace_print_flags gfpflag_names[];
 
+#ifdef CONFIG_VAS
+void mm_updated(struct mm_struct *mm);
+void vm_area_updated(struct vm_area_struct *vma);
+#else
+static inline void mm_updated(struct mm_struct *mm) {}
+static inline void vm_area_updated(struct vm_area_struct *vma) {}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 7026f2146fcd..e4747b3fd5b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4042,6 +4042,9 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 				&& test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
 		ret = VM_FAULT_SIGBUS;
 
+	if (ret)
+		vm_area_updated(vma);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
diff --git a/mm/mmap.c b/mm/mmap.c
index d35c6b51cadf..1d82b2260448 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/timekeeping.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -942,6 +943,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		uprobe_mmap(insert);
 
 	validate_mm(mm);
+	vm_area_updated(vma);
 
 	return 0;
 }
@@ -1135,6 +1137,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 		if (err)
 			return NULL;
 		khugepaged_enter_vma_merge(prev, vm_flags);
+		vm_area_updated(prev);
 		return prev;
 	}
 
@@ -1162,6 +1165,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 		if (err)
 			return NULL;
 		khugepaged_enter_vma_merge(area, vm_flags);
+		vm_area_updated(area);
 		return area;
 	}
 
@@ -1719,6 +1723,7 @@ unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 	vma->vm_flags |= VM_SOFTDIRTY;
 
 	vma_set_page_prot(vma);
+	vm_area_updated(vma);
 
 	return addr;
 
@@ -2263,6 +2268,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma_merge(vma, vma->vm_flags);
 	validate_mm(mm);
+	vm_area_updated(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -2332,6 +2338,7 @@ int expand_downwards(struct vm_area_struct *vma,
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma_merge(vma, vma->vm_flags);
 	validate_mm(mm);
+	vm_area_updated(vma);
 	return error;
 }
 
@@ -2457,6 +2464,7 @@ void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, start, end);
+	mm_updated(mm);
 }
 
 /*
@@ -2656,6 +2664,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 
 	/* Fix up all other VM information */
 	remove_vma_list(mm, vma);
+	mm_updated(mm);
 
 	return 0;
 }
@@ -2882,6 +2891,7 @@ static int do_brk(unsigned long addr, unsigned long request)
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
+	vm_area_updated(vma);
 	return 0;
 }
 
@@ -3058,6 +3068,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		*need_rmap_locks = false;
 	}
+	vm_area_updated(new_vma);
 	return new_vma;
 
 out_free_mempol:
@@ -3204,6 +3215,7 @@ static struct vm_area_struct *__install_special_mapping(
 	vm_stat_account(mm, vma->vm_flags, len >> PAGE_SHIFT);
 
 	perf_event_mmap(vma);
+	vm_area_updated(vma);
 
 	return vma;
 
@@ -3550,3 +3562,13 @@ static int __meminit init_reserve_notifier(void)
 	return 0;
 }
 subsys_initcall(init_reserve_notifier);
+
+void mm_updated(struct mm_struct *mm)
+{
+	mm->vas_last_update = ktime_get();
+}
+
+void vm_area_updated(struct vm_area_struct *vma)
+{
+	vma->vas_last_update = vma->vm_mm->vas_last_update = ktime_get();
+}
diff --git a/mm/vas.c b/mm/vas.c
new file mode 100644
index 000000000000..447d61e1da79
--- /dev/null
+++ b/mm/vas.c
@@ -0,0 +1,2188 @@
+/*
+ *  First Class Virtual Address Spaces
+ *  Copyright (c) 2016-2017, Hewlett Packard Enterprise
+ *
+ *  Code Authors:
+ *     Marco A Benatto <marco.antonio.780@gmail.com>
+ *     Till Smejkal <till.smejkal@gmail.com>
+ */
+
+
+#include <linux/vas.h>
+
+#include <linux/atomic.h>
+#include <linux/cred.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/fcntl.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/lockdep.h>
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mutex.h>
+#include <linux/printk.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/rmap.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/stat.h>
+#include <linux/string.h>
+#include <linux/syscalls.h>
+#include <linux/uidgid.h>
+#include <linux/uaccess.h>
+#include <linux/vmacache.h>
+
+#include <asm/mman.h>
+#include <asm/processor.h>
+
+#include "internal.h"
+
+
+/***
+ * Internally used defines
+ ***/
+
+/**
+ * Make sure we are not overflowing the VAS sharing variable.
+ **/
+#define VAS_MAX_SHARES U16_MAX
+
+#define VAS_MAX_ID INT_MAX
+
+/**
+ * Masks and bits to implement sharing of VAS.
+ **/
+#define VAS_SHARE_READABLE (1 << 0)
+#define VAS_SHARE_WRITABLE (1 << 16)
+#define VAS_SHARE_READ_MASK 0xffff
+#define VAS_SHARE_WRITE_MASK 0xffff0000
+#define VAS_SHARE_READ_WRITE_MASK (VAS_SHARE_READ_MASK | VAS_SHARE_WRITE_MASK)
+
+/**
+ * Get the next vm_area_struct of the VMA list in the memory map but safely.
+ **/
+#define next_vma_safe(vma) ((vma) ? (vma)->vm_next : NULL)
+
+/**
+ * Get a string representation of the access type to a VAS.
+ **/
+#define access_type_str(type) ((type) & MAY_WRITE ?			\
+			       ((type) & MAY_READ ? "rw" : "wo") : "ro")
+
+
+/***
+ * Debugging functions
+ ***/
+
+#ifdef CONFIG_VAS_DEBUG
+
+/**
+ * Dump the content of the given memory map.
+ *
+ * @param mm:		The memory map which should be dumped.
+ **/
+static void __dump_memory_map(const char *title, struct mm_struct *mm)
+{
+	int count;
+	struct vm_area_struct *vma;
+
+	down_read(&mm->mmap_sem);
+
+	/* Dump some general information. */
+	pr_info("-- %s [%p] --\n"
+		"> General information <\n"
+		"  PGD value: %#lx\n"
+		"  Task size: %#lx\n"
+		"  Map count: %d\n"
+		"  Last update: %lld\n"
+		"  Code:  %#lx - %#lx\n"
+		"  Data:  %#lx - %#lx\n"
+		"  Heap:  %#lx - %#lx\n"
+		"  Stack: %#lx\n"
+		"  Args:  %#lx - %#lx\n"
+		"  Env:   %#lx - %#lx\n",
+		title, mm, pgd_val(*mm->pgd), mm->task_size, mm->map_count,
+		mm->vas_last_update, mm->start_code, mm->end_code,
+		mm->start_data, mm->end_data, mm->start_brk, mm->brk,
+		mm->start_stack, mm->arg_start, mm->arg_end, mm->env_start,
+		mm->env_end);
+
+	/* Dump current RSS state counters of the memory map. */
+	pr_cont("> RSS Counter <\n");
+	for (count = 0; count < NR_MM_COUNTERS; ++count)
+		pr_cont(" %d: %lu\n", count, get_mm_counter(mm, count));
+
+	/* Dump the information for each region. */
+	pr_cont("> Mapped Regions <\n");
+	for (vma = mm->mmap, count = 0; vma; vma = vma->vm_next, ++count) {
+		pr_cont("  VMA %3d: %#14lx - %#-14lx", count, vma->vm_start,
+			vma->vm_end);
+
+		if (is_exec_mapping(vma->vm_flags))
+			pr_cont(" EXEC  ");
+		else if (is_data_mapping(vma->vm_flags))
+			pr_cont(" DATA  ");
+		else if (is_stack_mapping(vma->vm_flags))
+			pr_cont(" STACK ");
+		else
+			pr_cont(" OTHER ");
+
+		pr_cont("%c%c%c%c [%c]",
+			vma->vm_flags & VM_READ ? 'r' : '-',
+			vma->vm_flags & VM_WRITE ? 'w' : '-',
+			vma->vm_flags & VM_EXEC ? 'x' : '-',
+			vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
+			vma->vas_reference ? 'v' : '-');
+
+		if (vma->vm_file) {
+			struct file *f = vma->vm_file;
+			char *buf;
+
+			buf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+			if (buf) {
+				char *p;
+
+				p = file_path(f, buf, PATH_MAX);
+				if (IS_ERR(p))
+					p = "?";
+
+				pr_cont(" --> %s @%lu\n", p, vma->vm_pgoff);
+				kfree(buf);
+			} else {
+				pr_cont(" --> NA @%lu\n", vma->vm_pgoff);
+			}
+		} else if (vma->vm_ops && vma->vm_ops->name) {
+			pr_cont(" --> %s\n", vma->vm_ops->name(vma));
+		} else {
+			pr_cont(" ANON\n");
+		}
+	}
+	if (count == 0)
+		pr_cont("  EMPTY\n");
+
+	up_read(&mm->mmap_sem);
+}
+
+#define pr_vas_debug(fmt, args...) pr_info("[VAS] %s - " fmt, __func__, ##args)
+#define dump_memory_map(title, mm) __dump_memory_map(title, mm)
+
+#else /* CONFIG_VAS_DEBUG */
+
+#define pr_vas_debug(...) do {} while (0)
+#define dump_memory_map(...) do {} while (0)
+
+#endif /* CONFIG_VAS_DEBUG */
+
+/***
+ * Internally used variables
+ ***/
+
+/**
+ * All SLAB caches used to improve allocation performance.
+ **/
+static struct kmem_cache *vas_cachep;
+static struct kmem_cache *att_vas_cachep;
+static struct kmem_cache *vas_context_cachep;
+
+/**
+ * Global management data structures and their associated locks.
+ **/
+static struct idr vases;
+static spinlock_t vases_lock;
+
+/**
+ * The place holder variables that are used to identify to-be-deleted items in
+ * our global management data structures.
+ **/
+static struct vas *INVALID_VAS;
+
+/**
+ * Kernel 'ksets' where all objects will be managed.
+ **/
+static struct kset *vases_kset;
+
+
+/***
+ * Constructors and destructors for the data structures.
+ ***/
+static inline struct vm_area_struct *__new_vm_area(void)
+{
+	return kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
+}
+
+static inline void __delete_vm_area(struct vm_area_struct *vma)
+{
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static inline struct vas *__new_vas(void)
+{
+	return kmem_cache_zalloc(vas_cachep, GFP_KERNEL);
+}
+
+static inline void __delete_vas(struct vas *vas)
+{
+	WARN_ON(vas->refcount != 0);
+
+	mutex_destroy(&vas->mtx);
+
+	if (vas->mm)
+		mmput_async(vas->mm);
+	kmem_cache_free(vas_cachep, vas);
+}
+
+static inline void __delete_vas_rcu(struct rcu_head *rp)
+{
+	struct vas *vas = container_of(rp, struct vas, rcu);
+
+	__delete_vas(vas);
+}
+
+static inline struct att_vas *__new_att_vas(void)
+{
+	return kmem_cache_zalloc(att_vas_cachep, GFP_ATOMIC);
+}
+
+static inline void __delete_att_vas(struct att_vas *avas)
+{
+	if (avas->mm)
+		mmput_async(avas->mm);
+	kmem_cache_free(att_vas_cachep, avas);
+}
+
+static inline struct vas_context *__new_vas_context(void)
+{
+	return kmem_cache_zalloc(vas_context_cachep, GFP_KERNEL);
+}
+
+static inline void __delete_vas_context(struct vas_context *ctx)
+{
+	WARN_ON(ctx->refcount != 0);
+
+	kmem_cache_free(vas_context_cachep, ctx);
+}
+
+/***
+ * Kobject management of data structures
+ ***/
+
+/**
+ * Correctly take and put VAS.
+ **/
+static inline struct vas *__vas_get(struct vas *vas)
+{
+	return container_of(kobject_get(&vas->kobj), struct vas, kobj);
+}
+
+static inline void __vas_put(struct vas *vas)
+{
+	kobject_put(&vas->kobj);
+}
+
+/**
+ * The sysfs structure we need to handle attributes of a VAS.
+ **/
+struct vas_sysfs_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
+			char *buf);
+	ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
+			 const char *buf, size_t count);
+};
+
+#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)				\
+static struct vas_sysfs_attr vas_sysfs_attr_##NAME =			\
+	__ATTR(NAME, MODE, SHOW, STORE)
+
+/**
+ * Functions for all the sysfs operations.
+ **/
+static ssize_t __vas_sysfs_attr_show(struct kobject *kobj,
+				     struct attribute *attr,
+				     char *buf)
+{
+	struct vas *vas;
+	struct vas_sysfs_attr *vsattr;
+
+	vas = container_of(kobj, struct vas, kobj);
+	vsattr = container_of(attr, struct vas_sysfs_attr, attr);
+
+	if (!vsattr->show)
+		return -EIO;
+
+	return vsattr->show(vas, vsattr, buf);
+}
+
+static ssize_t __vas_sysfs_attr_store(struct kobject *kobj,
+				      struct attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct vas *vas;
+	struct vas_sysfs_attr *vsattr;
+
+	vas = container_of(kobj, struct vas, kobj);
+	vsattr = container_of(attr, struct vas_sysfs_attr, attr);
+
+	if (!vsattr->store)
+		return -EIO;
+
+	return vsattr->store(vas, vsattr, buf, count);
+}
+
+/**
+ * The sysfs operations structure for a VAS.
+ **/
+static const struct sysfs_ops vas_sysfs_ops = {
+	.show = __vas_sysfs_attr_show,
+	.store = __vas_sysfs_attr_store,
+};
+
+/**
+ * Default attributes of a VAS.
+ **/
+static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr *vsattr,
+			       char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s", vas->name);
+}
+VAS_SYSFS_ATTR(name, 0444, __show_vas_name, NULL);
+
+static ssize_t __show_vas_mode(struct vas *vas, struct vas_sysfs_attr *vsattr,
+			       char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%#03o", vas->mode);
+}
+VAS_SYSFS_ATTR(mode, 0444, __show_vas_mode, NULL);
+
+static ssize_t __show_vas_user(struct vas *vas, struct vas_sysfs_attr *vsattr,
+			       char *buf)
+{
+	struct user_namespace *ns = current_user_ns();
+
+	return scnprintf(buf, PAGE_SIZE, "%d", from_kuid(ns, vas->uid));
+}
+VAS_SYSFS_ATTR(user, 0444, __show_vas_user, NULL);
+
+static ssize_t __show_vas_group(struct vas *vas, struct vas_sysfs_attr *vsattr,
+				char *buf)
+{
+	struct user_namespace *ns = current_user_ns();
+
+	return scnprintf(buf, PAGE_SIZE, "%d", from_kgid(ns, vas->gid));
+}
+VAS_SYSFS_ATTR(group, 0444, __show_vas_group, NULL);
+
+static struct attribute *vas_default_attr[] = {
+	&vas_sysfs_attr_name.attr,
+	&vas_sysfs_attr_mode.attr,
+	&vas_sysfs_attr_user.attr,
+	&vas_sysfs_attr_group.attr,
+	NULL
+};
+
+/**
+ * Function to release the VAS after its kobject is gone.
+ **/
+static void __vas_release(struct kobject *kobj)
+{
+	struct vas *vas = container_of(kobj, struct vas, kobj);
+
+	spin_lock(&vases_lock);
+	idr_remove(&vases, vas->id);
+	spin_unlock(&vases_lock);
+
+	/*
+	 * Wait for the full RCU grace period before actually deleting the VAS
+	 * data structure since we haven't done it earlier.
+	 */
+	call_rcu(&vas->rcu, __delete_vas_rcu);
+}
+
+/**
+ * The ktype data structure representing a VAS.
+ **/
+static struct kobj_type vas_ktype = {
+	.sysfs_ops = &vas_sysfs_ops,
+	.release = __vas_release,
+	.default_attrs = vas_default_attr,
+};
+
+
+/***
+ * Internally visible functions
+ ***/
+
+/**
+ * Working with the global VAS list.
+ **/
+static inline void vas_remove(struct vas *vas)
+{
+	spin_lock(&vases_lock);
+
+	/*
+	 * Only put the to-be-deleted place holder in the IDR, the actual remove
+	 * in the IDR and the freeing of the object  will be done when we
+	 * release the kobject. We need to do it this way, to keep the ID
+	 * reserved. Otherwise it can happen, that we try to create a new VAS
+	 * with a reused ID in the sysfs before the current VAS is removed from
+	 * the sysfs.
+	 */
+	idr_replace(&vases, INVALID_VAS, vas->id);
+	spin_unlock(&vases_lock);
+
+	/*
+	 * No need to wait for the RCU period here, we will do it before
+	 * actually deleting the VAS in the 'vas_release' function.
+	 */
+	__vas_put(vas);
+}
+
+static inline int vas_insert(struct vas *vas)
+{
+	int ret;
+
+	/* Add the VAS in the IDR cache. */
+	spin_lock(&vases_lock);
+
+	ret = idr_alloc(&vases, vas, 1, VAS_MAX_ID, GFP_KERNEL);
+
+	spin_unlock(&vases_lock);
+
+	if (ret < 0) {
+		__delete_vas(vas);
+		return ret;
+	}
+
+	/* Add the last data to the VAS' data structure. */
+	vas->id = ret;
+	vas->kobj.kset = vases_kset;
+
+	/* Initialize the kobject and add it to the sysfs. */
+	ret = kobject_init_and_add(&vas->kobj, &vas_ktype, NULL, "%d", vas->id);
+	if (ret != 0) {
+		vas_remove(vas);
+		return ret;
+	}
+
+	/* The VAS is ready, trigger the corresponding UEVENT. */
+	kobject_uevent(&vas->kobj, KOBJ_ADD);
+
+	/*
+	 * We don't put or get the VAS again, because its reference count will
+	 * be initialized with '1'. This will be reduced to 0 when we remove the
+	 * VAS again from the internal global management list.
+	 */
+	return 0;
+}
+
+static inline struct vas *vas_lookup(int id)
+{
+	struct vas *vas;
+
+	rcu_read_lock();
+
+	vas = idr_find(&vases, id);
+	if (vas == INVALID_VAS)
+		vas = NULL;
+	if (vas)
+		vas = __vas_get(vas);
+
+	rcu_read_unlock();
+
+	return vas;
+}
+
+static inline struct vas *vas_lookup_by_name(const char *name)
+{
+	struct vas *vas;
+	int id;
+
+	rcu_read_lock();
+
+	idr_for_each_entry(&vases, vas, id) {
+		if (vas == INVALID_VAS)
+			continue;
+
+		if (strcmp(vas->name, name) == 0)
+			break;
+	}
+
+	if (vas)
+		vas = __vas_get(vas);
+
+	rcu_read_unlock();
+
+	return vas;
+}
+
+ /**
+  * Management of the sharing of VAS.
+  **/
+static inline int vas_take_share(int type, struct vas *vas)
+{
+	int ret;
+
+	spin_lock(&vas->share_lock);
+	if (type & MAY_WRITE) {
+		if ((vas->sharing & VAS_SHARE_READ_WRITE_MASK) == 0) {
+			vas->sharing += VAS_SHARE_WRITABLE;
+			ret = 1;
+		} else
+			ret = 0;
+	} else {
+		if ((vas->sharing & VAS_SHARE_WRITE_MASK) == 0) {
+			vas->sharing += VAS_SHARE_READABLE;
+			ret = 1;
+		} else
+			ret = 0;
+	}
+	spin_unlock(&vas->share_lock);
+
+	return ret;
+}
+
+static inline void vas_put_share(int type, struct vas *vas)
+{
+	spin_lock(&vas->share_lock);
+	if (type & MAY_WRITE)
+		vas->sharing -= VAS_SHARE_WRITABLE;
+	else
+		vas->sharing -= VAS_SHARE_READABLE;
+	spin_unlock(&vas->share_lock);
+}
+
+/**
+ * Management of the memory maps.
+ **/
+static int init_vas_mm(struct vas *vas)
+{
+	struct mm_struct *mm;
+
+	mm = mm_alloc();
+	if (!mm)
+		return -ENOMEM;
+
+	mm = mm_setup(mm);
+	if (!mm)
+		return -ENOMEM;
+
+	arch_pick_mmap_layout(mm);
+
+	vas->mm = mm;
+	return 0;
+}
+
+static int init_att_vas_mm(struct att_vas *avas, struct task_struct *owner)
+{
+	struct mm_struct *mm, *orig_mm = owner->original_mm;
+
+	mm = mm_alloc();
+	if (!mm)
+		return -ENOMEM;
+
+	mm = mm_setup(mm);
+	if (!mm)
+		return -ENOMEM;
+
+	mm = mm_set_task(mm, owner, orig_mm->user_ns);
+	if (!mm)
+		return -ENOMEM;
+
+	arch_pick_mmap_layout(mm);
+
+	/* Additional setup of the memory map. */
+	set_mm_exe_file(mm, get_mm_exe_file(orig_mm));
+	mm->vas_last_update = orig_mm->vas_last_update;
+
+	avas->mm = mm;
+	return 0;
+}
+
+/**
+ * Lookup the corresponding vm_area in the referenced memory map.
+ *
+ * The function is very similar to 'find_exact_vma'. However, it can also handle
+ * cases where a VMA was resized while the referenced one wasn't or visa-versa.
+ **/
+static struct vm_area_struct *vas_find_reference(struct mm_struct *mm,
+						 struct vm_area_struct *vma)
+{
+	struct vm_area_struct *ref;
+
+	ref = find_vma(mm, vma->vm_start);
+	if (ref) {
+		/*
+		 * Ok we found VMA in the other memory map. So lets check
+		 * whether this is really the VMA we are referencing to.
+		 */
+		if (ref->vm_start == vma->vm_start &&
+		    ref->vm_end == vma->vm_end)
+			/* This is an exact match. */
+			return ref;
+
+		if (ref->vm_start != vma->vm_start &&
+		    ref->vm_end == vma->vm_end &&
+		    vma->vm_flags & VM_GROWSDOWN)
+			/* This might be the stack VMA. */
+			return ref;
+	}
+
+	return NULL;
+}
+
+/**
+ * Translate a bit field with O_* bits into fs-like bit field with MAY_* bits.
+ **/
+static inline int __build_vas_access_type(int acc_type)
+{
+	/* We are only interested in access modes. */
+	acc_type &= O_ACCMODE;
+
+	if (acc_type == O_RDONLY)
+		return MAY_READ;
+	else if (acc_type == O_WRONLY)
+		return MAY_WRITE;
+	else if (acc_type == O_RDWR)
+		return MAY_READ | MAY_WRITE;
+
+	return -1;
+}
+
+static inline int __check_permission(kuid_t uid, kgid_t gid, umode_t mode,
+				     int type)
+{
+	kuid_t cur_uid = current_uid();
+
+	/* root can do anything with a VAS. */
+	if (unlikely(uid_eq(cur_uid, GLOBAL_ROOT_UID)))
+		return 0;
+
+	if (likely(uid_eq(cur_uid, uid)))
+		mode >>= 6;
+	else if (in_group_p(gid))
+		mode >>= 3;
+
+	if ((type & ~mode & (MAY_READ | MAY_WRITE)) == 0)
+		return 0;
+	return -EACCES;
+}
+
+/**
+ * Copy a vm_area from one memory map into another one.
+ *
+ * Requires that the semaphores of the destination memory maps is taken in
+ * write-mode and the one of the source memory map at least in read-mode.
+ *
+ * @param[in] src_mm:	The memory map to which the vm_area belongs that should
+ *			be copied.
+ * @param[in] src_vma:	The vm_area that should be copied.
+ * @param[in] dst_mm:	The memory map to which the vm_area should be copied.
+ * @param[in] vm_flags:	The vm_flags that should be used for the new vm_area.
+ *
+ * @returns:		A pointer to the new vm_area on success, NULL
+ *			otherwise.
+ **/
+static inline
+struct vm_area_struct *__copy_vm_area(struct mm_struct *src_mm,
+				      struct vm_area_struct *src_vma,
+				      struct mm_struct *dst_mm,
+				      unsigned long vm_flags)
+{
+	struct vm_area_struct *vma, *prev;
+	struct rb_node **rb_link, *rb_parent;
+	int ret;
+
+	pr_vas_debug("Copying VMA - addr: %#lx - %#lx - to %p\n",
+		     src_vma->vm_start, src_vma->vm_end, dst_mm);
+
+	ret = find_vma_links(dst_mm, src_vma->vm_start, src_vma->vm_end,
+			     &prev, &rb_link, &rb_parent);
+	if (ret != 0) {
+		pr_vas_debug("Could not map VMA in the new memory map because of a conflict with a different mapping\n");
+		return NULL;
+	}
+
+	vma = __new_vm_area();
+	*vma = *src_vma;
+
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	ret = vma_dup_policy(src_vma, vma);
+	if (ret != 0)
+		goto out_free_vma;
+	ret = anon_vma_clone(vma, src_vma);
+	if (ret != 0)
+		goto out_free_vma;
+	vma->vm_mm = dst_mm;
+	vma->vm_flags = vm_flags;
+	vma_set_page_prot(vma);
+	vma->vm_next = vma->vm_prev = NULL;
+	if (vma->vm_file)
+		get_file(vma->vm_file);
+	if (vma->vm_ops && vma->vm_ops->open)
+		vma->vm_ops->open(vma);
+	vma->vas_last_update = src_vma->vas_last_update;
+
+	vma_link(dst_mm, vma, prev, rb_link, rb_parent);
+
+	vm_stat_account(dst_mm, vma->vm_flags, vma_pages(vma));
+	if (unlikely(dup_page_range(dst_mm, vma, src_mm, src_vma)))
+		pr_vas_debug("Failed to copy page table for VMA %p from %p\n",
+			     vma, src_vma);
+
+	return vma;
+
+out_free_vma:
+	__delete_vm_area(vma);
+	return NULL;
+}
+
+/**
+ * Remove a vm_area from a given memory map.
+ *
+ * Requires that the semaphores of the memory map is taken in write-mode.
+ *
+ * @param mm:		The memory map from which the vm_area should be
+ *			removed.
+ * @param vma:		The vm_area that should be removed.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static inline int __remove_vm_area(struct mm_struct *mm,
+				   struct vm_area_struct *vma)
+{
+	pr_vas_debug("Removing VMA - addr: %#lx - %#lx - from %p\n",
+		     vma->vm_start, vma->vm_end, mm);
+
+	return do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+/**
+ * Update the information of a vm_area in one particular memory map with the
+ * information of the corresponding one in another memory map.
+ *
+ * Requires that the semaphores of both memory maps are taken in write-mode.
+ *
+ * @param[in] src_mm:	The memory map to which the vm_area belongs from which
+ *			the information should be copied.
+ * @param[in] src_vma:	The vm_area from which the information should be
+ *			copied.
+ * @param[in] dst_mm:	The memory map to which the vm_area belongs to which
+ *			the information should be copied.
+ * @param[in] dst_vma:	The vm_area that should be updated if already known,
+ *			otherwise this can be NULL and will be looked up in the
+ *			destination memory map.
+ *
+ * @returns:		A pointer to the updated vm_area on success, NULL
+ *			otherwise.
+ **/
+static inline
+struct vm_area_struct *__update_vm_area(struct mm_struct *src_mm,
+					struct vm_area_struct *src_vma,
+					struct mm_struct *dst_mm,
+					struct vm_area_struct *dst_vma)
+{
+	pr_vas_debug("Updating VMA - addr: %#lx - %#lx - in %p\n",
+		     src_vma->vm_start, src_vma->vm_end, dst_mm);
+
+	/* Lookup the destination vm_area if not yet known. */
+	if (!dst_vma)
+		dst_vma = vas_find_reference(dst_mm, src_vma);
+
+	if (!dst_vma) {
+		pr_vas_debug("Cannot find corresponding memory region in destination memory map -- Abort\n");
+		dst_vma = NULL;
+	} else if (ktime_compare(src_vma->vas_last_update,
+				 dst_vma->vas_last_update) == 0) {
+		pr_vas_debug("Memory region is unchanged -- Skip\n");
+	} else if (ktime_compare(src_vma->vas_last_update,
+				 dst_vma->vas_last_update) == -1) {
+		pr_vas_debug("Memory region is stale (%lld vs %lld)-- Abort\n",
+			     src_vma->vas_last_update,
+			     dst_vma->vas_last_update);
+		dst_vma = NULL;
+	} else if (src_vma->vm_start != dst_vma->vm_start ||
+		   src_vma->vm_end != dst_vma->vm_end) {
+		/*
+		 * The VMA changed completely. We have to represent this change
+		 * in the destination memory region.
+		 */
+		struct mm_struct *orig_vas_ref = dst_vma->vas_reference;
+		unsigned long orig_vm_flags = dst_vma->vm_flags;
+
+		if (__remove_vm_area(dst_mm, dst_vma) != 0) {
+			dst_vma = NULL;
+			goto out;
+		}
+
+		dst_vma = __copy_vm_area(src_mm, src_vma, dst_mm,
+					 orig_vm_flags);
+		if (!dst_vma)
+			goto out;
+
+		dst_vma->vas_reference = orig_vas_ref;
+	} else {
+		/*
+		 * The VMA itself did not change. However, mappings might have
+		 * changed. So at least update the page table entries belonging
+		 * to the VMA in the destination memory region.
+		 */
+		if (unlikely(dup_page_range(dst_mm, dst_vma, src_mm, src_vma)))
+			pr_vas_debug("Cannot update page table entries\n");
+
+		dst_vma->vas_last_update = src_vma->vas_last_update;
+	}
+
+out:
+	return dst_vma;
+}
+
+/**
+ * Merge the VAS' parts of the memory map into the attached-VAS memory map.
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure that
+ *			contains all the information for this attachment.
+ * @param[in] vas:	The pointer to the VAS of which the memory map should
+ *			be merged.
+ * @param[in] type:	The type of attaching (see vas_attach for more
+ *			information).
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int vas_merge(struct att_vas *avas, struct vas *vas, int type)
+{
+	struct vm_area_struct *vma, *new_vma;
+	struct mm_struct *vas_mm, *avas_mm;
+	int ret;
+
+	vas_mm = vas->mm;
+	avas_mm = avas->mm;
+
+	dump_memory_map("Before VAS MM", vas_mm);
+
+	if (down_write_killable(&avas_mm->mmap_sem))
+		return -EINTR;
+	down_read_nested(&vas_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	/* Try to copy all VMAs of the VAS into the AS of the attached-VAS. */
+	for (vma = vas_mm->mmap; vma; vma = vma->vm_next) {
+		unsigned long merged_vm_flags = vma->vm_flags;
+
+		pr_vas_debug("Merging a VAS memory region (%#lx - %#lx)\n",
+			     vma->vm_start, vma->vm_end);
+
+		/*
+		 * Remove the writable bit from the vm_flags if the VAS is
+		 * attached only readable.
+		 */
+		if (!(type & MAY_WRITE))
+			merged_vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
+
+		new_vma = __copy_vm_area(vas_mm, vma, avas_mm,
+					 merged_vm_flags);
+		if (!new_vma) {
+			pr_vas_debug("Failed to merge a VAS memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+			ret = -EFAULT;
+			goto out_unlock;
+		}
+
+		/*
+		 * Remember for the VMA that we just added it to the
+		 * attached-VAS that it actually belongs to the VAS.
+		 */
+		new_vma->vas_reference = vas_mm;
+	}
+
+	ret = 0;
+
+out_unlock:
+	up_read(&vas_mm->mmap_sem);
+	up_write(&avas_mm->mmap_sem);
+
+	dump_memory_map("After VAS MM", vas_mm);
+	dump_memory_map("After Attached-VAS MM", avas_mm);
+
+	return ret;
+}
+
+/**
+ * Unmerge the VAS-related parts of an attached-VAS memory map back into the
+ * VAS' memory map.
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure that
+ *			contains all the information for this attachment.
+ * @param[in] vas:	The pointer to the VAS for which the memory map should
+ *			be updated again.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int vas_unmerge(struct att_vas *avas, struct vas *vas)
+{
+	struct vm_area_struct *vma, *next;
+	struct mm_struct *vas_mm, *avas_mm;
+	int ret;
+
+	vas_mm = vas->mm;
+	avas_mm = avas->mm;
+
+	dump_memory_map("Before Attached-VAS MM", avas_mm);
+	dump_memory_map("Before VAS MM", vas_mm);
+
+	if (down_write_killable(&avas_mm->mmap_sem))
+		return -EINTR;
+	down_write_nested(&vas_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	/* Update all VMAs of the VAS if they changed in the attached-VAS. */
+	for (vma = avas_mm->mmap, next = next_vma_safe(vma); vma;
+	     vma = next, next = next_vma_safe(next)) {
+		struct mm_struct *ref_mm = vma->vas_reference;
+
+		if (!ref_mm) {
+			struct vm_area_struct *new_vma;
+
+			/*
+			 * This is a VMA which was created while the VAS was
+			 * attached to the process and which is not yet existent
+			 * in the VAS. Copy it into the VAS' mm_struct.
+			 */
+			pr_vas_debug("Unmerging a new VAS memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+
+			new_vma = __copy_vm_area(avas_mm, vma, vas_mm,
+						 vma->vm_flags);
+			if (!new_vma) {
+				pr_vas_debug("Failed to unmerge a new VAS memory region (%#lx - %#lx)\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			new_vma->vas_reference = NULL;
+		} else {
+			struct vm_area_struct *upd_vma;
+
+			/*
+			 * This VMA was previously copied into the memory map
+			 * when the VAS was attached to the process. So check if
+			 * we need to update the corresponding VMA in the VAS'
+			 * memory map.
+			 */
+			pr_vas_debug("Unmerging an existing VAS memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+
+			upd_vma = __update_vm_area(avas_mm, vma, vas_mm, NULL);
+			if (!upd_vma) {
+				pr_vas_debug("Failed to unmerge a VAS memory region (%#lx - %#lx)\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+		}
+
+		/* Remove the current VMA from the attached-VAS memory map. */
+		__remove_vm_area(avas_mm, vma);
+	}
+
+	ret = 0;
+
+out_unlock:
+	up_write(&vas_mm->mmap_sem);
+	up_write(&avas_mm->mmap_sem);
+
+	dump_memory_map("After VAS MM", vas_mm);
+
+	return ret;
+}
+
+/**
+ * Merge the task's parts of the memory map into the attached-VAS memory map.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure that
+ *			contains all the information for this attachment.
+ * @param[in] tsk:	The pointer to the task of which the memory map
+ *			should be merged.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int task_merge(struct att_vas *avas, struct task_struct *tsk)
+{
+	struct vm_area_struct *vma, *new_vma;
+	struct mm_struct *avas_mm, *tsk_mm;
+	int ret;
+
+	tsk_mm = tsk->original_mm;
+	avas_mm = avas->mm;
+
+	dump_memory_map("Before Task MM", tsk_mm);
+	dump_memory_map("Before Attached-VAS MM", avas_mm);
+
+	if (down_write_killable(&avas_mm->mmap_sem))
+		return -EINTR;
+	down_read_nested(&tsk_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Try to copy all necessary memory regions from the task's memory
+	 * map to the attached-VAS memory map.
+	 */
+	for (vma = tsk_mm->mmap; vma; vma = vma->vm_next) {
+		pr_vas_debug("Merging a task memory region (%#lx - %#lx)\n",
+			     vma->vm_start, vma->vm_end);
+
+		new_vma = __copy_vm_area(tsk_mm, vma, avas_mm, vma->vm_flags);
+		if (!new_vma) {
+			pr_vas_debug("Failed to merge a task memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+			ret = -EFAULT;
+			goto out_unlock;
+		}
+
+		/*
+		 * Remember for the VMA that we just added it to the
+		 * attached-VAS that it actually belongs to the task.
+		 */
+		new_vma->vas_reference = tsk_mm;
+	}
+
+	ret = 0;
+
+out_unlock:
+	up_read(&tsk_mm->mmap_sem);
+	up_write(&avas_mm->mmap_sem);
+
+	dump_memory_map("After Task MM", tsk_mm);
+	dump_memory_map("After Attached-VAS MM", avas_mm);
+
+	return ret;
+}
+
+/**
+ * Unmerge task-related parts of an attached-VAS memory map back into the
+ * task's memory map.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure that
+ *			contains all the information for this attachment.
+ * @param[in] tsk:	The pointer to the task for which the memory map
+ *			should be updated again.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int task_unmerge(struct att_vas *avas, struct task_struct *tsk)
+{
+	struct vm_area_struct *vma, *next;
+	struct mm_struct *avas_mm, *tsk_mm;
+
+	tsk_mm = tsk->original_mm;
+	avas_mm = avas->mm;
+
+	dump_memory_map("Before Task MM", tsk_mm);
+	dump_memory_map("Before Attached-VAS MM", avas_mm);
+
+	if (down_write_killable(&avas_mm->mmap_sem))
+		return -EINTR;
+
+	/*
+	 * Since we are always syncing with the task's memory map at every
+	 * switch, unmerging the task's memory regions basically just means
+	 * removing them.
+	 */
+	for (vma = avas_mm->mmap, next = next_vma_safe(vma); vma;
+	     vma = next, next = next_vma_safe(next)) {
+		struct mm_struct *ref_mm = vma->vas_reference;
+
+		if (ref_mm != tsk_mm) {
+			pr_vas_debug("Skipping memory region (%#lx - %#lx) during task unmerging\n",
+				     vma->vm_start, vma->vm_end);
+			continue;
+		}
+
+		pr_vas_debug("Unmerging a task memory region (%#lx - %#lx)\n",
+			     vma->vm_start, vma->vm_end);
+
+		/* Remove the current VMA from the attached-VAS memory map. */
+		__remove_vm_area(avas_mm, vma);
+	}
+
+	up_write(&avas_mm->mmap_sem);
+
+	dump_memory_map("After Task MM", tsk_mm);
+	dump_memory_map("After Attached-VAS MM", avas_mm);
+
+	return 0;
+}
+
+/**
+ * Attach a VAS to a task -- update internal information ONLY
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure
+ *			containing all the information of this attaching.
+ * @param[in] tsk:	The pointer to the task to which the VAS should be
+ *			attached.
+ * @param[in] vas:	The pointer to the VAS which should be attached.
+ *
+ * @returns:		0 on succes, -ERRNO otherwise.
+ **/
+static int __vas_attach(struct att_vas *avas, struct task_struct *tsk,
+			struct vas *vas)
+{
+	int ret;
+
+	/* Before doing anything, synchronize the RSS-stat of the task. */
+	sync_mm_rss(tsk->mm);
+
+	/*
+	 * Try to acquire the VAS share with the proper type. This will ensure
+	 * that the different sharing possibilities of VAS are respected.
+	 */
+	if (!vas_take_share(avas->type, vas)) {
+		pr_vas_debug("VAS is already attached exclusively\n");
+		return -EBUSY;
+	}
+
+	ret = vas_merge(avas, vas, avas->type);
+	if (ret != 0)
+		goto out_put_share;
+
+	ret = task_merge(avas, tsk);
+	if (ret != 0)
+		goto out_put_share;
+
+	vas->refcount++;
+
+	return 0;
+
+out_put_share:
+	vas_put_share(avas->type, vas);
+	return ret;
+}
+
+/**
+ * Detach a VAS from a task -- update internal information ONLY
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure
+ *			containing all the information of this attaching.
+ * @param[in] tsk:	The pointer to the task from which the VAS should be
+ *			detached.
+ * @param[in] vas:	The pointer to the VAS which should be detached.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int __vas_detach(struct att_vas *avas, struct task_struct *tsk,
+			struct vas *vas)
+{
+	int ret;
+
+	/* Before detaching the VAS, synchronize the RSS-stat of the task. */
+	sync_mm_rss(tsk->mm);
+
+	ret = task_unmerge(avas, tsk);
+	if (ret != 0)
+		return ret;
+
+	ret = vas_unmerge(avas, vas);
+	if (ret != 0)
+		return ret;
+
+	vas->refcount--;
+
+	/* We unlock the VAS here to ensure our sharing properties. */
+	vas_put_share(avas->type, vas);
+
+	return 0;
+}
+
+static int __sync_from_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = 0;
+	for (vma = tsk_mm->mmap; vma; vma = vma->vm_next) {
+		struct vm_area_struct *ref;
+
+		ref = vas_find_reference(avas_mm, vma);
+		if (!ref) {
+			ref = __copy_vm_area(tsk_mm, vma, avas_mm,
+					     vma->vm_flags);
+
+			if (!ref) {
+				pr_vas_debug("Failed to copy memory region (%#lx - %#lx) during task sync\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				break;
+			}
+
+			/*
+			 * Remember for the newly added memory region where we
+			 * copied it from.
+			 */
+			ref->vas_reference = tsk_mm;
+		} else {
+			ref = __update_vm_area(tsk_mm, vma, avas_mm, ref);
+			if (!ref) {
+				pr_vas_debug("Failed to update memory region (%#lx - %#lx) during task sync\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				break;
+			}
+		}
+	}
+
+	return ret;
+}
+
+static int __sync_to_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = 0;
+	for (vma = avas_mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vas_reference != tsk_mm) {
+			pr_vas_debug("Skip unrelated memory region (%#lx - %#lx) during task resync\n",
+				     vma->vm_start, vma->vm_end);
+		} else {
+			struct vm_area_struct *ref;
+
+			ref = __update_vm_area(avas_mm, vma, tsk_mm, NULL);
+			if (!ref) {
+				pr_vas_debug("Failed to update memory region (%#lx - %#lx) during task resync\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				break;
+			}
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * Synchronize all task related parts of the memory maps to reflect the latest
+ * state.
+ *
+ * @param[in] avas_mm:	The memory map of the attached-VAS.
+ * @param[in] tsk_mm:	The memory map of the task.
+ * @param[in] dir:	The direction in which the sync should happen:
+ *				1 => tsk -> avas
+ *			       -1 => avas -> tsk
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int synchronize_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm,
+			    int dir)
+{
+	struct mm_struct *src_mm, *dst_mm;
+	int ret;
+
+	src_mm = dir == 1 ? tsk_mm : avas_mm;
+	dst_mm = dir == 1 ? avas_mm : tsk_mm;
+
+	/*
+	 * We have nothing to do if nothing has changed the memory maps since
+	 * the last sync.
+	 */
+	if (ktime_compare(src_mm->vas_last_update,
+			  dst_mm->vas_last_update) == 0) {
+		pr_vas_debug("Nothing to do during switch, memory map is up-to-date\n");
+		return 0;
+	}
+
+	pr_vas_debug("Synchronize memory map from %s to %s\n",
+		     dir == 1 ? "Task" : "Attached-VAS",
+		     dir == 1 ? "Attached-VAS" : "Task");
+
+	dump_memory_map("Before Task MM", tsk_mm);
+	dump_memory_map("Before Attached-VAS MM", avas_mm);
+
+	if (down_write_killable(&dst_mm->mmap_sem))
+		return -EINTR;
+	down_read_nested(&src_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	if (dir == 1)
+		ret = __sync_from_task(avas_mm, tsk_mm);
+	else
+		ret = __sync_to_task(avas_mm, tsk_mm);
+
+	if (ret != 0)
+		goto out_unlock;
+
+	/*
+	 * Also update all the information where the different memory regions
+	 * such as code, data and stack start and end.
+	 */
+	dst_mm->start_code = src_mm->start_code;
+	dst_mm->end_code = src_mm->end_code;
+	dst_mm->start_data = src_mm->start_data;
+	dst_mm->end_data = src_mm->end_data;
+	dst_mm->start_brk = src_mm->start_brk;
+	dst_mm->brk = src_mm->brk;
+	dst_mm->start_stack = src_mm->start_stack;
+	dst_mm->arg_start = src_mm->arg_start;
+	dst_mm->arg_end = src_mm->arg_end;
+	dst_mm->env_start = src_mm->env_end;
+	dst_mm->env_end = src_mm->env_end;
+	dst_mm->task_size = src_mm->task_size;
+
+	dst_mm->vas_last_update = src_mm->vas_last_update;
+
+	ret = 0;
+
+out_unlock:
+	up_read(&src_mm->mmap_sem);
+	up_write(&dst_mm->mmap_sem);
+
+	dump_memory_map("After Task MM", tsk_mm);
+	dump_memory_map("After Attached-VAS MM", avas_mm);
+
+	return ret;
+}
+
+/**
+ * Properly update and setup the memory maps before performing the actual
+ * switch to a different address space.
+ *
+ * @param[in] from:	The attached-VAS that we are switching away from, or
+ *			NULL if we are switching away from the task's original
+ *			AS.
+ * @param[in] to:	The attached-VAS that we are switching to, or NULL if
+ *			we are switching to the task's original AS.
+ * @param[in] tsk:	The pointer to the task for which the switch should
+ *			happen.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int vas_prepare_switch(struct att_vas *from, struct att_vas *to,
+			      struct task_struct *tsk)
+{
+	int ret;
+
+	/* Before doing anything, synchronize the RSS-stat of the task. */
+	sync_mm_rss(tsk->mm);
+
+	/*
+	 * When switching away from a VAS we have to first update the task's
+	 * memory map so that it is always up-to-date
+	 */
+	if (from) {
+		ret = synchronize_task(from->mm, tsk->original_mm, -1);
+		if (ret != 0)
+			return ret;
+	}
+
+	/*
+	 * When switching to a VAS we have to update the VAS' memory map so that
+	 * it contains all the up to date information of the task.
+	 */
+	if (to) {
+		ret = synchronize_task(to->mm, tsk->original_mm, 1);
+		if (ret != 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * Switch a task's address space to the given one.
+ *
+ * @param[in] tsk:	The pointer to the task for which the AS should be
+ *			switched.
+ * @param[in] vid:	The ID of the VAS to which the task should switch, or
+ *			0 if the task should switch to its original AS.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int __vas_switch(struct task_struct *tsk, int vid)
+{
+	struct vas_context *ctx = tsk->vas_ctx;
+	struct att_vas *next_avas, *old_avas;
+	struct mm_struct *nextmm, *oldmm;
+	bool is_attached;
+	int ret;
+
+	vas_context_lock(ctx);
+
+	if (vid == 0) {
+		pr_vas_debug("Switching to original mm\n");
+		next_avas = NULL;
+		nextmm = tsk->original_mm;
+	} else {
+		is_attached = false;
+		list_for_each_entry(next_avas, &ctx->vases, tsk_link) {
+			if (next_avas->vas->id == vid) {
+				is_attached = true;
+				break;
+			}
+		}
+		if (!is_attached) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+
+		pr_vas_debug("Switching to VAS - name: %s\n",
+			     next_avas->vas->name);
+		nextmm = next_avas->mm;
+	}
+
+	if (tsk->active_vas == 0) {
+		pr_vas_debug("Switching from original mm\n");
+		old_avas = NULL;
+		oldmm = tsk->active_mm;
+	} else {
+		is_attached = false;
+		list_for_each_entry(old_avas, &ctx->vases, tsk_link) {
+			if (old_avas->vas->id == tsk->active_vas) {
+				is_attached = true;
+				break;
+			}
+		}
+		if (!is_attached) {
+			WARN(!is_attached, "Could not find the task's active VAS.\n");
+			old_avas = NULL;
+			oldmm = tsk->mm;
+		} else {
+			pr_vas_debug("Switching from VAS - name: %s\n",
+				     old_avas->vas->name);
+			oldmm = old_avas->mm;
+		}
+	}
+
+	vas_context_unlock(ctx);
+
+	/* Check if we are already running on the specified mm. */
+	if (oldmm == nextmm)
+		return 0;
+
+	/*
+	 * Prepare the mm_struct data structure we are switching to. Update the
+	 * mappings for stack, code, data and other recent changes.
+	 */
+	ret = vas_prepare_switch(old_avas, next_avas, tsk);
+	if (ret != 0) {
+		pr_vas_debug("Failed to prepare memory maps for switch\n");
+		return ret;
+	}
+
+	task_lock(tsk);
+
+	/* Perform the actual switch in the new address space. */
+	vmacache_flush(tsk);
+	switch_mm(oldmm, nextmm, tsk);
+
+	tsk->mm = nextmm;
+	tsk->active_mm = nextmm;
+	tsk->active_vas = vid;
+
+	task_unlock(tsk);
+
+	return 0;
+
+out_unlock:
+	vas_context_unlock(ctx);
+
+	return ret;
+}
+
+
+/***
+ * Externally visible functions
+ ***/
+
+int vas_create(const char *name, umode_t mode)
+{
+	struct vas *vas;
+	int ret;
+
+	if (!name)
+		return -EINVAL;
+
+	if (vas_find(name) > 0)
+		return -EEXIST;
+
+	pr_vas_debug("Creating a new VAS - name: %s\n", name);
+
+	/* Allocate and initialize the VAS. */
+	vas = __new_vas();
+	if (!vas)
+		return -ENOMEM;
+
+	if (strscpy(vas->name, name, VAS_MAX_NAME_LENGTH) < 0) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	mutex_init(&vas->mtx);
+
+	ret = init_vas_mm(vas);
+	if (ret != 0)
+		goto out_free;
+
+	vas->refcount = 0;
+
+	INIT_LIST_HEAD(&vas->attaches);
+	spin_lock_init(&vas->share_lock);
+	vas->sharing = 0;
+
+	vas->mode = mode & 0666;
+	vas->uid = current_uid();
+	vas->gid = current_gid();
+
+	ret = vas_insert(vas);
+	if (ret != 0)
+		/*
+		 * We don't need to do anything here. @vas_insert will care
+		 * for the deletion of the VAS before returning with an error.
+		 */
+		return ret;
+
+	return vas->id;
+
+out_free:
+	__delete_vas(vas);
+	return ret;
+}
+EXPORT_SYMBOL(vas_create);
+
+struct vas *vas_get(int vid)
+{
+	return vas_lookup(vid);
+}
+EXPORT_SYMBOL(vas_get);
+
+void vas_put(struct vas *vas)
+{
+	if (!vas)
+		return;
+
+	__vas_put(vas);
+}
+EXPORT_SYMBOL(vas_put);
+
+int vas_find(const char *name)
+{
+	struct vas *vas;
+
+	vas = vas_lookup_by_name(name);
+	if (vas) {
+		int vid = vas->id;
+
+		vas_put(vas);
+		return vid;
+	}
+
+	return -ESRCH;
+}
+EXPORT_SYMBOL(vas_find);
+
+int vas_delete(int vid)
+{
+	struct vas *vas;
+	int ret;
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	pr_vas_debug("Deleting VAS - name: %s\n", vas->name);
+
+	vas_lock(vas);
+
+	if (vas->refcount != 0) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* The user needs write permission to the VAS to delete it. */
+	ret = __check_permission(vas->uid, vas->gid, vas->mode, MAY_WRITE);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to delete the VAS\n");
+		goto out_unlock;
+	}
+
+	vas_unlock(vas);
+
+	vas_remove(vas);
+	vas_put(vas);
+
+	return 0;
+
+out_unlock:
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_delete);
+
+int vas_attach(struct task_struct *tsk, int vid, int type)
+{
+	struct vas_context *ctx = tsk->vas_ctx;
+	struct vas *vas;
+	struct att_vas *avas;
+	int ret;
+
+	type &= (MAY_READ | MAY_WRITE);
+
+	if (!tsk)
+		return -EINVAL;
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	pr_vas_debug("Attaching VAS - name: %s - to task - pid: %d - %s\n",
+		     vas->name, tsk->pid, access_type_str(type));
+
+	vas_lock(vas);
+
+	/*
+	 * Before we can attach the VAS to the task we first have to make some
+	 * sanity checks.
+	 */
+
+	/*
+	 * 1: Check that the user has adequate permissions to attach the VAS in
+	 * the given way.
+	 */
+	ret = __check_permission(vas->uid, vas->gid, vas->mode, type);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to attach the VAS\n");
+		goto out_unlock;
+	}
+
+	/*
+	 * 2: Check if this VAS is already attached to a task. If yes check if
+	 * it is a different task or the one we want to attach currently.
+	 */
+	list_for_each_entry(avas, &vas->attaches, vas_link) {
+		if (avas->tsk == tsk) {
+			pr_vas_debug("VAS is already attached to the task\n");
+			ret = 0;
+			goto out_unlock;
+		}
+	}
+
+	/* 3: Check if we reached the maximum number of shares for this VAS. */
+	if (vas->refcount == VAS_MAX_SHARES) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * All sanity checks are done. We can now safely attach the VAS to the
+	 * given task.
+	 */
+
+	/* Allocate and initialize the attached-VAS data structure. */
+	avas = __new_att_vas();
+	if (!avas) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	ret = init_att_vas_mm(avas, tsk);
+	if (ret != 0)
+		goto out_free_avas;
+
+	avas->vas = vas;
+	avas->tsk = tsk;
+	avas->type = type;
+
+	ret = __vas_attach(avas, tsk, vas);
+	if (ret != 0)
+		goto out_free_avas;
+
+	vas_context_lock(ctx);
+
+	list_add(&avas->tsk_link, &ctx->vases);
+	list_add(&avas->vas_link, &vas->attaches);
+
+	vas_context_unlock(ctx);
+
+	ret = 0;
+
+out_unlock:
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return ret;
+
+out_free_avas:
+	__delete_att_vas(avas);
+	goto out_unlock;
+}
+EXPORT_SYMBOL(vas_attach);
+
+int vas_detach(struct task_struct *tsk, int vid)
+{
+	struct vas_context *ctx = tsk->vas_ctx;
+	struct vas *vas;
+	struct att_vas *avas;
+	bool is_attached;
+	int ret;
+
+	if (!tsk)
+		return -EINVAL;
+
+	task_lock(tsk);
+	vas_context_lock(ctx);
+
+	is_attached = false;
+	list_for_each_entry(avas, &ctx->vases, tsk_link) {
+		if (avas->vas->id == vid) {
+			is_attached = true;
+			break;
+		}
+	}
+	if (!is_attached) {
+		pr_vas_debug("VAS is not attached to the given task\n");
+		ret = -EINVAL;
+		goto out_unlock_tsk;
+	}
+
+	vas = avas->vas;
+
+	/*
+	 * Make sure that our reference to the VAS can not be removed while we
+	 * are currently working with it.
+	 */
+	__vas_get(vas);
+
+	pr_vas_debug("Detaching VAS - name: %s - from task - pid: %d\n",
+		     vas->name, tsk->pid);
+
+	/*
+	 * Before we can detach the VAS from the task we have to perform some
+	 * sanity checks.
+	 */
+
+	/*
+	 * 1: Check if the VAS we want to detach is currently the active VAS
+	 * because we must not detach this VAS. The user first has to switch
+	 * away.
+	 */
+	if (tsk->active_vas == vid) {
+		pr_vas_debug("VAS is currently in use by the task\n");
+		ret = -EBUSY;
+		goto out_put_vas;
+	}
+
+	/*
+	 * We are done with the sanity checks. It is now safe to detach the VAS
+	 * from the given task.
+	 */
+	list_del(&avas->tsk_link);
+
+	vas_context_unlock(ctx);
+	task_unlock(tsk);
+
+	vas_lock(vas);
+
+	list_del(&avas->vas_link);
+
+	ret = __vas_detach(avas, tsk, vas);
+	if (ret != 0)
+		goto out_reinsert;
+
+	__delete_att_vas(avas);
+
+	vas_unlock(vas);
+	__vas_put(vas);
+
+	return 0;
+
+out_reinsert:
+	vas_context_lock(ctx);
+
+	list_add(&avas->tsk_link, &ctx->vases);
+	list_add(&avas->vas_link, &vas->attaches);
+
+	vas_context_unlock(ctx);
+	vas_unlock(vas);
+	__vas_put(vas);
+
+	return ret;
+
+out_put_vas:
+	__vas_put(vas);
+
+out_unlock_tsk:
+	vas_context_unlock(ctx);
+	task_unlock(tsk);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_detach);
+
+int vas_switch(struct task_struct *tsk, int vid)
+{
+	if (!tsk)
+		return -EINVAL;
+
+	return __vas_switch(tsk, vid);
+}
+EXPORT_SYMBOL(vas_switch);
+
+int vas_getattr(int vid, struct vas_attr *attr)
+{
+	struct vas *vas;
+	struct user_namespace *ns = current_user_ns();
+
+	if (!attr)
+		return -EINVAL;
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	pr_vas_debug("Getting attributes for VAS - name: %s\n", vas->name);
+
+	vas_lock(vas);
+
+	memset(attr, 0, sizeof(struct vas_attr));
+	attr->mode = vas->mode;
+	attr->user = from_kuid(ns, vas->uid);
+	attr->group = from_kgid(ns, vas->gid);
+
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return 0;
+}
+EXPORT_SYMBOL(vas_getattr);
+
+int vas_setattr(int vid, struct vas_attr *attr)
+{
+	struct vas *vas;
+	struct user_namespace *ns = current_user_ns();
+	int ret;
+
+	if (!attr)
+		return -EINVAL;
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	pr_vas_debug("Setting attributes for VAS - name: %s\n", vas->name);
+
+	vas_lock(vas);
+
+	/* The user needs write permission to change attributes for the VAS. */
+	ret = __check_permission(vas->uid, vas->gid, vas->mode, MAY_WRITE);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to set attributes for the VAS\n");
+		goto out_unlock;
+	}
+
+	vas->mode = attr->mode & 0666;
+	vas->uid = make_kuid(ns, attr->user);
+	vas->gid = make_kgid(ns, attr->group);
+
+	ret = 0;
+
+out_unlock:
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_setattr);
+
+void __init vas_init(void)
+{
+	/* Create the SLAB caches for our data structures. */
+	vas_cachep = KMEM_CACHE(vas, SLAB_PANIC|SLAB_NOTRACK);
+	att_vas_cachep = KMEM_CACHE(att_vas, SLAB_PANIC|SLAB_NOTRACK);
+	vas_context_cachep = KMEM_CACHE(vas_context, SLAB_PANIC|SLAB_NOTRACK);
+
+	/* Initialize the internal management data structures. */
+	idr_init(&vases);
+	spin_lock_init(&vases_lock);
+
+	/* Initialize the place holder variables. */
+	INVALID_VAS = __new_vas();
+
+	/* Initialize the VAS context of the init task. */
+	vas_clone(0, &init_task);
+}
+
+/*
+ * We need to use a postcore_initcall to initialize the sysfs directories,
+ * because the 'sys/kernel' directory will be initialized in a core_initcall.
+ * Hence, we have to queue the initialization of the VAS sysfs directories after
+ * this.
+ */
+static int __init vas_sysfs_init(void)
+{
+	/* Setup the sysfs base directories. */
+	vases_kset = kset_create_and_add("vas", NULL, kernel_kobj);
+	if (!vases_kset) {
+		pr_err("Failed to initialize the VAS sysfs directory\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+postcore_initcall(vas_sysfs_init);
+
+int vas_clone(int clone_flags, struct task_struct *tsk)
+{
+	int ret = 0;
+
+	struct vas_context *ctx;
+
+	if (clone_flags & CLONE_VM) {
+		ctx = current->vas_ctx;
+
+		pr_vas_debug("Copy VAS context (%p -- %d) for task - %p - from task - %p\n",
+			     ctx, ctx->refcount, tsk, current);
+
+		vas_context_lock(ctx);
+		ctx->refcount++;
+		vas_context_unlock(ctx);
+	} else {
+		pr_vas_debug("Create a new VAS context for task - %p\n",
+			     tsk);
+
+		ctx = __new_vas_context();
+		if (!ctx) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		spin_lock_init(&ctx->lock);
+		ctx->refcount = 1;
+		INIT_LIST_HEAD(&ctx->vases);
+	}
+
+	tsk->vas_ctx = ctx;
+
+out:
+	return ret;
+}
+
+void vas_exit(struct task_struct *tsk)
+{
+	struct vas_context *ctx = tsk->vas_ctx;
+
+	if (tsk->active_vas != 0) {
+		int error;
+
+		pr_vas_debug("Switch to original MM before exit for task - %p\n",
+			     tsk);
+
+		error = __vas_switch(tsk, 0);
+		if (error != 0)
+			pr_alert("Switching back to original MM failed with %d\n",
+				 error);
+	}
+
+	pr_vas_debug("Exiting VAS context (%p -- %d) for task - %p\n", ctx,
+		     ctx->refcount, tsk);
+
+	vas_context_lock(ctx);
+
+	ctx->refcount--;
+	tsk->vas_ctx = NULL;
+
+	vas_context_unlock(ctx);
+
+	if (ctx->refcount == 0) {
+		/*
+		 * We have to clear this VAS context from all the VAS it has
+		 * attached before it is save to delete it. There is no need to
+		 * hold the look while doing this since we are the last one
+		 * having a reference to this particular VAS context.
+		 */
+		struct att_vas *avas, *s_avas;
+
+		list_for_each_entry_safe(avas, s_avas, &ctx->vases, tsk_link) {
+			struct vas *vas = avas->vas;
+			int error;
+
+			pr_vas_debug("Detaching VAS - name: %s - from exiting task - pid: %d\n",
+				     vas->name, tsk->pid);
+
+			/*
+			 * Make sure our reference to the VAS is not deleted
+			 * while we are currently working with it.
+			 */
+			__vas_get(vas);
+
+			vas_lock(vas);
+
+			error = __vas_detach(avas, tsk, vas);
+			if (error != 0)
+				pr_alert("Detaching VAS from task failed with %d\n",
+					 error);
+
+			list_del(&avas->tsk_link);
+			list_del(&avas->vas_link);
+			__delete_att_vas(avas);
+
+			vas_unlock(vas);
+			__vas_put(vas);
+		}
+
+		/*
+		 * All the attached VAS are detached. Now it is safe to remove
+		 * this VAS context.
+		 */
+		__delete_vas_context(ctx);
+
+		pr_vas_debug("Deleted VAS context\n");
+	}
+}
+
+/***
+ * System Calls
+ ***/
+
+SYSCALL_DEFINE2(vas_create, const char __user *, name, umode_t, mode)
+{
+	char vas_name[VAS_MAX_NAME_LENGTH];
+	int len;
+
+	if (!name)
+		return -EINVAL;
+
+	len = strlen(name);
+	if (len >= VAS_MAX_NAME_LENGTH)
+		return -EINVAL;
+
+	if (copy_from_user(vas_name, name, len) != 0)
+		return -EFAULT;
+
+	vas_name[len] = '\0';
+
+	return vas_create(name, mode);
+}
+
+SYSCALL_DEFINE1(vas_delete, int, vid)
+{
+	if (vid < 0)
+		return -EINVAL;
+
+	return vas_delete(vid);
+}
+
+SYSCALL_DEFINE1(vas_find, const char __user *, name)
+{
+	char vas_name[VAS_MAX_NAME_LENGTH];
+	int len;
+
+	if (!name)
+		return -EINVAL;
+
+	len = strlen(name);
+	if (len >= VAS_MAX_NAME_LENGTH)
+		return -EINVAL;
+
+	if (copy_from_user(vas_name, name, len) != 0)
+		return -EFAULT;
+
+	vas_name[len] = '\0';
+
+	return vas_find(name);
+}
+
+SYSCALL_DEFINE3(vas_attach, pid_t, pid, int, vid, int, type)
+{
+	struct task_struct *tsk;
+	int vas_acc_type;
+
+	if (pid < 0 || vid < 0)
+		return -EINVAL;
+
+	tsk = pid == 0 ? current : find_task_by_vpid(pid);
+	if (!tsk)
+		return -ESRCH;
+
+	vas_acc_type = __build_vas_access_type(type);
+	if (vas_acc_type == -1)
+		return -EINVAL;
+
+	return vas_attach(tsk, vid, vas_acc_type);
+}
+
+SYSCALL_DEFINE2(vas_detach, pid_t, pid, int, vid)
+{
+	struct task_struct *tsk;
+
+	if (pid < 0 || vid < 0)
+		return -EINVAL;
+
+	tsk = pid == 0 ? current : find_task_by_vpid(pid);
+	if (!tsk)
+		return -ESRCH;
+
+	return vas_detach(tsk, vid);
+}
+
+SYSCALL_DEFINE1(vas_switch, int, vid)
+{
+	struct task_struct *tsk = current;
+
+	if (vid < 0)
+		return -EINVAL;
+
+	return vas_switch(tsk, vid);
+}
+
+SYSCALL_DEFINE0(active_vas)
+{
+	struct task_struct *tsk = current;
+
+	return tsk->active_vas;
+}
+
+SYSCALL_DEFINE2(vas_getattr, int, vid, struct vas_attr __user *, uattr)
+{
+	struct vas_attr attr;
+	int ret;
+
+	if (vid < 0 || !uattr)
+		return -EINVAL;
+
+	ret = vas_getattr(vid, &attr);
+	if (ret != 0)
+		return ret;
+
+	if (copy_to_user(uattr, &attr, sizeof(struct vas_attr)) != 0)
+		return -EFAULT;
+
+	return 0;
+}
+
+SYSCALL_DEFINE2(vas_setattr, int, vid, struct vas_attr __user *, uattr)
+{
+	struct vas_attr attr;
+
+	if (vid < 0 || !uattr)
+		return -EINVAL;
+
+	if (copy_from_user(&attr, uattr, sizeof(struct vas_attr)) != 0)
+		return -EFAULT;
+
+	return vas_setattr(vid, &attr);
+}
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (9 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 10/13] mm: Introduce first class virtual address spaces Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:27   ` Matthew Wilcox
  2017-03-13 22:14 ` [RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces Till Smejkal
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

VAS segments are an extension to first class virtual address spaces that
can be used to share specific memory regions between multiple first class
virtual address spaces. VAS segments have a specific size and position in a
virtual address space and can thereby be used to share in-memory pointer
based data structures between multiple address spaces as well as other
in-memory data without the need to represent them in mmap-able files or
use shmem.

Similar to first class virtual address spaces, VAS segments must be created
and destroyed explicitly by a user. The system will never automatically
destroy or create a virtual segment. Via attaching a VAS segment to a first
class virtual address space, the memory that is contained in the VAS
segment can be accessed and changed.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    7 +
 arch/x86/entry/syscalls/syscall_64.tbl |    7 +
 include/linux/syscalls.h               |   10 +
 include/linux/vas.h                    |  114 +++
 include/linux/vas_types.h              |   91 ++-
 include/uapi/asm-generic/unistd.h      |   16 +-
 include/uapi/linux/vas.h               |   12 +
 kernel/sys_ni.c                        |    7 +
 mm/vas.c                               | 1234 ++++++++++++++++++++++++++++++--
 9 files changed, 1451 insertions(+), 47 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 8c553eef8c44..a4f91d14a856 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,10 @@
 389	i386	active_vas		sys_active_vas
 390	i386	vas_getattr		sys_vas_getattr
 391	i386	vas_setattr		sys_vas_setattr
+392	i386	vas_seg_create		sys_vas_seg_create
+393	i386	vas_seg_delete		sys_vas_seg_delete
+394	i386	vas_seg_find		sys_vas_seg_find
+395	i386	vas_seg_attach		sys_vas_seg_attach
+396	i386	vas_seg_detach		sys_vas_seg_detach
+397	i386	vas_seg_getattr		sys_vas_seg_getattr
+398	i386	vas_seg_setattr		sys_vas_seg_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 72f1f0495710..a0f9503c3d28 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,13 @@
 338	common	active_vas		sys_active_vas
 339	common	vas_getattr		sys_vas_getattr
 340	common	vas_setattr		sys_vas_setattr
+341	common	vas_seg_create		sys_vas_seg_create
+342	common	vas_seg_delete		sys_vas_seg_delete
+343	common	vas_seg_find		sys_vas_seg_find
+344	common	vas_seg_attach		sys_vas_seg_attach
+345	common	vas_seg_detach		sys_vas_seg_detach
+346	common	vas_seg_getattr		sys_vas_seg_getattr
+347	common	vas_seg_setattr		sys_vas_seg_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fdea27d37c96..7380dcdc4bc1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -66,6 +66,7 @@ struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
 struct vas_attr;
+struct vas_seg_attr;
 union bpf_attr;
 
 #include <linux/types.h>
@@ -914,4 +915,13 @@ asmlinkage long sys_active_vas(void);
 asmlinkage long sys_vas_getattr(int vid, struct vas_attr __user *attr);
 asmlinkage long sys_vas_setattr(int vid, struct vas_attr __user *attr);
 
+asmlinkage long sys_vas_seg_create(const char __user *name, unsigned long start,
+				   unsigned long end, umode_t mode);
+asmlinkage long sys_vas_seg_delete(int sid);
+asmlinkage long sys_vas_seg_find(const char __user *name);
+asmlinkage long sys_vas_seg_attach(int vid, int sid, int type);
+asmlinkage long sys_vas_seg_detach(int vid, int sid);
+asmlinkage long sys_vas_seg_getattr(int sid, struct vas_seg_attr __user *attr);
+asmlinkage long sys_vas_seg_setattr(int sid, struct vas_seg_attr __user *attr);
+
 #endif
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 6a72e42f96d2..376b9fa1ee27 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -138,6 +138,120 @@ extern int vas_setattr(int vid, struct vas_attr *attr);
 
 
 /***
+ * Management of VAS segments
+ ***/
+
+/**
+ * Lock and unlock helper for VAS segments.
+ **/
+#define vas_seg_lock(seg) mutex_lock(&(seg)->mtx)
+#define vas_seg_unlock(seg) mutex_unlock(&(seg)->mtx)
+
+/**
+ * Create a new VAS segment.
+ *
+ * @param[in] name:		The name of the new VAS segment.
+ * @param[in] start:		The address where the VAS segment begins.
+ * @param[in] end:		The address where the VAS segment ends.
+ * @param[in] mode:		The access rights for the VAS segment.
+ *
+ * @returns:			The VAS segment ID on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_create(const char *name, unsigned long start,
+			  unsigned long end, umode_t mode);
+
+/**
+ * Get a pointer to a VAS segment data structure.
+ *
+ * @param[in] sid:		The ID of the VAS segment whose data structure
+ *				should be returned.
+ *
+ * @returns:			The pointer to the VAS segment data structure
+ *				on success, or NULL otherwise.
+ **/
+extern struct vas_seg *vas_seg_get(int sid);
+
+/**
+ * Return a pointer to a VAS segment data structure again.
+ *
+ * @param[in] seg:		The pointer to the VAS segment data structure
+ *				that should be returned.
+ **/
+extern void vas_seg_put(struct vas_seg *seg);
+
+/**
+ * Get ID of the VAS segment belonging to a given name.
+ *
+ * @param[in] name:		The name of the VAS segment for which the ID
+ *				should be returned.
+ *
+ * @returns:			The VAS segment ID on success, -ERRNO
+ *				otherwise.
+ **/
+extern int vas_seg_find(const char *name);
+
+/**
+ * Delete the given VAS segment again.
+ *
+ * @param[in] id:		The ID of the VAS segment which should be
+ *				deleted.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_delete(int id);
+
+/**
+ * Attach a VAS segment to a VAS.
+ *
+ * @param[in] vid:		The ID of the VAS to which the VAS segment
+ *				should be attached.
+ * @param[in] sid:		The ID of the VAS segment which should be
+ *				attached.
+ * @param[in] type:		The type how the VAS segment should be
+ *				attached.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_attach(int vid, int sid, int type);
+
+/**
+ * Detach a VAS segment from a VAS.
+ *
+ * @param[in] vid:		The ID of the VAS from which the VAS segment
+ *				should be detached.
+ * @param[in] sid:		The ID of the VAS segment which should be
+ *				detached.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_detach(int vid, int sid);
+
+/**
+ * Get attributes of a VAS segment.
+ *
+ * @param[in] sid:		The ID of the VAS segment for which the
+ *				attributes should be returned.
+ * @param[out] attr:		The pointer to the struct where the attributes
+ *				should be saved.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_getattr(int sid, struct vas_seg_attr *attr);
+
+/**
+ * Set attributes of a VAS segment.
+ *
+ * @param[in] sid:		The ID of the VAS segment for which the
+ *				attributes should be updated.
+ * @param[in] attr:		The pointer to the struct containing the new
+ *				attributes.
+ *
+ * @returns:			0 on success, -ERRNO otherwise.
+ **/
+extern int vas_seg_setattr(int sid, struct vas_seg_attr *attr);
+
+
+/***
  * Management of the VAS subsystem
  ***/
 
diff --git a/include/linux/vas_types.h b/include/linux/vas_types.h
index f06bfa9ef729..a5291a18ea07 100644
--- a/include/linux/vas_types.h
+++ b/include/linux/vas_types.h
@@ -24,8 +24,8 @@ struct task_struct;
  * The struct representing a Virtual Address Space (VAS).
  *
  * This data structure contains all the necessary information of a VAS such as
- * its name, ID. It also contains access rights and other management
- * information.
+ * its name, ID, as well as the list of all the VAS segments which are attached
+ * to it. It also contains access rights and other management information.
  **/
 struct vas {
 	struct kobject kobj;		/* < the internal kobject that we use *
@@ -38,7 +38,8 @@ struct vas {
 	struct mutex mtx;		/* < lock for parallel access.        */
 
 	struct mm_struct *mm;		/* < a partial memory map containing  *
-					 *   all mappings of this VAS.        */
+					 *   all mappings of this VAS and all *
+					 *   of its attached VAS segments.    */
 
 	struct list_head link;		/* < the link in the global VAS list. */
 	struct rcu_head rcu;		/* < the RCU helper used for          *
@@ -54,6 +55,11 @@ struct vas {
 					 *   of the current sharing state of  *
 					 *   the VAS.                         */
 
+	struct list_head segments;	/* < the list of attached VAS         *
+					 *   segments.                        */
+	u32 nr_segments;		/* < the number of VAS segments       *
+					 *   attached to this VAS.            */
+
 	umode_t mode;			/* < the access rights to this VAS.   */
 	kuid_t uid;			/* < the UID of the owning user of    *
 					 *   this VAS.                        */
@@ -85,4 +91,83 @@ struct att_vas {
 	int type;			/* < the type of attaching (RO/RW).   */
 };
 
+/**
+ * The struct representing a VAS segment.
+ *
+ * A VAS segment is a region in memory. Accordingly, it is very similar to a
+ * vm_area. However, instead of a vm_area it can only represent a memory region
+ * and not a file and also knows where it is mapped. In addition VAS segments
+ * also have an ID, a name, access rights and a lock managing the way it can be
+ * shared between multiple VAS.
+ **/
+struct vas_seg {
+	struct kobject kobj;		/* < the internal kobject that we use *
+					 *   for reference counting and sysfs *
+					 *   handling.                        */
+
+	int id;				/* < ID                               */
+	char name[VAS_MAX_NAME_LENGTH];	/* < name                             */
+
+	struct mutex mtx;		/* < lock for parallel access.        */
+
+	unsigned long start;		/* < the virtual address where the    *
+					 *   VAS segment starts.              */
+	unsigned long end;		/* < the virtual address where the    *
+					 *   VAS segment ends.                */
+	unsigned long length;		/* < the size of the VAS segment in   *
+					 *   bytes.                           */
+
+	struct mm_struct *mm;		/* < a partial memory map containing  *
+					 *   all the mappings for this VAS    *
+					 *   segment.                         */
+
+	struct list_head link;		/* < the link in the global VAS       *
+					 *   segment list.                    */
+	struct rcu_head rcu;		/* < the RCU helper used for          *
+					 *   asynchronous VAS segment         *
+					 *   deletion.                        */
+
+	u16 refcount;			/* < how often is the VAS segment     *
+					 *   attached.                        */
+	struct list_head attaches;	/* < the list of VASes which have     *
+					 *   this VAS segment attached.       */
+
+	spinlock_t share_lock;		/* < lock for protecting sharing      *
+					 *   state.                           */
+	u32 sharing;			/* < the variable used to keep track  *
+					 *   of the current sharing state of  *
+					 *   the VAS segment.                 */
+
+	umode_t mode;			/* < the access rights to this VAS    *
+					 *   segment.                         */
+	kuid_t uid;			/* < the UID of the owning user of    *
+					 *   this VAS segment.                */
+	kgid_t gid;			/* < the GID of the owning group of   *
+					 *   this VAS segment.                */
+};
+
+/**
+ * The struct representing a VAS segment being attached to a VAS.
+ *
+ * Since a VAS segment can be attached to a multiple VAS this data structure is
+ * necessary. It forms the connection between the VAS and the VAS segment
+ * itself.
+ **/
+struct att_vas_seg {
+	struct vas_seg *seg;            /* < the reference to the actual VAS  *
+					 *   segment containing all the       *
+					 *   information.                     */
+
+	struct vas *vas;		/* < the reference to the VAS to      *
+					 *   which the VAS segment is         *
+					 *   attached to.                     */
+
+	struct list_head vas_link;	/* < the link in the list managed     *
+					 *   inside the VAS.                  */
+	struct list_head seg_link;	/* < the link in the list managed     *
+					 *   inside the VAS segment.          */
+
+	int type;			/* < the type of attaching (RO/RW).   */
+};
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 35df7d40a443..4014b4bd2f18 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -748,9 +748,23 @@ __SYSCALL(__NR_active_vas, sys_active_vas)
 __SYSCALL(__NR_vas_getattr, sys_vas_getattr)
 #define __NR_vas_setattr 299
 __SYSCALL(__NR_vas_setattr, sys_vas_setattr)
+#define __NR_vas_seg_create 300
+__SYSCALL(__NR_vas_seg_create, sys_vas_seg_create)
+#define __NR_vas_seg_delete 301
+__SYSCALL(__NR_vas_seg_delete, sys_vas_seg_delete)
+#define __NR_vas_seg_find 302
+__SYSCALL(__NR_vas_seg_find, sys_vas_seg_find)
+#define __NR_vas_seg_attach 303
+__SYSCALL(__NR_vas_seg_attach, sys_vas_seg_attach)
+#define __NR_vas_seg_detach 304
+__SYSCALL(__NR_vas_seg_detach, sys_vas_seg_detach)
+#define __NR_vas_seg_getattr 305
+__SYSCALL(__NR_vas_seg_getattr, sys_vas_seg_getattr)
+#define __NR_vas_seg_setattr 306
+__SYSCALL(__NR_vas_seg_setattr, sys_vas_seg_setattr)
 
 #undef __NR_syscalls
-#define __NR_syscalls 300
+#define __NR_syscalls 307
 
 /*
  * All syscalls below here should go away really,
diff --git a/include/uapi/linux/vas.h b/include/uapi/linux/vas.h
index 02f70f88bdcb..a8858b013a44 100644
--- a/include/uapi/linux/vas.h
+++ b/include/uapi/linux/vas.h
@@ -13,4 +13,16 @@ struct vas_attr {
 	__kernel_gid_t group;		/* < the owning group of the VAS.     */
 };
 
+/**
+ * The struct containing attributes of a VAS segment.
+ **/
+struct vas_seg_attr {
+	__kernel_mode_t mode;		/* < the access rights to the VAS     *
+					 *   segment.                         */
+	__kernel_uid_t user;		/* < the owning user of the VAS       *
+					 *   segment.                         */
+	__kernel_gid_t group;		/* < the owning group of the VAS      *
+					 *   segment.                         */
+};
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f6f83c5ec1a1..659fe96afcfa 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -269,3 +269,10 @@ cond_syscall(sys_vas_switch);
 cond_syscall(sys_active_vas);
 cond_syscall(sys_vas_getattr);
 cond_syscall(sys_vas_setattr);
+cond_syscall(sys_segment_create);
+cond_syscall(sys_segment_delete);
+cond_syscall(sys_segment_find);
+cond_syscall(sys_segment_attach);
+cond_syscall(sys_segment_detach);
+cond_syscall(sys_segment_getattr);
+cond_syscall(sys_segment_setattr);
diff --git a/mm/vas.c b/mm/vas.c
index 447d61e1da79..345b023c21aa 100644
--- a/mm/vas.c
+++ b/mm/vas.c
@@ -61,7 +61,7 @@
 #define VAS_MAX_ID INT_MAX
 
 /**
- * Masks and bits to implement sharing of VAS.
+ * Masks and bits to implement sharing of VAS and VAS segments.
  **/
 #define VAS_SHARE_READABLE (1 << 0)
 #define VAS_SHARE_WRITABLE (1 << 16)
@@ -194,6 +194,8 @@ static void __dump_memory_map(const char *title, struct mm_struct *mm)
 static struct kmem_cache *vas_cachep;
 static struct kmem_cache *att_vas_cachep;
 static struct kmem_cache *vas_context_cachep;
+static struct kmem_cache *seg_cachep;
+static struct kmem_cache *att_seg_cachep;
 
 /**
  * Global management data structures and their associated locks.
@@ -201,16 +203,21 @@ static struct kmem_cache *vas_context_cachep;
 static struct idr vases;
 static spinlock_t vases_lock;
 
+static struct idr vas_segs;
+static spinlock_t vas_segs_lock;
+
 /**
  * The place holder variables that are used to identify to-be-deleted items in
  * our global management data structures.
  **/
 static struct vas *INVALID_VAS;
+static struct vas_seg *INVALID_VAS_SEG;
 
 /**
  * Kernel 'ksets' where all objects will be managed.
  **/
 static struct kset *vases_kset;
+static struct kset *vas_segs_kset;
 
 
 /***
@@ -273,6 +280,40 @@ static inline void __delete_vas_context(struct vas_context *ctx)
 	kmem_cache_free(vas_context_cachep, ctx);
 }
 
+static inline struct vas_seg *__new_vas_seg(void)
+{
+	return kmem_cache_zalloc(seg_cachep, GFP_KERNEL);
+}
+
+static inline void __delete_vas_seg(struct vas_seg *seg)
+{
+	WARN_ON(seg->refcount != 0);
+
+	mutex_destroy(&seg->mtx);
+
+	if (seg->mm)
+		mmput_async(seg->mm);
+	kmem_cache_free(seg_cachep, seg);
+}
+
+static inline void __delete_vas_seg_rcu(struct rcu_head *rp)
+{
+	struct vas_seg *seg = container_of(rp, struct vas_seg, rcu);
+
+	__delete_vas_seg(seg);
+}
+
+static inline struct att_vas_seg *__new_att_vas_seg(void)
+{
+	return kmem_cache_zalloc(att_seg_cachep, GFP_ATOMIC);
+}
+
+static inline void __delete_att_vas_seg(struct att_vas_seg *aseg)
+{
+	kmem_cache_free(att_seg_cachep, aseg);
+}
+
+
 /***
  * Kobject management of data structures
  ***/
@@ -418,6 +459,161 @@ static struct kobj_type vas_ktype = {
 	.default_attrs = vas_default_attr,
 };
 
+/**
+ * Correctly get and put VAS segments.
+ **/
+static inline struct vas_seg *__vas_seg_get(struct vas_seg *seg)
+{
+	return container_of(kobject_get(&seg->kobj), struct vas_seg, kobj);
+}
+
+static inline void __vas_seg_put(struct vas_seg *seg)
+{
+	kobject_put(&seg->kobj);
+}
+
+/**
+ * The sysfs structure we need to handle attributes of a VAS segment.
+ **/
+struct vas_seg_sysfs_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct vas_seg *seg, struct vas_seg_sysfs_attr *ssattr,
+			char *buf);
+	ssize_t (*store)(struct vas_seg *seg, struct vas_seg_sysfs_attr *ssattr,
+			 const char *buf, ssize_t count);
+};
+
+#define VAS_SEG_SYSFS_ATTR(NAME, MODE, SHOW, STORE)			\
+static struct vas_seg_sysfs_attr vas_seg_sysfs_attr_##NAME =		\
+	__ATTR(NAME, MODE, SHOW, STORE)
+
+/**
+ * Functions for all the sysfs operations for VAS segments.
+ **/
+static ssize_t __vas_seg_sysfs_attr_show(struct kobject *kobj,
+					 struct attribute *attr,
+					 char *buf)
+{
+	struct vas_seg *seg;
+	struct vas_seg_sysfs_attr *ssattr;
+
+	seg = container_of(kobj, struct vas_seg, kobj);
+	ssattr = container_of(attr, struct vas_seg_sysfs_attr, attr);
+
+	if (!ssattr->show)
+		return -EIO;
+
+	return ssattr->show(seg, ssattr, buf);
+}
+
+static ssize_t __vas_seg_sysfs_attr_store(struct kobject *kobj,
+					  struct attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct vas_seg *seg;
+	struct vas_seg_sysfs_attr *ssattr;
+
+	seg = container_of(kobj, struct vas_seg, kobj);
+	ssattr = container_of(attr, struct vas_seg_sysfs_attr, attr);
+
+	if (!ssattr->store)
+		return -EIO;
+
+	return ssattr->store(seg, ssattr, buf, count);
+}
+
+/**
+ * The sysfs operations structure for a VAS segment.
+ **/
+static const struct sysfs_ops vas_seg_sysfs_ops = {
+	.show = __vas_seg_sysfs_attr_show,
+	.store = __vas_seg_sysfs_attr_store,
+};
+
+/**
+ * Default attributes of a VAS segment.
+ **/
+static ssize_t __show_vas_seg_name(struct vas_seg *seg,
+				   struct vas_seg_sysfs_attr *ssattr,
+				   char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%s", seg->name);
+}
+VAS_SEG_SYSFS_ATTR(name, 0444, __show_vas_seg_name, NULL);
+
+static ssize_t __show_vas_seg_mode(struct vas_seg *seg,
+				   struct vas_seg_sysfs_attr *ssattr,
+				   char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%#03o", seg->mode);
+}
+VAS_SEG_SYSFS_ATTR(mode, 0444, __show_vas_seg_mode, NULL);
+
+static ssize_t __show_vas_seg_user(struct vas_seg *seg,
+				   struct vas_seg_sysfs_attr *ssattr,
+				   char *buf)
+{
+	struct user_namespace *ns = current_user_ns();
+
+	return scnprintf(buf, PAGE_SIZE, "%d", from_kuid(ns, seg->uid));
+}
+VAS_SEG_SYSFS_ATTR(user, 0444, __show_vas_seg_user, NULL);
+
+static ssize_t __show_vas_seg_group(struct vas_seg *seg,
+				    struct vas_seg_sysfs_attr *ssattr,
+				    char *buf)
+{
+	struct user_namespace *ns = current_user_ns();
+
+	return scnprintf(buf, PAGE_SIZE, "%d", from_kgid(ns, seg->gid));
+}
+VAS_SEG_SYSFS_ATTR(group, 0444, __show_vas_seg_group, NULL);
+
+static ssize_t __show_vas_seg_region(struct vas_seg *seg,
+				     struct vas_seg_sysfs_attr *ssattr,
+				     char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%lx-%lx", seg->start, seg->end);
+}
+VAS_SEG_SYSFS_ATTR(region, 0444, __show_vas_seg_region, NULL);
+
+static struct attribute *vas_seg_default_attr[] = {
+	&vas_seg_sysfs_attr_name.attr,
+	&vas_seg_sysfs_attr_mode.attr,
+	&vas_seg_sysfs_attr_user.attr,
+	&vas_seg_sysfs_attr_group.attr,
+	&vas_seg_sysfs_attr_region.attr,
+	NULL
+};
+
+/**
+ * Function to release the VAS segment after its kobject is gone.
+ **/
+static void __vas_seg_release(struct kobject *kobj)
+{
+	struct vas_seg *seg = container_of(kobj, struct vas_seg, kobj);
+
+	/* Give up the ID in the IDR that was occupied by this VAS segment. */
+	spin_lock(&vas_segs_lock);
+	idr_remove(&vas_segs, seg->id);
+	spin_unlock(&vas_segs_lock);
+
+	/*
+	 * Wait a full RCU grace period before actually deleting the VAS segment
+	 * data structure since we haven't done it earlier.
+	 */
+	call_rcu(&seg->rcu, __delete_vas_seg_rcu);
+}
+
+/**
+ * The ktype data structure representing a VAS segment.
+ **/
+static struct kobj_type vas_seg_ktype = {
+	.sysfs_ops = &vas_seg_sysfs_ops,
+	.release = __vas_seg_release,
+	.default_attrs = vas_seg_default_attr,
+};
+
 
 /***
  * Internally visible functions
@@ -526,8 +722,99 @@ static inline struct vas *vas_lookup_by_name(const char *name)
 	return vas;
 }
 
+/**
+ * Working with the global VAS segments list.
+ **/
+static inline void vas_seg_remove(struct vas_seg *seg)
+{
+	spin_lock(&vas_segs_lock);
+
+	/*
+	 * We only put a to-be-deleted place holder in the IDR at this point.
+	 * See @vas_remove for more details.
+	 */
+	idr_replace(&vas_segs, INVALID_VAS_SEG, seg->id);
+	spin_unlock(&vas_segs_lock);
+
+	/* No need to wait for grace period. See @vas_remove why. */
+	__vas_seg_put(seg);
+}
+
+static inline int vas_seg_insert(struct vas_seg *seg)
+{
+	int ret;
+
+	/* Add the VAS segment in the IDR cache. */
+	spin_lock(&vas_segs_lock);
+
+	ret = idr_alloc(&vas_segs, seg, 1, VAS_MAX_ID, GFP_KERNEL);
+
+	spin_unlock(&vas_segs_lock);
+
+	if (ret < 0) {
+		__delete_vas_seg(seg);
+		return ret;
+	}
+
+	/* Add the remaining data to the VAS segment's data structure. */
+	seg->id = ret;
+	seg->kobj.kset = vas_segs_kset;
+
+	/* Initialize the kobject and add it to the sysfs. */
+	ret = kobject_init_and_add(&seg->kobj, &vas_seg_ktype, NULL,
+				   "%d", seg->id);
+	if (ret != 0) {
+		vas_seg_remove(seg);
+		return ret;
+	}
+
+	kobject_uevent(&seg->kobj, KOBJ_ADD);
+
+	return 0;
+}
+
+static inline struct vas_seg *vas_seg_lookup(int id)
+{
+	struct vas_seg *seg;
+
+	rcu_read_lock();
+
+	seg = idr_find(&vas_segs, id);
+	if (seg == INVALID_VAS_SEG)
+		seg = NULL;
+	if (seg)
+		seg = __vas_seg_get(seg);
+
+	rcu_read_unlock();
+
+	return seg;
+}
+
+static inline struct vas_seg *vas_seg_lookup_by_name(const char *name)
+{
+	struct vas_seg *seg;
+	int id;
+
+	rcu_read_lock();
+
+	idr_for_each_entry(&vas_segs, seg, id) {
+		if (seg == INVALID_VAS_SEG)
+			continue;
+
+		if (strcmp(seg->name, name) == 0)
+			break;
+	}
+
+	if (seg)
+		seg = __vas_seg_get(seg);
+
+	rcu_read_unlock();
+
+	return seg;
+}
+
  /**
-  * Management of the sharing of VAS.
+  * Management of the sharing of VAS and VAS segments.
   **/
 static inline int vas_take_share(int type, struct vas *vas)
 {
@@ -562,6 +849,39 @@ static inline void vas_put_share(int type, struct vas *vas)
 	spin_unlock(&vas->share_lock);
 }
 
+static inline int vas_seg_take_share(int type, struct vas_seg *seg)
+{
+	int ret;
+
+	spin_lock(&seg->share_lock);
+	if (type & MAY_WRITE) {
+		if ((seg->sharing & VAS_SHARE_READ_WRITE_MASK) == 0) {
+			seg->sharing += VAS_SHARE_WRITABLE;
+			ret = 1;
+		} else
+			ret = 0;
+	} else {
+		if ((seg->sharing & VAS_SHARE_WRITE_MASK) == 0) {
+			seg->sharing += VAS_SHARE_READABLE;
+			ret = 1;
+		} else
+			ret = 0;
+	}
+	spin_unlock(&seg->share_lock);
+
+	return ret;
+}
+
+static inline void vas_seg_put_share(int type, struct vas_seg *seg)
+{
+	spin_lock(&seg->share_lock);
+	if (type & MAY_WRITE)
+		seg->sharing -= VAS_SHARE_WRITABLE;
+	else
+		seg->sharing -= VAS_SHARE_READABLE;
+	spin_unlock(&seg->share_lock);
+}
+
 /**
  * Management of the memory maps.
  **/
@@ -609,6 +929,59 @@ static int init_att_vas_mm(struct att_vas *avas, struct task_struct *owner)
 	return 0;
 }
 
+static int init_vas_seg_mm(struct vas_seg *seg)
+{
+	struct mm_struct *mm;
+	unsigned long map_flags, page_prot_flags;
+	vm_flags_t vm_flags;
+	unsigned long map_addr;
+	int ret;
+
+	mm = mm_alloc();
+	if (!mm)
+		return -ENOMEM;
+
+	mm = mm_setup(mm);
+	if (!mm)
+		return -ENOMEM;
+
+	arch_pick_mmap_layout(mm);
+
+	map_flags = MAP_ANONYMOUS | MAP_FIXED;
+	page_prot_flags = PROT_READ | PROT_WRITE;
+	vm_flags = calc_vm_prot_bits(page_prot_flags, 0) |
+		calc_vm_flag_bits(map_flags) | mm->def_flags |
+		VM_DONTEXPAND | VM_DONTCOPY;
+
+	/* Find the possible mapping address for the VAS segment. */
+	map_addr = get_unmapped_area(mm, NULL, seg->start, seg->length,
+				     0, map_flags);
+	if (map_addr != seg->start) {
+		ret = -EFAULT;
+		goto out_free;
+	}
+
+	/* Insert the mapping into the mm_struct of the VAS segment. */
+	map_addr = mmap_region(mm, NULL, seg->start, seg->length,
+			       vm_flags, 0);
+	if (map_addr != seg->start) {
+		ret = -EFAULT;
+		goto out_free;
+	}
+
+	/* Populate the VAS segments memory region. */
+	mm_populate(mm, seg->start, seg->length);
+
+	/* The mm_struct is properly setup. We are done here. */
+	seg->mm = mm;
+
+	return 0;
+
+out_free:
+	mmput(mm);
+	return ret;
+}
+
 /**
  * Lookup the corresponding vm_area in the referenced memory map.
  *
@@ -1126,61 +1499,200 @@ static int task_unmerge(struct att_vas *avas, struct task_struct *tsk)
 }
 
 /**
- * Attach a VAS to a task -- update internal information ONLY
+ * Merge a VAS segment's memory map into a VAS memory map.
  *
- * Requires that the VAS is already locked.
+ * Requires that the VAS and the VAS segment is already locked.
  *
- * @param[in] avas:	The pointer to the attached-VAS data structure
- *			containing all the information of this attaching.
- * @param[in] tsk:	The pointer to the task to which the VAS should be
- *			attached.
- * @param[in] vas:	The pointer to the VAS which should be attached.
+ * @param[in] vas:	The pointer to the VAS into which the VAS segment should
+ *			be merged.
+ * @param[in] seg:	The pointer to the VAS segment that should be merged.
+ * @param[in] type:	The type of attaching (see attach_segment for more
+ *			information).
  *
- * @returns:		0 on succes, -ERRNO otherwise.
+ * @returns:		0 on success, -ERRNO otherwise.
  **/
-static int __vas_attach(struct att_vas *avas, struct task_struct *tsk,
-			struct vas *vas)
+static int vas_seg_merge(struct vas *vas, struct vas_seg *seg, int type)
 {
+	struct vm_area_struct *vma, *new_vma;
+	struct mm_struct *vas_mm, *seg_mm;
 	int ret;
 
-	/* Before doing anything, synchronize the RSS-stat of the task. */
-	sync_mm_rss(tsk->mm);
+	vas_mm = vas->mm;
+	seg_mm = seg->mm;
 
-	/*
-	 * Try to acquire the VAS share with the proper type. This will ensure
-	 * that the different sharing possibilities of VAS are respected.
-	 */
-	if (!vas_take_share(avas->type, vas)) {
-		pr_vas_debug("VAS is already attached exclusively\n");
-		return -EBUSY;
-	}
+	dump_memory_map("Before VAS MM", vas_mm);
+	dump_memory_map("Before VAS segment MM", seg_mm);
 
-	ret = vas_merge(avas, vas, avas->type);
-	if (ret != 0)
-		goto out_put_share;
+	if (down_write_killable(&vas_mm->mmap_sem))
+		return -EINTR;
+	down_read_nested(&seg_mm->mmap_sem, SINGLE_DEPTH_NESTING);
 
-	ret = task_merge(avas, tsk);
-	if (ret != 0)
-		goto out_put_share;
+	/* Try to copy all VMAs of the VAS into the AS of the attached-VAS. */
+	for (vma = seg_mm->mmap; vma; vma = vma->vm_next) {
+		unsigned long merged_vm_flags = vma->vm_flags;
 
-	vas->refcount++;
+		pr_vas_debug("Merging a VAS segment memory region (%#lx - %#lx)\n",
+			     vma->vm_start, vma->vm_end);
 
-	return 0;
+		/*
+		 * Remove the writable bit from the vm_flags if the VAS segment
+		 * is attached only readable.
+		 */
+		if (!(type & MAY_WRITE))
+			merged_vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
 
-out_put_share:
-	vas_put_share(avas->type, vas);
-	return ret;
-}
+		new_vma = __copy_vm_area(seg_mm, vma, vas_mm, merged_vm_flags);
+		if (!new_vma) {
+			pr_vas_debug("Failed to merge a VAS segment memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+			ret = -EFAULT;
+			goto out_unlock;
+		}
 
-/**
- * Detach a VAS from a task -- update internal information ONLY
- *
- * Requires that the VAS is already locked.
- *
- * @param[in] avas:	The pointer to the attached-VAS data structure
- *			containing all the information of this attaching.
- * @param[in] tsk:	The pointer to the task from which the VAS should be
- *			detached.
+		/*
+		 * Remember for the VMA that we just added it to the VAS that it
+		 * actually belongs to the VAS segment.
+		 */
+		new_vma->vas_reference = seg_mm;
+	}
+
+	ret = 0;
+
+out_unlock:
+	up_read(&seg_mm->mmap_sem);
+	up_write(&vas_mm->mmap_sem);
+
+	dump_memory_map("After VAS MM", vas_mm);
+	dump_memory_map("After VAS segment MM", seg_mm);
+
+	return ret;
+}
+
+/**
+ * Unmerge the VAS segment-related parts of a VAS' memory map back into the
+ * VAS segment's memory map.
+ *
+ * Requires that the VAS and the VAS segment are already locked.
+ *
+ * @param[in] vas:	The pointer to the VAS from which the VAS segment
+ *			related data should be taken.
+ * @param[in] seg:	The pointer to the VAS segment for which the memory map
+ *			should be updated again.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int vas_seg_unmerge(struct vas *vas, struct vas_seg *seg)
+{
+	struct vm_area_struct *vma, *next;
+	struct mm_struct *vas_mm, *seg_mm;
+	int ret;
+
+	vas_mm = vas->mm;
+	seg_mm = seg->mm;
+
+	dump_memory_map("Before VAS MM", vas_mm);
+	dump_memory_map("Before VAS segment MM", seg_mm);
+
+	if (down_write_killable(&vas_mm->mmap_sem))
+		return -EINTR;
+	down_write_nested(&seg_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+
+	/* Update all memory regions which belonged to the VAS segment. */
+	for (vma = vas_mm->mmap, next = next_vma_safe(vma); vma;
+	     vma = next, next = next_vma_safe(next)) {
+		struct mm_struct *ref_mm = vma->vas_reference;
+
+		if (ref_mm != seg_mm) {
+			pr_vas_debug("Skipping memory region (%#lx - %#lx) during VAS segment unmerging\n",
+				     vma->vm_start, vma->vm_end);
+			continue;
+		} else {
+			struct vm_area_struct *upd_vma;
+
+			pr_vas_debug("Unmerging a VAS segment memory region (%#lx - %#lx)\n",
+				     vma->vm_start, vma->vm_end);
+
+			upd_vma = __update_vm_area(vas_mm, vma, seg_mm, NULL);
+			if (!upd_vma) {
+				pr_vas_debug("Failed to unmerge a VAS segment memory region (%#lx - %#lx)\n",
+					     vma->vm_start, vma->vm_end);
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+		}
+
+		/* Remove the current VMA from the VAS memory map. */
+		__remove_vm_area(vas_mm, vma);
+	}
+
+	ret = 0;
+
+out_unlock:
+	up_write(&seg_mm->mmap_sem);
+	up_write(&vas_mm->mmap_sem);
+
+	dump_memory_map("After VAS MM", vas_mm);
+	dump_memory_map("After VAS segment MM", seg_mm);
+
+	return ret;
+}
+
+/**
+ * Attach a VAS to a task -- update internal information ONLY
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure
+ *			containing all the information of this attaching.
+ * @param[in] tsk:	The pointer to the task to which the VAS should be
+ *			attached.
+ * @param[in] vas:	The pointer to the VAS which should be attached.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int __vas_attach(struct att_vas *avas, struct task_struct *tsk,
+			struct vas *vas)
+{
+	int ret;
+
+	/* Before doing anything, synchronize the RSS-stat of the task. */
+	sync_mm_rss(tsk->mm);
+
+	/*
+	 * Try to acquire the VAS share with the proper type. This will ensure
+	 * that the different sharing possibilities of VAS are respected.
+	 */
+	if (!vas_take_share(avas->type, vas)) {
+		pr_vas_debug("VAS is already attached exclusively\n");
+		return -EBUSY;
+	}
+
+	ret = vas_merge(avas, vas, avas->type);
+	if (ret != 0)
+		goto out_put_share;
+
+	ret = task_merge(avas, tsk);
+	if (ret != 0)
+		goto out_put_share;
+
+	vas->refcount++;
+
+	return 0;
+
+out_put_share:
+	vas_put_share(avas->type, vas);
+	return ret;
+}
+
+/**
+ * Detach a VAS from a task -- update internal information ONLY
+ *
+ * Requires that the VAS is already locked.
+ *
+ * @param[in] avas:	The pointer to the attached-VAS data structure
+ *			containing all the information of this attaching.
+ * @param[in] tsk:	The pointer to the task from which the VAS should be
+ *			detached.
  * @param[in] vas:	The pointer to the VAS which should be detached.
  *
  * @returns:		0 on success, -ERRNO otherwise.
@@ -1209,6 +1721,83 @@ static int __vas_detach(struct att_vas *avas, struct task_struct *tsk,
 	return 0;
 }
 
+/**
+ * Attach a VAS segment to a VAS -- update internal information ONLY
+ *
+ * Requires that the VAS segment and the VAS are already locked.
+ *
+ * @param aseg:		The pointer tot he attached VAS segment data structure
+ *			containing all the information of this attaching.
+ * @param vas:		The pointer to the VAS to which the VAS segment should
+ *			be attached.
+ * @param seg:		The pointer to the VAS segment which should be attached.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int __vas_seg_attach(struct att_vas_seg *aseg, struct vas *vas,
+			   struct vas_seg *seg)
+{
+	int ret;
+
+	/*
+	 * Try to acquire the VAS segment share with the proper type. This will
+	 * ensure that the different sharing possibilities of VAS segments are
+	 * respected.
+	 */
+	if (!vas_seg_take_share(aseg->type, seg)) {
+		pr_vas_debug("VAS segment is already attached to a VAS writable\n");
+		return -EBUSY;
+	}
+
+	/* Update the memory map of the VAS. */
+	ret = vas_seg_merge(vas, seg, aseg->type);
+	if (ret != 0)
+		goto out_put_share;
+
+	seg->refcount++;
+	vas->nr_segments++;
+
+	return 0;
+
+out_put_share:
+	vas_seg_put_share(aseg->type, seg);
+	return ret;
+}
+
+/**
+ * Detach a VAS segment from a VAS -- update internal information ONLY
+ *
+ * Requires that the VAS segment and the VAS are already locked.
+ *
+ * @param aseg:		The pointer to the attached VAS segment data structure
+ *			containing all the information of this attaching.
+ * @param vas:		The pointer to the VAS from which the VAS segment should
+ *			be detached.
+ * @param seg:		The pointer to the VAS segment which should be detached.
+ *
+ * @returns:		0 on success, -ERRNO otherwise.
+ **/
+static int __vas_seg_detach(struct att_vas_seg *aseg, struct vas *vas,
+			    struct vas_seg *seg)
+{
+	int ret;
+
+	/* Update the memory maps of the VAS segment and the VAS. */
+	ret = vas_seg_unmerge(vas, seg);
+	if (ret != 0)
+		return ret;
+
+	seg->refcount--;
+	vas->nr_segments--;
+
+	/*
+	 * We unlock the VAS segment here to ensure our sharing properties.
+	 */
+	vas_seg_put_share(aseg->type, seg);
+
+	return 0;
+}
+
 static int __sync_from_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 {
 	struct vm_area_struct *vma;
@@ -1542,6 +2131,9 @@ int vas_create(const char *name, umode_t mode)
 	spin_lock_init(&vas->share_lock);
 	vas->sharing = 0;
 
+	INIT_LIST_HEAD(&vas->segments);
+	vas->nr_segments = 0;
+
 	vas->mode = mode & 0666;
 	vas->uid = current_uid();
 	vas->gid = current_gid();
@@ -1596,6 +2188,7 @@ EXPORT_SYMBOL(vas_find);
 int vas_delete(int vid)
 {
 	struct vas *vas;
+	struct att_vas_seg *aseg, *s_aseg;
 	int ret;
 
 	vas = vas_get(vid);
@@ -1618,6 +2211,39 @@ int vas_delete(int vid)
 		goto out_unlock;
 	}
 
+	/* Detach all still attached VAS segments. */
+	list_for_each_entry_safe(aseg, s_aseg, &vas->segments, vas_link) {
+		struct vas_seg *seg = aseg->seg;
+		int error;
+
+		pr_vas_debug("Detaching VAS segment - name: %s - from to-be-deleted VAS - name: %s\n",
+			     seg->name, vas->name);
+
+		/*
+		 * Make sure that our VAS segment reference is not removed while
+		 * we work with it.
+		 */
+		__vas_seg_get(seg);
+
+		/*
+		 * Since the VAS from which we detach this VAS segment is going
+		 * to be deleted anyways we can shorten the detaching process.
+		 */
+		vas_seg_lock(seg);
+
+		error = __vas_seg_detach(aseg, vas, seg);
+		if (error != 0)
+			pr_alert("Detaching VAS segment from VAS failed with %d\n",
+				 error);
+
+		list_del(&aseg->seg_link);
+		list_del(&aseg->vas_link);
+		__delete_att_vas_seg(aseg);
+
+		vas_seg_unlock(seg);
+		__vas_seg_put(seg);
+	}
+
 	vas_unlock(vas);
 
 	vas_remove(vas);
@@ -1908,19 +2534,433 @@ int vas_setattr(int vid, struct vas_attr *attr)
 }
 EXPORT_SYMBOL(vas_setattr);
 
+int vas_seg_create(const char *name, unsigned long start, unsigned long end,
+		   umode_t mode)
+{
+	struct vas_seg *seg;
+	int ret;
+
+	if (!name || !PAGE_ALIGNED(start) || !PAGE_ALIGNED(end) ||
+	    (end <= start))
+		return -EINVAL;
+
+	if (vas_seg_find(name) > 0)
+		return -EEXIST;
+
+	pr_vas_debug("Creating a new VAS segment - name: %s start: %#lx end: %#lx\n",
+		     name, start, end);
+
+	/* Allocate and initialize the VAS segment. */
+	seg = __new_vas_seg();
+	if (!seg)
+		return -ENOMEM;
+
+	if (strscpy(seg->name, name, VAS_MAX_NAME_LENGTH) < 0) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
+	mutex_init(&seg->mtx);
+
+	seg->start = start;
+	seg->end = end;
+	seg->length = end - start;
+
+	ret = init_vas_seg_mm(seg);
+	if (ret != 0)
+		goto out_free;
+
+	seg->refcount = 0;
+
+	INIT_LIST_HEAD(&seg->attaches);
+	spin_lock_init(&seg->share_lock);
+	seg->sharing = 0;
+
+	seg->mode = mode & 0666;
+	seg->uid = current_uid();
+	seg->gid = current_gid();
+
+	ret = vas_seg_insert(seg);
+	if (ret != 0)
+		/*
+		 * We don't need to free anything here. @vas_seg_insert will
+		 * care for the deletion if something went wrong.
+		 */
+		return ret;
+
+	return seg->id;
+
+out_free:
+	__delete_vas_seg(seg);
+	return ret;
+}
+EXPORT_SYMBOL(vas_seg_create);
+
+struct vas_seg *vas_seg_get(int sid)
+{
+	return vas_seg_lookup(sid);
+}
+EXPORT_SYMBOL(vas_seg_get);
+
+void vas_seg_put(struct vas_seg *seg)
+{
+	if (!seg)
+		return;
+
+	return __vas_seg_put(seg);
+}
+EXPORT_SYMBOL(vas_seg_put);
+
+int vas_seg_find(const char *name)
+{
+	struct vas_seg *seg;
+
+	seg = vas_seg_lookup_by_name(name);
+	if (seg) {
+		int sid = seg->id;
+
+		vas_seg_put(seg);
+		return sid;
+	}
+
+	return -ESRCH;
+}
+EXPORT_SYMBOL(vas_seg_find);
+
+int vas_seg_delete(int id)
+{
+	struct vas_seg *seg;
+	int ret;
+
+	seg = vas_seg_get(id);
+	if (!seg)
+		return -EINVAL;
+
+	pr_vas_debug("Deleting VAS segment - name: %s\n", seg->name);
+
+	vas_seg_lock(seg);
+
+	if (seg->refcount != 0) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* The user needs write permission to the VAS segment to delete it. */
+	ret = __check_permission(seg->uid, seg->gid, seg->mode, MAY_WRITE);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to delete the VAS segment\n");
+		goto out_unlock;
+	}
+
+	vas_seg_unlock(seg);
+
+	vas_seg_remove(seg);
+	vas_seg_put(seg);
+
+	return 0;
+
+out_unlock:
+	vas_seg_unlock(seg);
+	vas_seg_put(seg);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_seg_delete);
+
+int vas_seg_attach(int vid, int sid, int type)
+{
+	struct vas *vas;
+	struct vas_seg *seg;
+	struct att_vas_seg *aseg;
+	int ret;
+
+	type &= (MAY_READ | MAY_WRITE);
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	seg = vas_seg_get(sid);
+	if (!seg) {
+		vas_put(vas);
+		return -EINVAL;
+	}
+
+	pr_vas_debug("Attaching VAS segment - name: %s - to VAS - name: %s - %s\n",
+		     seg->name, vas->name, access_type_str(type));
+
+	vas_lock(vas);
+	vas_seg_lock(seg);
+
+	/*
+	 * Before we can attach the VAS segment to the VAS we have to make some
+	 * sanity checks.
+	 */
+
+	/*
+	 * 1: Check that the user has adequate permissions to attach the VAS
+	 * segment in the given way.
+	 */
+	ret = __check_permission(seg->uid, seg->gid, seg->mode, type);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to attach the VAS segment\n");
+		goto out_unlock;
+	}
+
+	/*
+	 * 2: The user needs write permission to the VAS to attach a VAS segment
+	 * to it. Check that this requirement is fulfilled.
+	 */
+	ret = __check_permission(vas->uid, vas->gid, vas->mode, MAY_WRITE);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions on the VAS to attach the VAS segment\n");
+		goto out_unlock;
+	}
+
+
+	/*
+	 * 3: Check if the VAS is attached to a process. We do not support
+	 * changes to an attached VAS. A VAS must not be attached to a process
+	 * to be able to make changes to it. This ensures that the page tables
+	 * are always properly initialized.
+	 */
+	if (vas->refcount != 0) {
+		pr_vas_debug("VAS is attached to a process\n");
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * 4: Check if the VAS segment is already attached to this particular
+	 * VAS. Double-attaching would lead to unintended behavior.
+	 */
+	list_for_each_entry(aseg, &seg->attaches, seg_link) {
+		if (aseg->vas == vas) {
+			pr_vas_debug("VAS segment is already attached to the VAS\n");
+			ret = 0;
+			goto out_unlock;
+		}
+	}
+
+	/* 5: Check if we reached the maximum number of shares for this VAS. */
+	if (seg->refcount == VAS_MAX_SHARES) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * All sanity checks are done. It is safe to attach this VAS segment to
+	 * the VAS now.
+	 */
+
+	/* Allocate and initialize the attached VAS segment data structure. */
+	aseg = __new_att_vas_seg();
+	if (!aseg) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	aseg->seg = seg;
+	aseg->vas = vas;
+	aseg->type = type;
+
+	ret = __vas_seg_attach(aseg, vas, seg);
+	if (ret != 0)
+		goto out_free_aseg;
+
+	list_add(&aseg->vas_link, &vas->segments);
+	list_add(&aseg->seg_link, &seg->attaches);
+
+	ret = 0;
+
+out_unlock:
+	vas_seg_unlock(seg);
+	vas_seg_put(seg);
+
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return ret;
+
+out_free_aseg:
+	__delete_att_vas_seg(aseg);
+	goto out_unlock;
+}
+EXPORT_SYMBOL(vas_seg_attach);
+
+int vas_seg_detach(int vid, int sid)
+{
+	struct vas *vas;
+	struct vas_seg *seg;
+	struct att_vas_seg *aseg;
+	bool is_attached;
+	int ret;
+
+	vas = vas_get(vid);
+	if (!vas)
+		return -EINVAL;
+
+	vas_lock(vas);
+
+	is_attached = false;
+	list_for_each_entry(aseg, &vas->segments, vas_link) {
+		if (aseg->seg->id == sid) {
+			is_attached = true;
+			break;
+		}
+	}
+	if (!is_attached) {
+		pr_vas_debug("VAS segment is not attached to the given VAS\n");
+		ret = -EINVAL;
+		goto out_unlock_vas;
+	}
+
+	seg = aseg->seg;
+
+	/*
+	 * Make sure that our reference to the VAS segment is not deleted while
+	 * we are working with it.
+	 */
+	__vas_seg_get(seg);
+
+	vas_seg_lock(seg);
+
+	pr_vas_debug("Detaching VAS segment - name: %s - from VAS - name: %s\n",
+		     seg->name, vas->name);
+
+	/*
+	 * Before we can detach the VAS segment from the VAS we have to do some
+	 * sanity checks.
+	 */
+
+	/*
+	 * 1: Check if the VAS is attached to a process. We do not support
+	 * changes to an attached VAS. A VAS must not be attached to a process
+	 * to be able to make changes to it. This ensures that the page tables
+	 * are always properly initialized.
+	 */
+	if (vas->refcount != 0) {
+		pr_vas_debug("VAS is attached to a process\n");
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * All sanity checks are done. It is safe to detach the VAS segment from
+	 * the VAS now.
+	 */
+	ret = __vas_seg_detach(aseg, vas, seg);
+	if (ret != 0)
+		goto out_unlock;
+
+	list_del(&aseg->seg_link);
+	list_del(&aseg->vas_link);
+	__delete_att_vas_seg(aseg);
+
+	ret = 0;
+
+out_unlock:
+	vas_seg_unlock(seg);
+	__vas_seg_put(seg);
+
+out_unlock_vas:
+	vas_unlock(vas);
+	vas_put(vas);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_seg_detach);
+
+int vas_seg_getattr(int sid, struct vas_seg_attr *attr)
+{
+	struct vas_seg *seg;
+	struct user_namespace *ns = current_user_ns();
+
+	if (!attr)
+		return -EINVAL;
+
+	seg = vas_seg_get(sid);
+	if (!seg)
+		return -EINVAL;
+
+	pr_vas_debug("Getting attributes for VAS segment - name: %s\n",
+		     seg->name);
+
+	vas_seg_lock(seg);
+
+	memset(attr, 0, sizeof(struct vas_seg_attr));
+	attr->mode = seg->mode;
+	attr->user = from_kuid(ns, seg->uid);
+	attr->group = from_kgid(ns, seg->gid);
+
+	vas_seg_unlock(seg);
+	vas_seg_put(seg);
+
+	return 0;
+}
+EXPORT_SYMBOL(vas_seg_getattr);
+
+int vas_seg_setattr(int sid, struct vas_seg_attr *attr)
+{
+	struct vas_seg *seg;
+	struct user_namespace *ns = current_user_ns();
+	int ret;
+
+	if (!attr)
+		return -EINVAL;
+
+	seg = vas_seg_get(sid);
+	if (!seg)
+		return -EINVAL;
+
+	pr_vas_debug("Setting attributes for VAS segment - name: %s\n",
+		     seg->name);
+
+	vas_seg_lock(seg);
+
+	/*
+	 * The user needs write permission to change attributes for the
+	 * VAS segment.
+	 */
+	ret = __check_permission(seg->uid, seg->gid, seg->mode, MAY_WRITE);
+	if (ret != 0) {
+		pr_vas_debug("User doesn't have the appropriate permissions to set attributes for the VAS segment\n");
+		goto out_unlock;
+	}
+
+	seg->mode = attr->mode & 0666;
+	seg->uid = make_kuid(ns, attr->user);
+	seg->gid = make_kgid(ns, attr->group);
+
+	ret = 0;
+
+out_unlock:
+	vas_seg_unlock(seg);
+	vas_seg_put(seg);
+
+	return ret;
+}
+EXPORT_SYMBOL(vas_seg_setattr);
+
 void __init vas_init(void)
 {
 	/* Create the SLAB caches for our data structures. */
 	vas_cachep = KMEM_CACHE(vas, SLAB_PANIC|SLAB_NOTRACK);
 	att_vas_cachep = KMEM_CACHE(att_vas, SLAB_PANIC|SLAB_NOTRACK);
 	vas_context_cachep = KMEM_CACHE(vas_context, SLAB_PANIC|SLAB_NOTRACK);
+	seg_cachep = KMEM_CACHE(vas_seg, SLAB_PANIC|SLAB_NOTRACK);
+	att_seg_cachep = KMEM_CACHE(att_vas_seg, SLAB_PANIC|SLAB_NOTRACK);
 
 	/* Initialize the internal management data structures. */
 	idr_init(&vases);
 	spin_lock_init(&vases_lock);
 
+	idr_init(&vas_segs);
+	spin_lock_init(&vas_segs_lock);
+
 	/* Initialize the place holder variables. */
 	INVALID_VAS = __new_vas();
+	INVALID_VAS_SEG = __new_vas_seg();
 
 	/* Initialize the VAS context of the init task. */
 	vas_clone(0, &init_task);
@@ -1941,6 +2981,12 @@ static int __init vas_sysfs_init(void)
 		return -ENOMEM;
 	}
 
+	vas_segs_kset = kset_create_and_add("vas_segs", NULL, kernel_kobj);
+	if (!vas_segs_kset) {
+		pr_err("Failed to initialize the VAS segment sysfs directory\n");
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 postcore_initcall(vas_sysfs_init);
@@ -2186,3 +3232,105 @@ SYSCALL_DEFINE2(vas_setattr, int, vid, struct vas_attr __user *, uattr)
 
 	return vas_setattr(vid, &attr);
 }
+
+SYSCALL_DEFINE4(vas_seg_create, const char __user *, name, unsigned long, begin,
+		unsigned long, end, umode_t, mode)
+{
+	char seg_name[VAS_MAX_NAME_LENGTH];
+	int len;
+
+	if (!name)
+		return -EINVAL;
+
+	len = strlen(name);
+	if (len >= VAS_MAX_NAME_LENGTH)
+		return -EINVAL;
+
+	if (copy_from_user(seg_name, name, len) != 0)
+		return -EFAULT;
+
+	seg_name[len] = '\0';
+
+	return vas_seg_create(seg_name, begin, end, mode);
+}
+
+SYSCALL_DEFINE1(vas_seg_delete, int, id)
+{
+	if (id < 0)
+		return -EINVAL;
+
+	return vas_seg_delete(id);
+}
+
+SYSCALL_DEFINE1(vas_seg_find, const char __user *, name)
+{
+	char seg_name[VAS_MAX_NAME_LENGTH];
+	int len;
+
+	if (!name)
+		return -EINVAL;
+
+	len = strlen(name);
+	if (len >= VAS_MAX_NAME_LENGTH)
+		return -EINVAL;
+
+	if (copy_from_user(seg_name, name, len) != 0)
+		return -EFAULT;
+
+	seg_name[len] = '\0';
+
+	return vas_seg_find(seg_name);
+}
+
+SYSCALL_DEFINE3(vas_seg_attach, int, vid, int, sid, int, type)
+{
+	int vas_acc_type;
+
+	if (vid < 0 || sid < 0)
+		return -EINVAL;
+
+	vas_acc_type = __build_vas_access_type(type);
+	if (vas_acc_type == -1)
+		return -EINVAL;
+
+	return vas_seg_attach(vid, sid, vas_acc_type);
+}
+
+SYSCALL_DEFINE2(vas_seg_detach, int, vid, int, sid)
+{
+	if (vid < 0 || sid < 0)
+		return -EINVAL;
+
+	return vas_seg_detach(vid, sid);
+}
+
+SYSCALL_DEFINE2(vas_seg_getattr, int, sid, struct vas_seg_attr __user *, uattr)
+{
+	struct vas_seg_attr attr;
+	int ret;
+
+	if (sid < 0 || !uattr)
+		return -EINVAL;
+
+	ret = vas_seg_getattr(sid, &attr);
+	if (ret != 0)
+		return ret;
+
+	if (copy_to_user(uattr, &attr, sizeof(struct vas_seg_attr)) != 0)
+		return -EFAULT;
+
+	return 0;
+}
+
+SYSCALL_DEFINE2(vas_seg_setattr, int, sid, struct vas_seg_attr __user *, uattr)
+{
+	struct vas_seg_attr attr;
+
+	if (sid < 0 || !uattr)
+		return -EINVAL;
+
+	if (copy_from_user(&attr, uattr, sizeof(struct vas_seg_attr)) != 0)
+		return -EFAULT;
+
+	return vas_seg_setattr(sid, &attr);
+}
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (10 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-13 22:14 ` [RFC PATCH 13/13] fs/proc: Add procfs " Till Smejkal
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Until now, whenever a task attaches a first class virtual address space,
all the memory regions currently present in the task are replicated into
the first class virtual address space so that the task can continue
executing as if nothing has changed. However, this technique causes the
attach and detach operations to be very costly, since the whole memory map
of the task has to be duplicated.

Lazy-attaching on the other side uses a similar technique as it is done to
copy page tables during fork. Instead of completely duplicating the memory
map of the task together with its page tables, only a skeleton memory map
is created and then later filled with content when a page fault is
triggered when the process actually accesses the memory regions. The big
advantage is, that unnecessary memory regions are not duplicated at all,
but just those that the process actually uses while executing inside the
first class virtual address space. The only memory region which is always
duplicated during the attach-operation is the code memory section, because
this memory region is always necessary for execution and saves us one page
fault later during the process execution.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 include/linux/mm_types.h |   1 +
 include/linux/vas.h      |  26 ++++++++
 mm/Kconfig               |  18 ++++++
 mm/memory.c              |   5 ++
 mm/vas.c                 | 164 ++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 197 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 82bf78ea83ee..65e04f14225d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -362,6 +362,7 @@ struct vm_area_struct {
 #ifdef CONFIG_VAS
 	struct mm_struct *vas_reference;
 	ktime_t vas_last_update;
+	bool vas_attached;
 #endif
 };
 
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 376b9fa1ee27..8682bfc86568 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -2,6 +2,7 @@
 #define _LINUX_VAS_H
 
 
+#include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <linux/vas_types.h>
 
@@ -293,4 +294,29 @@ static inline int vas_exit(struct task_struct *tsk) { return 0; }
 
 #endif /* CONFIG_VAS */
 
+
+/***
+ * Management of the VAS lazy attaching
+ ***/
+
+#ifdef CONFIG_VAS_LAZY_ATTACH
+
+/**
+ * Lazily update the page tables of a vm_area which was not completely setup
+ * during the VAS attaching.
+ *
+ * @param[in] vma:		The vm_area for which the page tables should be
+ *				setup before continuing the page fault handling.
+ *
+ * @returns:			0 of the lazy-attach was successful or not
+ *				necessary, or 1 if something went wrong.
+ */
+extern int vas_lazy_attach_vma(struct vm_area_struct *vma);
+
+#else /* CONFIG_VAS_LAZY_ATTACH */
+
+static inline int vas_lazy_attach_vma(struct vm_area_struct *vma) { return 0; }
+
+#endif /* CONFIG_VAS_LAZY_ATTACH */
+
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 9a80877f3536..934c56bcdbf4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -720,6 +720,24 @@ config VAS
 
 	  If not sure, then say N.
 
+config VAS_LAZY_ATTACH
+	bool "Use lazy-attach for First Class Virtual Address Spaces"
+	depends on VAS
+	default y
+	help
+	  When this option is enabled, memory regions of First Class Virtual 
+	  Address Spaces will be mapped in the task's address space lazily after
+	  the switch happened. That means, the actual mapping will happen when a
+	  page fault occurs for the particular memory region. While this
+	  technique is less costly during the switching operation, it can become
+	  very costly during the page fault handling.
+
+	  Hence if the program uses a lot of different memory regions, this
+	  lazy-attaching technique can be more costly than doing the mapping
+	  eagerly during the switch.
+
+	  If not sure, then say Y.
+
 config VAS_DEBUG
 	bool "Debugging output for First Class Virtual Address Spaces"
 	depends on VAS
diff --git a/mm/memory.c b/mm/memory.c
index e4747b3fd5b9..cdefc99a50ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include <linux/debugfs.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
+#include <linux/vas.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -4000,6 +4001,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
 
+	/* Check if this VMA belongs to a VAS and needs to be lazy attached. */
+	if (unlikely(vas_lazy_attach_vma(vma)))
+		return VM_FAULT_SIGSEGV;
+
 	/*
 	 * Enable the memcg OOM handling for faults triggered in user
 	 * space.  Kernel faults are handled more gracefully.
diff --git a/mm/vas.c b/mm/vas.c
index 345b023c21aa..953ba8d6e603 100644
--- a/mm/vas.c
+++ b/mm/vas.c
@@ -138,12 +138,13 @@ static void __dump_memory_map(const char *title, struct mm_struct *mm)
 		else
 			pr_cont(" OTHER ");
 
-		pr_cont("%c%c%c%c [%c]",
+		pr_cont("%c%c%c%c [%c:%c]",
 			vma->vm_flags & VM_READ ? 'r' : '-',
 			vma->vm_flags & VM_WRITE ? 'w' : '-',
 			vma->vm_flags & VM_EXEC ? 'x' : '-',
 			vma->vm_flags & VM_MAYSHARE ? 's' : 'p',
-			vma->vas_reference ? 'v' : '-');
+			vma->vas_reference ? 'v' : '-',
+			vma->vas_attached ? 'a' : '-');
 
 		if (vma->vm_file) {
 			struct file *f = vma->vm_file;
@@ -883,6 +884,43 @@ static inline void vas_seg_put_share(int type, struct vas_seg *seg)
 }
 
 /**
+ * Identifying special regions of a memory map.
+ **/
+static inline unsigned long round_up_to_page(unsigned long addr)
+{
+	return PAGE_ALIGNED(addr) ? addr : ((addr & PAGE_MASK) + PAGE_SIZE);
+}
+
+static inline unsigned long round_down_to_page(unsigned long addr)
+{
+	return (addr & PAGE_MASK);
+}
+
+static inline bool is_code_region(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return ((vma->vm_start >= round_down_to_page(mm->start_code)) &&
+		(vma->vm_end <= round_up_to_page(mm->end_code)));
+}
+
+static inline bool is_data_region(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return ((vma->vm_start >= round_down_to_page(mm->start_data)) &&
+		(vma->vm_end <= round_up_to_page(mm->end_data)));
+}
+
+static inline bool is_heap_region(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	return ((vma->vm_start >= round_down_to_page(mm->start_brk)) &&
+		(vma->vm_end <= round_up_to_page(mm->brk)));
+}
+
+/**
  * Management of the memory maps.
  **/
 static int init_vas_mm(struct vas *vas)
@@ -1070,7 +1108,8 @@ static inline
 struct vm_area_struct *__copy_vm_area(struct mm_struct *src_mm,
 				      struct vm_area_struct *src_vma,
 				      struct mm_struct *dst_mm,
-				      unsigned long vm_flags)
+				      unsigned long vm_flags,
+				      bool dup_pages)
 {
 	struct vm_area_struct *vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
@@ -1105,11 +1144,13 @@ struct vm_area_struct *__copy_vm_area(struct mm_struct *src_mm,
 	if (vma->vm_ops && vma->vm_ops->open)
 		vma->vm_ops->open(vma);
 	vma->vas_last_update = src_vma->vas_last_update;
+	vma->vas_attached = dup_pages;
 
 	vma_link(dst_mm, vma, prev, rb_link, rb_parent);
 
 	vm_stat_account(dst_mm, vma->vm_flags, vma_pages(vma));
-	if (unlikely(dup_page_range(dst_mm, vma, src_mm, src_vma)))
+	if (dup_pages &&
+	    unlikely(dup_page_range(dst_mm, vma, src_mm, src_vma)))
 		pr_vas_debug("Failed to copy page table for VMA %p from %p\n",
 			     vma, src_vma);
 
@@ -1199,7 +1240,7 @@ struct vm_area_struct *__update_vm_area(struct mm_struct *src_mm,
 		}
 
 		dst_vma = __copy_vm_area(src_mm, src_vma, dst_mm,
-					 orig_vm_flags);
+					 orig_vm_flags, true);
 		if (!dst_vma)
 			goto out;
 
@@ -1264,7 +1305,7 @@ static int vas_merge(struct att_vas *avas, struct vas *vas, int type)
 			merged_vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
 
 		new_vma = __copy_vm_area(vas_mm, vma, avas_mm,
-					 merged_vm_flags);
+					 merged_vm_flags, true);
 		if (!new_vma) {
 			pr_vas_debug("Failed to merge a VAS memory region (%#lx - %#lx)\n",
 				     vma->vm_start, vma->vm_end);
@@ -1337,7 +1378,7 @@ static int vas_unmerge(struct att_vas *avas, struct vas *vas)
 				     vma->vm_start, vma->vm_end);
 
 			new_vma = __copy_vm_area(avas_mm, vma, vas_mm,
-						 vma->vm_flags);
+						 vma->vm_flags, true);
 			if (!new_vma) {
 				pr_vas_debug("Failed to unmerge a new VAS memory region (%#lx - %#lx)\n",
 					     vma->vm_start, vma->vm_end);
@@ -1346,7 +1387,8 @@ static int vas_unmerge(struct att_vas *avas, struct vas *vas)
 			}
 
 			new_vma->vas_reference = NULL;
-		} else {
+			new_vma->vas_attached = false;
+		} else if (vma->vas_attached) {
 			struct vm_area_struct *upd_vma;
 
 			/*
@@ -1365,6 +1407,9 @@ static int vas_unmerge(struct att_vas *avas, struct vas *vas)
 				ret = -EFAULT;
 				goto out_unlock;
 			}
+		} else {
+			pr_vas_debug("Skip not-attached memory region (%#lx - %#lx) during VAS unmerging\n",
+				     vma->vm_start, vma->vm_end);
 		}
 
 		/* Remove the current VMA from the attached-VAS memory map. */
@@ -1389,10 +1434,16 @@ static int vas_unmerge(struct att_vas *avas, struct vas *vas)
  *			contains all the information for this attachment.
  * @param[in] tsk:	The pointer to the task of which the memory map
  *			should be merged.
+ * @param[in] default_copy_eagerly:
+ *			How should all the memory regions except the code region
+ *			be handled. If true, their page tables will be
+ *			duplicated as well, if false they will not be
+ *			duplicated.
  *
  * @returns:		0 on success, -ERRNO otherwise.
  **/
-static int task_merge(struct att_vas *avas, struct task_struct *tsk)
+static int _task_merge(struct att_vas *avas, struct task_struct *tsk,
+		       bool default_copy_eagerly)
 {
 	struct vm_area_struct *vma, *new_vma;
 	struct mm_struct *avas_mm, *tsk_mm;
@@ -1413,10 +1464,23 @@ static int task_merge(struct att_vas *avas, struct task_struct *tsk)
 	 * map to the attached-VAS memory map.
 	 */
 	for (vma = tsk_mm->mmap; vma; vma = vma->vm_next) {
-		pr_vas_debug("Merging a task memory region (%#lx - %#lx)\n",
-			     vma->vm_start, vma->vm_end);
+		bool copy_eagerly = default_copy_eagerly;
+
+		/*
+		 * The code region of the task will *always* be copied eagerly.
+		 * We need this region in any case to continue execution. All
+		 * the other memory regions are copied according to the
+		 * 'default_copy_eagerly' variable.
+		 */
+		if (is_code_region(vma))
+			copy_eagerly = true;
 
-		new_vma = __copy_vm_area(tsk_mm, vma, avas_mm, vma->vm_flags);
+		pr_vas_debug("Merging a task memory region (%#lx - %#lx) %s\n",
+			     vma->vm_start, vma->vm_end,
+			     copy_eagerly ? "eagerly" : "lazily");
+
+		new_vma = __copy_vm_area(tsk_mm, vma, avas_mm, vma->vm_flags,
+					 copy_eagerly);
 		if (!new_vma) {
 			pr_vas_debug("Failed to merge a task memory region (%#lx - %#lx)\n",
 				     vma->vm_start, vma->vm_end);
@@ -1443,6 +1507,16 @@ static int task_merge(struct att_vas *avas, struct task_struct *tsk)
 	return ret;
 }
 
+/*
+ * Decide based on the kernel configuration setting if we copy task memory
+ * regions eagerly or lazily.
+ */
+#ifdef CONFIG_VAS_LAZY_ATTACH
+#define task_merge(avas, tsk) _task_merge(avas, tsk, false)
+#else
+#define task_merge(avas, tsk) _task_merge(avas, tsk, true)
+#endif
+
 /**
  * Unmerge task-related parts of an attached-VAS memory map back into the
  * task's memory map.
@@ -1541,7 +1615,8 @@ static int vas_seg_merge(struct vas *vas, struct vas_seg *seg, int type)
 		if (!(type & MAY_WRITE))
 			merged_vm_flags &= ~(VM_WRITE | VM_MAYWRITE);
 
-		new_vma = __copy_vm_area(seg_mm, vma, vas_mm, merged_vm_flags);
+		new_vma = __copy_vm_area(seg_mm, vma, vas_mm, merged_vm_flags,
+					 true);
 		if (!new_vma) {
 			pr_vas_debug("Failed to merge a VAS segment memory region (%#lx - %#lx)\n",
 				     vma->vm_start, vma->vm_end);
@@ -1606,7 +1681,7 @@ static int vas_seg_unmerge(struct vas *vas, struct vas_seg *seg)
 			pr_vas_debug("Skipping memory region (%#lx - %#lx) during VAS segment unmerging\n",
 				     vma->vm_start, vma->vm_end);
 			continue;
-		} else {
+		} else if (vma->vas_attached) {
 			struct vm_area_struct *upd_vma;
 
 			pr_vas_debug("Unmerging a VAS segment memory region (%#lx - %#lx)\n",
@@ -1619,6 +1694,9 @@ static int vas_seg_unmerge(struct vas *vas, struct vas_seg *seg)
 				ret = -EFAULT;
 				goto out_unlock;
 			}
+		} else {
+			pr_vas_debug("Skip not-attached memory region (%#lx - %#lx) during segment unmerging\n",
+				     vma->vm_start, vma->vm_end);
 		}
 
 		/* Remove the current VMA from the VAS memory map. */
@@ -1809,8 +1887,13 @@ static int __sync_from_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 
 		ref = vas_find_reference(avas_mm, vma);
 		if (!ref) {
+#ifdef CONFIG_VAS_LAZY_ATTACH
 			ref = __copy_vm_area(tsk_mm, vma, avas_mm,
-					     vma->vm_flags);
+					     vma->vm_flags, false);
+#else
+			ref = __copy_vm_area(tsk_mm, vma, avas_mm,
+					     vma->vm_flags, true);
+#endif
 
 			if (!ref) {
 				pr_vas_debug("Failed to copy memory region (%#lx - %#lx) during task sync\n",
@@ -1824,7 +1907,7 @@ static int __sync_from_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 			 * copied it from.
 			 */
 			ref->vas_reference = tsk_mm;
-		} else {
+		} else if (ref->vas_attached) {
 			ref = __update_vm_area(tsk_mm, vma, avas_mm, ref);
 			if (!ref) {
 				pr_vas_debug("Failed to update memory region (%#lx - %#lx) during task sync\n",
@@ -1832,6 +1915,9 @@ static int __sync_from_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 				ret = -EFAULT;
 				break;
 			}
+		} else {
+			pr_vas_debug("Skip not-attached memory region (%#lx - %#lx) during task sync\n",
+				     vma->vm_start, vma->vm_end);
 		}
 	}
 
@@ -1848,7 +1934,7 @@ static int __sync_to_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 		if (vma->vas_reference != tsk_mm) {
 			pr_vas_debug("Skip unrelated memory region (%#lx - %#lx) during task resync\n",
 				     vma->vm_start, vma->vm_end);
-		} else {
+		} else if (vma->vas_attached) {
 			struct vm_area_struct *ref;
 
 			ref = __update_vm_area(avas_mm, vma, tsk_mm, NULL);
@@ -1858,6 +1944,9 @@ static int __sync_to_task(struct mm_struct *avas_mm, struct mm_struct *tsk_mm)
 				ret = -EFAULT;
 				break;
 			}
+		} else {
+			pr_vas_debug("Skip not-attached memory region (%#lx - %#lx) during task resync\n",
+				     vma->vm_start, vma->vm_end);
 		}
 	}
 
@@ -3100,6 +3189,47 @@ void vas_exit(struct task_struct *tsk)
 	}
 }
 
+#ifdef CONFIG_VAS_LAZY_ATTACH
+
+int vas_lazy_attach_vma(struct vm_area_struct *vma)
+{
+	struct mm_struct *ref_mm, *mm;
+	struct vm_area_struct *ref_vma;
+
+	if (likely(!vma->vas_reference))
+		return 0;
+	if (vma->vas_attached)
+		return 0;
+
+	ref_mm = vma->vas_reference;
+	mm = vma->vm_mm;
+
+	down_read_nested(&ref_mm->mmap_sem, SINGLE_DEPTH_NESTING);
+	ref_vma = vas_find_reference(ref_mm, vma);
+	up_read(&ref_mm->mmap_sem);
+	if (!ref_vma) {
+		pr_vas_debug("Couldn't find VAS reference\n");
+		return 1;
+	}
+
+	pr_vas_debug("Lazy-attach memory region (%#lx - %#lx)\n",
+		     ref_vma->vm_start, ref_vma->vm_end);
+
+	if (unlikely(dup_page_range(mm, vma, ref_mm, ref_vma))) {
+		pr_vas_debug("Failed to copy page tables for VMA %p from %p\n",
+			     vma, ref_vma);
+		return 1;
+	}
+
+	vma->vas_last_update = ref_vma->vas_last_update;
+	vma->vas_attached = true;
+
+	return 0;
+}
+
+#endif /* CONFIG_VAS_LAZY_ATTACH */
+
+
 /***
  * System Calls
  ***/
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 13/13] fs/proc: Add procfs support for first class virtual address spaces
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (11 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces Till Smejkal
@ 2017-03-13 22:14 ` Till Smejkal
  2017-03-14  0:18 ` [RFC PATCH 00/13] Introduce " Richard Henderson
  2017-03-14  0:58 ` Andy Lutomirski
  14 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:14 UTC (permalink / raw)
  To: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Add new files and directories to the procfs file system that contain
various information about the first class virtual address spaces attach to
the processes in the system.

To the procfs directories of each process in the system (/proc/$PID) an
additional directory with the name 'vas' is added that contains information
about all the VAS that are attached to this process. In this directory one
can find for each attached VAS a special folder with a file with some
status information about the attached VAS, a file with the current memory
map of the attached VAS and a link to the sysfs folder of the underlying
VAS.

Signed-off-by: Till Smejkal <till.smejkal@gmail.com>
---
 fs/proc/base.c     | 528 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/proc/inode.c    |   1 +
 fs/proc/internal.h |   1 +
 mm/Kconfig         |   9 +
 4 files changed, 539 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87c9a9aacda3..e60c13dd087c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -45,6 +45,9 @@
  *
  *  Paul Mundt <paul.mundt@nokia.com>:
  *  Overall revision about smaps.
+ *
+ *  Till Smejkal <till.smejkal@gmail.com>:
+ *  Add entries for first class virtual address spaces.
  */
 
 #include <linux/uaccess.h>
@@ -87,6 +90,7 @@
 #include <linux/slab.h>
 #include <linux/flex_array.h>
 #include <linux/posix-timers.h>
+#include <linux/vas.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -2841,6 +2845,527 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
 	return err;
 }
 
+#ifdef CONFIG_VAS_PROCFS
+
+/**
+ * Get a string representation of the access type to a VAS.
+ **/
+#define vas_access_type_str(type) ((type) & MAY_WRITE ?			\
+				   ((type) & MAY_READ ? "rw" : "wo") : "ro")
+
+static int att_vas_show_status(struct seq_file *sf, void *unused)
+{
+	struct inode *inode = sf->private;
+	struct proc_inode *pi = PROC_I(inode);
+	struct task_struct *tsk;
+	struct vas_context *vas_ctx;
+	struct att_vas *avas;
+	int vid = pi->vas_id;
+
+	tsk = get_proc_task(inode);
+	if (!tsk)
+		return -ENOENT;
+
+	vas_ctx = tsk->vas_ctx;
+
+	vas_context_lock(vas_ctx);
+
+	list_for_each_entry(avas, &vas_ctx->vases, tsk_link) {
+		if (vid == avas->vas->id)
+			goto good_att_vas;
+	}
+
+	vas_context_unlock(vas_ctx);
+	put_task_struct(tsk);
+
+	return -ENOENT;
+
+good_att_vas:
+	seq_printf(sf,
+		   "pid:  %d\n"
+		   "vid:  %d\n"
+		   "type: %s\n",
+		   avas->tsk->pid, avas->vas->id,
+		   vas_access_type_str(avas->type));
+
+	vas_context_unlock(vas_ctx);
+	put_task_struct(tsk);
+
+	return 0;
+}
+
+static int att_vas_show_status_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, att_vas_show_status, inode);
+}
+
+static const struct file_operations att_vas_show_status_fops = {
+	.open		= att_vas_show_status_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int att_vas_show_mappings(struct seq_file *sf, void *unused)
+{
+	struct inode *inode = sf->private;
+	struct proc_inode *pi = PROC_I(inode);
+	struct task_struct *tsk;
+	struct vas_context *vas_ctx;
+	struct att_vas *avas;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int vid = pi->vas_id;
+
+	tsk = get_proc_task(inode);
+	if (!tsk)
+		return -ENOENT;
+
+	vas_ctx = tsk->vas_ctx;
+
+	vas_context_lock(vas_ctx);
+
+	list_for_each_entry(avas, &vas_ctx->vases, tsk_link) {
+		if (avas->vas->id == vid)
+			goto good_att_vas;
+	}
+
+	vas_context_unlock(vas_ctx);
+	put_task_struct(tsk);
+
+	return -ENOENT;
+
+good_att_vas:
+	mm = avas->mm;
+
+	down_read(&mm->mmap_sem);
+
+	if (!mm->mmap) {
+		seq_puts(sf, "EMPTY\n");
+		goto out_unlock;
+	}
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		vm_flags_t flags = vma->vm_flags;
+		struct file *file = vma->vm_file;
+		unsigned long long pgoff = 0;
+
+		if (file)
+			pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+
+		seq_printf(sf, "%08lx-%08lx %c%c%c%c [%c:%c] %08llx",
+			   vma->vm_start, vma->vm_end,
+			   flags & VM_READ ? 'r' : '-',
+			   flags & VM_WRITE ? 'w' : '-',
+			   flags & VM_EXEC ? 'x' : '-',
+			   flags & VM_MAYSHARE ? 's' : 'p',
+			   vma->vas_reference ? 'v' : '-',
+			   vma->vas_attached ? 'a' : '-',
+			   pgoff);
+
+		seq_putc(sf, ' ');
+
+		if (file) {
+			seq_file_path(sf, file, "\n");
+		} else if (vma->vm_ops && vma->vm_ops->name) {
+			seq_puts(sf, vma->vm_ops->name(vma));
+		} else {
+			if (!vma->vm_mm)
+				seq_puts(sf, "[vdso]");
+			else if (vma->vm_start <= mm->brk &&
+				 vma->vm_end >= mm->start_brk)
+				seq_puts(sf, "[heap]");
+			else if (vma->vm_start <= mm->start_stack &&
+				 vma->vm_end >= mm->start_stack)
+				seq_puts(sf, "[stack]");
+		}
+
+		seq_putc(sf, '\n');
+	}
+
+out_unlock:
+	up_read(&mm->mmap_sem);
+
+	vas_context_unlock(vas_ctx);
+	put_task_struct(tsk);
+
+	return 0;
+}
+
+static int att_vas_show_mappings_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, att_vas_show_mappings, inode);
+}
+
+static const struct file_operations att_vas_show_mappings_fops = {
+	.open		= att_vas_show_mappings_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int att_vas_vas_link(char *name, int *buflen, int vid)
+{
+	int len;
+
+	len = scnprintf(name, *buflen, "/sys/kernel/vas/%d", vid);
+	if (len >= *buflen)
+		return -E2BIG;
+
+	*buflen = len;
+	return 0;
+}
+
+static int att_vas_vas_link_readlink(struct dentry *dentry, char __user *buffer,
+				     int buflen)
+{
+	char *name;
+	int len, ret;
+	int vid;
+
+	len = PATH_MAX;
+	name = kmalloc(len, GFP_TEMPORARY);
+	if (!name)
+		return -ENOMEM;
+
+	vid = PROC_I(d_inode(dentry))->vas_id;
+
+	ret = att_vas_vas_link(name, &len, vid);
+	if (ret != 0)
+		goto out_free;
+
+	if (len > buflen)
+		len = buflen;
+	if (copy_to_user(buffer, name, len) != 0) {
+		ret = -EFAULT;
+		goto out_free;
+	}
+
+	ret = len;
+
+out_free:
+	kfree(name);
+
+	return ret;
+}
+
+static const char *att_vas_vas_link_get_link(struct dentry *dentry,
+					     struct inode *inode,
+					     struct delayed_call *done)
+{
+	char *name;
+	int len, ret;
+	int vid;
+
+	if (!dentry)
+		return ERR_PTR(-ECHILD);
+
+	len = PATH_MAX;
+	name = kmalloc(len, GFP_TEMPORARY);
+	if (!name)
+		return NULL;
+
+	vid = PROC_I(inode)->vas_id;
+
+	ret = att_vas_vas_link(name, &len, vid);
+	if (ret != 0) {
+		kfree(name);
+		return ERR_PTR(ret);
+	}
+
+	set_delayed_call(done, kfree_link, name);
+	return name;
+}
+
+static const struct inode_operations att_vas_vas_link_iops = {
+	.readlink	= att_vas_vas_link_readlink,
+	.get_link	= att_vas_vas_link_get_link,
+	.setattr	= proc_setattr
+};
+
+static const struct pid_entry att_vas_stuff[] = {
+	REG("status", 0444, att_vas_show_status_fops),
+	REG("maps", 0440, att_vas_show_mappings_fops),
+	NOD("vas", (S_IFLNK | 0777), &att_vas_vas_link_iops, NULL, {}),
+};
+
+static int att_vas_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct inode *inode;
+	struct task_struct *tsk;
+	struct vas_context *vas_ctx;
+	struct att_vas *avas;
+	int vid;
+	int ret;
+
+	if (flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = d_inode(dentry);
+	tsk = get_proc_task(inode);
+	if (!tsk)
+		return 0;
+
+	vid = PROC_I(inode)->vas_id;
+	vas_ctx = tsk->vas_ctx;
+
+	vas_context_lock(vas_ctx);
+
+	ret = 0;
+	list_for_each_entry(avas, &vas_ctx->vases, tsk_link) {
+		if (avas->vas->id == vid) {
+			ret = 1;
+			break;
+		}
+	}
+
+	vas_context_unlock(vas_ctx);
+	put_task_struct(tsk);
+
+	return ret;
+}
+
+static const struct dentry_operations att_vas_dops = {
+	.d_revalidate	= att_vas_revalidate,
+};
+
+static int att_vas_pident_instantiate(struct inode *dir,
+				      struct dentry *dentry,
+				      struct task_struct *task,
+				      const void *ptr)
+{
+	const struct pid_entry *p = ptr;
+	struct inode *inode;
+	struct proc_inode *ei;
+
+	inode = proc_pid_make_inode(dir->i_sb, task, p->mode);
+	if (!inode)
+		goto out;
+
+	ei = PROC_I(inode);
+	if (S_ISDIR(inode->i_mode))
+		set_nlink(inode, 2);
+	if (p->iop)
+		inode->i_op = p->iop;
+	if (p->fop)
+		inode->i_fop = p->fop;
+	ei->op = p->op;
+
+	/* Copy the VAS ID from the parent inode */
+	ei->vas_id = PROC_I(dir)->vas_id;
+
+	d_set_d_op(dentry, &att_vas_dops);
+	d_add(dentry, inode);
+
+	if (att_vas_revalidate(dentry, 0))
+		return 0;
+out:
+	return -ENOENT;
+}
+
+static struct dentry *att_vas_pident_lookup(struct inode *dir,
+					    struct dentry *dentry,
+					    unsigned int flags)
+{
+	int error;
+	struct task_struct *task = get_proc_task(dir);
+	const struct pid_entry *p, *last;
+	const struct pid_entry *entries = att_vas_stuff;
+	int nents = ARRAY_SIZE(att_vas_stuff);
+
+	error = -ENOENT;
+
+	if (!task)
+		goto out_no_task;
+
+	last = &entries[nents];
+	for (p = entries; p < last; p++) {
+		if (p->len != dentry->d_name.len)
+			continue;
+		if (!memcmp(dentry->d_name.name, p->name, p->len))
+			break;
+	}
+	if (p >= last)
+		goto out;
+
+	error = att_vas_pident_instantiate(dir, dentry, task, p);
+out:
+	put_task_struct(task);
+out_no_task:
+	return ERR_PTR(error);
+}
+
+static int att_vas_pident_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct task_struct *task = get_proc_task(file_inode(file));
+	const struct pid_entry *p, *last;
+	const struct pid_entry *entries = att_vas_stuff;
+	int nents = ARRAY_SIZE(att_vas_stuff);
+
+	if (!task)
+		return -ENOENT;
+
+	if (!dir_emit_dots(file, ctx))
+		goto out;
+
+	if (ctx->pos >= nents + 2)
+		goto out;
+
+	last = &entries[nents];
+	for (p = entries + (ctx->pos - 2); p < last; p++) {
+		if (!proc_fill_cache(file, ctx, p->name, p->len,
+				     att_vas_pident_instantiate, task, p))
+			break;
+		ctx->pos++;
+	}
+out:
+	put_task_struct(task);
+	return 0;
+}
+
+static const struct inode_operations proc_att_vas_iops = {
+	.lookup		= att_vas_pident_lookup,
+	.setattr	= proc_setattr,
+	.permission	= generic_permission,
+};
+
+static const struct file_operations proc_att_vas_fops = {
+	.read		= generic_read_dir,
+	.llseek		= generic_file_llseek,
+	.iterate_shared	= att_vas_pident_readdir,
+};
+
+static int proc_att_vas_dir_instantiate(struct inode *dir,
+					struct dentry *dentry,
+					struct task_struct *tsk,
+					const void *data)
+{
+	struct inode *inode;
+	struct proc_inode *pi;
+	const struct att_vas *avas = data;
+
+	inode = proc_pid_make_inode(dir->i_sb, tsk, S_IFDIR | 0555);
+
+	if (!inode)
+		return -ENOENT;
+
+	pi = PROC_I(inode);
+	pi->vas_id = avas->vas->id;
+
+	set_nlink(inode, 2);
+	inode->i_op = &proc_att_vas_iops;
+	inode->i_fop = &proc_att_vas_fops;
+
+	d_add(dentry, inode);
+	d_set_d_op(dentry, &att_vas_dops);
+
+	/*
+	 * No need to revalidate the dentry at this point, because we are still
+	 * holding the lock for the VAS context. Hence this VAS cannot be
+	 * detached from the task and hence the dentry is still valid.
+	 */
+	return 0;
+}
+
+static struct dentry *proc_att_vas_dir_lookup(struct inode *dir,
+					      struct dentry *dentry,
+					      unsigned int flags)
+{
+	struct task_struct *tsk;
+	struct vas_context *vas_ctx;
+	struct att_vas *avas;
+	int vid;
+	int ret;
+
+	tsk = get_proc_task(dir);
+	if (!tsk)
+		return ERR_PTR(-ENOENT);
+
+	if (kstrtoint(dentry->d_name.name, 10, &vid)) {
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	vas_ctx = tsk->vas_ctx;
+
+	vas_context_lock(vas_ctx);
+
+	ret = -ENOENT;
+	list_for_each_entry(avas, &vas_ctx->vases, tsk_link) {
+		if (vid == avas->vas->id) {
+			ret = proc_att_vas_dir_instantiate(dir, dentry,
+							   tsk, avas);
+			break;
+		}
+	}
+
+	vas_context_unlock(vas_ctx);
+
+out_put:
+	put_task_struct(tsk);
+
+	return ERR_PTR(ret);
+}
+
+static int proc_att_vas_dir_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct task_struct *tsk;
+	struct vas_context *vas_ctx;
+	struct att_vas *avas;
+	int pos = 2;
+
+	tsk = get_proc_task(inode);
+	if (!tsk)
+		return -ENOENT;
+
+	if (!dir_emit_dots(file, ctx))
+		goto out_put;
+
+	vas_ctx = tsk->vas_ctx;
+
+	vas_context_lock(vas_ctx);
+
+	list_for_each_entry(avas, &vas_ctx->vases, tsk_link) {
+		char name[PROC_NUMBUF];
+		int len;
+
+		if (++pos <= ctx->pos)
+			continue;
+
+		snprintf(name, sizeof(name), "%d", avas->vas->id);
+		len = strnlen(name, PROC_NUMBUF);
+
+		if (!proc_fill_cache(file, ctx, name, len,
+				     proc_att_vas_dir_instantiate,
+				     tsk, avas))
+			break;
+
+		ctx->pos++;
+	}
+
+	vas_context_unlock(vas_ctx);
+
+out_put:
+	put_task_struct(tsk);
+
+	return 0;
+}
+
+const struct inode_operations proc_att_vas_dir_iops = {
+	.lookup		= proc_att_vas_dir_lookup,
+	.setattr	= proc_setattr,
+	.permission	= generic_permission,
+};
+
+const struct file_operations proc_att_vas_dir_fops = {
+	.read		= generic_read_dir,
+	.llseek		= generic_file_llseek,
+	.iterate_shared	= proc_att_vas_dir_readdir,
+};
+
+#endif
+
 /*
  * Thread groups
  */
@@ -2856,6 +3381,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
 #endif
+#ifdef CONFIG_VAS_PROCFS
+	DIR("vas",        S_IRUGO|S_IXUGO, proc_att_vas_dir_iops, proc_att_vas_dir_fops),
+#endif
 	REG("environ",    S_IRUSR, proc_environ_operations),
 	REG("auxv",       S_IRUSR, proc_auxv_operations),
 	ONE("status",     S_IRUGO, proc_pid_status),
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index cb2d5702bdce..cc8937d348df 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -63,6 +63,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
 	ei->pid = NULL;
 	ei->fd = 0;
 	ei->op.proc_get_link = NULL;
+	ei->vas_id = 0;
 	ei->pde = NULL;
 	ei->sysctl = NULL;
 	ei->sysctl_entry = NULL;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2de5194ba378..0cb6bb39d61d 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -61,6 +61,7 @@ union proc_op {
 struct proc_inode {
 	struct pid *pid;
 	unsigned int fd;
+	int vas_id;
 	union proc_op op;
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
diff --git a/mm/Kconfig b/mm/Kconfig
index 934c56bcdbf4..9ef3efc16bed 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -745,3 +745,12 @@ config VAS_DEBUG
 	help
 	  Enable extensive debugging output for the First Class Virtual Address
 	  Spaces feature.
+
+config VAS_PROCFS
+	bool "procfs entries for First Class Virtual Address Spaces"
+	depends on VAS && PROC_FS
+	default y
+	help
+	  Provide information in /proc/$PID about all First Class 
+	  Virtual Address Spaces that are currently attached to the
+	  corresponding process.
-- 
2.12.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions
  2017-03-13 22:14 ` [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions Till Smejkal
@ 2017-03-13 22:27   ` Matthew Wilcox
  2017-03-13 22:45     ` Till Smejkal
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2017-03-13 22:27 UTC (permalink / raw)
  To: Till Smejkal
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote:
> +/**
> + * Create a new VAS segment.
> + *
> + * @param[in] name:		The name of the new VAS segment.
> + * @param[in] start:		The address where the VAS segment begins.
> + * @param[in] end:		The address where the VAS segment ends.
> + * @param[in] mode:		The access rights for the VAS segment.
> + *
> + * @returns:			The VAS segment ID on success, -ERRNO otherwise.
> + **/

Please follow the kernel-doc conventions, as described in
Documentation/doc-guide/kernel-doc.rst.  Also, function documentation
goes with the implementation, not the declaration.

> +/**
> + * Get ID of the VAS segment belonging to a given name.
> + *
> + * @param[in] name:		The name of the VAS segment for which the ID
> + *				should be returned.
> + *
> + * @returns:			The VAS segment ID on success, -ERRNO
> + *				otherwise.
> + **/
> +extern int vas_seg_find(const char *name);

So ... segments have names, and IDs ... and access permissions ...
Why isn't this a special purpose filesystem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions
  2017-03-13 22:27   ` Matthew Wilcox
@ 2017-03-13 22:45     ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-13 22:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Till Smejkal, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Vineet Gupta, Russell King, Catalin Marinas, Will Deacon,
	Steven Miao, Richard Kuo, Tony Luck, Fenghua Yu, James Hogan,
	Ralf Baechle, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	David S. Miller, Chris Metcalf, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Chris Zankel, Max Filippov,
	Arnd Bergmann, Greg Kroah-Hartman, Laurent Pinchart,
	Mauro Carvalho Chehab, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, David Woodhouse, Brian Norris, Boris Brezillon,
	Marek Vasut, Richard Weinberger, Cyrille Pitchen, Felipe Balbi,
	Alexander Viro, Benjamin LaHaise, Nadia Yvette Chambers,
	Jeff Layton, J. Bruce Fields, Peter Zijlstra, Hugh Dickins,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jaroslav Kysela,
	Takashi Iwai, linux-kernel, linux-alpha, linux-snps-arc,
	linux-arm-kernel, adi-buildroot-devel, linux-hexagon, linux-ia64,
	linux-metag, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, sparclinux, linux-xtensa, linux-media, linux-mtd,
	linux-usb, linux-fsdevel, linux-aio, linux-mm, linux-api,
	linux-arch, alsa-devel

Hi Matthew,

On Mon, 13 Mar 2017, Matthew Wilcox wrote:
> On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote:
> > +/**
> > + * Create a new VAS segment.
> > + *
> > + * @param[in] name:		The name of the new VAS segment.
> > + * @param[in] start:		The address where the VAS segment begins.
> > + * @param[in] end:		The address where the VAS segment ends.
> > + * @param[in] mode:		The access rights for the VAS segment.
> > + *
> > + * @returns:			The VAS segment ID on success, -ERRNO otherwise.
> > + **/
> 
> Please follow the kernel-doc conventions, as described in
> Documentation/doc-guide/kernel-doc.rst.  Also, function documentation
> goes with the implementation, not the declaration.

Thank you for this pointer. I wasn't aware of this convention. I will change the
patches accordingly.

> > +/**
> > + * Get ID of the VAS segment belonging to a given name.
> > + *
> > + * @param[in] name:		The name of the VAS segment for which the ID
> > + *				should be returned.
> > + *
> > + * @returns:			The VAS segment ID on success, -ERRNO
> > + *				otherwise.
> > + **/
> > +extern int vas_seg_find(const char *name);
> 
> So ... segments have names, and IDs ... and access permissions ...
> Why isn't this a special purpose filesystem?

We also thought about this. However, we decided against implementing them as a
special purpose filesystem, mainly because we could not think of a good way to
represent a VAS/VAS segment in this file system (should they be represented rather as
file or directory) and we weren't sure what a hierarchy in the filesystem would mean
for the underlying address spaces. Hence we decided against it and rather used a
combination of IDR and sysfs. However, I don't have any strong feelings and would
also reimplement them as a special purpose filesystem if people rather like them to
be one.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
  2017-03-13 22:14 ` [RFC PATCH 10/13] mm: Introduce first class virtual address spaces Till Smejkal
@ 2017-03-13 23:52   ` Greg Kroah-Hartman
  2017-03-14  0:24     ` Till Smejkal
  2017-03-14  1:35   ` Vineet Gupta
  1 sibling, 1 reply; 45+ messages in thread
From: Greg Kroah-Hartman @ 2017-03-13 23:52 UTC (permalink / raw)
  To: Till Smejkal
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote:

There's no way with that many cc: lists and people that this is really
making it through very many people's filters and actually on a mailing
list.  Please trim them down.

Minor sysfs questions/issues:

> +struct vas {
> +	struct kobject kobj;		/* < the internal kobject that we use *
> +					 *   for reference counting and sysfs *
> +					 *   handling.                        */
> +
> +	int id;				/* < ID                               */
> +	char name[VAS_MAX_NAME_LENGTH];	/* < name                             */

The kobject has a name, why not use that?

> +
> +	struct mutex mtx;		/* < lock for parallel access.        */
> +
> +	struct mm_struct *mm;		/* < a partial memory map containing  *
> +					 *   all mappings of this VAS.        */
> +
> +	struct list_head link;		/* < the link in the global VAS list. */
> +	struct rcu_head rcu;		/* < the RCU helper used for          *
> +					 *   asynchronous VAS deletion.       */
> +
> +	u16 refcount;			/* < how often is the VAS attached.   */

The kobject has a refcount, use that?  Don't have 2 refcounts in the
same structure, that way lies madness.  And bugs, lots of bugs...

And if this really is a refcount (hint, I don't think it is), you should
use the refcount_t type.

> +/**
> + * The sysfs structure we need to handle attributes of a VAS.
> + **/
> +struct vas_sysfs_attr {
> +	struct attribute attr;
> +	ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> +			char *buf);
> +	ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> +			 const char *buf, size_t count);
> +};
> +
> +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)				\
> +static struct vas_sysfs_attr vas_sysfs_attr_##NAME =			\
> +	__ATTR(NAME, MODE, SHOW, STORE)

__ATTR_RO and __ATTR_RW should work better for you.  If you really need
this.

Oh, and where is the Documentation/ABI/ updates to try to describe the
sysfs structure and files?  Did I miss that in the series?

> +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr *vsattr,
> +			       char *buf)
> +{
> +	return scnprintf(buf, PAGE_SIZE, "%s", vas->name);

It's a page size, just use sprintf() and be done with it.  No need to
ever check, you "know" it will be correct.

Also, what about a trailing '\n' for these attributes?

Oh wait, why have a name when the kobject name is already there in the
directory itself?  Do you really need this?

> +/**
> + * The ktype data structure representing a VAS.
> + **/
> +static struct kobj_type vas_ktype = {
> +	.sysfs_ops = &vas_sysfs_ops,
> +	.release = __vas_release,

Why the odd __vas* naming?  What's wrong with vas_release?


> +	.default_attrs = vas_default_attr,
> +};
> +
> +
> +/***
> + * Internally visible functions
> + ***/
> +
> +/**
> + * Working with the global VAS list.
> + **/
> +static inline void vas_remove(struct vas *vas)

<snip>

You have a ton of inline functions, for no good reason.  Make them all
"real" functions please.  Unless you can measure the size/speed
differences?  If so, please say so.


thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (12 preceding siblings ...)
  2017-03-13 22:14 ` [RFC PATCH 13/13] fs/proc: Add procfs " Till Smejkal
@ 2017-03-14  0:18 ` Richard Henderson
  2017-03-14  0:39   ` Till Smejkal
  2017-03-14  0:58 ` Andy Lutomirski
  14 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2017-03-14  0:18 UTC (permalink / raw)
  To: Till Smejkal, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On 03/14/2017 08:14 AM, Till Smejkal wrote:
> At the current state of the development, first class virtual address spaces
> have one limitation, that we haven't been able to solve so far. The feature
> allows, that different threads of the same process can execute in different
> AS at the same time. This is possible, because the VAS-switch operation
> only changes the active mm_struct for the task_struct of the calling
> thread. However, when a thread switches into a first class virtual address
> space, some parts of its original AS are duplicated into the new one to
> allow the thread to continue its execution at its current state.
> Accordingly, parts of the processes AS (e.g. the code section, data
> section, heap section and stack sections) exist in multiple AS if the
> process has a VAS attached to it. Changes to these shared memory regions
> are synchronized between the address spaces whenever a thread switches
> between two of them. Unfortunately, in some scenarios the kernel is not
> able to properly synchronize all these shared memory regions because of
> conflicting changes. One such example happens if there are two threads, one
> executing in an attached first class virtual address space, the other in
> the tasks original address space. If both threads make changes to the heap
> section that cause expansion of the underlying vm_area_struct, the kernel
> cannot correctly synchronize these changes, because that would cause parts
> of the virtual address space to be overwritten with unrelated data. In the
> current implementation such conflicts are only detected but not resolved
> and result in an error code being returned by the kernel during the VAS
> switch operation. Unfortunately, that means for the particular thread that
> tried to make the switch, that it cannot do this anymore in the future and
> accordingly has to be killed.

This sounds like a fairly fundamental problem to me.

Is this an indication that full virtual address spaces are useless?  It would 
seem like if you only use virtual address segments then you avoid all of the 
problems with executing code, active stacks, and brk.


r~

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
  2017-03-13 23:52   ` Greg Kroah-Hartman
@ 2017-03-14  0:24     ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14  0:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Till Smejkal, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Vineet Gupta, Russell King, Catalin Marinas, Will Deacon,
	Steven Miao, Richard Kuo, Tony Luck, Fenghua Yu, James Hogan,
	Ralf Baechle, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	David S. Miller, Chris Metcalf, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Chris Zankel, Max Filippov,
	Arnd Bergmann, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

Hi Greg,

First of all thanks for your reply.

On Tue, 14 Mar 2017, Greg Kroah-Hartman wrote:
> On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote:
> 
> There's no way with that many cc: lists and people that this is really
> making it through very many people's filters and actually on a mailing
> list.  Please trim them down.

I am sorry that the patch's cc-list is too big. This was the list of people that the
get_maintainers.pl script produced. I already recognized that it was a huge number of
people, but I didn't want to remove anyone from the list because I wasn't sure who
would be interested in this patch set. Do you have any suggestion who to remove from
the list? I don't want to annoy anyone with useless emails.

> Minor sysfs questions/issues:
> 
> > +struct vas {
> > +	struct kobject kobj;		/* < the internal kobject that we use *
> > +					 *   for reference counting and sysfs *
> > +					 *   handling.                        */
> > +
> > +	int id;				/* < ID                               */
> > +	char name[VAS_MAX_NAME_LENGTH];	/* < name                             */
> 
> The kobject has a name, why not use that?

The reason why I don't use the kobject's name is that I don't restrict the names that
are used for VAS/VAS segments. Accordingly, it would be allowed to use a name like
"foo/bar/xyz" as VAS name. However, I am not sure what would happen in the sysfs if I
would use such a name for the kobject. Especially, since one could think of another
VAS with the name "foo/bar" whose name would conflict with the first one although it
not necessarily has any connection with it.

> > +
> > +	struct mutex mtx;		/* < lock for parallel access.        */
> > +
> > +	struct mm_struct *mm;		/* < a partial memory map containing  *
> > +					 *   all mappings of this VAS.        */
> > +
> > +	struct list_head link;		/* < the link in the global VAS list. */
> > +	struct rcu_head rcu;		/* < the RCU helper used for          *
> > +					 *   asynchronous VAS deletion.       */
> > +
> > +	u16 refcount;			/* < how often is the VAS attached.   */
> 
> The kobject has a refcount, use that?  Don't have 2 refcounts in the
> same structure, that way lies madness.  And bugs, lots of bugs...
> 
> And if this really is a refcount (hint, I don't think it is), you should
> use the refcount_t type.

I actually use both the internal kobject refcount to keep track of how often a
VAS/VAS segment is referenced and this 'refcount' variable to keep track how often
the VAS is actually attached to a task. They not necessarily must be related to each
other. I can rename this variable to attach_count. Or if preferred I can
alternatively only use the kobject reference counter and remove this variable
completely though I would loose information about how often the VAS is attached to a
task because the kobject reference counter is also used to keep track of other
variables referencing the VAS.

> > +/**
> > + * The sysfs structure we need to handle attributes of a VAS.
> > + **/
> > +struct vas_sysfs_attr {
> > +	struct attribute attr;
> > +	ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +			char *buf);
> > +	ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +			 const char *buf, size_t count);
> > +};
> > +
> > +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)				\
> > +static struct vas_sysfs_attr vas_sysfs_attr_##NAME =			\
> > +	__ATTR(NAME, MODE, SHOW, STORE)
> 
> __ATTR_RO and __ATTR_RW should work better for you.  If you really need
> this.

Thank you. I will have a look at these functions.

> Oh, and where is the Documentation/ABI/ updates to try to describe the
> sysfs structure and files?  Did I miss that in the series?

Oh sorry, I forgot to add this file. I will add the ABI descriptions for future
submissions.

> > +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +			       char *buf)
> > +{
> > +	return scnprintf(buf, PAGE_SIZE, "%s", vas->name);
> 
> It's a page size, just use sprintf() and be done with it.  No need to
> ever check, you "know" it will be correct.

OK. I was following the sysfs example in the documentation that used scnprintf, but
if sprintf is preferred, I can change this.

> Also, what about a trailing '\n' for these attributes?

I will change this.

> Oh wait, why have a name when the kobject name is already there in the
> directory itself?  Do you really need this?

See above.

> > +/**
> > + * The ktype data structure representing a VAS.
> > + **/
> > +static struct kobj_type vas_ktype = {
> > +	.sysfs_ops = &vas_sysfs_ops,
> > +	.release = __vas_release,
> 
> Why the odd __vas* naming?  What's wrong with vas_release?

I was using the __* naming scheme for functions that have no other meaning outside of
my source file. But I can change this if people don't like it. I have no strong
feelings about the names of the functions.

> > +	.default_attrs = vas_default_attr,
> > +};
> > +
> > +
> > +/***
> > + * Internally visible functions
> > + ***/
> > +
> > +/**
> > + * Working with the global VAS list.
> > + **/
> > +static inline void vas_remove(struct vas *vas)
> 
> <snip>
> 
> You have a ton of inline functions, for no good reason.  Make them all
> "real" functions please.  Unless you can measure the size/speed
> differences?  If so, please say so.

There was no specific reason why I declared the functions as inline except my hope to
reduce the function call for some of my very small functions. I can look more closely
at this and check whether there is some real benefit in inlining them and if not
remove it.

Thank you very much.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  0:18 ` [RFC PATCH 00/13] Introduce " Richard Henderson
@ 2017-03-14  0:39   ` Till Smejkal
  2017-03-14  1:02     ` Richard Henderson
  0 siblings, 1 reply; 45+ messages in thread
From: Till Smejkal @ 2017-03-14  0:39 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Till Smejkal, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 08:14 AM, Till Smejkal wrote:
> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> > Accordingly, parts of the processes AS (e.g. the code section, data
> > section, heap section and stack sections) exist in multiple AS if the
> > process has a VAS attached to it. Changes to these shared memory regions
> > are synchronized between the address spaces whenever a thread switches
> > between two of them. Unfortunately, in some scenarios the kernel is not
> > able to properly synchronize all these shared memory regions because of
> > conflicting changes. One such example happens if there are two threads, one
> > executing in an attached first class virtual address space, the other in
> > the tasks original address space. If both threads make changes to the heap
> > section that cause expansion of the underlying vm_area_struct, the kernel
> > cannot correctly synchronize these changes, because that would cause parts
> > of the virtual address space to be overwritten with unrelated data. In the
> > current implementation such conflicts are only detected but not resolved
> > and result in an error code being returned by the kernel during the VAS
> > switch operation. Unfortunately, that means for the particular thread that
> > tried to make the switch, that it cannot do this anymore in the future and
> > accordingly has to be killed.
> 
> This sounds like a fairly fundamental problem to me.

Yes I agree. This is a significant limitation of first class virtual address spaces.
However, conflict like this can be mitigated by being careful in the application
that uses multiple first class virtual address spaces. If all threads make sure that
they never resize shared memory regions when executing inside a VAS such conflicts do
not occur. Another possibility that I investigated but not yet finished is that such
resizes of shared memory regions have to be synchronized more frequently than just at
every switch between VASes. If one for example "forward" memory region resizes to all
AS that share this particular memory region during the resize operation, one can
completely eliminate this problem. Unfortunately, this introduces a significant cost
and introduces a difficult to handle race condition.

> Is this an indication that full virtual address spaces are useless?  It
> would seem like if you only use virtual address segments then you avoid all
> of the problems with executing code, active stacks, and brk.

What do you mean with *virtual address segments*? The nice part of first class
virtual address spaces is that one can share/reuse collections of address space
segments easily.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
                   ` (13 preceding siblings ...)
  2017-03-14  0:18 ` [RFC PATCH 00/13] Introduce " Richard Henderson
@ 2017-03-14  0:58 ` Andy Lutomirski
  2017-03-14  2:07   ` Till Smejkal
  14 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2017-03-14  0:58 UTC (permalink / raw)
  To: Till Smejkal
  Cc: Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal
<till.smejkal@googlemail.com> wrote:
> This patchset extends the kernel memory management subsystem with a new
> type of address spaces (called VAS) which can be created and destroyed
> independently of processes by a user in the system. During its lifetime
> such a VAS can be attached to processes by the user which allows a process
> to have multiple address spaces and thereby multiple, potentially
> different, views on the system's main memory. During its execution the
> threads belonging to the process are able to switch freely between the
> different attached VAS and the process' original AS enabling them to
> utilize the different available views on the memory.

Sounds like the old SKAS feature for UML.

> In addition to the concept of first class virtual address spaces, this
> patchset introduces yet another feature called VAS segments. VAS segments
> are memory regions which have a fixed size and position in the virtual
> address space and can be shared between multiple first class virtual
> address spaces. Such shareable memory regions are especially useful for
> in-memory pointer-based data structures or other pure in-memory data.

This sounds rather complicated.  Getting TLB flushing right seems
tricky.  Why not just map the same thing into multiple mms?

>
>             |     VAS     |  processes  |
>     -------------------------------------
>     switch  |       468ns |      1944ns |

The solution here is IMO to fix the scheduler.

Also, FWIW, I have patches (that need a little work) that will make
switch_mm() waaaay faster on x86.

> At the current state of the development, first class virtual address spaces
> have one limitation, that we haven't been able to solve so far. The feature
> allows, that different threads of the same process can execute in different
> AS at the same time. This is possible, because the VAS-switch operation
> only changes the active mm_struct for the task_struct of the calling
> thread. However, when a thread switches into a first class virtual address
> space, some parts of its original AS are duplicated into the new one to
> allow the thread to continue its execution at its current state.

Ick.  Please don't do this.  Can we please keep an mm as just an mm
and not make it look magically different depending on which process
maps it?  If you need a trampoline (which you do, of course), just
write a trampoline in regular user code and map it manually.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  0:39   ` Till Smejkal
@ 2017-03-14  1:02     ` Richard Henderson
  2017-03-14  1:31       ` Till Smejkal
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2017-03-14  1:02 UTC (permalink / raw)
  To: Till Smejkal, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On 03/14/2017 10:39 AM, Till Smejkal wrote:
>> Is this an indication that full virtual address spaces are useless?  It
>> would seem like if you only use virtual address segments then you avoid all
>> of the problems with executing code, active stacks, and brk.
>
> What do you mean with *virtual address segments*? The nice part of first class
> virtual address spaces is that one can share/reuse collections of address space
> segments easily.

What do *I* mean?  You introduced the term, didn't you?
Rereading your original I see you called them "VAS segments".

Anyway, whatever they are called, it would seem that these segments do not 
require any of the syncing mechanisms that are causing you problems.


r~

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  1:02     ` Richard Henderson
@ 2017-03-14  1:31       ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14  1:31 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Till Smejkal, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel

On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 10:39 AM, Till Smejkal wrote:
> > > Is this an indication that full virtual address spaces are useless?  It
> > > would seem like if you only use virtual address segments then you avoid all
> > > of the problems with executing code, active stacks, and brk.
> > 
> > What do you mean with *virtual address segments*? The nice part of first class
> > virtual address spaces is that one can share/reuse collections of address space
> > segments easily.
> 
> What do *I* mean?  You introduced the term, didn't you?
> Rereading your original I see you called them "VAS segments".

Oh, I am sorry. I thought that you were referring to some other feature that I don't
know.

> Anyway, whatever they are called, it would seem that these segments do not
> require any of the syncing mechanisms that are causing you problems.

Yes, VAS segments provide a possibility to share memory regions between multiple
address spaces without the need to synchronize heap, stack, etc. Unfortunately, the
VAS segment feature itself without the whole concept of first class virtual address
spaces is not as powerful. With some additional work it can probably be represented
with the existing shmem functionality.

The first class virtual address space feature on the other side provides a real
benefit for applications in our opinion namely that an application can switch between
different views on its memory which enables various interesting programming paradigms
as mentioned in the cover letter.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
  2017-03-13 22:14 ` [RFC PATCH 10/13] mm: Introduce first class virtual address spaces Till Smejkal
  2017-03-13 23:52   ` Greg Kroah-Hartman
@ 2017-03-14  1:35   ` Vineet Gupta
  2017-03-14  2:34     ` Till Smejkal
  1 sibling, 1 reply; 45+ messages in thread
From: Vineet Gupta @ 2017-03-14  1:35 UTC (permalink / raw)
  To: Till Smejkal, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J.  Bottomley, Helge Deller, Benjamin  Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller
  Cc: linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel, Ingo Molnar, Thomas Gleixner, Andy Lutomirski

+CC Ingo, tglx

Hi Till,

On 03/13/2017 03:14 PM, Till Smejkal wrote:
> Introduce a different type of address spaces which are first class citizens
> in the OS. That means that the kernel now handles two types of AS, those
> which are closely coupled with a process and those which aren't. While the
> former ones are created and destroyed together with the process by the
> kernel and are the default type of AS in the Linux kernel, the latter ones
> have to be managed explicitly by the user and are the newly introduced
> type.
> 
> Accordingly, a first class AS (also called VAS == virtual address space)
> can exist in the OS independently from any process. A user has to
> explicitly create and destroy them in the system. Processes and VAS can be
> combined by attaching a previously created VAS to a process which basically
> adds an additional AS to the process that the process' threads are able to
> execute in. Hence, VAS allow a process to have different views onto the
> main memory of the system (its original AS and the attached VAS) between
> which its threads can switch arbitrarily during their lifetime.
> 
> The functionality made available through first class virtual address spaces
> can be used in various different ways. One possible way to utilize VAS is
> to compartmentalize a process for security reasons. Another possible usage
> is to improve the performance of data-centric applications by being able to
> manage different sets of data in memory without the need to map or unmap
> them.
> 
> Furthermore, first class virtual address spaces can be attached to
> different processes at the same time if the underlying memory is only
> readable. This mechanism allows sharing of whole address spaces between
> multiple processes that can both execute in them using the contained
> memory.

I've not looked at the patches closely (or read the references paper fully yet),
but at first glance it seems on ARC architecture, we can can potentially
use/leverage this mechanism to implement the shared TLB entries. Before anyone
shouts these are not same as the IA64/x86 protection keys which allow TLB entries
with different protection bits across processes etc. These TLB entries are
actually *shared* by processes.

Conceptually there's shared address spaces, independent of processes. e.g. ldso
code is shared address space #1, libc (code) #2 .... System can support a limited
number of shared addr spaces (say 64, enough for typical embedded sys).

While Normal TLB entries are tagged with ASID (Addr space ID) to keep them unique
across processes, Shared TLB entries are tagged with Shared address space ID.

A process MMU context consists of ASID (a single number) and a SASID bitmap (to
allow "subscription" to multiple Shared spaces. The subscriptions are set up bu
userspace ld.so which knows about the libs process wants to map.

The restriction ofcourse is that the spaces are mapped at *same* vaddr is all
participating processes. I know this goes against whole security, address space
randomization - but it gives much better real time performance. Why does each
process need to take a MMU exception for libc code...

So long story short - it seems there can be multiple uses of this infrastructure !

-Vineet

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  0:58 ` Andy Lutomirski
@ 2017-03-14  2:07   ` Till Smejkal
  2017-03-14  5:37     ` Andy Lutomirski
  0 siblings, 1 reply; 45+ messages in thread
From: Till Smejkal @ 2017-03-14  2:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Till Smejkal, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Vineet Gupta, Russell King, Catalin Marinas, Will Deacon,
	Steven Miao, Richard Kuo, Tony Luck, Fenghua Yu, James Hogan,
	Ralf Baechle, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	David S. Miller, Chris Metcalf, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86 ML, Chris Zankel, Max Filippov,
	Arnd Bergmann, Greg Kroah-Hartman, Laurent Pinchart,
	Mauro Carvalho Chehab, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, David Woodhouse, Brian Norris, Boris Brezillon,
	Marek Vasut, Richard Weinberger, Cyrille Pitchen, Felipe Balbi,
	Alexander Viro, Benjamin LaHaise, Nadia Yvette Chambers,
	Jeff Layton, J. Bruce Fields, Peter Zijlstra, Hugh Dickins,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jaroslav Kysela,
	Takashi Iwai, linux-kernel, linux-alpha, arcml, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	Linux MIPS Mailing List, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, sparclinux, linux-xtensa, Linux Media Mailing List,
	linux-mtd, USB list, Linux FS Devel, linux-aio, linux-mm,
	Linux API, linux-arch, ALSA development

On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal
> <till.smejkal@googlemail.com> wrote:
> > This patchset extends the kernel memory management subsystem with a new
> > type of address spaces (called VAS) which can be created and destroyed
> > independently of processes by a user in the system. During its lifetime
> > such a VAS can be attached to processes by the user which allows a process
> > to have multiple address spaces and thereby multiple, potentially
> > different, views on the system's main memory. During its execution the
> > threads belonging to the process are able to switch freely between the
> > different attached VAS and the process' original AS enabling them to
> > utilize the different available views on the memory.
> 
> Sounds like the old SKAS feature for UML.

I haven't heard of this feature before, but after shortly looking at the description
on the UML website it actually has some similarities with what I am proposing. But as
far as I can see this was not merged into the mainline kernel, was it? In addition, I
think that first class virtual address spaces goes even one step further by allowing
AS to live independently of processes.

> > In addition to the concept of first class virtual address spaces, this
> > patchset introduces yet another feature called VAS segments. VAS segments
> > are memory regions which have a fixed size and position in the virtual
> > address space and can be shared between multiple first class virtual
> > address spaces. Such shareable memory regions are especially useful for
> > in-memory pointer-based data structures or other pure in-memory data.
> 
> This sounds rather complicated.  Getting TLB flushing right seems
> tricky.  Why not just map the same thing into multiple mms?

This is exactly what happens at the end. The memory region that is described by the
VAS segment will be mapped in the ASes that use the segment.

> >
> >             |     VAS     |  processes  |
> >     -------------------------------------
> >     switch  |       468ns |      1944ns |
> 
> The solution here is IMO to fix the scheduler.

IMHO it will be very difficult for the scheduler code to reach the same switching
time as the pure VAS switch because switching between VAS does not involve saving any
registers or FPU state and does not require selecting the next runnable task. VAS
switch is basically a system call that just changes the AS of the current thread
which makes it a very lightweight operation.

> Also, FWIW, I have patches (that need a little work) that will make
> switch_mm() waaaay faster on x86.

These patches will also improve the speed of the VAS switch operation. We are also
using the switch_mm function in the background to perform the actual hardware switch
between the two ASes. The main reason why the VAS switch is faster than the task
switch is that it just has to do fewer things.

> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> 
> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> and not make it look magically different depending on which process
> maps it?  If you need a trampoline (which you do, of course), just
> write a trampoline in regular user code and map it manually.

Did I understand you correctly that you are proposing that the switching thread
should make sure by itself that its code, stack, … memory regions are properly setup
in the new AS before/after switching into it? I think, this would make using first
class virtual address spaces much more difficult for user applications to the extend
that I am not even sure if they can be used at all. At the moment, switching into a
VAS is a very simple operation for an application because the kernel will just simply
do the right thing.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces
  2017-03-14  1:35   ` Vineet Gupta
@ 2017-03-14  2:34     ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14  2:34 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Till Smejkal, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J.  Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	linux-kernel, linux-alpha, linux-snps-arc, linux-arm-kernel,
	adi-buildroot-devel, linux-hexagon, linux-ia64, linux-metag,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-media, linux-mtd, linux-usb,
	linux-fsdevel, linux-aio, linux-mm, linux-api, linux-arch,
	alsa-devel, Ingo Molnar, Thomas Gleixner, Andy Lutomirski

Hi Vineet,

On Mon, 13 Mar 2017, Vineet Gupta wrote:
> I've not looked at the patches closely (or read the references paper fully yet),
> but at first glance it seems on ARC architecture, we can can potentially
> use/leverage this mechanism to implement the shared TLB entries. Before anyone
> shouts these are not same as the IA64/x86 protection keys which allow TLB entries
> with different protection bits across processes etc. These TLB entries are
> actually *shared* by processes.
> 
> Conceptually there's shared address spaces, independent of processes. e.g. ldso
> code is shared address space #1, libc (code) #2 .... System can support a limited
> number of shared addr spaces (say 64, enough for typical embedded sys).
> 
> While Normal TLB entries are tagged with ASID (Addr space ID) to keep them unique
> across processes, Shared TLB entries are tagged with Shared address space ID.
> 
> A process MMU context consists of ASID (a single number) and a SASID bitmap (to
> allow "subscription" to multiple Shared spaces. The subscriptions are set up bu
> userspace ld.so which knows about the libs process wants to map.
> 
> The restriction ofcourse is that the spaces are mapped at *same* vaddr is all
> participating processes. I know this goes against whole security, address space
> randomization - but it gives much better real time performance. Why does each
> process need to take a MMU exception for libc code...
> 
> So long story short - it seems there can be multiple uses of this infrastructure !

During the development of this code, we also looked at shared TLB entries, but
the other way around. We wanted to use them to prevent flushing of TLB entries of
shared memory regions when switching between multiple ASes. Unfortunately, we never
finished this part of the code.

However, we also investigated into a different use-case for first class virtual
address spaces that is related to what you propose if I didn't understand something
wrong. The idea is to move shared libraries into their own first class virtual
address space and only load some small trampoline code in the application AS. This
trampoline code performs the VAS switch in the libraries AS and execute the requested
function there. If we combine this architecture with tagged TLB entries to prevent
TLB flushes during the switch operation, it can also reach an acceptable performance.
A side effect of moving the shared library into its own AS is that it can not be used
by ROP-attacks because it is not accessible in the application's AS.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  2:07   ` Till Smejkal
@ 2017-03-14  5:37     ` Andy Lutomirski
  2017-03-14 16:12       ` Till Smejkal
  0 siblings, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2017-03-14  5:37 UTC (permalink / raw)
  To: Andy Lutomirski, Till Smejkal, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
<till.smejkal@googlemail.com> wrote:
> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> This sounds rather complicated.  Getting TLB flushing right seems
>> tricky.  Why not just map the same thing into multiple mms?
>
> This is exactly what happens at the end. The memory region that is described by the
> VAS segment will be mapped in the ASes that use the segment.

So why is this kernel feature better than just doing MAP_SHARED
manually in userspace?


>> Ick.  Please don't do this.  Can we please keep an mm as just an mm
>> and not make it look magically different depending on which process
>> maps it?  If you need a trampoline (which you do, of course), just
>> write a trampoline in regular user code and map it manually.
>
> Did I understand you correctly that you are proposing that the switching thread
> should make sure by itself that its code, stack, … memory regions are properly setup
> in the new AS before/after switching into it? I think, this would make using first
> class virtual address spaces much more difficult for user applications to the extend
> that I am not even sure if they can be used at all. At the moment, switching into a
> VAS is a very simple operation for an application because the kernel will just simply
> do the right thing.

Yes.  I think that having the same mm_struct look different from
different tasks is problematic.  Getting it right in the arch code is
going to be nasty.  The heuristics of what to share are also tough --
why would text + data + stack or whatever you're doing be adequate?
What if you're in a thread?  What if two tasks have their stacks in
the same place?

I could imagine something like a sigaltstack() mode that lets you set
a signal up to also switch mm could be useful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
  2017-03-13 22:14 ` [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init' Till Smejkal
@ 2017-03-14 10:18   ` David Laight
  2017-03-14 16:18     ` Till Smejkal
  0 siblings, 1 reply; 45+ messages in thread
From: David Laight @ 2017-03-14 10:18 UTC (permalink / raw)
  To: 'Till Smejkal',
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker
  Cc: linux-mips, alsa-devel, linux-ia64, linux-aio, linux-mm,
	linux-mtd, sparclinux, linux-arch, linux-s390, linux-hexagon,
	linux-sh, linux-snps-arc, linux-media, linux-xtensa,
	adi-buildroot-devel@lists.sourceforge.net

From: Linuxppc-dev Till Smejkal
> Sent: 13 March 2017 22:14
> The only way until now to create a new memory map was via the exported
> function 'mm_alloc'. Unfortunately, this function not only allocates a new
> memory map, but also completely initializes it. However, with the
> introduction of first class virtual address spaces, some initialization
> steps done in 'mm_alloc' are not applicable to the memory maps needed for
> this feature and hence would lead to errors in the kernel code.
> 
> Instead of introducing a new function that can allocate and initialize
> memory maps for first class virtual address spaces and potentially
> duplicate some code, I decided to split the mm_alloc function as well as
> the 'mm_init' function that it uses.
> 
> Now there are four functions exported instead of only one. The new
> 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> want to have the old behavior of mm_alloc one can use the newly introduced
> function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> also fully initializes it.
...

That looks like bugs waiting to happen.
You need unchanged code to fail to compile.

	David


--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14  5:37     ` Andy Lutomirski
@ 2017-03-14 16:12       ` Till Smejkal
  2017-03-14 19:53         ` Chris Metcalf
  2017-03-15 16:51         ` Andy Lutomirski
  0 siblings, 2 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14 16:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Till Smejkal, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
> <till.smejkal@googlemail.com> wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> >> This sounds rather complicated.  Getting TLB flushing right seems
> >> tricky.  Why not just map the same thing into multiple mms?
> >
> > This is exactly what happens at the end. The memory region that is described by the
> > VAS segment will be mapped in the ASes that use the segment.
> 
> So why is this kernel feature better than just doing MAP_SHARED
> manually in userspace?

One advantage of VAS segments is that they can be globally queried by user programs
which means that VAS segments can be shared by applications that not necessarily have
to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
if the tasks that share the memory region are related (aka. have a common parent that
initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
file. VAS segments on the other side allow sharing of pure in memory data by
arbitrary related tasks without the need of a file. This becomes especially
interesting if one combines VAS segments with non-volatile memory since one can keep
data structures in the NVM and still be able to share them between multiple tasks.

> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> and not make it look magically different depending on which process
> >> maps it?  If you need a trampoline (which you do, of course), just
> >> write a trampoline in regular user code and map it manually.
> >
> > Did I understand you correctly that you are proposing that the switching thread
> > should make sure by itself that its code, stack, … memory regions are properly setup
> > in the new AS before/after switching into it? I think, this would make using first
> > class virtual address spaces much more difficult for user applications to the extend
> > that I am not even sure if they can be used at all. At the moment, switching into a
> > VAS is a very simple operation for an application because the kernel will just simply
> > do the right thing.
> 
> Yes.  I think that having the same mm_struct look different from
> different tasks is problematic.  Getting it right in the arch code is
> going to be nasty.  The heuristics of what to share are also tough --
> why would text + data + stack or whatever you're doing be adequate?
> What if you're in a thread?  What if two tasks have their stacks in
> the same place?

The different ASes that a task now can have when it uses first class virtual address
spaces are not realized in the kernel by using only one mm_struct per task that just
looks differently but by using multiple mm_structs - one for each AS that the task
can execute in. When a task attaches a first class virtual address space to itself to
be able to use another AS, the kernel adds a temporary mm_struct to this task that
contains the mappings of the first class virtual address space and the one shared
with the task's original AS. If a thread now wants to switch into this attached first
class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers
in the task_struct of the thread to the temporary mm_struct and performs the
corresponding mm_switch operation. The original mm_struct of the thread will not be
changed.

Accordingly, I do not magically make mm_structs look differently depending on the
task that uses it, but create temporary mm_structs that only contain mappings to the
same memory regions.

I agree that finding a good heuristics of what to share is difficult. At the moment,
all memory regions that are available in the task's original AS will also be
available when a thread switches into an attached first class virtual address space
(aka. are shared). That means that VAS can mainly be used to extend the AS of a task
in the current state of the implementation. The reason why I implemented the sharing
in this way is that I didn't want to break shared libraries. If I only share
code+heap+stack, shared libraries would not work anymore after switching into a VAS.

> I could imagine something like a sigaltstack() mode that lets you set
> a signal up to also switch mm could be useful.

This is a very interesting idea. I will keep it in mind for future use cases of
multiple virtual address spaces per task.

Thanks
Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
  2017-03-14 10:18   ` David Laight
@ 2017-03-14 16:18     ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14 16:18 UTC (permalink / raw)
  To: David Laight
  Cc: 'Till Smejkal',
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Andy Lutomirski, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-mips,
	alsa-devel, linux-ia64, linux-aio, linux-mm, linux-mtd,
	sparclinux, linux-arch, linux-s390, linux-hexagon, linux-sh,
	linux-snps-arc, linux-media, linux-xtensa, adi-buildroot-devel,
	linux-metag, linux-arm-kernel, linux-parisc, linux-api,
	linux-usb, linux-kernel, linux-alpha, linux-fsdevel,
	linuxppc-dev

On Tue, 14 Mar 2017, David Laight wrote:
> From: Linuxppc-dev Till Smejkal
> > Sent: 13 March 2017 22:14
> > The only way until now to create a new memory map was via the exported
> > function 'mm_alloc'. Unfortunately, this function not only allocates a new
> > memory map, but also completely initializes it. However, with the
> > introduction of first class virtual address spaces, some initialization
> > steps done in 'mm_alloc' are not applicable to the memory maps needed for
> > this feature and hence would lead to errors in the kernel code.
> > 
> > Instead of introducing a new function that can allocate and initialize
> > memory maps for first class virtual address spaces and potentially
> > duplicate some code, I decided to split the mm_alloc function as well as
> > the 'mm_init' function that it uses.
> > 
> > Now there are four functions exported instead of only one. The new
> > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> > want to have the old behavior of mm_alloc one can use the newly introduced
> > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> > also fully initializes it.
> ...
> 
> That looks like bugs waiting to happen.
> You need unchanged code to fail to compile.

Thank you for this hint. I can give the new mm_alloc function a different name so
that code that uses the *old* mm_alloc function will fail to compile. I just reused
the old name when I wrote the code, because mm_alloc was only used in very few
locations in the kernel (2 times in the whole kernel source) which made identifying
and changing them very easy. I also don't think that there will be many users in the
kernel for mm_alloc in the future because it is a relatively low level data
structure. But if it is better to use a different name for the new function, I am
very happy to change this.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14 16:12       ` Till Smejkal
@ 2017-03-14 19:53         ` Chris Metcalf
  2017-03-14 21:14           ` Till Smejkal
  2017-03-15 16:51         ` Andy Lutomirski
  1 sibling, 1 reply; 45+ messages in thread
From: Chris Metcalf @ 2017-03-14 19:53 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Chris Zankel, Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On 3/14/2017 12:12 PM, Till Smejkal wrote:
> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
>> <till.smejkal@googlemail.com> wrote:
>>> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>>>> This sounds rather complicated.  Getting TLB flushing right seems
>>>> tricky.  Why not just map the same thing into multiple mms?
>>> This is exactly what happens at the end. The memory region that is described by the
>>> VAS segment will be mapped in the ASes that use the segment.
>> So why is this kernel feature better than just doing MAP_SHARED
>> manually in userspace?
> One advantage of VAS segments is that they can be globally queried by user programs
> which means that VAS segments can be shared by applications that not necessarily have
> to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> if the tasks that share the memory region are related (aka. have a common parent that
> initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> file.

True, but why is this bad?  The shared mapping will be memory resident
regardless, even if backed by a file (unless swapped out under heavy
memory pressure, but arguably that's a feature anyway).  More importantly,
having a file name is a simple and consistent way of identifying such
shared memory segments.

With a little work, you can also arrange to map such files into memory
at a fixed address in all participating processes, thus making internal
pointers work correctly.

> VAS segments on the other side allow sharing of pure in memory data by
> arbitrary related tasks without the need of a file. This becomes especially
> interesting if one combines VAS segments with non-volatile memory since one can keep
> data structures in the NVM and still be able to share them between multiple tasks.

I am not fully up to speed on NV/pmem stuff, but isn't that exactly what
the DAX mode is supposed to allow you to do?  If so, isn't sharing a
mapped file on a DAX filesystem on top of pmem equivalent to what
you're proposing?

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14 19:53         ` Chris Metcalf
@ 2017-03-14 21:14           ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-14 21:14 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Chris Zankel, Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Tue, 14 Mar 2017, Chris Metcalf wrote:
> On 3/14/2017 12:12 PM, Till Smejkal wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
> > > <till.smejkal@googlemail.com> wrote:
> > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > > > This sounds rather complicated.  Getting TLB flushing right seems
> > > > > tricky.  Why not just map the same thing into multiple mms?
> > > > This is exactly what happens at the end. The memory region that is described by the
> > > > VAS segment will be mapped in the ASes that use the segment.
> > > So why is this kernel feature better than just doing MAP_SHARED
> > > manually in userspace?
> > One advantage of VAS segments is that they can be globally queried by user programs
> > which means that VAS segments can be shared by applications that not necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> > if the tasks that share the memory region are related (aka. have a common parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> > file.
> 
> True, but why is this bad?  The shared mapping will be memory resident
> regardless, even if backed by a file (unless swapped out under heavy
> memory pressure, but arguably that's a feature anyway).  More importantly,
> having a file name is a simple and consistent way of identifying such
> shared memory segments.
> 
> With a little work, you can also arrange to map such files into memory
> at a fixed address in all participating processes, thus making internal
> pointers work correctly.

I don't want to say that the interface provided by MAP_SHARED is bad. I am only
arguing that VAS segments and the interface that they provide have an advantage over
the existing ones in my opinion. However, Matthew Wilcox also suggested in some
earlier mail that VAS segments could be exported to user space via a special purpose
filesystem. This would enable users of VAS segments to also just use some special
files to setup the shared memory regions. But since the VAS segment itself already
knows where at has to be mapped in the virtual address space of the process, the
establishing of the shared memory region would be very easy for the user.

> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one can keep
> > data structures in the NVM and still be able to share them between multiple tasks.
> 
> I am not fully up to speed on NV/pmem stuff, but isn't that exactly what
> the DAX mode is supposed to allow you to do?  If so, isn't sharing a
> mapped file on a DAX filesystem on top of pmem equivalent to what
> you're proposing?

If I read the documentation to DAX filesystems correctly, it is indeed possible to us
them to create files that life purely in NVM. I wasn't fully aware of this feature.
Thanks for the pointer.

However, the main contribution of this patchset is actually the idea of first class
virtual address spaces and that they can be used to allow processes to have multiple
different views on the system's main memory. For us, VAS segments were another logic
step in the same direction (from first class virtual address spaces to first class
address space segments). However, if there is already functionality in the Linux
kernel to achieve the exact same behavior, there is no real need to add VAS segments.
I will continue thinking about them and either find a different situation where the
currently available interface is not sufficient/too complicated or drop VAS segments
from future version of the patch set.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-14 16:12       ` Till Smejkal
  2017-03-14 19:53         ` Chris Metcalf
@ 2017-03-15 16:51         ` Andy Lutomirski
  2017-03-15 16:57           ` Matthew Wilcox
  2017-03-15 19:44           ` Till Smejkal
  1 sibling, 2 replies; 45+ messages in thread
From: Andy Lutomirski @ 2017-03-15 16:51 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Tue, Mar 14, 2017 at 9:12 AM, Till Smejkal
<till.smejkal@googlemail.com> wrote:
> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
>> <till.smejkal@googlemail.com> wrote:
>> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> >> This sounds rather complicated.  Getting TLB flushing right seems
>> >> tricky.  Why not just map the same thing into multiple mms?
>> >
>> > This is exactly what happens at the end. The memory region that is described by the
>> > VAS segment will be mapped in the ASes that use the segment.
>>
>> So why is this kernel feature better than just doing MAP_SHARED
>> manually in userspace?
>
> One advantage of VAS segments is that they can be globally queried by user programs
> which means that VAS segments can be shared by applications that not necessarily have
> to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> if the tasks that share the memory region are related (aka. have a common parent that
> initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> file.

What's wrong with memfd_create()?

> VAS segments on the other side allow sharing of pure in memory data by
> arbitrary related tasks without the need of a file. This becomes especially
> interesting if one combines VAS segments with non-volatile memory since one can keep
> data structures in the NVM and still be able to share them between multiple tasks.

What's wrong with regular mmap?

>
>> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
>> >> and not make it look magically different depending on which process
>> >> maps it?  If you need a trampoline (which you do, of course), just
>> >> write a trampoline in regular user code and map it manually.
>> >
>> > Did I understand you correctly that you are proposing that the switching thread
>> > should make sure by itself that its code, stack, … memory regions are properly setup
>> > in the new AS before/after switching into it? I think, this would make using first
>> > class virtual address spaces much more difficult for user applications to the extend
>> > that I am not even sure if they can be used at all. At the moment, switching into a
>> > VAS is a very simple operation for an application because the kernel will just simply
>> > do the right thing.
>>
>> Yes.  I think that having the same mm_struct look different from
>> different tasks is problematic.  Getting it right in the arch code is
>> going to be nasty.  The heuristics of what to share are also tough --
>> why would text + data + stack or whatever you're doing be adequate?
>> What if you're in a thread?  What if two tasks have their stacks in
>> the same place?
>
> The different ASes that a task now can have when it uses first class virtual address
> spaces are not realized in the kernel by using only one mm_struct per task that just
> looks differently but by using multiple mm_structs - one for each AS that the task
> can execute in. When a task attaches a first class virtual address space to itself to
> be able to use another AS, the kernel adds a temporary mm_struct to this task that
> contains the mappings of the first class virtual address space and the one shared
> with the task's original AS. If a thread now wants to switch into this attached first
> class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers
> in the task_struct of the thread to the temporary mm_struct and performs the
> corresponding mm_switch operation. The original mm_struct of the thread will not be
> changed.
>
> Accordingly, I do not magically make mm_structs look differently depending on the
> task that uses it, but create temporary mm_structs that only contain mappings to the
> same memory regions.

This sounds complicated and fragile.  What happens if a heuristically
shared region coincides with a region in the "first class address
space" being selected?

I think the right solution is "you're a user program playing virtual
address games -- make sure you do it right".

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 16:51         ` Andy Lutomirski
@ 2017-03-15 16:57           ` Matthew Wilcox
  2017-03-15 19:44           ` Till Smejkal
  1 sibling, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2017-03-15 16:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Till Smejkal, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, Mar 15, 2017 at 09:51:31AM -0700, Andy Lutomirski wrote:
> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one can keep
> > data structures in the NVM and still be able to share them between multiple tasks.
> 
> What's wrong with regular mmap?

I think it's the usual misunderstandings about how to use mmap.
>From the paper:

   Memory-centric computing demands careful organization of the
   virtual address space, but interfaces such as mmap only give limited
   control. Some systems do not support creation of address regions at
   specific offsets. In Linux, for example, mmap does not safely abort if
   a request is made to open a region of memory over an existing region;
   it simply writes over it.

The correct answer of course, is "Don't specify MAP_FIXED".  Specify the
'hint' address, and if you don't get it, either fix up your data structure
pointers, or just abort and complain noisily.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 16:51         ` Andy Lutomirski
  2017-03-15 16:57           ` Matthew Wilcox
@ 2017-03-15 19:44           ` Till Smejkal
  2017-03-15 19:47             ` Rich Felker
  2017-03-15 20:06             ` Andy Lutomirski
  1 sibling, 2 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-15 19:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Till Smejkal, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > One advantage of VAS segments is that they can be globally queried by user programs
> > which means that VAS segments can be shared by applications that not necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> > if the tasks that share the memory region are related (aka. have a common parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> > file.
> 
> What's wrong with memfd_create()?
> 
> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one can keep
> > data structures in the NVM and still be able to share them between multiple tasks.
> 
> What's wrong with regular mmap?

I never wanted to say that there is something wrong with regular mmap. We just
figured that with VAS segments you could remove the need to mmap your shared data but
instead can keep everything purely in memory.

Unfortunately, I am not at full speed with memfds. Is my understanding correct that
if the last user of such a file descriptor closes it, the corresponding memory is
freed? Accordingly, memfd cannot be used to keep data in memory while no program is
currently using it, can it? To be able to do this you need again some representation
of the data in a file? Yes, you can use a tmpfs to keep the file content in memory as
well, or some DAX filesystem to keep the file content in NVM, but this always
requires that such filesystems are mounted in the system that the application is
currently running on. VAS segments on the other side would provide a functionality to
achieve the same without the need of any mounted filesystem. However, I agree, that
this is just a small advantage compared to what can already be achieved with the
existing functionality provided by the Linux kernel. I probably need to revisit the
whole idea of first class virtual address space segments before continuing with this
pacthset. Thank you very much for the great feedback.

> >> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> >> and not make it look magically different depending on which process
> >> >> maps it?  If you need a trampoline (which you do, of course), just
> >> >> write a trampoline in regular user code and map it manually.
> >> >
> >> > Did I understand you correctly that you are proposing that the switching thread
> >> > should make sure by itself that its code, stack, … memory regions are properly setup
> >> > in the new AS before/after switching into it? I think, this would make using first
> >> > class virtual address spaces much more difficult for user applications to the extend
> >> > that I am not even sure if they can be used at all. At the moment, switching into a
> >> > VAS is a very simple operation for an application because the kernel will just simply
> >> > do the right thing.
> >>
> >> Yes.  I think that having the same mm_struct look different from
> >> different tasks is problematic.  Getting it right in the arch code is
> >> going to be nasty.  The heuristics of what to share are also tough --
> >> why would text + data + stack or whatever you're doing be adequate?
> >> What if you're in a thread?  What if two tasks have their stacks in
> >> the same place?
> >
> > The different ASes that a task now can have when it uses first class virtual address
> > spaces are not realized in the kernel by using only one mm_struct per task that just
> > looks differently but by using multiple mm_structs - one for each AS that the task
> > can execute in. When a task attaches a first class virtual address space to itself to
> > be able to use another AS, the kernel adds a temporary mm_struct to this task that
> > contains the mappings of the first class virtual address space and the one shared
> > with the task's original AS. If a thread now wants to switch into this attached first
> > class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers
> > in the task_struct of the thread to the temporary mm_struct and performs the
> > corresponding mm_switch operation. The original mm_struct of the thread will not be
> > changed.
> >
> > Accordingly, I do not magically make mm_structs look differently depending on the
> > task that uses it, but create temporary mm_structs that only contain mappings to the
> > same memory regions.
> 
> This sounds complicated and fragile.  What happens if a heuristically
> shared region coincides with a region in the "first class address
> space" being selected?

If such a conflict happens, the task cannot use the first class address space and the
corresponding system call will return an error. However, with the current available
virtual address space size that programs can use, such conflicts are probably rare. I
could also image some additional functionality that allows a user to mark parts of
its AS to not to be shared/to be shared when switching into a VAS. With this
functionality in place, there would be no need for a heuristic in the kernel but the
user decides what to share. The kernel would by default only share code, data, and
stack and the application/libraries have to mark all the other memory regions as
shared if they need to be also available in the VAS.

> I think the right solution is "you're a user program playing virtual
> address games -- make sure you do it right".

Hm, in general I agree, that the easier and more robust solution from the kernel
perspective is to let the user do the AS setup and only provide the functionality to
create new empty ASes. Though, I think that such an interface would be much more
difficult to use than my current design. Letting the user program setup the AS has
also another implication that I currently don't have. Since I share the code and
stack regions between all ASes that are available to a process, I don't need to
save/restore stack pointers or instruction pointers when threads switch between ASes.
However, when the user will setup the AS, the kernel cannot be sure that the code and
stack will be mapped at the same virtual address and hence has to save and restore
these registers (and also potentially others since we can now basically jump between
different execution contexts).

When we first designed first class virtual address spaces, we had one special
use-case in mind, namely that one application wants to use different data sets that
it does not want/can keep in the same AS. Hence, sharing code and stack between the
different ASes that the application uses was a logic step for us because the code
memory region for example has to be available at all AS anyways since all of them
execute the same application. Sharing the stack memory region enabled the application
to keep volatile information that might be needed in the new AS on the stack which
allows easy information flow between the different ASes. 

For this patch, I extended the initial sharing of stack and code memory regions to
all memory regions that are available in the tasks original AS to also allow
dynamically linked applications and multi-threaded applications to flawlessly use
first class virtual address spaces.

To put it in a nutshell, we envisioned first class virtual address spaces to be
rather used as shareable/reusable data containers which made sharing various memory
regions that are crucial for the execution of the application a feasible
implementation decision.

Thank you all very much for the feedback. I really appreciate it.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 19:44           ` Till Smejkal
@ 2017-03-15 19:47             ` Rich Felker
  2017-03-15 21:30               ` Till Smejkal
  2017-03-15 20:06             ` Andy Lutomirski
  1 sibling, 1 reply; 45+ messages in thread
From: Rich Felker @ 2017-03-15 19:47 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, David S. Miller, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Chris Zankel, Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > > One advantage of VAS segments is that they can be globally queried by user programs
> > > which means that VAS segments can be shared by applications that not necessarily have
> > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> > > if the tasks that share the memory region are related (aka. have a common parent that
> > > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> > > file.
> > 
> > What's wrong with memfd_create()?
> > 
> > > VAS segments on the other side allow sharing of pure in memory data by
> > > arbitrary related tasks without the need of a file. This becomes especially
> > > interesting if one combines VAS segments with non-volatile memory since one can keep
> > > data structures in the NVM and still be able to share them between multiple tasks.
> > 
> > What's wrong with regular mmap?
> 
> I never wanted to say that there is something wrong with regular mmap. We just
> figured that with VAS segments you could remove the need to mmap your shared data but
> instead can keep everything purely in memory.
> 
> Unfortunately, I am not at full speed with memfds. Is my understanding correct that
> if the last user of such a file descriptor closes it, the corresponding memory is
> freed? Accordingly, memfd cannot be used to keep data in memory while no program is
> currently using it, can it? To be able to do this you need again some representation

I have a name for application-allocated kernel resources that persist
without a process holding a reference to them or a node in the
filesystem: a bug. See: sysvipc.

Rich

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 19:44           ` Till Smejkal
  2017-03-15 19:47             ` Rich Felker
@ 2017-03-15 20:06             ` Andy Lutomirski
  2017-03-15 22:02               ` Till Smejkal
  1 sibling, 1 reply; 45+ messages in thread
From: Andy Lutomirski @ 2017-03-15 20:06 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal
<till.smejkal@googlemail.com> wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:
>> > One advantage of VAS segments is that they can be globally queried by user programs
>> > which means that VAS segments can be shared by applications that not necessarily have
>> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
>> > if the tasks that share the memory region are related (aka. have a common parent that
>> > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
>> > file.
>>
>> What's wrong with memfd_create()?
>>
>> > VAS segments on the other side allow sharing of pure in memory data by
>> > arbitrary related tasks without the need of a file. This becomes especially
>> > interesting if one combines VAS segments with non-volatile memory since one can keep
>> > data structures in the NVM and still be able to share them between multiple tasks.
>>
>> What's wrong with regular mmap?
>
> I never wanted to say that there is something wrong with regular mmap. We just
> figured that with VAS segments you could remove the need to mmap your shared data but
> instead can keep everything purely in memory.

memfd does that.

>
> Unfortunately, I am not at full speed with memfds. Is my understanding correct that
> if the last user of such a file descriptor closes it, the corresponding memory is
> freed? Accordingly, memfd cannot be used to keep data in memory while no program is
> currently using it, can it?

No, stop right here.  If you want to have a bunch of memory that
outlives the program that allocates it, use a filesystem (tmpfs,
hugetlbfs, ext4, whatever).  Don't create new persistent kernel
things.

> VAS segments on the other side would provide a functionality to
> achieve the same without the need of any mounted filesystem. However, I agree, that
> this is just a small advantage compared to what can already be achieved with the
> existing functionality provided by the Linux kernel.

I see this "small advantage" as "resource leak and security problem".

>> This sounds complicated and fragile.  What happens if a heuristically
>> shared region coincides with a region in the "first class address
>> space" being selected?
>
> If such a conflict happens, the task cannot use the first class address space and the
> corresponding system call will return an error. However, with the current available
> virtual address space size that programs can use, such conflicts are probably rare.

A bug that hits 1% of the time is often worse than one that hits 100%
of the time because debugging it is miserable.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 19:47             ` Rich Felker
@ 2017-03-15 21:30               ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-15 21:30 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, David S. Miller, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, X86 ML,
	Chris Zankel, Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, 15 Mar 2017, Rich Felker wrote:
> On Wed, Mar 15, 2017 at 12:44:47PM -0700, Till Smejkal wrote:
> > On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> > > > One advantage of VAS segments is that they can be globally queried by user programs
> > > > which means that VAS segments can be shared by applications that not necessarily have
> > > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> > > > if the tasks that share the memory region are related (aka. have a common parent that
> > > > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> > > > file.
> > > 
> > > What's wrong with memfd_create()?
> > > 
> > > > VAS segments on the other side allow sharing of pure in memory data by
> > > > arbitrary related tasks without the need of a file. This becomes especially
> > > > interesting if one combines VAS segments with non-volatile memory since one can keep
> > > > data structures in the NVM and still be able to share them between multiple tasks.
> > > 
> > > What's wrong with regular mmap?
> > 
> > I never wanted to say that there is something wrong with regular mmap. We just
> > figured that with VAS segments you could remove the need to mmap your shared data but
> > instead can keep everything purely in memory.
> > 
> > Unfortunately, I am not at full speed with memfds. Is my understanding correct that
> > if the last user of such a file descriptor closes it, the corresponding memory is
> > freed? Accordingly, memfd cannot be used to keep data in memory while no program is
> > currently using it, can it? To be able to do this you need again some representation
> 
> I have a name for application-allocated kernel resources that persist
> without a process holding a reference to them or a node in the
> filesystem: a bug. See: sysvipc.

VAS segments are first class citizens of the OS similar to processes. Accordingly, I
would not see this behavior as a bug. VAS segments are a kernel handle to
"persistent" memory (in the sense that they are independent of the lifetime of the
application that created them). That means the memory that is described by VAS
segments can be reused by other applications even if the VAS segment was not used by
any application in between. It is very much like a pure in-memory file. An
application creates a VAS segment, fills it with content and if it does not delete it
again, can reuse/open it again later. This also means, that if you know that you
never want to use this memory again you have to remove it explicitly, like you have
to remove a file, if you don't want to use it anymore.

I think it really might be better to implement VAS segments (if I should keep this
feature at all) with a special purpose filesystem. The way I've designed it seams to
be very misleading.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 20:06             ` Andy Lutomirski
@ 2017-03-15 22:02               ` Till Smejkal
  2017-03-15 22:09                 ` Luck, Tony
  2017-03-16  8:21                 ` Thomas Gleixner
  0 siblings, 2 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-15 22:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Till Smejkal, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> On Wed, Mar 15, 2017 at 12:44 PM, Till Smejkal
> <till.smejkal@googlemail.com> wrote:
> > On Wed, 15 Mar 2017, Andy Lutomirski wrote:
> >> > One advantage of VAS segments is that they can be globally queried by user programs
> >> > which means that VAS segments can be shared by applications that not necessarily have
> >> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work
> >> > if the tasks that share the memory region are related (aka. have a common parent that
> >> > initialized the shared mapping). Otherwise, the shared mapping have to be backed by a
> >> > file.
> >>
> >> What's wrong with memfd_create()?
> >>
> >> > VAS segments on the other side allow sharing of pure in memory data by
> >> > arbitrary related tasks without the need of a file. This becomes especially
> >> > interesting if one combines VAS segments with non-volatile memory since one can keep
> >> > data structures in the NVM and still be able to share them between multiple tasks.
> >>
> >> What's wrong with regular mmap?
> >
> > I never wanted to say that there is something wrong with regular mmap. We just
> > figured that with VAS segments you could remove the need to mmap your shared data but
> > instead can keep everything purely in memory.
> 
> memfd does that.

Yes, that's right. Thanks for giving me the pointer to this. I should have researched
more carefully before starting to work at VAS segments.

> > VAS segments on the other side would provide a functionality to
> > achieve the same without the need of any mounted filesystem. However, I agree, that
> > this is just a small advantage compared to what can already be achieved with the
> > existing functionality provided by the Linux kernel.
> 
> I see this "small advantage" as "resource leak and security problem".

I don't agree here. VAS segments are basically in-memory files that are handled by
the kernel directly without using a file system. Hence, if an application uses a VAS
segment to store data the same rules apply as if it uses a file. Everything that it
saves in the VAS segment might be accessible by other applications. An application
using VAS segments should be aware of this fact. In addition, the resources that are
represented by a VAS segment are not leaked. As I said, VAS segments are much like
files. Hence, if you don't want to use them any more, delete them. But as with files,
the kernel will not delete them for you (although something like this can be added).

> >> This sounds complicated and fragile.  What happens if a heuristically
> >> shared region coincides with a region in the "first class address
> >> space" being selected?
> >
> > If such a conflict happens, the task cannot use the first class address space and the
> > corresponding system call will return an error. However, with the current available
> > virtual address space size that programs can use, such conflicts are probably rare.
> 
> A bug that hits 1% of the time is often worse than one that hits 100%
> of the time because debugging it is miserable.

I don't agree that this is a bug at all. If there is a conflict in the memory layout
of the ASes the application simply cannot use this first class virtual address space.
Every application that wants to use first class virtual address spaces should check
for error return values and handle them.

This situation is similar to mapping a file at some special address in memory because
the file contains pointer based data structures and the application wants to use
them, but the kernel cannot map the file at this particular position in the
application's AS because there is already a different conflicting mapping. If an
application wants to do such things, it should also handle all the errors that can
occur.

Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 22:02               ` Till Smejkal
@ 2017-03-15 22:09                 ` Luck, Tony
  2017-03-15 23:18                   ` Till Smejkal
  2017-03-16  8:21                 ` Thomas Gleixner
  1 sibling, 1 reply; 45+ messages in thread
From: Luck, Tony @ 2017-03-15 22:09 UTC (permalink / raw)
  To: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote:
> I don't agree here. VAS segments are basically in-memory files that are handled by
> the kernel directly without using a file system. Hence, if an application uses a VAS
> segment to store data the same rules apply as if it uses a file. Everything that it
> saves in the VAS segment might be accessible by other applications. An application
> using VAS segments should be aware of this fact. In addition, the resources that are
> represented by a VAS segment are not leaked. As I said, VAS segments are much like
> files. Hence, if you don't want to use them any more, delete them. But as with files,
> the kernel will not delete them for you (although something like this can be added).

So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)?

Apart from VAS having better names, instead of silly "key_t key" ones.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 22:09                 ` Luck, Tony
@ 2017-03-15 23:18                   ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-15 23:18 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Andy Lutomirski, Till Smejkal,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	X86 ML, Chris Zankel, Max Filippov, Arnd Bergmann,
	Greg Kroah-Hartman, Laurent Pinchart, Mauro Carvalho Chehab,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, David Woodhouse,
	Brian Norris, Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, 15 Mar 2017, Luck, Tony wrote:
> On Wed, Mar 15, 2017 at 03:02:34PM -0700, Till Smejkal wrote:
> > I don't agree here. VAS segments are basically in-memory files that are handled by
> > the kernel directly without using a file system. Hence, if an application uses a VAS
> > segment to store data the same rules apply as if it uses a file. Everything that it
> > saves in the VAS segment might be accessible by other applications. An application
> > using VAS segments should be aware of this fact. In addition, the resources that are
> > represented by a VAS segment are not leaked. As I said, VAS segments are much like
> > files. Hence, if you don't want to use them any more, delete them. But as with files,
> > the kernel will not delete them for you (although something like this can be added).
> 
> So how do they differ from shmget(2), shmat(2), shmdt(2), shmctl(2)?
> 
> Apart from VAS having better names, instead of silly "key_t key" ones.

Unfortunately, I have to admit that the VAS segments don't differ from shm* a lot.
The implementation is differently, but the functionality that you can achieve with it
is very similar. I am sorry. We should have looked more closely at the whole
functionality that is provided by the shmem subsystem before working on VAS segments.

However, VAS segments are not the key part of this patch set. The more interesting
functionality in our opinion is the introduction of first class virtual address
spaces and what they can be used for. VAS segments were just another logical step for
us (from first class virtual address spaces to first class virtual address space
segments) but since their functionality can be achieved with various other already
existing features of the Linux kernel, I will probably drop them in future versions
of the patchset.

Thanks
Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-15 22:02               ` Till Smejkal
  2017-03-15 22:09                 ` Luck, Tony
@ 2017-03-16  8:21                 ` Thomas Gleixner
  2017-03-16 17:29                   ` Till Smejkal
  1 sibling, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2017-03-16  8:21 UTC (permalink / raw)
  To: Till Smejkal
  Cc: Andy Lutomirski, Andy Lutomirski, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Ingo Molnar, H. Peter Anvin, X86 ML, Chris Zankel,
	Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Wed, 15 Mar 2017, Till Smejkal wrote:
> On Wed, 15 Mar 2017, Andy Lutomirski wrote:

> > > VAS segments on the other side would provide a functionality to
> > > achieve the same without the need of any mounted filesystem. However,
> > > I agree, that this is just a small advantage compared to what can
> > > already be achieved with the existing functionality provided by the
> > > Linux kernel.
> > 
> > I see this "small advantage" as "resource leak and security problem".
> 
> I don't agree here. VAS segments are basically in-memory files that are
> handled by the kernel directly without using a file system. Hence, if an

Why do we need yet another mechanism to represent something which looks
like a file instead of simply using existing mechanisms and extend them?

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-16  8:21                 ` Thomas Gleixner
@ 2017-03-16 17:29                   ` Till Smejkal
  2017-03-16 17:42                     ` Thomas Gleixner
  0 siblings, 1 reply; 45+ messages in thread
From: Till Smejkal @ 2017-03-16 17:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Till Smejkal, Andy Lutomirski, Andy Lutomirski,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Ingo Molnar, H. Peter Anvin, X86 ML, Chris Zankel,
	Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> Why do we need yet another mechanism to represent something which looks
> like a file instead of simply using existing mechanisms and extend them?

You are right. I also recognized during the discussion with Andy, Chris, Matthew,
Luck, Rich and the others that there are already other techniques in the Linux kernel
that can achieve the same functionality when combined. As I said also to the others,
I will drop the VAS segments for future versions. The first class virtual address
space feature was the more interesting part of the patchset anyways.

Thanks
Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-16 17:29                   ` Till Smejkal
@ 2017-03-16 17:42                     ` Thomas Gleixner
  2017-03-16 17:50                       ` Till Smejkal
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Gleixner @ 2017-03-16 17:42 UTC (permalink / raw)
  To: Till Smejkal
  Cc: Andy Lutomirski, Andy Lutomirski, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Miao, Richard Kuo,
	Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Ingo Molnar, H. Peter Anvin, X86 ML, Chris Zankel,
	Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Thu, 16 Mar 2017, Till Smejkal wrote:
> On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> > Why do we need yet another mechanism to represent something which looks
> > like a file instead of simply using existing mechanisms and extend them?
> 
> You are right. I also recognized during the discussion with Andy, Chris,
> Matthew, Luck, Rich and the others that there are already other
> techniques in the Linux kernel that can achieve the same functionality
> when combined. As I said also to the others, I will drop the VAS segments
> for future versions. The first class virtual address space feature was
> the more interesting part of the patchset anyways.

While you are at it, could you please drop this 'first class' marketing as
well? It has zero technical value, really.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
  2017-03-16 17:42                     ` Thomas Gleixner
@ 2017-03-16 17:50                       ` Till Smejkal
  0 siblings, 0 replies; 45+ messages in thread
From: Till Smejkal @ 2017-03-16 17:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Till Smejkal, Andy Lutomirski, Andy Lutomirski,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Vineet Gupta,
	Russell King, Catalin Marinas, Will Deacon, Steven Miao,
	Richard Kuo, Tony Luck, Fenghua Yu, James Hogan, Ralf Baechle,
	James E.J. Bottomley, Helge Deller, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, David S. Miller,
	Chris Metcalf, Ingo Molnar, H. Peter Anvin, X86 ML, Chris Zankel,
	Max Filippov, Arnd Bergmann, Greg Kroah-Hartman,
	Laurent Pinchart, Mauro Carvalho Chehab, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, David Woodhouse, Brian Norris,
	Boris Brezillon, Marek Vasut, Richard Weinberger,
	Cyrille Pitchen, Felipe Balbi, Alexander Viro, Benjamin LaHaise,
	Nadia Yvette Chambers, Jeff Layton, J. Bruce Fields,
	Peter Zijlstra, Hugh Dickins, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jaroslav Kysela, Takashi Iwai, linux-kernel,
	linux-alpha, arcml, linux-arm-kernel, adi-buildroot-devel,
	linux-hexagon, linux-ia64, linux-metag, Linux MIPS Mailing List,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, Linux Media Mailing List, linux-mtd, USB list,
	Linux FS Devel, linux-aio, linux-mm, Linux API, linux-arch,
	ALSA development

On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> On Thu, 16 Mar 2017, Till Smejkal wrote:
> > On Thu, 16 Mar 2017, Thomas Gleixner wrote:
> > > Why do we need yet another mechanism to represent something which looks
> > > like a file instead of simply using existing mechanisms and extend them?
> > 
> > You are right. I also recognized during the discussion with Andy, Chris,
> > Matthew, Luck, Rich and the others that there are already other
> > techniques in the Linux kernel that can achieve the same functionality
> > when combined. As I said also to the others, I will drop the VAS segments
> > for future versions. The first class virtual address space feature was
> > the more interesting part of the patchset anyways.
> 
> While you are at it, could you please drop this 'first class' marketing as
> well? It has zero technical value, really.

Yes of course. I am sorry for the trouble that I caused already.

Thanks
Till

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2017-03-16 17:50 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-13 22:14 [RFC PATCH 00/13] Introduce first class virtual address spaces Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region' Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff' Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area' Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 05/13] mm: Add mm_struct argument to 'mm_populate' and '__mm_populate' Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init' Till Smejkal
2017-03-14 10:18   ` David Laight
2017-03-14 16:18     ` Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 10/13] mm: Introduce first class virtual address spaces Till Smejkal
2017-03-13 23:52   ` Greg Kroah-Hartman
2017-03-14  0:24     ` Till Smejkal
2017-03-14  1:35   ` Vineet Gupta
2017-03-14  2:34     ` Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions Till Smejkal
2017-03-13 22:27   ` Matthew Wilcox
2017-03-13 22:45     ` Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces Till Smejkal
2017-03-13 22:14 ` [RFC PATCH 13/13] fs/proc: Add procfs " Till Smejkal
2017-03-14  0:18 ` [RFC PATCH 00/13] Introduce " Richard Henderson
2017-03-14  0:39   ` Till Smejkal
2017-03-14  1:02     ` Richard Henderson
2017-03-14  1:31       ` Till Smejkal
2017-03-14  0:58 ` Andy Lutomirski
2017-03-14  2:07   ` Till Smejkal
2017-03-14  5:37     ` Andy Lutomirski
2017-03-14 16:12       ` Till Smejkal
2017-03-14 19:53         ` Chris Metcalf
2017-03-14 21:14           ` Till Smejkal
2017-03-15 16:51         ` Andy Lutomirski
2017-03-15 16:57           ` Matthew Wilcox
2017-03-15 19:44           ` Till Smejkal
2017-03-15 19:47             ` Rich Felker
2017-03-15 21:30               ` Till Smejkal
2017-03-15 20:06             ` Andy Lutomirski
2017-03-15 22:02               ` Till Smejkal
2017-03-15 22:09                 ` Luck, Tony
2017-03-15 23:18                   ` Till Smejkal
2017-03-16  8:21                 ` Thomas Gleixner
2017-03-16 17:29                   ` Till Smejkal
2017-03-16 17:42                     ` Thomas Gleixner
2017-03-16 17:50                       ` Till Smejkal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).