LKML Archive on lore.kernel.org
 help / color / Atom feed
* [RFC 0/9] Popcorn Linux Distributed Thread Execution
       [not found] <0>
@ 2020-04-29 19:32 ` Javier Malave
  2020-04-29 19:32   ` [RFC 1/9] Core Popcorn Changes Javier Malave
                     ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

This patch set adds the Popcorn Distributed Thread Execution support
to the kernel. It is based off of Linux 5.2 commit 72a20ce. We are
looking for feedback on design and implementation from the community.

Background
==========

Popcorn Linux is a collaborative work by the Software and Systems Research
Group at Virginia Tech. It is based on the original thesis by David
Katz. Principal contributors since include, Sang-Hoon Kim, Maria Sadini
Ajith Saya, Vicnent Legout, Antonio Barbalace and Binoy Ravindran. 

Popcorn Linux is a Linux kernel-based software stack that enables
applications to execute, with a shared code base, on distributed hosts.
Popcorn allows applications to start execution on a particular host and
migrate, at run-time, to a remote host. Multi-threaded applications may
migrate any particular thread to any remote host. 

Unlike userspace checkpoint-restart solutions (e.g., CRIU), Popcorn 
enables seamless and dynamic migration across hosts during execution
(no user interaction), and ensures coherent virtual memory across hosts 
for concurrent thread execution.

Popcorn Linux implements a software-based distributed shared memory 
by extending Linux's virtual memory subsystem and it enables processes 
on different machines to observe a common and coherent virtual address
space. Coherency of virtual memory pages of different hosts is ensured 
using a reader-replicate/writer-invalidate, page-level consistency protocol.

The version of Popcorn Linux presented in this RFC supports only x86
configurations. The stack in this RFC includes a modified kernel and a 
userspace run-time library.

There is a more advanced version of Popcorn Linux that allows applications
to concurrently execute across ISA-different cores (.e.g. X86, ARM). This 
feature-rich version adds a customized LLVM-compiler toolchain to the stack. 
Nevertheless, this RFC focuses on a simpler single-architecture form of 
Popcorn that does not require compiler modifications.

Both the compiler toolchain and the heterogeneous implementation of
Popcorn Linux may be found at these github locations. It should be noted
the heterogeneous version is under active development and is likely less 
stable than the one provided in this patch.

Heterogeneous Popcorn Linux
https://github.com/ssrg-vt/popcorn-kernel

LLVM Compiler
https://github.com/ssrg-vt/popcorn-compiler

More information on the Popcorn Linux research team and their current
work may be found here:

http://popcornlinux.org/

Popcorn Library
===============

The Popcorn kernel library is a suite of light tests that showcase
the use of Popcorn's core system calls. These tests allow for a 
basic migration of processes and threads between 2 or more nodes.

The test suite is being expanded and may be found in at 
https://github.com/ssrg-vt/popcorn-kernel-lib. The branch 
matching this RFC kernel code base is called "upstream". 

Reverting L1TF protections
==========================

This initial iteration of code currently reverts L1TF side channel
protections on x86 systems. Future iterations of code will comply
with the L1TF patches and current pagetable walking algorithms.

Security Disclaimer
===================

Popcorn Linux assumes that it is operating across equally-trusted host
systems. This patch-set is intended to initiate discussions and should
not be built and loaded onto public (internet-connected) machines.
Popcorn Linux, as-is, runs a kernel-based daemon that listens for
messages passed to it via a TCP socket.  This IP-based message layer
is intended for Popcorn testing and development purposes only, given
obvious latency and security issues stemming from passing pages and
kernel structures over TCP.

Andrew Hughes (9):
  Core Popcorn Changes Popcorn Linux
  Add x86 specifc files for Popcorn
  Temporarily revert L1TF mitigation for Popcorn
  Popcorn system call additions
  Popcorn Utility
  Process Server for Popcorn Distributed Thread Execution
  Virtual Memory Address Server for Distributed Thread Execution
  Page Server for Distributed Thread Execution
  Add Popcorn Message Layer and socket support

 arch/x86/Kconfig                       |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 arch/x86/include/asm/pgtable-2level.h  |   17 -
 arch/x86/include/asm/pgtable-3level.h  |    2 -
 arch/x86/include/asm/pgtable.h         |   52 +-
 arch/x86/include/asm/pgtable_64.h      |    2 -
 arch/x86/kernel/Makefile               |    1 +
 arch/x86/kernel/process_server.c       |  250 +++
 arch/x86/mm/fault.c                    |   18 +
 arch/x86/mm/mmap.c                     |   21 -
 drivers/msg_layer/Kconfig              |   28 +
 drivers/msg_layer/Makefile             |    2 +
 drivers/msg_layer/common.h             |   63 +
 drivers/msg_layer/socket.c             |  710 +++++++++
 fs/proc/base.c                         |    9 +
 fs/read_write.c                        |   15 +-
 include/linux/mm_types.h               |   12 +
 include/linux/sched.h                  |   27 +-
 include/linux/syscalls.h               |    9 +
 include/popcorn/bundle.h               |   38 +
 include/popcorn/debug.h                |   38 +
 include/popcorn/page_server.h          |   34 +
 include/popcorn/pcn_kmsg.h             |  205 +++
 include/popcorn/process_server.h       |   18 +
 include/popcorn/regset.h               |   96 ++
 include/popcorn/stat.h                 |   16 +
 include/popcorn/types.h                |   20 +
 include/popcorn/vma_server.h           |   33 +
 include/uapi/asm-generic/mman-common.h |    4 +
 include/uapi/asm-generic/unistd.h      |   11 +-
 kernel/Kconfig.popcorn                 |   54 +
 kernel/Makefile                        |    1 +
 kernel/exit.c                          |    9 +
 kernel/fork.c                          |   51 +-
 kernel/futex.c                         |   32 +
 kernel/popcorn/Makefile                |    7 +
 kernel/popcorn/bundle.c                |  115 ++
 kernel/popcorn/fh_action.c             |  207 +++
 kernel/popcorn/fh_action.h             |   34 +
 kernel/popcorn/init.c                  |   58 +
 kernel/popcorn/page_server.c           | 2019 ++++++++++++++++++++++++
 kernel/popcorn/page_server.h           |   16 +
 kernel/popcorn/pcn_kmsg.c              |  231 +++
 kernel/popcorn/pgtable.h               |   31 +
 kernel/popcorn/process_server.c        | 1037 ++++++++++++
 kernel/popcorn/process_server.h        |   21 +
 kernel/popcorn/stat.c                  |  165 ++
 kernel/popcorn/trace_events.h          |   76 +
 kernel/popcorn/types.h                 |  358 +++++
 kernel/popcorn/util.c                  |  121 ++
 kernel/popcorn/util.h                  |   14 +
 kernel/popcorn/vma_server.c            |  818 ++++++++++
 kernel/popcorn/vma_server.h            |   24 +
 kernel/popcorn/wait_station.c          |   84 +
 kernel/popcorn/wait_station.h          |   27 +
 kernel/sched/core.c                    |  106 +-
 kernel/sys_ni.c                        |    3 +
 mm/gup.c                               |   18 +
 mm/internal.h                          |    4 +
 mm/madvise.c                           |   51 +
 mm/memory.c                            |  236 ++-
 mm/mmap.c                              |   53 +
 mm/mprotect.c                          |   70 +-
 mm/mremap.c                            |   20 +
 64 files changed, 7763 insertions(+), 165 deletions(-)
 create mode 100644 arch/x86/kernel/process_server.c
 create mode 100644 drivers/msg_layer/Kconfig
 create mode 100644 drivers/msg_layer/Makefile
 create mode 100644 drivers/msg_layer/common.h
 create mode 100644 drivers/msg_layer/socket.c
 create mode 100644 include/popcorn/bundle.h
 create mode 100644 include/popcorn/debug.h
 create mode 100644 include/popcorn/page_server.h
 create mode 100644 include/popcorn/pcn_kmsg.h
 create mode 100644 include/popcorn/process_server.h
 create mode 100644 include/popcorn/regset.h
 create mode 100644 include/popcorn/stat.h
 create mode 100644 include/popcorn/types.h
 create mode 100644 include/popcorn/vma_server.h
 create mode 100644 kernel/Kconfig.popcorn
 create mode 100644 kernel/popcorn/Makefile
 create mode 100644 kernel/popcorn/bundle.c
 create mode 100644 kernel/popcorn/fh_action.c
 create mode 100644 kernel/popcorn/fh_action.h
 create mode 100644 kernel/popcorn/init.c
 create mode 100644 kernel/popcorn/page_server.c
 create mode 100644 kernel/popcorn/page_server.h
 create mode 100644 kernel/popcorn/pcn_kmsg.c
 create mode 100644 kernel/popcorn/pgtable.h
 create mode 100644 kernel/popcorn/process_server.c
 create mode 100644 kernel/popcorn/process_server.h
 create mode 100644 kernel/popcorn/stat.c
 create mode 100644 kernel/popcorn/trace_events.h
 create mode 100644 kernel/popcorn/types.h
 create mode 100644 kernel/popcorn/util.c
 create mode 100644 kernel/popcorn/util.h
 create mode 100644 kernel/popcorn/vma_server.c
 create mode 100644 kernel/popcorn/vma_server.h
 create mode 100644 kernel/popcorn/wait_station.c
 create mode 100644 kernel/popcorn/wait_station.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/9] Core Popcorn Changes
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 2/9] Add x86 specifc files for Popcorn Javier Malave
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

Popcorn Linux is a Linux kernel-based software stack that 
enables applications to execute, with a shared source
base, on distributed hosts.

To achieve its goal of distributed multi-threaded
applications, Popcorn introduces certain core modifications
to the Linux kernel. These include the addition of
several system calls, a message layer and the Popcorn
implementation itself.

System Calls

In this RFC there are three system calls associated with
the migration mechanism.

popcorn_migrate: allows user to migrate a thread to another
popcorn registered node.

popcorn_get_node_info: get status information on Popcorn
registered nodes.

popcorn_get_thread_status: get status information of current
distributed thread.

Message Layer

Popcorn views compute resources as nodes. Each Popcorn node
may be multiple cores running an instance of the Linux kernel.
Each node registers itself in the network via the message layer
TCP/IP socket. Popcorn processes communicate with each other 
using Popcorn specific messages. Popcorn messages are used for 
the VMA coherency protocol, managing the necessary distributed 
locks, signaling process migration and exit.

Popcorn Implementation

The heart of Popcorn's implementation resides in kernel/popcorn.
Popcorn implements a main kernel thread to execute process
migration from its origin node to remote nodes. A pair of work
queues are used to process incoming messages and requests.

Work is tracked via a remote_context struct introduced by
Popcorn Linux. This struct also contains necessary information
to implement the VMA coherency protocol. As such, this struct is
embedded in task_struct and mm_struct and forms part of the
memory management modifications necessary to achieve
dynamic thread migration.

We welcome feedback to these core modifications. And look
forward to an open and productive discussion of the complete
Popcorn Linux work.
---
 fs/proc/base.c                         |   9 ++
 fs/read_write.c                        |  15 +-
 include/linux/mm_types.h               |  12 ++
 include/linux/sched.h                  |  27 +++-
 include/uapi/asm-generic/mman-common.h |   4 +
 kernel/Kconfig.popcorn                 |  54 +++++++
 kernel/Makefile                        |   1 +
 kernel/exit.c                          |   9 ++
 kernel/fork.c                          |  51 ++++++-
 kernel/futex.c                         |  32 ++++
 kernel/sched/core.c                    | 106 +++++++++++++-
 kernel/sys_ni.c                        |   3 +
 mm/gup.c                               |  18 +++
 mm/internal.h                          |   4 +
 mm/madvise.c                           |  51 +++++++
 mm/memory.c                            | 195 ++++++++++++++++++++++++-
 mm/mmap.c                              |  53 +++++++
 mm/mprotect.c                          |  21 ++-
 mm/mremap.c                            |  20 +++
 19 files changed, 679 insertions(+), 6 deletions(-)
 create mode 100644 kernel/Kconfig.popcorn

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9c8ca6cd3..887f36c55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -88,6 +88,9 @@
 #include <linux/user_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#endif
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
@@ -345,6 +348,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 	tsk = get_proc_task(file_inode(file));
 	if (!tsk)
 		return -ESRCH;
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(tsk)) {
+		put_task_struct(tsk);
+		return 0;
+	}
+#endif
 	ret = get_task_cmdline(tsk, buf, count, pos);
 	put_task_struct(tsk);
 	if (ret > 0)
diff --git a/fs/read_write.c b/fs/read_write.c
index c543d965e..b0bc6aefc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -573,11 +573,18 @@ static inline loff_t *file_ppos(struct file *file)
 	return file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
 }
 
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+#include <popcorn/types.h>
+#endif
 ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
-
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	if (WARN_ON(distributed_remote_process(current))) {
+		printk("  file read at remote thread is not supported yet\n");
+	}
+#endif
 	if (f.file) {
 		loff_t pos, *ppos = file_ppos(f.file);
 		if (ppos) {
@@ -602,6 +609,12 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	if (WARN_ON(distributed_remote_process(current))) {
+		printk("  file write at remote thread is not supported yet\n");
+	}
+#endif
+
 	if (f.file) {
 		loff_t pos, *ppos = file_ppos(f.file);
 		if (ppos) {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8ec38b11b..b041bce9c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,10 @@
 
 #include <asm/mmu.h>
 
+#ifdef CONFIG_POPCORN
+struct remote_context;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -505,6 +509,10 @@ struct mm_struct {
 		/* HMM needs to track a few things per mm */
 		struct hmm *hmm;
 #endif
+#ifdef CONFIG_POPCORN
+		struct remote_context *remote;
+#endif
+
 	} __randomize_layout;
 
 	/*
@@ -670,6 +678,10 @@ enum vm_fault_reason {
 	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
 	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
 	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
+#ifdef CONFIG_POPCORN
+	VM_FAULT_CONTINUE       = (__force vm_fault_t)0x004000,
+	VM_FAULT_KILLED         = (__force vm_fault_t)0x008000,
+#endif
 };
 
 /* Encode hstate index for a hwpoisoned large page */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 118374106..7c787d435 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,7 +29,9 @@
 #include <linux/mm_types_task.h>
 #include <linux/task_io_accounting.h>
 #include <linux/rseq.h>
-
+#ifdef CONFIG_POPCORN
+#include <linux/completion.h>
+#endif
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
 struct backing_dev_info;
@@ -1177,6 +1179,29 @@ struct task_struct {
 	unsigned long			task_state_change;
 #endif
 	int				pagefault_disabled;
+#ifdef CONFIG_POPCORN
+	struct remote_context *remote;
+	union {
+		int peer_nid;
+		int remote_nid;
+		int origin_nid;
+	};
+	union {
+		pid_t peer_pid;
+		pid_t remote_pid;
+		pid_t origin_pid;
+	};
+
+	bool is_worker;			/* kernel thread that manages the process*/
+	bool at_remote;			/* Is executing on behalf of another node? */
+
+	volatile void *remote_work;
+	struct completion remote_work_pended;
+
+	int migration_target_nid;
+	int backoff_weight;
+#endif
+
 #ifdef CONFIG_MMU
 	struct task_struct		*oom_reaper_list;
 #endif
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f..cd60c857e 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -64,6 +64,10 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#ifdef CONFIG_POPCORN
+#define MADV_RELEASE 20
+#endif
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/Kconfig.popcorn b/kernel/Kconfig.popcorn
new file mode 100644
index 000000000..3ed8b4fc3
--- /dev/null
+++ b/kernel/Kconfig.popcorn
@@ -0,0 +1,54 @@
+menu "Popcorn Distributed Execution Support"
+
+# This is selected by all the architectures Popcorn supports
+config ARCH_SUPPORTS_POPCORN
+	bool
+
+config POPCORN
+	bool "Popcorn Distributed Execution Support"
+	depends on ARCH_SUPPORTS_POPCORN
+	default n
+	help
+	  Enable or disable the Popcorn multi-kernel Linux support.
+
+if POPCORN
+
+config POPCORN_DEBUG
+	bool "Log debug messages for Popcorn"
+	default n
+	help
+	  Enable or disable kernel messages that can help debug Popcorn issues.
+
+config POPCORN_DEBUG_PROCESS_SERVER
+	bool "Debug task migration"
+	depends on POPCORN_DEBUG
+	default n
+
+config POPCORN_DEBUG_PAGE_SERVER
+	bool "Debug page migration"
+	depends on POPCORN_DEBUG
+	default n
+
+config POPCORN_DEBUG_VMA_SERVER
+	bool "Debug VMA handling"
+	depends on POPCORN_DEBUG
+	default n
+
+config POPCORN_DEBUG_VERBOSE
+	bool "Log more debug messages"
+	depends on POPCORN_DEBUG
+	default n
+
+config POPCORN_CHECK_SANITY
+	bool "Perform extra-sanity checks"
+	default y
+
+
+comment "Popcorn is not currently supported on this architecture"
+	depends on !ARCH_SUPPORTS_POPCORN
+
+source "drivers/msg_layer/Kconfig"
+
+endif # POPCORN
+
+endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index a8d923b54..c082d96a8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_POPCORN) += popcorn/
 
 obj-$(CONFIG_HAS_IOMEM) += iomem.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
diff --git a/kernel/exit.c b/kernel/exit.c
index 1803efb29..207c2d12f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -69,6 +69,10 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/process_server.h>
+#endif
+
 static void __unhash_process(struct task_struct *p, bool group_dead)
 {
 	nr_threads--;
@@ -503,6 +507,11 @@ static void exit_mm(void)
 	if (!mm)
 		return;
 	sync_mm_rss(mm);
+
+#ifdef CONFIG_POPCORN
+	process_server_task_exit(current);
+#endif
+
 	/*
 	 * Serialize with any possible pending coredump.
 	 * We must hold mmap_sem around checking core_state
diff --git a/kernel/fork.c b/kernel/fork.c
index 75675b9bf..c49a72b16 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -107,6 +107,11 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/process_server.h>
+#endif
+
 /*
  * Minimum number of threads to boot the kernel
  */
@@ -923,6 +928,44 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+
+#ifdef CONFIG_POPCORN
+	/*
+	 * Reset variables for tracking remote execution
+	 */
+	tsk->remote = NULL;
+	tsk->remote_nid = tsk->origin_nid = -1;
+	tsk->remote_pid = tsk->origin_pid = -1;
+
+	tsk->is_worker = false;
+
+	/*
+	 * If the new tsk is not in the same thread group as the parent,
+	 * then we do not need to propagate the old thread info.
+	 * Otherwise, make sure to keep an accurate record
+	 * of which node and thread group the new thread is a part of.
+	 */
+	if (orig->tgid != tsk->tgid) {
+		tsk->at_remote = false;
+	}
+
+	tsk->remote_work = NULL;
+	init_completion(&tsk->remote_work_pended);
+
+	tsk->migration_target_nid = -1;
+	tsk->backoff_weight = 0;
+
+	/*
+	 * Temporarily boost the priviledge to exploit thread bootstrapping
+	 * in copy_thread_tls() during kernel_thread(). Will be demoted in the
+	 * remote thread context.
+	 */
+	if (orig->is_worker) {
+		tsk->flags |= PF_KTHREAD;
+	}
+
+#endif // CONFIG_POPCORN
+
 	return tsk;
 
 free_stack:
@@ -1006,6 +1049,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
+#endif
+#ifdef CONFIG_POPCORN
+	mm->remote = NULL;
 #endif
 	mm_init_uprobes_state(mm);
 
@@ -1066,6 +1112,10 @@ static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+#ifdef CONFIG_POPCORN
+	if (mm->remote)
+		free_remote_context(mm->remote);
+#endif
 	mmdrop(mm);
 }
 
@@ -1927,7 +1977,6 @@ static __latent_entropy struct task_struct *copy_process(
 	p->utimescaled = p->stimescaled = 0;
 #endif
 	prev_cputime_init(&p->prev_cputime);
-
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_init(&p->vtime.seqcount);
 	p->vtime.starttime = 0;
diff --git a/kernel/futex.c b/kernel/futex.c
index 4b5b468c5..e374295e1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -59,6 +59,12 @@
 
 #include <asm/futex.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/process_server.h>
+#include <popcorn/page_server.h>
+#endif
+
 #include "locking/rtmutex_common.h"
 
 /*
@@ -2684,6 +2690,9 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int ret;
+#ifdef CONFIG_POPCORN
+	struct fault_handle *fh = NULL;
+#endif
 
 	if (!bitset)
 		return -EINVAL;
@@ -2701,11 +2710,19 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	}
 
 retry:
+#ifdef CONFIG_POPCORN
+	ret = page_server_get_userpage(uaddr, &fh, "wait");
+	if (ret < 0)
+		goto out;
+#endif
 	/*
 	 * Prepare to wait on uaddr. On success, holds hb lock and increments
 	 * q.key refs.
 	 */
 	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
+#ifdef CONFIG_POPCORN
+	page_server_put_userpage(fh, "wait");
+#endif
 	if (ret)
 		goto out;
 
@@ -3629,6 +3646,15 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 			return -ENOSYS;
 	}
 
+#ifdef CONFIG_POPCORN
+	if (distributed_process(current)) {
+		WARN_ON(cmd != FUTEX_WAIT &&
+				cmd != FUTEX_WAIT_BITSET &&
+				cmd != FUTEX_WAKE &&
+				cmd != FUTEX_WAKE_BITSET);
+	}
+#endif
+
 	switch (cmd) {
 	case FUTEX_WAIT:
 		val3 = FUTEX_BITSET_MATCH_ANY;
@@ -3695,6 +3721,12 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 	    cmd == FUTEX_CMP_REQUEUE_PI || cmd == FUTEX_WAKE_OP)
 		val2 = (u32) (unsigned long) utime;
 
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		return process_server_do_futex_at_remote(
+				uaddr, op, val, tp ? true : false, &ts, uaddr2, val2, val3);
+	}
+#endif
 	return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c42774..4bcb43f18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2770,6 +2770,9 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 
 	calculate_sigpending();
 }
+#ifdef CONFIG_POPCORN_DEBUG
+extern void trace_task_status(void);
+#endif
 
 /*
  * context_switch - switch to the new MM and the new thread's register state.
@@ -2779,7 +2782,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next, struct rq_flags *rf)
 {
 	struct mm_struct *mm, *oldmm;
-
+#ifdef CONFIG_POPCORN_DEBUG
+	trace_task_status();
+#endif
 	prepare_task_switch(rq, prev, next);
 
 	mm = next->mm;
@@ -4912,6 +4917,105 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
 	return ret;
 }
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/bundle.h>
+#include <popcorn/types.h>
+#include <popcorn/process_server.h>
+
+SYSCALL_DEFINE1(popcorn_get_thread_status, struct popcorn_thread_status __user *, status)
+{
+	struct popcorn_thread_status st = {
+		.current_nid = my_nid,
+		.peer_nid = current->peer_nid,
+		.peer_pid = current->peer_pid,
+	};
+
+	if (!access_ok(status, sizeof(*status))) {
+		return -EINVAL;
+	}
+
+	if (copy_to_user(status, &st, sizeof(st))) {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+SYSCALL_DEFINE3(popcorn_get_node_info, int *, _my_nid, struct popcorn_node_info __user *, info, int, len)
+{
+	int i;
+
+	if (!access_ok(_my_nid, sizeof(*_my_nid))) {
+		return -EINVAL;
+	}
+	if (copy_to_user(_my_nid, &my_nid, sizeof(my_nid))) {
+		return -EINVAL;
+	}
+
+	if (!access_ok(info, sizeof(*info) * MAX_POPCORN_NODES)) {
+		return -EINVAL;
+	}
+	for (i = 0; i < len; i++) {
+		struct popcorn_node_info res = {
+			.status = 0,
+			.arch = POPCORN_ARCH_UNKNOWN,
+			.distance = 0,
+		};
+		struct popcorn_node_info __user *ni = info + i;
+
+		if (get_popcorn_node_online(i)) {
+			res.status = 1;
+			res.arch = get_popcorn_node_arch(i);
+		}
+
+		if (copy_to_user(ni, &res, sizeof(res))) {
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+#pragma GCC optimize ("no-omit-frame-pointer")
+#pragma GCC optimize ("no-optimize-sibling-calls")
+SYSCALL_DEFINE2(popcorn_migrate, int, nid, void __user *, uregs)
+{
+	int ret;
+	PSPRINTK("####### MIGRATE [%d] to %d\n", current->pid, nid);
+
+	if (nid == -1) {
+		nid = current->migration_target_nid;
+	}
+	if (nid < 0 || nid >= MAX_POPCORN_NODES) {
+		PSPRINTK("  [%d] invalid migration destination %d\n",
+				current->pid, nid);
+		return -EINVAL;
+	}
+	if (nid == my_nid) {
+		PSPRINTK("  [%d] already running at the destination %d\n",
+				current->pid, nid);
+		return -EBUSY;
+	}
+
+	if (!get_popcorn_node_online(nid)) {
+		PSPRINTK("  [%d] destination node %d is offline\n",
+				current->pid, nid);
+		return -EAGAIN;
+	}
+
+	ret = process_server_do_migration(current, nid, uregs);
+	if (ret) return ret;
+
+	current->migration_target_nid = -1;
+
+	update_frame_pointer();
+#ifdef CONFIG_POPCORN_DEBUG_VERBOSE
+	PSPRINTK("  [%d] resume execution\n", current->pid);
+#endif
+	return 0;
+}
+#pragma GCC reset_options
+#endif // CONFIG_POPCORN
+
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6..51e19ede1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -166,6 +166,9 @@ COND_SYSCALL(syslog);
 /* kernel/ptrace.c */
 
 /* kernel/sched/core.c */
+COND_SYSCALL(popcorn_migrate);
+COND_SYSCALL(popcorn_get_node_info);
+COND_SYSCALL(popcorn_get_thread_status);
 
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf..f3ca58a7b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -22,6 +22,11 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/process_server.h>
+#include <popcorn/vma_server.h>
+#endif
+
 #include "internal.h"
 
 struct follow_page_context {
@@ -969,6 +974,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 
 retry:
 	vma = find_extend_vma(mm, address);
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(tsk)) {
+		if (!vma || address < vma->vm_start) {
+			if (vma_server_fetch_vma(tsk, address) == 0) {
+				/* Replace with updated VMA */
+				vma = find_extend_vma(mm, address);
+			} else {
+				return -ENOMEM;
+			}
+		}
+	}
+#endif
+
 	if (!vma || address < vma->vm_start)
 		return -EFAULT;
 
diff --git a/mm/internal.h b/mm/internal.h
index e32390802..e945732ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -9,6 +9,10 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#endif
 #include <linux/pagemap.h>
 #include <linux/tracepoint-defs.h>
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e67..4d13609d7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -28,6 +28,12 @@
 #include <asm/tlb.h>
 
 #include "internal.h"
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/vma_server.h>
+#include <popcorn/page_server.h>
+#include <popcorn/bundle.h>
+#endif
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -686,6 +692,23 @@ static int madvise_inject_error(int behavior,
 }
 #endif
 
+#ifdef CONFIG_POPCORN
+int madvise_release(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	int nr_pages = 0;
+	unsigned long addr;
+
+	/* mmap_sem is held */
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		nr_pages += page_server_release_page_ownership(vma, addr);
+	}
+
+	VSPRINTK("  [%d] %d %d / %ld %lx-%lx\n", current->pid, my_nid,
+			nr_pages, (end - start) / PAGE_SIZE, start, end);
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -698,6 +721,10 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
+#ifdef CONFIG_POPCORN
+	case MADV_RELEASE:
+		return madvise_release(vma, start, end);
+#endif
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -726,6 +753,10 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+
+#ifdef CONFIG_POPCORN
+	case MADV_RELEASE:
+#endif
 	case MADV_WIPEONFORK:
 	case MADV_KEEPONFORK:
 #ifdef CONFIG_MEMORY_FAILURE
@@ -809,6 +840,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	int write;
 	size_t len;
 	struct blk_plug plug;
+#ifdef CONFIG_POPCORN
+	unsigned long start_orig = start;
+	size_t len_orig = len_in;
+#endif
+
 
 	if (!madvise_behavior_valid(behavior))
 		return error;
@@ -893,5 +929,20 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	else
 		up_read(&current->mm->mmap_sem);
 
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		error = vma_server_madvise_remote(start_orig, len_orig, behavior);
+		if (error)
+			return error;
+	}
+#endif
+
 	return error;
 }
+
+#ifdef CONFIG_POPCORN
+long ksys_madvise(unsigned long start, size_t len, int behavior)
+{
+	return __do_sys_madvise(start, len, behavior);
+}
+#endif
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c..dd972a6a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -81,6 +81,11 @@
 #include <asm/pgtable.h>
 
 #include "internal.h"
+#ifdef CONFIG_POPCORN
+#include <linux/delay.h>
+#include <popcorn/page_server.h>
+#include <popcorn/process_server.h>
+#endif
 
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
@@ -1059,6 +1064,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		pte_t ptent = *pte;
 		if (pte_none(ptent))
 			continue;
+#ifdef CONFIG_POPCORN
+		page_server_zap_pte(vma, addr, pte, &ptent);
+#endif
 
 		if (pte_present(ptent)) {
 			struct page *page;
@@ -3889,7 +3897,29 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			vmf->pte = NULL;
 		}
 	}
+#ifdef CONFIG_POPCORN
+	if (distributed_process(current)) {
+		int ret;
+		if (pmd_none(*vmf->pmd)) {
+			if (__pte_alloc(vmf->vma->vm_mm, vmf->pmd))
+				return VM_FAULT_OOM;
+		}
 
+		ret = page_server_handle_pte_fault(vmf);
+		if (ret == VM_FAULT_RETRY) {
+			int backoff = ++current->backoff_weight;
+			PGPRINTK("  [%d] backoff %d\n", current->pid, backoff);
+			if (backoff <= 10) {
+				udelay(backoff * 100);
+			} else {
+				msleep(backoff - 10);
+			}
+		} else {
+			current->backoff_weight /= 2;
+		}
+		if (ret != VM_FAULT_CONTINUE) return ret;
+	}
+#endif
 	if (!vmf->pte) {
 		if (vma_is_anonymous(vmf->vma))
 			return do_anonymous_page(vmf);
@@ -3897,8 +3927,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
+	if (!pte_present(vmf->orig_pte)) {
+#ifdef CONFIG_POPCORN
+		page_server_panic(true, vmf->vma->vm_mm,
+				  vmf->address, vmf->pte, entry);
+#endif
 		return do_swap_page(vmf);
+	}
 
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
@@ -3932,6 +3967,164 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_POPCORN
+struct page *get_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t *pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mem_cgroup *memcg;
+	struct page *page;
+	pte_t entry = *pte;
+
+	if ((page = vm_normal_page(vma, addr, entry)))
+		return page;
+
+	BUG_ON(!is_zero_pfn(pte_pfn(entry)) && "Cannot handle this special page");
+
+	page = alloc_zeroed_user_highpage_movable(vma, addr);
+	if (!page)
+		return NULL;
+
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+		put_page(page);
+		return NULL;
+	}
+
+	__SetPageUptodate(page);
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	if (vma->vm_flags & VM_WRITE)
+		entry = pte_mkwrite(pte_mkdirty(entry));
+
+	inc_mm_counter_fast(mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
+	lru_cache_add_active_or_unevictable(page, vma);
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+	flush_tlb_page(vma, addr);
+
+	return page;
+}
+
+int handle_pte_fault_origin(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags)
+{
+	struct mem_cgroup *memcg;
+	struct page *page;
+	spinlock_t *ptl;
+	pte_t entry = *pte;
+	struct vm_fault vmf = {
+		.vma = vma,
+		.address = address & PAGE_MASK,
+		.flags = flags,
+		.pgoff = linear_page_index(vma, address),
+		.gfp_mask = __get_fault_gfp_mask(vma),
+	};
+
+	barrier();
+
+	/* TODO this is broken, vmd is not populated. And cast probably breaks things */
+	if (!vma_is_anonymous(vma))
+	  return do_fault(&vmf);
+
+	/**
+	 * Following is for anonymous page. Almost same to do_anonymos_page
+	 * except it allocates page upon read
+	 */
+	pte_unmap(pte);
+
+	if (vma->vm_flags & VM_SHARED) return VM_FAULT_SIGBUS;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
+	page = alloc_zeroed_user_highpage_movable(vma, address);
+	if (!page)
+		return VM_FAULT_OOM;
+
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	__SetPageUptodate(page);
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	if (vma->vm_flags & VM_WRITE)
+		entry = pte_mkwrite(pte_mkdirty(entry));
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_none(*pte)) {
+		/* Somebody already attached a page */
+		mem_cgroup_cancel_charge(page, memcg, false);
+		put_page(page);
+	} else {
+		inc_mm_counter_fast(mm, MM_ANONPAGES);
+		page_add_new_anon_rmap(page, vma, address, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+
+		set_pte_at(mm, address, pte, entry);
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, address, pte);
+	}
+	pte_unmap_unlock(pte, ptl);
+	return 0;
+}
+
+int cow_file_at_origin(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t *pte)
+{
+	struct page *new_page, *old_page;
+	struct mem_cgroup *memcg;
+	pte_t entry;
+
+	/**
+	 * Following is very similar to do_wp_page() and wp_page_copy()
+	 */
+	if (anon_vma_prepare(vma))
+		return VM_FAULT_OOM;
+
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+	if (!new_page) return VM_FAULT_OOM;
+
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+		put_page(new_page);
+		return VM_FAULT_OOM;
+	}
+
+	old_page = vm_normal_page(vma, addr, *pte);
+	BUG_ON(!old_page);
+	BUG_ON(PageAnon(old_page));
+
+	get_page(old_page);
+
+	copy_user_highpage(new_page, old_page, addr, vma);
+	__SetPageUptodate(new_page);
+
+	dec_mm_counter_fast(mm, MM_FILEPAGES);
+	inc_mm_counter_fast(mm, MM_ANONPAGES);
+
+	flush_cache_page(vma, addr, pte_pfn(*pte));
+	entry = mk_pte(new_page, vma->vm_page_prot);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
+	ptep_clear_flush_notify(vma, addr, pte);
+	page_add_new_anon_rmap(new_page, vma, addr, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+
+	page_remove_rmap(old_page, false);
+	put_page(old_page);
+
+	return 0;
+}
+#endif /* CONFIG_POPCORN */
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae..9d25692e5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -53,6 +53,12 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/bundle.h>
+#include <popcorn/types.h>
+#include <popcorn/vma_server.h>
+#endif
+
 #include "internal.h"
 
 #ifndef arch_mmap_check
@@ -200,6 +206,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	bool populate;
 	bool downgraded = false;
 	LIST_HEAD(uf);
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		while (!down_write_trylock(&mm->mmap_sem))
+			schedule();
+	}
+#endif
 
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
@@ -281,6 +293,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	userfaultfd_unmap_complete(mm, &uf);
 	if (populate)
 		mm_populate(oldbrk, newbrk - oldbrk);
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		if (vma_server_brk_remote(oldbrk, brk)) {
+			return brk;
+		}
+	}
+#endif
 	return brk;
 
 out:
@@ -289,6 +308,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	return retval;
 }
 
+#ifdef CONFIG_POPCORN
+long ksys_brk(unsigned long addr)
+{
+	return __do_sys_brk(addr);
+}
+#endif
+
 static long vma_compute_subtree_gap(struct vm_area_struct *vma)
 {
 	unsigned long max, prev_end, subtree_gap;
@@ -1607,6 +1633,12 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 	}
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		retval = vma_server_mmap_remote(file, addr, len, prot, flags, pgoff);
+		goto out_fput;
+	}
+#endif
 
 	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
 out_fput:
@@ -2846,9 +2878,20 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
 	int ret;
 	struct mm_struct *mm = current->mm;
 	LIST_HEAD(uf);
+#ifdef CONFIG_POPCORN
+	if (distributed_process(current)) {
+		while (!down_write_trylock(&mm->mmap_sem))
+			schedule();
+	} else {
+		if (down_write_killable(&mm->mmap_sem))
+			return -EINTR;
+
+	}
+#else
 
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
+#endif
 
 	ret = __do_munmap(mm, start, len, &uf, downgrade);
 	/*
@@ -2875,6 +2918,16 @@ EXPORT_SYMBOL(vm_munmap);
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	profile_munmap(addr);
+
+#ifdef CONFIG_POPCORN
+	if (unlikely(distributed_process(current))) {
+		if (current->at_remote) {
+			return vma_server_munmap_remote(addr, len);
+		} else {
+			return vma_server_munmap_origin(addr, len, my_nid);
+		}
+	}
+#endif
 	return __vm_munmap(addr, len, true);
 }
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bf38dfbbb..d78e9dbc5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -24,6 +24,11 @@
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
 #include <linux/perf_event.h>
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/vma_server.h>
+#endif
+
 #include <linux/pkeys.h>
 #include <linux/ksm.h>
 #include <linux/uaccess.h>
@@ -479,7 +484,13 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		return -ENOMEM;
 	if (!arch_validate_prot(prot, start))
 		return -EINVAL;
-
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		error = vma_server_mprotect_remote(start, len, prot);
+		if (error)
+			return error;
+	}
+#endif
 	reqprot = prot;
 
 	if (down_write_killable(&current->mm->mmap_sem))
@@ -582,6 +593,14 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 	return do_mprotect_pkey(start, len, prot, -1);
 }
 
+#ifdef CONFIG_POPCORN
+long ksys_mprotect(unsigned long start, size_t len,
+		  unsigned long prot)
+{
+        return __do_sys_mprotect(start, len, prot);
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 
 SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
diff --git a/mm/mremap.c b/mm/mremap.c
index fc241d23c..3d9e26352 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -30,6 +30,11 @@
 
 #include "internal.h"
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/vma_server.h>
+#endif
+
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -617,6 +622,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 
 	old_len = PAGE_ALIGN(old_len);
 	new_len = PAGE_ALIGN(new_len);
+#ifdef CONFIG_POPCORN
+	if (distributed_remote_process(current)) {
+		vma_server_mremap_remote(addr, old_len, new_len, flags, new_addr);
+	}
+#endif
+
 
 	/*
 	 * We allow a zero old-len as a special case
@@ -727,3 +738,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	userfaultfd_unmap_complete(mm, &uf_unmap);
 	return ret;
 }
+
+#ifdef CONFIG_POPCORN
+long ksys_mremap(unsigned long addr,
+		 unsigned long old_len, unsigned long new_len,
+		 unsigned long flags, unsigned long new_addr)
+{
+        return __do_sys_mremap(addr, old_len, new_len, flags, new_addr);
+}
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 2/9] Add x86 specifc files for Popcorn
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
  2020-04-29 19:32   ` [RFC 1/9] Core Popcorn Changes Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 3/9] Temporary revert L1TF mitigation " Javier Malave
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

Popcorn Linux is a Linux kernel-based software stack
that enables applications to execute, with a shared
source base, on distributed hosts.

This Popcorn patch adds x86_64 functionality only. 
Future iterations of Popcorn will support
heterogeneous architectures.
---
 arch/x86/Kconfig               |  3 +++
 arch/x86/include/asm/pgtable.h | 11 +++++++++++
 arch/x86/kernel/Makefile       |  1 +
 arch/x86/mm/fault.c            | 18 ++++++++++++++++++
 4 files changed, 33 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1b..4ca75c6c3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -88,6 +88,7 @@ config X86
 	select ARCH_SUPPORTS_ACPI
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
+	select ARCH_SUPPORTS_POPCORN	if X86_64
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
@@ -353,6 +354,8 @@ config PGTABLE_LEVELS
 	default 3 if X86_PAE
 	default 2
 
+source "kernel/Kconfig.popcorn"
+
 config CC_HAS_SANE_STACKPROTECTOR
 	bool
 	default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC)) if 64BIT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0509b41..7a0171c67 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -210,9 +210,13 @@ static inline u64 protnone_mask(u64 val);
 
 static inline unsigned long pte_pfn(pte_t pte)
 {
+#ifdef CONFIG_POPCORN
+	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
+#else
 	phys_addr_t pfn = pte_val(pte);
 	pfn ^= protnone_mask(pfn);
 	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
+#endif
 }
 
 static inline unsigned long pmd_pfn(pmd_t pmd)
@@ -602,7 +606,11 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
+#ifndef CONFIG_POPCORN
 	pteval_t val = pte_val(pte), oldval = val;
+#else
+	pteval_t val = pte_val(pte);
+#endif
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -610,8 +618,11 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+#ifndef CONFIG_POPCORN
 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
+#endif
 	return __pte(val);
+
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ce1b5cc36..3fb863285 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -133,6 +133,7 @@ obj-$(CONFIG_EFI)			+= sysfb_efi.o
 
 obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
+obj-$(CONFIG_POPCORN)		+= process_server.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
 obj-$(CONFIG_X86_INTEL_UMIP)		+= umip.o
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46df4c6aa..14b9755f9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,11 @@
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
 
+#ifdef CONFIG_POPCORN
+#include <popcorn/types.h>
+#include <popcorn/vma_server.h>
+#endif
+
 /*
  * Returns 0 if mmiotrace is disabled, or if the fault is not
  * handled by mmiotrace:
@@ -1416,6 +1421,19 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	vma = find_vma(mm, address);
+#ifdef CONFIG_POPCORN
+	/* vma worker should not fault */
+	BUG_ON(tsk->is_worker);
+
+	if (distributed_remote_process(tsk)) {
+		if (!vma || vma->vm_start > address) {
+			if (vma_server_fetch_vma(tsk, address) == 0) {
+				/* Replace with updated VMA */
+				vma = find_vma(mm, address);
+			}
+		}
+	}
+#endif
 	if (unlikely(!vma)) {
 		bad_area(regs, hw_error_code, address);
 		return;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 3/9] Temporary revert L1TF mitigation for Popcorn
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
  2020-04-29 19:32   ` [RFC 1/9] Core Popcorn Changes Javier Malave
  2020-04-29 19:32   ` [RFC 2/9] Add x86 specifc files for Popcorn Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 4/9] Popcorn system call additions Javier Malave
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

Popcorn Linux is a Linux kernel-based software stack
that enables applications to execute, with a shared
source base, on distributed hosts.

To ensure correct functionality across hosts and
focus feedback during the RFC process we have
reverted L1TF mitigations for x86 mitigations
temporarily. Future iterations of Popcorn will
comply with the L1TF mitigations.
---
 arch/x86/include/asm/pgtable-2level.h | 17 --------
 arch/x86/include/asm/pgtable-3level.h |  2 -
 arch/x86/include/asm/pgtable.h        | 59 +++++----------------------
 arch/x86/include/asm/pgtable_64.h     |  2 -
 arch/x86/mm/mmap.c                    | 21 ----------
 mm/memory.c                           | 41 +++++++------------
 mm/mprotect.c                         | 49 ----------------------
 7 files changed, 25 insertions(+), 166 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 60d0f9015..685ffe8a0 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,21 +95,4 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
-/* No inverted PFNs on 2 level page tables */
-
-static inline u64 protnone_mask(u64 val)
-{
-	return 0;
-}
-
-static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
-{
-	return val;
-}
-
-static inline bool __pte_needs_invert(u64 val)
-{
-	return false;
-}
-
 #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index f8b1ad2c3..fa1a7047f 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -332,6 +332,4 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 	return pte;
 }
 
-#include <asm/pgtable-invert.h>
-
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7a0171c67..21a97114d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -204,33 +204,19 @@ static inline int pte_special(pte_t pte)
 	return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
-/* Entries that were set to PROT_NONE are inverted */
-
-static inline u64 protnone_mask(u64 val);
-
 static inline unsigned long pte_pfn(pte_t pte)
 {
-#ifdef CONFIG_POPCORN
 	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
-#else
-	phys_addr_t pfn = pte_val(pte);
-	pfn ^= protnone_mask(pfn);
-	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
-#endif
 }
 
 static inline unsigned long pmd_pfn(pmd_t pmd)
 {
-	phys_addr_t pfn = pmd_val(pmd);
-	pfn ^= protnone_mask(pfn);
-	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
+	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pud_pfn(pud_t pud)
 {
-	phys_addr_t pfn = pud_val(pud);
-	pfn ^= protnone_mask(pfn);
-	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;
+	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long p4d_pfn(p4d_t p4d)
@@ -568,26 +554,20 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
-	pfn ^= protnone_mask(pgprot_val(pgprot));
-	pfn &= PTE_PFN_MASK;
-	return __pte(pfn | check_pgprot(pgprot));
+	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
+		     check_pgprot(pgprot));
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
-	pfn ^= protnone_mask(pgprot_val(pgprot));
-	pfn &= PHYSICAL_PMD_PAGE_MASK;
-	return __pmd(pfn | check_pgprot(pgprot));
+	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
+		     check_pgprot(pgprot));
 }
 
 static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)
 {
-	phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT;
-	pfn ^= protnone_mask(pgprot_val(pgprot));
-	pfn &= PHYSICAL_PUD_PAGE_MASK;
-	return __pud(pfn | check_pgprot(pgprot));
+	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |
+		     check_pgprot(pgprot));
 }
 
 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
@@ -602,15 +582,9 @@ static inline pud_t pud_mknotpresent(pud_t pud)
 	      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
-static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
-#ifndef CONFIG_POPCORN
-	pteval_t val = pte_val(pte), oldval = val;
-#else
 	pteval_t val = pte_val(pte);
-#endif
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -618,20 +592,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
-#ifndef CONFIG_POPCORN
-	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
-#endif
-	return __pte(val);
 
+	return __pte(val);
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
-	pmdval_t val = pmd_val(pmd), oldval = val;
+	pmdval_t val = pmd_val(pmd);
 
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
-	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
+
 	return __pmd(val);
 }
 
@@ -1466,14 +1437,6 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
 	return __pte_access_permitted(pud_val(pud), write);
 }
 
-#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1
-extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);
-
-static inline bool arch_has_pfn_modify_check(void)
-{
-	return boot_cpu_has_bug(X86_BUG_L1TF);
-}
-
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 0bb566315..0e529a4a5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -272,7 +272,5 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages)
 	return true;
 }
 
-#include <asm/pgtable-invert.h>
-
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index aae9a933d..1afca4ddf 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -227,24 +227,3 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
 
 	return phys_addr_valid(addr + count - 1);
 }
-
-/*
- * Only allow root to set high MMIO mappings to PROT_NONE.
- * This prevents an unpriv. user to set them to PROT_NONE and invert
- * them, then pointing to valid memory for L1TF speculation.
- *
- * Note: for locked down kernels may want to disable the root override.
- */
-bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
-{
-	if (!boot_cpu_has_bug(X86_BUG_L1TF))
-		return true;
-	if (!__pte_needs_invert(pgprot_val(prot)))
-		return true;
-	/* If it's real memory always allow */
-	if (pfn_valid(pfn))
-		return true;
-	if (pfn >= l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))
-		return false;
-	return true;
-}
diff --git a/mm/memory.c b/mm/memory.c
index dd972a6a1..a93c9a9dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -119,9 +119,13 @@ EXPORT_SYMBOL(high_memory);
 int randomize_va_space __read_mostly =
 #ifdef CONFIG_COMPAT_BRK
 					1;
+#else
+#ifdef CONFIG_POPCORN
+					0;	/* Popcorn needs address space randomization to be turned off for the time being */
 #else
 					2;
 #endif
+#endif
 
 static int __init disable_randmaps(char *s)
 {
@@ -1706,9 +1710,6 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	if (!pfn_modify_allowed(pfn, pgprot))
-		return VM_FAULT_SIGBUS;
-
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
 	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
@@ -1770,9 +1771,6 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 
 	track_pfn_insert(vma, &pgprot, pfn);
 
-	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
-		return VM_FAULT_SIGBUS;
-
 	/*
 	 * If we don't have pte special, then we have to use the pfn_valid()
 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1833,7 +1831,6 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 {
 	pte_t *pte;
 	spinlock_t *ptl;
-	int err = 0;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1841,16 +1838,12 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	arch_enter_lazy_mmu_mode();
 	do {
 		BUG_ON(!pte_none(*pte));
-		if (!pfn_modify_allowed(pfn, prot)) {
-			err = -EACCES;
-			break;
-		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
-	return err;
+	return 0;
 }
 
 static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
@@ -1859,7 +1852,6 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 {
 	pmd_t *pmd;
 	unsigned long next;
-	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pmd = pmd_alloc(mm, pud, addr);
@@ -1868,10 +1860,9 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
-		err = remap_pte_range(mm, pmd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot);
-		if (err)
-			return err;
+		if (remap_pte_range(mm, pmd, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot))
+			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1882,7 +1873,6 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 {
 	pud_t *pud;
 	unsigned long next;
-	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pud = pud_alloc(mm, p4d, addr);
@@ -1890,10 +1880,9 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		err = remap_pmd_range(mm, pud, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot);
-		if (err)
-			return err;
+		if (remap_pmd_range(mm, pud, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot))
+			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
@@ -1904,7 +1893,6 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 {
 	p4d_t *p4d;
 	unsigned long next;
-	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	p4d = p4d_alloc(mm, pgd, addr);
@@ -1912,10 +1900,9 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		err = remap_pud_range(mm, p4d, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot);
-		if (err)
-			return err;
+		if (remap_pud_range(mm, p4d, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot))
+			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d78e9dbc5..a9b0dfc60 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -313,42 +313,6 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	return pages;
 }
 
-static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
-			       unsigned long next, struct mm_walk *walk)
-{
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
-}
-
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				   unsigned long addr, unsigned long next,
-				   struct mm_walk *walk)
-{
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
-}
-
-static int prot_none_test(unsigned long addr, unsigned long next,
-			  struct mm_walk *walk)
-{
-	return 0;
-}
-
-static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
-			   unsigned long end, unsigned long newflags)
-{
-	pgprot_t new_pgprot = vm_get_page_prot(newflags);
-	struct mm_walk prot_none_walk = {
-		.pte_entry = prot_none_pte_entry,
-		.hugetlb_entry = prot_none_hugetlb_entry,
-		.test_walk = prot_none_test,
-		.mm = current->mm,
-		.private = &new_pgprot,
-	};
-
-	return walk_page_range(start, end, &prot_none_walk);
-}
-
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	unsigned long start, unsigned long end, unsigned long newflags)
@@ -366,19 +330,6 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 		return 0;
 	}
 
-	/*
-	 * Do PROT_NONE PFN permission checks here when we can still
-	 * bail out without undoing a lot of state. This is a rather
-	 * uncommon case, so doesn't need to be very optimized.
-	 */
-	if (arch_has_pfn_modify_check() &&
-	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {
-		error = prot_none_walk(vma, start, end, newflags);
-		if (error)
-			return error;
-	}
-
 	/*
 	 * If we make a private mapping writable we increase our commit;
 	 * but (without finer accounting) cannot reduce our commit if we
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 4/9] Popcorn system call additions
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (2 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 3/9] Temporary revert L1TF mitigation " Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 5/9] Popcorn Utility Javier Malave
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

The Popcorn system calls are a core component of
Popcorn Linux. All system calls have been added 
to kernel/sched/core.c

The main system call is popcorn_migrate. User
applications may invoke this system call to trigger
a migration from their origin node to a remote
node; and vice-versa. The system call relies on
several "server modules" (process server, vma server,
page server) to perform the migration and maintain
VMA coherency. A message layer for IPC has been
also added to communicate Popcorn messages across
the distributed threads. You may find a basic
example of the system call at the Popcorn kernel
library. All three system calls are showcased in the
Popcorn kernel library.
---
 arch/x86/entry/syscalls/syscall_64.tbl |  3 +++
 include/linux/syscalls.h               |  9 +++++++++
 include/uapi/asm-generic/unistd.h      | 11 +++++++++--
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e62..5f8aff57e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,9 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+434	64	popcorn_migrate		__x64_sys_popcorn_migrate
+435	64	popcorn_get_thread_status	__x64_sys_popcorn_get_thread_status
+436	64	popcorn_get_node_info	__x64_sys_popcorn_get_node_info
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2bcef4c70..e8e4430d5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1250,6 +1250,15 @@ ssize_t ksys_pread64(unsigned int fd, char __user *buf, size_t count,
 ssize_t ksys_pwrite64(unsigned int fd, const char __user *buf,
 		      size_t count, loff_t pos);
 int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+#ifdef CONFIG_POPCORN
+long ksys_brk(unsigned long brk);
+long ksys_mremap(unsigned long addr,
+		 unsigned long old_len, unsigned long new_len,
+		 unsigned long flags, unsigned long new_addr);
+long ksys_madvise(unsigned long start, size_t len, int behavior);
+long ksys_mprotect(unsigned long start, size_t len,
+		  unsigned long prot);
+#endif
 #ifdef CONFIG_ADVISE_SYSCALLS
 int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
 #else
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf..71a526a1b 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,16 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
-
+#ifdef CONFIG_POPCORN
+#define __NR_popcorn_migrate 434
+__SYSCALL(__NR_popcorn_migrate, sys_popcorn_migrate)
+#define __NR_popcorn_get_thread_status 435
+__SYSCALL(__NR_popcorn_get_thread_status, sys_popcorn_get_thread_status)
+#define __NR_popcorn_get_node_info 436
+__SYSCALL(__NR_popcorn_get_node_info, sys_popcorn_get_node_info)
+#endif
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 5/9] Popcorn Utility
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (3 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 4/9] Popcorn system call additions Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 6/9] Process Server for Popcorn Distributed Thread Execution Javier Malave
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

This patch contains utility and other
Popcorn supplementary modules, including
initialization routines.
---
 include/popcorn/bundle.h      |  38 ++++
 include/popcorn/debug.h       |  38 ++++
 include/popcorn/regset.h      |  96 +++++++++
 include/popcorn/stat.h        |  16 ++
 include/popcorn/types.h       |  20 ++
 kernel/popcorn/Makefile       |   7 +
 kernel/popcorn/bundle.c       | 115 +++++++++++
 kernel/popcorn/init.c         |  58 ++++++
 kernel/popcorn/stat.c         | 165 ++++++++++++++++
 kernel/popcorn/trace_events.h |  76 ++++++++
 kernel/popcorn/types.h        | 358 ++++++++++++++++++++++++++++++++++
 kernel/popcorn/util.c         | 121 ++++++++++++
 kernel/popcorn/util.h         |  14 ++
 kernel/popcorn/wait_station.c |  84 ++++++++
 kernel/popcorn/wait_station.h |  27 +++
 15 files changed, 1233 insertions(+)
 create mode 100644 include/popcorn/bundle.h
 create mode 100644 include/popcorn/debug.h
 create mode 100644 include/popcorn/regset.h
 create mode 100644 include/popcorn/stat.h
 create mode 100644 include/popcorn/types.h
 create mode 100644 kernel/popcorn/Makefile
 create mode 100644 kernel/popcorn/bundle.c
 create mode 100644 kernel/popcorn/init.c
 create mode 100644 kernel/popcorn/stat.c
 create mode 100644 kernel/popcorn/trace_events.h
 create mode 100644 kernel/popcorn/types.h
 create mode 100644 kernel/popcorn/util.c
 create mode 100644 kernel/popcorn/util.h
 create mode 100644 kernel/popcorn/wait_station.c
 create mode 100644 kernel/popcorn/wait_station.h

diff --git a/include/popcorn/bundle.h b/include/popcorn/bundle.h
new file mode 100644
index 000000000..280325e10
--- /dev/null
+++ b/include/popcorn/bundle.h
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __POPCORN_RACK_H__
+#define __POPCORN_RACK_H__
+
+#define MAX_POPCORN_NODES 32
+#if (MAX_POPCORN_NODES > 62)
+#error Currently support up to 62 nodes
+#endif
+
+enum popcorn_arch {
+	POPCORN_ARCH_UNKNOWN = -1,
+	POPCORN_ARCH_X86 = 1,
+	POPCORN_ARCH_PPC = 2,
+	POPCORN_ARCH_MAX,
+};
+
+extern int my_nid;
+extern const enum popcorn_arch my_arch;
+
+bool get_popcorn_node_online(int nid);
+int get_popcorn_node_arch(int nid);
+int popcorn_nodes_init(void);
+void set_popcorn_node_online(int nid, bool online);
+void broadcast_my_node_info(int nr_nodes);
+
+struct popcorn_thread_status {
+	int current_nid;
+	int peer_nid;
+	pid_t peer_pid;
+};
+
+struct popcorn_node_info {
+	unsigned int status;
+	int arch;
+	int distance;
+};
+
+#endif
diff --git a/include/popcorn/debug.h b/include/popcorn/debug.h
new file mode 100644
index 000000000..4a12a7ba7
--- /dev/null
+++ b/include/popcorn/debug.h
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __INCLUDE_POPCORN_DEBUG_H__
+#define __INCLUDE_POPCORN_DEBUG_H__
+
+#define PCNPRINTK(...) printk(KERN_INFO "popcorn: " __VA_ARGS__)
+#define PCNPRINTK_ERR(...) printk(KERN_ERR "popcorn: " __VA_ARGS__)
+
+#ifdef CONFIG_POPCORN_DEBUG
+#define PRINTK(...) printk(KERN_INFO __VA_ARGS__)
+#else
+#define PRINTK(...)
+#endif
+
+#ifdef CONFIG_POPCORN_DEBUG_PROCESS_SERVER
+#define PSPRINTK(...) printk(KERN_INFO __VA_ARGS__)
+#else
+#define PSPRINTK(...)
+#endif
+
+#ifdef CONFIG_POPCORN_DEBUG_VMA_SERVER
+#define VSPRINTK(...) printk(KERN_INFO __VA_ARGS__)
+#else
+#define VSPRINTK(...)
+#endif
+
+#ifdef CONFIG_POPCORN_DEBUG_PAGE_SERVER
+#define PGPRINTK(...) printk(KERN_INFO __VA_ARGS__)
+#else
+#define PGPRINTK(...)
+#endif
+
+#ifdef CONFIG_POPCORN_DEBUG_MSG_LAYER
+#define MSGPRINTK(...) printk(KERN_INFO __VA_ARGS__)
+#else
+#define MSGPRINTK(...)
+#endif
+
+#endif /*  __INCLUDE_POPCORN_DEBUG_H__ */
diff --git a/include/popcorn/regset.h b/include/popcorn/regset.h
new file mode 100644
index 000000000..c13c90525
--- /dev/null
+++ b/include/popcorn/regset.h
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+/*
+ * /include/popcorn/regset.h
+ *
+ *  This file provides the architecture specific macro and structures of the
+ *  helper functionality implementation of the process server
+ *
+ * author, Sharath Kumar Bhat, SSRG, VirginiaTech 2014
+ *
+ */
+
+#ifndef PROCESS_SERVER_ARCH_MACROS_H_
+#define PROCESS_SERVER_ARCH_MACROS_H_
+
+#include <popcorn/bundle.h>
+
+struct regset_x86_64 {
+	/* Program counter/instruction pointer */
+	uint64_t rip;
+
+	/* General purpose registers */
+	uint64_t rax, rdx, rcx, rbx,
+			 rsi, rdi, rbp, rsp,
+			 r8, r9, r10, r11,
+			 r12, r13, r14, r15;
+
+	/* Multimedia-extension (MMX) registers */
+	uint64_t mmx[8];
+
+	/* Streaming SIMD Extension (SSE) registers */
+	unsigned __int128 xmm[16];
+
+	/* x87 floating point registers */
+	long double st[8];
+
+	/* Segment registers */
+	uint32_t cs, ss, ds, es, fs, gs;
+
+	/* Flag register */
+	uint64_t rflags;
+};
+
+struct regset_aarch64 {
+	/* Stack pointer & program counter */
+	uint64_t sp;
+	uint64_t pc;
+
+	/* General purpose registers */
+	uint64_t x[31];
+
+	/* FPU/SIMD registers */
+	unsigned __int128 v[32];
+};
+
+struct regset_powerpc {
+	unsigned long nip;
+	unsigned long msr;
+	unsigned long ctr;
+	unsigned long link;
+	unsigned long xer;
+	unsigned long ccr;
+
+	unsigned long gpr[32];
+	uint64_t fpr[32];
+
+	unsigned long orig_gpr3;	/* Used for restarting system calls */
+	unsigned long softe;		/* Soft enabled/disabled */
+};
+
+struct field_arch {
+	unsigned long tls;
+	unsigned long oob[4];
+	bool fpu_active;
+
+	union {
+		unsigned long regsets;
+		struct regset_x86_64 regs_x86;
+		struct regset_aarch64 regs_aarch;
+		struct regset_powerpc regs_ppc;
+	};
+};
+
+static inline size_t regset_size(int arch) {
+	const size_t sizes[] = {
+		sizeof(struct regset_aarch64),
+		sizeof(struct regset_x86_64),
+		sizeof(struct regset_powerpc),
+	};
+
+	if(arch <= POPCORN_ARCH_UNKNOWN || arch >= POPCORN_ARCH_MAX)
+		return -EINVAL;
+
+	return sizes[arch];
+}
+
+#endif
diff --git a/include/popcorn/stat.h b/include/popcorn/stat.h
new file mode 100644
index 000000000..cbd024314
--- /dev/null
+++ b/include/popcorn/stat.h
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __KERNEL_POPCORN_STAT_H__
+#define __KERNEL_POPCORN_STAT_H__
+
+struct pcn_kmsg_message;
+
+void account_pcn_message_sent(struct pcn_kmsg_message *msg);
+void account_pcn_message_recv(struct pcn_kmsg_message *msg);
+
+void account_pcn_rdma_write(size_t size);
+void account_pcn_rdma_read(size_t size);
+
+#define POPCORN_STAT_FMT  "%12llu  %12llu  %s\n"
+#define POPCORN_STAT_FMT2 "%8llu.%03llu  %8llu.%03llu  %s\n"
+
+#endif /* KERNEL_POPCORN_STAT_H_ */
diff --git a/include/popcorn/types.h b/include/popcorn/types.h
new file mode 100644
index 000000000..aaab9d923
--- /dev/null
+++ b/include/popcorn/types.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __INCLUDE_POPCORN_TYPES_H__
+#define __INCLUDE_POPCORN_TYPES_H__
+
+#include <linux/sched.h>
+
+static inline bool distributed_process(struct task_struct *tsk)
+{
+	if (!tsk->mm) return false;
+	return !!tsk->mm->remote;
+}
+
+static inline bool distributed_remote_process(struct task_struct *tsk)
+{
+	return distributed_process(tsk) && tsk->at_remote;
+}
+
+#include <popcorn/debug.h>
+
+#endif /* __INCLUDE_POPCORN_TYPES_H__ */
diff --git a/kernel/popcorn/Makefile b/kernel/popcorn/Makefile
new file mode 100644
index 000000000..6821ec217
--- /dev/null
+++ b/kernel/popcorn/Makefile
@@ -0,0 +1,7 @@
+obj-$(CONFIG_POPCORN)	+= init.o util.o
+obj-$(CONFIG_POPCORN)	+= wait_station.o
+obj-$(CONFIG_POPCORN)	+= process_server.o vma_server.o
+obj-$(CONFIG_POPCORN)	+= page_server.o fh_action.o
+obj-$(CONFIG_POPCORN)	+= bundle.o
+obj-$(CONFIG_POPCORN)	+= pcn_kmsg.o
+obj-$(CONFIG_POPCORN)	+= stat.o
diff --git a/kernel/popcorn/bundle.c b/kernel/popcorn/bundle.c
new file mode 100644
index 000000000..54ffb59e1
--- /dev/null
+++ b/kernel/popcorn/bundle.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/bundle.c
+ *
+ * Popcorn node init
+ *
+ * Original file developed by SSRG at Virginia Tech.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ */
+
+#include <asm/bug.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+
+#include <popcorn/pcn_kmsg.h>
+#include <popcorn/bundle.h>
+#include <popcorn/debug.h>
+#include "types.h"
+
+struct popcorn_node {
+	enum popcorn_arch arch;
+	int bundle_id;
+
+	bool is_connected;
+};
+
+static struct popcorn_node popcorn_nodes[MAX_POPCORN_NODES];
+
+bool get_popcorn_node_online(int nid)
+{
+	return popcorn_nodes[nid].is_connected;
+}
+EXPORT_SYMBOL(get_popcorn_node_online);
+
+void set_popcorn_node_online(int nid, bool online)
+{
+        popcorn_nodes[nid].is_connected = online;
+}
+EXPORT_SYMBOL(set_popcorn_node_online);
+
+int my_nid __read_mostly = -1;
+EXPORT_SYMBOL(my_nid);
+
+const enum popcorn_arch my_arch = POPCORN_ARCH_X86;
+EXPORT_SYMBOL(my_arch);
+
+int get_popcorn_node_arch(int nid)
+{
+	return popcorn_nodes[nid].arch;
+}
+EXPORT_SYMBOL(get_popcorn_node_arch);
+
+const char *archs_sz[] = {
+	"aarch64",
+	"x86_64",
+	"ppc64le",
+};
+
+void broadcast_my_node_info(int nr_nodes)
+{
+	int i;
+	node_info_t info = {
+		.nid = my_nid,
+		.arch = my_arch,
+	};
+	for (i = 0; i < nr_nodes; i++) {
+		if (i == my_nid)
+                        continue;
+		pcn_kmsg_send(PCN_KMSG_TYPE_NODE_INFO, i, &info, sizeof(info));
+	}
+}
+EXPORT_SYMBOL(broadcast_my_node_info);
+
+static bool my_node_info_printed = false;
+
+static int handle_node_info(struct pcn_kmsg_message *msg)
+{
+	node_info_t *info = (node_info_t *)msg;
+
+	if (my_nid != -1 && !my_node_info_printed) {
+		popcorn_nodes[my_nid].arch = my_arch;
+		my_node_info_printed = true;
+	}
+
+	PCNPRINTK("   %d joined, %s\n", info->nid, archs_sz[info->arch]);
+	popcorn_nodes[info->nid].arch = info->arch;
+	smp_mb();
+
+	pcn_kmsg_done(msg);
+	return 0;
+}
+
+int __init popcorn_nodes_init(void)
+{
+	int i;
+	BUG_ON(my_arch == POPCORN_ARCH_UNKNOWN);
+
+	for (i = 0; i < MAX_POPCORN_NODES; i++) {
+		struct popcorn_node *pn = popcorn_nodes + i;
+
+		pn->is_connected = false;
+		pn->arch = POPCORN_ARCH_UNKNOWN;
+		pn->bundle_id = -1;
+	}
+
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_NODE_INFO, node_info);
+
+	return 0;
+}
diff --git a/kernel/popcorn/init.c b/kernel/popcorn/init.c
new file mode 100644
index 000000000..a0cc9796f
--- /dev/null
+++ b/kernel/popcorn/init.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/init.c
+ *
+ * Popcorn node init
+ *
+ * Copyright (c) 2013 - 2014 Akshay Giridhar
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ * author, rewritten by Sang-Hoon Kim, 2016-2017
+ * author, modified by Antonio Barbalace, 2014
+ */
+
+#include <linux/kernel.h>
+#include <linux/workqueue.h>
+
+#include <popcorn/debug.h>
+#include "types.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace_events.h"
+
+struct workqueue_struct *popcorn_wq;
+struct workqueue_struct *popcorn_ordered_wq;
+EXPORT_SYMBOL(popcorn_wq);
+EXPORT_SYMBOL(popcorn_ordered_wq);
+
+extern int pcn_kmsg_init(void);
+extern int popcorn_nodes_init(void);
+extern int process_server_init(void);
+extern int vma_server_init(void);
+extern int page_server_init(void);
+extern int statistics_init(void);
+
+static int __init popcorn_init(void)
+{
+	PRINTK("Initialize Popcorn subsystems...\n");
+
+	/*
+	 * Create work queues so that we can do bottom side
+	 * processing on data that was brought in by the
+	 * communications module interrupt handlers.
+	 */
+	popcorn_ordered_wq = create_singlethread_workqueue("pcn_wq_ordered");
+	popcorn_wq = alloc_workqueue("pcn_wq", WQ_MEM_RECLAIM, 0);
+
+	pcn_kmsg_init();
+
+	popcorn_nodes_init();
+	vma_server_init();
+	process_server_init();
+	page_server_init();
+
+	statistics_init();
+	return 0;
+}
+late_initcall(popcorn_init);
diff --git a/kernel/popcorn/stat.c b/kernel/popcorn/stat.c
new file mode 100644
index 000000000..55f05caf4
--- /dev/null
+++ b/kernel/popcorn/stat.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/stat.c
+ *
+ * Original file developed by SSRG at Virginia Tech.
+ *
+ * author , Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ */
+
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/slab.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/percpu.h>
+#include <asm/uaccess.h>
+
+#include <popcorn/pcn_kmsg.h>
+#include <popcorn/stat.h>
+
+static unsigned long long sent_stats[PCN_KMSG_TYPE_MAX] = {0};
+static unsigned long long recv_stats[PCN_KMSG_TYPE_MAX] = {0};
+
+static DEFINE_PER_CPU(unsigned long long, bytes_sent) = 0;
+static DEFINE_PER_CPU(unsigned long long, bytes_recv) = 0;
+static DEFINE_PER_CPU(unsigned long long, bytes_rdma_written) = 0;
+static DEFINE_PER_CPU(unsigned long long, bytes_rdma_read) = 0;
+
+static unsigned long long last_bytes_sent = 0;
+static unsigned long long last_bytes_recv = 0;
+static unsigned long long last_bytes_rdma_written = 0;
+static unsigned long long last_bytes_rdma_read = 0;
+static ktime_t last_stat = 0;
+
+const char *pcn_kmsg_type_name[PCN_KMSG_TYPE_MAX] = {
+	[PCN_KMSG_TYPE_TASK_MIGRATE] = "migration",
+	[PCN_KMSG_TYPE_VMA_INFO_REQUEST] = "VMA info",
+	[PCN_KMSG_TYPE_VMA_OP_REQUEST] = "VMA op",
+	[PCN_KMSG_TYPE_REMOTE_PAGE_REQUEST] = "remote page",
+	[PCN_KMSG_TYPE_PAGE_INVALIDATE_REQUEST] = "invalidate",
+	[PCN_KMSG_TYPE_FUTEX_REQUEST] = "futex",
+};
+
+void account_pcn_message_sent(struct pcn_kmsg_message *msg)
+{
+	struct pcn_kmsg_hdr *h = (struct pcn_kmsg_hdr *)msg;
+	this_cpu_add(bytes_sent, h->size);
+}
+
+void account_pcn_message_recv(struct pcn_kmsg_message *msg)
+{
+	struct pcn_kmsg_hdr *h = (struct pcn_kmsg_hdr *)msg;
+	this_cpu_add(bytes_recv, h->size);
+}
+
+void account_pcn_rdma_write(size_t size)
+{
+	this_cpu_add(bytes_rdma_written, size);
+}
+
+void account_pcn_rdma_read(size_t size)
+{
+	this_cpu_add(bytes_rdma_read, size);
+}
+
+void fh_action_stat(struct seq_file *seq, void *);
+
+static int __show_stats(struct seq_file *seq, void *v)
+{
+	int i;
+	unsigned long long sent = 0;
+	unsigned long long recv = 0;
+	ktime_t now;
+	unsigned long long rate_sent, rate_recv;
+	unsigned long elapsed;
+
+	now = ktime_get_real();
+	elapsed = last_stat - now;
+	last_stat = now;
+
+	for_each_present_cpu(i) {
+		sent += per_cpu(bytes_sent, i);
+		recv += per_cpu(bytes_recv, i);
+	}
+	seq_printf(seq, POPCORN_STAT_FMT, sent, recv, "Total network I/O");
+
+	rate_sent = (sent - last_bytes_sent);
+	rate_recv = (recv - last_bytes_recv);
+	seq_printf(seq, POPCORN_STAT_FMT2,
+			rate_sent / elapsed, (rate_sent % elapsed) * 1000 / elapsed,
+			rate_recv / elapsed, (rate_recv % elapsed) * 1000 / elapsed,
+			"MB/s");
+	last_bytes_sent = sent;
+	last_bytes_recv = recv;
+
+	if (pcn_kmsg_has_features(PCN_KMSG_FEATURE_RDMA) && elapsed) {
+		recv = sent = 0;
+		for_each_present_cpu(i) {
+			sent += per_cpu(bytes_rdma_written, i);
+			recv += per_cpu(bytes_rdma_read, i);
+		}
+		seq_printf(seq, POPCORN_STAT_FMT, sent, recv, "RDMA");
+
+		rate_sent = (sent - last_bytes_rdma_written);
+		rate_recv = (recv - last_bytes_rdma_read);
+		seq_printf(seq, POPCORN_STAT_FMT2,
+				rate_sent / elapsed, (rate_sent % elapsed) * 1000 / elapsed,
+				rate_recv / elapsed, (rate_recv % elapsed) * 1000 / elapsed,
+				"MB/s");
+		last_bytes_rdma_written = sent;
+		last_bytes_rdma_read = recv;
+	}
+
+	pcn_kmsg_stat(seq, NULL);
+
+	return 0;
+}
+
+static ssize_t __write_stats(struct file *file, const char __user *buffer,
+			     size_t size, loff_t *offset)
+{
+	int i;
+	for_each_present_cpu(i) {
+		per_cpu(bytes_sent, i) = 0;
+		per_cpu(bytes_recv, i) = 0;
+		per_cpu(bytes_rdma_written, i) = 0;
+		per_cpu(bytes_rdma_read, i) = 0;
+	}
+	pcn_kmsg_stat(NULL, NULL);
+
+	for (i = 0 ; i < PCN_KMSG_TYPE_MAX; i++) {
+		sent_stats[i] = 0;
+		recv_stats[i] = 0;
+	}
+	fh_action_stat(NULL, NULL);
+
+	return size;
+}
+
+static int __open_stats(struct inode *inode, struct file *file)
+{
+	return single_open(file, __show_stats, inode->i_private);
+}
+
+static struct file_operations stats_ops = {
+	.owner = THIS_MODULE,
+	.open = __open_stats,
+	.read = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+	.write = __write_stats,
+};
+
+static struct proc_dir_entry *proc_entry = NULL;
+
+int statistics_init(void)
+{
+	proc_entry = proc_create("popcorn_stat", S_IRUGO | S_IWUGO, NULL, &stats_ops);
+	if (proc_entry == NULL) {
+		printk(KERN_ERR"cannot create proc_fs entry for popcorn stats\n");
+		return -ENOMEM;
+	}
+	return 0;
+}
diff --git a/kernel/popcorn/trace_events.h b/kernel/popcorn/trace_events.h
new file mode 100644
index 000000000..7ea5e8b6a
--- /dev/null
+++ b/kernel/popcorn/trace_events.h
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM popcorn
+
+#if !defined(_TRACE_EVENTS_POPCORN_H_) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_EVENTS_POPCORN_H_
+
+#include <linux/tracepoint.h>
+
+
+TRACE_EVENT(pgfault,
+	TP_PROTO(const int nid, const int pid, const char rw,
+		const unsigned long instr_addr, const unsigned long addr,
+		const int result),
+
+	TP_ARGS(nid, pid, rw, instr_addr, addr, result),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, pid)
+		__field(char, rw)
+		__field(unsigned long, instr_addr)
+		__field(unsigned long, addr)
+		__field(int, result)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->pid = pid;
+		__entry->rw = rw;
+		__entry->instr_addr = instr_addr;
+		__entry->addr = addr;
+		__entry->result = result;
+	),
+
+	TP_printk("%d %d %c %lx %lx %d",
+		__entry->nid, __entry->pid, __entry->rw,
+		__entry->instr_addr, __entry->addr, __entry->result)
+);
+
+
+TRACE_EVENT(pgfault_stat,
+	TP_PROTO(const unsigned long instr_addr, const unsigned long addr,
+		const int result, const int retries, const unsigned long time_us),
+
+	TP_ARGS(instr_addr, addr, result, retries, time_us),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, instr_addr)
+		__field(unsigned long, addr)
+		__field(int, result)
+		__field(int, retries)
+		__field(unsigned long, time_us)
+	),
+
+	TP_fast_assign(
+		__entry->instr_addr = instr_addr;
+		__entry->addr = addr;
+		__entry->result = result;
+		__entry->retries = retries;
+		__entry->time_us = time_us;
+	),
+
+	TP_printk("%lx %lx %d %d %lu",
+		__entry->instr_addr, __entry->addr, __entry->result,
+		__entry->retries, __entry->time_us)
+);
+
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+
+#define TRACE_INCLUDE_PATH ../../kernel/popcorn
+#define TRACE_INCLUDE_FILE trace_events
+#include <trace/define_trace.h>
diff --git a/kernel/popcorn/types.h b/kernel/popcorn/types.h
new file mode 100644
index 000000000..bd6f3db0e
--- /dev/null
+++ b/kernel/popcorn/types.h
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __INTERNAL_POPCORN_TYPES_H__
+#define __INTERNAL_POPCORN_TYPES_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/sched/task.h>
+#include <popcorn/pcn_kmsg.h>
+#include <popcorn/regset.h>
+#include <linux/sched.h>
+#include <popcorn/types.h>
+
+#define FAULTS_HASH 31
+
+/*
+ * Remote execution context
+ */
+struct remote_context {
+	struct list_head list;
+	atomic_t count;
+	struct mm_struct *mm;
+
+	int tgid;
+	bool for_remote;
+
+	/* Tracking page status */
+	struct radix_tree_root pages;
+
+	/* For page replication protocol */
+	spinlock_t faults_lock[FAULTS_HASH];
+	struct hlist_head faults[FAULTS_HASH];
+
+	/* For VMA management */
+	spinlock_t vmas_lock;
+	struct list_head vmas;
+
+	/* Remote worker */
+	bool stop_remote_worker;
+
+	struct task_struct *remote_worker;
+	struct completion remote_works_ready;
+	spinlock_t remote_works_lock;
+	struct list_head remote_works;
+
+	pid_t remote_tgids[MAX_POPCORN_NODES];
+};
+
+struct remote_context *__get_mm_remote(struct mm_struct *mm);
+struct remote_context *get_task_remote(struct task_struct *tsk);
+bool put_task_remote(struct task_struct *tsk);
+bool __put_task_remote(struct remote_context *rc);
+
+
+/*
+ * Process migration
+ */
+#define BACK_MIGRATION_FIELDS \
+	int remote_nid;\
+	pid_t remote_pid;\
+	pid_t origin_pid;\
+	unsigned int personality;\
+	struct field_arch arch;
+DEFINE_PCN_KMSG(back_migration_request_t, BACK_MIGRATION_FIELDS);
+
+#define CLONE_FIELDS \
+	pid_t origin_tgid;\
+	pid_t origin_pid;\
+	unsigned long task_size; \
+	unsigned long stack_start; \
+	unsigned long env_start;\
+	unsigned long env_end;\
+	unsigned long arg_start;\
+	unsigned long arg_end;\
+	unsigned long start_brk;\
+	unsigned long brk;\
+	unsigned long start_code ;\
+	unsigned long end_code;\
+	unsigned long start_data;\
+	unsigned long end_data;\
+	unsigned int personality;\
+	unsigned long def_flags;\
+	char exe_path[512];\
+	struct field_arch arch;
+DEFINE_PCN_KMSG(clone_request_t, CLONE_FIELDS);
+
+
+/*
+ * This message is sent in response to a clone request.
+ * Its purpose is to notify the requesting cpu that make
+ * the specified pid is executing on behalf of the
+ * requesting cpu.
+ */
+#define REMOTE_TASK_PAIRING_FIELDS \
+	pid_t my_tgid; \
+	pid_t my_pid; \
+	pid_t your_pid;
+DEFINE_PCN_KMSG(remote_task_pairing_t, REMOTE_TASK_PAIRING_FIELDS);
+
+
+#define REMOTE_TASK_EXIT_FIELDS  \
+	pid_t origin_pid; \
+	pid_t remote_pid; \
+	int exit_code;
+DEFINE_PCN_KMSG(remote_task_exit_t, REMOTE_TASK_EXIT_FIELDS);
+
+#define ORIGIN_TASK_EXIT_FIELDS \
+	pid_t origin_pid; \
+	pid_t remote_pid; \
+	int exit_code;
+DEFINE_PCN_KMSG(origin_task_exit_t, ORIGIN_TASK_EXIT_FIELDS);
+
+
+/*
+ * VMA management
+ */
+#define VMA_INFO_REQUEST_FIELDS \
+	pid_t origin_pid;	\
+	pid_t remote_pid;	\
+	unsigned long addr;
+DEFINE_PCN_KMSG(vma_info_request_t, VMA_INFO_REQUEST_FIELDS);
+
+#define VMA_INFO_RESPONSE_FIELDS \
+	pid_t remote_pid;	 \
+	int result;		 \
+	unsigned long addr;	 \
+	unsigned long vm_start;	 \
+	unsigned long vm_end;	 \
+	unsigned long vm_flags;	 \
+	unsigned long vm_pgoff;	 \
+	char vm_file_path[512];
+DEFINE_PCN_KMSG(vma_info_response_t, VMA_INFO_RESPONSE_FIELDS);
+
+#define vma_info_anon(x) ((x)->vm_file_path[0] == '\0' ? true : false)
+
+
+#define VMA_OP_REQUEST_FIELDS \
+	pid_t origin_pid; \
+	pid_t remote_pid; \
+	int remote_ws; \
+	int operation; \
+	union { \
+		unsigned long addr; \
+		unsigned long start; \
+		unsigned long brk; \
+	}; \
+	union { \
+		unsigned long len;		/* mmap */ \
+		unsigned long old_len;	/* mremap */ \
+	}; \
+	union { \
+		unsigned long prot;		/* mmap */ \
+		int behavior;			/* madvise */ \
+		unsigned long new_len;	/* mremap */ \
+	}; \
+	unsigned long flags;		/* mmap, remap */ \
+	union { \
+		unsigned long pgoff;	/* mmap */ \
+		unsigned long new_addr;	/* mremap */ \
+	}; \
+	char path[512];
+DEFINE_PCN_KMSG(vma_op_request_t, VMA_OP_REQUEST_FIELDS);
+
+#define VMA_OP_RESPONSE_FIELDS \
+	pid_t origin_pid;      \
+	pid_t remote_pid;      \
+	int remote_ws;	       \
+	int operation;	       \
+	long ret;	       \
+	union {			    \
+		unsigned long addr; \
+		unsigned long start; \
+		unsigned long brk; \
+	}; \
+	unsigned long len;
+DEFINE_PCN_KMSG(vma_op_response_t, VMA_OP_RESPONSE_FIELDS);
+
+
+/*
+ * Page management
+ */
+#define REMOTE_PAGE_REQUEST_FIELDS \
+	pid_t origin_pid;	   \
+	int origin_ws;		   \
+	pid_t remote_pid;	   \
+	unsigned long addr;	   \
+	unsigned long fault_flags; \
+	unsigned long instr_addr;  \
+	dma_addr_t rdma_addr;	   \
+	u32 rdma_key;
+DEFINE_PCN_KMSG(remote_page_request_t, REMOTE_PAGE_REQUEST_FIELDS);
+
+#define REMOTE_PAGE_RESPONSE_COMMON_FIELDS \
+	pid_t remote_pid;		   \
+	pid_t origin_pid;		   \
+	int origin_ws;			   \
+	unsigned long addr;		   \
+	int result;
+
+#define REMOTE_PAGE_RESPONSE_FIELDS \
+	REMOTE_PAGE_RESPONSE_COMMON_FIELDS \
+	unsigned char page[PAGE_SIZE];
+DEFINE_PCN_KMSG(remote_page_response_t, REMOTE_PAGE_RESPONSE_FIELDS);
+
+#define REMOTE_PAGE_GRANT_FIELDS \
+	REMOTE_PAGE_RESPONSE_COMMON_FIELDS
+DEFINE_PCN_KMSG(remote_page_response_short_t, REMOTE_PAGE_GRANT_FIELDS);
+
+
+#define REMOTE_PAGE_FLUSH_COMMON_FIELDS \
+	pid_t origin_pid;		\
+	int remote_nid;			\
+	pid_t remote_pid;		\
+	int remote_ws;			\
+	unsigned long addr;		\
+	unsigned long flags;
+
+#define REMOTE_PAGE_FLUSH_FIELDS \
+	REMOTE_PAGE_FLUSH_COMMON_FIELDS \
+	unsigned char page[PAGE_SIZE];
+DEFINE_PCN_KMSG(remote_page_flush_t, REMOTE_PAGE_FLUSH_FIELDS);
+
+#define REMOTE_PAGE_RELEASE_FIELDS \
+	REMOTE_PAGE_FLUSH_COMMON_FIELDS
+DEFINE_PCN_KMSG(remote_page_release_t, REMOTE_PAGE_RELEASE_FIELDS);
+
+#define REMOTE_PAGE_FLUSH_ACK_FIELDS \
+	int remote_ws; \
+	unsigned long flags;
+DEFINE_PCN_KMSG(remote_page_flush_ack_t, REMOTE_PAGE_FLUSH_ACK_FIELDS);
+
+
+#define PAGE_INVALIDATE_REQUEST_FIELDS \
+	pid_t origin_pid; \
+	int origin_ws; \
+	pid_t remote_pid; \
+	unsigned long addr;
+DEFINE_PCN_KMSG(page_invalidate_request_t, PAGE_INVALIDATE_REQUEST_FIELDS);
+
+#define PAGE_INVALIDATE_RESPONSE_FIELDS \
+	pid_t origin_pid; \
+	int origin_ws; \
+	pid_t remote_pid;
+DEFINE_PCN_KMSG(page_invalidate_response_t, PAGE_INVALIDATE_RESPONSE_FIELDS);
+
+
+/*
+ * Futex
+ */
+#define REMOTE_FUTEX_REQ_FIELDS \
+	pid_t origin_pid;	\
+	int remote_ws;		\
+	int op;			\
+	u32 val;		\
+	struct timespec64 ts;	\
+	void *uaddr;		\
+	void *uaddr2;		\
+	u32 val2;		\
+	u32 val3;
+DEFINE_PCN_KMSG(remote_futex_request, REMOTE_FUTEX_REQ_FIELDS);
+
+#define REMOTE_FUTEX_RES_FIELDS \
+	int remote_ws;		\
+	long ret;
+DEFINE_PCN_KMSG(remote_futex_response, REMOTE_FUTEX_RES_FIELDS);
+
+/*
+ * Node information
+ */
+#define NODE_INFO_FIELDS \
+	int nid;	 \
+	int bundle_id;	 \
+	int arch;
+DEFINE_PCN_KMSG(node_info_t, NODE_INFO_FIELDS);
+
+
+/*
+ * Schedule server. Not yet completely ported though
+ */
+#define SCHED_PERIODIC_FIELDS \
+	int power_1;	      \
+	int power_2;	      \
+	int power_3;
+DEFINE_PCN_KMSG(sched_periodic_req, SCHED_PERIODIC_FIELDS);
+
+/*
+ * Message routing using work queues
+ */
+extern struct workqueue_struct *popcorn_wq;
+extern struct workqueue_struct *popcorn_ordered_wq;
+
+struct pcn_kmsg_work {
+	struct work_struct work;
+	void *msg;
+};
+
+static inline int __handle_popcorn_work(struct pcn_kmsg_message *msg,
+					void (*handler)(struct work_struct *),
+					struct workqueue_struct *wq)
+{
+	struct pcn_kmsg_work *w = kmalloc(sizeof(*w), GFP_ATOMIC);
+	BUG_ON(!w);
+
+	w->msg = msg;
+	INIT_WORK(&w->work, handler);
+	smp_wmb();
+	queue_work(wq, &w->work);
+
+	return 0;
+}
+
+int request_remote_work(pid_t pid, struct pcn_kmsg_message *req);
+
+#define DEFINE_KMSG_WQ_HANDLER(x) \
+static inline int handle_##x(struct pcn_kmsg_message *msg) {\
+	return __handle_popcorn_work(msg, process_##x, popcorn_wq);\
+}
+#define DEFINE_KMSG_ORDERED_WQ_HANDLER(x) \
+static inline int handle_##x(struct pcn_kmsg_message *msg) {\
+	return __handle_popcorn_work(msg, process_##x, popcorn_ordered_wq);\
+}
+#define DEFINE_KMSG_RW_HANDLER(x,type,member) \
+static inline int handle_##x(struct pcn_kmsg_message *msg) {\
+	type *req = (type *)msg; \
+	return request_remote_work(req->member, msg); \
+}
+
+#define REGISTER_KMSG_WQ_HANDLER(x, y) \
+	pcn_kmsg_register_callback(x, handle_##y)
+
+#define REGISTER_KMSG_HANDLER(x, y) \
+	pcn_kmsg_register_callback(x, handle_##y)
+
+#define START_KMSG_WORK(type, name, work) \
+	struct pcn_kmsg_work *__pcn_kmsg_work__ = (struct pcn_kmsg_work *)(work); \
+	type *name = __pcn_kmsg_work__->msg
+
+#define END_KMSG_WORK(name) \
+	pcn_kmsg_done(name); \
+	kfree(__pcn_kmsg_work__);
+
+static inline struct task_struct *__get_task_struct(pid_t pid)
+{
+	struct task_struct *tsk = NULL;
+	rcu_read_lock();
+	tsk = find_task_by_vpid(pid);
+	if (likely(tsk)) {
+		get_task_struct(tsk);
+	}
+	rcu_read_unlock();
+	return tsk;
+}
+
+#endif /* __INTERNAL_POPCORN_TYPES_H__ */
diff --git a/kernel/popcorn/util.c b/kernel/popcorn/util.c
new file mode 100644
index 000000000..cb031200b
--- /dev/null
+++ b/kernel/popcorn/util.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/util.c
+ *
+ * General Utility Functions
+ *
+ * Original file developed by SSRG at Virginia Tech.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/sched/task_stack.h>
+#include <popcorn/bundle.h>
+
+void print_page_data(unsigned char *addr)
+{
+	int i;
+	for (i = 0; i < PAGE_SIZE; i++) {
+		if (i % 16 == 0) {
+			printk(KERN_INFO"%08lx:", (unsigned long)(addr + i));
+		}
+		if (i % 4 == 0) {
+			printk(" ");
+		}
+		printk("%02x", *(addr + i));
+	}
+	printk("\n");
+}
+
+void print_page_signature(unsigned char *addr)
+{
+	unsigned char *p = addr;
+	int i, j;
+	for (i = 0; i < PAGE_SIZE / 128; i++) {
+		unsigned char signature = 0;
+		for (j = 0; j < 32; j++) {
+			signature = (signature + *p++) & 0xff;
+		}
+		printk("%02x", signature);
+	}
+	printk("\n");
+}
+
+void print_page_signature_pid(pid_t pid, unsigned char *addr)
+{
+	printk("  [%d] ", pid);
+	print_page_signature(addr);
+}
+
+static DEFINE_SPINLOCK(__print_lock);
+static char *__print_buffer = NULL;
+
+void print_page_owner(unsigned long addr, unsigned long *owners, pid_t pid)
+{
+	if (unlikely(!__print_buffer)) {
+		__print_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	}
+	spin_lock(&__print_lock);
+	bitmap_print_to_pagebuf(
+			true, __print_buffer, owners, MAX_POPCORN_NODES);
+	printk("  [%d] %lx %s", pid, addr, __print_buffer);
+	spin_unlock(&__print_lock);
+}
+
+#include <linux/fs.h>
+
+static DEFINE_SPINLOCK(__file_path_lock);
+static char *__file_path_buffer = NULL;
+
+int get_file_path(struct file *file, char *sz, size_t size)
+{
+	char *ppath;
+	int retval = 0;
+
+	if (!file) {
+		BUG_ON(size < 1);
+		sz[0] = '\0';
+		return -EINVAL;
+	}
+
+	if (unlikely(!__file_path_buffer)) {
+		__file_path_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	}
+
+	spin_lock(&__file_path_lock);
+	ppath = file_path(file, __file_path_buffer, PAGE_SIZE);
+	if (IS_ERR(ppath)) {
+		retval = -ESRCH;
+		goto out_unlock;
+	}
+
+	strncpy(sz, ppath, size);
+
+out_unlock:
+	spin_unlock(&__file_path_lock);
+	return 0;
+}
+
+
+static const char *__comm_to_trace[] = {
+};
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+
+void trace_task_status(void)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(__comm_to_trace); i++) {
+		const char *comm = __comm_to_trace[i];
+		if (memcmp(current->comm, comm, strlen(comm)) == 0) {
+			printk("@@[%d] %s %lx\n", current->pid,
+					current->comm, instruction_pointer(current_pt_regs()));
+			break;
+		}
+	}
+}
diff --git a/kernel/popcorn/util.h b/kernel/popcorn/util.h
new file mode 100644
index 000000000..b94fb192a
--- /dev/null
+++ b/kernel/popcorn/util.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __POPCORN_KERNEL_UTIL_H__
+#define __POPCORN_KERNEL_UTIL_H__
+struct page;
+
+void print_page_data(unsigned char *addr);
+void print_page_signature(unsigned char *addr);
+void print_page_signature_pid(pid_t pid, unsigned char *addr);
+void print_page_owner(unsigned long addr, unsigned long *owners, pid_t pid);
+
+int get_file_path(struct file *file, char *sz, size_t size);
+
+void trace_task_status(void);
+#endif
diff --git a/kernel/popcorn/wait_station.c b/kernel/popcorn/wait_station.c
new file mode 100644
index 000000000..3367222f4
--- /dev/null
+++ b/kernel/popcorn/wait_station.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/wait_station.c
+ *
+ * Waiting stations allows threads to be waited for a given
+ * number of events are completed
+ *
+ * Original file developed by SSRG at Virginia Tech.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/err.h>
+
+#include "wait_station.h"
+
+#define MAX_WAIT_STATIONS 1024
+
+static struct wait_station wait_stations[MAX_WAIT_STATIONS];
+
+static DEFINE_SPINLOCK(wait_station_lock);
+static DECLARE_BITMAP(wait_station_available, MAX_WAIT_STATIONS) = { 0 };
+
+struct wait_station *get_wait_station_multiple(struct task_struct *tsk,
+					       int count)
+{
+	int id;
+	struct wait_station *ws;
+
+	spin_lock(&wait_station_lock);
+	id = find_first_zero_bit(wait_station_available, MAX_WAIT_STATIONS);
+	BUG_ON(id >= MAX_WAIT_STATIONS);
+	ws = wait_stations + id;
+	set_bit(id, wait_station_available);
+	spin_unlock(&wait_station_lock);
+
+	ws->id = id;
+	ws->pid = tsk->pid;
+	ws->private = (void *)0xbad0face;
+	init_completion(&ws->pendings);
+	atomic_set(&ws->pendings_count, count);
+	smp_wmb();
+
+	return ws;
+}
+EXPORT_SYMBOL_GPL(get_wait_station_multiple);
+
+struct wait_station *wait_station(int id)
+{
+	smp_rmb();
+	return wait_stations + id;
+}
+EXPORT_SYMBOL_GPL(wait_station);
+
+void put_wait_station(struct wait_station *ws)
+{
+	int id = ws->id;
+	spin_lock(&wait_station_lock);
+	BUG_ON(!test_bit(id, wait_station_available));
+	clear_bit(id, wait_station_available);
+	spin_unlock(&wait_station_lock);
+}
+EXPORT_SYMBOL_GPL(put_wait_station);
+
+void *wait_at_station(struct wait_station *ws)
+{
+	void *ret;
+	if (!try_wait_for_completion(&ws->pendings)) {
+		if (wait_for_completion_io_timeout(&ws->pendings, 300 * HZ) == 0) {
+			ret = ERR_PTR(-ETIMEDOUT);
+			goto out;
+		}
+	}
+	smp_rmb();
+	ret = (void *)ws->private;
+out:
+	put_wait_station(ws);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(wait_at_station);
diff --git a/kernel/popcorn/wait_station.h b/kernel/popcorn/wait_station.h
new file mode 100644
index 000000000..3ced49b84
--- /dev/null
+++ b/kernel/popcorn/wait_station.h
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef _POPCORN_WAIT_STATION_H_
+#define _POPCORN_WAIT_STATION_H_
+
+#include <linux/completion.h>
+#include <linux/atomic.h>
+
+struct wait_station {
+	int id;
+	pid_t pid;
+	volatile void *private;
+	struct completion pendings;
+	atomic_t pendings_count;
+};
+
+struct task_struct;
+
+struct wait_station *get_wait_station_multiple(struct task_struct *tsk,
+					int count);
+static inline struct wait_station *get_wait_station(struct task_struct *tsk)
+{
+	return get_wait_station_multiple(tsk, 1);
+}
+struct wait_station *wait_station(int id);
+void put_wait_station(struct wait_station *ws);
+void *wait_at_station(struct wait_station *ws);
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 6/9] Process Server for Popcorn Distributed Thread Execution
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (4 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 5/9] Popcorn Utility Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 7/9] Virtual Memory Address Server for " Javier Malave
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

This is the core migration and back migration processor.
It contains handlers for migration, back migration and
distributed futex messages. It also handles remote process
exits.

This module is triggered by popcorn_migrate syscall via
the function process_server_do_migration. If the process
that triggers the system call is its origin node, then
__do_migration triggers the forward migration process to
the user specified destination node. The forward migration 
process uses a kernel thread called remote_thread_main.

Otherwise, if the process is in a remote node, it already
has been migrated away, so __do_back_migration triggers the 
back migration process to the specified destination node 
(which in this case should be the origin node).

Process Server is in charge of processing additional
functionality. When nodes receive Popcorn messages (see 
RFC 9/9  Message Layer) specific handlers to each Popcorn
server post work to the Popcorn work queues.    

The Process Server calls upon the appropriate "process"
function to complete the work. There are two routines in charge 
of this: process_remote_works and run_remote_worker.  
---
 arch/x86/kernel/process_server.c |  250 +++++++
 include/popcorn/process_server.h |   18 +
 kernel/popcorn/process_server.c  | 1037 ++++++++++++++++++++++++++++++
 kernel/popcorn/process_server.h  |   21 +
 4 files changed, 1326 insertions(+)
 create mode 100644 arch/x86/kernel/process_server.c
 create mode 100644 include/popcorn/process_server.h
 create mode 100644 kernel/popcorn/process_server.c
 create mode 100644 kernel/popcorn/process_server.h

diff --git a/arch/x86/kernel/process_server.c b/arch/x86/kernel/process_server.c
new file mode 100644
index 000000000..4819efb33
--- /dev/null
+++ b/arch/x86/kernel/process_server.c
@@ -0,0 +1,250 @@
+/*
+ * File:
+ * 	process_server.c
+ *
+ * Description:
+ * 	this file implements the x86 architecture specific
+ *  helper functionality of the process server
+ *
+ * Created on:
+ * 	Sep 19, 2014
+ *
+ * Author:
+ * 	Sharath Kumar Bhat, SSRG, VirginiaTech
+ *
+ */
+
+/* File includes */
+#include <linux/sched.h>
+#include <linux/kdebug.h>
+#include <linux/ptrace.h>
+#include <asm/uaccess.h>
+#include <asm/prctl.h>
+#include <asm/proto.h>
+#include <asm/desc.h>
+#include <asm/fpu/internal.h>
+
+#include <popcorn/types.h>
+#include <popcorn/regset.h>
+
+/*
+ * Function:
+ *		save_thread_info
+ *
+ * Description:
+ *		this function saves the architecture specific info of the task
+ *		to the struct struct field_arch structure passed
+ *
+ * Input:
+ *	regs,	pointer to the pt_regs field of the task
+ *
+ * Output:
+ *	arch,	pointer to the struct field_arch structure type where the
+ *			architecture specific information of the task has to be
+ *			saved
+ *
+ * Return value:
+ *	on success, returns 0
+ * 	on failure, returns negative integer
+ */
+int save_thread_info(struct field_arch *arch)
+{
+	unsigned short fsindex, gsindex;
+	unsigned long ds, es, fs, gs;
+	int cpu;
+
+	BUG_ON(!arch);
+
+	cpu = get_cpu();
+
+	/*
+	 * Segments
+	 * CS and SS are set during the user/kernel mode switch.
+	 * Thus, nothing to do with them.
+	 */
+
+	ds = current->thread.ds;
+	es = current->thread.es;
+
+	savesegment(fs, fsindex);
+	if (fsindex) {
+		fs = get_desc_base(current->thread.tls_array + current->thread.fsbase);
+	} else {
+		rdmsrl(MSR_FS_BASE, fs);
+	}
+
+	savesegment(gs, gsindex);
+	if (gsindex) {
+		gs = get_desc_base(current->thread.tls_array + current->thread.gsbase);
+	} else {
+		rdmsrl(MSR_KERNEL_GS_BASE, gs);
+	}
+
+	WARN_ON(ds);
+	WARN_ON(es);
+	WARN_ON(gs);
+	arch->tls = fs;
+
+	put_cpu();
+
+	/*
+	PSPRINTK("%s [%d] tls %lx\n", __func__, current->pid, arch->tls);
+	*/
+
+	return 0;
+}
+
+
+/*
+ * Function:
+ *		restore_thread_info
+ *
+ * Description:
+ *		this function restores the architecture specific info of the
+ *		task from the struct field_arch structure passed
+ *
+ * Input:
+ * 	arch,	pointer to the struct field_arch structure type from which the
+ *			architecture specific information of the task has to be
+ *			restored
+ *
+ *	restore_segments,
+ *			restore segmentations as well if segs is true. Unless, do
+ *			not restore the segmentation units (for back migration)
+ *
+ * Output:
+ *	none
+ *
+ * Return value:
+ *	on success, returns 0
+ * 	on failure, returns negative integer
+ */
+int restore_thread_info(struct field_arch *arch, bool restore_segments)
+{
+	struct pt_regs *regs = current_pt_regs();
+	struct regset_x86_64 *regset = &arch->regs_x86;
+	int cpu;
+
+	cpu = get_cpu();
+
+	regs->r15 = regset->r15;
+	regs->r14 = regset->r14;
+	regs->r13 = regset->r13;
+	regs->r12 = regset->r12;
+	regs->bp = regset->rbp;
+	regs->bx = regset->rbx;
+
+	regs->r11 = regset->r11;
+	regs->r10 = regset->r10;
+	regs->r9 = regset->r9;
+	regs->r8 = regset->r8;
+	regs->ax = regset->rax;
+	regs->cx = regset->rcx;
+	regs->dx = regset->rdx;
+	regs->si = regset->rsi;
+	regs->di = regset->rdi;
+
+	regs->ip = regset->rip;
+	regs->sp = regset->rsp;
+	regs->flags = regset->rflags;
+
+	if (restore_segments) {
+		regs->cs = __USER_CS;
+		regs->ss = __USER_DS;
+
+		/*
+		current->thread.ds = regset->ds;
+		current->thread.es = regset->es;
+		*/
+
+		if (arch->tls) {
+			do_arch_prctl_64(current, ARCH_SET_FS, arch->tls);
+		}
+		/*
+		if (arch->thread_gs) {
+			do_arch_prctl_64(current, ARCH_SET_GS, arch->thread_gs);
+		}
+		*/
+	}
+
+	put_cpu();
+
+#ifdef CONFIG_POPCORN_DEBUG_VERBOSE
+	PSPRINTK("%s [%d] ip %lx\n", __func__, current->pid, regs->ip);
+	PSPRINTK("%s [%d] sp %lx bp %lx\n", __func__, current->pid, regs->sp, regs->bp);
+#endif
+	return 0;
+}
+
+#include <asm/stacktrace.h>
+noinline_for_stack void update_frame_pointer(void)
+{
+	unsigned long *rbp;
+	rbp = __builtin_frame_address(0); /* update_frame_pointer */
+
+	/* User rbp is at one stack frames below */
+	*rbp = current_pt_regs()->bp;	/* sched_migrate */
+}
+
+
+/*
+ * Function:
+ *		dump_processor_regs
+ *
+ * Description:
+ *		this function prints the architecture specific registers specified
+ *		in the input argument
+ *
+ * Input:
+ * 	task,	pointer to the architecture specific registers
+ *
+ * Output:
+ * 	none
+ *
+ * Return value:
+ *	void
+ *
+ * Why don't use show_all() for x86?
+ */
+void dump_processor_regs(struct pt_regs* regs)
+{
+	unsigned long fs, gs;
+	unsigned long fsindex, gsindex;
+
+	dump_stack();
+	if (!regs) return;
+	printk(KERN_ALERT"DUMP REGS %s\n", __func__);
+
+	if (NULL != regs) {
+		printk(KERN_ALERT"r15{%lx}\n",regs->r15);
+		printk(KERN_ALERT"r14{%lx}\n",regs->r14);
+		printk(KERN_ALERT"r13{%lx}\n",regs->r13);
+		printk(KERN_ALERT"r12{%lx}\n",regs->r12);
+		printk(KERN_ALERT"r11{%lx}\n",regs->r11);
+		printk(KERN_ALERT"r10{%lx}\n",regs->r10);
+		printk(KERN_ALERT"r9{%lx}\n",regs->r9);
+		printk(KERN_ALERT"r8{%lx}\n",regs->r8);
+		printk(KERN_ALERT"bp{%lx}\n",regs->bp);
+		printk(KERN_ALERT"bx{%lx}\n",regs->bx);
+		printk(KERN_ALERT"ax{%lx}\n",regs->ax);
+		printk(KERN_ALERT"cx{%lx}\n",regs->cx);
+		printk(KERN_ALERT"dx{%lx}\n",regs->dx);
+		printk(KERN_ALERT"di{%lx}\n",regs->di);
+		printk(KERN_ALERT"orig_ax{%lx}\n",regs->orig_ax);
+		printk(KERN_ALERT"ip{%lx}\n",regs->ip);
+		printk(KERN_ALERT"cs{%lx}\n",regs->cs);
+		printk(KERN_ALERT"flags{%lx}\n",regs->flags);
+		printk(KERN_ALERT"sp{%lx}\n",regs->sp);
+		printk(KERN_ALERT"ss{%lx}\n",regs->ss);
+	}
+	rdmsrl(MSR_FS_BASE, fs);
+	rdmsrl(MSR_GS_BASE, gs);
+	printk(KERN_ALERT"fs{%lx} - %lx content %lx\n",fs, current->thread.fsbase, fs ? * (unsigned long*)fs : 0x1234567l);
+	printk(KERN_ALERT"gs{%lx} - %lx content %lx\n",gs, current->thread.gsbase, fs ? * (unsigned long*)gs : 0x1234567l);
+
+	savesegment(fs, fsindex);
+	savesegment(gs, gsindex);
+	printk(KERN_ALERT"fsindex{%lx} - %x\n",fsindex, current->thread.fsindex);
+	printk(KERN_ALERT"gsindex{%lx} - %x\n",gsindex, current->thread.gsindex);
+	printk(KERN_ALERT"REGS DUMP COMPLETE\n");
+}
diff --git a/include/popcorn/process_server.h b/include/popcorn/process_server.h
new file mode 100644
index 000000000..ebe0787c6
--- /dev/null
+++ b/include/popcorn/process_server.h
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __POPCORN_PROCESS_SERVER_H
+#define __POPCORN_PROCESS_SERVER_H
+
+
+int process_server_do_migration(struct task_struct *tsk, unsigned int dst_nid,
+				void __user *uregs);
+int process_server_task_exit(struct task_struct *tsk);
+int update_frame_pointer(void);
+
+long process_server_do_futex_at_remote(u32 __user *uaddr, int op, u32 val,
+				       bool valid_ts, struct timespec64 *ts,
+				       u32 __user *uaddr2, u32 val2, u32 val3);
+
+struct remote_context;
+void free_remote_context(struct remote_context *rc);
+
+#endif /* __POPCORN_PROCESS_SERVER_H */
diff --git a/kernel/popcorn/process_server.c b/kernel/popcorn/process_server.c
new file mode 100644
index 000000000..48b72b7a5
--- /dev/null
+++ b/kernel/popcorn/process_server.c
@@ -0,0 +1,1037 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/process_server.c
+ *
+ * Popcorn Linux thread migration implementation.
+ *
+ * This is Popcorn's core migration handler. It
+ * also defines and registers handlers for
+ * other key Popcorn work.
+ *
+ * This work was an extension of David Katz MS Thesis,
+ * rewritten by Sang-Hoon to support multithread
+ * environment.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ * author Sang-Hoon Kim, SSRG Virginia Tech 2017
+ * author Antonio Barbalace, SSRG Virginia Tech 2014-2016
+ * author Vincent Legout, Sharat Kumar Bath, Ajithchandra Saya, SSRG Virginia Tech 2014-2015
+ * author David Katz, Marina Sadini, SSRG Virginia 2013
+ */
+
+#include <linux/sched.h>
+#include <linux/threads.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/ptrace.h>
+#include <linux/mmu_context.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/mmu_context.h>
+#include <asm/kdebug.h>
+
+#include <popcorn/bundle.h>
+
+#include "types.h"
+#include "process_server.h"
+#include "vma_server.h"
+#include "page_server.h"
+#include "wait_station.h"
+#include "util.h"
+
+static struct list_head remote_contexts[2];
+static spinlock_t remote_contexts_lock[2];
+
+inline void __lock_remote_contexts(spinlock_t *remote_contexts_lock, int index)
+{
+	spin_lock(remote_contexts_lock + index);
+}
+
+inline void __unlock_remote_contexts(spinlock_t *remote_contexts_lock, int index)
+{
+	spin_unlock(remote_contexts_lock + index);
+}
+
+/* Hold the corresponding remote_contexts_lock */
+static struct remote_context *__lookup_remote_contexts_in(int nid, int tgid)
+{
+	struct remote_context *rc;
+
+	list_for_each_entry(rc, remote_contexts + INDEX_INBOUND, list) {
+		if (rc->remote_tgids[nid] == tgid)
+			return rc;
+	}
+	return NULL;
+}
+
+inline struct remote_context *__get_mm_remote(struct mm_struct *mm)
+{
+	struct remote_context *rc = mm->remote;
+	atomic_inc(&rc->count);
+	return rc;
+}
+
+inline struct remote_context *get_task_remote(struct task_struct *tsk)
+{
+	return __get_mm_remote(tsk->mm);
+}
+
+inline bool __put_task_remote(struct remote_context *rc)
+{
+	if (!atomic_dec_and_test(&rc->count)) return false;
+
+	__lock_remote_contexts(remote_contexts_lock, rc->for_remote);
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	BUG_ON(atomic_read(&rc->count));
+#endif
+	list_del(&rc->list);
+	__unlock_remote_contexts(remote_contexts_lock, rc->for_remote);
+
+	free_remote_context_pages(rc);
+	kfree(rc);
+	return true;
+}
+
+inline bool put_task_remote(struct task_struct *tsk)
+{
+	return __put_task_remote(tsk->mm->remote);
+}
+
+void free_remote_context(struct remote_context *rc)
+{
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	BUG_ON(atomic_read(&rc->count) != 1 && atomic_read(&rc->count) != 2);
+#endif
+	__put_task_remote(rc);
+}
+
+static struct remote_context *__alloc_remote_context(int nid, int tgid,
+						     bool remote)
+{
+	struct remote_context *rc = kmalloc(sizeof(*rc), GFP_KERNEL);
+	int i;
+
+	if (!rc)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&rc->list);
+	atomic_set(&rc->count, 1); /* Account for mm->remote in a near future */
+	rc->mm = NULL;
+
+	rc->tgid = tgid;
+	rc->for_remote = remote;
+
+	for (i = 0; i < FAULTS_HASH; i++) {
+		INIT_HLIST_HEAD(&rc->faults[i]);
+		spin_lock_init(&rc->faults_lock[i]);
+	}
+
+	INIT_LIST_HEAD(&rc->vmas);
+	spin_lock_init(&rc->vmas_lock);
+
+	rc->stop_remote_worker = false;
+
+	rc->remote_worker = NULL;
+	INIT_LIST_HEAD(&rc->remote_works);
+	spin_lock_init(&rc->remote_works_lock);
+	init_completion(&rc->remote_works_ready);
+
+	memset(rc->remote_tgids, 0x00, sizeof(rc->remote_tgids));
+
+	INIT_RADIX_TREE(&rc->pages, GFP_ATOMIC);
+
+	return rc;
+}
+
+static void __build_task_comm(char *buffer, char *path)
+{
+	int i, ch;
+	for (i = 0; (ch = *(path++)) != '\0';) {
+		if (ch == '/')
+			i = 0;
+		else if (i < (TASK_COMM_LEN - 1))
+			buffer[i++] = ch;
+	}
+	buffer[i] = '\0';
+}
+
+/*
+ *  This function implements a distributed mutex.
+ */
+long process_server_do_futex_at_remote(u32 __user *uaddr, int op, u32 val,
+				       bool valid_ts, struct timespec64 *ts,
+				       u32 __user *uaddr2, u32 val2, u32 val3)
+{
+	struct wait_station *ws = get_wait_station(current);
+	remote_futex_request req = {
+		.origin_pid = current->origin_pid,
+		.remote_ws = ws->id,
+		.op = op,
+		.val = val,
+		.ts = {
+			.tv_sec = -1,
+		},
+		.uaddr = uaddr,
+		.uaddr2 = uaddr2,
+		.val2 = val2,
+		.val3 = val3,
+	};
+	remote_futex_response *res;
+	long ret = 0;
+
+	if (valid_ts) {
+		req.ts = *ts;
+	}
+
+
+	pcn_kmsg_send(PCN_KMSG_TYPE_FUTEX_REQUEST,
+			current->origin_nid, &req, sizeof(req));
+
+	res = wait_at_station(ws);
+	ret = res->ret;
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+static int handle_remote_futex_response(struct pcn_kmsg_message *msg)
+{
+	remote_futex_response *res = (remote_futex_response *)msg;
+	struct wait_station *ws = wait_station(res->remote_ws);
+
+	ws->private = res;
+	complete(&ws->pendings);
+	return 0;
+}
+
+static int process_remote_futex_request(remote_futex_request *req)
+{
+	int ret;
+	int err = 0;
+	remote_futex_response *res;
+	ktime_t t, *tp = NULL;
+
+	if (timespec64_valid(&req->ts)) {
+		t = timespec64_to_ktime(req->ts);
+		t = ktime_add_safe(ktime_get(), t);
+		tp = &t;
+	}
+
+
+	ret = do_futex(req->uaddr, req->op, req->val,
+			tp, req->uaddr2, req->val2, req->val3);
+
+	res = pcn_kmsg_get(sizeof(*res));
+
+	if(!res) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	res->remote_ws = req->remote_ws;
+	res->ret = ret;
+
+	pcn_kmsg_post(PCN_KMSG_TYPE_FUTEX_RESPONSE,
+			current->remote_nid, res, sizeof(*res));
+
+out:
+	pcn_kmsg_done(req);
+	return err;
+}
+
+/*
+ *  This function handles process exit.
+ */
+static void __terminate_remotes(struct remote_context *rc)
+{
+	int nid;
+	origin_task_exit_t req = {
+		.origin_pid = current->pid,
+		.exit_code = current->exit_code,
+	};
+
+	/* Take down peer vma workers */
+	for (nid = 0; nid < MAX_POPCORN_NODES; nid++) {
+		if (nid == my_nid || rc->remote_tgids[nid] == 0) continue;
+		PSPRINTK("TERMINATE [%d/%d] with 0x%d\n",
+				rc->remote_tgids[nid], nid, req.exit_code);
+
+		req.remote_pid = rc->remote_tgids[nid];
+		pcn_kmsg_send(PCN_KMSG_TYPE_TASK_EXIT_ORIGIN, nid, &req, sizeof(req));
+	}
+}
+
+static int __exit_origin_task(struct task_struct *tsk)
+{
+	struct remote_context *rc = tsk->mm->remote;
+
+	if (tsk->remote) {
+		put_task_remote(tsk);
+	}
+	tsk->remote = NULL;
+	tsk->origin_nid = tsk->origin_pid = -1;
+
+	/**
+	 * Trigger peer termination if this is the last user thread
+	 * referring to this mm.
+	 */
+	if (atomic_read(&tsk->mm->mm_users) == 1) {
+		__terminate_remotes(rc);
+	}
+
+	return 0;
+}
+
+static int __exit_remote_task(struct task_struct *tsk)
+{
+	/* Something went south. Notify the origin. */
+	if (tsk->exit_code != TASK_PARKED) {
+		if (!get_task_remote(tsk)->stop_remote_worker) {
+			remote_task_exit_t req = {
+				.origin_pid = tsk->origin_pid,
+				.remote_pid = tsk->pid,
+				.exit_code = tsk->exit_code,
+			};
+			pcn_kmsg_send(PCN_KMSG_TYPE_TASK_EXIT_REMOTE,
+					tsk->origin_nid, &req, sizeof(req));
+		}
+		put_task_remote(tsk);
+	}
+
+	put_task_remote(tsk);
+	tsk->remote = NULL;
+	tsk->origin_nid = tsk->origin_pid = -1;
+
+	return 0;
+}
+
+int process_server_task_exit(struct task_struct *tsk)
+{
+	WARN_ON(tsk != current);
+
+	if (!distributed_process(tsk))
+		return -ESRCH;
+
+	PSPRINTK("EXITED [%d] %s%s / 0x%x\n", tsk->pid,
+			tsk->at_remote ? "remote" : "local",
+			tsk->is_worker ? " worker": "",
+			tsk->exit_code);
+
+	if (tsk->is_worker)
+		return 0;
+
+	if (tsk->at_remote)
+		return __exit_remote_task(tsk);
+	else
+		return __exit_origin_task(tsk);
+}
+
+/*
+ * Handle the notification of the task kill at the remote.
+ */
+static void process_remote_task_exit(remote_task_exit_t *req)
+{
+	struct task_struct *tsk = current;
+	int exit_code = req->exit_code;
+
+	if (tsk->remote_pid != req->remote_pid) {
+		printk(KERN_INFO"%s: pid mismatch %d != %d\n", __func__,
+				tsk->remote_pid, req->remote_pid);
+		pcn_kmsg_done(req);
+		return;
+	}
+
+	PSPRINTK("%s [%d] 0x%x\n", __func__, tsk->pid, req->exit_code);
+
+	tsk->remote = NULL;
+	tsk->remote_nid = -1;
+	tsk->remote_pid = -1;
+	put_task_remote(tsk);
+
+	exit_code = req->exit_code;
+	pcn_kmsg_done(req);
+
+	if (exit_code & CSIGNAL)
+		force_sig(exit_code & CSIGNAL, tsk);
+
+	do_exit(exit_code);
+}
+
+static void process_origin_task_exit(struct remote_context *rc,
+				     origin_task_exit_t *req)
+{
+	BUG_ON(!current->is_worker);
+
+	PSPRINTK("\nTERMINATE [%d] with 0x%x\n", current->pid, req->exit_code);
+	current->exit_code = req->exit_code;
+	rc->stop_remote_worker = true;
+
+	pcn_kmsg_done(req);
+}
+
+/*
+ *    This function handles back migration.
+ *    If there is a pid mismatch the function will exit.
+ *    Otherwise it will restore the process at origin.
+ */
+static void process_back_migration(back_migration_request_t *req)
+{
+	if (current->remote_pid != req->remote_pid) {
+		printk(KERN_INFO"%s: pid mismatch during back migration (%d != %d)\n",
+				__func__, current->remote_pid, req->remote_pid);
+		goto out_free;
+	}
+
+	PSPRINTK("### BACKMIG [%d] from [%d/%d]\n",
+			current->pid, req->remote_pid, req->remote_nid);
+
+	/* Welcome home */
+
+	current->remote = NULL;
+	current->remote_nid = -1;
+	current->remote_pid = -1;
+	put_task_remote(current);
+
+	current->personality = req->personality;
+
+	/* XXX signals */
+
+	/* mm is not updated here; has been synchronized through vma operations */
+
+	restore_thread_info(&req->arch, true);
+
+out_free:
+	pcn_kmsg_done(req);
+}
+
+/*
+ * Send a message to <dst_nid> for migrating back a task <task>.
+ * This is a back migration
+ *  => <task> must already been migrated to <dst_nid>.
+ * It returns -1 in error case.
+ */
+static int __do_back_migration(struct task_struct *tsk, int dst_nid,
+			       void __user *uregs)
+{
+	back_migration_request_t *req;
+	int ret = 0;
+	int size = 0;
+
+	might_sleep();
+
+	BUG_ON(tsk->origin_nid == -1 && tsk->origin_pid == -1);
+
+	req = pcn_kmsg_get(sizeof(*req));
+
+	if(!req) {
+		return -ENOMEM;
+	}
+
+	req->origin_pid = tsk->origin_pid;
+	req->remote_nid = my_nid;
+	req->remote_pid = tsk->pid;
+
+	req->personality = tsk->personality;
+
+	size = regset_size(get_popcorn_node_arch(dst_nid));
+	if(!size) {
+		return -EINVAL;
+
+	}
+	ret = copy_from_user(&req->arch.regsets, uregs,
+			size);
+
+	save_thread_info(&req->arch);
+
+	ret = pcn_kmsg_post(
+		PCN_KMSG_TYPE_TASK_MIGRATE_BACK, dst_nid, req, sizeof(*req));
+
+	do_exit(TASK_PARKED);
+
+	return ret;
+}
+
+/*
+ *  Remote thread
+ */
+static int handle_remote_task_pairing(struct pcn_kmsg_message *msg)
+{
+	remote_task_pairing_t *req = (remote_task_pairing_t *)msg;
+	struct task_struct *tsk;
+	int from_nid = PCN_KMSG_FROM_NID(req);
+	int ret = 0;
+
+	tsk = __get_task_struct(req->your_pid);
+	if (!tsk) {
+		ret = -ESRCH;
+		goto out;
+	}
+	BUG_ON(tsk->at_remote);
+	BUG_ON(!tsk->remote);
+
+	tsk->remote_nid = from_nid;
+	tsk->remote_pid = req->my_pid;
+	tsk->remote->remote_tgids[from_nid] = req->my_tgid;
+
+	put_task_struct(tsk);
+out:
+	pcn_kmsg_done(req);
+	return ret;
+}
+
+static int __pair_remote_task(void)
+{
+	remote_task_pairing_t req = {
+		.my_tgid = current->tgid,
+		.my_pid = current->pid,
+		.your_pid = current->origin_pid,
+	};
+	return pcn_kmsg_send(
+			PCN_KMSG_TYPE_TASK_PAIRING, current->origin_nid, &req, sizeof(req));
+}
+
+
+struct remote_thread_params {
+	clone_request_t *req;
+};
+
+/*
+ *   This function performs the main proceess
+ *   migration operation. It demotes temporary priviledge prior
+ *   to injecting the thread info into the clone request.
+ *   Upon success, the function returns to user-space.
+ */
+static int remote_thread_main(void *_args)
+{
+	struct remote_thread_params *params = _args;
+	clone_request_t *req = params->req;
+	int ret = 0;
+
+#ifdef CONFIG_POPCORN_DEBUG_VERBOSE
+	PSPRINTK("%s [%d] started for [%d/%d]\n", __func__,
+			current->pid, req->origin_pid, PCN_KMSG_FROM_NID(req));
+#endif
+
+	current->flags &= ~PF_KTHREAD;
+	current->origin_nid = PCN_KMSG_FROM_NID(req);
+	current->origin_pid = req->origin_pid;
+	current->remote = get_task_remote(current);
+
+	set_fs(USER_DS);
+
+	restore_thread_info(&req->arch, true);
+
+	if((ret = __pair_remote_task())) {
+#ifdef CONFIG_POPCORN_DEBUG_VERBOSE
+		PSPRINTK("%s [%d] failed __pair_remote_task() for [%d/%d]\n", __func__,
+			current->pid, req->origin_pid, PCN_KMSG_FROM_NID(req));
+#endif
+		return ret;
+	}
+
+	PSPRINTK("\n####### MIGRATED - [%d/%d] from [%d/%d]\n",
+			current->pid, my_nid, current->origin_pid, current->origin_nid);
+
+	kfree(params);
+	pcn_kmsg_done(req);
+
+	return ret;
+}
+
+static int __fork_remote_thread(clone_request_t *req)
+{
+	struct remote_thread_params *params;
+	params = kmalloc(sizeof(*params), GFP_KERNEL);
+	params->req = req;
+
+	/* The loop deals with signals between concurrent migration */
+	while (kernel_thread(remote_thread_main, params,
+					CLONE_THREAD | CLONE_SIGHAND | SIGCHLD) < 0) {
+		schedule();
+	}
+	return 0;
+}
+
+static int __construct_mm(clone_request_t *req, struct remote_context *rc)
+{
+	struct mm_struct *mm;
+	struct file *f;
+	struct rlimit rlim_stack;
+
+	if(!(mm = mm_alloc())) {
+		return -ENOMEM;
+	}
+
+	task_lock(current->group_leader);
+	rlim_stack = current->signal->rlim[RLIMIT_STACK];
+	task_unlock(current->group_leader);
+
+	arch_pick_mmap_layout(mm, &rlim_stack);
+
+	f = filp_open(req->exe_path, O_RDONLY | O_LARGEFILE | O_EXCL, 0);
+	if (IS_ERR(f)) {
+		PCNPRINTK_ERR("cannot open executable from %s\n", req->exe_path);
+		mmdrop(mm);
+		return -EINVAL;
+	}
+	set_mm_exe_file(mm, f);
+	filp_close(f, NULL);
+
+	mm->task_size = req->task_size;
+	mm->start_stack = req->stack_start;
+	mm->start_brk = req->start_brk;
+	mm->brk = req->brk;
+	mm->env_start = req->env_start;
+	mm->env_end = req->env_end;
+	mm->arg_start = req->arg_start;
+	mm->arg_end = req->arg_end;
+	mm->start_code = req->start_code;
+	mm->end_code = req->end_code;
+	mm->start_data = req->start_data;
+	mm->end_data = req->end_data;
+	mm->def_flags = req->def_flags;
+
+	use_mm(mm);
+
+	rc->mm = mm;  /* No need to increase mm_users due to mm_alloc() */
+	mm->remote = rc;
+
+	return 0;
+}
+
+static void __terminate_remote_threads(struct remote_context *rc)
+{
+	struct task_struct *tsk;
+
+	/* Terminate userspace threads. Tried to use do_group_exit() but it
+	 * didn't work */
+	rcu_read_lock();
+	for_each_thread(current, tsk) {
+		if (tsk->is_worker)
+			continue;
+		force_sig(current->exit_code, tsk);
+	}
+	rcu_read_unlock();
+}
+
+static void __run_remote_worker(struct remote_context *rc)
+{
+	while (!rc->stop_remote_worker) {
+		struct work_struct *work = NULL;
+		struct pcn_kmsg_message *msg;
+		int wait_ret;
+		unsigned long flags;
+
+		wait_ret = wait_for_completion_interruptible_timeout(
+					&rc->remote_works_ready, HZ);
+		if (wait_ret == 0)
+			continue;
+
+		spin_lock_irqsave(&rc->remote_works_lock, flags);
+
+		if (!list_empty(&rc->remote_works)) {
+			work = list_first_entry(
+					&rc->remote_works, struct work_struct, entry);
+			list_del(&work->entry);
+		}
+		spin_unlock_irqrestore(&rc->remote_works_lock, flags);
+
+		if (!work)
+			continue;
+
+		msg = ((struct pcn_kmsg_work *)work)->msg;
+
+		switch (msg->header.type) {
+		case PCN_KMSG_TYPE_TASK_MIGRATE:
+			__fork_remote_thread((clone_request_t *)msg);
+			break;
+		case PCN_KMSG_TYPE_VMA_OP_REQUEST:
+			process_vma_op_request((vma_op_request_t *)msg);
+			break;
+		case PCN_KMSG_TYPE_TASK_EXIT_ORIGIN:
+			process_origin_task_exit(rc, (origin_task_exit_t *)msg);
+			break;
+		default:
+			printk("Unknown remote work type %d\n", msg->header.type);
+			break;
+		}
+
+		/* msg is released (pcn_kmsg_done()) in each handler */
+		kfree(work);
+	}
+}
+
+struct remote_worker_params {
+	clone_request_t *req;
+	struct remote_context *rc;
+	char comm[TASK_COMM_LEN];
+};
+
+static int remote_worker_main(void *data)
+{
+	struct remote_worker_params *params = (struct remote_worker_params *)data;
+	struct remote_context *rc = params->rc;
+	clone_request_t *req = params->req;
+	int mm_err = 0;
+
+	might_sleep();
+	kfree(params);
+
+	PSPRINTK("%s: [%d] for [%d/%d]\n", __func__,
+			current->pid, req->origin_tgid, PCN_KMSG_FROM_NID(req));
+	PSPRINTK("%s: [%d] %s\n", __func__,
+			current->pid, req->exe_path);
+
+	current->flags &= ~PF_RANDOMIZE;	/* Disable ASLR for now*/
+	current->flags &= ~PF_KTHREAD;	/* Demote to a user thread */
+
+	current->personality = req->personality;
+	current->is_worker = true;
+	current->at_remote = true;
+	current->origin_nid = PCN_KMSG_FROM_NID(req);
+	current->origin_pid = req->origin_pid;
+
+	set_user_nice(current, 0);
+
+	if ((mm_err = __construct_mm(req, rc))) {
+		return mm_err;
+	}
+
+	get_task_remote(current);
+	rc->tgid = current->tgid;
+
+	__run_remote_worker(rc);
+
+	__terminate_remote_threads(rc);
+
+	put_task_remote(current);
+	return current->exit_code;
+}
+
+static void __schedule_remote_work(struct remote_context *rc,
+				   struct pcn_kmsg_work *work)
+{
+	/* Exploit the list_head in work_struct */
+	struct list_head *entry = &((struct work_struct *)work)->entry;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(entry);
+	spin_lock_irqsave(&rc->remote_works_lock, flags);
+	list_add(entry, &rc->remote_works);
+	spin_unlock_irqrestore(&rc->remote_works_lock, flags);
+
+	complete(&rc->remote_works_ready);
+}
+
+static void clone_remote_thread(struct work_struct *_work)
+{
+	struct pcn_kmsg_work *work = (struct pcn_kmsg_work *)_work;
+	clone_request_t *req = work->msg;
+	int nid_from = PCN_KMSG_FROM_NID(req);
+	int tgid_from = req->origin_tgid;
+	struct remote_context *rc;
+	struct remote_context *rc_new =
+			__alloc_remote_context(nid_from, tgid_from, true);
+
+	BUG_ON(!rc_new);
+
+	__lock_remote_contexts(remote_contexts_lock, INDEX_INBOUND);
+	rc = __lookup_remote_contexts_in(nid_from, tgid_from);
+	if (!rc) {
+		struct remote_worker_params *params;
+
+		rc = rc_new;
+		rc->remote_tgids[nid_from] = tgid_from;
+		list_add(&rc->list, remote_contexts + INDEX_INBOUND);
+		__unlock_remote_contexts(remote_contexts_lock, INDEX_INBOUND);
+
+		params = kmalloc(sizeof(*params), GFP_KERNEL);
+		BUG_ON(!params);
+
+		params->rc = rc;
+		params->req = req;
+		__build_task_comm(params->comm, req->exe_path);
+		smp_wmb();
+
+		rc->remote_worker =
+				kthread_run(remote_worker_main, params, params->comm);
+	} else {
+		__unlock_remote_contexts(remote_contexts_lock, INDEX_INBOUND);
+		kfree(rc_new);
+	}
+
+	/* Schedule this fork request */
+	__schedule_remote_work(rc, work);
+	return;
+}
+
+static int handle_clone_request(struct pcn_kmsg_message *msg)
+{
+	clone_request_t *req = (clone_request_t *)msg;
+	struct pcn_kmsg_work *work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if(!work)
+		return -ENOMEM;
+
+	work->msg = req;
+	INIT_WORK((struct work_struct *)work, clone_remote_thread);
+	queue_work(popcorn_wq, (struct work_struct *)work);
+
+	return 0;
+}
+
+/*
+ * Handle remote works at the origin
+ */
+int request_remote_work(pid_t pid, struct pcn_kmsg_message *req)
+{
+	struct task_struct *tsk = __get_task_struct(pid);
+
+	if (!tsk) {
+		printk(KERN_INFO"%s: invalid origin task %d for remote work %d\n",
+				__func__, pid, req->header.type);
+		pcn_kmsg_done(req);
+		return -ESRCH;
+	}
+
+	/*
+	 * Origin-initiated remote works are node-wide operations, thus, enqueue
+	 * such requests into the remote work queue.
+	 * On the other hand, remote-initated remote works are thread-wise requests.
+	 * So, pending the requests to the per-thread work queue.
+	 */
+	if (tsk->at_remote) {
+		struct remote_context *rc = get_task_remote(tsk);
+		struct pcn_kmsg_work *work = kmalloc(sizeof(*work), GFP_ATOMIC);
+
+		BUG_ON(!tsk->is_worker);
+		work->msg = req;
+
+		__schedule_remote_work(rc, work);
+
+		__put_task_remote(rc);
+	} else {
+		BUG_ON(tsk->remote_work);
+		tsk->remote_work = req;
+		complete(&tsk->remote_work_pended);
+	}
+
+	put_task_struct(tsk);
+	return 0;
+}
+
+static int __process_remote_works(void)
+{
+	int err = 0;
+	bool run = true;
+	BUG_ON(current->at_remote);
+
+	while (run) {
+		struct pcn_kmsg_message *req;
+		long ret;
+		ret = wait_for_completion_interruptible_timeout(
+				&current->remote_work_pended, HZ);
+		if (ret == 0)
+			continue;
+
+		req = (struct pcn_kmsg_message *)current->remote_work;
+		current->remote_work = NULL;
+		smp_wmb();
+
+		if (!req)
+			continue;
+
+		switch (req->header.type) {
+		case PCN_KMSG_TYPE_VMA_OP_REQUEST:
+			process_vma_op_request((vma_op_request_t *)req);
+			break;
+		case PCN_KMSG_TYPE_VMA_INFO_REQUEST:
+			process_vma_info_request((vma_info_request_t *)req);
+			break;
+		case PCN_KMSG_TYPE_FUTEX_REQUEST:
+			err = process_remote_futex_request((remote_futex_request *)req);
+			break;
+		case PCN_KMSG_TYPE_TASK_EXIT_REMOTE:
+			process_remote_task_exit((remote_task_exit_t *)req);
+			run = false;
+			break;
+		case PCN_KMSG_TYPE_TASK_MIGRATE_BACK:
+			process_back_migration((back_migration_request_t *)req);
+			run = false;
+			break;
+		default:
+			if (WARN_ON("Received unsupported remote work")) {
+				printk("  type: %d\n", req->header.type);
+			}
+		}
+	}
+	return err;
+}
+
+/*
+ * Send a message to <dst_nid> for migrating a task <task>.
+ * This function will ask the remote node to create a thread to host the task.
+ * It returns <0 in error case.
+ */
+static int __request_clone_remote(int dst_nid, struct task_struct *tsk,
+				  void __user *uregs)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	clone_request_t *req;
+	int ret = 0;
+	int size = 0;
+
+	req = pcn_kmsg_get(sizeof(*req));
+
+	if (!req) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* struct mm_struct */
+	if (get_file_path(mm->exe_file, req->exe_path, sizeof(req->exe_path))) {
+		printk("%s: cannot get path to exe binary\n", __func__);
+		ret = -ESRCH;
+		pcn_kmsg_put(req);
+		goto out;
+	}
+
+	req->task_size = mm->task_size;
+	req->stack_start = mm->start_stack;
+	req->start_brk = mm->start_brk;
+	req->brk = mm->brk;
+	req->env_start = mm->env_start;
+	req->env_end = mm->env_end;
+	req->arg_start = mm->arg_start;
+	req->arg_end = mm->arg_end;
+	req->start_code = mm->start_code;
+	req->end_code = mm->end_code;
+	req->start_data = mm->start_data;
+	req->end_data = mm->end_data;
+	req->def_flags = mm->def_flags;
+
+	/* struct tsk_struct */
+	req->origin_tgid = tsk->tgid;
+	req->origin_pid = tsk->pid;
+
+	req->personality = tsk->personality;
+
+	/* Register sets from userspace */
+	size = regset_size(get_popcorn_node_arch(dst_nid));
+	if(!size) {
+		return -EINVAL;
+
+	}
+	ret = copy_from_user(&req->arch.regsets, uregs,
+			     size);
+
+	save_thread_info(&req->arch);
+
+	ret = pcn_kmsg_post(PCN_KMSG_TYPE_TASK_MIGRATE, dst_nid, req, sizeof(*req));
+
+out:
+	mmput(mm);
+	return ret;
+}
+
+/*
+ * do_migration takes care of the original process migration
+ * to a remote node. Its complementary function being
+ * do_back_migration which returns a process from a remote node
+ * back to the origin node.
+ *
+ * The first thread gets migrated attaches the remote context to
+ * mm->remote, which indicates some threads in this process is
+ * distributed. At this point tsk->remote gets a link to the
+ * remote context.
+ *
+ */
+static int __do_migration(struct task_struct *tsk, int dst_nid,
+			  void __user *uregs)
+{
+	int ret = 0;
+	struct remote_context *rc;
+
+	rc = __alloc_remote_context(my_nid, tsk->tgid, false);
+	if (IS_ERR(rc))
+		return PTR_ERR(rc);
+
+	if (cmpxchg(&tsk->mm->remote, 0, rc)) {
+		kfree(rc);
+	} else {
+		rc->mm = tsk->mm;
+		rc->remote_tgids[my_nid] = tsk->tgid;
+
+		__lock_remote_contexts(remote_contexts_lock, INDEX_OUTBOUND);
+		list_add(&rc->list, remote_contexts + INDEX_OUTBOUND);
+		__unlock_remote_contexts(remote_contexts_lock, INDEX_OUTBOUND);
+	}
+
+	tsk->remote = get_task_remote(tsk);
+
+	ret = __request_clone_remote(dst_nid, tsk, uregs);
+	if (ret)
+		return ret;
+
+	ret = __process_remote_works();
+	return ret;
+}
+
+/*
+ * Migrate the specified task <task> to node <dst_nid>
+ * Currently, this function will put the specified task to sleep,
+ * and push its info over to the remote node.
+ * The remote node will then create a new thread and import that
+ * info into its new context.
+ */
+int process_server_do_migration(struct task_struct *tsk, unsigned int dst_nid,
+				void __user *uregs)
+{
+	int ret = 0;
+
+	if (tsk->origin_nid == dst_nid) {
+		ret = __do_back_migration(tsk, dst_nid, uregs);
+	} else {
+		ret = __do_migration(tsk, dst_nid, uregs);
+		if (ret) {
+			tsk->remote = NULL;
+			tsk->remote_pid = tsk->remote_nid = -1;
+			put_task_remote(tsk);
+		}
+	}
+
+	return ret;
+}
+
+DEFINE_KMSG_RW_HANDLER(origin_task_exit, origin_task_exit_t, remote_pid);
+DEFINE_KMSG_RW_HANDLER(remote_task_exit, remote_task_exit_t, origin_pid);
+DEFINE_KMSG_RW_HANDLER(back_migration, back_migration_request_t, origin_pid);
+DEFINE_KMSG_RW_HANDLER(remote_futex_request, remote_futex_request, origin_pid);
+
+/*
+ * Initialize the process server.
+ */
+int __init process_server_init(void)
+{
+	INIT_LIST_HEAD(&remote_contexts[0]);
+	INIT_LIST_HEAD(&remote_contexts[1]);
+
+	spin_lock_init(&remote_contexts_lock[0]);
+	spin_lock_init(&remote_contexts_lock[1]);
+
+	/* Register handlers */
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_TASK_MIGRATE, clone_request);
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_TASK_MIGRATE_BACK, back_migration);
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_TASK_PAIRING, remote_task_pairing);
+
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_TASK_EXIT_REMOTE, remote_task_exit);
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_TASK_EXIT_ORIGIN, origin_task_exit);
+
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_FUTEX_REQUEST, remote_futex_request);
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_FUTEX_RESPONSE, remote_futex_response);
+
+	return 0;
+}
diff --git a/kernel/popcorn/process_server.h b/kernel/popcorn/process_server.h
new file mode 100644
index 000000000..22e8f51f1
--- /dev/null
+++ b/kernel/popcorn/process_server.h
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __INTERNAL_PROCESS_SERVER_H__
+#define __INTERNAL_PROCESS_SERVER_H__
+
+#include <popcorn/process_server.h>
+
+enum {
+	INDEX_OUTBOUND = 0,
+	INDEX_INBOUND = 1,
+};
+
+struct task_struct;
+struct field_arch;
+
+inline void __lock_remote_contexts(spinlock_t *remote_contexts_lock, int index);
+inline void __unlock_remote_contexts(spinlock_t *remote_contexts_lock,
+				     int index);
+
+int save_thread_info(struct field_arch *arch);
+int restore_thread_info(struct field_arch *arch, bool restore_segments);
+#endif /* __INTERNAL_PROCESS_SERVER_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 7/9] Virtual Memory Address Server for Distributed Thread Execution
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (5 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 6/9] Process Server for Popcorn Distributed Thread Execution Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 8/9] Page " Javier Malave
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

Popcorn Linux implements software-based distributed
shared memory by extending Linux's virtual memory subsystem,
and enables processes on different machines to observe a
common and coherent virtual address space. Coherency of virtual
memory pages of different hosts is ensured using a
reader-replicate/writer-invalidate, page-level consistency protocol.

The VMA server's job is to implement the aforementioned protocol.
Currently this work is fully managed at the origin node.
Remote delegation of VMA operations is not supported except for
MUNMAP.
---
 include/popcorn/vma_server.h |  33 ++
 kernel/popcorn/vma_server.c  | 818 +++++++++++++++++++++++++++++++++++
 kernel/popcorn/vma_server.h  |  24 +
 3 files changed, 875 insertions(+)
 create mode 100644 include/popcorn/vma_server.h
 create mode 100644 kernel/popcorn/vma_server.c
 create mode 100644 kernel/popcorn/vma_server.h

diff --git a/include/popcorn/vma_server.h b/include/popcorn/vma_server.h
new file mode 100644
index 000000000..162f41368
--- /dev/null
+++ b/include/popcorn/vma_server.h
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef INCLUDE_POPCORN_VMA_SERVER_H_
+#define INCLUDE_POPCORN_VMA_SERVER_H_
+
+
+/*
+ * VMA operation handlers for origin
+ */
+int vma_server_munmap_origin(unsigned long start, size_t len, int nid_except);
+
+
+/*
+ * Retrieve VMAs from origin
+ */
+int vma_server_fetch_vma(struct task_struct *tsk, unsigned long address);
+
+/*
+ * VMA operation handler for remote
+ */
+unsigned long vma_server_mmap_remote(struct file *file, unsigned long addr,
+				     unsigned long len, unsigned long prot,
+				     unsigned long flags, unsigned long pgoff);
+
+int vma_server_munmap_remote(unsigned long start, size_t len);
+int vma_server_brk_remote(unsigned long oldbrk, unsigned long brk);
+int vma_server_madvise_remote(unsigned long start, size_t len, int behavior);
+int vma_server_mprotect_remote(unsigned long start, size_t len,
+			       unsigned long prot);
+int vma_server_mremap_remote(unsigned long addr, unsigned long old_len,
+			     unsigned long new_len, unsigned long flags,
+			     unsigned long new_addr);
+
+#endif /* INCLUDE_POPCORN_VMA_SERVER_H_ */
diff --git a/kernel/popcorn/vma_server.c b/kernel/popcorn/vma_server.c
new file mode 100644
index 000000000..256267c0e
--- /dev/null
+++ b/kernel/popcorn/vma_server.c
@@ -0,0 +1,818 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/vma_server.c
+ *
+ * Popcorn Linux VMA handler implementation.
+ *
+ * VMA Server implements a reader-replicate/
+ * writer invalidate page-level coherency protocol.
+ *
+ * This work was an extension of David Katz MS Thesis,
+ * rewritten by Sang-Hoon to support multithread environment.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ * author Sang-Hoon Kim, SSRG Virginia Tech 2016-2017
+ * author Vincent Legout, Antonio Barbalace, SSRG Virginia Tech 2016
+ * author Ajith Saya, Sharath Bhat, SSRG Virginia Tech 2015
+ * author Marina Sadini, Antonio Barbalace, SSRG Virginia Tech 2014
+ * author Marina Sadini, SSRG Virginia Tech 2013
+ */
+
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/kthread.h>
+
+#include <linux/mman.h>
+#include <linux/highmem.h>
+#include <linux/ptrace.h>
+#include <linux/sched/mm.h>
+#include <linux/elf.h>
+#include <linux/syscalls.h>
+#include <popcorn/bundle.h>
+
+#include "types.h"
+#include "util.h"
+#include "vma_server.h"
+#include "page_server.h"
+#include "wait_station.h"
+
+
+const char *vma_op_code_sz[] = {
+	"mmap", "munmap", "mprotect", "mremap", "madvise", "brk"
+};
+
+/*
+ *   This function performs a diff between all VMA's pcorand the current VMA
+ */
+static unsigned long map_difference(struct mm_struct *mm, struct file *file,
+				    unsigned long start, unsigned long end,
+				    unsigned long prot, unsigned long flags,
+				    unsigned long pgoff)
+{
+	unsigned long ret = start;
+	unsigned long error;
+	unsigned long populate = 0;
+	struct vm_area_struct *vma;
+
+	VSPRINTK("  [%d] map+ %lx %lx\n", current->pid, start, end);
+	for (vma = current->mm->mmap; start < end; vma = vma->vm_next) {
+
+		if (vma == NULL || end <= vma->vm_start) {
+			/*
+			 * We've reached the end of the list, or the VMA is fully
+			 * above the region of interest
+			 */
+			VSPRINTK("  [%d] map0 %lx -- %lx @ %lx, %lx\n", current->pid,
+					start, end, pgoff, prot);
+			error = do_mmap_pgoff(file, start, end - start,
+					      prot, flags, pgoff, &populate, NULL);
+			if (error != start) {
+				ret = VM_FAULT_SIGBUS;
+			}
+			break;
+		} else if (start >= vma->vm_start && end <= vma->vm_end) {
+			/*
+			 * VMA fully encompases the region of interest. nothing to do
+			 */
+			break;
+		} else if (start >= vma->vm_start
+				&& start < vma->vm_end && end > vma->vm_end) {
+			/*
+			 * VMA includes the start of the region of interest
+			 * but not the end. advance start (no mapping to do)
+			 */
+			pgoff += ((vma->vm_end - start) >> PAGE_SHIFT);
+			start = vma->vm_end;
+		} else if (start < vma->vm_start
+				&& vma->vm_start < end && end <= vma->vm_end) {
+			/*
+			 * VMA includes the end of the region of interest
+			 * but not the start
+			 */
+			VSPRINTK("  [%d] map1 %lx -- %lx @ %lx\n", current->pid,
+					start, vma->vm_start, pgoff);
+			error = do_mmap_pgoff(file, start, vma->vm_start - start,
+					      prot, flags, pgoff, &populate, NULL);
+			if (error != start) {
+				ret = VM_FAULT_SIGBUS;;
+			}
+			break;
+		} else if (start <= vma->vm_start && vma->vm_end <= end) {
+			/* VMA is fully within the region of interest */
+			VSPRINTK("  [%d] map2 %lx -- %lx @ %lx\n", current->pid,
+					start, vma->vm_start, pgoff);
+			error = do_mmap_pgoff(file, start, vma->vm_start - start,
+					      prot, flags, pgoff, &populate, NULL);
+			if (error != start) {
+				ret = VM_FAULT_SIGBUS;
+				break;
+			}
+
+			/*
+			 * Then advance to the end of this VMA
+			 */
+			pgoff += ((vma->vm_end - start) >> PAGE_SHIFT);
+			start = vma->vm_end;
+		}
+	}
+	WARN_ON(populate);
+	return ret;
+}
+
+/*
+ * VMA operation delegators at remotes
+ */
+static vma_op_request_t *__alloc_vma_op_request(enum vma_op_code opcode)
+{
+	vma_op_request_t *req = kmalloc(sizeof(*req), GFP_KERNEL);
+
+	if(!req)
+		return req;
+
+	req->origin_pid = current->origin_pid,
+	req->remote_pid = current->pid,
+	req->operation = opcode;
+
+	return req;
+}
+
+static int __delegate_vma_op(vma_op_request_t *req, vma_op_response_t **resp)
+{
+	vma_op_response_t *res;
+	struct wait_station *ws = get_wait_station(current);
+
+	req->remote_ws = ws->id;
+
+	pcn_kmsg_send(PCN_KMSG_TYPE_VMA_OP_REQUEST,
+			current->origin_nid, req, sizeof(*req));
+	res = wait_at_station(ws);
+	WARN_ON(res->operation != req->operation);
+
+	*resp = res;
+	return res->ret;
+}
+
+static int handle_vma_op_response(struct pcn_kmsg_message *msg)
+{
+	vma_op_response_t *res = (vma_op_response_t *)msg;
+	struct wait_station *ws = wait_station(res->remote_ws);
+
+	ws->private = res;
+	complete(&ws->pendings);
+
+	return 0;
+}
+
+unsigned long vma_server_mmap_remote(struct file *file,
+				     unsigned long addr, unsigned long len,
+				     unsigned long prot, unsigned long flags,
+				     unsigned long pgoff)
+{
+	unsigned long ret = 0;
+	vma_op_request_t *req;
+	vma_op_response_t *res;
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_MMAP)))
+		return -ENOMEM;
+
+	req->addr = addr;
+	req->len = len;
+	req->prot = prot;
+	req->flags = flags;
+	req->pgoff = pgoff;
+	get_file_path(file, req->path, sizeof(req->path));
+
+	VSPRINTK("\n## VMA mmap [%d] %lx - %lx, %lx %lx\n", current->pid,
+			addr, addr + len, prot, flags);
+	if (req->path[0] != '\0') {
+		VSPRINTK("  [%d] %s\n", current->pid, req->path);
+	}
+
+	ret = __delegate_vma_op(req, &res);
+
+	VSPRINTK("  [%d] %ld %lx -- %lx\n", current->pid,
+			ret, res->addr, res->addr + res->len);
+
+	if (ret)
+		goto out_free;
+
+	while (!down_write_trylock(&current->mm->mmap_sem)) {
+		schedule();
+	}
+	ret = map_difference(current->mm, file, res->addr, res->addr + res->len,
+			prot, flags, pgoff);
+	up_write(&current->mm->mmap_sem);
+
+out_free:
+	kfree(req);
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+int vma_server_munmap_remote(unsigned long start, size_t len)
+{
+	int ret;
+	vma_op_request_t *req;
+	vma_op_response_t *res;
+
+	VSPRINTK("\n## VMA munmap [%d] %lx %lx\n", current->pid, start, len);
+
+	ret = vm_munmap(start, len);
+	if (ret)
+		return ret;
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_MUNMAP)))
+		return -ENOMEM;
+
+	req->addr = start;
+	req->len = len;
+
+	ret = __delegate_vma_op(req, &res);
+
+	VSPRINTK("  [%d] %d %lx -- %lx\n", current->pid,
+			ret, res->addr, res->addr + res->len);
+
+	kfree(req);
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+int vma_server_brk_remote(unsigned long oldbrk, unsigned long brk)
+{
+	int ret;
+	vma_op_request_t *req;
+	vma_op_response_t *res;
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_BRK)))
+		return -ENOMEM;
+
+	req->brk = brk;
+
+	VSPRINTK("\n## VMA brk-ed [%d] %lx --> %lx\n", current->pid, oldbrk, brk);
+
+	ret = __delegate_vma_op(req, &res);
+
+	VSPRINTK("  [%d] %d %lx\n", current->pid, ret, res->brk);
+
+	kfree(req);
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+int vma_server_madvise_remote(unsigned long start, size_t len, int behavior)
+{
+	int ret;
+	vma_op_request_t *req;
+	vma_op_response_t *res;
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_MADVISE)))
+		return -ENOMEM;
+
+	req->addr = start;
+	req->len = len;
+	req->behavior = behavior;
+
+	VSPRINTK("\n## VMA madvise-d [%d] %lx %lx %d\n", current->pid,
+			start, len, behavior);
+
+	ret = __delegate_vma_op(req, &res);
+
+	VSPRINTK("  [%d] %d %lx -- %lx %d\n", current->pid,
+			ret, res->addr, res->addr + res->len, behavior);
+
+	kfree(req);
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+int vma_server_mprotect_remote(unsigned long start, size_t len,
+			       unsigned long prot)
+{
+	int ret;
+	vma_op_request_t *req;
+	vma_op_response_t *res;
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_MPROTECT)))
+		return -ENOMEM;
+
+	req->start = start;
+	req->len = len;
+	req->prot = prot;
+
+	VSPRINTK("\nVMA mprotect [%d] %lx %lx %lx\n", current->pid,
+			start, len, prot);
+
+	ret = __delegate_vma_op(req, &res);
+
+	VSPRINTK("  [%d] %d %lx -- %lx %lx\n", current->pid,
+			ret, res->start, res->start + res->len, prot);
+
+	kfree(req);
+	pcn_kmsg_done(res);
+
+	return ret;
+}
+
+int vma_server_mremap_remote(unsigned long addr, unsigned long old_len,
+			     unsigned long new_len, unsigned long flags,
+			     unsigned long new_addr)
+{
+	WARN_ON_ONCE("Does not support remote mremap yet");
+	VSPRINTK("\nVMA mremap [%d] %lx %lx %lx %lx %lx\n", current->pid,
+			addr, old_len, new_len, flags, new_addr);
+	return -EINVAL;
+}
+
+
+/*
+ * VMA handlers for origin
+ */
+int vma_server_munmap_origin(unsigned long start, size_t len, int nid_except)
+{
+	int nid;
+	vma_op_request_t *req;
+	struct remote_context *rc = get_task_remote(current);
+
+	if(!(req = __alloc_vma_op_request(VMA_OP_MUNMAP)))
+		return -ENOMEM;
+
+	req->start = start;
+	req->len = len;
+
+	for (nid = 0; nid < MAX_POPCORN_NODES; nid++) {
+		struct wait_station *ws;
+		vma_op_response_t *res;
+
+		if (!get_popcorn_node_online(nid) || !rc->remote_tgids[nid])
+			continue;
+
+		if (nid == my_nid || nid == nid_except)
+			continue;
+
+		ws = get_wait_station(current);
+		req->remote_ws = ws->id;
+		req->origin_pid = rc->remote_tgids[nid];
+
+		VSPRINTK("  [%d] ->munmap [%d/%d] %lx+%lx\n", current->pid,
+				req->origin_pid, nid, start, len);
+		pcn_kmsg_send(PCN_KMSG_TYPE_VMA_OP_REQUEST, nid, req, sizeof(*req));
+		res = wait_at_station(ws);
+	}
+	kfree(req);
+	put_task_remote(current);
+
+	vm_munmap(start, len);
+	return 0;
+}
+
+/*
+ * VMA worker
+ *
+ * We do this because functions related to memory mapping operate
+ * on "current". Thus, we need mmap/munmap/madvise in our process
+ */
+static void __reply_vma_op(vma_op_request_t *req, long ret)
+{
+	vma_op_response_t *res = pcn_kmsg_get(sizeof(*res));
+
+	res->origin_pid = current->pid;
+	res->remote_pid = req->remote_pid;
+	res->remote_ws = req->remote_ws;
+
+	res->operation = req->operation;
+	res->ret = ret;
+	res->addr = req->addr;
+	res->len = req->len;
+
+	pcn_kmsg_post(PCN_KMSG_TYPE_VMA_OP_RESPONSE,
+			PCN_KMSG_FROM_NID(req), res, sizeof(*res));
+}
+
+/*
+ * Handle delegated VMA operations
+ * Currently, the remote worker only handles munmap VMA operations.
+ */
+static long __process_vma_op_at_remote(vma_op_request_t *req)
+{
+	long ret = -EPERM;
+
+	switch (req->operation) {
+	case VMA_OP_MUNMAP:
+		ret = vm_munmap(req->addr, req->len);
+		break;
+	case VMA_OP_MMAP:
+	case VMA_OP_MPROTECT:
+	case VMA_OP_MREMAP:
+	case VMA_OP_BRK:
+	case VMA_OP_MADVISE:
+		WARN_ON("Not implemented yet");
+		break;
+	default:
+		WARN_ON("unreachable");
+	}
+	return ret;
+}
+
+static long __process_vma_op_at_origin(vma_op_request_t *req)
+{
+	long ret = -EPERM;
+	int from_nid = PCN_KMSG_FROM_NID(req);
+
+	switch (req->operation) {
+	case VMA_OP_MMAP: {
+		unsigned long populate = 0;
+		unsigned long raddr;
+		struct file *f = NULL;
+		struct mm_struct *mm = get_task_mm(current);
+
+		if (req->path[0] != '\0')
+			f = filp_open(req->path, O_RDONLY | O_LARGEFILE, 0);
+
+		if (IS_ERR(f)) {
+			ret = PTR_ERR(f);
+			printk("  [%d] Cannot open %s %ld\n", current->pid, req->path, ret);
+			mmput(mm);
+			break;
+		}
+		down_write(&mm->mmap_sem);
+		raddr = do_mmap_pgoff(f, req->addr, req->len, req->prot,
+				      req->flags, req->pgoff, &populate, NULL);
+		up_write(&mm->mmap_sem);
+		if (populate)
+			mm_populate(raddr, populate);
+
+		ret = IS_ERR_VALUE(raddr) ? raddr : 0;
+		req->addr = raddr;
+		VSPRINTK("  [%d] %lx %lx -- %lx %lx %lx\n", current->pid,
+				ret, req->addr, req->addr + req->len, req->prot, req->flags);
+
+		if (f)
+			filp_close(f, NULL);
+		mmput(mm);
+		break;
+	}
+	case VMA_OP_BRK: {
+		unsigned long brk = req->brk;
+		req->brk = ksys_brk(req->brk);
+		ret = brk != req->brk;
+		break;
+	}
+	case VMA_OP_MUNMAP:
+		ret = vma_server_munmap_origin(req->addr, req->len, from_nid);
+		break;
+	case VMA_OP_MPROTECT:
+		ret = ksys_mprotect(req->addr, req->len, req->prot);
+		break;
+	case VMA_OP_MREMAP:
+		ret = ksys_mremap(req->addr, req->old_len, req->new_len,
+				  req->flags, req->new_addr);
+		break;
+	case VMA_OP_MADVISE:
+		if (req->behavior == MADV_RELEASE) {
+			ret = process_madvise_release_from_remote(
+					from_nid, req->start, req->start + req->len);
+		} else {
+			ret = ksys_madvise(req->start, req->len, req->behavior);
+		}
+		break;
+	default:
+		WARN_ON("unreachable");
+	}
+
+	return ret;
+}
+
+void process_vma_op_request(vma_op_request_t *req)
+{
+	long ret = 0;
+	VSPRINTK("\nVMA_OP_REQUEST [%d] %s %lx %lx\n", current->pid,
+			vma_op_code_sz[req->operation], req->addr, req->len);
+
+	if (current->at_remote) {
+		ret = __process_vma_op_at_remote(req);
+	} else {
+		ret = __process_vma_op_at_origin(req);
+	}
+
+	VSPRINTK("  [%d] ->%s %ld\n", current->pid,
+			vma_op_code_sz[req->operation], ret);
+
+	__reply_vma_op(req, ret);
+	pcn_kmsg_done(req);
+}
+
+/*
+ * Response for remote VMA request and handling the response
+ */
+struct vma_info {
+	struct list_head list;
+	unsigned long addr;
+	atomic_t pendings;
+	struct completion complete;
+	wait_queue_head_t pendings_wait;
+
+	volatile int ret;
+	volatile vma_info_response_t *response;
+};
+
+static struct vma_info *__lookup_pending_vma_request(struct remote_context *rc,
+						     unsigned long addr)
+{
+	struct vma_info *vi;
+
+	list_for_each_entry(vi, &rc->vmas, list) {
+		if (vi->addr == addr) return vi;
+	}
+	return NULL;
+}
+
+static int handle_vma_info_response(struct pcn_kmsg_message *msg)
+{
+	vma_info_response_t *res = (vma_info_response_t *)msg;
+	struct task_struct *tsk;
+	unsigned long flags;
+	struct vma_info *vi;
+	struct remote_context *rc;
+
+	tsk = __get_task_struct(res->remote_pid);
+	if (WARN_ON(!tsk)) {
+		goto out_free;
+	}
+	rc = get_task_remote(tsk);
+
+	spin_lock_irqsave(&rc->vmas_lock, flags);
+	vi = __lookup_pending_vma_request(rc, res->addr);
+	spin_unlock_irqrestore(&rc->vmas_lock, flags);
+	put_task_remote(tsk);
+	put_task_struct(tsk);
+
+	if (WARN_ON(!vi)) {
+		goto out_free;
+	}
+
+	vi->response = res;
+	complete(&vi->complete);
+	return 0;
+
+out_free:
+	pcn_kmsg_done(res);
+	return 0;
+}
+
+/*
+ * Handle VMA info requests at the origin.
+ * This is invoked through the remote work delegation.
+ */
+void process_vma_info_request(vma_info_request_t *req)
+{
+	vma_info_response_t *res = NULL;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	unsigned long addr = req->addr;
+
+	might_sleep();
+
+	while (!res) {
+		res = kmalloc(sizeof(*res), GFP_KERNEL);
+	}
+	res->addr = addr;
+
+	mm = get_task_mm(current);
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma(mm, addr);
+	if (unlikely(!vma)) {
+		printk("vma_info: vma does not exist at %lx\n", addr);
+		res->result = -ENOENT;
+		goto out_up;
+	}
+	if (likely(vma->vm_start <= addr)) {
+		goto good;
+	}
+	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
+		printk("vma_info: vma does not really exist at %lx\n", addr);
+		res->result = -ENOENT;
+		goto out_up;
+	}
+
+good:
+	res->vm_start = vma->vm_start;
+	res->vm_end = vma->vm_end;
+	res->vm_flags = vma->vm_flags;
+	res->vm_pgoff = vma->vm_pgoff;
+
+	get_file_path(vma->vm_file, res->vm_file_path, sizeof(res->vm_file_path));
+	res->result = 0;
+
+out_up:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+	if (res->result == 0) {
+		VSPRINTK("\n## VMA_INFO [%d] %lx -- %lx %lx\n", current->pid,
+				res->vm_start, res->vm_end, res->vm_flags);
+		if (!vma_info_anon(res)) {
+			VSPRINTK("  [%d] %s + %lx\n", current->pid,
+					res->vm_file_path, res->vm_pgoff);
+		}
+	}
+
+	res->remote_pid = req->remote_pid;
+	pcn_kmsg_send(PCN_KMSG_TYPE_VMA_INFO_RESPONSE,
+			PCN_KMSG_FROM_NID(req), res, sizeof(*res));
+
+	pcn_kmsg_done(req);
+	kfree(res);
+	return;
+}
+
+
+static struct vma_info *__alloc_vma_info_request(struct task_struct *tsk,
+						 unsigned long addr,
+						 vma_info_request_t **preq)
+{
+	struct vma_info *vi = kmalloc(sizeof(*vi), GFP_KERNEL);
+	vma_info_request_t *req = kmalloc(sizeof(*req), GFP_KERNEL);
+
+	if(!vi || !req)
+		return vi;
+
+	INIT_LIST_HEAD(&vi->list);
+	vi->addr = addr;
+	vi->response = (volatile vma_info_response_t *)0xdeadbeaf;
+	atomic_set(&vi->pendings, 0);
+	init_completion(&vi->complete);
+	init_waitqueue_head(&vi->pendings_wait);
+
+	req->origin_pid = tsk->origin_pid;
+	req->remote_pid = tsk->pid;
+	req->addr = addr;
+
+	*preq = req;
+
+	return vi;
+}
+
+static int __update_vma(struct task_struct *tsk, vma_info_response_t *res)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct vm_area_struct *vma;
+	unsigned long prot;
+	unsigned flags = MAP_FIXED;
+	struct file *f = NULL;
+	unsigned long err = 0;
+	int ret = 0;
+	unsigned long addr = res->addr;
+
+	if (res->result) {
+		down_read(&mm->mmap_sem);
+		return res->result;
+	}
+
+	while (!down_write_trylock(&mm->mmap_sem)) {
+		schedule();
+	}
+	vma = find_vma(mm, addr);
+	VSPRINTK("  [%d] %lx %lx\n", tsk->pid, vma ? vma->vm_start : 0, addr);
+
+	if (vma && vma->vm_start <= addr)
+		goto out;
+
+	if (vma_info_anon(res)) {
+		flags |= MAP_ANONYMOUS;
+	} else {
+		f = filp_open(res->vm_file_path, O_RDONLY | O_LARGEFILE, 0);
+		if (IS_ERR(f)) {
+			printk(KERN_ERR"%s: cannot find backing file %s\n",__func__,
+				res->vm_file_path);
+			ret = -EIO;
+			goto out;
+		}
+
+		VSPRINTK("  [%d] %s + %lx\n", tsk->pid,
+				res->vm_file_path, res->vm_pgoff);
+	}
+
+	prot  = ((res->vm_flags & VM_READ) ? PROT_READ : 0)
+			| ((res->vm_flags & VM_WRITE) ? PROT_WRITE : 0)
+			| ((res->vm_flags & VM_EXEC) ? PROT_EXEC : 0);
+
+	flags = flags
+			| ((res->vm_flags & VM_DENYWRITE) ? MAP_DENYWRITE : 0)
+			| ((res->vm_flags & VM_SHARED) ? MAP_SHARED : MAP_PRIVATE)
+			| ((res->vm_flags & VM_GROWSDOWN) ? MAP_GROWSDOWN : 0);
+
+	err = map_difference(mm, f, res->vm_start, res->vm_end,
+				prot, flags, res->vm_pgoff);
+
+	if (f) filp_close(f, NULL);
+
+out:
+	downgrade_write(&mm->mmap_sem);
+	return ret;
+}
+
+/*
+ * Fetch VMA information from the origin.
+ * mm->mmap_sem is down_read() at this point and should be downed upon return.
+ */
+int vma_server_fetch_vma(struct task_struct *tsk, unsigned long address)
+{
+	struct vma_info *vi = NULL;
+	unsigned long flags;
+	DEFINE_WAIT(wait);
+	int ret = 0;
+	unsigned long addr = address & PAGE_MASK;
+	vma_info_request_t *req = NULL;
+	struct remote_context *rc = get_task_remote(tsk);
+
+	might_sleep();
+
+	VSPRINTK("\n## VMAFAULT [%d] %lx %lx\n", current->pid,
+			address, instruction_pointer(current_pt_regs()));
+
+	spin_lock_irqsave(&rc->vmas_lock, flags);
+	vi = __lookup_pending_vma_request(rc, addr);
+	if (!vi) {
+		struct vma_info *v;
+		spin_unlock_irqrestore(&rc->vmas_lock, flags);
+
+		vi = __alloc_vma_info_request(tsk, addr, &req);
+
+		if(!vi || !req)
+			return -ENOMEM;
+
+		spin_lock_irqsave(&rc->vmas_lock, flags);
+		v = __lookup_pending_vma_request(rc, addr);
+		if (!v) {
+			list_add(&vi->list, &rc->vmas);
+		} else {
+			kfree(vi);
+			vi = v;
+			kfree(req);
+			req = NULL;
+		}
+	}
+	up_read(&tsk->mm->mmap_sem);
+
+	if (req) {
+		spin_unlock_irqrestore(&rc->vmas_lock, flags);
+
+		VSPRINTK("  [%d] %lx ->[%d/%d]\n", current->pid,
+				addr, tsk->origin_pid, tsk->origin_nid);
+		pcn_kmsg_send(PCN_KMSG_TYPE_VMA_INFO_REQUEST,
+				tsk->origin_nid, req, sizeof(*req));
+		wait_for_completion(&vi->complete);
+
+		ret = vi->ret =
+			__update_vma(tsk, (vma_info_response_t *)vi->response);
+
+		spin_lock_irqsave(&rc->vmas_lock, flags);
+		list_del(&vi->list);
+		spin_unlock_irqrestore(&rc->vmas_lock, flags);
+
+		pcn_kmsg_done((void *)vi->response);
+		wake_up_all(&vi->pendings_wait);
+
+		kfree(req);
+	} else {
+		VSPRINTK("  [%d] %lx already pended\n", current->pid, addr);
+		atomic_inc(&vi->pendings);
+		prepare_to_wait(&vi->pendings_wait, &wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irqrestore(&rc->vmas_lock, flags);
+
+		io_schedule();
+		finish_wait(&vi->pendings_wait, &wait);
+
+		smp_rmb();
+		ret = vi->ret;
+		if (atomic_dec_and_test(&vi->pendings)) {
+			kfree(vi);
+		}
+		down_read(&tsk->mm->mmap_sem);
+	}
+
+	put_task_remote(tsk);
+	return ret;
+}
+
+DEFINE_KMSG_RW_HANDLER(vma_info_request, vma_info_request_t, origin_pid);
+DEFINE_KMSG_RW_HANDLER(vma_op_request, vma_op_request_t, origin_pid);
+
+int vma_server_init(void)
+{
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_VMA_INFO_REQUEST, vma_info_request);
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_VMA_INFO_RESPONSE, vma_info_response);
+
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_VMA_OP_REQUEST, vma_op_request);
+	REGISTER_KMSG_HANDLER(PCN_KMSG_TYPE_VMA_OP_RESPONSE, vma_op_response);
+
+	return 0;
+}
diff --git a/kernel/popcorn/vma_server.h b/kernel/popcorn/vma_server.h
new file mode 100644
index 000000000..7ca760a85
--- /dev/null
+++ b/kernel/popcorn/vma_server.h
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __INTERNAL_VMA_SERVER_H_
+#define __INTERNAL_VMA_SERVER_H_
+
+#include <popcorn/vma_server.h>
+
+enum vma_op_code {
+	VMA_OP_NOP = -1,
+	VMA_OP_MMAP,
+	VMA_OP_MUNMAP,
+	VMA_OP_MPROTECT,
+	VMA_OP_MREMAP,
+	VMA_OP_MADVISE,
+	VMA_OP_BRK,
+	VMA_OP_MAX,
+};
+
+struct remote_context;
+
+void process_vma_info_request(vma_info_request_t *req);
+
+void process_vma_op_request(vma_op_request_t *req);
+
+#endif /* __INTERNAL_VMA_SERVER_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 8/9] Page Server for Distributed Thread Execution
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (6 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 7/9] Virtual Memory Address Server for " Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-04-29 19:32   ` [RFC 9/9] Add Popcorn Message Layer and socket support Javier Malave
  2020-05-07 17:46   ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Pavel Machek
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

Popcorn threads execute in a distributed manner,
thus pages may fall into an inconsistent state.
A few examples may be,

A) A new page is created in a remote node, and is needed
at the origin node, after back migration.

B) A page has been changed in one node, the thread
migrates to a previous node where it had executed code,
and attempts to read the remotely modified page.

As part of the coherent VMA space, the page server takes
care of managing page consistency. Page server must mark
pages invalid when applicable, handle faults and verify
ownership.
---
 include/popcorn/page_server.h |   34 +
 kernel/popcorn/fh_action.c    |  207 ++++
 kernel/popcorn/fh_action.h    |   34 +
 kernel/popcorn/page_server.c  | 2019 +++++++++++++++++++++++++++++++++
 kernel/popcorn/page_server.h  |   16 +
 kernel/popcorn/pgtable.h      |   31 +
 6 files changed, 2341 insertions(+)
 create mode 100644 include/popcorn/page_server.h
 create mode 100644 kernel/popcorn/fh_action.c
 create mode 100644 kernel/popcorn/fh_action.h
 create mode 100644 kernel/popcorn/page_server.c
 create mode 100644 kernel/popcorn/page_server.h
 create mode 100644 kernel/popcorn/pgtable.h

diff --git a/include/popcorn/page_server.h b/include/popcorn/page_server.h
new file mode 100644
index 000000000..f4b25970c
--- /dev/null
+++ b/include/popcorn/page_server.h
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef INCLUDE_POPCORN_PAGE_SERVER_H_
+#define INCLUDE_POPCORN_PAGE_SERVER_H_
+
+struct fault_handle;
+
+/*
+ * Entry points for dealing with page fault in Popcorn Rack
+ */
+int page_server_handle_pte_fault(struct vm_fault *vmf);
+
+void page_server_zap_pte(struct vm_area_struct *vma, unsigned long addr,
+			 pte_t *pte, pte_t *pteval);
+
+int page_server_get_userpage(u32 __user *uaddr, struct fault_handle **handle,
+			     char *mode);
+void page_server_put_userpage(struct fault_handle *fh, char *mode);
+
+void page_server_panic(bool condition, struct mm_struct *mm,
+		       unsigned long address, pte_t *pte,
+		       pte_t pte_val);
+
+int page_server_release_page_ownership(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/* Implemented in mm/memory.c */
+int handle_pte_fault_origin(struct mm_struct *, struct vm_area_struct *,
+			    unsigned long, pte_t *, pmd_t *, unsigned int);
+struct page *get_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t *pte);
+int cow_file_at_origin(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t *pte);
+
+#endif /* INCLUDE_POPCORN_PAGE_SERVER_H_ */
diff --git a/kernel/popcorn/fh_action.c b/kernel/popcorn/fh_action.c
new file mode 100644
index 000000000..0ba0aea90
--- /dev/null
+++ b/kernel/popcorn/fh_action.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/fh_action.c
+ *
+ * Original file developed by SSRG at Virginia Tech.
+ *
+ * Fault handling statistics.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ */
+
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+#include "fh_action.h"
+
+/*
+ * Current fault handling type
+ *   (L/R) For local or remote
+ *   (-/i) Ownership revocation pending
+ *   (R/W) For read or write
+ *
+ * Current fault type
+ *   (L/R) For local or remote
+ *   (R/W) For read or write
+ *
+ * e.g., at origin, L-WRW means the page is currently locked for handling
+ * local "fault for write" and "request to handle" remote fault for write.
+ * In this case, just retry immediately.
+ */
+
+static unsigned long __fh_action_stat[64] = { 0 };
+static const unsigned short fh_action_table[64] = {
+	/* L - R, LR (at origin) */
+	FH_ACTION_FOLLOW,
+	/* L - R, RR */
+	FH_ACTION_RETRY | FH_ACTION_WAIT | FH_ACTION_LOCAL,
+	/* L - R, LW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* L - R, RW */
+	FH_ACTION_RETRY,
+        /* L - W, LR */
+	FH_ACTION_FOLLOW,
+	/* L - W, RR */
+	FH_ACTION_RETRY,
+	/* L - W, LW */
+	FH_ACTION_FOLLOW,
+	/* L - W, RW */
+	FH_ACTION_RETRY,
+	/* L i R, LR */
+	FH_ACTION_INVALID,
+	/* L i R, RR (at origin) */
+	FH_ACTION_INVALID,
+	/* L i R, LW */
+	FH_ACTION_INVALID,
+	/* L i R, RW */
+	FH_ACTION_INVALID,
+	/* L i W, LR */
+	FH_ACTION_INVALID,
+	/* L i W, RR */
+	FH_ACTION_INVALID,
+	/* L i W, LW */
+	FH_ACTION_INVALID,
+	/* L i W, RW */
+	FH_ACTION_INVALID,
+
+	/* R - R, LR */
+	FH_ACTION_FOLLOW,
+	/* R - R, RR */
+	FH_ACTION_RETRY | FH_ACTION_WAIT | FH_ACTION_LOCAL,
+	/* R - R, LW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - R, RW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - W, LR */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - W, RR */
+	FH_ACTION_RETRY,
+	/* R - W, LW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - W, RW */
+	FH_ACTION_RETRY,
+	/* R i R, LR */
+	FH_ACTION_INVALID,
+	/* R i R, RR (at origin) */
+	FH_ACTION_INVALID,
+        /* R i R, LW */
+	FH_ACTION_INVALID,
+	/* R i R, RW */
+	FH_ACTION_INVALID,
+	/* R i W, LR */
+	FH_ACTION_INVALID,
+	/* R i W, RR */
+	FH_ACTION_INVALID,
+	/* R i W, LW */
+	FH_ACTION_INVALID,
+	/* R i W, RW */
+	FH_ACTION_INVALID,
+
+	/*
+	 * At remote
+	 *      - *i*R* are impossible since the origin never asks remotes for
+	 *        pages while an ownership revocation is pended.
+	 *      - R**R* are impossible since the origin never asks a page twice.
+	 *      - Ri*** are impossible henceforth.
+	 */
+
+	/* L - R, LR */
+	FH_ACTION_FOLLOW,
+	/* L - R, RR, implies remote doesn't own this page */
+	FH_ACTION_INVALID,
+	/* L - R, LW */
+	FH_ACTION_RETRY  | FH_ACTION_WAIT,
+	/* L - R, RW */
+	FH_ACTION_INVALID,
+	/* L - W, LR */
+	FH_ACTION_FOLLOW,
+	/* L - W, RR */
+	FH_ACTION_RETRY  | FH_ACTION_WAIT | FH_ACTION_LOCAL | FH_ACTION_DELAY,
+	/* L - W, LW */
+	FH_ACTION_FOLLOW,
+	/* L - W, RW */
+	FH_ACTION_RETRY  | FH_ACTION_WAIT | FH_ACTION_LOCAL | FH_ACTION_DELAY,
+	/* L i R, LR, no waiter should exist when finishing ownership revocation */
+	FH_ACTION_FOLLOW | FH_ACTION_RETRY | FH_ACTION_DELAY,
+	/* L i R, RR */
+	FH_ACTION_INVALID,
+	/* L i R, LW */
+	FH_ACTION_RETRY  | FH_ACTION_WAIT | FH_ACTION_DELAY,
+	/* L i R, RW */
+	FH_ACTION_INVALID,
+	/* L i W, LR (same to LiRLR) */
+	FH_ACTION_FOLLOW | FH_ACTION_RETRY | FH_ACTION_DELAY,
+	/* L i W, RR */
+	FH_ACTION_INVALID,
+	/* L i W, LW (same to LiRLR) */
+	FH_ACTION_FOLLOW | FH_ACTION_RETRY | FH_ACTION_DELAY,
+	/* L i W, RW */
+	FH_ACTION_INVALID,
+	/* R - R, LR */
+	FH_ACTION_INVALID,
+	/* R - R, RR */
+	FH_ACTION_INVALID,
+	/* R - R, LW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - R, RW */
+	FH_ACTION_INVALID,
+	/* R - W, LR */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - W, RR */
+	FH_ACTION_INVALID,
+	/* R - W, LW */
+	FH_ACTION_RETRY | FH_ACTION_WAIT,
+	/* R - W, RW */
+	FH_ACTION_INVALID,
+	/* R i R, LR */
+	FH_ACTION_INVALID,
+	/* R i R, RR */
+	FH_ACTION_INVALID,
+	/* R i R, LW */
+	FH_ACTION_INVALID,
+	/* R i R, RW */
+	FH_ACTION_INVALID,
+	/* R i W, LR */
+	FH_ACTION_INVALID,
+	/* R i W, RR */
+	FH_ACTION_INVALID,
+	/* R i W, LW */
+	FH_ACTION_INVALID,
+	/* R i W, RW */
+	FH_ACTION_INVALID,
+};
+
+unsigned short get_fh_action(bool at_remote,
+                             unsigned long fh_flags,
+                             unsigned fault_flags)
+{
+	unsigned short i;
+	i  = (at_remote << 5);
+	i |= (fh_flags & 0x07) << 2;
+	i |= !!(fault_for_write(fault_flags)) << 1;
+	i |= !!(fault_flags & PC_FAULT_FLAG_REMOTE) << 0;
+
+	return fh_action_table[i];
+}
+
+void fh_action_stat(struct seq_file *seq, void *v)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(__fh_action_stat) / 4; i++) {
+		if (seq) {
+			seq_printf(seq,
+                       "%2d %-12lu %2d %-12lu %2d %-12lu %2d %-12lu\n",
+				       i,
+					   __fh_action_stat[i], i + 16,
+                       __fh_action_stat[i + 16], i + 32,
+                       __fh_action_stat[i + 32], i + 48,
+                       __fh_action_stat[i + 48]);
+		} else {
+			__fh_action_stat[i] = 0;
+			__fh_action_stat[i + 16] = 0;
+			__fh_action_stat[i + 32] = 0;
+			__fh_action_stat[i + 48] = 0;
+		}
+	}
+}
diff --git a/kernel/popcorn/fh_action.h b/kernel/popcorn/fh_action.h
new file mode 100644
index 000000000..ac9183d0d
--- /dev/null
+++ b/kernel/popcorn/fh_action.h
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __POPCORN_FAULT_HANDLING_ACTION_H__
+#define __POPCORN_FAULT_HANDLING_ACTION_H__
+
+#include <linux/mm.h>
+
+enum {
+	FH_ACTION_INVALID = 0x00,
+	FH_ACTION_FOLLOW = 0x10,
+	FH_ACTION_RETRY = 0x20,
+	FH_ACTION_WAIT = 0x01,
+	FH_ACTION_LOCAL = 0x02,
+	FH_ACTION_DELAY = 0x04,
+
+	PC_FAULT_FLAG_REMOTE = 0x200,
+
+	FH_ACTION_MAX_FOLLOWER = 8,
+};
+
+static inline bool fault_for_write(unsigned long flags)
+{
+	return !!(flags & FAULT_FLAG_WRITE);
+}
+
+static inline bool fault_for_read(unsigned long flags)
+{
+	return !fault_for_write(flags);
+}
+
+unsigned short get_fh_action(bool at_remote,
+                             unsigned long fh_flags,
+                             unsigned fault_flags
+                             );
+#endif
diff --git a/kernel/popcorn/page_server.c b/kernel/popcorn/page_server.c
new file mode 100644
index 000000000..ace549be1
--- /dev/null
+++ b/kernel/popcorn/page_server.c
@@ -0,0 +1,2019 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+/*
+ * /kernel/popcorn/page_server.c
+ *
+ * Popcorn Linux page server implementation.
+ * As part of the coherent VMA space, the page server takes
+ * care of managing page consistency.
+ *
+ * This work was an extension of Marina Sadini's MS Thesis.
+ * It has since been modified for multi-threaded support.
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ * author Sang-Hoon Kim, SSRG Virginia Tech 2017
+ * author Marina Sadini, SSRG Virginia 2013
+ */
+
+#include <linux/compiler.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
+#include <linux/wait.h>
+#include <linux/ptrace.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/delay.h>
+#include <linux/random.h>
+#include <linux/radix-tree.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/task_stack.h>
+#include <linux/sched/mm.h>
+
+#include <asm/tlbflush.h>
+#include <asm/cacheflush.h>
+#include <asm/mmu_context.h>
+
+#include <popcorn/bundle.h>
+#include <popcorn/pcn_kmsg.h>
+
+#include "types.h"
+#include "pgtable.h"
+#include "wait_station.h"
+#include "page_server.h"
+#include "fh_action.h"
+
+#include "trace_events.h"
+
+static inline int __fault_hash_key(unsigned long address)
+{
+	return (address >> PAGE_SHIFT) % FAULTS_HASH;
+}
+
+/**************************************************************************
+ * Page ownership tracking mechanism
+ */
+#define PER_PAGE_INFO_SIZE \
+		(sizeof(unsigned long) * BITS_TO_LONGS(MAX_POPCORN_NODES))
+#define PAGE_INFO_PER_REGION (PAGE_SIZE / PER_PAGE_INFO_SIZE)
+
+static inline void __get_page_info_key(unsigned long addr, unsigned long *key,
+				       unsigned long *offset)
+{
+	unsigned long paddr = addr >> PAGE_SHIFT;
+	*key = paddr / PAGE_INFO_PER_REGION;
+	*offset = (paddr % PAGE_INFO_PER_REGION) *
+			(PER_PAGE_INFO_SIZE / sizeof(unsigned long));
+}
+
+static inline struct page *__get_page_info_page(struct mm_struct *mm,
+						unsigned long addr,
+						unsigned long *offset)
+{
+	unsigned long key;
+	struct page *page;
+	struct remote_context *rc = mm->remote;
+	__get_page_info_key(addr, &key, offset);
+
+	page = radix_tree_lookup(&rc->pages, key);
+	if (!page)
+		return NULL;
+
+	return page;
+}
+
+static inline unsigned long *__get_page_info_mapped(struct mm_struct *mm,
+						    unsigned long addr,
+						    unsigned long *offset)
+{
+	unsigned long key;
+	struct page *page;
+	struct remote_context *rc = mm->remote;
+	__get_page_info_key(addr, &key, offset);
+
+	page = radix_tree_lookup(&rc->pages, key);
+	if (!page)
+		return NULL;
+
+	return (unsigned long *)kmap_atomic(page) + *offset;
+}
+
+#define FREE_BATCH 16
+void free_remote_context_pages(struct remote_context *rc)
+{
+	int nr_pages;
+	struct page *pages[FREE_BATCH];
+
+	do {
+		int i;
+		nr_pages = radix_tree_gang_lookup(&rc->pages,
+				(void **)pages, 0, FREE_BATCH);
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pages[i];
+			radix_tree_delete(&rc->pages, page_private(page));
+			__free_page(page);
+		}
+	} while (nr_pages == FREE_BATCH);
+}
+
+#define PI_FLAG_COWED 62
+#define PI_FLAG_DISTRIBUTED 63
+
+static struct page *__lookup_page_info_page(struct remote_context *rc,
+					    unsigned long key)
+{
+	struct page *page = radix_tree_lookup(&rc->pages, key);
+	if (!page) {
+		int ret;
+		page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+		BUG_ON(!page);
+		set_page_private(page, key);
+
+		ret = radix_tree_insert(&rc->pages, key, page);
+		BUG_ON(ret);
+	}
+	return page;
+}
+
+static inline void SetPageDistributed(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long key, offset;
+	unsigned long *region;
+	struct page *page;
+	struct remote_context *rc = mm->remote;
+	__get_page_info_key(addr, &key, &offset);
+
+	page = __lookup_page_info_page(rc, key);
+	region = kmap_atomic(page);
+	set_bit(PI_FLAG_DISTRIBUTED, region + offset);
+	kunmap_atomic(region);
+}
+
+static inline void SetPageCowed(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long key, offset;
+	unsigned long *region;
+	struct page *page;
+	struct remote_context *rc = mm->remote;
+	__get_page_info_key(addr, &key, &offset);
+
+	page = __lookup_page_info_page(rc, key);
+	region = kmap_atomic(page);
+	set_bit(PI_FLAG_COWED, region + offset);
+	kunmap_atomic(region);
+}
+
+static inline void ClearPageInfo(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+
+	if (!pi)
+		return;
+	clear_bit(PI_FLAG_DISTRIBUTED, pi);
+	clear_bit(PI_FLAG_COWED, pi);
+	bitmap_clear(pi, 0, MAX_POPCORN_NODES);
+	kunmap_atomic(pi - offset);
+}
+
+static inline bool PageDistributed(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	bool ret;
+
+	if (!pi)
+		return false;
+	ret = test_bit(PI_FLAG_DISTRIBUTED, pi);
+	kunmap_atomic(pi - offset);
+	return ret;
+}
+
+static inline bool PageCowed(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	bool ret;
+
+	if (!pi)
+		return false;
+	ret = test_bit(PI_FLAG_COWED, pi);
+	kunmap_atomic(pi - offset);
+	return ret;
+}
+
+static inline bool page_is_mine(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	bool ret = true;
+
+	if (!pi)
+		return true;
+	if (!test_bit(PI_FLAG_DISTRIBUTED, pi))
+		goto out;
+	ret = test_bit(my_nid, pi);
+out:
+	kunmap_atomic(pi - offset);
+	return ret;
+}
+
+static inline bool test_page_owner(int nid, struct mm_struct *mm,
+				   unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	bool ret;
+
+	if (!pi)
+		return false;
+	ret = test_bit(nid, pi);
+	kunmap_atomic(pi - offset);
+	return ret;
+}
+
+static inline void set_page_owner(int nid, struct mm_struct *mm,
+				  unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	set_bit(nid, pi);
+	kunmap_atomic(pi - offset);
+}
+
+static inline void clear_page_owner(int nid, struct mm_struct *mm,
+				    unsigned long addr)
+{
+	unsigned long offset;
+	unsigned long *pi = __get_page_info_mapped(mm, addr, &offset);
+	if (!pi)
+		return;
+
+	clear_bit(nid, pi);
+	kunmap_atomic(pi - offset);
+}
+
+/*
+ * Fault tracking mechanism
+ */
+enum {
+	FAULT_HANDLE_WRITE = 0x01,
+	FAULT_HANDLE_INVALIDATE = 0x02,
+	FAULT_HANDLE_REMOTE = 0x04,
+};
+
+static struct kmem_cache *__fault_handle_cache = NULL;
+
+struct fault_handle {
+	struct hlist_node list;
+
+	unsigned long addr;
+	unsigned long flags;
+
+	unsigned int limit;
+	pid_t pid;
+	int ret;
+
+	atomic_t pendings;
+	atomic_t pendings_retry;
+	wait_queue_head_t waits;
+	wait_queue_head_t waits_retry;
+	struct remote_context *rc;
+
+	struct completion *complete;
+};
+
+static struct fault_handle *__alloc_fault_handle(struct task_struct *tsk,
+						 unsigned long addr)
+{
+	struct fault_handle *fh =
+			kmem_cache_alloc(__fault_handle_cache, GFP_ATOMIC);
+	int fk = __fault_hash_key(addr);
+	BUG_ON(!fh);
+
+	INIT_HLIST_NODE(&fh->list);
+
+	fh->addr = addr;
+	fh->flags = 0;
+
+	init_waitqueue_head(&fh->waits);
+	init_waitqueue_head(&fh->waits_retry);
+	atomic_set(&fh->pendings, 1);
+	atomic_set(&fh->pendings_retry, 0);
+	fh->limit = 0;
+	fh->ret = 0;
+	fh->rc = get_task_remote(tsk);
+	fh->pid = tsk->pid;
+	fh->complete = NULL;
+
+	hlist_add_head(&fh->list, &fh->rc->faults[fk]);
+	return fh;
+}
+
+
+static struct fault_handle *__start_invalidation(struct task_struct *tsk,
+						 unsigned long addr,
+						 spinlock_t *ptl)
+{
+	unsigned long flags;
+	struct remote_context *rc = get_task_remote(tsk);
+	struct fault_handle *fh;
+	bool found = false;
+	DECLARE_COMPLETION_ONSTACK(complete);
+	int fk = __fault_hash_key(addr);
+
+	spin_lock_irqsave(&rc->faults_lock[fk], flags);
+	hlist_for_each_entry(fh, &rc->faults[fk], list) {
+		if (fh->addr == addr) {
+			PGPRINTK("  [%d] %s %s ongoing, wait\n", tsk->pid,
+				fh->flags & FAULT_HANDLE_REMOTE ? "remote" : "local",
+				fh->flags & FAULT_HANDLE_WRITE ? "write" : "read");
+			BUG_ON(fh->flags & FAULT_HANDLE_INVALIDATE);
+			fh->flags |= FAULT_HANDLE_INVALIDATE;
+			fh->complete = &complete;
+			found = true;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&rc->faults_lock[fk], flags);
+	put_task_remote(tsk);
+
+	if (found) {
+		spin_unlock(ptl);
+		PGPRINTK(" +[%d] %lx %p\n", tsk->pid, addr, fh);
+		wait_for_completion(&complete);
+		PGPRINTK(" =[%d] %lx %p\n", tsk->pid, addr, fh);
+		spin_lock(ptl);
+	} else {
+		fh = NULL;
+		PGPRINTK(" =[%d] %lx\n", tsk->pid, addr);
+	}
+	return fh;
+}
+
+static void __finish_invalidation(struct fault_handle *fh)
+{
+	unsigned long flags;
+	int fk;
+
+	if (!fh)
+		return;
+	fk = __fault_hash_key(fh->addr);
+
+	BUG_ON(atomic_read(&fh->pendings));
+	spin_lock_irqsave(&fh->rc->faults_lock[fk], flags);
+	hlist_del(&fh->list);
+	spin_unlock_irqrestore(&fh->rc->faults_lock[fk], flags);
+
+	__put_task_remote(fh->rc);
+	if (atomic_read(&fh->pendings_retry)) {
+		wake_up_all(&fh->waits_retry);
+	} else {
+		kmem_cache_free(__fault_handle_cache, fh);
+	}
+}
+
+static struct fault_handle *__start_fault_handling(struct task_struct *tsk,
+						   unsigned long addr,
+						   unsigned long fault_flags,
+						   spinlock_t *ptl, bool *leader)
+	__releases(ptl)
+{
+	unsigned long flags;
+	struct fault_handle *fh;
+	bool found = false;
+	struct remote_context *rc = get_task_remote(tsk);
+	DEFINE_WAIT(wait);
+	int fk = __fault_hash_key(addr);
+
+	spin_lock_irqsave(&rc->faults_lock[fk], flags);
+	spin_unlock(ptl);
+
+	hlist_for_each_entry(fh, &rc->faults[fk], list) {
+		if (fh->addr == addr) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found) {
+		unsigned long action =
+				get_fh_action(tsk->at_remote, fh->flags, fault_flags);
+
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+		BUG_ON(action == FH_ACTION_INVALID);
+#endif
+		if (action & FH_ACTION_RETRY) {
+			if (action & FH_ACTION_WAIT) {
+				goto out_wait_retry;
+			}
+			goto out_retry;
+		}
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+		BUG_ON(action != FH_ACTION_FOLLOW);
+#endif
+
+		if (fh->limit++ > FH_ACTION_MAX_FOLLOWER) {
+			goto out_wait_retry;
+		}
+
+		atomic_inc(&fh->pendings);
+#ifndef CONFIG_POPCORN_DEBUG_PAGE_SERVER
+		prepare_to_wait(&fh->waits, &wait, TASK_UNINTERRUPTIBLE);
+#else
+		prepare_to_wait_exclusive(&fh->waits, &wait, TASK_UNINTERRUPTIBLE);
+#endif
+		spin_unlock_irqrestore(&rc->faults_lock[fk], flags);
+		PGPRINTK(" +[%d] %lx %p\n", tsk->pid, addr, fh);
+		put_task_remote(tsk);
+
+		io_schedule();
+		finish_wait(&fh->waits, &wait);
+
+		fh->pid = tsk->pid;
+		*leader = false;
+		return fh;
+	}
+
+	fh = __alloc_fault_handle(tsk, addr);
+	fh->flags |= fault_for_write(fault_flags) ? FAULT_HANDLE_WRITE : 0;
+	fh->flags |= (fault_flags & PC_FAULT_FLAG_REMOTE) ? FAULT_HANDLE_REMOTE : 0;
+
+	spin_unlock_irqrestore(&rc->faults_lock[fk], flags);
+	put_task_remote(tsk);
+
+	*leader = true;
+	return fh;
+
+out_wait_retry:
+	atomic_inc(&fh->pendings_retry);
+	prepare_to_wait(&fh->waits_retry, &wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock_irqrestore(&rc->faults_lock[fk], flags);
+	put_task_remote(tsk);
+
+	PGPRINTK("  [%d] waits %p\n", tsk->pid, fh);
+	io_schedule();
+	finish_wait(&fh->waits_retry, &wait);
+	if (atomic_dec_and_test(&fh->pendings_retry)) {
+		kmem_cache_free(__fault_handle_cache, fh);
+	}
+	return NULL;
+
+out_retry:
+	spin_unlock_irqrestore(&rc->faults_lock[fk], flags);
+	put_task_remote(tsk);
+
+	PGPRINTK("  [%d] locked. retry %p\n", tsk->pid, fh);
+	return NULL;
+}
+
+static bool __finish_fault_handling(struct fault_handle *fh)
+{
+	unsigned long flags;
+	bool last = false;
+	int fk = __fault_hash_key(fh->addr);
+
+	spin_lock_irqsave(&fh->rc->faults_lock[fk], flags);
+	if (atomic_dec_return(&fh->pendings)) {
+		PGPRINTK(" >[%d] %lx %p\n", fh->pid, fh->addr, fh);
+#ifndef CONFIG_POPCORN_DEBUG_PAGE_SERVER
+		wake_up_all(&fh->waits);
+#else
+		wake_up(&fh->waits);
+#endif
+	} else {
+		PGPRINTK(">>[%d] %lx %p\n", fh->pid, fh->addr, fh);
+		if (fh->complete) {
+			complete(fh->complete);
+		} else {
+			hlist_del(&fh->list);
+			last = true;
+		}
+	}
+	spin_unlock_irqrestore(&fh->rc->faults_lock[fk], flags);
+
+	if (last) {
+		__put_task_remote(fh->rc);
+		if (atomic_read(&fh->pendings_retry)) {
+			wake_up_all(&fh->waits_retry);
+		} else {
+			kmem_cache_free(__fault_handle_cache, fh);
+		}
+	}
+	return last;
+}
+
+
+/*
+ * Helper functions for PTE following
+ */
+static pte_t *__get_pte_at(struct mm_struct *mm, unsigned long addr,
+			   pmd_t **ppmd,
+			   spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd || pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d || p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (!pud || pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd || pmd_none(*pmd))
+		return NULL;
+
+	*ppmd = pmd;
+	*ptlp = pte_lockptr(mm, pmd);
+
+	return pte_offset_map(pmd, addr);
+}
+
+static pte_t *__get_pte_at_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t **ppmd,
+				 spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd)
+		return NULL;
+
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return NULL;
+
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return NULL;
+
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return NULL;
+
+	pte = pte_alloc_map(mm, pmd, addr);
+
+	*ppmd = pmd;
+	*ptlp = pte_lockptr(mm, pmd);
+	return pte;
+}
+
+static struct page *__find_page_at(struct mm_struct *mm, unsigned long addr,
+				   pte_t **ptep, spinlock_t **ptlp)
+{
+	pmd_t *pmd;
+	pte_t *pte = NULL;
+	spinlock_t *ptl = NULL;
+	struct page *page = ERR_PTR(-ENOMEM);
+
+	pte = __get_pte_at(mm, addr, &pmd, &ptl);
+
+	if (pte == NULL) {
+		pte = NULL;
+		ptl = NULL;
+		page = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	if (pte_none(*pte)) {
+		pte_unmap(pte);
+		pte = NULL;
+		ptl = NULL;
+		page = ERR_PTR(-ENOENT);
+		goto out;
+	}
+
+	spin_lock(ptl);
+	page = pte_page(*pte);
+	get_page(page);
+
+out:
+	*ptep = pte;
+	*ptlp = ptl;
+	return page;
+}
+
+
+/*
+ * Panicked by bug!!!!!
+ */
+void page_server_panic(bool condition, struct mm_struct *mm,
+		       unsigned long address, pte_t *pte,
+		       pte_t pte_val)
+{
+	unsigned long *pi;
+	unsigned long pi_val = -1;
+	unsigned long offset;
+
+	if (!condition)
+		return;
+
+	pi = __get_page_info_mapped(mm, address, &offset);
+	if (pi) {
+		pi_val = *pi;
+		kunmap_atomic(pi - offset);
+	}
+
+	printk(KERN_ERR "------------------ Start panicking -----------------\n");
+	printk(KERN_ERR "%s: %lx %p %lx %p %lx\n", __func__,
+			address, pi, pi_val, pte, pte_flags(pte_val));
+	show_regs(current_pt_regs());
+	BUG_ON("Page server panicked!!");
+}
+
+
+/*
+ * Flush pages to the origin
+ */
+enum {
+	FLUSH_FLAG_START = 0x01,
+	FLUSH_FLAG_FLUSH = 0x02,
+	FLUSH_FLAG_RELEASE = 0x04,
+	FLUSH_FLAG_LAST = 0x10,
+};
+
+
+static void process_remote_page_flush(struct work_struct *work)
+{
+	START_KMSG_WORK(remote_page_flush_t, req, work);
+	unsigned long addr = req->addr;
+	struct task_struct *tsk;
+	struct mm_struct *mm;
+	struct remote_context *rc;
+	struct page *page;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	void *paddr;
+	struct vm_area_struct *vma;
+	remote_page_flush_ack_t res = {
+		.remote_ws = req->remote_ws,
+	};
+
+	PGPRINTK("  [%d] flush ->[%d/%d] %lx\n",
+			req->origin_pid, req->remote_pid, req->remote_nid, addr);
+
+	tsk = __get_task_struct(req->origin_pid);
+	if (!tsk)
+		goto out_free;
+
+	mm = get_task_mm(tsk);
+	rc = get_task_remote(tsk);
+
+	if (req->flags & FLUSH_FLAG_START) {
+		res.flags = FLUSH_FLAG_START;
+		pcn_kmsg_send(PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH_ACK,
+				req->remote_nid, &res, sizeof(res));
+		goto out_put;
+	} else if (req->flags & FLUSH_FLAG_LAST) {
+		res.flags = FLUSH_FLAG_LAST;
+		pcn_kmsg_send(PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH_ACK,
+				req->remote_nid, &res, sizeof(res));
+		goto out_put;
+	}
+
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, addr);
+	BUG_ON(!vma || vma->vm_start > addr);
+
+	page = __find_page_at(mm, addr, &pte, &ptl);
+	BUG_ON(IS_ERR(page));
+
+	/* XXX should be outside of ptl lock */
+	if (req->flags & FLUSH_FLAG_FLUSH) {
+		paddr = kmap(page);
+		copy_to_user_page(vma, page, addr, paddr, req->page, PAGE_SIZE);
+		kunmap(page);
+	}
+
+	SetPageDistributed(mm, addr);
+	set_page_owner(my_nid, mm, addr);
+	clear_page_owner(req->remote_nid, mm, addr);
+
+	/* XXX Should update through clear_flush and set */
+	entry = pte_make_valid(*pte);
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+	flush_tlb_page(vma, addr);
+
+	put_page(page);
+
+	pte_unmap_unlock(pte, ptl);
+	up_read(&mm->mmap_sem);
+
+out_put:
+	put_task_remote(tsk);
+	put_task_struct(tsk);
+	mmput(mm);
+
+out_free:
+	END_KMSG_WORK(req);
+}
+
+
+static int __do_pte_flush(pte_t *pte, unsigned long addr, unsigned long next,
+			  struct mm_walk *walk)
+{
+	remote_page_flush_t *req = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	int req_size;
+	enum pcn_kmsg_type req_type;
+	char type;
+
+	if (pte_none(*pte))
+		return 0;
+
+	page = pte_page(*pte);
+	BUG_ON(!page);
+
+	if (test_page_owner(my_nid, vma->vm_mm, addr)) {
+		req->addr = addr;
+		if ((vma->vm_flags & VM_WRITE) && pte_write(*pte)) {
+			void *paddr;
+			flush_cache_page(vma, addr, page_to_pfn(page));
+			paddr = kmap_atomic(page);
+			copy_from_user_page(walk->vma, page, addr, req->page, paddr, PAGE_SIZE);
+			kunmap_atomic(paddr);
+
+			req_type = PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH;
+			req_size = sizeof(remote_page_flush_t);
+			req->flags = FLUSH_FLAG_FLUSH;
+			type = '*';
+		} else {
+			req_type = PCN_KMSG_TYPE_REMOTE_PAGE_RELEASE;
+			req_size = sizeof(remote_page_release_t);
+			req->flags = FLUSH_FLAG_RELEASE;
+			type = '+';
+		}
+		clear_page_owner(my_nid, vma->vm_mm, addr);
+
+		pcn_kmsg_send(req_type, current->origin_nid, req, req_size);
+	} else {
+		*pte = pte_make_valid(*pte);
+		type = '-';
+	}
+	PGPRINTK("  [%d] %c %lx\n", current->pid, type, addr);
+
+	return 0;
+}
+
+
+int page_server_flush_remote_pages(struct remote_context *rc)
+{
+	remote_page_flush_t *req = kmalloc(sizeof(*req), GFP_KERNEL);
+	struct mm_struct *mm = rc->mm;
+	struct mm_walk walk = {
+		.pte_entry = __do_pte_flush,
+		.mm = mm,
+		.private = req,
+	};
+	struct vm_area_struct *vma;
+	struct wait_station *ws = get_wait_station(current);
+
+	BUG_ON(!req);
+
+	PGPRINTK("FLUSH_REMOTE_PAGES [%d]\n", current->pid);
+
+	req->remote_nid = my_nid;
+	req->remote_pid = current->pid;
+	req->remote_ws = ws->id;
+	req->origin_pid = current->origin_pid;
+	req->addr = 0;
+
+	/* Notify the start synchronously */
+	req->flags = FLUSH_FLAG_START;
+	pcn_kmsg_send(PCN_KMSG_TYPE_REMOTE_PAGE_RELEASE,
+			current->origin_nid, req, sizeof(*req));
+	wait_at_station(ws);
+
+	/* Send pages asynchronously */
+	ws = get_wait_station(current);
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		walk.vma = vma;
+		walk_page_vma(vma, &walk);
+	}
+	up_read(&mm->mmap_sem);
+
+	/* Notify the completion synchronously */
+	req->flags = FLUSH_FLAG_LAST;
+	pcn_kmsg_send(PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH,
+			current->origin_nid, req, sizeof(*req));
+	wait_at_station(ws);
+
+	kfree(req);
+
+	// XXX: make sure there is no backlog.
+	msleep(1000);
+
+	return 0;
+}
+
+static int handle_remote_page_flush_ack(struct pcn_kmsg_message *msg)
+{
+	remote_page_flush_ack_t *req = (remote_page_flush_ack_t *)msg;
+	struct wait_station *ws = wait_station(req->remote_ws);
+
+	complete(&ws->pendings);
+
+	pcn_kmsg_done(req);
+	return 0;
+}
+
+
+/*
+ * Page invalidation protocol
+ */
+static void __do_invalidate_page(struct task_struct *tsk,
+				 page_invalidate_request_t *req)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	struct vm_area_struct *vma;
+	pmd_t *pmd;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	int ret = 0;
+	unsigned long addr = req->addr;
+	struct fault_handle *fh;
+
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	PGPRINTK("\nINVALIDATE_PAGE [%d] %lx [%d/%d]\n", tsk->pid, addr,
+			req->origin_pid, PCN_KMSG_FROM_NID(req));
+
+	pte = __get_pte_at(mm, addr, &pmd, &ptl);
+	if (!pte)
+		goto out;
+
+	spin_lock(ptl);
+	fh = __start_invalidation(tsk, addr, ptl);
+
+	clear_page_owner(my_nid, mm, addr);
+
+	BUG_ON(!pte_present(*pte));
+	entry = ptep_clear_flush(vma, addr, pte);
+	entry = pte_make_invalid(entry);
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+
+	__finish_invalidation(fh);
+	pte_unmap_unlock(pte, ptl);
+
+out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+}
+
+static void process_page_invalidate_request(struct work_struct *work)
+{
+	START_KMSG_WORK(page_invalidate_request_t, req, work);
+	page_invalidate_response_t *res;
+	struct task_struct *tsk;
+
+	res = pcn_kmsg_get(sizeof(*res));
+	res->origin_pid = req->origin_pid;
+	res->origin_ws = req->origin_ws;
+	res->remote_pid = req->remote_pid;
+
+	/* Only home issues invalidate requests. Hence, I am a remote */
+	tsk = __get_task_struct(req->remote_pid);
+	if (!tsk) {
+		PGPRINTK("%s: no such process %d %d %lx\n", __func__,
+				req->origin_pid, req->remote_pid, req->addr);
+		pcn_kmsg_put(res);
+		goto out_free;
+	}
+
+	__do_invalidate_page(tsk, req);
+
+	PGPRINTK(">>[%d] ->[%d/%d]\n", req->remote_pid, res->origin_pid,
+			PCN_KMSG_FROM_NID(req));
+	pcn_kmsg_post(PCN_KMSG_TYPE_PAGE_INVALIDATE_RESPONSE,
+			PCN_KMSG_FROM_NID(req), res, sizeof(*res));
+
+	put_task_struct(tsk);
+
+out_free:
+	END_KMSG_WORK(req);
+}
+
+
+static int handle_page_invalidate_response(struct pcn_kmsg_message *msg)
+{
+	page_invalidate_response_t *res = (page_invalidate_response_t *)msg;
+	struct wait_station *ws = wait_station(res->origin_ws);
+
+	if (atomic_dec_and_test(&ws->pendings_count)) {
+		complete(&ws->pendings);
+	}
+
+	pcn_kmsg_done(res);
+	return 0;
+}
+
+
+static void __revoke_page_ownership(struct task_struct *tsk, int nid, pid_t pid,
+				    unsigned long addr, int ws_id)
+{
+	page_invalidate_request_t *req = pcn_kmsg_get(sizeof(*req));
+
+	req->addr = addr;
+	req->origin_pid = tsk->pid;
+	req->origin_ws = ws_id;
+	req->remote_pid = pid;
+
+	PGPRINTK("  [%d] revoke %lx [%d/%d]\n", tsk->pid, addr, pid, nid);
+	pcn_kmsg_post(PCN_KMSG_TYPE_PAGE_INVALIDATE_REQUEST, nid, req, sizeof(*req));
+}
+
+
+/*
+ * Voluntarily release page ownership
+ */
+int process_madvise_release_from_remote(int from_nid, unsigned long start,
+					unsigned long end)
+{
+	struct mm_struct *mm;
+	unsigned long addr;
+	int nr_pages = 0;
+
+	mm = get_task_mm(current);
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		pmd_t *pmd;
+		pte_t *pte;
+		spinlock_t *ptl;
+		pte = __get_pte_at(mm, addr, &pmd, &ptl);
+		if (!pte)
+			continue;
+		spin_lock(ptl);
+		if (!pte_none(*pte)) {
+			clear_page_owner(from_nid, mm, addr);
+			nr_pages++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+	mmput(mm);
+	VSPRINTK("  [%d] %d %d / %ld %lx-%lx\n", current->pid, from_nid,
+			nr_pages, (end - start) / PAGE_SIZE, start, end);
+	return 0;
+}
+
+int page_server_release_page_ownership(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	pte_t *pte;
+	pte_t pte_val;
+	spinlock_t *ptl;
+
+	pte = __get_pte_at(mm, addr, &pmd, &ptl);
+	if (!pte)
+		return 0;
+
+	spin_lock(ptl);
+	if (pte_none(*pte) || !pte_present(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return 0;
+	}
+
+	clear_page_owner(my_nid, mm, addr);
+	pte_val = ptep_clear_flush(vma, addr, pte);
+	pte_val = pte_make_invalid(pte_val);
+
+	set_pte_at_notify(mm, addr, pte, pte_val);
+	update_mmu_cache(vma, addr, pte);
+	pte_unmap_unlock(pte, ptl);
+	return 1;
+}
+
+
+/*
+ * Handle page faults happened at remote nodes.
+ */
+static int handle_remote_page_response(struct pcn_kmsg_message *msg)
+{
+	remote_page_response_t *res = (remote_page_response_t *)msg;
+	struct wait_station *ws = wait_station(res->origin_ws);
+
+	PGPRINTK("  [%d] <-[%d/%d] %lx %x\n",
+			ws->pid, res->remote_pid, PCN_KMSG_FROM_NID(res),
+			res->addr, res->result);
+	ws->private = res;
+
+	if (atomic_dec_and_test(&ws->pendings_count))
+		complete(&ws->pendings);
+	return 0;
+}
+
+#define TRANSFER_PAGE_WITH_RDMA \
+		pcn_kmsg_has_features(PCN_KMSG_FEATURE_RDMA)
+
+static int __request_remote_page(struct task_struct *tsk, int from_nid,
+				 pid_t from_pid, unsigned long addr,
+				 unsigned long fault_flags, int ws_id,
+				 struct pcn_kmsg_rdma_handle **rh)
+{
+	remote_page_request_t *req;
+
+	*rh = NULL;
+
+	req = pcn_kmsg_get(sizeof(*req));
+	req->addr = addr;
+	req->fault_flags = fault_flags;
+
+	req->origin_pid = tsk->pid;
+	req->origin_ws = ws_id;
+
+	req->remote_pid = from_pid;
+	req->instr_addr = instruction_pointer(current_pt_regs());
+
+	if (TRANSFER_PAGE_WITH_RDMA) {
+		struct pcn_kmsg_rdma_handle *handle =
+				pcn_kmsg_pin_rdma_buffer(NULL, PAGE_SIZE);
+		if (IS_ERR(handle)) {
+			pcn_kmsg_put(req);
+			return PTR_ERR(handle);
+		}
+		*rh = handle;
+		req->rdma_addr = handle->dma_addr;
+		req->rdma_key = handle->rkey;
+	} else {
+		req->rdma_addr = 0;
+		req->rdma_key = 0;
+	}
+
+	PGPRINTK("  [%d] ->[%d/%d] %lx %lx\n", tsk->pid,
+			from_pid, from_nid, addr, req->instr_addr);
+
+	pcn_kmsg_post(PCN_KMSG_TYPE_REMOTE_PAGE_REQUEST,
+			from_nid, req, sizeof(*req));
+	return 0;
+}
+
+static remote_page_response_t *__fetch_page_from_origin(struct task_struct *tsk,
+							struct vm_area_struct *vma,
+							unsigned long addr,
+							unsigned long fault_flags,
+							struct page *page)
+{
+	remote_page_response_t *rp;
+	struct wait_station *ws = get_wait_station(tsk);
+	struct pcn_kmsg_rdma_handle *rh;
+
+	__request_remote_page(tsk, tsk->origin_nid, tsk->origin_pid,
+			addr, fault_flags, ws->id, &rh);
+
+	rp = wait_at_station(ws);
+	if (rp->result == 0) {
+		void *paddr = kmap(page);
+		if (TRANSFER_PAGE_WITH_RDMA) {
+			copy_to_user_page(vma, page, addr, paddr, rh->addr, PAGE_SIZE);
+		} else {
+			copy_to_user_page(vma, page, addr, paddr, rp->page, PAGE_SIZE);
+		}
+		kunmap(page);
+		flush_dcache_page(page);
+		__SetPageUptodate(page);
+	}
+
+	if (rh)
+		pcn_kmsg_unpin_rdma_buffer(rh);
+
+	return rp;
+}
+
+static int __claim_remote_page(struct task_struct *tsk, struct mm_struct *mm,
+			       struct vm_area_struct *vma, unsigned long addr,
+			       unsigned long fault_flags, struct page *page)
+{
+	int peers;
+	unsigned int random = prandom_u32();
+	struct wait_station *ws;
+	struct remote_context *rc = __get_mm_remote(mm);
+	remote_page_response_t *rp;
+	int from, from_nid = 0;
+	/* Read when @from becomes zero and save the nid to @from_nid */
+	int nid;
+	struct pcn_kmsg_rdma_handle *rh = NULL;
+	unsigned long offset;
+	struct page *pip = __get_page_info_page(mm, addr, &offset);
+	unsigned long *pi = (unsigned long *)kmap(pip) + offset;
+	BUG_ON(!pip);
+
+	peers = bitmap_weight(pi, MAX_POPCORN_NODES);
+
+	if (test_bit(my_nid, pi)) {
+		peers--;
+	}
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	page_server_panic(peers == 0, mm, addr, NULL, __pte(0));
+#endif
+	from = random % peers;
+
+	if (fault_for_read(fault_flags)) {
+		peers = 1;
+	}
+	ws = get_wait_station_multiple(tsk, peers);
+
+	for_each_set_bit(nid, pi, MAX_POPCORN_NODES) {
+		pid_t pid = rc->remote_tgids[nid];
+		if (nid == my_nid) continue;
+		if (from-- == 0) {
+			from_nid = nid;
+			__request_remote_page(tsk, nid, pid, addr, fault_flags, ws->id, &rh);
+		} else {
+			if (fault_for_write(fault_flags)) {
+				clear_bit(nid, pi);
+				__revoke_page_ownership(tsk, nid, pid, addr, ws->id);
+			}
+		}
+		if (--peers == 0) break;
+	}
+
+	rp = wait_at_station(ws);
+
+	if (fault_for_write(fault_flags)) {
+		clear_bit(from_nid, pi);
+	}
+
+	if (rp->result == 0) {
+		void *paddr = kmap(page);
+		if (TRANSFER_PAGE_WITH_RDMA) {
+			copy_to_user_page(vma, page, addr, paddr, rh->addr, PAGE_SIZE);
+		} else {
+			copy_to_user_page(vma, page, addr, paddr, rp->page, PAGE_SIZE);
+		}
+		kunmap(page);
+		flush_dcache_page(page);
+		__SetPageUptodate(page);
+	}
+	pcn_kmsg_done(rp);
+
+	if (rh)
+		pcn_kmsg_unpin_rdma_buffer(rh);
+	__put_task_remote(rc);
+	kunmap(pip);
+	return 0;
+}
+
+static void __claim_local_page(struct task_struct *tsk, unsigned long addr,
+			       int except_nid)
+{
+	struct mm_struct *mm = tsk->mm;
+	unsigned long offset;
+	struct page *pip = __get_page_info_page(mm, addr, &offset);
+	unsigned long *pi;
+	int peers;
+
+	if (!pip) return; /* skip claiming non-distributed page */
+	pi = (unsigned long *)kmap(pip) + offset;
+	peers = bitmap_weight(pi, MAX_POPCORN_NODES);
+	if (!peers) {
+		kunmap(pip);
+		return;	/* skip claiming the page that is not distributed */
+	}
+
+	BUG_ON(!test_bit(except_nid, pi));
+	peers--;	/* exclude except_nid from peers */
+
+	if (test_bit(my_nid, pi) && except_nid != my_nid)
+		peers--;
+
+	if (peers > 0) {
+		int nid;
+		struct remote_context *rc = get_task_remote(tsk);
+		struct wait_station *ws = get_wait_station_multiple(tsk, peers);
+
+		for_each_set_bit(nid, pi, MAX_POPCORN_NODES) {
+			pid_t pid = rc->remote_tgids[nid];
+			if (nid == except_nid || nid == my_nid)
+				continue;
+
+			clear_bit(nid, pi);
+			__revoke_page_ownership(tsk, nid, pid, addr, ws->id);
+		}
+		put_task_remote(tsk);
+
+		wait_at_station(ws);
+	}
+	kunmap(pip);
+}
+
+void page_server_zap_pte(struct vm_area_struct *vma, unsigned long addr,
+			 pte_t *pte, pte_t *pteval)
+{
+	if (!vma->vm_mm->remote) return;
+
+	ClearPageInfo(vma->vm_mm, addr);
+
+	*pteval = pte_make_valid(*pte);
+	*pteval = pte_mkyoung(*pteval);
+	if (ptep_set_access_flags(vma, addr, pte, *pteval, 1)) {
+		update_mmu_cache(vma, addr, pte);
+	}
+#ifdef CONFIG_POPCORN_DEBUG_VERBOSE
+	PGPRINTK("  [%d] zap %lx\n", current->pid, addr);
+#endif
+}
+
+static void __make_pte_valid(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long addr,
+		unsigned long fault_flags, pte_t *pte)
+{
+	pte_t entry;
+
+	entry = ptep_clear_flush(vma, addr, pte);
+	entry = pte_make_valid(entry);
+
+	if (fault_for_write(fault_flags)) {
+		entry = pte_mkwrite(entry);
+		entry = pte_mkdirty(entry);
+	} else {
+		entry = pte_wrprotect(entry);
+	}
+	entry = pte_mkyoung(entry);
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+
+	SetPageDistributed(mm, addr);
+	set_page_owner(my_nid, mm, addr);
+}
+
+
+/*
+ * Remote fault handler at a remote location
+ */
+static int __handle_remotefault_at_remote(struct task_struct *tsk,
+					  struct mm_struct *mm,
+					  struct vm_area_struct *vma,
+					  remote_page_request_t *req,
+					  remote_page_response_t *res)
+{
+	unsigned long addr = req->addr & PAGE_MASK;
+	unsigned fault_flags = req->fault_flags | PC_FAULT_FLAG_REMOTE;
+	unsigned char *paddr;
+	struct page *page;
+
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+	pte_t entry;
+
+	struct fault_handle *fh;
+	bool leader;
+	bool present;
+
+	pte = __get_pte_at(mm, addr, &pmd, &ptl);
+	if (!pte) {
+		PGPRINTK("  [%d] No PTE!!\n", tsk->pid);
+		return VM_FAULT_OOM;
+	}
+
+	spin_lock(ptl);
+	fh = __start_fault_handling(tsk, addr, fault_flags, ptl, &leader);
+	if (!fh) {
+		pte_unmap(pte);
+		return VM_FAULT_LOCKED;
+	}
+
+	if (pte_none(*pte)) {
+		pte_unmap(pte);
+		return VM_FAULT_SIGSEGV;
+	}
+
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	BUG_ON(!page_is_mine(mm, addr));
+#endif
+
+	spin_lock(ptl);
+	SetPageDistributed(mm, addr);
+	entry = ptep_clear_flush(vma, addr, pte);
+
+	if (fault_for_write(fault_flags)) {
+		clear_page_owner(my_nid, mm, addr);
+		entry = pte_make_invalid(entry);
+	} else {
+		entry = pte_wrprotect(entry);
+	}
+
+	set_pte_at_notify(mm, addr, pte, entry);
+	update_mmu_cache(vma, addr, pte);
+	pte_unmap_unlock(pte, ptl);
+	present = pte_is_present(*pte);
+	page = vm_normal_page(vma, addr, *pte);
+	BUG_ON(!page);
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	if (TRANSFER_PAGE_WITH_RDMA) {
+		paddr = kmap(page);
+		pcn_kmsg_rdma_write(PCN_KMSG_FROM_NID(req),
+				req->rdma_addr, paddr, PAGE_SIZE, req->rdma_key);
+		kunmap(page);
+	} else {
+		paddr = kmap_atomic(page);
+		copy_from_user_page(vma, page, addr, res->page, paddr, PAGE_SIZE);
+		kunmap_atomic(paddr);
+	}
+
+	__finish_fault_handling(fh);
+	return 0;
+}
+
+
+
+/*
+ * Remote fault handler at the origin
+ */
+static int __handle_remotefault_at_origin(struct task_struct *tsk,
+					  struct mm_struct *mm,
+					  struct vm_area_struct *vma,
+					  remote_page_request_t *req,
+					  remote_page_response_t *res)
+{
+	int from_nid = PCN_KMSG_FROM_NID(req);
+	unsigned long addr = req->addr;
+	unsigned long fault_flags = req->fault_flags | PC_FAULT_FLAG_REMOTE;
+	unsigned char *paddr;
+	struct page *page;
+
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	struct fault_handle *fh;
+	bool leader;
+	bool grant = false;
+
+again:
+	pte = __get_pte_at_alloc(mm, vma, addr, &pmd, &ptl);
+	if (!pte) {
+		PGPRINTK("  [%d] No PTE!!\n", tsk->pid);
+		return VM_FAULT_OOM;
+	}
+
+	spin_lock(ptl);
+	if (pte_none(*pte)) {
+		int ret;
+		spin_unlock(ptl);
+		PGPRINTK("  [%d] handle local fault at origin\n", tsk->pid);
+		ret = handle_pte_fault_origin(mm, vma, addr, pte, pmd, fault_flags);
+		/* returned with pte unmapped */
+		if (ret & VM_FAULT_RETRY) {
+			/* mmap_sem is released during do_fault */
+			return VM_FAULT_RETRY;
+		}
+		if (fault_for_write(fault_flags) && !vma_is_anonymous(vma))
+			SetPageCowed(mm, addr);
+		goto again;
+	}
+
+	fh = __start_fault_handling(tsk, addr, fault_flags, ptl, &leader);
+
+	/*
+	 * Indicates the same page is handled at the origin and it might cause
+	 * this node to be blocked recursively. This prevents forming the loop
+	 * by releasing everything from remote.
+	 */
+	if (!fh) {
+		pte_unmap(pte);
+		up_read(&mm->mmap_sem);
+		return VM_FAULT_RETRY;
+	}
+	page = get_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+
+	if (leader) {
+		pte_t entry;
+
+		/* Prepare the page if it is not mine. This should be leader */
+		PGPRINTK(" =[%d] %s%s %p\n",
+				tsk->pid, page_is_mine(mm, addr) ? "origin " : "",
+				test_page_owner(from_nid, mm, addr) ? "remote": "", fh);
+
+		if (test_page_owner(from_nid, mm, addr)) {
+			BUG_ON(fault_for_read(fault_flags) && "Read fault from owner??");
+			__claim_local_page(tsk, addr, from_nid);
+			grant = true;
+		} else {
+			if (!page_is_mine(mm, addr)) {
+				__claim_remote_page(tsk, mm, vma, addr, fault_flags, page);
+			} else {
+				if (fault_for_write(fault_flags))
+					__claim_local_page(tsk, addr, my_nid);
+			}
+		}
+		spin_lock(ptl);
+
+		SetPageDistributed(mm, addr);
+		set_page_owner(from_nid, mm, addr);
+
+		entry = ptep_clear_flush(vma, addr, pte);
+		if (fault_for_write(fault_flags)) {
+			clear_page_owner(my_nid, mm, addr);
+			entry = pte_make_invalid(entry);
+		} else {
+			/* For remote-claimed case */
+			entry = pte_make_valid(entry);
+			entry = pte_wrprotect(entry);
+			set_page_owner(my_nid, mm, addr);
+		}
+		set_pte_at_notify(mm, addr, pte, entry);
+		update_mmu_cache(vma, addr, pte);
+
+		spin_unlock(ptl);
+	}
+	pte_unmap(pte);
+
+	if (!grant) {
+		flush_cache_page(vma, addr, page_to_pfn(page));
+		if (TRANSFER_PAGE_WITH_RDMA) {
+			paddr = kmap(page);
+			pcn_kmsg_rdma_write(PCN_KMSG_FROM_NID(req),
+					req->rdma_addr, paddr, PAGE_SIZE, req->rdma_key);
+			kunmap(page);
+		} else {
+			paddr = kmap_atomic(page);
+			copy_from_user_page(vma, page, addr, res->page, paddr, PAGE_SIZE);
+			kunmap_atomic(paddr);
+		}
+	}
+
+	__finish_fault_handling(fh);
+	return grant ? VM_FAULT_CONTINUE : 0;
+}
+
+
+/*
+ * Entry point to remote fault handler
+ *
+ * To accelerate the ownership grant by skipping transferring page data,
+ * the response might be multiplexed between remote_page_response_short_t and
+ * remote_page_response_t.
+ */
+static void process_remote_page_request(struct work_struct *work)
+{
+	START_KMSG_WORK(remote_page_request_t, req, work);
+	remote_page_response_t *res;
+	int from_nid = PCN_KMSG_FROM_NID(req);
+	struct task_struct *tsk;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int res_size;
+	enum pcn_kmsg_type res_type;
+	int down_read_retry = 0;
+
+	if (TRANSFER_PAGE_WITH_RDMA) {
+		res = pcn_kmsg_get(sizeof(remote_page_response_short_t));
+	} else {
+		res = pcn_kmsg_get(sizeof(*res));
+	}
+
+	do {
+		tsk = __get_task_struct(req->remote_pid);
+		if (!tsk) {
+			res->result = VM_FAULT_SIGBUS;
+			PGPRINTK("  [%d] not found\n", req->remote_pid);
+			break;
+		}
+		mm = get_task_mm(tsk);
+
+		PGPRINTK("\nREMOTE_PAGE_REQUEST [%d] %lx %c %lx from [%d/%d]\n",
+			 req->remote_pid, req->addr,
+			 fault_for_write(req->fault_flags) ? 'W' : 'R',
+			 req->instr_addr, req->origin_pid, from_nid);
+
+		while (!down_read_trylock(&mm->mmap_sem)) {
+			if (!tsk->at_remote && down_read_retry++ > 4) {
+				res->result = VM_FAULT_RETRY;
+				goto out_up;
+			}
+			schedule();
+		}
+		vma = find_vma(mm, req->addr);
+		if (!vma || vma->vm_start > req->addr) {
+			res->result = VM_FAULT_SIGBUS;
+			goto out_up;
+		}
+
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+		BUG_ON(vma->vm_flags & VM_EXEC);
+#endif
+
+		if (tsk->at_remote) {
+			res->result = __handle_remotefault_at_remote(tsk, mm, vma, req, res);
+		} else {
+			res->result = __handle_remotefault_at_origin(tsk, mm, vma, req, res);
+		}
+
+		out_up:
+		if (res->result != VM_FAULT_RETRY) {
+			up_read(&mm->mmap_sem);
+		}
+		mmput(mm);
+		put_task_struct(tsk);
+
+	} while (res->result == VM_FAULT_LOCKED);
+
+	if (res->result != 0 || TRANSFER_PAGE_WITH_RDMA) {
+		res_type = PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE_SHORT;
+		res_size = sizeof(remote_page_response_short_t);
+	} else {
+		res_type = PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE;
+		res_size = sizeof(remote_page_response_t);
+	}
+	res->addr = req->addr;
+	res->remote_pid = req->remote_pid;
+
+	res->origin_pid = req->origin_pid;
+	res->origin_ws = req->origin_ws;
+
+	PGPRINTK("  [%d] ->[%d/%d] %x\n", req->remote_pid,
+			res->origin_pid, from_nid, res->result);
+
+	trace_pgfault(from_nid, req->remote_pid,
+			fault_for_write(req->fault_flags) ? 'W' : 'R',
+			req->instr_addr, req->addr, res->result);
+
+	pcn_kmsg_post(res_type, from_nid, res, res_size);
+
+	END_KMSG_WORK(req);
+}
+
+/*
+ * Exclusively keep a user page to the current node. Should put the user
+ * page after use. This routine is similar to localfault handler at origin
+ * thus may be refactored.
+ */
+int page_server_get_userpage(u32 __user *uaddr, struct fault_handle **handle,
+			     char *mode)
+{
+	unsigned long addr = (unsigned long)uaddr & PAGE_MASK;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+
+	const unsigned long fault_flags = 0;
+	struct fault_handle *fh = NULL;
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	bool leader;
+	int ret = 0;
+
+	*handle = NULL;
+	if (!distributed_process(current))
+		return 0;
+
+	mm = get_task_mm(current);
+retry:
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pte = __get_pte_at(mm, addr, &pmd, &ptl);
+	if (!pte) {
+		ret = -EINVAL;
+		goto out;
+	}
+	spin_lock(ptl);
+	fh = __start_fault_handling(current, addr, fault_flags, ptl, &leader);
+
+	if (!fh) {
+		pte_unmap(pte);
+		up_read(&mm->mmap_sem);
+		io_schedule();
+		goto retry;
+	}
+
+	if (leader && !page_is_mine(mm, addr)) {
+		struct page *page = get_normal_page(vma, addr, pte);
+		__claim_remote_page(current, mm, vma, addr, fault_flags, page);
+
+		spin_lock(ptl);
+		__make_pte_valid(mm, vma, addr, fault_flags, pte);
+		spin_unlock(ptl);
+	}
+	pte_unmap(pte);
+	ret = 0;
+
+out:
+	*handle = fh;
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
+
+void page_server_put_userpage(struct fault_handle *fh, char *mode)
+{
+	if (!fh)
+		return;
+
+	__finish_fault_handling(fh);
+}
+
+
+/*
+ * Local fault handler at the remote
+ */
+static int __handle_localfault_at_remote(struct vm_fault *vmf)
+{
+	spinlock_t *ptl;
+	struct page *page;
+	bool populated = false;
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	struct fault_handle *fh;
+	bool leader;
+	remote_page_response_t *rp;
+	unsigned long addr = vmf->address & PAGE_MASK;
+	bool present;
+
+	if (anon_vma_prepare(vmf->vma)) {
+		BUG_ON("Cannot prepare vma for anonymous page");
+		pte_unmap(vmf->pte);
+		return VM_FAULT_SIGBUS;
+	}
+
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(ptl);
+
+	if (!vmf->pte) {
+		vmf->pte = pte_alloc_map(vmf->vma->vm_mm, vmf->pmd, vmf->address);
+	}
+
+	/* setup and populate pte entry */
+	if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+		pte_unmap_unlock(vmf->pte, ptl);
+		PGPRINTK("  [%d] %lx already handled\n", current->pid, addr);
+		return 0;
+	}
+	fh = __start_fault_handling(current, addr, vmf->flags, ptl, &leader);
+	if (!fh) {
+		pte_unmap(vmf->pte);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+		return VM_FAULT_RETRY;
+	}
+
+	PGPRINTK(" %c[%d] %lx %p\n", leader ? '=' : '-', current->pid, addr, fh);
+	if (!leader) {
+		pte_unmap(vmf->pte);
+		ret = fh->ret;
+		if (ret) up_read(&vmf->vma->vm_mm->mmap_sem);
+		goto out_follower;
+	}
+
+	present = pte_is_present(*vmf->pte);
+	if (!present) {
+		pte_make_valid(*vmf->pte);
+	}
+	page = vm_normal_page(vmf->vma, addr, *vmf->pte);
+	if (!present) {
+		pte_make_invalid(*vmf->pte);
+	}
+	if (pte_none(*vmf->pte) || !page) {
+		page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vmf->vma, addr);
+		BUG_ON(!page);
+
+		if (mem_cgroup_try_charge(page, vmf->vma->vm_mm, GFP_KERNEL, &memcg, false)) {
+			BUG();
+		}
+		populated = true;
+	}
+
+	get_page(page);
+
+	rp = __fetch_page_from_origin(current, vmf->vma, addr, vmf->flags, page);
+
+	if (rp->result && rp->result != VM_FAULT_CONTINUE) {
+		if (rp->result != VM_FAULT_RETRY)
+			PGPRINTK("  [%d] failed 0x%x\n", current->pid, rp->result);
+		ret = rp->result;
+		pte_unmap(vmf->pte);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+		goto out_free;
+	}
+
+	if (rp->result == VM_FAULT_CONTINUE) {
+		/*
+		 * Page ownership is granted without transferring the page data
+		 * since this node already owns the up-to-dated page
+		 */
+		pte_t entry;
+		BUG_ON(populated);
+
+		spin_lock(ptl);
+		entry = pte_make_valid(*vmf->pte);
+		if (fault_for_write(vmf->flags)) {
+			entry = pte_mkwrite(entry);
+			entry = pte_mkdirty(entry);
+		} else {
+			entry = pte_wrprotect(entry);
+		}
+		entry = pte_mkyoung(entry);
+
+		if (ptep_set_access_flags(vmf->vma, addr, vmf->pte, entry, 1)) {
+			update_mmu_cache(vmf->vma, addr, vmf->pte);
+		}
+	} else {
+		spin_lock(ptl);
+		if (populated) {
+			alloc_set_pte(vmf, memcg, page);
+		} else {
+			__make_pte_valid(vmf->vma->vm_mm, vmf->vma, addr, vmf->flags, vmf->pte);
+		}
+	}
+	SetPageDistributed(vmf->vma->vm_mm, addr);
+	set_page_owner(my_nid, vmf->vma->vm_mm, addr);
+	pte_unmap_unlock(vmf->pte, ptl);
+	ret = 0;	/* The leader squashes both 0 and VM_FAULT_CONTINUE to 0 */
+
+out_free:
+	put_page(page);
+	pcn_kmsg_done(rp);
+	fh->ret = ret;
+
+out_follower:
+	__finish_fault_handling(fh);
+	return ret;
+}
+
+static bool __handle_copy_on_write(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long addr,
+		pte_t *pte, pte_t *pte_val, unsigned int fault_flags)
+{
+	if (vma_is_anonymous(vma) || fault_for_read(fault_flags)) return false;
+	BUG_ON(vma->vm_flags & VM_SHARED);
+
+	/*
+	 * We need to determine whether the page is already cowed or not to
+	 * avoid unnecessary cows. But there is no explicit data structure that
+	 * bookkeeping such information. Also, explicitly tracking every CoW
+	 * including non-distributed processes is not desirable due to the
+	 * high frequency of CoW.
+	 * Fortunately, private vma is not flushed, implying the PTE dirty bit
+	 * is not cleared but kept throughout its lifetime. If the dirty bit is
+	 * set for a page, the page is written previously, which implies the page
+	 * is CoWed!!!
+	 */
+	if (pte_dirty(*pte_val)) return false;
+
+	if (PageCowed(mm, addr)) return false;
+
+	if (cow_file_at_origin(mm, vma, addr, pte)) return false;
+
+	*pte_val = *pte;
+	SetPageCowed(mm, addr);
+
+	return true;
+}
+
+/*
+ * Local fault handler at the origin
+ */
+static int __handle_localfault_at_origin(struct vm_fault *vmf)
+{
+	spinlock_t *ptl;
+	unsigned long addr = vmf->address & PAGE_MASK;
+	struct fault_handle *fh;
+	bool leader;
+
+	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+	spin_lock(ptl);
+
+	if (!vmf->pte) {
+		spin_unlock(ptl);
+		PGPRINTK("  [%d] %lx fresh at origin, continue\n", current->pid, addr);
+		return VM_FAULT_CONTINUE;
+	}
+
+	if (!pte_same(*vmf->pte, vmf->orig_pte)) {
+		pte_unmap_unlock(vmf->pte, ptl);
+		PGPRINTK("  [%d] %lx already handled\n", current->pid, addr);
+		return 0;
+	}
+
+	/* Fresh access to the address. Handle locally since we are at the origin */
+	if (pte_none(vmf->orig_pte)) {
+		BUG_ON(pte_present(vmf->orig_pte));
+		spin_unlock(ptl);
+		PGPRINTK("  [%d] fresh at origin. continue\n", current->pid);
+		return VM_FAULT_CONTINUE;
+	}
+
+	/* Nothing to do with DSM (e.g. COW). Handle locally */
+	if (!PageDistributed(vmf->vma->vm_mm, addr)) {
+		spin_unlock(ptl);
+		PGPRINTK("  [%d] local at origin. continue\n", current->pid);
+		return VM_FAULT_CONTINUE;
+	}
+
+	fh = __start_fault_handling(current, addr, vmf->flags, ptl, &leader);
+	if (!fh) {
+		pte_unmap(vmf->pte);
+		up_read(&vmf->vma->vm_mm->mmap_sem);
+		return VM_FAULT_RETRY;
+	}
+
+	/* Handle replicated page via the memory consistency protocol */
+	PGPRINTK(" %c[%d] %lx replicated %smine %p\n",
+			leader ? '=' : ' ', current->pid, addr,
+			page_is_mine(vmf->vma->vm_mm, addr) ? "" : "not ", fh);
+
+	if (!leader) {
+		pte_unmap(vmf->pte);
+		goto out_wakeup;
+	}
+
+	__handle_copy_on_write(vmf->vma->vm_mm, vmf->vma, addr, vmf->pte, &(vmf->orig_pte), vmf->flags);
+
+	if (page_is_mine(vmf->vma->vm_mm, addr)) {
+		if (fault_for_read(vmf->flags)) {
+			/* Racy exit */
+			pte_unmap(vmf->pte);
+			goto out_wakeup;
+		}
+
+		__claim_local_page(current, addr, my_nid);
+
+		spin_lock(ptl);
+		vmf->orig_pte = pte_mkwrite(vmf->orig_pte);
+		vmf->orig_pte = pte_mkdirty(vmf->orig_pte);
+		vmf->orig_pte = pte_mkyoung(vmf->orig_pte);
+
+		if (ptep_set_access_flags(vmf->vma, addr, vmf->pte, vmf->orig_pte, 1)) {
+			update_mmu_cache(vmf->vma, addr, vmf->pte);
+		}
+	} else {
+		struct page *page;
+		bool present;
+		present = pte_is_present(vmf->orig_pte);
+		if (!present) {
+			pte_make_valid(vmf->orig_pte);
+		}
+		page = vm_normal_page(vmf->vma, addr, vmf->orig_pte);
+		if (!present) {
+			pte_make_invalid(vmf->orig_pte);
+		}
+		BUG_ON(!page);
+
+		__claim_remote_page(current, vmf->vma->vm_mm, vmf->vma, addr, vmf->flags, page);
+
+		spin_lock(ptl);
+		__make_pte_valid(vmf->vma->vm_mm, vmf->vma, addr, vmf->flags, vmf->pte);
+	}
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+	BUG_ON(!test_page_owner(my_nid, vmf->vma->vm_mm, addr));
+#endif
+	pte_unmap_unlock(vmf->pte, ptl);
+
+out_wakeup:
+	__finish_fault_handling(fh);
+
+	return 0;
+}
+
+/*
+ * Function:
+ *	page_server_handle_pte_fault
+ *
+ * Description:
+ *	Handle PTE faults with Popcorn page replication protocol.
+ *  down_read(&mm->mmap_sem) is already held when getting in.
+ *  DO NOT FORGET to unmap pte before returning non-VM_FAULT_CONTINUE.
+ *
+ * Input:
+ *	All are from the PTE handler
+ *
+ * Return values:
+ *	VM_FAULT_CONTINUE when the page fault can be handled locally.
+ *	0 if the fault is fetched remotely and fixed.
+ *  ERROR otherwise
+ */
+int page_server_handle_pte_fault(struct vm_fault *vmf)
+{
+	unsigned long addr = vmf->address & PAGE_MASK;
+	int ret = 0;
+
+	might_sleep();
+
+	PGPRINTK("\n## PAGEFAULT [%d] %lx %c %lx %x %lx\n",
+			current->pid, vmf->address,
+			fault_for_write(vmf->flags) ? 'W' : 'R',
+			instruction_pointer(current_pt_regs()),
+		 vmf->flags, pte_flags(vmf->orig_pte));
+
+	/*
+	 * Thread at the origin
+	 */
+	if (!current->at_remote) {
+		ret = __handle_localfault_at_origin(vmf);
+		goto out;
+	}
+
+	/*
+	 * Thread running at a remote
+	 *
+	 * Fault handling at the remote side is simpler than at the origin.
+	 * There will be no copy-on-write case at the remote since no thread
+	 * creation is allowed at the remote side.
+	 */
+	if (pte_none(vmf->orig_pte)) {
+		/* Can we handle the fault locally? */
+		if (vmf->vma->vm_flags & VM_EXEC) {
+			PGPRINTK("  [%d] VM_EXEC. continue\n", current->pid);
+			ret = VM_FAULT_CONTINUE;
+			goto out;
+		}
+		if (!vma_is_anonymous(vmf->vma) &&
+				((vmf->vma->vm_flags & (VM_WRITE | VM_SHARED)) == 0)) {
+			PGPRINTK("  [%d] locally file-mapped read-only. continue\n",
+					current->pid);
+			ret = VM_FAULT_CONTINUE;
+			goto out;
+		}
+	}
+
+	if (!pte_present(vmf->orig_pte)) {
+		/* Remote page fault */
+		ret = __handle_localfault_at_remote(vmf);
+		goto out;
+	}
+
+	if ((vmf->vma->vm_flags & VM_WRITE) &&
+			fault_for_write(vmf->flags) && !pte_write(vmf->orig_pte)) {
+		/* wr-protected for keeping page consistency */
+		ret = __handle_localfault_at_remote(vmf);
+		goto out;
+	}
+
+	pte_unmap(vmf->pte);
+	PGPRINTK("  [%d] might be fixed by others???\n", current->pid);
+	ret = 0;
+
+out:
+	trace_pgfault(my_nid, current->pid,
+			fault_for_write(vmf->flags) ? 'W' : 'R',
+			instruction_pointer(current_pt_regs()), addr, ret);
+
+	return ret;
+}
+
+
+/*
+ * Routing popcorn messages to workers
+ */
+DEFINE_KMSG_WQ_HANDLER(remote_page_request);
+DEFINE_KMSG_WQ_HANDLER(page_invalidate_request);
+DEFINE_KMSG_ORDERED_WQ_HANDLER(remote_page_flush);
+
+int __init page_server_init(void)
+{
+	REGISTER_KMSG_WQ_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_REQUEST, remote_page_request);
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE, remote_page_response);
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE_SHORT, remote_page_response);
+	REGISTER_KMSG_WQ_HANDLER(
+			PCN_KMSG_TYPE_PAGE_INVALIDATE_REQUEST, page_invalidate_request);
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_PAGE_INVALIDATE_RESPONSE, page_invalidate_response);
+	REGISTER_KMSG_WQ_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH, remote_page_flush);
+	REGISTER_KMSG_WQ_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_RELEASE, remote_page_flush);
+	REGISTER_KMSG_HANDLER(
+			PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH_ACK, remote_page_flush_ack);
+
+	__fault_handle_cache = kmem_cache_create("fault_handle",
+			sizeof(struct fault_handle), 0, 0, NULL);
+
+	return 0;
+}
diff --git a/kernel/popcorn/page_server.h b/kernel/popcorn/page_server.h
new file mode 100644
index 000000000..49eb1022a
--- /dev/null
+++ b/kernel/popcorn/page_server.h
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __INTERNAL_PAGE_SERVER_H__
+#define __INTERNAL_PAGE_SERVER_H__
+
+#include <popcorn/page_server.h>
+
+/*
+ * Flush pages in remote to the origin
+ */
+int page_server_flush_remote_pages(struct remote_context *rc);
+
+void free_remote_context_pages(struct remote_context *rc);
+int process_madvise_release_from_remote(int from_nid, unsigned long start,
+					unsigned long end);
+
+#endif /* __INTERNAL_PAGE_SERVER_H_ */
diff --git a/kernel/popcorn/pgtable.h b/kernel/popcorn/pgtable.h
new file mode 100644
index 000000000..b266b14f8
--- /dev/null
+++ b/kernel/popcorn/pgtable.h
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+#ifndef __KERNEL_POPCORN_PGTABLE_H__
+#define __KERNEL_POPCORN_PGTABLE_H__
+
+#include <asm/pgtable.h>
+
+#ifdef CONFIG_X86
+static inline pte_t pte_make_invalid(pte_t entry)
+{
+	entry = pte_modify(entry, __pgprot(pte_flags(entry) & ~_PAGE_PRESENT));
+
+	return entry;
+}
+
+static inline pte_t pte_make_valid(pte_t entry)
+{
+	entry = pte_modify(entry, __pgprot(pte_flags(entry) | _PAGE_PRESENT));
+
+	return entry;
+}
+
+static inline bool pte_is_present(pte_t entry)
+{
+	return (pte_val(entry) & _PAGE_PRESENT) == _PAGE_PRESENT;
+}
+
+#else
+#error "The architecture is not supported yet."
+#endif
+
+#endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 9/9] Add Popcorn Message Layer and socket support
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (7 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 8/9] Page " Javier Malave
@ 2020-04-29 19:32   ` Javier Malave
  2020-05-07 17:46   ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Pavel Machek
  9 siblings, 0 replies; 11+ messages in thread
From: Javier Malave @ 2020-04-29 19:32 UTC (permalink / raw)
  To: bx; +Cc: Javier Malave, linux-kernel, ah

This patch adds the Popcorn Message layer. It allows for Popcorn 
functionality over a TCP/IP network. The network is created when the 
module is inserted on all nodes, using a list of IP addresses located in 
/etc/popcorn/nodes. Each node ID is defined by the order of these IP 
addresses. 

All Popcorn work is handled via work queues. Popcorn uses two work queues: 
popcorn_ordered_wq and popcorn_wq. Each Popcorn server module (vma server, 
page server, process server) registers handlers with the Popcorn kernel 
message module. These handlers are in charge of posting work to the work
queues.

The message socket receive handler delegates the Popcorn kernel message 
module to call upon the proper work queue handlers. Moreover, the Popcorn 
kernel message module is responsible for posting and processing Popcorn 
messages. It forwards requests and responses for the different Popcorn
server modules invoking the appropriate handlers for each message type.  
---
 drivers/msg_layer/Kconfig  |  28 ++
 drivers/msg_layer/Makefile |   2 +
 drivers/msg_layer/common.h |  63 ++++
 drivers/msg_layer/socket.c | 710 +++++++++++++++++++++++++++++++++++++
 include/popcorn/pcn_kmsg.h | 205 +++++++++++
 kernel/popcorn/pcn_kmsg.c  | 231 ++++++++++++
 6 files changed, 1239 insertions(+)
 create mode 100644 drivers/msg_layer/Kconfig
 create mode 100644 drivers/msg_layer/Makefile
 create mode 100644 drivers/msg_layer/common.h
 create mode 100644 drivers/msg_layer/socket.c
 create mode 100644 include/popcorn/pcn_kmsg.h
 create mode 100644 kernel/popcorn/pcn_kmsg.c

diff --git a/drivers/msg_layer/Kconfig b/drivers/msg_layer/Kconfig
new file mode 100644
index 000000000..f5baf6d6e
--- /dev/null
+++ b/drivers/msg_layer/Kconfig
@@ -0,0 +1,28 @@
+if POPCORN
+
+config POPCORN_KMSG
+	bool "Inter-kernel messaging layer"
+	default y
+	help
+	  Enable support for various inter-kernel message passing
+	  implementations
+
+if POPCORN_KMSG
+
+# Socket over Ethernet
+config POPCORN_KMSG_SOCKET
+	tristate "Over TCP/IP (DANGEROUS)"
+	depends on INET && m
+	default m
+	help
+	  Send Popcorn messages through TCP/IP sockets
+
+# Debuging
+config POPCORN_DEBUG_MSG_LAYER
+	bool "Print debug messages"
+	depends on POPCORN_DEBUG
+	default n
+
+endif # POPCORN_KMSG
+
+endif # POPCORN
diff --git a/drivers/msg_layer/Makefile b/drivers/msg_layer/Makefile
new file mode 100644
index 000000000..5c8ffbefd
--- /dev/null
+++ b/drivers/msg_layer/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_POPCORN_KMSG_SOCKET) := msg_socket.o
+msg_socket-y := socket.o ring_buffer.o
diff --git a/drivers/msg_layer/common.h b/drivers/msg_layer/common.h
new file mode 100644
index 000000000..352f8d9f3
--- /dev/null
+++ b/drivers/msg_layer/common.h
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+
+#ifndef _MSG_LAYER_COMMON_H_
+#define _MSG_LAYER_COMMON_H_
+
+#include <popcorn/pcn_kmsg.h>
+#include <popcorn/bundle.h>
+#include <popcorn/debug.h>
+
+#include <linux/inet.h>
+#include <linux/inetdevice.h>
+#include <linux/netdevice.h>
+
+#define MAX_NUM_NODES	32
+
+static uint32_t ip_table[MAX_NUM_NODES] = { 0 };
+static uint32_t max_nodes = MAX_NUM_NODES;
+
+static uint32_t __init __get_host_ip(void)
+{
+	struct net_device *d;
+	for_each_netdev(&init_net, d) {
+		struct in_ifaddr *ifaddr;
+
+		for (ifaddr = d->ip_ptr->ifa_list; ifaddr; ifaddr = ifaddr->ifa_next) {
+			int i;
+			uint32_t addr = ifaddr->ifa_local;
+			for (i = 0; i < max_nodes; i++) {
+				if (addr == ip_table[i]) {
+					return addr;
+				}
+			}
+		}
+	}
+	return -1;
+}
+
+bool __init identify_myself(void)
+{
+	int i;
+	uint32_t my_ip;
+
+	PCNPRINTK("Loading node configuration...");
+
+	my_ip = __get_host_ip();
+
+	for (i = 0; i < max_nodes; i++) {
+		char *me = " ";
+		if (my_ip == ip_table[i]) {
+			my_nid = i;
+			me = "*";
+		}
+		PCNPRINTK("%s %d: %pI4\n", me, i, ip_table + i);
+	}
+
+	if (my_nid < 0) {
+		PCNPRINTK_ERR("My IP is not listed in the node configuration\n");
+		return false;
+	}
+
+	return true;
+}
+#endif
diff --git a/drivers/msg_layer/socket.c b/drivers/msg_layer/socket.c
new file mode 100644
index 000000000..80a172c6f
--- /dev/null
+++ b/drivers/msg_layer/socket.c
@@ -0,0 +1,710 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+/*
+ * /drivers/msg_layer/socket.c
+ *
+ * Messaging transport layer over TCP/IP
+ *
+ *  author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ *  Narf Industries 2020 (modifications for upstream RFC)
+ *  author, Ho-Ren (Jack) Chuang <horenc@vt.edu>
+ *  author, Sang-Hoon Kim <sanghoon@vt.edu>
+ */
+#include <linux/kthread.h>
+#include <popcorn/stat.h>
+#include <linux/module.h>
+#include <linux/inet.h>
+#include <linux/string.h>
+#include <linux/fs.h>
+#include "ring_buffer.h"
+#include "common.h"
+
+#define PORT 30467
+#define MAX_SEND_DEPTH	1024
+
+#define CONFIG_FILE_LEN	256
+#define CONFIG_FILE_PATH	"/etc/popcorn/nodes"
+#define CONFIG_FILE_CHUNK_SIZE	512
+
+enum {
+	SEND_FLAG_POSTED = 0,
+};
+
+struct q_item {
+	struct pcn_kmsg_message *msg;
+	unsigned long flags;
+	struct completion *done;
+};
+
+/* Per-node handle for socket */
+struct sock_handle {
+	int nid;
+
+	/* Ring buffer for queueing outbound messages */
+	struct q_item *msg_q;
+	unsigned long q_head;
+	unsigned long q_tail;
+	spinlock_t q_lock;
+	struct semaphore q_empty;
+	struct semaphore q_full;
+
+	struct socket *sock;
+	struct task_struct *send_handler;
+	struct task_struct *recv_handler;
+};
+static struct sock_handle sock_handles[MAX_NUM_NODES] = {};
+
+static struct socket *sock_listen = NULL;
+static struct ring_buffer send_buffer = {};
+
+static char config_file_path[CONFIG_FILE_LEN];
+
+/*
+ * Handle inbound messages
+ */
+static int ksock_recv(struct socket *sock, char *buf, size_t len)
+{
+	struct msghdr msg = {
+		.msg_flags = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_name = NULL,
+		.msg_namelen = 0,
+	};
+	struct kvec iov = {
+		.iov_base = buf,
+		.iov_len = len,
+	};
+
+	return kernel_recvmsg(sock, &msg, &iov, 1, len, MSG_WAITALL);
+}
+
+static int recv_handler(void* arg0)
+{
+	struct sock_handle *sh = arg0;
+	MSGPRINTK("RECV handler for %d is ready\n", sh->nid);
+
+	while (!kthread_should_stop()) {
+		int len;
+		int ret;
+		size_t offset;
+		struct pcn_kmsg_hdr header;
+		char *data;
+
+		/* compose header */
+		offset = 0;
+		len = sizeof(header);
+		while (len > 0) {
+			ret = ksock_recv(sh->sock, (char *)(&header) + offset, len);
+			if (ret == -1 || kthread_should_stop() ) {
+				return 0;
+			}
+			offset += ret;
+			len -= ret;
+		}
+		if (ret < 0)
+			break;
+
+#ifdef CONFIG_POPCORN_CHECK_SANITY
+		BUG_ON(header.type < 0 || header.type >= PCN_KMSG_TYPE_MAX);
+		BUG_ON(header.size < 0 || header.size >  PCN_KMSG_MAX_SIZE);
+#endif
+
+		/* compose body */
+		data = kmalloc(header.size, GFP_KERNEL);
+		BUG_ON(!data && "Unable to alloc a message");
+
+		memcpy(data, &header, sizeof(header));
+
+		offset = sizeof(header);
+		len = header.size - offset;
+
+		while (len > 0) {
+			ret = ksock_recv(sh->sock, data + offset, len);
+			if (ret == -1 || kthread_should_stop() ) {
+				return 0;
+			}
+			offset += ret;
+			len -= ret;
+		}
+		if (ret < 0)
+			break;
+
+		/* Call pcn_kmsg upper layer */
+		pcn_kmsg_process((struct pcn_kmsg_message *)data);
+	}
+	return 0;
+}
+
+
+/*
+ * Handle outbound messages
+ */
+static int ksock_send(struct socket *sock, char *buf, size_t len)
+{
+	struct msghdr msg = {
+		.msg_flags = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_name = NULL,
+		.msg_namelen = 0,
+	};
+	struct kvec iov = {
+		.iov_base = buf,
+		.iov_len = len,
+	};
+
+	return kernel_sendmsg(sock, &msg, &iov, 1, len);
+}
+
+static int enq_send(int dest_nid, struct pcn_kmsg_message *msg,
+		    unsigned long flags, struct completion *done)
+{
+	int ret;
+	unsigned long at;
+	struct sock_handle *sh = sock_handles + dest_nid;
+	struct q_item *qi;
+
+	if (!sh)
+		return -1;
+
+	do {
+		ret = down_interruptible(&sh->q_full);
+
+		/* Return if sleep is interrupted by a signal */
+		if (ret == -EINTR)
+			return -1;
+	} while (ret);
+
+	spin_lock(&sh->q_lock);
+	at = sh->q_tail;
+	qi = sh->msg_q + at;
+	sh->q_tail = (at + 1) & (MAX_SEND_DEPTH - 1);
+
+	qi->msg = msg;
+	qi->flags = flags;
+	qi->done = done;
+	spin_unlock(&sh->q_lock);
+	up(&sh->q_empty);
+
+	return at;
+}
+
+void sock_kmsg_put(struct pcn_kmsg_message *msg);
+
+static int deq_send(struct sock_handle *sh)
+{
+	int ret;
+	char *p;
+	unsigned long from;
+	size_t remaining;
+	struct pcn_kmsg_message *msg;
+	struct q_item *qi;
+	unsigned long flags;
+	struct completion *done;
+
+	do {
+		ret = down_interruptible(&sh->q_empty);
+
+		/* Return if sleep is interrupted by a signal */
+		if (ret == -EINTR || kthread_should_stop() )
+			return 0;
+	} while (ret);
+
+	spin_lock(&sh->q_lock);
+	from = sh->q_head;
+	qi = sh->msg_q + from;
+	sh->q_head = (from + 1) & (MAX_SEND_DEPTH - 1);
+
+	msg = qi->msg;
+	flags = qi->flags;
+	done = qi->done;
+	spin_unlock(&sh->q_lock);
+	up(&sh->q_full);
+
+	p = (char *)msg;
+	remaining = msg->header.size;
+
+	while (remaining > 0) {
+		int sent = ksock_send(sh->sock, p, remaining);
+		if (sent < 0) {
+			io_schedule();
+			continue;
+		}
+		p += sent;
+		remaining -= sent;
+	}
+	if (test_bit(SEND_FLAG_POSTED, &flags)) {
+		sock_kmsg_put(msg);
+	}
+	if (done) complete(done);
+
+	return 0;
+}
+
+static int send_handler(void* arg0)
+{
+	struct sock_handle *sh = arg0;
+	MSGPRINTK("SEND handler for %d is ready\n", sh->nid);
+
+	while (!kthread_should_stop()) {
+		deq_send(sh);
+	}
+	kfree(sh->msg_q);
+	return 0;
+}
+
+
+#define WORKAROUND_POOL
+/* Manage send buffer */
+struct pcn_kmsg_message *sock_kmsg_get(size_t size)
+{
+	struct pcn_kmsg_message *msg;
+	might_sleep();
+
+#ifdef WORKAROUND_POOL
+	msg = kmalloc(size, GFP_KERNEL);
+#else
+	while (!(msg = ring_buffer_get(&send_buffer, size))) {
+		WARN_ON_ONCE("ring buffer is full\n");
+		schedule();
+	}
+#endif
+	return msg;
+}
+
+void sock_kmsg_put(struct pcn_kmsg_message *msg)
+{
+#ifdef WORKAROUND_POOL
+	kfree(msg);
+#else
+	ring_buffer_put(&send_buffer, msg);
+#endif
+}
+
+
+/* This is the interface for message layer */
+int sock_kmsg_send(int dest_nid, struct pcn_kmsg_message *msg, size_t size)
+{
+	int ret;
+
+	DECLARE_COMPLETION_ONSTACK(done);
+	ret = enq_send(dest_nid, msg, 0, &done);
+
+	if (ret != -1) {
+		if (!try_wait_for_completion(&done)) {
+			int ret = wait_for_completion_io_timeout(&done, 60 * HZ);
+			if (!ret)
+				return -EAGAIN;
+		}
+	}
+	return 0;
+}
+
+int sock_kmsg_post(int dest_nid, struct pcn_kmsg_message *msg, size_t size)
+{
+	enq_send(dest_nid, msg, 1 << SEND_FLAG_POSTED, NULL);
+	return 0;
+}
+
+void sock_kmsg_done(struct pcn_kmsg_message *msg)
+{
+	kfree(msg);
+}
+
+void sock_kmsg_stat(struct seq_file *seq, void *v)
+{
+	if (seq) {
+		seq_printf(seq, POPCORN_STAT_FMT,
+				(unsigned long long)ring_buffer_usage(&send_buffer),
+				0ULL,
+				"socket");
+	}
+}
+
+struct pcn_kmsg_transport transport_socket = {
+	.name = "socket",
+	.features = 0,
+
+	.get = sock_kmsg_get,
+	.put = sock_kmsg_put,
+	.stat = sock_kmsg_stat,
+
+	.send = sock_kmsg_send,
+	.post = sock_kmsg_post,
+	.done = sock_kmsg_done,
+};
+
+
+static struct task_struct * __init __start_handler(const int nid, const char *type,
+						   int (*handler)(void *data))
+{
+	char name[40];
+	struct task_struct *tsk;
+
+	sprintf(name, "pcn_%s_%d", type, nid);
+	tsk = kthread_run(handler, sock_handles + nid, name);
+	if (IS_ERR(tsk)) {
+		MSGPRINTK(KERN_ERR "Cannot create %s handler, %ld\n", name, PTR_ERR(tsk));
+		return tsk;
+	}
+
+	return tsk;
+}
+
+static int __init __start_handlers(const int nid)
+{
+	struct task_struct *tsk_send, *tsk_recv;
+	tsk_send = __start_handler(nid, "send", send_handler);
+	if (IS_ERR(tsk_send)) {
+		return PTR_ERR(tsk_send);
+	}
+
+	tsk_recv = __start_handler(nid, "recv", recv_handler);
+	if (IS_ERR(tsk_recv)) {
+		kthread_stop(tsk_send);
+		return PTR_ERR(tsk_recv);
+	}
+	sock_handles[nid].send_handler = tsk_send;
+	sock_handles[nid].recv_handler = tsk_recv;
+	return 0;
+}
+
+static int __init __connect_to_server(int nid)
+{
+	int ret;
+	struct sockaddr_in addr;
+	struct socket *sock;
+
+	ret = sock_create(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+	if (ret < 0) {
+		MSGPRINTK("Failed to create socket, %d\n", ret);
+		return ret;
+	}
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(PORT);
+	addr.sin_addr.s_addr = ip_table[nid];
+
+	MSGPRINTK("Connecting to %d at %pI4\n", nid, ip_table + nid);
+	do {
+		ret = kernel_connect(sock, (struct sockaddr *)&addr, sizeof(addr), 0);
+		if (ret < 0) {
+			MSGPRINTK("Failed to connect the socket %d. Attempt again!!\n", ret);
+			msleep(1000);
+		}
+	} while (ret < 0);
+
+	sock_handles[nid].sock = sock;
+	ret = __start_handlers(nid);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int __init __accept_client(int *nid)
+{
+	int i;
+	int ret;
+	int retry = 0;
+	bool found = false;
+	struct socket *sock;
+	struct sockaddr_in addr;
+
+	do {
+		ret = sock_create(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+		if (ret < 0) {
+			MSGPRINTK("Failed to create socket, %d\n", ret);
+			return ret;
+		}
+
+		ret = kernel_accept(sock_listen, &sock, 0);
+		if (ret < 0) {
+			MSGPRINTK("Failed to accept, %d\n", ret);
+			goto out;
+		}
+
+		ret = kernel_getpeername(sock, (struct sockaddr *)&addr);
+		if (ret < 0) {
+			goto out_release;
+		}
+
+		/* Identify incoming peer nid */
+		for (i = 0; i < max_nodes; i++) {
+			if (addr.sin_addr.s_addr == ip_table[i]) {
+				*nid = i;
+				found = true;
+			}
+		}
+		if (!found) {
+			sock_release(sock);
+			continue;
+		}
+	} while (retry++ < 10 && !found);
+
+	if (!found)
+		return -EAGAIN;
+	sock_handles[*nid].sock = sock;
+
+	ret = __start_handlers(*nid);
+	if (ret)
+		goto out_release;
+
+	return 0;
+
+out_release:
+	sock_release(sock);
+out:
+	return ret;
+}
+
+static int __init __listen_to_connection(void)
+{
+	int ret;
+	struct sockaddr_in addr;
+
+	ret = sock_create(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock_listen);
+	if (ret < 0) {
+		printk(KERN_ERR "Failed to create socket, %d", ret);
+		return ret;
+	}
+
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = INADDR_ANY;
+	addr.sin_port = htons(PORT);
+
+	ret = kernel_bind(sock_listen, (struct sockaddr *)&addr, sizeof(addr));
+	if (ret < 0) {
+		printk(KERN_ERR "Failed to bind socket, %d\n", ret);
+		goto out_release;
+	}
+
+	ret = kernel_listen(sock_listen, max_nodes);
+	if (ret < 0) {
+		printk(KERN_ERR "Failed to listen to connections, %d\n", ret);
+		goto out_release;
+	}
+
+	MSGPRINTK("Ready to accept incoming connections\n");
+	return 0;
+
+out_release:
+	sock_release(sock_listen);
+	sock_listen = NULL;
+	return ret;
+}
+
+static bool load_config_file(char *file)
+{
+	struct file *fp;
+	int bytes_read, ret;
+	int num_nodes = 0;
+	bool retval = true;
+	char ip_addr[CONFIG_FILE_CHUNK_SIZE];
+	u8 i4_addr[4];
+	loff_t offset = 0;
+	const char *end;
+
+	/* If no path was passed in, use hard coded default */
+	if (file[0] == '\0') {
+		strlcpy(file, CONFIG_FILE_PATH, CONFIG_FILE_LEN);
+	}
+
+	fp = filp_open(file, O_RDONLY, 0);
+	if (IS_ERR(fp)) {
+		MSGPRINTK("Cannot open config file %ld\n", PTR_ERR(fp));
+		return false;
+	}
+
+	while (num_nodes < (max_nodes - 1)) {
+		bytes_read = kernel_read(fp, ip_addr, CONFIG_FILE_CHUNK_SIZE, &offset);
+		if (bytes_read > 0) {
+			int str_off, str_len, j;
+
+			/* Replace \n, \r with \0 */
+			for (j = 0; j < CONFIG_FILE_CHUNK_SIZE; j++) {
+				if (ip_addr[j] == '\n' || ip_addr[j] == '\r') {
+					ip_addr[j] = '\0';
+				}
+			}
+
+			str_off = 0;
+			str_len = strlen(ip_addr);
+			while (str_off < bytes_read) {
+				str_len = strlen(ip_addr + str_off);
+
+				/* Make sure IP address is a valid IPv4 address */
+				if(str_len > 0){
+					ret = in4_pton(ip_addr + str_off, -1, i4_addr, -1, &end);
+					if (!ret) {
+						MSGPRINTK("invalid IP address in config file\n");
+						retval = false;
+						goto done;
+					}
+
+					ip_table[num_nodes++] = *((uint32_t *) i4_addr);
+				}
+
+				str_off += str_len + 1;
+			}
+		} else {
+			break;
+		}
+	}
+
+	/* Update max_nodes with number of nodes read in from config file */
+	max_nodes = num_nodes;
+
+done:
+	filp_close(fp, NULL);
+	return retval;
+}
+
+static void __init bail_early(void)
+{
+        int i;
+        if (sock_listen) sock_release(sock_listen);
+        for (i = 0; i < max_nodes; i++) {
+                struct sock_handle *sh = sock_handles + i;
+                if (sh->send_handler) {
+                        wake_up_process(sh->send_handler);
+                } else {
+                        if (sh->msg_q) kfree(sh->msg_q);
+                }
+                if (sh->recv_handler) {
+                        wake_up_process(sh->recv_handler);
+                }
+                if (sh->sock) {
+                        sock_release(sh->sock);
+                }
+        }
+        ring_buffer_destroy(&send_buffer);
+
+        MSGPRINTK("Successfully unloaded module!\n");
+
+}
+
+static void __exit exit_kmsg_sock(void)
+{
+	int i;
+	if (sock_listen) sock_release(sock_listen);
+	for (i = 0; i < max_nodes; i++) {
+		struct sock_handle *sh = sock_handles + i;
+		if (sh->send_handler) {
+			wake_up_process(sh->send_handler);
+		} else {
+			if (sh->msg_q) kfree(sh->msg_q);
+		}
+		if (sh->recv_handler) {
+			wake_up_process(sh->recv_handler);
+		}
+		if (sh->sock) {
+			sock_release(sh->sock);
+		}
+	}
+	ring_buffer_destroy(&send_buffer);
+
+	MSGPRINTK("Successfully unloaded module!\n");
+}
+
+static int __init init_kmsg_sock(void)
+{
+	int i, ret;
+
+	MSGPRINTK("Loading Popcorn messaging layer over TCP/IP...\n");
+
+	/* Load node configuration */
+	if (!load_config_file(config_file_path))
+		return -EINVAL;
+
+	if (!identify_myself())
+		return -EINVAL;
+
+	pcn_kmsg_set_transport(&transport_socket);
+
+	for (i = 0; i < max_nodes; i++) {
+		struct sock_handle *sh = sock_handles + i;
+
+		sh->msg_q = kmalloc(sizeof(*sh->msg_q) * MAX_SEND_DEPTH, GFP_KERNEL);
+		if (!sh->msg_q) {
+			ret = -ENOMEM;
+			goto out_exit;
+		}
+
+		sh->nid = i;
+		sh->q_head = 0;
+		sh->q_tail = 0;
+		spin_lock_init(&sh->q_lock);
+
+		sema_init(&sh->q_empty, 0);
+		sema_init(&sh->q_full, MAX_SEND_DEPTH);
+	}
+
+	if ((ret = ring_buffer_init(&send_buffer, "sock_send")))
+		goto out_exit;
+
+	if ((ret = __listen_to_connection()))
+		return ret;
+
+	/* Wait for a while so that nodes are ready to listen to connections */
+	msleep(100);
+
+	/* Initilaize the sock.
+	 *
+	 *  Each node has a connection table like tihs:
+	 * --------------------------------------------------------------------
+	 * | connect | connect | (many)... | my_nid(one) | accept | (many)... |
+	 * --------------------------------------------------------------------
+	 * my_nid:  no need to talk to itself
+	 * connect: connecting to existing nodes
+	 * accept:  waiting for the connection requests from later nodes
+	 */
+	for (i = 0; i < my_nid; i++) {
+		if ((ret = __connect_to_server(i)))
+			goto out_exit;
+		set_popcorn_node_online(i, true);
+	}
+
+	set_popcorn_node_online(my_nid, true);
+
+	for (i = my_nid + 1; i < max_nodes; i++) {
+		int nid = 0;
+		if ((ret = __accept_client(&nid)))
+			goto out_exit;
+		set_popcorn_node_online(nid, true);
+	}
+
+	broadcast_my_node_info(i);
+
+	PCNPRINTK("Ready on TCP/IP\n");
+	return 0;
+
+out_exit:
+	bail_early();
+	return ret;
+}
+
+static int max_nodes_set(const char *val, const struct kernel_param *kp)
+{
+	int n = 0, ret;
+
+	ret = kstrtoint(val, 10, &n);
+	if (ret != 0 || n < 1 || n > MAX_NUM_NODES)
+		return -EINVAL;
+
+	return param_set_int(val, kp);
+}
+
+static const struct kernel_param_ops param_ops = {
+	.set	= max_nodes_set,
+};
+
+module_param_cb(simpcb, &param_ops, &max_nodes, 0664);
+MODULE_PARM_DESC(max_nodes, "Maximum number of nodes supported");
+
+module_param_string(config_file, config_file_path, CONFIG_FILE_LEN, 0400);
+MODULE_PARM_DESC(config_file, "Configuration file path");
+
+module_init(init_kmsg_sock);
+module_exit(exit_kmsg_sock);
+MODULE_LICENSE("GPL");
diff --git a/include/popcorn/pcn_kmsg.h b/include/popcorn/pcn_kmsg.h
new file mode 100644
index 000000000..87dcac702
--- /dev/null
+++ b/include/popcorn/pcn_kmsg.h
@@ -0,0 +1,205 @@
+// SPDX-License-Identifier: GPL-2.0, 3-clause BSD
+#ifndef __POPCORN_PCN_KMSG_H__
+#define __POPCORN_PCN_KMSG_H__
+
+#include <linux/types.h>
+#include <linux/seq_file.h>
+
+/* Enumerate message types */
+enum pcn_kmsg_type {
+	/* Thread migration */
+	PCN_KMSG_TYPE_NODE_INFO,
+	PCN_KMSG_TYPE_STAT_START,
+	PCN_KMSG_TYPE_TASK_MIGRATE,
+	PCN_KMSG_TYPE_TASK_MIGRATE_BACK,
+	PCN_KMSG_TYPE_TASK_PAIRING,
+	PCN_KMSG_TYPE_TASK_EXIT_ORIGIN,
+	PCN_KMSG_TYPE_TASK_EXIT_REMOTE,
+
+	/* VMA synchronization */
+	PCN_KMSG_TYPE_VMA_INFO_REQUEST,
+	PCN_KMSG_TYPE_VMA_INFO_RESPONSE,
+	PCN_KMSG_TYPE_VMA_OP_REQUEST,
+	PCN_KMSG_TYPE_VMA_OP_RESPONSE,
+
+	/* Page consistency protocol */
+	PCN_KMSG_TYPE_REMOTE_PAGE_REQUEST,
+	PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE,
+	PCN_KMSG_TYPE_REMOTE_PAGE_RESPONSE_SHORT,
+	PCN_KMSG_TYPE_PAGE_INVALIDATE_REQUEST,
+	PCN_KMSG_TYPE_PAGE_INVALIDATE_RESPONSE,
+	PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH,	/* XXX page flush is not working now */
+	PCN_KMSG_TYPE_REMOTE_PAGE_RELEASE,
+	PCN_KMSG_TYPE_REMOTE_PAGE_FLUSH_ACK,
+
+	/* Distributed futex */
+	PCN_KMSG_TYPE_FUTEX_REQUEST,
+	PCN_KMSG_TYPE_FUTEX_RESPONSE,
+	PCN_KMSG_TYPE_STAT_END,
+
+	/* Performance experiments */
+	PCN_KMSG_TYPE_TEST_REQUEST,
+	PCN_KMSG_TYPE_TEST_RESPONSE,
+	PCN_KMSG_TYPE_TEST_RDMA_REQUEST,
+	PCN_KMSG_TYPE_TEST_RDMA_RESPONSE,
+
+	/* Provide the single system image */
+	PCN_KMSG_TYPE_REMOTE_PROC_CPUINFO_REQUEST,
+	PCN_KMSG_TYPE_REMOTE_PROC_CPUINFO_RESPONSE,
+	PCN_KMSG_TYPE_REMOTE_PROC_MEMINFO_REQUEST,
+	PCN_KMSG_TYPE_REMOTE_PROC_MEMINFO_RESPONSE,
+	PCN_KMSG_TYPE_REMOTE_PROC_PS_REQUEST,
+	PCN_KMSG_TYPE_REMOTE_PROC_PS_RESPONSE,
+
+	/* Schedule server */
+	PCN_KMSG_TYPE_SCHED_PERIODIC,		/* XXX sched requires help!! */
+
+	PCN_KMSG_TYPE_MAX
+};
+
+/* Enumerate message priority. XXX Priority is not supported yet. */
+enum pcn_kmsg_prio {
+	PCN_KMSG_PRIO_LOW,
+	PCN_KMSG_PRIO_NORMAL,
+	PCN_KMSG_PRIO_HIGH,
+};
+
+/* Message header */
+struct pcn_kmsg_hdr {
+	int from_nid		:6;
+	enum pcn_kmsg_prio prio	:2;
+	enum pcn_kmsg_type type	:8;
+	size_t size;
+} __attribute__((packed));
+
+#define PCN_KMSG_FROM_NID(x) \
+	(((struct pcn_kmsg_message *)x)->header.from_nid)
+#define PCN_KMSG_SIZE(x) (sizeof(struct pcn_kmsg_hdr) + x)
+
+#define PCN_KMSG_MAX_SIZE (64UL << 10)
+#define PCN_KMSG_MAX_PAYLOAD_SIZE \
+	(PCN_KMSG_MAX_SIZE - sizeof(struct pcn_kmsg_hdr))
+
+#define DEFINE_PCN_KMSG(type, fields)		\
+	typedef struct {			\
+		struct pcn_kmsg_hdr header;	\
+		fields;				\
+	} __attribute__((packed)) type
+
+struct pcn_kmsg_message {
+	struct pcn_kmsg_hdr header;
+	unsigned char payload[PCN_KMSG_MAX_PAYLOAD_SIZE];
+} __attribute__((packed));
+
+void pcn_kmsg_dump(struct pcn_kmsg_message *msg);
+
+
+/* SETUP */
+
+/* Function pointer to callback functions */
+typedef int (*pcn_kmsg_cbftn)(struct pcn_kmsg_message *);
+
+/* Register a callback function to handle the message type */
+int pcn_kmsg_register_callback(enum pcn_kmsg_type type,
+			       pcn_kmsg_cbftn callback);
+
+/* Unregister a callback function for the message type */
+int pcn_kmsg_unregister_callback(enum pcn_kmsg_type type);
+
+
+/* MESSAGING */
+
+/*
+ * Send @msg whose size is @msg_size to the node @dest_nid.
+ * @msg is sent synchronously; it is safe to deallocate @msg after the return.
+ */
+int pcn_kmsg_send(enum pcn_kmsg_type type, int dest_nid, void *msg,
+		  size_t msg_size);
+
+/*
+ * Post @msg whose size is @msg_size to be sent to the node @dest_nid.
+ * The message should be allocated through pcn_kmsg_get(), and the message
+ * is reclaimed automatically once it is sent.
+ */
+int pcn_kmsg_post(enum pcn_kmsg_type type, int dest_nid, void *msg,
+		  size_t msg_size);
+
+/*
+ * Get message buffer for posting. Note pcn_kmsg_put() is for returning
+ * unused buffer without posting it; posted message is reclaimed automatically.
+ */
+void *pcn_kmsg_get(size_t size);
+void pcn_kmsg_put(void *msg);
+
+/*
+ * Process the received messag @msg. Each message layer should start processing
+ * the request by calling this function
+ */
+void pcn_kmsg_process(struct pcn_kmsg_message *msg);
+
+/*
+ * Return received message @msg after handling to recyle it. @msg becomes
+ * unavailable after the call. Make sure return received messages otherwise
+ * the message layer will panick.
+ */
+void pcn_kmsg_done(void *msg);
+
+/*
+ * Print out transport-specific statistics into @buffer
+ */
+void pcn_kmsg_stat(struct seq_file *seq, void *v);
+
+struct pcn_kmsg_rdma_handle {
+	u32 rkey;
+	void *addr;
+	dma_addr_t dma_addr;
+	void *private;
+};
+
+/*
+ * Pin @buffer for RDMA and get @rdma_addr and @rdma_key.
+ */
+struct pcn_kmsg_rdma_handle *pcn_kmsg_pin_rdma_buffer(void *buffer,
+						      size_t size);
+
+void pcn_kmsg_unpin_rdma_buffer(struct pcn_kmsg_rdma_handle *handle);
+
+int pcn_kmsg_rdma_write(int dest_nid, dma_addr_t rdma_addr, void *addr,
+			size_t size, u32 rdma_key);
+
+int pcn_kmsg_rdma_read(int from_nid, void *addr, dma_addr_t rdma_addr,
+		       size_t size, u32 rdma_key);
+
+/* TRANSPORT DESCRIPTOR */
+enum {
+	PCN_KMSG_FEATURE_RDMA = 1,
+};
+
+/*
+ * Check the features that the transport layer provides. Return true iff all
+ * features are supported.
+ */
+bool pcn_kmsg_has_features(unsigned int features);
+
+struct pcn_kmsg_transport {
+	char *name;
+	unsigned long features;
+
+	struct pcn_kmsg_message *(*get)(size_t);
+	void (*put)(struct pcn_kmsg_message *);
+
+	int (*send)(int, struct pcn_kmsg_message *, size_t);
+	int (*post)(int, struct pcn_kmsg_message *, size_t);
+	void (*done)(struct pcn_kmsg_message *);
+
+	void (*stat)(struct seq_file *, void *);
+
+	struct pcn_kmsg_rdma_handle *(*pin_rdma_buffer)(void *, size_t);
+	void (*unpin_rdma_buffer)(struct pcn_kmsg_rdma_handle *);
+	int (*rdma_write)(int, dma_addr_t, void *, size_t, u32);
+	int (*rdma_read)(int, void *, dma_addr_t, size_t, u32);
+};
+
+void pcn_kmsg_set_transport(struct pcn_kmsg_transport *tr);
+
+#endif /* __POPCORN_PCN_KMSG_H__ */
diff --git a/kernel/popcorn/pcn_kmsg.c b/kernel/popcorn/pcn_kmsg.c
new file mode 100644
index 000000000..882c82e25
--- /dev/null
+++ b/kernel/popcorn/pcn_kmsg.c
@@ -0,0 +1,231 @@
+// SPDX-License-Identifier: GPL-2.0, BSD
+/*
+ * /kernel/popcorn/pcn_kmsg.c - Kernel Module for Popcorn Messaging Layer over Socket
+ *
+ * author, Javier Malave, Rebecca Shapiro, Andrew Hughes,
+ * Narf Industries 2020 (modifications for upstream RFC)
+ *
+ * (Copyright):
+ * author, Sang-Hoon Kim, SSRG, Virginia Tech, 2017-2018
+ * author Ben Shelton, SSRG, Virginia Tech, 2013
+ */
+
+#include <linux/kernel.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+
+#include <popcorn/pcn_kmsg.h>
+#include <popcorn/debug.h>
+#include <popcorn/stat.h>
+#include <popcorn/bundle.h>
+
+static pcn_kmsg_cbftn pcn_kmsg_cbftns[PCN_KMSG_TYPE_MAX] = { NULL };
+
+static struct pcn_kmsg_transport *transport = NULL;
+
+void pcn_kmsg_set_transport(struct pcn_kmsg_transport *tr)
+{
+	if (transport && tr) {
+		printk(KERN_ERR "Replace hot transport at your own risk.\n");
+	}
+	transport = tr;
+}
+EXPORT_SYMBOL(pcn_kmsg_set_transport);
+
+int pcn_kmsg_register_callback(enum pcn_kmsg_type type, pcn_kmsg_cbftn callback)
+{
+	BUG_ON(type < 0 || type >= PCN_KMSG_TYPE_MAX);
+
+	pcn_kmsg_cbftns[type] = callback;
+	return 0;
+}
+EXPORT_SYMBOL(pcn_kmsg_register_callback);
+
+int pcn_kmsg_unregister_callback(enum pcn_kmsg_type type)
+{
+	return pcn_kmsg_register_callback(type, (pcn_kmsg_cbftn)NULL);
+}
+EXPORT_SYMBOL(pcn_kmsg_unregister_callback);
+
+
+static atomic_t __nr_outstanding_requests[PCN_KMSG_TYPE_MAX] = { ATOMIC_INIT(0) };
+
+void pcn_kmsg_process(struct pcn_kmsg_message *msg)
+{
+	pcn_kmsg_cbftn ftn;
+
+	if(IS_ENABLED(CONFIG_POPCORN_CHECK_SANITY)) {
+		BUG_ON(msg->header.type < 0 || msg->header.type >= PCN_KMSG_TYPE_MAX);
+		BUG_ON(msg->header.size < 0 || msg->header.size > PCN_KMSG_MAX_SIZE);
+		if (atomic_inc_return(__nr_outstanding_requests + msg->header.type) > 64) {
+			if (WARN_ON_ONCE("leaking received messages, ")) {
+				printk("type %d\n", msg->header.type);
+			}
+		}
+	}
+	account_pcn_message_recv(msg);
+
+	ftn = pcn_kmsg_cbftns[msg->header.type];
+
+	if (ftn != NULL) {
+		ftn(msg);
+	} else {
+		printk(KERN_ERR"No callback registered for %d\n", msg->header.type);
+		pcn_kmsg_done(msg);
+	}
+}
+EXPORT_SYMBOL(pcn_kmsg_process);
+
+static inline int __build_and_check_msg(enum pcn_kmsg_type type, int to,
+					struct pcn_kmsg_message *msg, size_t size)
+{
+
+	if(IS_ENABLED(CONFIG_POPCORN_CHECK_SANITY)) {
+		BUG_ON(type < 0 || type >= PCN_KMSG_TYPE_MAX);
+		BUG_ON(size > PCN_KMSG_MAX_SIZE);
+		BUG_ON(to < 0 || to >= MAX_POPCORN_NODES);
+		BUG_ON(to == my_nid);
+	}
+
+	msg->header.type = type;
+	msg->header.prio = PCN_KMSG_PRIO_NORMAL;
+	msg->header.size = size;
+	msg->header.from_nid = my_nid;
+	return 0;
+}
+
+int pcn_kmsg_send(enum pcn_kmsg_type type, int to, void *msg, size_t size)
+{
+	int ret;
+	if ((ret = __build_and_check_msg(type, to, msg, size)))
+		return ret;
+
+	account_pcn_message_sent(msg);
+	return transport->send(to, msg, size);
+}
+EXPORT_SYMBOL(pcn_kmsg_send);
+
+int pcn_kmsg_post(enum pcn_kmsg_type type, int to, void *msg, size_t size)
+{
+	int ret;
+	if ((ret = __build_and_check_msg(type, to, msg, size)))
+		return ret;
+
+	account_pcn_message_sent(msg);
+	return transport->post(to, msg, size);
+}
+EXPORT_SYMBOL(pcn_kmsg_post);
+
+void *pcn_kmsg_get(size_t size)
+{
+	if (transport && transport->get)
+		return transport->get(size);
+	return kmalloc(size, GFP_KERNEL);
+}
+EXPORT_SYMBOL(pcn_kmsg_get);
+
+void pcn_kmsg_put(void *msg)
+{
+	if (transport && transport->put) {
+		transport->put(msg);
+	} else {
+		kfree(msg);
+	}
+}
+EXPORT_SYMBOL(pcn_kmsg_put);
+
+
+void pcn_kmsg_done(void *msg)
+{
+	if(IS_ENABLED(CONFIG_POPCORN_CHECK_SANITY)) {
+		struct pcn_kmsg_hdr *h = msg;;
+		if (atomic_dec_return(__nr_outstanding_requests + h->type) < 0) {
+			printk(KERN_ERR
+			"Over-release message type %d\n", h->type);
+		}
+	}
+	if (transport && transport->done) {
+		transport->done(msg);
+	} else {
+		kfree(msg);
+	}
+}
+EXPORT_SYMBOL(pcn_kmsg_done);
+
+void pcn_kmsg_stat(struct seq_file *seq, void *v)
+{
+	if (transport && transport->stat) {
+		transport->stat(seq, v);
+	}
+}
+EXPORT_SYMBOL(pcn_kmsg_stat);
+
+bool pcn_kmsg_has_features(unsigned int features)
+{
+	if (!transport)
+		return false;
+
+	return (transport->features & features) == features;
+}
+EXPORT_SYMBOL(pcn_kmsg_has_features);
+
+int pcn_kmsg_rdma_read(int from_nid, void *addr, dma_addr_t rdma_addr,
+		size_t size, u32 rdma_key)
+{
+	if(IS_ENABLED(CONFIG_POPCORN_CHECK_SANITY)) {
+		if (!transport || !transport->rdma_read)
+			return -EPERM;
+	}
+
+	account_pcn_rdma_read(size);
+	return transport->rdma_read(from_nid, addr, rdma_addr, size, rdma_key);
+}
+EXPORT_SYMBOL(pcn_kmsg_rdma_read);
+
+int pcn_kmsg_rdma_write(int dest_nid, dma_addr_t rdma_addr, void *addr,
+			size_t size, u32 rdma_key)
+{
+	if(IS_ENABLED(CONFIG_POPCORN_CHECK_SANITY)) {
+		if (!transport || !transport->rdma_write)
+			return -EPERM;
+	}
+
+	account_pcn_rdma_write(size);
+    return transport->rdma_write(dest_nid, rdma_addr, addr, size, rdma_key);
+}
+EXPORT_SYMBOL(pcn_kmsg_rdma_write);
+
+
+struct pcn_kmsg_rdma_handle *pcn_kmsg_pin_rdma_buffer(void *buffer, size_t size)
+{
+	if (transport && transport->pin_rdma_buffer) {
+		return transport->pin_rdma_buffer(buffer, size);
+	}
+	return ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL(pcn_kmsg_pin_rdma_buffer);
+
+void pcn_kmsg_unpin_rdma_buffer(struct pcn_kmsg_rdma_handle *handle)
+{
+	if (transport && transport->unpin_rdma_buffer) {
+		transport->unpin_rdma_buffer(handle);
+	}
+}
+EXPORT_SYMBOL(pcn_kmsg_unpin_rdma_buffer);
+
+
+void pcn_kmsg_dump(struct pcn_kmsg_message *msg)
+{
+	struct pcn_kmsg_hdr *h = &msg->header;
+	printk("MSG %p: from=%d type=%d size=%lu\n",
+			msg, h->from_nid, h->type, h->size);
+}
+EXPORT_SYMBOL(pcn_kmsg_dump);
+
+int __init pcn_kmsg_init(void)
+{
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/9] Popcorn Linux Distributed Thread Execution
  2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
                     ` (8 preceding siblings ...)
  2020-04-29 19:32   ` [RFC 9/9] Add Popcorn Message Layer and socket support Javier Malave
@ 2020-05-07 17:46   ` Pavel Machek
  9 siblings, 0 replies; 11+ messages in thread
From: Pavel Machek @ 2020-05-07 17:46 UTC (permalink / raw)
  To: Javier Malave; +Cc: bx, linux-kernel, ah

On Wed 2020-04-29 15:32:47, Javier Malave wrote:
> This patch set adds the Popcorn Distributed Thread Execution support
> to the kernel. It is based off of Linux 5.2 commit 72a20ce. We are
> looking for feedback on design and implementation from the community.

You may want to cc linux-api mailing list...?

> Popcorn Linux is a Linux kernel-based software stack that enables
> applications to execute, with a shared code base, on distributed hosts.
> Popcorn allows applications to start execution on a particular host and
> migrate, at run-time, to a remote host. Multi-threaded applications may
> migrate any particular thread to any remote host. 

Sounds like a lot of fun.

> Popcorn Linux implements a software-based distributed shared memory 
> by extending Linux's virtual memory subsystem and it enables processes 
> on different machines to observe a common and coherent virtual address
> space. Coherency of virtual memory pages of different hosts is ensured 
> using a reader-replicate/writer-invalidate, page-level consistency protocol.

Sounds interesting. I guess this needs very, very fast network. Do you have some performance numbers 
somewhere?

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0>
2020-04-29 19:32 ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Javier Malave
2020-04-29 19:32   ` [RFC 1/9] Core Popcorn Changes Javier Malave
2020-04-29 19:32   ` [RFC 2/9] Add x86 specifc files for Popcorn Javier Malave
2020-04-29 19:32   ` [RFC 3/9] Temporary revert L1TF mitigation " Javier Malave
2020-04-29 19:32   ` [RFC 4/9] Popcorn system call additions Javier Malave
2020-04-29 19:32   ` [RFC 5/9] Popcorn Utility Javier Malave
2020-04-29 19:32   ` [RFC 6/9] Process Server for Popcorn Distributed Thread Execution Javier Malave
2020-04-29 19:32   ` [RFC 7/9] Virtual Memory Address Server for " Javier Malave
2020-04-29 19:32   ` [RFC 8/9] Page " Javier Malave
2020-04-29 19:32   ` [RFC 9/9] Add Popcorn Message Layer and socket support Javier Malave
2020-05-07 17:46   ` [RFC 0/9] Popcorn Linux Distributed Thread Execution Pavel Machek

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git