linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Remote mapping
@ 2020-09-03 17:47 Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Adalbert Lazăr

This patchset adds support for the remote mapping feature.
Remote mapping, as its name suggests, is a means for transparent and
zero-copy access of a remote process' address space.
access of a remote process' address space.

The feature was designed according to a specification suggested by Paolo Bonzini:
>> The proposed API is a new pidfd system call, through which the parent
>> can map portions of its virtual address space into a file descriptor
>> and then pass that file descriptor to a child.
>>
>> This should be:
>>
>> - upstreamable, pidfd is the new cool thing and we could sell it as a
>> better way to do PTRACE_{PEEK,POKE}DATA
>>
>> - relatively easy to do based on the bitdefender remote process
>> mapping patches at.
>>
>> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
>> two file descriptors for respectively the control plane and the memory access.
>>
>> - the control plane accepts three ioctls
>>
>> PIDFD_MEM_MAP takes a struct like
>>
>>     struct pidfd_mem_map {
>>          uint64_t address;
>>          off_t offset;
>>          off_t size;
>>          int flags;
>>          int padding[7];
>>     }
>>
>> After this is done, the memory access fd can be mmap-ed at range
>> [offset,
>> offset+size), and it will read memory from range [address,
>> address+size) of the target descriptor.
>>
>> PIDFD_MEM_UNMAP takes a struct like
>>
>>     struct pidfd_mem_unmap {
>>          off_t offset;
>>          off_t size;
>>     }
>>
>> and unmaps the corresponding range of course.
>>
>> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
>> PIDFD_MEM_UNMAP.  For now I think it should just check that the
>> argument is zero, bells and whistles can be added later.
>>
>> - the memory access fd can be mmap-ed as in the bitdefender patches
>> but also accessed with read/write/pread/pwrite/...  As in the
>> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
>> regions when the source address space changes.  In this case,
>> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
(it currently doesn't support read/write/pread/pwrite/...)

The main remote mapping patch also contains the legacy implementation which
creates a region the size of the whole process address space by means of the
REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
address space it wishes.

VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
process address space within the specified range. Pages are installed in the
current process page tables at fault time and removed by the mmu_interval_notifier
invalidate callbck. No further memory management is involved.
On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
or if the remote process address space was reaped by OOM, the remote mapping
fault handler returns VM_FAULT_SIGBUS.

At Bitdefender we are using remote mapping for virtual machine introspection:
- the QEMU running the introspected machine creates the pair of file descriptors,
passes the access fd to the introspector QEMU, and uses the control fd to allow
access to the memslots it creates for its machine
- the QEMU running the introspector machine receives the access fd and mmap()s
the regions made available, then hotplugs the obtained memory in its machine
Having this setup creates nested invalidate_range_start/end MMU notifier calls.

Patch organization:
- patch 1 allows unmap_page_range() to run without rescheduling
  Needed for remote mapping to zap current process page tables when OOM calls
  mmu_notifier_invalidate_range_start_nonblock(&range)

- patch 2 creates VMA-specific zapping behavior
  A remote mapping VMA does not own the pages it maps, so all it has to do is
  clear the PTEs.

- patch 3 removed MMU notifier lockdep map
  It was just incompatible with our use case.

- patch 4 is the remote mapping implementation

- patch 5 adds suggested pidfd_mem system call

Mircea Cirjaliu (5):
  mm: add atomic capability to zap_details
  mm: let the VMA decide how zap_pte_range() acts on mapped pages
  mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
    nested scenarios
  mm/remote_mapping: use a pidfd to access memory belonging to unrelated
    process
  pidfd_mem: implemented remote memory mapping system call

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 include/linux/mm.h                     |   22 +
 include/linux/mmu_notifier.h           |    5 +-
 include/linux/pid.h                    |    1 +
 include/linux/remote_mapping.h         |   22 +
 include/linux/syscalls.h               |    1 +
 include/uapi/asm-generic/unistd.h      |    2 +
 include/uapi/linux/remote_mapping.h    |   36 +
 kernel/exit.c                          |    2 +-
 kernel/pid.c                           |   55 +
 mm/Kconfig                             |   11 +
 mm/Makefile                            |    1 +
 mm/memory.c                            |  193 ++--
 mm/mmu_notifier.c                      |   19 -
 mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
 16 files changed, 1535 insertions(+), 110 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c


CC:Christian Brauner <christian@brauner.io>
base-commit: ae83d0b416db002fe95601e7f97f64b59514d936


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/5] mm: add atomic capability to zap_details
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
@ 2020-09-03 17:47 ` Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mircea Cirjaliu, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Force zap_xxx_range() functions to loop without rescheduling.
Useful for unmapping memory in an atomic context, although no
checks for atomic context are being made.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mm.h |  6 ++++++
 mm/memory.c        | 11 +++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..1be4482a7b81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1601,8 +1601,14 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool atomic;				/* Do not sleep. */
 };
 
+static inline bool zap_is_atomic(struct zap_details *details)
+{
+	return (unlikely(details) && details->atomic);
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..8e78fb151f8f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1056,7 +1056,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_none(ptent))
 			continue;
 
-		if (need_resched())
+		if (!zap_is_atomic(details) && need_resched())
 			break;
 
 		if (pte_present(ptent)) {
@@ -1159,7 +1159,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	}
 
 	if (addr != end) {
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 		goto again;
 	}
 
@@ -1195,7 +1196,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
 next:
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
 	return addr;
@@ -1224,7 +1226,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 			continue;
 		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
 next:
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 	} while (pud++, addr = next, addr != end);
 
 	return addr;


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
@ 2020-09-03 17:47 ` Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mircea Cirjaliu, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Instead of having one big function to handle all cases of page unmapping,
have multiple implementation-defined callbacks, each for its own VMA type.
In the future, exotic VMA implementations won't have to bloat the unique
zapping function with another case of mappings.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mm.h |  16 ++++
 mm/memory.c        | 182 +++++++++++++++++++++++++--------------------
 2 files changed, 116 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1be4482a7b81..39e55467aa49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct bdi_writeback;
+struct zap_details;
 
 void init_mm_internals(void);
 
@@ -601,6 +602,14 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+
+	/*
+	 * Called by zap_pte_range() for use by special VMAs that implement
+	 * custom zapping behavior.
+	 */
+	int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr,
+		       pte_t *pte, int rss[], struct mmu_gather *tlb,
+		       struct zap_details *details);
 };
 
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
@@ -1594,6 +1603,13 @@ static inline bool can_do_mlock(void) { return false; }
 extern int user_shm_lock(size_t, struct user_struct *);
 extern void user_shm_unlock(size_t, struct user_struct *);
 
+/*
+ * Flags returned by zap_pte implementations
+ */
+#define ZAP_PTE_CONTINUE	0
+#define ZAP_PTE_FLUSH		(1 << 0)	/* Ask for TLB flush. */
+#define ZAP_PTE_BREAK		(1 << 1)	/* Break PTE iteration. */
+
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
diff --git a/mm/memory.c b/mm/memory.c
index 8e78fb151f8f..a225bfd01417 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1031,18 +1031,109 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+static int zap_pte_common(struct vm_area_struct *vma, unsigned long addr,
+			  pte_t *pte, int rss[], struct mmu_gather *tlb,
+			  struct zap_details *details)
+{
+	struct mm_struct *mm = tlb->mm;
+	pte_t ptent = *pte;
+	swp_entry_t entry;
+	int flags = 0;
+
+	if (pte_present(ptent)) {
+		struct page *page;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (unlikely(details) && page) {
+			/*
+			 * unmap_shared_mapping_pages() wants to
+			 * invalidate cache without truncating:
+			 * unmap shared but keep private pages.
+			 */
+			if (details->check_mapping &&
+			    details->check_mapping != page_rmapping(page))
+				return 0;
+		}
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		if (unlikely(!page))
+			return 0;
+
+		if (!PageAnon(page)) {
+			if (pte_dirty(ptent)) {
+				flags |= ZAP_PTE_FLUSH;
+				set_page_dirty(page);
+			}
+			if (pte_young(ptent) &&
+			    likely(!(vma->vm_flags & VM_SEQ_READ)))
+				mark_page_accessed(page);
+		}
+		rss[mm_counter(page)]--;
+		page_remove_rmap(page, false);
+		if (unlikely(page_mapcount(page) < 0))
+			print_bad_pte(vma, addr, ptent, page);
+		if (unlikely(__tlb_remove_page(tlb, page)))
+			flags |= ZAP_PTE_FLUSH | ZAP_PTE_BREAK;
+		return flags;
+	}
+
+	entry = pte_to_swp_entry(ptent);
+	if (non_swap_entry(entry) && is_device_private_entry(entry)) {
+		struct page *page = device_private_entry_to_page(entry);
+
+		if (unlikely(details && details->check_mapping)) {
+			/*
+			 * unmap_shared_mapping_pages() wants to
+			 * invalidate cache without truncating:
+			 * unmap shared but keep private pages.
+			 */
+			if (details->check_mapping != page_rmapping(page))
+				return 0;
+		}
+
+		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		rss[mm_counter(page)]--;
+		page_remove_rmap(page, false);
+		put_page(page);
+		return 0;
+	}
+
+	/* If details->check_mapping, we leave swap entries. */
+	if (unlikely(details))
+		return 0;
+
+	if (!non_swap_entry(entry))
+		rss[MM_SWAPENTS]--;
+	else if (is_migration_entry(entry)) {
+		struct page *page;
+
+		page = migration_entry_to_page(entry);
+		rss[mm_counter(page)]--;
+	}
+	if (unlikely(!free_swap_and_cache(entry)))
+		print_bad_pte(vma, addr, ptent, NULL);
+	pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+
+	return flags;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	int force_flush = 0;
+	int flags = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
+
+	int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr,
+		       pte_t *pte, int rss[], struct mmu_gather *tlb,
+		       struct zap_details *details) = zap_pte_common;
+	if (vma->vm_ops && vma->vm_ops->zap_pte)
+		zap_pte = vma->vm_ops->zap_pte;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1058,92 +1149,19 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		if (!zap_is_atomic(details) && need_resched())
 			break;
-
-		if (pte_present(ptent)) {
-			struct page *page;
-
-			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
-				continue;
-
-			if (!PageAnon(page)) {
-				if (pte_dirty(ptent)) {
-					force_flush = 1;
-					set_page_dirty(page);
-				}
-				if (pte_young(ptent) &&
-				    likely(!(vma->vm_flags & VM_SEQ_READ)))
-					mark_page_accessed(page);
-			}
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
-				force_flush = 1;
-				addr += PAGE_SIZE;
-				break;
-			}
-			continue;
-		}
-
-		entry = pte_to_swp_entry(ptent);
-		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
-			struct page *page = device_private_entry_to_page(entry);
-
-			if (unlikely(details && details->check_mapping)) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping !=
-				    page_rmapping(page))
-					continue;
-			}
-
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
-			put_page(page);
-			continue;
+		if (flags & ZAP_PTE_BREAK) {
+			flags &= ~ZAP_PTE_BREAK;
+			break;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
-			continue;
-
-		if (!non_swap_entry(entry))
-			rss[MM_SWAPENTS]--;
-		else if (is_migration_entry(entry)) {
-			struct page *page;
-
-			page = migration_entry_to_page(entry);
-			rss[mm_counter(page)]--;
-		}
-		if (unlikely(!free_swap_and_cache(entry)))
-			print_bad_pte(vma, addr, ptent, NULL);
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		flags |= zap_pte(vma, addr, pte, rss, tlb, details);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (flags & ZAP_PTE_FLUSH)
 		tlb_flush_mmu_tlbonly(tlb);
 	pte_unmap_unlock(start_pte, ptl);
 
@@ -1153,8 +1171,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	 * entries before releasing the ptl), free the batched
 	 * memory too. Restart if we didn't do everything.
 	 */
-	if (force_flush) {
-		force_flush = 0;
+	if (flags & ZAP_PTE_FLUSH) {
+		flags &= ~ZAP_PTE_FLUSH;
 		tlb_flush_mmu(tlb);
 	}
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
@ 2020-09-03 17:47 ` Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mircea Cirjaliu, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

The combination of remote mapping + KVM causes nested range invalidations,
which reports lockdep warnings.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mmu_notifier.h |  5 +----
 mm/mmu_notifier.c            | 19 -------------------
 2 files changed, 1 insertion(+), 23 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 736f6918335e..81ea457d41be 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -440,12 +440,10 @@ mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
 	might_sleep();
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (mm_has_notifiers(range->mm)) {
 		range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
 		__mmu_notifier_invalidate_range_start(range);
 	}
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
 static inline int
@@ -453,12 +451,11 @@ mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range)
 {
 	int ret = 0;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (mm_has_notifiers(range->mm)) {
 		range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE;
 		ret = __mmu_notifier_invalidate_range_start(range);
 	}
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+
 	return ret;
 }
 
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 06852b896fa6..928751bd8630 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -22,12 +22,6 @@
 /* global SRCU for all MMs */
 DEFINE_STATIC_SRCU(srcu);
 
-#ifdef CONFIG_LOCKDEP
-struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
-	.name = "mmu_notifier_invalidate_range_start"
-};
-#endif
-
 /*
  * The mmu_notifier_subscriptions structure is allocated and installed in
  * mm->notifier_subscriptions inside the mm_take_all_locks() protected
@@ -242,8 +236,6 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 * will always clear the below sleep in some reasonable time as
 	 * subscriptions->invalidate_seq is even in the idle state.
 	 */
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 	if (is_invalidating)
 		wait_event(subscriptions->wq,
 			   READ_ONCE(subscriptions->invalidate_seq) != seq);
@@ -572,13 +564,11 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 	struct mmu_notifier_subscriptions *subscriptions =
 		range->mm->notifier_subscriptions;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (subscriptions->has_itree)
 		mn_itree_inv_end(subscriptions);
 
 	if (!hlist_empty(&subscriptions->list))
 		mn_hlist_invalidate_end(subscriptions, range, only_end);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -612,13 +602,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	lockdep_assert_held_write(&mm->mmap_sem);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
-	if (IS_ENABLED(CONFIG_LOCKDEP)) {
-		fs_reclaim_acquire(GFP_KERNEL);
-		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-		fs_reclaim_release(GFP_KERNEL);
-	}
-
 	if (!mm->notifier_subscriptions) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
@@ -1062,8 +1045,6 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub)
 	 * The possible sleep on progress in the invalidation requires the
 	 * caller not hold any locks held by invalidation callbacks.
 	 */
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 	if (seq)
 		wait_event(subscriptions->wq,
 			   READ_ONCE(subscriptions->invalidate_seq) != seq);


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (2 preceding siblings ...)
  2020-09-03 17:47 ` [RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
@ 2020-09-03 17:47 ` Adalbert Lazăr
  2020-09-03 17:47 ` [RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mircea Cirjaliu, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Remote mapping creates a mirror VMA that exposes memory owned by another
process in a zero-copy manner. The pages are mapped in the current process'
address space with no memory management involved and little impact on the
remote process operation. Currently incompatible with THP.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/remote_mapping.h      |   22 +
 include/uapi/linux/remote_mapping.h |   36 +
 mm/Kconfig                          |   11 +
 mm/Makefile                         |    1 +
 mm/remote_mapping.c                 | 1273 +++++++++++++++++++++++++++
 5 files changed, 1343 insertions(+)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c

diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h
new file mode 100644
index 000000000000..5c1d43e8f669
--- /dev/null
+++ b/include/linux/remote_mapping.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_REMOTE_MAPPING_H
+#define _LINUX_REMOTE_MAPPING_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_REMOTE_MAPPING
+
+extern int task_remote_map(struct task_struct *task, int fds[]);
+
+#else /* CONFIG_REMOTE_MAPPING */
+
+static inline int task_remote_map(struct task_struct *task, int fds[])
+{
+	return -EINVAL;
+}
+
+#endif /* CONFIG_REMOTE_MAPPING */
+
+
+#endif /* _LINUX_REMOTE_MAPPING_H */
diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h
new file mode 100644
index 000000000000..5d2828a6aa47
--- /dev/null
+++ b/include/uapi/linux/remote_mapping.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __UAPI_REMOTE_MAPPING_H__
+#define __UAPI_REMOTE_MAPPING_H__
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+// device file interface
+#define REMOTE_PROC_MAP	_IOW('r', 0x01, int)
+
+
+// pidfd interface
+#define PIDFD_IO_MAGIC	'p'
+
+struct pidfd_mem_map {
+	uint64_t address;
+	off_t offset;
+	off_t size;
+	int flags;
+	int padding[7];
+};
+
+struct pidfd_mem_unmap {
+	off_t offset;
+	off_t size;
+};
+
+#define PIDFD_MEM_MAP	_IOW(PIDFD_IO_MAGIC, 0x01, struct pidfd_mem_map)
+#define PIDFD_MEM_UNMAP _IOW(PIDFD_IO_MAGIC, 0x02, struct pidfd_mem_unmap)
+#define PIDFD_MEM_LOCK	_IOW(PIDFD_IO_MAGIC, 0x03, int)
+
+#define PIDFD_MEM_REMAP _IOW(PIDFD_IO_MAGIC, 0x04, unsigned long)
+// TODO: actually this is not for pidfd, find better names
+
+#endif /* __UAPI_REMOTE_MAPPING_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c35..0ecc3f41a98e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -804,6 +804,17 @@ config HMM_MIRROR
 	bool
 	depends on MMU
 
+config REMOTE_MAPPING
+	bool "Remote memory mapping"
+	depends on X86_64 && MMU && !TRANSPARENT_HUGEPAGE
+	select MMU_NOTIFIER
+	default n
+	help
+	  Allows a client process to gain access to an unrelated process'
+	  address space on a range-basis. The feature maps pages found at
+	  the remote equivalent address in the current process' page tables
+	  in a lightweight manner.
+
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ZONE_DEVICE
diff --git a/mm/Makefile b/mm/Makefile
index fccd3756b25f..ce1a00e7bc8c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
new file mode 100644
index 000000000000..1dc53992424b
--- /dev/null
+++ b/mm/remote_mapping.c
@@ -0,0 +1,1273 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Remote memory mapping.
+ *
+ * Copyright (C) 2017-2020 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/pid.h>
+#include <linux/file.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched/task.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/refcount.h>
+#include <linux/miscdevice.h>
+#include <uapi/linux/remote_mapping.h>
+#include <linux/pfn_t.h>
+#include <linux/errno.h>
+#include <linux/limits.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <asm/tlb.h>
+#include "internal.h"
+
+struct remote_file_context {
+	refcount_t refcount;
+
+	struct srcu_struct fault_srcu;
+	struct mm_struct *mm;
+
+	bool locked;
+	struct rb_root_cached rb_view;		/* view offset tree */
+	struct mutex views_lock;
+
+};
+
+struct remote_view {
+	refcount_t refcount;
+
+	unsigned long address;
+	unsigned long size;
+	unsigned long offset;
+	bool valid;
+
+	struct rb_node target_rb;		/* link for views tree */
+	unsigned long rb_subtree_last;		/* in remote_file_context */
+
+	struct mmu_interval_notifier mmin;
+	spinlock_t user_lock;
+
+	/*
+	 * interval tree for mapped ranges (indexed by source process HVA)
+	 * because of GPA->HVA aliasing, multiple ranges may overlap
+	 */
+	struct rb_root_cached rb_rmap;		/* rmap tree */
+	struct rw_semaphore rmap_lock;
+};
+
+struct remote_vma_context {
+	struct vm_area_struct *vma;		/* link back to VMA */
+	struct remote_view *view;		/* corresponding view */
+
+	struct rb_node rmap_rb;			/* link for rmap tree */
+	unsigned long rb_subtree_last;
+};
+
+/* view offset tree */
+static inline unsigned long view_start(struct remote_view *view)
+{
+	return view->offset + 1;
+}
+
+static inline unsigned long view_last(struct remote_view *view)
+{
+	return view->offset + view->size - 1;
+}
+
+INTERVAL_TREE_DEFINE(struct remote_view, target_rb,
+	unsigned long, rb_subtree_last, view_start, view_last,
+	static inline, view_interval_tree)
+
+#define view_tree_foreach(view, root, start, last)			\
+	for (view = view_interval_tree_iter_first(root, start, last);	\
+	     view; view = view_interval_tree_iter_next(view, start, last))
+
+/* rmap interval tree */
+static inline unsigned long ctx_start(struct remote_vma_context *ctx)
+{
+	struct vm_area_struct *vma = ctx->vma;
+	struct remote_view *view = ctx->view;
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	return offset - view->offset + view->address;
+}
+
+static inline unsigned long ctx_last(struct remote_vma_context *ctx)
+{
+	struct vm_area_struct *vma = ctx->vma;
+	struct remote_view *view = ctx->view;
+	unsigned long offset;
+
+	offset = (vma->vm_pgoff << PAGE_SHIFT) + (vma->vm_end - vma->vm_start);
+
+	return offset - view->offset + view->address;
+}
+
+static inline unsigned long ctx_rmap_start(struct remote_vma_context *ctx)
+{
+	return ctx_start(ctx) + 1;
+}
+
+static inline unsigned long ctx_rmap_last(struct remote_vma_context *ctx)
+{
+	return ctx_last(ctx) - 1;
+}
+
+INTERVAL_TREE_DEFINE(struct remote_vma_context, rmap_rb,
+	unsigned long, rb_subtree_last, ctx_rmap_start, ctx_rmap_last,
+	static inline, rmap_interval_tree)
+
+#define rmap_foreach(ctx, root, start, last)				\
+	for (ctx = rmap_interval_tree_iter_first(root, start, last);	\
+	     ctx; ctx = rmap_interval_tree_iter_next(ctx, start, last))
+
+static int mirror_zap_pte(struct vm_area_struct *vma, unsigned long addr,
+			  pte_t *pte, int rss[], struct mmu_gather *tlb,
+			  struct zap_details *details)
+{
+	pte_t ptent = *pte;
+	struct page *page;
+	int flags = 0;
+
+	page = vm_normal_page(vma, addr, ptent);
+	//ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+	ptent = ptep_clear_flush_notify(vma, addr, pte);
+	//tlb_remove_tlb_entry(tlb, pte, addr);
+
+	if (pte_dirty(ptent)) {
+		flags |= ZAP_PTE_FLUSH;
+		set_page_dirty(page);
+	}
+
+	return flags;
+}
+
+static void
+zap_remote_range(struct vm_area_struct *vma,
+		 unsigned long start, unsigned long end,
+		 bool atomic)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	struct zap_details details = {
+		.atomic = atomic,
+	};
+
+	pr_debug("%s: vma %lx-%lx, zap range %lx-%lx\n",
+		__func__, vma->vm_start, vma->vm_end, start, end);
+
+	tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+				vma, vma->vm_mm, start, end);
+	if (atomic)
+		mmu_notifier_invalidate_range_start_nonblock(&range);
+	else
+		mmu_notifier_invalidate_range_start(&range);
+
+	unmap_page_range(&tlb, vma, start, end, &details);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb, start, end);
+}
+
+static bool
+mirror_clear_view(struct remote_view *view,
+		  unsigned long start, unsigned long last, bool atomic)
+{
+	struct remote_vma_context *ctx;
+	unsigned long src_start, src_last;
+	unsigned long vma_start, vma_last;
+
+	pr_debug("%s: view %p [%lx-%lx), range [%lx-%lx)", __func__, view,
+		 view->offset, view->offset + view->size, start, last);
+
+	if (likely(!atomic))
+		down_read(&view->rmap_lock);
+	else if (!down_read_trylock(&view->rmap_lock))
+		return false;
+
+	rmap_foreach(ctx, &view->rb_rmap, start, last) {
+		struct vm_area_struct *vma = ctx->vma;
+
+		// intersect intervals (source process address range)
+		src_start = max(start, ctx_start(ctx));
+		src_last = min(last, ctx_last(ctx));
+
+		// translate to destination process address range
+		vma_start = vma->vm_start + (src_start - ctx_start(ctx));
+		vma_last = vma->vm_end - (ctx_last(ctx) - src_last);
+
+		zap_remote_range(vma, vma_start, vma_last, atomic);
+	}
+
+	up_read(&view->rmap_lock);
+
+	return true;
+}
+
+static bool mmin_invalidate(struct mmu_interval_notifier *interval_sub,
+			    const struct mmu_notifier_range *range,
+			    unsigned long cur_seq)
+{
+	struct remote_view *view =
+		container_of(interval_sub, struct remote_view, mmin);
+
+	pr_debug("%s: reason %d, range [%lx-%lx)\n", __func__,
+		 range->event, range->start, range->end);
+
+	spin_lock(&view->user_lock);
+	mmu_interval_set_seq(interval_sub, cur_seq);
+	spin_unlock(&view->user_lock);
+
+	/* mark view as invalid before zapping the page tables */
+	if (range->event == MMU_NOTIFY_RELEASE)
+		WRITE_ONCE(view->valid, false);
+
+	return mirror_clear_view(view, range->start, range->end,
+				 !mmu_notifier_range_blockable(range));
+}
+
+static const struct mmu_interval_notifier_ops mmin_ops = {
+	.invalidate = mmin_invalidate,
+};
+
+static void view_init(struct remote_view *view)
+{
+	refcount_set(&view->refcount, 1);
+	view->valid = true;
+	RB_CLEAR_NODE(&view->target_rb);
+	view->rb_rmap = RB_ROOT_CACHED;
+	init_rwsem(&view->rmap_lock);
+	spin_lock_init(&view->user_lock);
+}
+
+/* return working view or reason why it failed */
+static struct remote_view *
+view_alloc(struct mm_struct *mm, unsigned long address, unsigned long size, unsigned long offset)
+{
+	struct remote_view *view;
+	int result;
+
+	view = kzalloc(sizeof(*view), GFP_KERNEL);
+	if (!view)
+		return ERR_PTR(-ENOMEM);
+
+	view_init(view);
+
+	view->address = address;
+	view->size = size;
+	view->offset = offset;
+
+	pr_debug("%s: view %p [%lx-%lx)", __func__, view,
+		 view->offset, view->offset + view->size);
+
+	result = mmu_interval_notifier_insert(&view->mmin, mm, address, size, &mmin_ops);
+	if (result) {
+		kfree(view);
+		return ERR_PTR(result);
+	}
+
+	return view;
+}
+
+static void
+view_insert(struct remote_file_context *fctx, struct remote_view *view)
+{
+	view_interval_tree_insert(view, &fctx->rb_view);
+	refcount_inc(&view->refcount);
+}
+
+static struct remote_view *
+view_search_get(struct remote_file_context *fctx,
+	unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	lockdep_assert_held(&fctx->views_lock);
+
+	/*
+	 * loop & return the first view intersecting interval
+	 * further checks will be done down the road
+	 */
+	view_tree_foreach(view, &fctx->rb_view, start, last)
+		break;
+
+	if (view)
+		refcount_inc(&view->refcount);
+
+	return view;
+}
+
+static void
+view_put(struct remote_view *view)
+{
+	if (refcount_dec_and_test(&view->refcount)) {
+		pr_debug("%s: view %p [%lx-%lx) bye bye", __func__, view,
+			 view->offset, view->offset + view->size);
+
+		mmu_interval_notifier_remove(&view->mmin);
+		kfree(view);
+	}
+}
+
+static void
+view_remove(struct remote_file_context *fctx, struct remote_view *view)
+{
+	view_interval_tree_remove(view, &fctx->rb_view);
+	RB_CLEAR_NODE(&view->target_rb);
+	view_put(view);
+}
+
+static bool
+view_overlaps(struct remote_file_context *fctx,
+	unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	view_tree_foreach(view, &fctx->rb_view, start, last)
+		return true;
+
+	return false;
+}
+
+static struct remote_view *
+alloc_identity_view(struct mm_struct *mm)
+{
+	return view_alloc(mm, 0, ULONG_MAX, 0);
+}
+
+static void remote_file_context_init(struct remote_file_context *fctx)
+{
+	refcount_set(&fctx->refcount, 1);
+	init_srcu_struct(&fctx->fault_srcu);
+	fctx->locked = false;
+	fctx->rb_view = RB_ROOT_CACHED;
+	mutex_init(&fctx->views_lock);
+}
+
+static struct remote_file_context *remote_file_context_alloc(void)
+{
+	struct remote_file_context *fctx;
+
+	fctx = kzalloc(sizeof(*fctx), GFP_KERNEL);
+	if (fctx)
+		remote_file_context_init(fctx);
+
+	pr_debug("%s: fctx %p\n", __func__, fctx);
+
+	return fctx;
+}
+
+static void remote_file_context_get(struct remote_file_context *fctx)
+{
+	refcount_inc(&fctx->refcount);
+}
+
+static void remote_file_context_put(struct remote_file_context *fctx)
+{
+	struct remote_view *view, *n;
+
+	if (refcount_dec_and_test(&fctx->refcount)) {
+		pr_debug("%s: fctx %p\n", __func__, fctx);
+
+		rbtree_postorder_for_each_entry_safe(view, n,
+			&fctx->rb_view.rb_root, target_rb)
+			view_put(view);
+
+		if (fctx->mm)
+			mmdrop(fctx->mm);
+
+		kfree(fctx);
+	}
+}
+
+static void remote_vma_context_init(struct remote_vma_context *ctx)
+{
+	RB_CLEAR_NODE(&ctx->rmap_rb);
+}
+
+static struct remote_vma_context *remote_vma_context_alloc(void)
+{
+	struct remote_vma_context *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (ctx)
+		remote_vma_context_init(ctx);
+
+	return ctx;
+}
+
+static void remote_vma_context_free(struct remote_vma_context *ctx)
+{
+	kfree(ctx);
+}
+
+static int mirror_dev_open(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	fctx = remote_file_context_alloc();
+	if (!fctx)
+		return -ENOMEM;
+	file->private_data = fctx;
+
+	return 0;
+}
+
+static int do_remote_proc_map(struct file *file, int pid)
+{
+	struct remote_file_context *fctx = file->private_data;
+	struct task_struct *req_task;
+	struct mm_struct *req_mm;
+	struct remote_view *id;
+	int result = 0;
+
+	pr_debug("%s: pid %d\n", __func__, pid);
+
+	req_task = find_get_task_by_vpid(pid);
+	if (!req_task)
+		return -ESRCH;
+
+	req_mm = get_task_mm(req_task);
+	put_task_struct(req_task);
+	if (!req_mm)
+		return -EINVAL;
+
+	/* error on 2nd call or multithreaded race */
+	if (cmpxchg(&fctx->mm, (struct mm_struct *)NULL, req_mm) != NULL) {
+		result = -EALREADY;
+		goto out;
+	} else
+		mmgrab(req_mm);
+
+	id = alloc_identity_view(req_mm);
+	if (IS_ERR(id)) {
+		mmdrop(req_mm);
+		result = PTR_ERR(id);
+		goto out;
+	}
+
+	/* one view only, don't need to take mutex */
+	view_insert(fctx, id);
+	view_put(id);			/* usage reference */
+
+out:
+	mmput(req_mm);
+
+	return result;
+}
+
+static long mirror_dev_ioctl(struct file *file, unsigned int ioctl,
+	unsigned long arg)
+{
+	long result;
+
+	switch (ioctl) {
+	case REMOTE_PROC_MAP: {
+		int pid = (int)arg;
+
+		result = do_remote_proc_map(file, pid);
+		break;
+	}
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+/*
+ * This is called after all reference to the file have been dropped,
+ * including mmap()s, even if the file is close()d first.
+ */
+static int mirror_dev_release(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx = file->private_data;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	remote_file_context_put(fctx);
+
+	return 0;
+}
+
+static struct page *mm_remote_get_page(struct mm_struct *req_mm,
+	unsigned long address, unsigned int flags)
+{
+	struct page *req_page = NULL;
+	long nrpages;
+
+	might_sleep();
+
+	flags |= FOLL_ANON | FOLL_MIGRATION;
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm, address, 1,
+		flags, &req_page, NULL, NULL);
+	if (unlikely(nrpages == 0)) {
+		pr_err("no page at %lx\n", address);
+		return ERR_PTR(-ENOENT);
+	}
+	if (IS_ERR_VALUE(nrpages)) {
+		pr_err("get_user_pages_remote() failed: %d\n", (int)nrpages);
+		return ERR_PTR(nrpages);
+	}
+
+	/* limit introspection to anon memory (this also excludes zero-page) */
+	if (!PageAnon(req_page)) {
+		put_page(req_page);
+		pr_err("page at %lx not anon\n", address);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return req_page;
+}
+
+/*
+ * avoid PTE allocation in this function for 2 reasons:
+ * - it runs under user_lock, which is a spinlock and can't sleep
+ *   (user_lock can be a mutex is allocation is needed)
+ * - PTE allocation triggers reclaim, which causes a possible deadlock warning
+ */
+static vm_fault_t remote_map_page(struct vm_fault *vmf, struct page *page)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pte_t entry;
+
+	if (vmf->prealloc_pte) {
+		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+		if (unlikely(!pmd_none(*vmf->pmd))) {
+			spin_unlock(vmf->ptl);
+			goto map_pte;
+		}
+
+		mm_inc_nr_ptes(vma->vm_mm);
+		pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+		spin_unlock(vmf->ptl);
+		vmf->prealloc_pte = NULL;
+	} else {
+		BUG_ON(pmd_none(*vmf->pmd));
+	}
+
+map_pte:
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+
+	if (!pte_none(*vmf->pte))
+		goto out_unlock;
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	set_pte_at_notify(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+out_unlock:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct remote_vma_context *ctx = vma->vm_private_data;
+	struct remote_view *view = ctx->view;
+	struct file *file = vma->vm_file;
+	struct remote_file_context *fctx = file->private_data;
+	unsigned long req_addr;
+	unsigned int gup_flags;
+	struct page *req_page;
+	vm_fault_t result = VM_FAULT_SIGBUS;
+	struct mm_struct *src_mm = fctx->mm;
+	unsigned long seq;
+	int idx;
+
+fault_retry:
+	seq = mmu_interval_read_begin(&view->mmin);
+
+	idx = srcu_read_lock(&fctx->fault_srcu);
+
+	/* check if view was invalidated */
+	if (unlikely(!READ_ONCE(view->valid))) {
+		pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__,
+			view->offset, view->offset + view->size);
+		goto out_invalid;		/* VM_FAULT_SIGBUS */
+	}
+
+	/* drop current mm semapchore */
+	up_read(&current->mm->mmap_sem);
+
+	/* take remote mm semaphore */
+	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (!down_read_trylock(&src_mm->mmap_sem)) {
+			pr_debug("%s: couldn't take source semaphore!!\n", __func__);
+			goto out_retry;
+		}
+	} else
+		down_read(&src_mm->mmap_sem);
+
+	/* set GUP flags depending on the VMA */
+	gup_flags = 0;
+	if (vma->vm_flags & VM_WRITE)
+		gup_flags |= FOLL_WRITE | FOLL_FORCE;
+
+	/* translate file offset to source process HVA */
+	req_addr = (vmf->pgoff << PAGE_SHIFT) - view->offset + view->address;
+	req_page = mm_remote_get_page(src_mm, req_addr, gup_flags);
+
+	/* check for validity of the page */
+	if (IS_ERR_OR_NULL(req_page)) {
+		up_read(&src_mm->mmap_sem);
+
+		if (PTR_ERR(req_page) == -ERESTARTSYS ||
+		    PTR_ERR(req_page) == -EBUSY) {
+			goto out_retry;
+		} else
+			goto out_err;	/* VM_FAULT_SIGBUS */
+	}
+
+	up_read(&src_mm->mmap_sem);
+
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+
+	/* expedite retry */
+	if (mmu_interval_check_retry(&view->mmin, seq)) {
+		put_page(req_page);
+
+		srcu_read_unlock(&fctx->fault_srcu, idx);
+
+		goto fault_retry;
+	}
+
+	/* make sure the VMA hasn't gone away */
+	vma = find_vma(current->mm, vmf->address);
+	if (vma == vmf->vma) {
+		spin_lock(&view->user_lock);
+
+		if (mmu_interval_read_retry(&view->mmin, seq)) {
+			spin_unlock(&view->user_lock);
+
+			put_page(req_page);
+
+			srcu_read_unlock(&fctx->fault_srcu, idx);
+
+			goto fault_retry;
+		}
+
+		result = remote_map_page(vmf, req_page);  /* install PTE here */
+
+		spin_unlock(&view->user_lock);
+	}
+
+	put_page(req_page);
+
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	return result;
+
+out_err:
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+out_invalid:
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	return result;
+
+out_retry:
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	/* TODO: some optimizations work here when we arrive with FAULT_FLAG_ALLOW_RETRY */
+	/* TODO: mmap_sem doesn't need to be taken, then dropped */
+
+	/*
+	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+	 * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+	 * not set.
+	 *
+	 * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+	 * set, VM_FAULT_RETRY can still be returned if and only if there are
+	 * fatal_signal_pending()s, and the mmap_sem must be released before
+	 * returning it.
+	 */
+	if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_TRIED)) {
+		if (!(vmf->flags & FAULT_FLAG_KILLABLE))
+			if (current && fatal_signal_pending(current)) {
+				up_read(&current->mm->mmap_sem);
+				return VM_FAULT_RETRY;
+			}
+
+		if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+			up_read(&mm->mmap_sem);
+
+		return VM_FAULT_RETRY;
+	} else
+		return VM_FAULT_SIGBUS;
+}
+
+/*
+ * This is called in remove_vma() at the end of __do_munmap() after the address
+ * space has been unmapped and the page tables have been freed.
+ */
+static void mirror_vm_close(struct vm_area_struct *vma)
+{
+	struct remote_vma_context *ctx = vma->vm_private_data;
+	struct remote_view *view = ctx->view;
+
+	pr_debug("%s: VMA %lx-%lx (%lu bytes)\n", __func__,
+		vma->vm_start, vma->vm_end, vma->vm_end - vma->vm_start);
+
+	/* will wait for any running invalidate notifiers to finish */
+	down_write(&view->rmap_lock);
+	rmap_interval_tree_remove(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+	view_put(view);
+
+	remote_vma_context_free(ctx);
+}
+
+/* prevent partial unmap of destination VMA */
+static int mirror_vm_split(struct vm_area_struct *area, unsigned long addr)
+{
+	return -EINVAL;
+}
+
+static const struct vm_operations_struct mirror_vm_ops = {
+	.close = mirror_vm_close,
+	.fault = mirror_vm_fault,
+	.split = mirror_vm_split,
+	.zap_pte = mirror_zap_pte,
+};
+
+static bool is_mirror_vma(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &mirror_vm_ops;
+}
+
+static struct remote_view *
+getme_matching_view(struct remote_file_context *fctx,
+		    unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	/* lookup view for the VMA offset range */
+	view = view_search_get(fctx, start, last);
+	if (!view)
+		return NULL;
+
+	/* make sure the interval we're after is contained in the view */
+	if (start < view->offset || last > view->offset + view->size) {
+		view_put(view);
+		return NULL;
+	}
+
+	return view;
+}
+
+static struct remote_view *
+getme_exact_view(struct remote_file_context *fctx,
+		 unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	/* lookup view for the VMA offset range */
+	view = view_search_get(fctx, start, last);
+	if (!view)
+		return NULL;
+
+	/* make sure the interval we're after is contained in the view */
+	if (start != view->offset || last != view->offset + view->size) {
+		view_put(view);
+		return NULL;
+	}
+
+	return view;
+}
+
+static int mirror_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct remote_file_context *fctx = file->private_data;
+	struct remote_vma_context *ctx;
+	unsigned long start, length, last;
+	struct remote_view *view;
+
+	start = vma->vm_pgoff << PAGE_SHIFT;
+	length = vma->vm_end - vma->vm_start;
+	last = start + length;
+
+	pr_debug("%s: VMA %lx-%lx (%lu bytes), offsets %lx-%lx\n", __func__,
+		vma->vm_start, vma->vm_end, length, start, last);
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		pr_debug("%s: VMA is not shared\n", __func__);
+		return -EINVAL;
+	}
+
+	/* prepare the context */
+	ctx = remote_vma_context_alloc();
+	if (!ctx)
+		return -ENOMEM;
+
+	/* lookup view for the VMA offset range */
+	mutex_lock(&fctx->views_lock);
+	view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!view) {
+		pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last);
+		remote_vma_context_free(ctx);
+		return -EINVAL;
+	}
+
+	/* VMA must be linked to ctx before adding to rmap tree !! */
+	vma->vm_private_data = ctx;
+	ctx->vma = vma;
+
+	/* view may already be invalidated by the time it's linked */
+	down_write(&view->rmap_lock);
+	ctx->view = view;	/* view reference goes here */
+	rmap_interval_tree_insert(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+
+	/* set basic VMA properties */
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_ops = &mirror_vm_ops;
+
+	return 0;
+}
+
+static const struct file_operations mirror_ops = {
+	.open = mirror_dev_open,
+	.unlocked_ioctl = mirror_dev_ioctl,
+	.compat_ioctl = mirror_dev_ioctl,
+	.llseek = no_llseek,
+	.mmap = mirror_dev_mmap,
+	.release = mirror_dev_release,
+};
+
+static struct miscdevice mirror_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mirror-proc",
+	.fops = &mirror_ops,
+};
+
+builtin_misc_device(mirror_dev);
+
+static int pidfd_mem_remap(struct remote_file_context *fctx, unsigned long address)
+{
+	struct vm_area_struct *vma;
+	unsigned long start, last;
+	struct remote_vma_context *ctx;
+	struct remote_view *view, *new_view;
+	int result = 0;
+
+	pr_debug("%s: address %lx\n", __func__, address);
+
+	down_write(&current->mm->mmap_sem);
+
+	vma = find_vma(current->mm, address);
+	if (!vma || !is_mirror_vma(vma)) {
+		result = -EINVAL;
+		goto out_vma;
+	}
+
+	ctx = vma->vm_private_data;
+	view = ctx->view;
+
+	if (view->valid)
+		goto out_vma;
+
+	start = vma->vm_pgoff << PAGE_SHIFT;
+	last = start + (vma->vm_end - vma->vm_start);
+
+	/* lookup view for the VMA offset range */
+	mutex_lock(&fctx->views_lock);
+	new_view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!new_view) {
+		result = -EINVAL;
+		goto out_vma;
+	}
+	/* do not link to another invalid view */
+	if (!new_view->valid) {
+		view_put(new_view);
+		result = -EINVAL;
+		goto out_vma;
+	}
+
+	/* we have current->mm->mmap_sem in write mode, so no faults going on */
+	down_write(&view->rmap_lock);
+	rmap_interval_tree_remove(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+	view_put(view);		/* ctx reference */
+
+	/* replace with the new view */
+	down_write(&new_view->rmap_lock);
+	ctx->view = new_view;	/* new view reference goes here */
+	rmap_interval_tree_insert(ctx, &new_view->rb_rmap);
+	up_write(&new_view->rmap_lock);
+
+out_vma:
+	up_write(&current->mm->mmap_sem);
+
+	return result;
+}
+
+static long
+pidfd_mem_map_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
+{
+	struct remote_file_context *fctx = file->private_data;
+	long result = 0;
+
+	switch (ioctl) {
+	case PIDFD_MEM_REMAP:
+		result = pidfd_mem_remap(fctx, arg);
+		break;
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static void pidfd_mem_lock(struct remote_file_context *fctx)
+{
+	pr_debug("%s:\n", __func__);
+
+	mutex_lock(&fctx->views_lock);
+	fctx->locked = true;
+	mutex_unlock(&fctx->views_lock);
+}
+
+static int pidfd_mem_map(struct remote_file_context *fctx, struct pidfd_mem_map *map)
+{
+	struct remote_view *view;
+	int result = 0;
+
+	pr_debug("%s: offset %lx, size %lx, address %lx\n",
+		__func__, map->offset, map->size, (long)map->address);
+
+	if (!PAGE_ALIGNED(map->offset))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(map->size))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(map->address))
+		return -EINVAL;
+
+	/* make sure we're creating the view for a valid address space */
+	if (!mmget_not_zero(fctx->mm))
+		return -EINVAL;
+
+	view = view_alloc(fctx->mm, map->address, map->size, map->offset);
+	if (IS_ERR(view)) {
+		result = PTR_ERR(view);
+		goto out_mm;
+	}
+
+	mutex_lock(&fctx->views_lock);
+
+	/* locked ? */
+	if (unlikely(fctx->locked)) {
+		pr_debug("%s: views locked\n", __func__);
+		result = -EINVAL;
+		goto out;
+	}
+
+	/* overlaps another view ? */
+	if (view_overlaps(fctx, map->offset, map->offset + map->size)) {
+		pr_debug("%s: range overlaps\n", __func__);
+		result = -EALREADY;
+		goto out;
+	}
+
+	view_insert(fctx, view);
+
+out:
+	mutex_unlock(&fctx->views_lock);
+
+	view_put(view);			/* usage reference */
+out_mm:
+	mmput(fctx->mm);
+
+	return result;
+}
+
+static int pidfd_mem_unmap(struct remote_file_context *fctx, struct pidfd_mem_unmap *unmap)
+{
+	struct remote_view *view;
+
+	pr_debug("%s: offset %lx, size %lx\n",
+		__func__, unmap->offset, unmap->size);
+
+	if (!PAGE_ALIGNED(unmap->offset))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(unmap->size))
+		return -EINVAL;
+
+	mutex_lock(&fctx->views_lock);
+
+	if (unlikely(fctx->locked)) {
+		mutex_unlock(&fctx->views_lock);
+		return -EINVAL;
+	}
+
+	view = getme_exact_view(fctx, unmap->offset, unmap->offset + unmap->size);
+	if (!view) {
+		mutex_unlock(&fctx->views_lock);
+		return -EINVAL;
+	}
+
+	view_remove(fctx, view);
+
+	mutex_unlock(&fctx->views_lock);
+
+	/*
+	 * The view may still be refernced by a mapping VMA, so dropping
+	 * a reference here may not delete it. The view will be marked as
+	 * invalid, together with all the VMAs linked to it.
+	 */
+	WRITE_ONCE(view->valid, false);
+
+	/* wait until local faults finish */
+	synchronize_srcu(&fctx->fault_srcu);
+
+	/*
+	 * because the view is marked as invalid, faults will not succeed, so
+	 * we don't have to worry about synchronizing invalidations/faults
+	 */
+	mirror_clear_view(view, 0, ULONG_MAX, false);
+
+	view_put(view);			/* usage reference */
+
+	return 0;
+}
+
+static long
+pidfd_mem_ctrl_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
+{
+	struct remote_file_context *fctx = file->private_data;
+	void __user *argp = (void __user *)arg;
+	long result = 0;
+
+	switch (ioctl) {
+	case PIDFD_MEM_MAP: {
+		struct pidfd_mem_map map;
+
+		result = -EINVAL;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			return result;
+
+		result = pidfd_mem_map(fctx, &map);
+		break;
+	}
+
+	case PIDFD_MEM_UNMAP: {
+		struct pidfd_mem_unmap unmap;
+
+		result = -EINVAL;
+		if (copy_from_user(&unmap, argp, sizeof(unmap)))
+			return result;
+
+		result = pidfd_mem_unmap(fctx, &unmap);
+		break;
+	}
+
+	case PIDFD_MEM_LOCK:
+		pidfd_mem_lock(fctx);
+		break;
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int pidfd_mem_ctrl_release(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx = file->private_data;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	remote_file_context_put(fctx);
+
+	return 0;
+}
+
+static const struct file_operations pidfd_mem_ctrl_ops = {
+	.owner = THIS_MODULE,
+	.unlocked_ioctl = pidfd_mem_ctrl_ioctl,
+	.compat_ioctl = pidfd_mem_ctrl_ioctl,
+	.llseek = no_llseek,
+	.release = pidfd_mem_ctrl_release,
+};
+
+static unsigned long
+pidfd_mem_get_unmapped_area(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct remote_file_context *fctx = file->private_data;
+	unsigned long start = pgoff << PAGE_SHIFT;
+	unsigned long last = start + len;
+	unsigned long remote_addr, align_offset;
+	struct remote_view *view;
+	struct vm_area_struct *vma;
+	unsigned long result;
+
+	pr_debug("%s: addr %lx, len %lx, pgoff %lx, flags %lx\n",
+		__func__, addr, len, pgoff, flags);
+
+	if (flags & MAP_FIXED) {
+		if (addr == 0)
+			return -ENOMEM;
+		else
+			return addr;
+	}
+
+	// TODO: ellaborate on this case, we must still have alignment !!!!!!!!!
+	// TODO: only if THP enabled
+	if (addr == 0)
+		return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+
+	/* use this backing VMA */
+	vma = find_vma(current->mm, addr);
+	if (!vma) {
+		pr_debug("%s: no VMA found at %lx\n", __func__, addr);
+		return -EINVAL;
+	}
+
+	/* VMA was mapped with PROT_NONE */
+	if (vma_is_accessible(vma)) {
+		pr_debug("%s: VMA at %lx is not a backing VMA\n", __func__, addr);
+		return -EINVAL;
+	}
+
+	/*
+	 * if the view somehow gets removed afterwards, we're gonna create a
+	 * VMA for which there's no backing view, so mmap() will fail
+	 */
+	mutex_lock(&fctx->views_lock);
+	view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!view) {
+		pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last);
+		return -EINVAL;
+	}
+
+	/* this should be enough to ensure VMA alignment */
+	remote_addr = start - view->offset + view->address;
+	align_offset = remote_addr % PMD_SIZE;
+
+	if (addr % PMD_SIZE <= align_offset)
+		result = (addr & PMD_MASK) + align_offset;
+	else
+		result = (addr & PMD_MASK) + align_offset + PMD_SIZE;
+
+	view_put(view);		/* usage reference */
+
+	return result;
+}
+
+static const struct file_operations pidfd_mem_map_fops = {
+	.owner = THIS_MODULE,
+	.mmap = mirror_dev_mmap,
+	.get_unmapped_area = pidfd_mem_get_unmapped_area,
+	.unlocked_ioctl = pidfd_mem_map_ioctl,
+	.compat_ioctl = pidfd_mem_map_ioctl,
+	.llseek = no_llseek,
+	.release = mirror_dev_release,
+};
+
+int task_remote_map(struct task_struct *task, int fds[])
+{
+	struct mm_struct *mm;
+	struct remote_file_context *fctx;
+	struct file *ctrl, *map;
+	int ret;
+
+	// allocate common file context
+	fctx = remote_file_context_alloc();
+	if (!fctx)
+		return -ENOMEM;
+
+	// create these 2 fds
+	fds[0] = fds[1] = -1;
+
+	fds[0] = anon_inode_getfd("[pidfd_mem.ctrl]", &pidfd_mem_ctrl_ops, fctx,
+				  O_RDWR | O_CLOEXEC);
+	if (fds[0] < 0) {
+		ret = fds[0];
+		goto out;
+	}
+	remote_file_context_get(fctx);
+
+	ctrl = fget(fds[0]);
+	ctrl->f_mode |= FMODE_WRITE_IOCTL;
+	fput(ctrl);
+
+	fds[1] = anon_inode_getfd("[pidfd_mem.map]", &pidfd_mem_map_fops, fctx,
+				  O_RDWR | O_CLOEXEC | O_LARGEFILE);
+	if (fds[1] < 0) {
+		ret = fds[1];
+		goto out;
+	}
+	remote_file_context_get(fctx);
+
+	map = fget(fds[1]);
+	map->f_mode |= FMODE_LSEEK | FMODE_UNSIGNED_OFFSET | FMODE_RANDOM;
+	fput(map);
+
+	mm = get_task_mm(task);
+	if (!mm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* reference this mm in fctx */
+	mmgrab(mm);
+	fctx->mm = mm;
+
+	mmput(mm);
+	remote_file_context_put(fctx);		/* usage reference */
+
+	return 0;
+
+out:
+	if (fds[0] != -1) {
+		__close_fd(current->files, fds[0]);
+		remote_file_context_put(fctx);
+	}
+
+	if (fds[1] != -1) {
+		__close_fd(current->files, fds[1]);
+		remote_file_context_put(fctx);
+	}
+
+	// TODO: using __close_fd() does not guarantee success, use other means
+	// for file allocation & error recovery
+
+	remote_file_context_put(fctx);
+
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (3 preceding siblings ...)
  2020-09-03 17:47 ` [RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
@ 2020-09-03 17:47 ` Adalbert Lazăr
  2020-09-03 18:08 ` [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
  2020-09-04  9:54 ` Christian Brauner
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 17:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mircea Cirjaliu, Christian Brauner,
	Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

This system call returns 2 fds for inspecting the address space of a
remote process: one for control and one for access. Use according to
remote mapping specifications.

Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/pid.h                    |  1 +
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +
 kernel/exit.c                          |  2 +-
 kernel/pid.c                           | 55 ++++++++++++++++++++++++++
 7 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 54581ac671b4..ca1b5a32dbc5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,5 +440,6 @@
 433	i386	fspick			sys_fspick
 434	i386	pidfd_open		sys_pidfd_open
 435	i386	clone3			sys_clone3
+436     i386    pidfd_mem               sys_pidfd_mem
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37b844f839bc..6138d3d023f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			sys_fspick
 434	common	pidfd_open		sys_pidfd_open
 435	common	clone3			sys_clone3
+436     common  pidfd_mem               sys_pidfd_mem
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..9ec23ab23fd4 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -76,6 +76,7 @@ extern const struct file_operations pidfd_fops;
 
 struct file;
 
+extern struct pid *pidfd_get_pid(unsigned int fd);
 extern struct pid *pidfd_pid(const struct file *file);
 
 static inline struct pid *get_pid(struct pid *pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..621f3d52ed4e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -934,6 +934,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
 asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
+asmlinkage long sys_pidfd_mem(int pidfd, int __user *fds, unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..2663afc03c86 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
+#define __NR_pidfd_mem 436
+__SYSCALL(__NR_pidfd_mem, sys_pidfd_mem)
 
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
diff --git a/kernel/exit.c b/kernel/exit.c
index 389a88cb3081..37cd8949e606 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1464,7 +1464,7 @@ static long do_wait(struct wait_opts *wo)
 	return retval;
 }
 
-static struct pid *pidfd_get_pid(unsigned int fd)
+struct pid *pidfd_get_pid(unsigned int fd)
 {
 	struct fd f;
 	struct pid *pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..c9c49edf4a8a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/remote_mapping.h>
 
 struct pid init_struct_pid = {
 	.count		= REFCOUNT_INIT(1),
@@ -565,6 +566,60 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	return fd;
 }
 
+/**
+ * pidfd_mem() - Allow access to process address space.
+ *
+ * @pidfd: pid file descriptor for the target process
+ * @fds:   array where the control and access file descriptors are returned
+ * @flags: flags to pass
+ *
+ * This creates a pair of file descriptors used to gain access to the
+ * target process memory. The control fd is used to establish a linear
+ * mapping between an offset range and a userspace address range.
+ * The access fd is used to mmap(offset range) on the client side.
+ *
+ * Return: On success, 0 is returned.
+ *         On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE3(pidfd_mem, int, pidfd, int __user *, fds, unsigned int, flags)
+{
+	struct pid *pid;
+	struct task_struct *task;
+	int ret_fds[2];
+	int ret;
+
+	if (pidfd < 0)
+		return -EINVAL;
+	if (!fds)
+		return -EINVAL;
+	if (flags)
+		return -EINVAL;
+
+	pid = pidfd_get_pid(pidfd);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	put_pid(pid);
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+
+	ret = -EPERM;
+	if (unlikely(task == current) || capable(CAP_SYS_PTRACE))
+		ret = task_remote_map(task, ret_fds);
+	put_task_struct(task);
+	if (IS_ERR_VALUE((long)ret))
+		return ret;
+
+	if (copy_to_user(fds, ret_fds, sizeof(ret_fds))) {
+		put_unused_fd(ret_fds[0]);
+		put_unused_fd(ret_fds[1]);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Remote mapping
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (4 preceding siblings ...)
  2020-09-03 17:47 ` [RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
@ 2020-09-03 18:08 ` Adalbert Lazăr
  2020-09-04  9:54 ` Christian Brauner
  6 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-03 18:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu

CC+= Mihai, Mircea

On Thu,  3 Sep 2020 20:47:25 +0300, Adalbert Lazăr <alazar@bitdefender.com> wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
> 
> The feature was designed according to a specification suggested by Paolo Bonzini:
> >> The proposed API is a new pidfd system call, through which the parent
> >> can map portions of its virtual address space into a file descriptor
> >> and then pass that file descriptor to a child.
> >>
> >> This should be:
> >>
> >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> >> better way to do PTRACE_{PEEK,POKE}DATA
> >>
> >> - relatively easy to do based on the bitdefender remote process
> >> mapping patches at.
> >>
> >> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
> >> two file descriptors for respectively the control plane and the memory access.
> >>
> >> - the control plane accepts three ioctls
> >>
> >> PIDFD_MEM_MAP takes a struct like
> >>
> >>     struct pidfd_mem_map {
> >>          uint64_t address;
> >>          off_t offset;
> >>          off_t size;
> >>          int flags;
> >>          int padding[7];
> >>     }
> >>
> >> After this is done, the memory access fd can be mmap-ed at range
> >> [offset,
> >> offset+size), and it will read memory from range [address,
> >> address+size) of the target descriptor.
> >>
> >> PIDFD_MEM_UNMAP takes a struct like
> >>
> >>     struct pidfd_mem_unmap {
> >>          off_t offset;
> >>          off_t size;
> >>     }
> >>
> >> and unmaps the corresponding range of course.
> >>
> >> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
> >> PIDFD_MEM_UNMAP.  For now I think it should just check that the
> >> argument is zero, bells and whistles can be added later.
> >>
> >> - the memory access fd can be mmap-ed as in the bitdefender patches
> >> but also accessed with read/write/pread/pwrite/...  As in the
> >> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
> >> regions when the source address space changes.  In this case,
> >> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
> (it currently doesn't support read/write/pread/pwrite/...)
> 
> The main remote mapping patch also contains the legacy implementation which
> creates a region the size of the whole process address space by means of the
> REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
> address space it wishes.
> 
> VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> process address space within the specified range. Pages are installed in the
> current process page tables at fault time and removed by the mmu_interval_notifier
> invalidate callbck. No further memory management is involved.
> On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> or if the remote process address space was reaped by OOM, the remote mapping
> fault handler returns VM_FAULT_SIGBUS.
> 
> At Bitdefender we are using remote mapping for virtual machine introspection:
> - the QEMU running the introspected machine creates the pair of file descriptors,
> passes the access fd to the introspector QEMU, and uses the control fd to allow
> access to the memslots it creates for its machine
> - the QEMU running the introspector machine receives the access fd and mmap()s
> the regions made available, then hotplugs the obtained memory in its machine
> Having this setup creates nested invalidate_range_start/end MMU notifier calls.
> 
> Patch organization:
> - patch 1 allows unmap_page_range() to run without rescheduling
>   Needed for remote mapping to zap current process page tables when OOM calls
>   mmu_notifier_invalidate_range_start_nonblock(&range)
> 
> - patch 2 creates VMA-specific zapping behavior
>   A remote mapping VMA does not own the pages it maps, so all it has to do is
>   clear the PTEs.
> 
> - patch 3 removed MMU notifier lockdep map
>   It was just incompatible with our use case.
> 
> - patch 4 is the remote mapping implementation
> 
> - patch 5 adds suggested pidfd_mem system call
> 
> Mircea Cirjaliu (5):
>   mm: add atomic capability to zap_details
>   mm: let the VMA decide how zap_pte_range() acts on mapped pages
>   mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
>     nested scenarios
>   mm/remote_mapping: use a pidfd to access memory belonging to unrelated
>     process
>   pidfd_mem: implemented remote memory mapping system call
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>  include/linux/mm.h                     |   22 +
>  include/linux/mmu_notifier.h           |    5 +-
>  include/linux/pid.h                    |    1 +
>  include/linux/remote_mapping.h         |   22 +
>  include/linux/syscalls.h               |    1 +
>  include/uapi/asm-generic/unistd.h      |    2 +
>  include/uapi/linux/remote_mapping.h    |   36 +
>  kernel/exit.c                          |    2 +-
>  kernel/pid.c                           |   55 +
>  mm/Kconfig                             |   11 +
>  mm/Makefile                            |    1 +
>  mm/memory.c                            |  193 ++--
>  mm/mmu_notifier.c                      |   19 -
>  mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
>  16 files changed, 1535 insertions(+), 110 deletions(-)
>  create mode 100644 include/linux/remote_mapping.h
>  create mode 100644 include/uapi/linux/remote_mapping.h
>  create mode 100644 mm/remote_mapping.c
> 
> 
> CC:Christian Brauner <christian@brauner.io>
> base-commit: ae83d0b416db002fe95601e7f97f64b59514d936


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Remote mapping
  2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (5 preceding siblings ...)
  2020-09-03 18:08 ` [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
@ 2020-09-04  9:54 ` Christian Brauner
  2020-09-04 11:34   ` Adalbert Lazăr
  6 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2020-09-04  9:54 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Sargun Dhillon, Aleksa Sarai, Oleg Nesterov, Jann Horn,
	Kees Cook, Matthew Wilcox, linux-api

On Thu, Sep 03, 2020 at 08:47:25PM +0300, Adalbert Lazăr wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.

Hey Adalbert,

Thanks for the patch. When you resend this patch series, could you
please make sure that everyone Cced on any individual patch receives the
full patch series? I only got patch 5/5 and it's a bit annoying because
one completely lacks context of what's going on. I first thought "did
someone just add a syscall with 3 lines of commit message?". :)

Could you please resend the patch series with linux-api, me and the
following people Cced:

Andy Lutomirski <luto@kernel.org>
Arnd Bergmann <arnd@arndb.de>
Sargun Dhillon <sargun@sargun.me>
Aleksa Sarai <cyphar@cyphar.com>
Oleg Nesterov <oleg@redhat.com>
Jann Horn <jannh@google.com>
Kees Cook <keescook@chromium.org>
Matthew Wilcox <willy@infradead.org>
linux-api@vger.kernel.org

Christian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Remote mapping
  2020-09-04  9:54 ` Christian Brauner
@ 2020-09-04 11:34   ` Adalbert Lazăr
  0 siblings, 0 replies; 9+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-mm, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Andy Lutomirski, Arnd Bergmann,
	Sargun Dhillon, Aleksa Sarai, Oleg Nesterov, Jann Horn,
	Kees Cook, Matthew Wilcox, linux-api

On Fri, 4 Sep 2020 11:54:38 +0200, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Thu, Sep 03, 2020 at 08:47:25PM +0300, Adalbert Lazăr wrote:
> > This patchset adds support for the remote mapping feature.
> > Remote mapping, as its name suggests, is a means for transparent and
> > zero-copy access of a remote process' address space.
> > access of a remote process' address space.
> 
> Hey Adalbert,
> 
> Thanks for the patch. When you resend this patch series, could you
> please make sure that everyone Cced on any individual patch receives the
> full patch series? I only got patch 5/5 and it's a bit annoying because
> one completely lacks context of what's going on. I first thought "did
> someone just add a syscall with 3 lines of commit message?". :)
> 
> Could you please resend the patch series with linux-api, me and the
> following people Cced:

Done :D
Thank you, Christian


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-04 11:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 17:47 [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
2020-09-03 17:47 ` [RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
2020-09-03 17:47 ` [RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
2020-09-03 17:47 ` [RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
2020-09-03 17:47 ` [RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
2020-09-03 17:47 ` [RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
2020-09-03 18:08 ` [RFC PATCH 0/5] Remote mapping Adalbert Lazăr
2020-09-04  9:54 ` Christian Brauner
2020-09-04 11:34   ` Adalbert Lazăr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).