linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RESEND RFC PATCH 0/5] Remote mapping
@ 2020-09-04 11:31 Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Adalbert Lazăr

This patchset adds support for the remote mapping feature.
Remote mapping, as its name suggests, is a means for transparent and
zero-copy access of a remote process' address space.
access of a remote process' address space.

The feature was designed according to a specification suggested by
Paolo Bonzini:
>> The proposed API is a new pidfd system call, through which the parent
>> can map portions of its virtual address space into a file descriptor
>> and then pass that file descriptor to a child.
>>
>> This should be:
>>
>> - upstreamable, pidfd is the new cool thing and we could sell it as a
>> better way to do PTRACE_{PEEK,POKE}DATA
>>
>> - relatively easy to do based on the bitdefender remote process
>> mapping patches at.
>>
>> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
>> two file descriptors for respectively the control plane and the memory access.
>>
>> - the control plane accepts three ioctls
>>
>> PIDFD_MEM_MAP takes a struct like
>>
>>     struct pidfd_mem_map {
>>          uint64_t address;
>>          off_t offset;
>>          off_t size;
>>          int flags;
>>          int padding[7];
>>     }
>>
>> After this is done, the memory access fd can be mmap-ed at range
>> [offset,
>> offset+size), and it will read memory from range [address,
>> address+size) of the target descriptor.
>>
>> PIDFD_MEM_UNMAP takes a struct like
>>
>>     struct pidfd_mem_unmap {
>>          off_t offset;
>>          off_t size;
>>     }
>>
>> and unmaps the corresponding range of course.
>>
>> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
>> PIDFD_MEM_UNMAP.  For now I think it should just check that the
>> argument is zero, bells and whistles can be added later.
>>
>> - the memory access fd can be mmap-ed as in the bitdefender patches
>> but also accessed with read/write/pread/pwrite/...  As in the
>> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
>> regions when the source address space changes.  In this case,
>> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
(it currently doesn't support read/write/pread/pwrite/...)

The main remote mapping patch also contains the legacy implementation which
creates a region the size of the whole process address space by means of the
REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
address space it wishes.

VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
process address space within the specified range. Pages are installed in the
current process page tables at fault time and removed by the mmu_interval_notifier
invalidate callbck. No further memory management is involved.
On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
or if the remote process address space was reaped by OOM, the remote mapping
fault handler returns VM_FAULT_SIGBUS.

At Bitdefender we are using remote mapping for virtual machine introspection:
- the QEMU running the introspected machine creates the pair of file descriptors,
passes the access fd to the introspector QEMU, and uses the control fd to allow
access to the memslots it creates for its machine
- the QEMU running the introspector machine receives the access fd and mmap()s
the regions made available, then hotplugs the obtained memory in its machine
Having this setup creates nested invalidate_range_start/end MMU notifier calls.

Patch organization:
- patch 1 allows unmap_page_range() to run without rescheduling
  Needed for remote mapping to zap current process page tables when OOM calls
  mmu_notifier_invalidate_range_start_nonblock(&range)

- patch 2 creates VMA-specific zapping behavior
  A remote mapping VMA does not own the pages it maps, so all it has to do is
  clear the PTEs.

- patch 3 removed MMU notifier lockdep map
  It was just incompatible with our use case.

- patch 4 is the remote mapping implementation

- patch 5 adds suggested pidfd_mem system call

Mircea Cirjaliu (5):
  mm: add atomic capability to zap_details
  mm: let the VMA decide how zap_pte_range() acts on mapped pages
  mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
    nested scenarios
  mm/remote_mapping: use a pidfd to access memory belonging to unrelated
    process
  pidfd_mem: implemented remote memory mapping system call

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 include/linux/mm.h                     |   22 +
 include/linux/mmu_notifier.h           |    5 +-
 include/linux/pid.h                    |    1 +
 include/linux/remote_mapping.h         |   22 +
 include/linux/syscalls.h               |    1 +
 include/uapi/asm-generic/unistd.h      |    2 +
 include/uapi/linux/remote_mapping.h    |   36 +
 kernel/exit.c                          |    2 +-
 kernel/pid.c                           |   55 +
 mm/Kconfig                             |   11 +
 mm/Makefile                            |    1 +
 mm/memory.c                            |  193 ++--
 mm/mmu_notifier.c                      |   19 -
 mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
 16 files changed, 1535 insertions(+), 110 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c


base-commit: ae83d0b416db002fe95601e7f97f64b59514d936


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
@ 2020-09-04 11:31 ` Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Force zap_xxx_range() functions to loop without rescheduling.
Useful for unmapping memory in an atomic context, although no
checks for atomic context are being made.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mm.h |  6 ++++++
 mm/memory.c        | 11 +++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..1be4482a7b81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1601,8 +1601,14 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool atomic;				/* Do not sleep. */
 };
 
+static inline bool zap_is_atomic(struct zap_details *details)
+{
+	return (unlikely(details) && details->atomic);
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..8e78fb151f8f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1056,7 +1056,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_none(ptent))
 			continue;
 
-		if (need_resched())
+		if (!zap_is_atomic(details) && need_resched())
 			break;
 
 		if (pte_present(ptent)) {
@@ -1159,7 +1159,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	}
 
 	if (addr != end) {
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 		goto again;
 	}
 
@@ -1195,7 +1196,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
 next:
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
 	return addr;
@@ -1224,7 +1226,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 			continue;
 		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
 next:
-		cond_resched();
+		if (!zap_is_atomic(details))
+			cond_resched();
 	} while (pud++, addr = next, addr != end);
 
 	return addr;


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
@ 2020-09-04 11:31 ` Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Instead of having one big function to handle all cases of page unmapping,
have multiple implementation-defined callbacks, each for its own VMA type.
In the future, exotic VMA implementations won't have to bloat the unique
zapping function with another case of mappings.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mm.h |  16 ++++
 mm/memory.c        | 182 +++++++++++++++++++++++++--------------------
 2 files changed, 116 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1be4482a7b81..39e55467aa49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -36,6 +36,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct bdi_writeback;
+struct zap_details;
 
 void init_mm_internals(void);
 
@@ -601,6 +602,14 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+
+	/*
+	 * Called by zap_pte_range() for use by special VMAs that implement
+	 * custom zapping behavior.
+	 */
+	int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr,
+		       pte_t *pte, int rss[], struct mmu_gather *tlb,
+		       struct zap_details *details);
 };
 
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
@@ -1594,6 +1603,13 @@ static inline bool can_do_mlock(void) { return false; }
 extern int user_shm_lock(size_t, struct user_struct *);
 extern void user_shm_unlock(size_t, struct user_struct *);
 
+/*
+ * Flags returned by zap_pte implementations
+ */
+#define ZAP_PTE_CONTINUE	0
+#define ZAP_PTE_FLUSH		(1 << 0)	/* Ask for TLB flush. */
+#define ZAP_PTE_BREAK		(1 << 1)	/* Break PTE iteration. */
+
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
diff --git a/mm/memory.c b/mm/memory.c
index 8e78fb151f8f..a225bfd01417 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1031,18 +1031,109 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return ret;
 }
 
+static int zap_pte_common(struct vm_area_struct *vma, unsigned long addr,
+			  pte_t *pte, int rss[], struct mmu_gather *tlb,
+			  struct zap_details *details)
+{
+	struct mm_struct *mm = tlb->mm;
+	pte_t ptent = *pte;
+	swp_entry_t entry;
+	int flags = 0;
+
+	if (pte_present(ptent)) {
+		struct page *page;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (unlikely(details) && page) {
+			/*
+			 * unmap_shared_mapping_pages() wants to
+			 * invalidate cache without truncating:
+			 * unmap shared but keep private pages.
+			 */
+			if (details->check_mapping &&
+			    details->check_mapping != page_rmapping(page))
+				return 0;
+		}
+		ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+		if (unlikely(!page))
+			return 0;
+
+		if (!PageAnon(page)) {
+			if (pte_dirty(ptent)) {
+				flags |= ZAP_PTE_FLUSH;
+				set_page_dirty(page);
+			}
+			if (pte_young(ptent) &&
+			    likely(!(vma->vm_flags & VM_SEQ_READ)))
+				mark_page_accessed(page);
+		}
+		rss[mm_counter(page)]--;
+		page_remove_rmap(page, false);
+		if (unlikely(page_mapcount(page) < 0))
+			print_bad_pte(vma, addr, ptent, page);
+		if (unlikely(__tlb_remove_page(tlb, page)))
+			flags |= ZAP_PTE_FLUSH | ZAP_PTE_BREAK;
+		return flags;
+	}
+
+	entry = pte_to_swp_entry(ptent);
+	if (non_swap_entry(entry) && is_device_private_entry(entry)) {
+		struct page *page = device_private_entry_to_page(entry);
+
+		if (unlikely(details && details->check_mapping)) {
+			/*
+			 * unmap_shared_mapping_pages() wants to
+			 * invalidate cache without truncating:
+			 * unmap shared but keep private pages.
+			 */
+			if (details->check_mapping != page_rmapping(page))
+				return 0;
+		}
+
+		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		rss[mm_counter(page)]--;
+		page_remove_rmap(page, false);
+		put_page(page);
+		return 0;
+	}
+
+	/* If details->check_mapping, we leave swap entries. */
+	if (unlikely(details))
+		return 0;
+
+	if (!non_swap_entry(entry))
+		rss[MM_SWAPENTS]--;
+	else if (is_migration_entry(entry)) {
+		struct page *page;
+
+		page = migration_entry_to_page(entry);
+		rss[mm_counter(page)]--;
+	}
+	if (unlikely(!free_swap_and_cache(entry)))
+		print_bad_pte(vma, addr, ptent, NULL);
+	pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+
+	return flags;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	int force_flush = 0;
+	int flags = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
+
+	int (*zap_pte)(struct vm_area_struct *vma, unsigned long addr,
+		       pte_t *pte, int rss[], struct mmu_gather *tlb,
+		       struct zap_details *details) = zap_pte_common;
+	if (vma->vm_ops && vma->vm_ops->zap_pte)
+		zap_pte = vma->vm_ops->zap_pte;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1058,92 +1149,19 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 		if (!zap_is_atomic(details) && need_resched())
 			break;
-
-		if (pte_present(ptent)) {
-			struct page *page;
-
-			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			tlb_remove_tlb_entry(tlb, pte, addr);
-			if (unlikely(!page))
-				continue;
-
-			if (!PageAnon(page)) {
-				if (pte_dirty(ptent)) {
-					force_flush = 1;
-					set_page_dirty(page);
-				}
-				if (pte_young(ptent) &&
-				    likely(!(vma->vm_flags & VM_SEQ_READ)))
-					mark_page_accessed(page);
-			}
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
-				force_flush = 1;
-				addr += PAGE_SIZE;
-				break;
-			}
-			continue;
-		}
-
-		entry = pte_to_swp_entry(ptent);
-		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
-			struct page *page = device_private_entry_to_page(entry);
-
-			if (unlikely(details && details->check_mapping)) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping !=
-				    page_rmapping(page))
-					continue;
-			}
-
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
-			put_page(page);
-			continue;
+		if (flags & ZAP_PTE_BREAK) {
+			flags &= ~ZAP_PTE_BREAK;
+			break;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
-			continue;
-
-		if (!non_swap_entry(entry))
-			rss[MM_SWAPENTS]--;
-		else if (is_migration_entry(entry)) {
-			struct page *page;
-
-			page = migration_entry_to_page(entry);
-			rss[mm_counter(page)]--;
-		}
-		if (unlikely(!free_swap_and_cache(entry)))
-			print_bad_pte(vma, addr, ptent, NULL);
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		flags |= zap_pte(vma, addr, pte, rss, tlb, details);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (flags & ZAP_PTE_FLUSH)
 		tlb_flush_mmu_tlbonly(tlb);
 	pte_unmap_unlock(start_pte, ptl);
 
@@ -1153,8 +1171,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	 * entries before releasing the ptl), free the batched
 	 * memory too. Restart if we didn't do everything.
 	 */
-	if (force_flush) {
-		force_flush = 0;
+	if (flags & ZAP_PTE_FLUSH) {
+		flags &= ~ZAP_PTE_FLUSH;
 		tlb_flush_mmu(tlb);
 	}
 


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
  2020-09-04 11:31 ` [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
@ 2020-09-04 11:31 ` Adalbert Lazăr
  2020-09-04 12:03   ` Jason Gunthorpe
  2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

The combination of remote mapping + KVM causes nested range invalidations,
which reports lockdep warnings.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/mmu_notifier.h |  5 +----
 mm/mmu_notifier.c            | 19 -------------------
 2 files changed, 1 insertion(+), 23 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 736f6918335e..81ea457d41be 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -440,12 +440,10 @@ mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
 	might_sleep();
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (mm_has_notifiers(range->mm)) {
 		range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
 		__mmu_notifier_invalidate_range_start(range);
 	}
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
 static inline int
@@ -453,12 +451,11 @@ mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range)
 {
 	int ret = 0;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (mm_has_notifiers(range->mm)) {
 		range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE;
 		ret = __mmu_notifier_invalidate_range_start(range);
 	}
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+
 	return ret;
 }
 
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 06852b896fa6..928751bd8630 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -22,12 +22,6 @@
 /* global SRCU for all MMs */
 DEFINE_STATIC_SRCU(srcu);
 
-#ifdef CONFIG_LOCKDEP
-struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
-	.name = "mmu_notifier_invalidate_range_start"
-};
-#endif
-
 /*
  * The mmu_notifier_subscriptions structure is allocated and installed in
  * mm->notifier_subscriptions inside the mm_take_all_locks() protected
@@ -242,8 +236,6 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 	 * will always clear the below sleep in some reasonable time as
 	 * subscriptions->invalidate_seq is even in the idle state.
 	 */
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 	if (is_invalidating)
 		wait_event(subscriptions->wq,
 			   READ_ONCE(subscriptions->invalidate_seq) != seq);
@@ -572,13 +564,11 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 	struct mmu_notifier_subscriptions *subscriptions =
 		range->mm->notifier_subscriptions;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (subscriptions->has_itree)
 		mn_itree_inv_end(subscriptions);
 
 	if (!hlist_empty(&subscriptions->list))
 		mn_hlist_invalidate_end(subscriptions, range, only_end);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -612,13 +602,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription,
 	lockdep_assert_held_write(&mm->mmap_sem);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
-	if (IS_ENABLED(CONFIG_LOCKDEP)) {
-		fs_reclaim_acquire(GFP_KERNEL);
-		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-		fs_reclaim_release(GFP_KERNEL);
-	}
-
 	if (!mm->notifier_subscriptions) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
@@ -1062,8 +1045,6 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub)
 	 * The possible sleep on progress in the invalidation requires the
 	 * caller not hold any locks held by invalidation callbacks.
 	 */
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 	if (seq)
 		wait_event(subscriptions->wq,
 			   READ_ONCE(subscriptions->invalidate_seq) != seq);


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (2 preceding siblings ...)
  2020-09-04 11:31 ` [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
@ 2020-09-04 11:31 ` Adalbert Lazăr
  2020-09-04 17:55   ` Oleg Nesterov
                     ` (2 more replies)
  2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

Remote mapping creates a mirror VMA that exposes memory owned by another
process in a zero-copy manner. The pages are mapped in the current process'
address space with no memory management involved and little impact on the
remote process operation. Currently incompatible with THP.

Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 include/linux/remote_mapping.h      |   22 +
 include/uapi/linux/remote_mapping.h |   36 +
 mm/Kconfig                          |   11 +
 mm/Makefile                         |    1 +
 mm/remote_mapping.c                 | 1273 +++++++++++++++++++++++++++
 5 files changed, 1343 insertions(+)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c

diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h
new file mode 100644
index 000000000000..5c1d43e8f669
--- /dev/null
+++ b/include/linux/remote_mapping.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_REMOTE_MAPPING_H
+#define _LINUX_REMOTE_MAPPING_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_REMOTE_MAPPING
+
+extern int task_remote_map(struct task_struct *task, int fds[]);
+
+#else /* CONFIG_REMOTE_MAPPING */
+
+static inline int task_remote_map(struct task_struct *task, int fds[])
+{
+	return -EINVAL;
+}
+
+#endif /* CONFIG_REMOTE_MAPPING */
+
+
+#endif /* _LINUX_REMOTE_MAPPING_H */
diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h
new file mode 100644
index 000000000000..5d2828a6aa47
--- /dev/null
+++ b/include/uapi/linux/remote_mapping.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __UAPI_REMOTE_MAPPING_H__
+#define __UAPI_REMOTE_MAPPING_H__
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+// device file interface
+#define REMOTE_PROC_MAP	_IOW('r', 0x01, int)
+
+
+// pidfd interface
+#define PIDFD_IO_MAGIC	'p'
+
+struct pidfd_mem_map {
+	uint64_t address;
+	off_t offset;
+	off_t size;
+	int flags;
+	int padding[7];
+};
+
+struct pidfd_mem_unmap {
+	off_t offset;
+	off_t size;
+};
+
+#define PIDFD_MEM_MAP	_IOW(PIDFD_IO_MAGIC, 0x01, struct pidfd_mem_map)
+#define PIDFD_MEM_UNMAP _IOW(PIDFD_IO_MAGIC, 0x02, struct pidfd_mem_unmap)
+#define PIDFD_MEM_LOCK	_IOW(PIDFD_IO_MAGIC, 0x03, int)
+
+#define PIDFD_MEM_REMAP _IOW(PIDFD_IO_MAGIC, 0x04, unsigned long)
+// TODO: actually this is not for pidfd, find better names
+
+#endif /* __UAPI_REMOTE_MAPPING_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c35..0ecc3f41a98e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -804,6 +804,17 @@ config HMM_MIRROR
 	bool
 	depends on MMU
 
+config REMOTE_MAPPING
+	bool "Remote memory mapping"
+	depends on X86_64 && MMU && !TRANSPARENT_HUGEPAGE
+	select MMU_NOTIFIER
+	default n
+	help
+	  Allows a client process to gain access to an unrelated process'
+	  address space on a range-basis. The feature maps pages found at
+	  the remote equivalent address in the current process' page tables
+	  in a lightweight manner.
+
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ZONE_DEVICE
diff --git a/mm/Makefile b/mm/Makefile
index fccd3756b25f..ce1a00e7bc8c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
new file mode 100644
index 000000000000..1dc53992424b
--- /dev/null
+++ b/mm/remote_mapping.c
@@ -0,0 +1,1273 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Remote memory mapping.
+ *
+ * Copyright (C) 2017-2020 Bitdefender S.R.L.
+ *
+ * Author:
+ *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
+ */
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/pid.h>
+#include <linux/file.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched/task.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/refcount.h>
+#include <linux/miscdevice.h>
+#include <uapi/linux/remote_mapping.h>
+#include <linux/pfn_t.h>
+#include <linux/errno.h>
+#include <linux/limits.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <asm/tlb.h>
+#include "internal.h"
+
+struct remote_file_context {
+	refcount_t refcount;
+
+	struct srcu_struct fault_srcu;
+	struct mm_struct *mm;
+
+	bool locked;
+	struct rb_root_cached rb_view;		/* view offset tree */
+	struct mutex views_lock;
+
+};
+
+struct remote_view {
+	refcount_t refcount;
+
+	unsigned long address;
+	unsigned long size;
+	unsigned long offset;
+	bool valid;
+
+	struct rb_node target_rb;		/* link for views tree */
+	unsigned long rb_subtree_last;		/* in remote_file_context */
+
+	struct mmu_interval_notifier mmin;
+	spinlock_t user_lock;
+
+	/*
+	 * interval tree for mapped ranges (indexed by source process HVA)
+	 * because of GPA->HVA aliasing, multiple ranges may overlap
+	 */
+	struct rb_root_cached rb_rmap;		/* rmap tree */
+	struct rw_semaphore rmap_lock;
+};
+
+struct remote_vma_context {
+	struct vm_area_struct *vma;		/* link back to VMA */
+	struct remote_view *view;		/* corresponding view */
+
+	struct rb_node rmap_rb;			/* link for rmap tree */
+	unsigned long rb_subtree_last;
+};
+
+/* view offset tree */
+static inline unsigned long view_start(struct remote_view *view)
+{
+	return view->offset + 1;
+}
+
+static inline unsigned long view_last(struct remote_view *view)
+{
+	return view->offset + view->size - 1;
+}
+
+INTERVAL_TREE_DEFINE(struct remote_view, target_rb,
+	unsigned long, rb_subtree_last, view_start, view_last,
+	static inline, view_interval_tree)
+
+#define view_tree_foreach(view, root, start, last)			\
+	for (view = view_interval_tree_iter_first(root, start, last);	\
+	     view; view = view_interval_tree_iter_next(view, start, last))
+
+/* rmap interval tree */
+static inline unsigned long ctx_start(struct remote_vma_context *ctx)
+{
+	struct vm_area_struct *vma = ctx->vma;
+	struct remote_view *view = ctx->view;
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	return offset - view->offset + view->address;
+}
+
+static inline unsigned long ctx_last(struct remote_vma_context *ctx)
+{
+	struct vm_area_struct *vma = ctx->vma;
+	struct remote_view *view = ctx->view;
+	unsigned long offset;
+
+	offset = (vma->vm_pgoff << PAGE_SHIFT) + (vma->vm_end - vma->vm_start);
+
+	return offset - view->offset + view->address;
+}
+
+static inline unsigned long ctx_rmap_start(struct remote_vma_context *ctx)
+{
+	return ctx_start(ctx) + 1;
+}
+
+static inline unsigned long ctx_rmap_last(struct remote_vma_context *ctx)
+{
+	return ctx_last(ctx) - 1;
+}
+
+INTERVAL_TREE_DEFINE(struct remote_vma_context, rmap_rb,
+	unsigned long, rb_subtree_last, ctx_rmap_start, ctx_rmap_last,
+	static inline, rmap_interval_tree)
+
+#define rmap_foreach(ctx, root, start, last)				\
+	for (ctx = rmap_interval_tree_iter_first(root, start, last);	\
+	     ctx; ctx = rmap_interval_tree_iter_next(ctx, start, last))
+
+static int mirror_zap_pte(struct vm_area_struct *vma, unsigned long addr,
+			  pte_t *pte, int rss[], struct mmu_gather *tlb,
+			  struct zap_details *details)
+{
+	pte_t ptent = *pte;
+	struct page *page;
+	int flags = 0;
+
+	page = vm_normal_page(vma, addr, ptent);
+	//ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+	ptent = ptep_clear_flush_notify(vma, addr, pte);
+	//tlb_remove_tlb_entry(tlb, pte, addr);
+
+	if (pte_dirty(ptent)) {
+		flags |= ZAP_PTE_FLUSH;
+		set_page_dirty(page);
+	}
+
+	return flags;
+}
+
+static void
+zap_remote_range(struct vm_area_struct *vma,
+		 unsigned long start, unsigned long end,
+		 bool atomic)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+	struct zap_details details = {
+		.atomic = atomic,
+	};
+
+	pr_debug("%s: vma %lx-%lx, zap range %lx-%lx\n",
+		__func__, vma->vm_start, vma->vm_end, start, end);
+
+	tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
+				vma, vma->vm_mm, start, end);
+	if (atomic)
+		mmu_notifier_invalidate_range_start_nonblock(&range);
+	else
+		mmu_notifier_invalidate_range_start(&range);
+
+	unmap_page_range(&tlb, vma, start, end, &details);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb, start, end);
+}
+
+static bool
+mirror_clear_view(struct remote_view *view,
+		  unsigned long start, unsigned long last, bool atomic)
+{
+	struct remote_vma_context *ctx;
+	unsigned long src_start, src_last;
+	unsigned long vma_start, vma_last;
+
+	pr_debug("%s: view %p [%lx-%lx), range [%lx-%lx)", __func__, view,
+		 view->offset, view->offset + view->size, start, last);
+
+	if (likely(!atomic))
+		down_read(&view->rmap_lock);
+	else if (!down_read_trylock(&view->rmap_lock))
+		return false;
+
+	rmap_foreach(ctx, &view->rb_rmap, start, last) {
+		struct vm_area_struct *vma = ctx->vma;
+
+		// intersect intervals (source process address range)
+		src_start = max(start, ctx_start(ctx));
+		src_last = min(last, ctx_last(ctx));
+
+		// translate to destination process address range
+		vma_start = vma->vm_start + (src_start - ctx_start(ctx));
+		vma_last = vma->vm_end - (ctx_last(ctx) - src_last);
+
+		zap_remote_range(vma, vma_start, vma_last, atomic);
+	}
+
+	up_read(&view->rmap_lock);
+
+	return true;
+}
+
+static bool mmin_invalidate(struct mmu_interval_notifier *interval_sub,
+			    const struct mmu_notifier_range *range,
+			    unsigned long cur_seq)
+{
+	struct remote_view *view =
+		container_of(interval_sub, struct remote_view, mmin);
+
+	pr_debug("%s: reason %d, range [%lx-%lx)\n", __func__,
+		 range->event, range->start, range->end);
+
+	spin_lock(&view->user_lock);
+	mmu_interval_set_seq(interval_sub, cur_seq);
+	spin_unlock(&view->user_lock);
+
+	/* mark view as invalid before zapping the page tables */
+	if (range->event == MMU_NOTIFY_RELEASE)
+		WRITE_ONCE(view->valid, false);
+
+	return mirror_clear_view(view, range->start, range->end,
+				 !mmu_notifier_range_blockable(range));
+}
+
+static const struct mmu_interval_notifier_ops mmin_ops = {
+	.invalidate = mmin_invalidate,
+};
+
+static void view_init(struct remote_view *view)
+{
+	refcount_set(&view->refcount, 1);
+	view->valid = true;
+	RB_CLEAR_NODE(&view->target_rb);
+	view->rb_rmap = RB_ROOT_CACHED;
+	init_rwsem(&view->rmap_lock);
+	spin_lock_init(&view->user_lock);
+}
+
+/* return working view or reason why it failed */
+static struct remote_view *
+view_alloc(struct mm_struct *mm, unsigned long address, unsigned long size, unsigned long offset)
+{
+	struct remote_view *view;
+	int result;
+
+	view = kzalloc(sizeof(*view), GFP_KERNEL);
+	if (!view)
+		return ERR_PTR(-ENOMEM);
+
+	view_init(view);
+
+	view->address = address;
+	view->size = size;
+	view->offset = offset;
+
+	pr_debug("%s: view %p [%lx-%lx)", __func__, view,
+		 view->offset, view->offset + view->size);
+
+	result = mmu_interval_notifier_insert(&view->mmin, mm, address, size, &mmin_ops);
+	if (result) {
+		kfree(view);
+		return ERR_PTR(result);
+	}
+
+	return view;
+}
+
+static void
+view_insert(struct remote_file_context *fctx, struct remote_view *view)
+{
+	view_interval_tree_insert(view, &fctx->rb_view);
+	refcount_inc(&view->refcount);
+}
+
+static struct remote_view *
+view_search_get(struct remote_file_context *fctx,
+	unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	lockdep_assert_held(&fctx->views_lock);
+
+	/*
+	 * loop & return the first view intersecting interval
+	 * further checks will be done down the road
+	 */
+	view_tree_foreach(view, &fctx->rb_view, start, last)
+		break;
+
+	if (view)
+		refcount_inc(&view->refcount);
+
+	return view;
+}
+
+static void
+view_put(struct remote_view *view)
+{
+	if (refcount_dec_and_test(&view->refcount)) {
+		pr_debug("%s: view %p [%lx-%lx) bye bye", __func__, view,
+			 view->offset, view->offset + view->size);
+
+		mmu_interval_notifier_remove(&view->mmin);
+		kfree(view);
+	}
+}
+
+static void
+view_remove(struct remote_file_context *fctx, struct remote_view *view)
+{
+	view_interval_tree_remove(view, &fctx->rb_view);
+	RB_CLEAR_NODE(&view->target_rb);
+	view_put(view);
+}
+
+static bool
+view_overlaps(struct remote_file_context *fctx,
+	unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	view_tree_foreach(view, &fctx->rb_view, start, last)
+		return true;
+
+	return false;
+}
+
+static struct remote_view *
+alloc_identity_view(struct mm_struct *mm)
+{
+	return view_alloc(mm, 0, ULONG_MAX, 0);
+}
+
+static void remote_file_context_init(struct remote_file_context *fctx)
+{
+	refcount_set(&fctx->refcount, 1);
+	init_srcu_struct(&fctx->fault_srcu);
+	fctx->locked = false;
+	fctx->rb_view = RB_ROOT_CACHED;
+	mutex_init(&fctx->views_lock);
+}
+
+static struct remote_file_context *remote_file_context_alloc(void)
+{
+	struct remote_file_context *fctx;
+
+	fctx = kzalloc(sizeof(*fctx), GFP_KERNEL);
+	if (fctx)
+		remote_file_context_init(fctx);
+
+	pr_debug("%s: fctx %p\n", __func__, fctx);
+
+	return fctx;
+}
+
+static void remote_file_context_get(struct remote_file_context *fctx)
+{
+	refcount_inc(&fctx->refcount);
+}
+
+static void remote_file_context_put(struct remote_file_context *fctx)
+{
+	struct remote_view *view, *n;
+
+	if (refcount_dec_and_test(&fctx->refcount)) {
+		pr_debug("%s: fctx %p\n", __func__, fctx);
+
+		rbtree_postorder_for_each_entry_safe(view, n,
+			&fctx->rb_view.rb_root, target_rb)
+			view_put(view);
+
+		if (fctx->mm)
+			mmdrop(fctx->mm);
+
+		kfree(fctx);
+	}
+}
+
+static void remote_vma_context_init(struct remote_vma_context *ctx)
+{
+	RB_CLEAR_NODE(&ctx->rmap_rb);
+}
+
+static struct remote_vma_context *remote_vma_context_alloc(void)
+{
+	struct remote_vma_context *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (ctx)
+		remote_vma_context_init(ctx);
+
+	return ctx;
+}
+
+static void remote_vma_context_free(struct remote_vma_context *ctx)
+{
+	kfree(ctx);
+}
+
+static int mirror_dev_open(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	fctx = remote_file_context_alloc();
+	if (!fctx)
+		return -ENOMEM;
+	file->private_data = fctx;
+
+	return 0;
+}
+
+static int do_remote_proc_map(struct file *file, int pid)
+{
+	struct remote_file_context *fctx = file->private_data;
+	struct task_struct *req_task;
+	struct mm_struct *req_mm;
+	struct remote_view *id;
+	int result = 0;
+
+	pr_debug("%s: pid %d\n", __func__, pid);
+
+	req_task = find_get_task_by_vpid(pid);
+	if (!req_task)
+		return -ESRCH;
+
+	req_mm = get_task_mm(req_task);
+	put_task_struct(req_task);
+	if (!req_mm)
+		return -EINVAL;
+
+	/* error on 2nd call or multithreaded race */
+	if (cmpxchg(&fctx->mm, (struct mm_struct *)NULL, req_mm) != NULL) {
+		result = -EALREADY;
+		goto out;
+	} else
+		mmgrab(req_mm);
+
+	id = alloc_identity_view(req_mm);
+	if (IS_ERR(id)) {
+		mmdrop(req_mm);
+		result = PTR_ERR(id);
+		goto out;
+	}
+
+	/* one view only, don't need to take mutex */
+	view_insert(fctx, id);
+	view_put(id);			/* usage reference */
+
+out:
+	mmput(req_mm);
+
+	return result;
+}
+
+static long mirror_dev_ioctl(struct file *file, unsigned int ioctl,
+	unsigned long arg)
+{
+	long result;
+
+	switch (ioctl) {
+	case REMOTE_PROC_MAP: {
+		int pid = (int)arg;
+
+		result = do_remote_proc_map(file, pid);
+		break;
+	}
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+/*
+ * This is called after all reference to the file have been dropped,
+ * including mmap()s, even if the file is close()d first.
+ */
+static int mirror_dev_release(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx = file->private_data;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	remote_file_context_put(fctx);
+
+	return 0;
+}
+
+static struct page *mm_remote_get_page(struct mm_struct *req_mm,
+	unsigned long address, unsigned int flags)
+{
+	struct page *req_page = NULL;
+	long nrpages;
+
+	might_sleep();
+
+	flags |= FOLL_ANON | FOLL_MIGRATION;
+
+	/* get host page corresponding to requested address */
+	nrpages = get_user_pages_remote(NULL, req_mm, address, 1,
+		flags, &req_page, NULL, NULL);
+	if (unlikely(nrpages == 0)) {
+		pr_err("no page at %lx\n", address);
+		return ERR_PTR(-ENOENT);
+	}
+	if (IS_ERR_VALUE(nrpages)) {
+		pr_err("get_user_pages_remote() failed: %d\n", (int)nrpages);
+		return ERR_PTR(nrpages);
+	}
+
+	/* limit introspection to anon memory (this also excludes zero-page) */
+	if (!PageAnon(req_page)) {
+		put_page(req_page);
+		pr_err("page at %lx not anon\n", address);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return req_page;
+}
+
+/*
+ * avoid PTE allocation in this function for 2 reasons:
+ * - it runs under user_lock, which is a spinlock and can't sleep
+ *   (user_lock can be a mutex is allocation is needed)
+ * - PTE allocation triggers reclaim, which causes a possible deadlock warning
+ */
+static vm_fault_t remote_map_page(struct vm_fault *vmf, struct page *page)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pte_t entry;
+
+	if (vmf->prealloc_pte) {
+		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+		if (unlikely(!pmd_none(*vmf->pmd))) {
+			spin_unlock(vmf->ptl);
+			goto map_pte;
+		}
+
+		mm_inc_nr_ptes(vma->vm_mm);
+		pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+		spin_unlock(vmf->ptl);
+		vmf->prealloc_pte = NULL;
+	} else {
+		BUG_ON(pmd_none(*vmf->pmd));
+	}
+
+map_pte:
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+
+	if (!pte_none(*vmf->pte))
+		goto out_unlock;
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	set_pte_at_notify(vma->vm_mm, vmf->address, vmf->pte, entry);
+
+out_unlock:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return VM_FAULT_NOPAGE;
+}
+
+static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct remote_vma_context *ctx = vma->vm_private_data;
+	struct remote_view *view = ctx->view;
+	struct file *file = vma->vm_file;
+	struct remote_file_context *fctx = file->private_data;
+	unsigned long req_addr;
+	unsigned int gup_flags;
+	struct page *req_page;
+	vm_fault_t result = VM_FAULT_SIGBUS;
+	struct mm_struct *src_mm = fctx->mm;
+	unsigned long seq;
+	int idx;
+
+fault_retry:
+	seq = mmu_interval_read_begin(&view->mmin);
+
+	idx = srcu_read_lock(&fctx->fault_srcu);
+
+	/* check if view was invalidated */
+	if (unlikely(!READ_ONCE(view->valid))) {
+		pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__,
+			view->offset, view->offset + view->size);
+		goto out_invalid;		/* VM_FAULT_SIGBUS */
+	}
+
+	/* drop current mm semapchore */
+	up_read(&current->mm->mmap_sem);
+
+	/* take remote mm semaphore */
+	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+		if (!down_read_trylock(&src_mm->mmap_sem)) {
+			pr_debug("%s: couldn't take source semaphore!!\n", __func__);
+			goto out_retry;
+		}
+	} else
+		down_read(&src_mm->mmap_sem);
+
+	/* set GUP flags depending on the VMA */
+	gup_flags = 0;
+	if (vma->vm_flags & VM_WRITE)
+		gup_flags |= FOLL_WRITE | FOLL_FORCE;
+
+	/* translate file offset to source process HVA */
+	req_addr = (vmf->pgoff << PAGE_SHIFT) - view->offset + view->address;
+	req_page = mm_remote_get_page(src_mm, req_addr, gup_flags);
+
+	/* check for validity of the page */
+	if (IS_ERR_OR_NULL(req_page)) {
+		up_read(&src_mm->mmap_sem);
+
+		if (PTR_ERR(req_page) == -ERESTARTSYS ||
+		    PTR_ERR(req_page) == -EBUSY) {
+			goto out_retry;
+		} else
+			goto out_err;	/* VM_FAULT_SIGBUS */
+	}
+
+	up_read(&src_mm->mmap_sem);
+
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+
+	/* expedite retry */
+	if (mmu_interval_check_retry(&view->mmin, seq)) {
+		put_page(req_page);
+
+		srcu_read_unlock(&fctx->fault_srcu, idx);
+
+		goto fault_retry;
+	}
+
+	/* make sure the VMA hasn't gone away */
+	vma = find_vma(current->mm, vmf->address);
+	if (vma == vmf->vma) {
+		spin_lock(&view->user_lock);
+
+		if (mmu_interval_read_retry(&view->mmin, seq)) {
+			spin_unlock(&view->user_lock);
+
+			put_page(req_page);
+
+			srcu_read_unlock(&fctx->fault_srcu, idx);
+
+			goto fault_retry;
+		}
+
+		result = remote_map_page(vmf, req_page);  /* install PTE here */
+
+		spin_unlock(&view->user_lock);
+	}
+
+	put_page(req_page);
+
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	return result;
+
+out_err:
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+out_invalid:
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	return result;
+
+out_retry:
+	/* retake current mm semapchore */
+	down_read(&current->mm->mmap_sem);
+
+	srcu_read_unlock(&fctx->fault_srcu, idx);
+
+	/* TODO: some optimizations work here when we arrive with FAULT_FLAG_ALLOW_RETRY */
+	/* TODO: mmap_sem doesn't need to be taken, then dropped */
+
+	/*
+	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+	 * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+	 * not set.
+	 *
+	 * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+	 * set, VM_FAULT_RETRY can still be returned if and only if there are
+	 * fatal_signal_pending()s, and the mmap_sem must be released before
+	 * returning it.
+	 */
+	if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_TRIED)) {
+		if (!(vmf->flags & FAULT_FLAG_KILLABLE))
+			if (current && fatal_signal_pending(current)) {
+				up_read(&current->mm->mmap_sem);
+				return VM_FAULT_RETRY;
+			}
+
+		if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+			up_read(&mm->mmap_sem);
+
+		return VM_FAULT_RETRY;
+	} else
+		return VM_FAULT_SIGBUS;
+}
+
+/*
+ * This is called in remove_vma() at the end of __do_munmap() after the address
+ * space has been unmapped and the page tables have been freed.
+ */
+static void mirror_vm_close(struct vm_area_struct *vma)
+{
+	struct remote_vma_context *ctx = vma->vm_private_data;
+	struct remote_view *view = ctx->view;
+
+	pr_debug("%s: VMA %lx-%lx (%lu bytes)\n", __func__,
+		vma->vm_start, vma->vm_end, vma->vm_end - vma->vm_start);
+
+	/* will wait for any running invalidate notifiers to finish */
+	down_write(&view->rmap_lock);
+	rmap_interval_tree_remove(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+	view_put(view);
+
+	remote_vma_context_free(ctx);
+}
+
+/* prevent partial unmap of destination VMA */
+static int mirror_vm_split(struct vm_area_struct *area, unsigned long addr)
+{
+	return -EINVAL;
+}
+
+static const struct vm_operations_struct mirror_vm_ops = {
+	.close = mirror_vm_close,
+	.fault = mirror_vm_fault,
+	.split = mirror_vm_split,
+	.zap_pte = mirror_zap_pte,
+};
+
+static bool is_mirror_vma(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &mirror_vm_ops;
+}
+
+static struct remote_view *
+getme_matching_view(struct remote_file_context *fctx,
+		    unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	/* lookup view for the VMA offset range */
+	view = view_search_get(fctx, start, last);
+	if (!view)
+		return NULL;
+
+	/* make sure the interval we're after is contained in the view */
+	if (start < view->offset || last > view->offset + view->size) {
+		view_put(view);
+		return NULL;
+	}
+
+	return view;
+}
+
+static struct remote_view *
+getme_exact_view(struct remote_file_context *fctx,
+		 unsigned long start, unsigned long last)
+{
+	struct remote_view *view;
+
+	/* lookup view for the VMA offset range */
+	view = view_search_get(fctx, start, last);
+	if (!view)
+		return NULL;
+
+	/* make sure the interval we're after is contained in the view */
+	if (start != view->offset || last != view->offset + view->size) {
+		view_put(view);
+		return NULL;
+	}
+
+	return view;
+}
+
+static int mirror_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct remote_file_context *fctx = file->private_data;
+	struct remote_vma_context *ctx;
+	unsigned long start, length, last;
+	struct remote_view *view;
+
+	start = vma->vm_pgoff << PAGE_SHIFT;
+	length = vma->vm_end - vma->vm_start;
+	last = start + length;
+
+	pr_debug("%s: VMA %lx-%lx (%lu bytes), offsets %lx-%lx\n", __func__,
+		vma->vm_start, vma->vm_end, length, start, last);
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		pr_debug("%s: VMA is not shared\n", __func__);
+		return -EINVAL;
+	}
+
+	/* prepare the context */
+	ctx = remote_vma_context_alloc();
+	if (!ctx)
+		return -ENOMEM;
+
+	/* lookup view for the VMA offset range */
+	mutex_lock(&fctx->views_lock);
+	view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!view) {
+		pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last);
+		remote_vma_context_free(ctx);
+		return -EINVAL;
+	}
+
+	/* VMA must be linked to ctx before adding to rmap tree !! */
+	vma->vm_private_data = ctx;
+	ctx->vma = vma;
+
+	/* view may already be invalidated by the time it's linked */
+	down_write(&view->rmap_lock);
+	ctx->view = view;	/* view reference goes here */
+	rmap_interval_tree_insert(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+
+	/* set basic VMA properties */
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
+	vma->vm_ops = &mirror_vm_ops;
+
+	return 0;
+}
+
+static const struct file_operations mirror_ops = {
+	.open = mirror_dev_open,
+	.unlocked_ioctl = mirror_dev_ioctl,
+	.compat_ioctl = mirror_dev_ioctl,
+	.llseek = no_llseek,
+	.mmap = mirror_dev_mmap,
+	.release = mirror_dev_release,
+};
+
+static struct miscdevice mirror_dev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mirror-proc",
+	.fops = &mirror_ops,
+};
+
+builtin_misc_device(mirror_dev);
+
+static int pidfd_mem_remap(struct remote_file_context *fctx, unsigned long address)
+{
+	struct vm_area_struct *vma;
+	unsigned long start, last;
+	struct remote_vma_context *ctx;
+	struct remote_view *view, *new_view;
+	int result = 0;
+
+	pr_debug("%s: address %lx\n", __func__, address);
+
+	down_write(&current->mm->mmap_sem);
+
+	vma = find_vma(current->mm, address);
+	if (!vma || !is_mirror_vma(vma)) {
+		result = -EINVAL;
+		goto out_vma;
+	}
+
+	ctx = vma->vm_private_data;
+	view = ctx->view;
+
+	if (view->valid)
+		goto out_vma;
+
+	start = vma->vm_pgoff << PAGE_SHIFT;
+	last = start + (vma->vm_end - vma->vm_start);
+
+	/* lookup view for the VMA offset range */
+	mutex_lock(&fctx->views_lock);
+	new_view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!new_view) {
+		result = -EINVAL;
+		goto out_vma;
+	}
+	/* do not link to another invalid view */
+	if (!new_view->valid) {
+		view_put(new_view);
+		result = -EINVAL;
+		goto out_vma;
+	}
+
+	/* we have current->mm->mmap_sem in write mode, so no faults going on */
+	down_write(&view->rmap_lock);
+	rmap_interval_tree_remove(ctx, &view->rb_rmap);
+	up_write(&view->rmap_lock);
+	view_put(view);		/* ctx reference */
+
+	/* replace with the new view */
+	down_write(&new_view->rmap_lock);
+	ctx->view = new_view;	/* new view reference goes here */
+	rmap_interval_tree_insert(ctx, &new_view->rb_rmap);
+	up_write(&new_view->rmap_lock);
+
+out_vma:
+	up_write(&current->mm->mmap_sem);
+
+	return result;
+}
+
+static long
+pidfd_mem_map_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
+{
+	struct remote_file_context *fctx = file->private_data;
+	long result = 0;
+
+	switch (ioctl) {
+	case PIDFD_MEM_REMAP:
+		result = pidfd_mem_remap(fctx, arg);
+		break;
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static void pidfd_mem_lock(struct remote_file_context *fctx)
+{
+	pr_debug("%s:\n", __func__);
+
+	mutex_lock(&fctx->views_lock);
+	fctx->locked = true;
+	mutex_unlock(&fctx->views_lock);
+}
+
+static int pidfd_mem_map(struct remote_file_context *fctx, struct pidfd_mem_map *map)
+{
+	struct remote_view *view;
+	int result = 0;
+
+	pr_debug("%s: offset %lx, size %lx, address %lx\n",
+		__func__, map->offset, map->size, (long)map->address);
+
+	if (!PAGE_ALIGNED(map->offset))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(map->size))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(map->address))
+		return -EINVAL;
+
+	/* make sure we're creating the view for a valid address space */
+	if (!mmget_not_zero(fctx->mm))
+		return -EINVAL;
+
+	view = view_alloc(fctx->mm, map->address, map->size, map->offset);
+	if (IS_ERR(view)) {
+		result = PTR_ERR(view);
+		goto out_mm;
+	}
+
+	mutex_lock(&fctx->views_lock);
+
+	/* locked ? */
+	if (unlikely(fctx->locked)) {
+		pr_debug("%s: views locked\n", __func__);
+		result = -EINVAL;
+		goto out;
+	}
+
+	/* overlaps another view ? */
+	if (view_overlaps(fctx, map->offset, map->offset + map->size)) {
+		pr_debug("%s: range overlaps\n", __func__);
+		result = -EALREADY;
+		goto out;
+	}
+
+	view_insert(fctx, view);
+
+out:
+	mutex_unlock(&fctx->views_lock);
+
+	view_put(view);			/* usage reference */
+out_mm:
+	mmput(fctx->mm);
+
+	return result;
+}
+
+static int pidfd_mem_unmap(struct remote_file_context *fctx, struct pidfd_mem_unmap *unmap)
+{
+	struct remote_view *view;
+
+	pr_debug("%s: offset %lx, size %lx\n",
+		__func__, unmap->offset, unmap->size);
+
+	if (!PAGE_ALIGNED(unmap->offset))
+		return -EINVAL;
+	if (!PAGE_ALIGNED(unmap->size))
+		return -EINVAL;
+
+	mutex_lock(&fctx->views_lock);
+
+	if (unlikely(fctx->locked)) {
+		mutex_unlock(&fctx->views_lock);
+		return -EINVAL;
+	}
+
+	view = getme_exact_view(fctx, unmap->offset, unmap->offset + unmap->size);
+	if (!view) {
+		mutex_unlock(&fctx->views_lock);
+		return -EINVAL;
+	}
+
+	view_remove(fctx, view);
+
+	mutex_unlock(&fctx->views_lock);
+
+	/*
+	 * The view may still be refernced by a mapping VMA, so dropping
+	 * a reference here may not delete it. The view will be marked as
+	 * invalid, together with all the VMAs linked to it.
+	 */
+	WRITE_ONCE(view->valid, false);
+
+	/* wait until local faults finish */
+	synchronize_srcu(&fctx->fault_srcu);
+
+	/*
+	 * because the view is marked as invalid, faults will not succeed, so
+	 * we don't have to worry about synchronizing invalidations/faults
+	 */
+	mirror_clear_view(view, 0, ULONG_MAX, false);
+
+	view_put(view);			/* usage reference */
+
+	return 0;
+}
+
+static long
+pidfd_mem_ctrl_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
+{
+	struct remote_file_context *fctx = file->private_data;
+	void __user *argp = (void __user *)arg;
+	long result = 0;
+
+	switch (ioctl) {
+	case PIDFD_MEM_MAP: {
+		struct pidfd_mem_map map;
+
+		result = -EINVAL;
+		if (copy_from_user(&map, argp, sizeof(map)))
+			return result;
+
+		result = pidfd_mem_map(fctx, &map);
+		break;
+	}
+
+	case PIDFD_MEM_UNMAP: {
+		struct pidfd_mem_unmap unmap;
+
+		result = -EINVAL;
+		if (copy_from_user(&unmap, argp, sizeof(unmap)))
+			return result;
+
+		result = pidfd_mem_unmap(fctx, &unmap);
+		break;
+	}
+
+	case PIDFD_MEM_LOCK:
+		pidfd_mem_lock(fctx);
+		break;
+
+	default:
+		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
+		result = -ENOTTY;
+	}
+
+	return result;
+}
+
+static int pidfd_mem_ctrl_release(struct inode *inode, struct file *file)
+{
+	struct remote_file_context *fctx = file->private_data;
+
+	pr_debug("%s: file %p\n", __func__, file);
+
+	remote_file_context_put(fctx);
+
+	return 0;
+}
+
+static const struct file_operations pidfd_mem_ctrl_ops = {
+	.owner = THIS_MODULE,
+	.unlocked_ioctl = pidfd_mem_ctrl_ioctl,
+	.compat_ioctl = pidfd_mem_ctrl_ioctl,
+	.llseek = no_llseek,
+	.release = pidfd_mem_ctrl_release,
+};
+
+static unsigned long
+pidfd_mem_get_unmapped_area(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct remote_file_context *fctx = file->private_data;
+	unsigned long start = pgoff << PAGE_SHIFT;
+	unsigned long last = start + len;
+	unsigned long remote_addr, align_offset;
+	struct remote_view *view;
+	struct vm_area_struct *vma;
+	unsigned long result;
+
+	pr_debug("%s: addr %lx, len %lx, pgoff %lx, flags %lx\n",
+		__func__, addr, len, pgoff, flags);
+
+	if (flags & MAP_FIXED) {
+		if (addr == 0)
+			return -ENOMEM;
+		else
+			return addr;
+	}
+
+	// TODO: ellaborate on this case, we must still have alignment !!!!!!!!!
+	// TODO: only if THP enabled
+	if (addr == 0)
+		return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+
+	/* use this backing VMA */
+	vma = find_vma(current->mm, addr);
+	if (!vma) {
+		pr_debug("%s: no VMA found at %lx\n", __func__, addr);
+		return -EINVAL;
+	}
+
+	/* VMA was mapped with PROT_NONE */
+	if (vma_is_accessible(vma)) {
+		pr_debug("%s: VMA at %lx is not a backing VMA\n", __func__, addr);
+		return -EINVAL;
+	}
+
+	/*
+	 * if the view somehow gets removed afterwards, we're gonna create a
+	 * VMA for which there's no backing view, so mmap() will fail
+	 */
+	mutex_lock(&fctx->views_lock);
+	view = getme_matching_view(fctx, start, last);
+	mutex_unlock(&fctx->views_lock);
+	if (!view) {
+		pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last);
+		return -EINVAL;
+	}
+
+	/* this should be enough to ensure VMA alignment */
+	remote_addr = start - view->offset + view->address;
+	align_offset = remote_addr % PMD_SIZE;
+
+	if (addr % PMD_SIZE <= align_offset)
+		result = (addr & PMD_MASK) + align_offset;
+	else
+		result = (addr & PMD_MASK) + align_offset + PMD_SIZE;
+
+	view_put(view);		/* usage reference */
+
+	return result;
+}
+
+static const struct file_operations pidfd_mem_map_fops = {
+	.owner = THIS_MODULE,
+	.mmap = mirror_dev_mmap,
+	.get_unmapped_area = pidfd_mem_get_unmapped_area,
+	.unlocked_ioctl = pidfd_mem_map_ioctl,
+	.compat_ioctl = pidfd_mem_map_ioctl,
+	.llseek = no_llseek,
+	.release = mirror_dev_release,
+};
+
+int task_remote_map(struct task_struct *task, int fds[])
+{
+	struct mm_struct *mm;
+	struct remote_file_context *fctx;
+	struct file *ctrl, *map;
+	int ret;
+
+	// allocate common file context
+	fctx = remote_file_context_alloc();
+	if (!fctx)
+		return -ENOMEM;
+
+	// create these 2 fds
+	fds[0] = fds[1] = -1;
+
+	fds[0] = anon_inode_getfd("[pidfd_mem.ctrl]", &pidfd_mem_ctrl_ops, fctx,
+				  O_RDWR | O_CLOEXEC);
+	if (fds[0] < 0) {
+		ret = fds[0];
+		goto out;
+	}
+	remote_file_context_get(fctx);
+
+	ctrl = fget(fds[0]);
+	ctrl->f_mode |= FMODE_WRITE_IOCTL;
+	fput(ctrl);
+
+	fds[1] = anon_inode_getfd("[pidfd_mem.map]", &pidfd_mem_map_fops, fctx,
+				  O_RDWR | O_CLOEXEC | O_LARGEFILE);
+	if (fds[1] < 0) {
+		ret = fds[1];
+		goto out;
+	}
+	remote_file_context_get(fctx);
+
+	map = fget(fds[1]);
+	map->f_mode |= FMODE_LSEEK | FMODE_UNSIGNED_OFFSET | FMODE_RANDOM;
+	fput(map);
+
+	mm = get_task_mm(task);
+	if (!mm) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* reference this mm in fctx */
+	mmgrab(mm);
+	fctx->mm = mm;
+
+	mmput(mm);
+	remote_file_context_put(fctx);		/* usage reference */
+
+	return 0;
+
+out:
+	if (fds[0] != -1) {
+		__close_fd(current->files, fds[0]);
+		remote_file_context_put(fctx);
+	}
+
+	if (fds[1] != -1) {
+		__close_fd(current->files, fds[1]);
+		remote_file_context_put(fctx);
+	}
+
+	// TODO: using __close_fd() does not guarantee success, use other means
+	// for file allocation & error recovery
+
+	remote_file_context_put(fctx);
+
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (3 preceding siblings ...)
  2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
@ 2020-09-04 11:31 ` Adalbert Lazăr
  2020-09-04 19:18   ` Florian Weimer
  2020-09-07 14:55   ` Christian Brauner
  2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-04 11:31 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Paolo Bonzini, Mihai Donțu, Mircea Cirjaliu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner, Christian Brauner, Adalbert Lazăr

From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>

This system call returns 2 fds for inspecting the address space of a
remote process: one for control and one for access. Use according to
remote mapping specifications.

Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/pid.h                    |  1 +
 include/linux/syscalls.h               |  1 +
 include/uapi/asm-generic/unistd.h      |  2 +
 kernel/exit.c                          |  2 +-
 kernel/pid.c                           | 55 ++++++++++++++++++++++++++
 7 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 54581ac671b4..ca1b5a32dbc5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,5 +440,6 @@
 433	i386	fspick			sys_fspick
 434	i386	pidfd_open		sys_pidfd_open
 435	i386	clone3			sys_clone3
+436     i386    pidfd_mem               sys_pidfd_mem
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37b844f839bc..6138d3d023f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			sys_fspick
 434	common	pidfd_open		sys_pidfd_open
 435	common	clone3			sys_clone3
+436     common  pidfd_mem               sys_pidfd_mem
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..9ec23ab23fd4 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -76,6 +76,7 @@ extern const struct file_operations pidfd_fops;
 
 struct file;
 
+extern struct pid *pidfd_get_pid(unsigned int fd);
 extern struct pid *pidfd_pid(const struct file *file);
 
 static inline struct pid *get_pid(struct pid *pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..621f3d52ed4e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -934,6 +934,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
 asmlinkage long sys_syncfs(int fd);
 asmlinkage long sys_setns(int fd, int nstype);
 asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
+asmlinkage long sys_pidfd_mem(int pidfd, int __user *fds, unsigned int flags);
 asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
 			     unsigned int vlen, unsigned flags);
 asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..2663afc03c86 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
+#define __NR_pidfd_mem 436
+__SYSCALL(__NR_pidfd_mem, sys_pidfd_mem)
 
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
diff --git a/kernel/exit.c b/kernel/exit.c
index 389a88cb3081..37cd8949e606 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1464,7 +1464,7 @@ static long do_wait(struct wait_opts *wo)
 	return retval;
 }
 
-static struct pid *pidfd_get_pid(unsigned int fd)
+struct pid *pidfd_get_pid(unsigned int fd)
 {
 	struct fd f;
 	struct pid *pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..c9c49edf4a8a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/remote_mapping.h>
 
 struct pid init_struct_pid = {
 	.count		= REFCOUNT_INIT(1),
@@ -565,6 +566,60 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	return fd;
 }
 
+/**
+ * pidfd_mem() - Allow access to process address space.
+ *
+ * @pidfd: pid file descriptor for the target process
+ * @fds:   array where the control and access file descriptors are returned
+ * @flags: flags to pass
+ *
+ * This creates a pair of file descriptors used to gain access to the
+ * target process memory. The control fd is used to establish a linear
+ * mapping between an offset range and a userspace address range.
+ * The access fd is used to mmap(offset range) on the client side.
+ *
+ * Return: On success, 0 is returned.
+ *         On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE3(pidfd_mem, int, pidfd, int __user *, fds, unsigned int, flags)
+{
+	struct pid *pid;
+	struct task_struct *task;
+	int ret_fds[2];
+	int ret;
+
+	if (pidfd < 0)
+		return -EINVAL;
+	if (!fds)
+		return -EINVAL;
+	if (flags)
+		return -EINVAL;
+
+	pid = pidfd_get_pid(pidfd);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	put_pid(pid);
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+
+	ret = -EPERM;
+	if (unlikely(task == current) || capable(CAP_SYS_PTRACE))
+		ret = task_remote_map(task, ret_fds);
+	put_task_struct(task);
+	if (IS_ERR_VALUE((long)ret))
+		return ret;
+
+	if (copy_to_user(fds, ret_fds, sizeof(ret_fds))) {
+		put_unused_fd(ret_fds[0]);
+		put_unused_fd(ret_fds[1]);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios
  2020-09-04 11:31 ` [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
@ 2020-09-04 12:03   ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 12:03 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 02:31:14PM +0300, Adalbert Lazăr wrote:
> From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> 
> The combination of remote mapping + KVM causes nested range invalidations,
> which reports lockdep warnings.

You need a better solution than this, this lockdep is protecting a lot
of other more common code against various errors.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (4 preceding siblings ...)
  2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
@ 2020-09-04 12:11 ` Jason Gunthorpe
  2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
  2020-09-04 19:41   ` Matthew Wilcox
  2020-09-04 19:19 ` Florian Weimer
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 12:11 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> process address space within the specified range. Pages are installed in the
> current process page tables at fault time and removed by the mmu_interval_notifier
> invalidate callbck. No further memory management is involved.
> On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> or if the remote process address space was reaped by OOM, the remote mapping
> fault handler returns VM_FAULT_SIGBUS.

I still think anything along these lines needs to meet the XPMEM use
cases as well, we have to have more general solutions for such MM
stuff:

https://gitlab.com/hjelmn/xpmem

However, I think this fundamentally falls into some of the same bad
direction as xpmem.

I would much rather see this design copy & clone the VMA's than try to
mirror the PTEs inside the VMAs from the remote into a single giant
VMA and somehow split/mirror the VMA ops.

This is just too weird and fragile to be maintaible over a long
term.

For instance, one of the major bugs in things like xpmem was that they
are incompatible with get_user_pages(), largely because of this issue.

I feel like I said this already..

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
@ 2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
  2020-09-04 13:39     ` Jason Gunthorpe
  2020-09-04 19:41   ` Matthew Wilcox
  1 sibling, 1 reply; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-04 13:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > VMAs obtained by mmap()ing memory access fds mirror the contents of
> > the remote process address space within the specified range. Pages are
> > installed in the current process page tables at fault time and removed
> > by the mmu_interval_notifier invalidate callbck. No further memory
> management is involved.
> > On attempts to access a hole, or if a mapping was removed by
> > PIDFD_MEM_UNMAP, or if the remote process address space was reaped
> by
> > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> 
> I still think anything along these lines needs to meet the XPMEM use cases as
> well, we have to have more general solutions for such MM
> stuff:
> 
> https://gitlab.com/hjelmn/xpmem
> 
> However, I think this fundamentally falls into some of the same bad direction
> as xpmem.
> 
> I would much rather see this design copy & clone the VMA's than try to
> mirror the PTEs inside the VMAs from the remote into a single giant VMA and
> somehow split/mirror the VMA ops.

This design was made specifically for virtual machine introspection, where we 
care more about the contents of the address space, rather than the remote VMAs
and their vmops. (Right now only anon pages can be mapped, but I guess
we can extend to pagecache pages as well.) I just used what seemed to be the
common denominator to all page-related operations: range invalidation.
This looks like a general solution.

IMO cloning a VMA in an address space that has a completely different layout
will present its own set of caveats: What happens if the VMA resizes/splits? 
Can you replay all the remote VMA vmops on the clone VMA?

> This is just too weird and fragile to be maintaible over a long term.
> 
> For instance, one of the major bugs in things like xpmem was that they are
> incompatible with get_user_pages(), largely because of this issue.

We support get_user_pages(), that's how we integrate with KVM.
The difference is the page returned will not belong to the current process.

> I feel like I said this already..
> 
> Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
@ 2020-09-04 13:39     ` Jason Gunthorpe
  2020-09-04 14:18       ` Mircea CIRJALIU - MELIU
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 13:39 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of
> > > the remote process address space within the specified range. Pages are
> > > installed in the current process page tables at fault time and removed
> > > by the mmu_interval_notifier invalidate callbck. No further memory
> > management is involved.
> > > On attempts to access a hole, or if a mapping was removed by
> > > PIDFD_MEM_UNMAP, or if the remote process address space was reaped
> > by
> > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use cases as
> > well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad direction
> > as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant VMA and
> > somehow split/mirror the VMA ops.
> 
> This design was made specifically for virtual machine introspection, where we 
> care more about the contents of the address space, rather than the remote VMAs
> and their vmops. (Right now only anon pages can be mapped, but I guess
> we can extend to pagecache pages as well.) I just used what seemed to be the
> common denominator to all page-related operations: range invalidation.
> This looks like a general solution.

The point is that a VMA is how the MM connects its parts together,
cloning the content of a VMA without the rest of the VMA meta-data is
just going to be very fragile in the long run.. 

Especially if the VMA is presented as a normal VMA with working struct
pages/etc, not a pfn map.

> IMO cloning a VMA in an address space that has a completely different layout
> will present its own set of caveats: What happens if the VMA resizes/splits? 
> Can you replay all the remote VMA vmops on the clone VMA?

The mirror would have to reclone the source VMA every time the source
VMA changes.

> > This is just too weird and fragile to be maintaible over a long term.
> > 
> > For instance, one of the major bugs in things like xpmem was that they are
> > incompatible with get_user_pages(), largely because of this issue.
> 
> We support get_user_pages(), that's how we integrate with KVM.

This seems really sketchy, get_user_pages is sensitive to the VMA,
what happens when VMA flags are different/etc?

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 13:39     ` Jason Gunthorpe
@ 2020-09-04 14:18       ` Mircea CIRJALIU - MELIU
  2020-09-04 14:39         ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-04 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

> On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > > VMAs obtained by mmap()ing memory access fds mirror the contents
> > > > of the remote process address space within the specified range.
> > > > Pages are installed in the current process page tables at fault
> > > > time and removed by the mmu_interval_notifier invalidate callbck.
> > > > No further memory
> > > management is involved.
> > > > On attempts to access a hole, or if a mapping was removed by
> > > > PIDFD_MEM_UNMAP, or if the remote process address space was
> reaped
> > > by
> > > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> > >
> > > I still think anything along these lines needs to meet the XPMEM use
> > > cases as well, we have to have more general solutions for such MM
> > > stuff:
> > >
> > > https://gitlab.com/hjelmn/xpmem
> > >
> > > However, I think this fundamentally falls into some of the same bad
> > > direction as xpmem.
> > >
> > > I would much rather see this design copy & clone the VMA's than try
> > > to mirror the PTEs inside the VMAs from the remote into a single
> > > giant VMA and somehow split/mirror the VMA ops.
> >
> > This design was made specifically for virtual machine introspection,
> > where we care more about the contents of the address space, rather
> > than the remote VMAs and their vmops. (Right now only anon pages can
> > be mapped, but I guess we can extend to pagecache pages as well.) I
> > just used what seemed to be the common denominator to all page-related
> operations: range invalidation.
> > This looks like a general solution.
> 
> The point is that a VMA is how the MM connects its parts together, cloning
> the content of a VMA without the rest of the VMA meta-data is just going to
> be very fragile in the long run..
> 
> Especially if the VMA is presented as a normal VMA with working struct
> pages/etc, not a pfn map.
>
> > IMO cloning a VMA in an address space that has a completely different
> > layout will present its own set of caveats: What happens if the VMA
> resizes/splits?
> > Can you replay all the remote VMA vmops on the clone VMA?
> 
> The mirror would have to reclone the source VMA every time the source
> VMA changes.
> 
> > > This is just too weird and fragile to be maintaible over a long term.
> > >
> > > For instance, one of the major bugs in things like xpmem was that
> > > they are incompatible with get_user_pages(), largely because of this
> issue.
> >
> > We support get_user_pages(), that's how we integrate with KVM.
> 
> This seems really sketchy, get_user_pages is sensitive to the VMA, what
> happens when VMA flags are different/etc?

A debugger shouldn't complain if a portion of the debuggee is read-only,
just overwrite the data.

> Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 14:18       ` Mircea CIRJALIU - MELIU
@ 2020-09-04 14:39         ` Jason Gunthorpe
  2020-09-04 15:40           ` Mircea CIRJALIU - MELIU
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 14:39 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > This seems really sketchy, get_user_pages is sensitive to the VMA, what
> > happens when VMA flags are different/etc?
> 
> A debugger shouldn't complain if a portion of the debuggee is read-only,
> just overwrite the data.

At this point the kernel API here is so incredibly limited you may as
well use a memfd for passing the shared address space instead of
trying to do and maintain this complexity.

Your use case is only qemu, so what is the problem to replace the
allocator backing VM memory in userspace? Other people have been
talking about doing a memfd already for different reasons - and memfd
can already be shared as this scheme desires.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 14:39         ` Jason Gunthorpe
@ 2020-09-04 15:40           ` Mircea CIRJALIU - MELIU
  2020-09-04 16:11             ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-04 15:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

> On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > This seems really sketchy, get_user_pages is sensitive to the VMA,
> > > what happens when VMA flags are different/etc?
> >
> > A debugger shouldn't complain if a portion of the debuggee is
> > read-only, just overwrite the data.
> 
> At this point the kernel API here is so incredibly limited you may as well use a
> memfd for passing the shared address space instead of trying to do and
> maintain this complexity.
> 
> Your use case is only qemu, so what is the problem to replace the allocator
> backing VM memory in userspace? Other people have been talking about
> doing a memfd already for different reasons - and memfd can already be
> shared as this scheme desires.

KSM doesn't work on shmem.
Once you replace the allocator you render KSM useless.

Besides that, I had a mail once from Paolo Bonzini:
>> Hi,
>>
>> here at FOSDEM we discussed having a way for a parent process to 
>> split parts of an mmap range with one or more child processes.  This 
>> turns out to be a generalization of the remote memory mapping concept 
>> that BitDefender proposed for virtual machine introspection ( 
>> https://patchwork.kernel.org/patch/11284561/).  So far the patches 
>> haven't had a great reception from the MM people, but it shouldn't be 
>> hard to adjust the API according to the sketch below.  I am also 
>> including Mircea who is the author.
>>
>> The proposed API is a new pidfd system call, through which the parent 
>> can map portions of its virtual address space into a file descriptor 
>> and then pass that file descriptor to a child.
(the rest can be found in the cover letter)

Therefore I had to do a module that peeks into anon process memory.
And be compatible with KSM. This was among the requirements for the first
version of remote mapping, which ended up non-scalable.

Figures out it can peek into any kind of memory involving pages.
Also it doesn't have the overhead associated with mapping a page in a VMA.
And compared to ptrace(), it can keep the pages resident as long as needed.

Mircea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 15:40           ` Mircea CIRJALIU - MELIU
@ 2020-09-04 16:11             ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 16:11 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 03:40:55PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > > This seems really sketchy, get_user_pages is sensitive to the VMA,
> > > > what happens when VMA flags are different/etc?
> > >
> > > A debugger shouldn't complain if a portion of the debuggee is
> > > read-only, just overwrite the data.
> > 
> > At this point the kernel API here is so incredibly limited you may as well use a
> > memfd for passing the shared address space instead of trying to do and
> > maintain this complexity.
> > 
> > Your use case is only qemu, so what is the problem to replace the allocator
> > backing VM memory in userspace? Other people have been talking about
> > doing a memfd already for different reasons - and memfd can already be
> > shared as this scheme desires.
> 
> KSM doesn't work on shmem.
> Once you replace the allocator you render KSM useless.

I suspect making memfd to work with KSM will be much less hacky than
this.

> Figures out it can peek into any kind of memory involving pages.

No, this approach is really liminted to anonymous VMAs.

You might make some argument that VMA differences can be ignored if
they are all anonymous to start with, but certainly not once other
types are VMAs are included.

Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
@ 2020-09-04 17:55   ` Oleg Nesterov
  2020-09-07 14:30   ` Oleg Nesterov
  2020-09-07 15:02   ` Christian Brauner
  2 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2020-09-04 17:55 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

I didn't read this series. This is not my area and to be honest even
the API doesn't fit my head. I leave this to other reviewers.

Just a couple of nits after very quick glance.

On 09/04, Adalbert Lazăr wrote:
>
> +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
> +{
...

> +	up_read(&current->mm->mmap_sem);

...

> +	down_read(&current->mm->mmap_sem);
> +
> +	/* expedite retry */
> +	if (mmu_interval_check_retry(&view->mmin, seq)) {
> +		put_page(req_page);
> +
> +		srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +		goto fault_retry;
> +	}
> +
> +	/* make sure the VMA hasn't gone away */
> +	vma = find_vma(current->mm, vmf->address);
> +	if (vma == vmf->vma) {

vmf->vma can go away, its memory can be freed and re-allocated as another
vma returned by find_vma() above.

> +int task_remote_map(struct task_struct *task, int fds[])
> +{

...

> +	fds[1] = anon_inode_getfd("[pidfd_mem.map]", &pidfd_mem_map_fops, fctx,
> +				  O_RDWR | O_CLOEXEC | O_LARGEFILE);
> +	if (fds[1] < 0) {
> +		ret = fds[1];
> +		goto out;
> +	}
> +	remote_file_context_get(fctx);
> +
> +	map = fget(fds[1]);

Another thread can close this file right after fd_install(). fget() can return
NULL or another unrelated file.

Oleg.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call
  2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
@ 2020-09-04 19:18   ` Florian Weimer
  2020-09-07 14:55   ` Christian Brauner
  1 sibling, 0 replies; 44+ messages in thread
From: Florian Weimer @ 2020-09-04 19:18 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner, Christian Brauner

* Adalbert Lazăr:

> +/**
> + * pidfd_mem() - Allow access to process address space.
> + *
> + * @pidfd: pid file descriptor for the target process
> + * @fds:   array where the control and access file descriptors are returned
> + * @flags: flags to pass
> + *
> + * This creates a pair of file descriptors used to gain access to the
> + * target process memory. The control fd is used to establish a linear
> + * mapping between an offset range and a userspace address range.
> + * The access fd is used to mmap(offset range) on the client side.
> + *
> + * Return: On success, 0 is returned.
> + *         On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE3(pidfd_mem, int, pidfd, int __user *, fds, unsigned int, flags)

Structs are cheap.  Can you use a struct pointer for the fds argument?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (5 preceding siblings ...)
  2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
@ 2020-09-04 19:19 ` Florian Weimer
  2020-09-04 20:18   ` Paolo Bonzini
  2020-09-04 19:39 ` Andy Lutomirski
  2020-09-07 15:05 ` Christian Brauner
  8 siblings, 1 reply; 44+ messages in thread
From: Florian Weimer @ 2020-09-04 19:19 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

* Adalbert Lazăr:

>>> - the control plane accepts three ioctls
>>>
>>> PIDFD_MEM_MAP takes a struct like
>>>
>>>     struct pidfd_mem_map {
>>>          uint64_t address;
>>>          off_t offset;
>>>          off_t size;
>>>          int flags;
>>>          int padding[7];
>>>     }
>>>
>>> After this is done, the memory access fd can be mmap-ed at range
>>> [offset,
>>> offset+size), and it will read memory from range [address,
>>> address+size) of the target descriptor.
>>>
>>> PIDFD_MEM_UNMAP takes a struct like
>>>
>>>     struct pidfd_mem_unmap {
>>>          off_t offset;
>>>          off_t size;
>>>     }
>>>
>>> and unmaps the corresponding range of course.

off_t depends on preprocessor macros (_FILE_OFFSET_BITS), so these
types should be uint64_t, too.

I'm not sure what the advantage is of returning separate file
descriptors, and nit operating directly on the pidfd.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (6 preceding siblings ...)
  2020-09-04 19:19 ` Florian Weimer
@ 2020-09-04 19:39 ` Andy Lutomirski
  2020-09-04 20:09   ` Paolo Bonzini
  2020-09-07 10:25   ` Mircea CIRJALIU - MELIU
  2020-09-07 15:05 ` Christian Brauner
  8 siblings, 2 replies; 44+ messages in thread
From: Andy Lutomirski @ 2020-09-04 19:39 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: Linux-MM, Linux API, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 4, 2020 at 4:41 AM Adalbert Lazăr <alazar@bitdefender.com> wrote:
>
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
>

I think this is very clever, but I find myself wondering what happens
if people start trying to abuse this by, for example, setting up a
remote mapping pointing to fun regions like userfaultfd or another
remote mapping.

I'm a little concerned that it's actually too clever and that maybe a
more straightforward solution should be investigated.  I personally
rather dislike the KVM model in which the guest address space mirrors
the host (QEMU) address space rather than being its own thing.  In
particular, the current model means that extra-special-strange
mappings like SEV-encrypted memory are required to be present in the
QEMU page tables in order for the guest to see them.

(If I had noticed that last bit before it went upstream, I would have
NAKked it.  I would still like to see it deprecated and ideally
eventually removed from the kernel.  We have absolutely no business
creating incoherent mappings like this.)

--Andy


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
  2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
@ 2020-09-04 19:41   ` Matthew Wilcox
  2020-09-04 19:49     ` Jason Gunthorpe
                       ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Matthew Wilcox @ 2020-09-04 19:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Mircea Cirjaliu, Andy Lutomirski,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Christian Brauner

On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > process address space within the specified range. Pages are installed in the
> > current process page tables at fault time and removed by the mmu_interval_notifier
> > invalidate callbck. No further memory management is involved.
> > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > or if the remote process address space was reaped by OOM, the remote mapping
> > fault handler returns VM_FAULT_SIGBUS.
> 
> I still think anything along these lines needs to meet the XPMEM use
> cases as well, we have to have more general solutions for such MM
> stuff:
> 
> https://gitlab.com/hjelmn/xpmem
> 
> However, I think this fundamentally falls into some of the same bad
> direction as xpmem.
> 
> I would much rather see this design copy & clone the VMA's than try to
> mirror the PTEs inside the VMAs from the remote into a single giant
> VMA and somehow split/mirror the VMA ops.

I'm on holiday for the next few days, but does the mshare() API work for
your use case?

Proposal: http://www.wil.cx/~willy/linux/sileby.html
Start at implementation:
http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:41   ` Matthew Wilcox
@ 2020-09-04 19:49     ` Jason Gunthorpe
  2020-09-04 20:08     ` Paolo Bonzini
  2020-12-01 18:01     ` Jason Gunthorpe
  2 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2020-09-04 19:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Mircea Cirjaliu, Andy Lutomirski,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Christian Brauner

On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > > process address space within the specified range. Pages are installed in the
> > > current process page tables at fault time and removed by the mmu_interval_notifier
> > > invalidate callbck. No further memory management is involved.
> > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > > or if the remote process address space was reaped by OOM, the remote mapping
> > > fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use
> > cases as well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad
> > direction as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant
> > VMA and somehow split/mirror the VMA ops.
> 
> I'm on holiday for the next few days, but does the mshare() API work for
> your use case?
>
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

Let me ask around, it seems in a similar space at least!

Thanks,
Jason
 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:41   ` Matthew Wilcox
  2020-09-04 19:49     ` Jason Gunthorpe
@ 2020-09-04 20:08     ` Paolo Bonzini
  2020-12-01 18:01     ` Jason Gunthorpe
  2 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-04 20:08 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Christian Brauner

On 04/09/20 21:41, Matthew Wilcox wrote:
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

The main difference between mshare() and this is that we don't want an
all-or-nothing thing.

Adalbert's introspection thing is rather simple, but what I would like
to be able to do (and the reason why I suggested the multi-pidfd
approach) is actually a bit more complex:

- a parent process creates a range of memory

- there are multiple VMs child processes.  One of this VM is a primary
VM, the others are enclave VMs.  VMs are created by the parent process
and each VM gets a different view of the memory range through pidfd_mem.

- once an enclave VM is created, the primary VM must not be able to
access the memory that has been assigned to the enclave VM.  If the
parent unmaps the memory in the primary VM, the child must SIGBUS when
it's accessed.

- if memory is removed from a VM and assigned to another, this should
not involve any copy at all.

For this usecase the range of memory would be backed by hugetlbfs,
anonymous memory, VFIO, whatever.  Userfaultfd is certainly part of the
picture here on the VM side.  Having userfaultfd on the parent side
would be nice though I don't have a use for it right now.  I'm not sure
about non-anonymous VMAs.

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:39 ` Andy Lutomirski
@ 2020-09-04 20:09   ` Paolo Bonzini
  2020-09-04 20:34     ` Andy Lutomirski
  2020-09-07 10:25   ` Mircea CIRJALIU - MELIU
  1 sibling, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-04 20:09 UTC (permalink / raw)
  To: Andy Lutomirski, Adalbert Lazăr
  Cc: Linux-MM, Linux API, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Mihai Donțu,
	Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

On 04/09/20 21:39, Andy Lutomirski wrote:
> I'm a little concerned that it's actually too clever and that maybe a
> more straightforward solution should be investigated.  I personally
> rather dislike the KVM model in which the guest address space mirrors
> the host (QEMU) address space rather than being its own thing.  In
> particular, the current model means that extra-special-strange
> mappings like SEV-encrypted memory are required to be present in the
> QEMU page tables in order for the guest to see them.
> 
> (If I had noticed that last bit before it went upstream, I would have
> NAKked it.  I would still like to see it deprecated and ideally
> eventually removed from the kernel.  We have absolutely no business
> creating incoherent mappings like this.)

NACK first and ask second, right Andy?  I see that nothing has changed
since Alan Cox left Linux.

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:19 ` Florian Weimer
@ 2020-09-04 20:18   ` Paolo Bonzini
  2020-09-07  8:33     ` Christian Brauner
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-04 20:18 UTC (permalink / raw)
  To: Florian Weimer, Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On 04/09/20 21:19, Florian Weimer wrote:
> I'm not sure what the advantage is of returning separate file
> descriptors, and nit operating directly on the pidfd.

For privilege separation.  So far, the common case of pidfd operations
has been that whoever possesses a pidfd has "power" over the target
process.  Here however we also want to cover the case where one
privileged process wants to set up and manage a memory range for
multiple children.  The privilege process can do so by passing the
access file descriptor via SCM_RIGHTS.

We also want different children to have visibility over different
ranges, which is why there are multiple control fds rather than using
the pidfd itself as control fd.  You could have the map/unmap/lock ioctl
on the pidfd itself and the access fd as an argument of the ioctl, but
it seems cleaner to represent the pidfd-mem control capability as its
own file descriptor.

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 20:09   ` Paolo Bonzini
@ 2020-09-04 20:34     ` Andy Lutomirski
  2020-09-04 21:58       ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2020-09-04 20:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Adalbert Lazăr, Linux-MM, Linux API,
	Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Mihai Donțu, Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner


>> On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> 
>> On 04/09/20 21:39, Andy Lutomirski wrote:
>> I'm a little concerned that it's actually too clever and that maybe a
>> more straightforward solution should be investigated.  I personally
>> rather dislike the KVM model in which the guest address space mirrors
>> the host (QEMU) address space rather than being its own thing.  In
>> particular, the current model means that extra-special-strange
>> mappings like SEV-encrypted memory are required to be present in the
>> QEMU page tables in order for the guest to see them.
>> (If I had noticed that last bit before it went upstream, I would have
>> NAKked it.  I would still like to see it deprecated and ideally
>> eventually removed from the kernel.  We have absolutely no business
>> creating incoherent mappings like this.)
> 
> NACK first and ask second, right Andy?  I see that nothing has changed
> since Alan Cox left Linux.

NACKs are negotiable.  And maybe someone can convince me that the SEV mapping scheme is reasonable, but I would be surprised.

Regardless, you seem to be suggesting that you want to have enclave VMs in which the enclave can see some memory that the parent VM can’t see. How does this fit into the KVM mapping model?  How does this remote mapping mechanism help?  Do you want QEMU to have that memory mapped in its own pagetables?

As it stands, the way that KVM memory mappings are created seems to be convenient, but it also seems to be resulting in increasing bizarre userspace mappings.  At what point is the right solution to decouple KVM’s mappings from QEMU’s?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 20:34     ` Andy Lutomirski
@ 2020-09-04 21:58       ` Paolo Bonzini
  2020-09-04 23:17         ` Andy Lutomirski
  2020-09-07  7:05         ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-04 21:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Adalbert Lazăr, Linux-MM, Linux API,
	Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Mihai Donțu, Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On 04/09/20 22:34, Andy Lutomirski wrote:
> On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 04/09/20 21:39, Andy Lutomirski wrote:
>>> I'm a little concerned
>>> that it's actually too clever and that maybe a more
>>> straightforward solution should be investigated.  I personally 
>>> rather dislike the KVM model in which the guest address space
>>> mirrors the host (QEMU) address space rather than being its own
>>> thing.  In particular, the current model means that
>>> extra-special-strange mappings like SEV-encrypted memory are
>>> required to be present in the QEMU page tables in order for the
>>> guest to see them. (If I had noticed that last bit before it went
>>> upstream, I would have NAKked it.  I would still like to see it
>>> deprecated and ideally eventually removed from the kernel.  We
>>> have absolutely no business creating incoherent mappings like
>>> this.)
>> 
>> NACK first and ask second, right Andy?  I see that nothing has
>> changed since Alan Cox left Linux.
> 
> NACKs are negotiable.  And maybe someone can convince me that the SEV
> mapping scheme is reasonable, but I would be surprised.

So why say NACK?  Any half-decent maintainer would hold on merging the
patches at least until the discussion is over.  Also I suppose any
deprecation proposal should come with a description of an alternative.

Anyway, for SEV the problem is DMA.  There is no way to know in advance
which memory the guest will use for I/O; it can change at any time and
the same host-physical address can even be mapped both as C=0 and C=1 by
the guest.  There's no communication protocol between the guest and the
host to tell the host _which_ memory should be mapped in QEMU.  (One was
added to support migration, but that doesn't even work with SEV-ES
processors where migration is planned to happen mostly with help from
the guest, either in the firmware or somewhere else).

But this is a digression.  (If you would like to continue the discussion
please trim the recipient list and change the subject).

> Regardless, you seem to be suggesting that you want to have enclave
> VMs in which the enclave can see some memory that the parent VM can’t
> see. How does this fit into the KVM mapping model?  How does this
> remote mapping mechanism help?  Do you want QEMU to have that memory
> mapped in its own pagetables?

There are three processes:

- the manager, which is the parent of the VMs and uses the pidfd_mem
system call

- the primary VM

- the enclave VM(s)

The primary VM and the enclave VM(s) would each get a different memory
access file descriptor.  QEMU would treat them no differently from any
other externally-provided memory backend, say hugetlbfs or memfd, so
yeah they would be mmap-ed to userspace and the host virtual address
passed as usual to KVM.

Enclave VMs could be used to store secrets and perform crypto for
example.  The enclave is measured at boot, any keys or other stuff it
needs can be provided out-of-band from the manager

The manager can decide at any time to hide some memory from the parent
VM (in order to give it to an enclave).  This would actually be done on
 request of the parent VM itself, and QEMU would probably be so kind as
to replace the "hole" left in the guest memory with zeroes.  But QEMU is
untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
 privilege separation model that was implemented here.

Actually Amazon has already created something like that and Andra-Irina
Paraschiv has posted patches on the list for this.  Their implementation
is not open source, but this pidfd-mem concept is something that Andra,
Alexander Graf and I came up with as a way to 1) reimplement the feature
upstream and 2) satisfy Bitdefender's need for memory introspection 3)
add what seemed a useful interface anyway, for example to replace
PTRACE_{PEEK,POKE}DATA.  Though (3) would only need pread/pwrite, not
mmap which adds a lot of the complexity.

> As it stands, the way that KVM memory mappings are created seems to
> be convenient, but it also seems to be resulting in increasing
> bizarre userspace mappings.  At what point is the right solution to
> decouple KVM’s mappings from QEMU’s?

So what you are suggesting is that KVM manages its own address space
instead of host virtual addresses (and with no relationship to host
virtual addresses, it would be just a "cookie")?  It would then need a
couple ioctls to mmap/munmap (creating and deleting VMAs) into the
address space, and those cookies would be passed to
KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to these VMAs,
would it mmap a file descriptor provided by KVM?  All in all the
implementation seems quite complex, and I don't understand why it would
avoid incoherent SEV mappings; what am I missing?

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 21:58       ` Paolo Bonzini
@ 2020-09-04 23:17         ` Andy Lutomirski
  2020-09-05 18:27           ` Paolo Bonzini
  2020-09-07 12:41           ` Mircea CIRJALIU - MELIU
  2020-09-07  7:05         ` Christoph Hellwig
  1 sibling, 2 replies; 44+ messages in thread
From: Andy Lutomirski @ 2020-09-04 23:17 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Adalbert Lazăr, Linux-MM, Linux API,
	Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Mihai Donțu, Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 4, 2020 at 2:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 04/09/20 22:34, Andy Lutomirski wrote:
> > On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> On 04/09/20 21:39, Andy Lutomirski wrote:
> >>> I'm a little concerned
> >>> that it's actually too clever and that maybe a more
> >>> straightforward solution should be investigated.  I personally
> >>> rather dislike the KVM model in which the guest address space
> >>> mirrors the host (QEMU) address space rather than being its own
> >>> thing.  In particular, the current model means that
> >>> extra-special-strange mappings like SEV-encrypted memory are
> >>> required to be present in the QEMU page tables in order for the
> >>> guest to see them. (If I had noticed that last bit before it went
> >>> upstream, I would have NAKked it.  I would still like to see it
> >>> deprecated and ideally eventually removed from the kernel.  We
> >>> have absolutely no business creating incoherent mappings like
> >>> this.)
> >>
> >> NACK first and ask second, right Andy?  I see that nothing has
> >> changed since Alan Cox left Linux.
> >
> > NACKs are negotiable.  And maybe someone can convince me that the SEV
> > mapping scheme is reasonable, but I would be surprised.
>
> So why say NACK?  Any half-decent maintainer would hold on merging the
> patches at least until the discussion is over.  Also I suppose any
> deprecation proposal should come with a description of an alternative.

I suppose that's a fair point.

>
> Anyway, for SEV the problem is DMA.  There is no way to know in advance
> which memory the guest will use for I/O; it can change at any time and
> the same host-physical address can even be mapped both as C=0 and C=1 by
> the guest.  There's no communication protocol between the guest and the
> host to tell the host _which_ memory should be mapped in QEMU.  (One was
> added to support migration, but that doesn't even work with SEV-ES
> processors where migration is planned to happen mostly with help from
> the guest, either in the firmware or somewhere else).

There's sev_pin_memory(), so QEMU must have at least some idea of
which memory could potentially be encrypted.  Is it in fact the case
that QEMU doesn't know that some SEV pinned memory might actually be
used for DMA until the guest tries to do DMA on that memory?  If so,
yuck.

>
> But this is a digression.  (If you would like to continue the discussion
> please trim the recipient list and change the subject).

Fair enough.  And my apologies for bring grumpier about SEV than was called for.

>
> > Regardless, you seem to be suggesting that you want to have enclave
> > VMs in which the enclave can see some memory that the parent VM can’t
> > see. How does this fit into the KVM mapping model?  How does this
> > remote mapping mechanism help?  Do you want QEMU to have that memory
> > mapped in its own pagetables?
>
> There are three processes:
>
> - the manager, which is the parent of the VMs and uses the pidfd_mem
> system call
>
> - the primary VM
>
> - the enclave VM(s)
>
> The primary VM and the enclave VM(s) would each get a different memory
> access file descriptor.  QEMU would treat them no differently from any
> other externally-provided memory backend, say hugetlbfs or memfd, so
> yeah they would be mmap-ed to userspace and the host virtual address
> passed as usual to KVM.

Would the VM processes mmap() these descriptors, or would KVM learn
how to handle that memory without it being mapped?

>
> Enclave VMs could be used to store secrets and perform crypto for
> example.  The enclave is measured at boot, any keys or other stuff it
> needs can be provided out-of-band from the manager
>
> The manager can decide at any time to hide some memory from the parent
> VM (in order to give it to an enclave).  This would actually be done on
>  request of the parent VM itself, and QEMU would probably be so kind as
> to replace the "hole" left in the guest memory with zeroes.  But QEMU is
> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
>  privilege separation model that was implemented here.

How does this work?  Is there a revoke mechanism, or does the parent
just munmap() the memory itself?

>
> Actually Amazon has already created something like that and Andra-Irina
> Paraschiv has posted patches on the list for this.  Their implementation
> is not open source, but this pidfd-mem concept is something that Andra,
> Alexander Graf and I came up with as a way to 1) reimplement the feature
> upstream and 2) satisfy Bitdefender's need for memory introspection 3)
> add what seemed a useful interface anyway, for example to replace
> PTRACE_{PEEK,POKE}DATA.  Though (3) would only need pread/pwrite, not
> mmap which adds a lot of the complexity.
>
> > As it stands, the way that KVM memory mappings are created seems to
> > be convenient, but it also seems to be resulting in increasing
> > bizarre userspace mappings.  At what point is the right solution to
> > decouple KVM’s mappings from QEMU’s?
>
> So what you are suggesting is that KVM manages its own address space
> instead of host virtual addresses (and with no relationship to host
> virtual addresses, it would be just a "cookie")?  It would then need a
> couple ioctls to mmap/munmap (creating and deleting VMAs) into the
> address space, and those cookies would be passed to
> KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to these VMAs,
> would it mmap a file descriptor provided by KVM?  All in all the
> implementation seems quite complex, and I don't understand why it would
> avoid incoherent SEV mappings; what am I missing?

It might not avoid incoherent SEV mappings in particular, but it would
certainly enable other, somewhat related usecases.  For example, QEMU
could have KVM map a memfd without itself mapping that memfd, which
would reduce the extent to which the memory would be exposed to an
attacker who can read QEMU memory.

For this pidfd-mem scheme in particular, it might avoid the nasty
corner case I mentioned.  With pidfd-mem as in this patchset, I'm
concerned about what happens when process A maps some process B
memory, process B maps some of process A's memory, and there's a
recursive mapping that results.  Or when a process maps its own
memory, for that matter.  If KVM could map fd's directly, then there
could be a parallel mechanism for KVM to import portions of more than
one process's address space, and this particular problem would be
avoided.  So a process would create pidfd-mem-like object and pass
that to KVM (via an intermediary process or directly) and KVM could
map that, but no normal process would be allowed to map it.  This
avoids the recursion problems.

Or memfd could get fancier with operations to split memfds, remove
pages from memfds, etc.  Maybe that's overkill.

(Or a fancy recursion detector could be built, but this has been a
pain point in AF_UNIX, epoll, etc in the past.  It may be solvable,
but it won't be pretty.)

I admit that allowing KVM to map fd's directly without some specific
vm_operations support for this could be challenging, but ISTM kvm
could plausibly own an mm_struct and pagetables at the cost of some
wasted memory.  The result would, under the hood, work more or less
like the current implementation, but the API would be quite different.

>
> Paolo
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 23:17         ` Andy Lutomirski
@ 2020-09-05 18:27           ` Paolo Bonzini
  2020-09-07  8:38             ` Christian Brauner
  2020-09-07 12:41           ` Mircea CIRJALIU - MELIU
  1 sibling, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-05 18:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Adalbert Lazăr, Linux-MM, Linux API, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Mihai Donțu,
	Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

On 05/09/20 01:17, Andy Lutomirski wrote:
> There's sev_pin_memory(), so QEMU must have at least some idea of
> which memory could potentially be encrypted.  Is it in fact the case
> that QEMU doesn't know that some SEV pinned memory might actually be
> used for DMA until the guest tries to do DMA on that memory?  If so,
> yuck.

Yes.  All the memory is pinned, all the memory could potentially be used
for DMA (of garbage if it's encrypted).  And it's the same for pretty
much all protected VM extensions (SEV, POWER, s390, Intel TDX).

>> The primary VM and the enclave VM(s) would each get a different memory
>> access file descriptor.  QEMU would treat them no differently from any
>> other externally-provided memory backend, say hugetlbfs or memfd, so
>> yeah they would be mmap-ed to userspace and the host virtual address
>> passed as usual to KVM.
> 
> Would the VM processes mmap() these descriptors, or would KVM learn
> how to handle that memory without it being mapped?

The idea is that the process mmaps them, QEMU would treat them just the
same as a hugetlbfs file descriptor for example.

>> The manager can decide at any time to hide some memory from the parent
>> VM (in order to give it to an enclave).  This would actually be done on
>> request of the parent VM itself [...] But QEMU is
>> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
>> privilege separation model that was implemented here.
> 
> How does this work?  Is there a revoke mechanism, or does the parent
> just munmap() the memory itself?

The parent has ioctls to add and remove memory from the pidfd-mem.  So
unmapping is just calling the ioctl that removes a range.

>> So what you are suggesting is that KVM manages its own address space
>> instead of host virtual addresses (and with no relationship to host
>> virtual addresses, it would be just a "cookie")?
> 
> [...] For this pidfd-mem scheme in particular, it might avoid the nasty
> corner case I mentioned.  With pidfd-mem as in this patchset, I'm
> concerned about what happens when process A maps some process B
> memory, process B maps some of process A's memory, and there's a
> recursive mapping that results.  Or when a process maps its own
> memory, for that matter.
> 
> Or memfd could get fancier with operations to split memfds, remove
> pages from memfds, etc.  Maybe that's overkill.

Doing it directly with memfd is certainly an option, especially since
MFD_HUGE_* exists.  Basically you'd have a system call to create a
secondary view of the memfd, and the syscall interface could still be
very similar to what is in this patch, in particular the control/access
pair.  Probably this could be used also to implement Matthew Wilcox's ideas.

I still believe that the pidfd-mem concept has merit as a
"capability-like" PTRACE_{PEEK,POKE}DATA replacement, but it would not
need any of privilege separation or mmap support, only direct read/write.

So there's two concepts mixed in one interface in this patch, with two
completely different usecases.  Merging them is clever, but perhaps too
clever.  I can say that since it was my idea. :D

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 21:58       ` Paolo Bonzini
  2020-09-04 23:17         ` Andy Lutomirski
@ 2020-09-07  7:05         ` Christoph Hellwig
  2020-09-07  8:44           ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2020-09-07  7:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Andy Lutomirski, Adalbert Laz??r, Linux-MM,
	Linux API, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Mihai Don??u, Mircea Cirjaliu, Arnd Bergmann,
	Sargun Dhillon, Aleksa Sarai, Oleg Nesterov, Jann Horn,
	Kees Cook, Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote:
> So why say NACK?  Any half-decent maintainer would hold on merging the
> patches at least until the discussion is over.  Also I suppose any
> deprecation proposal should come with a description of an alternative.

Please stop these totally pointless and overly aggressive personal
attacks.  A maintainers prime job is to say no.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 20:18   ` Paolo Bonzini
@ 2020-09-07  8:33     ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2020-09-07  8:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Florian Weimer, Adalbert Lazăr, linux-mm, linux-api,
	Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Mihai Donțu, Mircea Cirjaliu, Andy Lutomirski,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Matthew Wilcox

Hey Paolo,

On Fri, Sep 04, 2020 at 10:18:17PM +0200, Paolo Bonzini wrote:
> On 04/09/20 21:19, Florian Weimer wrote:
> > I'm not sure what the advantage is of returning separate file
> > descriptors, and nit operating directly on the pidfd.
> 
> For privilege separation.  So far, the common case of pidfd operations
> has been that whoever possesses a pidfd has "power" over the target

I may misunderstand you but that's actually not quite true. Currently,
pidfds are just handles on processes and currently only convey identity.
They don't guarantee any sort of privilege over the target process. We
have had discussion to treat them more as a capability in the future but
that needs to be carefully thought out.

> process.  Here however we also want to cover the case where one
> privileged process wants to set up and manage a memory range for
> multiple children.  The privilege process can do so by passing the
> access file descriptor via SCM_RIGHTS.
> 
> We also want different children to have visibility over different
> ranges, which is why there are multiple control fds rather than using
> the pidfd itself as control fd.  You could have the map/unmap/lock ioctl
> on the pidfd itself and the access fd as an argument of the ioctl, but
> it seems cleaner to represent the pidfd-mem control capability as its
> own file descriptor.

We have very much on purpose avoided adding ioctls() on top of pidfds
and I'm not fond of the idea of starting to add them. Supporting
ioctl()s on an fd usually opens up a can of worms and makes sneaking in
questionable features more likely (I'm not saying your patchset does
that!).
If this interface holds up, I would ask you to please either keep this
as a separate fd type or please propose system calls only.

Christian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-05 18:27           ` Paolo Bonzini
@ 2020-09-07  8:38             ` Christian Brauner
  0 siblings, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2020-09-07  8:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Adalbert Lazăr, Linux-MM, Linux API,
	Andrew Morton, Alexander Graf, Stefan Hajnoczi, Jerome Glisse,
	Mihai Donțu, Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox

On Sat, Sep 05, 2020 at 08:27:29PM +0200, Paolo Bonzini wrote:
> On 05/09/20 01:17, Andy Lutomirski wrote:
> > There's sev_pin_memory(), so QEMU must have at least some idea of
> > which memory could potentially be encrypted.  Is it in fact the case
> > that QEMU doesn't know that some SEV pinned memory might actually be
> > used for DMA until the guest tries to do DMA on that memory?  If so,
> > yuck.
> 
> Yes.  All the memory is pinned, all the memory could potentially be used
> for DMA (of garbage if it's encrypted).  And it's the same for pretty
> much all protected VM extensions (SEV, POWER, s390, Intel TDX).
> 
> >> The primary VM and the enclave VM(s) would each get a different memory
> >> access file descriptor.  QEMU would treat them no differently from any
> >> other externally-provided memory backend, say hugetlbfs or memfd, so
> >> yeah they would be mmap-ed to userspace and the host virtual address
> >> passed as usual to KVM.
> > 
> > Would the VM processes mmap() these descriptors, or would KVM learn
> > how to handle that memory without it being mapped?
> 
> The idea is that the process mmaps them, QEMU would treat them just the
> same as a hugetlbfs file descriptor for example.
> 
> >> The manager can decide at any time to hide some memory from the parent
> >> VM (in order to give it to an enclave).  This would actually be done on
> >> request of the parent VM itself [...] But QEMU is
> >> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
> >> privilege separation model that was implemented here.
> > 
> > How does this work?  Is there a revoke mechanism, or does the parent
> > just munmap() the memory itself?
> 
> The parent has ioctls to add and remove memory from the pidfd-mem.  So
> unmapping is just calling the ioctl that removes a range.

I would strongly suggest we move away from ioctl() patterns. If
something like this comes up in the future, just propose them as system
calls.

> 
> >> So what you are suggesting is that KVM manages its own address space
> >> instead of host virtual addresses (and with no relationship to host
> >> virtual addresses, it would be just a "cookie")?
> > 
> > [...] For this pidfd-mem scheme in particular, it might avoid the nasty
> > corner case I mentioned.  With pidfd-mem as in this patchset, I'm
> > concerned about what happens when process A maps some process B
> > memory, process B maps some of process A's memory, and there's a
> > recursive mapping that results.  Or when a process maps its own
> > memory, for that matter.
> > 
> > Or memfd could get fancier with operations to split memfds, remove
> > pages from memfds, etc.  Maybe that's overkill.
> 
> Doing it directly with memfd is certainly an option, especially since
> MFD_HUGE_* exists.  Basically you'd have a system call to create a

I like that idea way better to be honest.

Christian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-07  7:05         ` Christoph Hellwig
@ 2020-09-07  8:44           ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2020-09-07  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Andy Lutomirski, Adalbert Laz??r, Linux-MM,
	Linux API, Andrew Morton, Alexander Graf, Stefan Hajnoczi,
	Jerome Glisse, Mihai Don??u, Mircea Cirjaliu, Arnd Bergmann,
	Sargun Dhillon, Aleksa Sarai, Oleg Nesterov, Jann Horn,
	Kees Cook, Matthew Wilcox, Christian Brauner

On 07/09/20 09:05, Christoph Hellwig wrote:
> On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote:
>> So why say NACK?  Any half-decent maintainer would hold on merging the
>> patches at least until the discussion is over.  Also I suppose any
>> deprecation proposal should come with a description of an alternative.
>
> Please stop these totally pointless and overly aggressive personal
> attacks.  A maintainers prime job is to say no.

Christoph,

I'm not sure who is attacking whom honestly, especially since Andy later
said "Fair enough.  And my apologies for bring grumpier about SEV than
was called for" and we had a perfectly civil and technical conversation.

Paolo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:39 ` Andy Lutomirski
  2020-09-04 20:09   ` Paolo Bonzini
@ 2020-09-07 10:25   ` Mircea CIRJALIU - MELIU
  1 sibling, 0 replies; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-07 10:25 UTC (permalink / raw)
  To: Andy Lutomirski, Adalbert Lazăr
  Cc: Linux-MM, Linux API, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Matthew Wilcox, Christian Brauner

> From: Andy Lutomirski <luto@kernel.org>
> I think this is very clever, but I find myself wondering what happens if people
> start trying to abuse this by, for example, setting up a remote mapping
> pointing to fun regions like userfaultfd or another remote mapping.

Can ptrace() be used to abuse fun regions of a process address space?
Remote mapping recursiveness can be eliminated by checking the VMA the
remote page is extracted from. (NYI)

> I'm a little concerned that it's actually too clever and that maybe a more
> straightforward solution should be investigated.  I personally rather dislike
> the KVM model in which the guest address space mirrors the host (QEMU)
> address space rather than being its own thing.

I've seen a few internal mmap()s throughout the kernel. Just wondering how
memory accounting is implemented for these cases. Will this be reflected in
the memory usage of the process that controls such a module? Especially in
the case of a virtual machine that needs a few GBs of memory.

Mircea


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 23:17         ` Andy Lutomirski
  2020-09-05 18:27           ` Paolo Bonzini
@ 2020-09-07 12:41           ` Mircea CIRJALIU - MELIU
  1 sibling, 0 replies; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-07 12:41 UTC (permalink / raw)
  To: Andy Lutomirski, Paolo Bonzini
  Cc: Adalbert Lazăr, Linux-MM, Linux API, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Mihai Donțu,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Matthew Wilcox, Christian Brauner

> From: Andy Lutomirski <luto@kernel.org>
> > > As it stands, the way that KVM memory mappings are created seems to
> > > be convenient, but it also seems to be resulting in increasing
> > > bizarre userspace mappings.  At what point is the right solution to
> > > decouple KVM’s mappings from QEMU’s?

Convenience is one of the drivers of code reuse.

> > So what you are suggesting is that KVM manages its own address space
> > instead of host virtual addresses (and with no relationship to host
> > virtual addresses, it would be just a "cookie")?  It would then need a
> > couple ioctls to mmap/munmap (creating and deleting VMAs) into the
> > address space, and those cookies would be passed to
> > KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to
> these
> > VMAs, would it mmap a file descriptor provided by KVM?  All in all the
> > implementation seems quite complex, and I don't understand why it
> > would avoid incoherent SEV mappings; what am I missing?
> 
> It might not avoid incoherent SEV mappings in particular, but it would
> certainly enable other, somewhat related usecases.  For example, QEMU
> could have KVM map a memfd without itself mapping that memfd, which
> would reduce the extent to which the memory would be exposed to an
> attacker who can read QEMU memory.

Isn't this security through obscurity?

> For this pidfd-mem scheme in particular, it might avoid the nasty corner case
> I mentioned.  With pidfd-mem as in this patchset, I'm concerned about what
> happens when process A maps some process B memory, process B maps
> some of process A's memory, and there's a recursive mapping that results.
> Or when a process maps its own memory, for that matter.  If KVM could map
> fd's directly, then there could be a parallel mechanism for KVM to import
> portions of more than one process's address space, and this particular
> problem would be avoided.  So a process would create pidfd-mem-like
> object and pass that to KVM (via an intermediary process or directly) and
> KVM could map that, but no normal process would be allowed to map it.  This
> avoids the recursion problems.
> 
> Or memfd could get fancier with operations to split memfds, remove pages
> from memfds, etc.  Maybe that's overkill.
> 
> (Or a fancy recursion detector could be built, but this has been a pain point in
> AF_UNIX, epoll, etc in the past.  It may be solvable, but it won't be pretty.)
> 
> I admit that allowing KVM to map fd's directly without some specific
> vm_operations support for this could be challenging, but ISTM kvm could
> plausibly own an mm_struct and pagetables at the cost of some wasted
> memory.  The result would, under the hood, work more or less like the
> current implementation, but the API would be quite different.

This looks like an attempt to pass memory related concerns to KVM developers.
The userspace mapping mechanism is good as it is. Probably not perfect, just 
good. The problem is that it's stuck to a few VMA models and needs to evolve
towards more bizarre/sketchy/weird/fragile patterns.

Also the memory code is one of the most tightly coupled code I have seen.
Probably explains the fear of the maintainers to try something new.

Mircea


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
  2020-09-04 17:55   ` Oleg Nesterov
@ 2020-09-07 14:30   ` Oleg Nesterov
  2020-09-07 15:16     ` Adalbert Lazăr
  2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
  2020-09-07 15:02   ` Christian Brauner
  2 siblings, 2 replies; 44+ messages in thread
From: Oleg Nesterov @ 2020-09-07 14:30 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

it seems that nobody is going to review this patch ;)

So I tried to read mirror_vm_fault() and the usage of mmap_sem doesn't
look right to me. But let me repeat, this is not my area I can be easily
wrong, please correct me.

On 09/04, Adalbert Lazăr wrote:
>
> +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct remote_vma_context *ctx = vma->vm_private_data;
> +	struct remote_view *view = ctx->view;
> +	struct file *file = vma->vm_file;
> +	struct remote_file_context *fctx = file->private_data;
> +	unsigned long req_addr;
> +	unsigned int gup_flags;
> +	struct page *req_page;
> +	vm_fault_t result = VM_FAULT_SIGBUS;
> +	struct mm_struct *src_mm = fctx->mm;
> +	unsigned long seq;
> +	int idx;
> +
> +fault_retry:
> +	seq = mmu_interval_read_begin(&view->mmin);
> +
> +	idx = srcu_read_lock(&fctx->fault_srcu);
> +
> +	/* check if view was invalidated */
> +	if (unlikely(!READ_ONCE(view->valid))) {
> +		pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__,
> +			view->offset, view->offset + view->size);
> +		goto out_invalid;		/* VM_FAULT_SIGBUS */
> +	}
> +
> +	/* drop current mm semapchore */
> +	up_read(&current->mm->mmap_sem);

Please use mmap_read_lock/unlock(mm) instead of down/up_read(mmap_sem).

But why is it safe to drop ->mmap_sem without checking
FAULT_FLAG_ALLOW_RETRY/RETRY_NOWAIT ?

> +	/* take remote mm semaphore */
> +	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> +		if (!down_read_trylock(&src_mm->mmap_sem)) {

I don't understand this... perhaps you meant FAULT_FLAG_RETRY_NOWAIT ?

> +	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
> +	 * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
> +	 * not set.

Well, iiuc FAULT_FLAG_ALLOW_RETRY means that ->fault() _can_ drop mmap_sem
and return VM_FAULT_RETRY (unless NOWAIT).

> +	if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_TRIED)) {
> +		if (!(vmf->flags & FAULT_FLAG_KILLABLE))
> +			if (current && fatal_signal_pending(current)) {
> +				up_read(&current->mm->mmap_sem);
> +				return VM_FAULT_RETRY;

I fail to understand this. But in any case, you do not need to check
current != NULL and up_read() looks wrong if RETRY_NOWAIT?

Oleg.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call
  2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
  2020-09-04 19:18   ` Florian Weimer
@ 2020-09-07 14:55   ` Christian Brauner
  1 sibling, 0 replies; 44+ messages in thread
From: Christian Brauner @ 2020-09-07 14:55 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox, Christian Brauner

On Fri, Sep 04, 2020 at 02:31:16PM +0300, Adalbert Lazăr wrote:
> From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> 
> This system call returns 2 fds for inspecting the address space of a
> remote process: one for control and one for access. Use according to
> remote mapping specifications.
> 
> Cc: Christian Brauner <christian@brauner.io>
> Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/pid.h                    |  1 +
>  include/linux/syscalls.h               |  1 +
>  include/uapi/asm-generic/unistd.h      |  2 +
>  kernel/exit.c                          |  2 +-
>  kernel/pid.c                           | 55 ++++++++++++++++++++++++++
>  7 files changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 54581ac671b4..ca1b5a32dbc5 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -440,5 +440,6 @@
>  433	i386	fspick			sys_fspick
>  434	i386	pidfd_open		sys_pidfd_open
>  435	i386	clone3			sys_clone3
> +436     i386    pidfd_mem               sys_pidfd_mem
>  437	i386	openat2			sys_openat2
>  438	i386	pidfd_getfd		sys_pidfd_getfd
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 37b844f839bc..6138d3d023f8 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -357,6 +357,7 @@
>  433	common	fspick			sys_fspick
>  434	common	pidfd_open		sys_pidfd_open
>  435	common	clone3			sys_clone3
> +436     common  pidfd_mem               sys_pidfd_mem
>  437	common	openat2			sys_openat2
>  438	common	pidfd_getfd		sys_pidfd_getfd
>  
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index cc896f0fc4e3..9ec23ab23fd4 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -76,6 +76,7 @@ extern const struct file_operations pidfd_fops;
>  
>  struct file;
>  
> +extern struct pid *pidfd_get_pid(unsigned int fd);
>  extern struct pid *pidfd_pid(const struct file *file);
>  
>  static inline struct pid *get_pid(struct pid *pid)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 1815065d52f3..621f3d52ed4e 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -934,6 +934,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
>  asmlinkage long sys_syncfs(int fd);
>  asmlinkage long sys_setns(int fd, int nstype);
>  asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
> +asmlinkage long sys_pidfd_mem(int pidfd, int __user *fds, unsigned int flags);
>  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
>  			     unsigned int vlen, unsigned flags);
>  asmlinkage long sys_process_vm_readv(pid_t pid,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 3a3201e4618e..2663afc03c86 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
>  #define __NR_clone3 435
>  __SYSCALL(__NR_clone3, sys_clone3)
>  #endif
> +#define __NR_pidfd_mem 436
> +__SYSCALL(__NR_pidfd_mem, sys_pidfd_mem)
>  
>  #define __NR_openat2 437
>  __SYSCALL(__NR_openat2, sys_openat2)
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 389a88cb3081..37cd8949e606 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1464,7 +1464,7 @@ static long do_wait(struct wait_opts *wo)
>  	return retval;
>  }
>  
> -static struct pid *pidfd_get_pid(unsigned int fd)
> +struct pid *pidfd_get_pid(unsigned int fd)
>  {
>  	struct fd f;
>  	struct pid *pid;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c835b844aca7..c9c49edf4a8a 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -42,6 +42,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/sched/task.h>
>  #include <linux/idr.h>
> +#include <linux/remote_mapping.h>
>  
>  struct pid init_struct_pid = {
>  	.count		= REFCOUNT_INIT(1),
> @@ -565,6 +566,60 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
>  	return fd;
>  }
>  
> +/**
> + * pidfd_mem() - Allow access to process address space.
> + *
> + * @pidfd: pid file descriptor for the target process
> + * @fds:   array where the control and access file descriptors are returned
> + * @flags: flags to pass
> + *
> + * This creates a pair of file descriptors used to gain access to the
> + * target process memory. The control fd is used to establish a linear
> + * mapping between an offset range and a userspace address range.
> + * The access fd is used to mmap(offset range) on the client side.
> + *
> + * Return: On success, 0 is returned.
> + *         On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE3(pidfd_mem, int, pidfd, int __user *, fds, unsigned int, flags)
> +{
> +	struct pid *pid;
> +	struct task_struct *task;
> +	int ret_fds[2];
> +	int ret;
> +
> +	if (pidfd < 0)
> +		return -EINVAL;

This needs to be EBADF.

> +	if (!fds)
> +		return -EINVAL;

If this API holds up, I think similar to what Florian suggested, a
struct would be nicer for userspace. Sm like:

struct rmemfds {
	s32 memfd1;
	s32 memfd2;
}

and passing it a size

pidfd_mem(int pidfd, struct rmemfd *fds, size_t size, unsigned int, flags)

and then copy_struct_from_user() will take care to do the right thing
for you.

> +	if (flags)
> +		return -EINVAL;
> +
> +	pid = pidfd_get_pid(pidfd);
> +	if (IS_ERR(pid))
> +		return PTR_ERR(pid);
> +
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	put_pid(pid);
> +	if (IS_ERR(task))
> +		return PTR_ERR(task);
> +
> +	ret = -EPERM;
> +	if (unlikely(task == current) || capable(CAP_SYS_PTRACE))
> +		ret = task_remote_map(task, ret_fds);
> +	put_task_struct(task);
> +	if (IS_ERR_VALUE((long)ret))
> +		return ret;
> +
> +	if (copy_to_user(fds, ret_fds, sizeof(ret_fds))) {
> +		put_unused_fd(ret_fds[0]);
> +		put_unused_fd(ret_fds[1]);
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>  	/* Verify no one has done anything silly: */


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
  2020-09-04 17:55   ` Oleg Nesterov
  2020-09-07 14:30   ` Oleg Nesterov
@ 2020-09-07 15:02   ` Christian Brauner
  2020-09-07 16:04     ` Mircea CIRJALIU - MELIU
  2 siblings, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2020-09-07 15:02 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox

On Fri, Sep 04, 2020 at 02:31:15PM +0300, Adalbert Lazăr wrote:
> From: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> 
> Remote mapping creates a mirror VMA that exposes memory owned by another
> process in a zero-copy manner. The pages are mapped in the current process'
> address space with no memory management involved and little impact on the
> remote process operation. Currently incompatible with THP.
> 
> Signed-off-by: Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> Signed-off-by: Adalbert Lazăr <alazar@bitdefender.com>
> ---
>  include/linux/remote_mapping.h      |   22 +
>  include/uapi/linux/remote_mapping.h |   36 +
>  mm/Kconfig                          |   11 +
>  mm/Makefile                         |    1 +
>  mm/remote_mapping.c                 | 1273 +++++++++++++++++++++++++++
>  5 files changed, 1343 insertions(+)
>  create mode 100644 include/linux/remote_mapping.h
>  create mode 100644 include/uapi/linux/remote_mapping.h
>  create mode 100644 mm/remote_mapping.c
> 
> diff --git a/include/linux/remote_mapping.h b/include/linux/remote_mapping.h
> new file mode 100644
> index 000000000000..5c1d43e8f669
> --- /dev/null
> +++ b/include/linux/remote_mapping.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_REMOTE_MAPPING_H
> +#define _LINUX_REMOTE_MAPPING_H
> +
> +#include <linux/sched.h>
> +
> +#ifdef CONFIG_REMOTE_MAPPING
> +
> +extern int task_remote_map(struct task_struct *task, int fds[]);
> +
> +#else /* CONFIG_REMOTE_MAPPING */
> +
> +static inline int task_remote_map(struct task_struct *task, int fds[])
> +{
> +	return -EINVAL;
> +}
> +
> +#endif /* CONFIG_REMOTE_MAPPING */
> +
> +
> +#endif /* _LINUX_REMOTE_MAPPING_H */
> diff --git a/include/uapi/linux/remote_mapping.h b/include/uapi/linux/remote_mapping.h
> new file mode 100644
> index 000000000000..5d2828a6aa47
> --- /dev/null
> +++ b/include/uapi/linux/remote_mapping.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef __UAPI_REMOTE_MAPPING_H__
> +#define __UAPI_REMOTE_MAPPING_H__
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +// device file interface
> +#define REMOTE_PROC_MAP	_IOW('r', 0x01, int)
> +
> +
> +// pidfd interface
> +#define PIDFD_IO_MAGIC	'p'
> +
> +struct pidfd_mem_map {
> +	uint64_t address;
> +	off_t offset;
> +	off_t size;
> +	int flags;
> +	int padding[7];
> +};
> +
> +struct pidfd_mem_unmap {
> +	off_t offset;
> +	off_t size;
> +};
> +
> +#define PIDFD_MEM_MAP	_IOW(PIDFD_IO_MAGIC, 0x01, struct pidfd_mem_map)
> +#define PIDFD_MEM_UNMAP _IOW(PIDFD_IO_MAGIC, 0x02, struct pidfd_mem_unmap)
> +#define PIDFD_MEM_LOCK	_IOW(PIDFD_IO_MAGIC, 0x03, int)
> +
> +#define PIDFD_MEM_REMAP _IOW(PIDFD_IO_MAGIC, 0x04, unsigned long)
> +// TODO: actually this is not for pidfd, find better names
> +
> +#endif /* __UAPI_REMOTE_MAPPING_H__ */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c35..0ecc3f41a98e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -804,6 +804,17 @@ config HMM_MIRROR
>  	bool
>  	depends on MMU
>  
> +config REMOTE_MAPPING
> +	bool "Remote memory mapping"
> +	depends on X86_64 && MMU && !TRANSPARENT_HUGEPAGE
> +	select MMU_NOTIFIER
> +	default n
> +	help
> +	  Allows a client process to gain access to an unrelated process'
> +	  address space on a range-basis. The feature maps pages found at
> +	  the remote equivalent address in the current process' page tables
> +	  in a lightweight manner.
> +
>  config DEVICE_PRIVATE
>  	bool "Unaddressable device memory (GPU memory, ...)"
>  	depends on ZONE_DEVICE
> diff --git a/mm/Makefile b/mm/Makefile
> index fccd3756b25f..ce1a00e7bc8c 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>  obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
>  obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>  obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> +obj-$(CONFIG_REMOTE_MAPPING) += remote_mapping.o
> diff --git a/mm/remote_mapping.c b/mm/remote_mapping.c
> new file mode 100644
> index 000000000000..1dc53992424b
> --- /dev/null
> +++ b/mm/remote_mapping.c
> @@ -0,0 +1,1273 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Remote memory mapping.
> + *
> + * Copyright (C) 2017-2020 Bitdefender S.R.L.
> + *
> + * Author:
> + *   Mircea Cirjaliu <mcirjaliu@bitdefender.com>
> + */
> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
> +
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/module.h>
> +#include <linux/printk.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/pid.h>
> +#include <linux/file.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/sched/task.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/interval_tree_generic.h>
> +#include <linux/refcount.h>
> +#include <linux/miscdevice.h>
> +#include <uapi/linux/remote_mapping.h>
> +#include <linux/pfn_t.h>
> +#include <linux/errno.h>
> +#include <linux/limits.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> +#include <asm/tlb.h>
> +#include "internal.h"
> +
> +struct remote_file_context {
> +	refcount_t refcount;
> +
> +	struct srcu_struct fault_srcu;
> +	struct mm_struct *mm;
> +
> +	bool locked;
> +	struct rb_root_cached rb_view;		/* view offset tree */
> +	struct mutex views_lock;
> +
> +};
> +
> +struct remote_view {
> +	refcount_t refcount;
> +
> +	unsigned long address;
> +	unsigned long size;
> +	unsigned long offset;
> +	bool valid;
> +
> +	struct rb_node target_rb;		/* link for views tree */
> +	unsigned long rb_subtree_last;		/* in remote_file_context */
> +
> +	struct mmu_interval_notifier mmin;
> +	spinlock_t user_lock;
> +
> +	/*
> +	 * interval tree for mapped ranges (indexed by source process HVA)
> +	 * because of GPA->HVA aliasing, multiple ranges may overlap
> +	 */
> +	struct rb_root_cached rb_rmap;		/* rmap tree */
> +	struct rw_semaphore rmap_lock;
> +};
> +
> +struct remote_vma_context {
> +	struct vm_area_struct *vma;		/* link back to VMA */
> +	struct remote_view *view;		/* corresponding view */
> +
> +	struct rb_node rmap_rb;			/* link for rmap tree */
> +	unsigned long rb_subtree_last;
> +};
> +
> +/* view offset tree */
> +static inline unsigned long view_start(struct remote_view *view)
> +{
> +	return view->offset + 1;
> +}
> +
> +static inline unsigned long view_last(struct remote_view *view)
> +{
> +	return view->offset + view->size - 1;
> +}
> +
> +INTERVAL_TREE_DEFINE(struct remote_view, target_rb,
> +	unsigned long, rb_subtree_last, view_start, view_last,
> +	static inline, view_interval_tree)
> +
> +#define view_tree_foreach(view, root, start, last)			\
> +	for (view = view_interval_tree_iter_first(root, start, last);	\
> +	     view; view = view_interval_tree_iter_next(view, start, last))
> +
> +/* rmap interval tree */
> +static inline unsigned long ctx_start(struct remote_vma_context *ctx)
> +{
> +	struct vm_area_struct *vma = ctx->vma;
> +	struct remote_view *view = ctx->view;
> +	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	return offset - view->offset + view->address;
> +}
> +
> +static inline unsigned long ctx_last(struct remote_vma_context *ctx)
> +{
> +	struct vm_area_struct *vma = ctx->vma;
> +	struct remote_view *view = ctx->view;
> +	unsigned long offset;
> +
> +	offset = (vma->vm_pgoff << PAGE_SHIFT) + (vma->vm_end - vma->vm_start);
> +
> +	return offset - view->offset + view->address;
> +}
> +
> +static inline unsigned long ctx_rmap_start(struct remote_vma_context *ctx)
> +{
> +	return ctx_start(ctx) + 1;
> +}
> +
> +static inline unsigned long ctx_rmap_last(struct remote_vma_context *ctx)
> +{
> +	return ctx_last(ctx) - 1;
> +}
> +
> +INTERVAL_TREE_DEFINE(struct remote_vma_context, rmap_rb,
> +	unsigned long, rb_subtree_last, ctx_rmap_start, ctx_rmap_last,
> +	static inline, rmap_interval_tree)
> +
> +#define rmap_foreach(ctx, root, start, last)				\
> +	for (ctx = rmap_interval_tree_iter_first(root, start, last);	\
> +	     ctx; ctx = rmap_interval_tree_iter_next(ctx, start, last))
> +
> +static int mirror_zap_pte(struct vm_area_struct *vma, unsigned long addr,
> +			  pte_t *pte, int rss[], struct mmu_gather *tlb,
> +			  struct zap_details *details)
> +{
> +	pte_t ptent = *pte;
> +	struct page *page;
> +	int flags = 0;
> +
> +	page = vm_normal_page(vma, addr, ptent);
> +	//ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> +	ptent = ptep_clear_flush_notify(vma, addr, pte);
> +	//tlb_remove_tlb_entry(tlb, pte, addr);
> +
> +	if (pte_dirty(ptent)) {
> +		flags |= ZAP_PTE_FLUSH;
> +		set_page_dirty(page);
> +	}
> +
> +	return flags;
> +}
> +
> +static void
> +zap_remote_range(struct vm_area_struct *vma,
> +		 unsigned long start, unsigned long end,
> +		 bool atomic)
> +{
> +	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
> +	struct zap_details details = {
> +		.atomic = atomic,
> +	};
> +
> +	pr_debug("%s: vma %lx-%lx, zap range %lx-%lx\n",
> +		__func__, vma->vm_start, vma->vm_end, start, end);
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0,
> +				vma, vma->vm_mm, start, end);
> +	if (atomic)
> +		mmu_notifier_invalidate_range_start_nonblock(&range);
> +	else
> +		mmu_notifier_invalidate_range_start(&range);
> +
> +	unmap_page_range(&tlb, vma, start, end, &details);
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +	tlb_finish_mmu(&tlb, start, end);
> +}
> +
> +static bool
> +mirror_clear_view(struct remote_view *view,
> +		  unsigned long start, unsigned long last, bool atomic)
> +{
> +	struct remote_vma_context *ctx;
> +	unsigned long src_start, src_last;
> +	unsigned long vma_start, vma_last;
> +
> +	pr_debug("%s: view %p [%lx-%lx), range [%lx-%lx)", __func__, view,
> +		 view->offset, view->offset + view->size, start, last);
> +
> +	if (likely(!atomic))
> +		down_read(&view->rmap_lock);
> +	else if (!down_read_trylock(&view->rmap_lock))
> +		return false;
> +
> +	rmap_foreach(ctx, &view->rb_rmap, start, last) {
> +		struct vm_area_struct *vma = ctx->vma;
> +
> +		// intersect intervals (source process address range)
> +		src_start = max(start, ctx_start(ctx));
> +		src_last = min(last, ctx_last(ctx));
> +
> +		// translate to destination process address range
> +		vma_start = vma->vm_start + (src_start - ctx_start(ctx));
> +		vma_last = vma->vm_end - (ctx_last(ctx) - src_last);
> +
> +		zap_remote_range(vma, vma_start, vma_last, atomic);
> +	}
> +
> +	up_read(&view->rmap_lock);
> +
> +	return true;
> +}
> +
> +static bool mmin_invalidate(struct mmu_interval_notifier *interval_sub,
> +			    const struct mmu_notifier_range *range,
> +			    unsigned long cur_seq)
> +{
> +	struct remote_view *view =
> +		container_of(interval_sub, struct remote_view, mmin);
> +
> +	pr_debug("%s: reason %d, range [%lx-%lx)\n", __func__,
> +		 range->event, range->start, range->end);
> +
> +	spin_lock(&view->user_lock);
> +	mmu_interval_set_seq(interval_sub, cur_seq);
> +	spin_unlock(&view->user_lock);
> +
> +	/* mark view as invalid before zapping the page tables */
> +	if (range->event == MMU_NOTIFY_RELEASE)
> +		WRITE_ONCE(view->valid, false);
> +
> +	return mirror_clear_view(view, range->start, range->end,
> +				 !mmu_notifier_range_blockable(range));
> +}
> +
> +static const struct mmu_interval_notifier_ops mmin_ops = {
> +	.invalidate = mmin_invalidate,
> +};
> +
> +static void view_init(struct remote_view *view)
> +{
> +	refcount_set(&view->refcount, 1);
> +	view->valid = true;
> +	RB_CLEAR_NODE(&view->target_rb);
> +	view->rb_rmap = RB_ROOT_CACHED;
> +	init_rwsem(&view->rmap_lock);
> +	spin_lock_init(&view->user_lock);
> +}
> +
> +/* return working view or reason why it failed */
> +static struct remote_view *
> +view_alloc(struct mm_struct *mm, unsigned long address, unsigned long size, unsigned long offset)
> +{
> +	struct remote_view *view;
> +	int result;
> +
> +	view = kzalloc(sizeof(*view), GFP_KERNEL);
> +	if (!view)
> +		return ERR_PTR(-ENOMEM);
> +
> +	view_init(view);
> +
> +	view->address = address;
> +	view->size = size;
> +	view->offset = offset;
> +
> +	pr_debug("%s: view %p [%lx-%lx)", __func__, view,
> +		 view->offset, view->offset + view->size);
> +
> +	result = mmu_interval_notifier_insert(&view->mmin, mm, address, size, &mmin_ops);
> +	if (result) {
> +		kfree(view);
> +		return ERR_PTR(result);
> +	}
> +
> +	return view;
> +}
> +
> +static void
> +view_insert(struct remote_file_context *fctx, struct remote_view *view)
> +{
> +	view_interval_tree_insert(view, &fctx->rb_view);
> +	refcount_inc(&view->refcount);
> +}
> +
> +static struct remote_view *
> +view_search_get(struct remote_file_context *fctx,
> +	unsigned long start, unsigned long last)
> +{
> +	struct remote_view *view;
> +
> +	lockdep_assert_held(&fctx->views_lock);
> +
> +	/*
> +	 * loop & return the first view intersecting interval
> +	 * further checks will be done down the road
> +	 */
> +	view_tree_foreach(view, &fctx->rb_view, start, last)
> +		break;
> +
> +	if (view)
> +		refcount_inc(&view->refcount);
> +
> +	return view;
> +}
> +
> +static void
> +view_put(struct remote_view *view)
> +{
> +	if (refcount_dec_and_test(&view->refcount)) {
> +		pr_debug("%s: view %p [%lx-%lx) bye bye", __func__, view,
> +			 view->offset, view->offset + view->size);
> +
> +		mmu_interval_notifier_remove(&view->mmin);
> +		kfree(view);
> +	}
> +}
> +
> +static void
> +view_remove(struct remote_file_context *fctx, struct remote_view *view)
> +{
> +	view_interval_tree_remove(view, &fctx->rb_view);
> +	RB_CLEAR_NODE(&view->target_rb);
> +	view_put(view);
> +}
> +
> +static bool
> +view_overlaps(struct remote_file_context *fctx,
> +	unsigned long start, unsigned long last)
> +{
> +	struct remote_view *view;
> +
> +	view_tree_foreach(view, &fctx->rb_view, start, last)
> +		return true;
> +
> +	return false;
> +}
> +
> +static struct remote_view *
> +alloc_identity_view(struct mm_struct *mm)
> +{
> +	return view_alloc(mm, 0, ULONG_MAX, 0);
> +}
> +
> +static void remote_file_context_init(struct remote_file_context *fctx)
> +{
> +	refcount_set(&fctx->refcount, 1);
> +	init_srcu_struct(&fctx->fault_srcu);
> +	fctx->locked = false;
> +	fctx->rb_view = RB_ROOT_CACHED;
> +	mutex_init(&fctx->views_lock);
> +}
> +
> +static struct remote_file_context *remote_file_context_alloc(void)
> +{
> +	struct remote_file_context *fctx;
> +
> +	fctx = kzalloc(sizeof(*fctx), GFP_KERNEL);
> +	if (fctx)
> +		remote_file_context_init(fctx);
> +
> +	pr_debug("%s: fctx %p\n", __func__, fctx);
> +
> +	return fctx;
> +}
> +
> +static void remote_file_context_get(struct remote_file_context *fctx)
> +{
> +	refcount_inc(&fctx->refcount);
> +}
> +
> +static void remote_file_context_put(struct remote_file_context *fctx)
> +{
> +	struct remote_view *view, *n;
> +
> +	if (refcount_dec_and_test(&fctx->refcount)) {
> +		pr_debug("%s: fctx %p\n", __func__, fctx);
> +
> +		rbtree_postorder_for_each_entry_safe(view, n,
> +			&fctx->rb_view.rb_root, target_rb)
> +			view_put(view);
> +
> +		if (fctx->mm)
> +			mmdrop(fctx->mm);
> +
> +		kfree(fctx);
> +	}
> +}
> +
> +static void remote_vma_context_init(struct remote_vma_context *ctx)
> +{
> +	RB_CLEAR_NODE(&ctx->rmap_rb);
> +}
> +
> +static struct remote_vma_context *remote_vma_context_alloc(void)
> +{
> +	struct remote_vma_context *ctx;
> +
> +	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (ctx)
> +		remote_vma_context_init(ctx);
> +
> +	return ctx;
> +}
> +
> +static void remote_vma_context_free(struct remote_vma_context *ctx)
> +{
> +	kfree(ctx);
> +}
> +
> +static int mirror_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct remote_file_context *fctx;
> +
> +	pr_debug("%s: file %p\n", __func__, file);
> +
> +	fctx = remote_file_context_alloc();
> +	if (!fctx)
> +		return -ENOMEM;
> +	file->private_data = fctx;
> +
> +	return 0;
> +}
> +
> +static int do_remote_proc_map(struct file *file, int pid)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +	struct task_struct *req_task;
> +	struct mm_struct *req_mm;
> +	struct remote_view *id;
> +	int result = 0;
> +
> +	pr_debug("%s: pid %d\n", __func__, pid);
> +
> +	req_task = find_get_task_by_vpid(pid);
> +	if (!req_task)
> +		return -ESRCH;
> +
> +	req_mm = get_task_mm(req_task);
> +	put_task_struct(req_task);
> +	if (!req_mm)
> +		return -EINVAL;
> +
> +	/* error on 2nd call or multithreaded race */
> +	if (cmpxchg(&fctx->mm, (struct mm_struct *)NULL, req_mm) != NULL) {
> +		result = -EALREADY;
> +		goto out;
> +	} else
> +		mmgrab(req_mm);
> +
> +	id = alloc_identity_view(req_mm);
> +	if (IS_ERR(id)) {
> +		mmdrop(req_mm);
> +		result = PTR_ERR(id);
> +		goto out;
> +	}
> +
> +	/* one view only, don't need to take mutex */
> +	view_insert(fctx, id);
> +	view_put(id);			/* usage reference */
> +
> +out:
> +	mmput(req_mm);
> +
> +	return result;
> +}
> +
> +static long mirror_dev_ioctl(struct file *file, unsigned int ioctl,
> +	unsigned long arg)
> +{
> +	long result;
> +
> +	switch (ioctl) {
> +	case REMOTE_PROC_MAP: {
> +		int pid = (int)arg;
> +
> +		result = do_remote_proc_map(file, pid);
> +		break;
> +	}
> +
> +	default:
> +		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +/*
> + * This is called after all reference to the file have been dropped,
> + * including mmap()s, even if the file is close()d first.
> + */
> +static int mirror_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +
> +	pr_debug("%s: file %p\n", __func__, file);
> +
> +	remote_file_context_put(fctx);
> +
> +	return 0;
> +}
> +
> +static struct page *mm_remote_get_page(struct mm_struct *req_mm,
> +	unsigned long address, unsigned int flags)
> +{
> +	struct page *req_page = NULL;
> +	long nrpages;
> +
> +	might_sleep();
> +
> +	flags |= FOLL_ANON | FOLL_MIGRATION;
> +
> +	/* get host page corresponding to requested address */
> +	nrpages = get_user_pages_remote(NULL, req_mm, address, 1,
> +		flags, &req_page, NULL, NULL);
> +	if (unlikely(nrpages == 0)) {
> +		pr_err("no page at %lx\n", address);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	if (IS_ERR_VALUE(nrpages)) {
> +		pr_err("get_user_pages_remote() failed: %d\n", (int)nrpages);
> +		return ERR_PTR(nrpages);
> +	}
> +
> +	/* limit introspection to anon memory (this also excludes zero-page) */
> +	if (!PageAnon(req_page)) {
> +		put_page(req_page);
> +		pr_err("page at %lx not anon\n", address);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return req_page;
> +}
> +
> +/*
> + * avoid PTE allocation in this function for 2 reasons:
> + * - it runs under user_lock, which is a spinlock and can't sleep
> + *   (user_lock can be a mutex is allocation is needed)
> + * - PTE allocation triggers reclaim, which causes a possible deadlock warning
> + */
> +static vm_fault_t remote_map_page(struct vm_fault *vmf, struct page *page)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	pte_t entry;
> +
> +	if (vmf->prealloc_pte) {
> +		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> +		if (unlikely(!pmd_none(*vmf->pmd))) {
> +			spin_unlock(vmf->ptl);
> +			goto map_pte;
> +		}
> +
> +		mm_inc_nr_ptes(vma->vm_mm);
> +		pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> +		spin_unlock(vmf->ptl);
> +		vmf->prealloc_pte = NULL;
> +	} else {
> +		BUG_ON(pmd_none(*vmf->pmd));
> +	}
> +
> +map_pte:
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> +
> +	if (!pte_none(*vmf->pte))
> +		goto out_unlock;
> +
> +	entry = mk_pte(page, vma->vm_page_prot);
> +	set_pte_at_notify(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> +out_unlock:
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct remote_vma_context *ctx = vma->vm_private_data;
> +	struct remote_view *view = ctx->view;
> +	struct file *file = vma->vm_file;
> +	struct remote_file_context *fctx = file->private_data;
> +	unsigned long req_addr;
> +	unsigned int gup_flags;
> +	struct page *req_page;
> +	vm_fault_t result = VM_FAULT_SIGBUS;
> +	struct mm_struct *src_mm = fctx->mm;
> +	unsigned long seq;
> +	int idx;
> +
> +fault_retry:
> +	seq = mmu_interval_read_begin(&view->mmin);
> +
> +	idx = srcu_read_lock(&fctx->fault_srcu);
> +
> +	/* check if view was invalidated */
> +	if (unlikely(!READ_ONCE(view->valid))) {
> +		pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__,
> +			view->offset, view->offset + view->size);
> +		goto out_invalid;		/* VM_FAULT_SIGBUS */
> +	}
> +
> +	/* drop current mm semapchore */
> +	up_read(&current->mm->mmap_sem);
> +
> +	/* take remote mm semaphore */
> +	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> +		if (!down_read_trylock(&src_mm->mmap_sem)) {
> +			pr_debug("%s: couldn't take source semaphore!!\n", __func__);
> +			goto out_retry;
> +		}
> +	} else
> +		down_read(&src_mm->mmap_sem);
> +
> +	/* set GUP flags depending on the VMA */
> +	gup_flags = 0;
> +	if (vma->vm_flags & VM_WRITE)
> +		gup_flags |= FOLL_WRITE | FOLL_FORCE;
> +
> +	/* translate file offset to source process HVA */
> +	req_addr = (vmf->pgoff << PAGE_SHIFT) - view->offset + view->address;
> +	req_page = mm_remote_get_page(src_mm, req_addr, gup_flags);
> +
> +	/* check for validity of the page */
> +	if (IS_ERR_OR_NULL(req_page)) {
> +		up_read(&src_mm->mmap_sem);
> +
> +		if (PTR_ERR(req_page) == -ERESTARTSYS ||
> +		    PTR_ERR(req_page) == -EBUSY) {
> +			goto out_retry;
> +		} else
> +			goto out_err;	/* VM_FAULT_SIGBUS */
> +	}
> +
> +	up_read(&src_mm->mmap_sem);
> +
> +	/* retake current mm semapchore */
> +	down_read(&current->mm->mmap_sem);
> +
> +	/* expedite retry */
> +	if (mmu_interval_check_retry(&view->mmin, seq)) {
> +		put_page(req_page);
> +
> +		srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +		goto fault_retry;
> +	}
> +
> +	/* make sure the VMA hasn't gone away */
> +	vma = find_vma(current->mm, vmf->address);
> +	if (vma == vmf->vma) {
> +		spin_lock(&view->user_lock);
> +
> +		if (mmu_interval_read_retry(&view->mmin, seq)) {
> +			spin_unlock(&view->user_lock);
> +
> +			put_page(req_page);
> +
> +			srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +			goto fault_retry;
> +		}
> +
> +		result = remote_map_page(vmf, req_page);  /* install PTE here */
> +
> +		spin_unlock(&view->user_lock);
> +	}
> +
> +	put_page(req_page);
> +
> +	srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +	return result;
> +
> +out_err:
> +	/* retake current mm semapchore */
> +	down_read(&current->mm->mmap_sem);
> +out_invalid:
> +	srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +	return result;
> +
> +out_retry:
> +	/* retake current mm semapchore */
> +	down_read(&current->mm->mmap_sem);
> +
> +	srcu_read_unlock(&fctx->fault_srcu, idx);
> +
> +	/* TODO: some optimizations work here when we arrive with FAULT_FLAG_ALLOW_RETRY */
> +	/* TODO: mmap_sem doesn't need to be taken, then dropped */
> +
> +	/*
> +	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
> +	 * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
> +	 * not set.
> +	 *
> +	 * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
> +	 * set, VM_FAULT_RETRY can still be returned if and only if there are
> +	 * fatal_signal_pending()s, and the mmap_sem must be released before
> +	 * returning it.
> +	 */
> +	if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_TRIED)) {
> +		if (!(vmf->flags & FAULT_FLAG_KILLABLE))
> +			if (current && fatal_signal_pending(current)) {
> +				up_read(&current->mm->mmap_sem);
> +				return VM_FAULT_RETRY;
> +			}
> +
> +		if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +			up_read(&mm->mmap_sem);
> +
> +		return VM_FAULT_RETRY;
> +	} else
> +		return VM_FAULT_SIGBUS;
> +}
> +
> +/*
> + * This is called in remove_vma() at the end of __do_munmap() after the address
> + * space has been unmapped and the page tables have been freed.
> + */
> +static void mirror_vm_close(struct vm_area_struct *vma)
> +{
> +	struct remote_vma_context *ctx = vma->vm_private_data;
> +	struct remote_view *view = ctx->view;
> +
> +	pr_debug("%s: VMA %lx-%lx (%lu bytes)\n", __func__,
> +		vma->vm_start, vma->vm_end, vma->vm_end - vma->vm_start);
> +
> +	/* will wait for any running invalidate notifiers to finish */
> +	down_write(&view->rmap_lock);
> +	rmap_interval_tree_remove(ctx, &view->rb_rmap);
> +	up_write(&view->rmap_lock);
> +	view_put(view);
> +
> +	remote_vma_context_free(ctx);
> +}
> +
> +/* prevent partial unmap of destination VMA */
> +static int mirror_vm_split(struct vm_area_struct *area, unsigned long addr)
> +{
> +	return -EINVAL;
> +}
> +
> +static const struct vm_operations_struct mirror_vm_ops = {
> +	.close = mirror_vm_close,
> +	.fault = mirror_vm_fault,
> +	.split = mirror_vm_split,
> +	.zap_pte = mirror_zap_pte,
> +};
> +
> +static bool is_mirror_vma(struct vm_area_struct *vma)
> +{
> +	return vma->vm_ops == &mirror_vm_ops;
> +}
> +
> +static struct remote_view *
> +getme_matching_view(struct remote_file_context *fctx,
> +		    unsigned long start, unsigned long last)
> +{
> +	struct remote_view *view;
> +
> +	/* lookup view for the VMA offset range */
> +	view = view_search_get(fctx, start, last);
> +	if (!view)
> +		return NULL;
> +
> +	/* make sure the interval we're after is contained in the view */
> +	if (start < view->offset || last > view->offset + view->size) {
> +		view_put(view);
> +		return NULL;
> +	}
> +
> +	return view;
> +}
> +
> +static struct remote_view *
> +getme_exact_view(struct remote_file_context *fctx,
> +		 unsigned long start, unsigned long last)
> +{
> +	struct remote_view *view;
> +
> +	/* lookup view for the VMA offset range */
> +	view = view_search_get(fctx, start, last);
> +	if (!view)
> +		return NULL;
> +
> +	/* make sure the interval we're after is contained in the view */
> +	if (start != view->offset || last != view->offset + view->size) {
> +		view_put(view);
> +		return NULL;
> +	}
> +
> +	return view;
> +}
> +
> +static int mirror_dev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +	struct remote_vma_context *ctx;
> +	unsigned long start, length, last;
> +	struct remote_view *view;
> +
> +	start = vma->vm_pgoff << PAGE_SHIFT;
> +	length = vma->vm_end - vma->vm_start;
> +	last = start + length;
> +
> +	pr_debug("%s: VMA %lx-%lx (%lu bytes), offsets %lx-%lx\n", __func__,
> +		vma->vm_start, vma->vm_end, length, start, last);
> +
> +	if (!(vma->vm_flags & VM_SHARED)) {
> +		pr_debug("%s: VMA is not shared\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	/* prepare the context */
> +	ctx = remote_vma_context_alloc();
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	/* lookup view for the VMA offset range */
> +	mutex_lock(&fctx->views_lock);
> +	view = getme_matching_view(fctx, start, last);
> +	mutex_unlock(&fctx->views_lock);
> +	if (!view) {
> +		pr_debug("%s: no view for range %lx-%lx\n", __func__, start, last);
> +		remote_vma_context_free(ctx);
> +		return -EINVAL;
> +	}
> +
> +	/* VMA must be linked to ctx before adding to rmap tree !! */
> +	vma->vm_private_data = ctx;
> +	ctx->vma = vma;
> +
> +	/* view may already be invalidated by the time it's linked */
> +	down_write(&view->rmap_lock);
> +	ctx->view = view;	/* view reference goes here */
> +	rmap_interval_tree_insert(ctx, &view->rb_rmap);
> +	up_write(&view->rmap_lock);
> +
> +	/* set basic VMA properties */
> +	vma->vm_flags |= VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND;
> +	vma->vm_ops = &mirror_vm_ops;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations mirror_ops = {
> +	.open = mirror_dev_open,
> +	.unlocked_ioctl = mirror_dev_ioctl,
> +	.compat_ioctl = mirror_dev_ioctl,
> +	.llseek = no_llseek,
> +	.mmap = mirror_dev_mmap,
> +	.release = mirror_dev_release,
> +};
> +
> +static struct miscdevice mirror_dev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "mirror-proc",
> +	.fops = &mirror_ops,
> +};
> +
> +builtin_misc_device(mirror_dev);
> +
> +static int pidfd_mem_remap(struct remote_file_context *fctx, unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long start, last;
> +	struct remote_vma_context *ctx;
> +	struct remote_view *view, *new_view;
> +	int result = 0;
> +
> +	pr_debug("%s: address %lx\n", __func__, address);
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	vma = find_vma(current->mm, address);
> +	if (!vma || !is_mirror_vma(vma)) {
> +		result = -EINVAL;
> +		goto out_vma;
> +	}
> +
> +	ctx = vma->vm_private_data;
> +	view = ctx->view;
> +
> +	if (view->valid)
> +		goto out_vma;
> +
> +	start = vma->vm_pgoff << PAGE_SHIFT;
> +	last = start + (vma->vm_end - vma->vm_start);
> +
> +	/* lookup view for the VMA offset range */
> +	mutex_lock(&fctx->views_lock);
> +	new_view = getme_matching_view(fctx, start, last);
> +	mutex_unlock(&fctx->views_lock);
> +	if (!new_view) {
> +		result = -EINVAL;
> +		goto out_vma;
> +	}
> +	/* do not link to another invalid view */
> +	if (!new_view->valid) {
> +		view_put(new_view);
> +		result = -EINVAL;
> +		goto out_vma;
> +	}
> +
> +	/* we have current->mm->mmap_sem in write mode, so no faults going on */
> +	down_write(&view->rmap_lock);
> +	rmap_interval_tree_remove(ctx, &view->rb_rmap);
> +	up_write(&view->rmap_lock);
> +	view_put(view);		/* ctx reference */
> +
> +	/* replace with the new view */
> +	down_write(&new_view->rmap_lock);
> +	ctx->view = new_view;	/* new view reference goes here */
> +	rmap_interval_tree_insert(ctx, &new_view->rb_rmap);
> +	up_write(&new_view->rmap_lock);
> +
> +out_vma:
> +	up_write(&current->mm->mmap_sem);
> +
> +	return result;
> +}
> +
> +static long
> +pidfd_mem_map_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +	long result = 0;
> +
> +	switch (ioctl) {
> +	case PIDFD_MEM_REMAP:
> +		result = pidfd_mem_remap(fctx, arg);
> +		break;
> +
> +	default:
> +		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +static void pidfd_mem_lock(struct remote_file_context *fctx)
> +{
> +	pr_debug("%s:\n", __func__);
> +
> +	mutex_lock(&fctx->views_lock);
> +	fctx->locked = true;
> +	mutex_unlock(&fctx->views_lock);
> +}
> +
> +static int pidfd_mem_map(struct remote_file_context *fctx, struct pidfd_mem_map *map)
> +{
> +	struct remote_view *view;
> +	int result = 0;
> +
> +	pr_debug("%s: offset %lx, size %lx, address %lx\n",
> +		__func__, map->offset, map->size, (long)map->address);
> +
> +	if (!PAGE_ALIGNED(map->offset))
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(map->size))
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(map->address))
> +		return -EINVAL;
> +
> +	/* make sure we're creating the view for a valid address space */
> +	if (!mmget_not_zero(fctx->mm))
> +		return -EINVAL;
> +
> +	view = view_alloc(fctx->mm, map->address, map->size, map->offset);
> +	if (IS_ERR(view)) {
> +		result = PTR_ERR(view);
> +		goto out_mm;
> +	}
> +
> +	mutex_lock(&fctx->views_lock);
> +
> +	/* locked ? */
> +	if (unlikely(fctx->locked)) {
> +		pr_debug("%s: views locked\n", __func__);
> +		result = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* overlaps another view ? */
> +	if (view_overlaps(fctx, map->offset, map->offset + map->size)) {
> +		pr_debug("%s: range overlaps\n", __func__);
> +		result = -EALREADY;
> +		goto out;
> +	}
> +
> +	view_insert(fctx, view);
> +
> +out:
> +	mutex_unlock(&fctx->views_lock);
> +
> +	view_put(view);			/* usage reference */
> +out_mm:
> +	mmput(fctx->mm);
> +
> +	return result;
> +}
> +
> +static int pidfd_mem_unmap(struct remote_file_context *fctx, struct pidfd_mem_unmap *unmap)
> +{
> +	struct remote_view *view;
> +
> +	pr_debug("%s: offset %lx, size %lx\n",
> +		__func__, unmap->offset, unmap->size);
> +
> +	if (!PAGE_ALIGNED(unmap->offset))
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(unmap->size))
> +		return -EINVAL;
> +
> +	mutex_lock(&fctx->views_lock);
> +
> +	if (unlikely(fctx->locked)) {
> +		mutex_unlock(&fctx->views_lock);
> +		return -EINVAL;
> +	}
> +
> +	view = getme_exact_view(fctx, unmap->offset, unmap->offset + unmap->size);
> +	if (!view) {
> +		mutex_unlock(&fctx->views_lock);
> +		return -EINVAL;
> +	}
> +
> +	view_remove(fctx, view);
> +
> +	mutex_unlock(&fctx->views_lock);
> +
> +	/*
> +	 * The view may still be refernced by a mapping VMA, so dropping
> +	 * a reference here may not delete it. The view will be marked as
> +	 * invalid, together with all the VMAs linked to it.
> +	 */
> +	WRITE_ONCE(view->valid, false);
> +
> +	/* wait until local faults finish */
> +	synchronize_srcu(&fctx->fault_srcu);
> +
> +	/*
> +	 * because the view is marked as invalid, faults will not succeed, so
> +	 * we don't have to worry about synchronizing invalidations/faults
> +	 */
> +	mirror_clear_view(view, 0, ULONG_MAX, false);
> +
> +	view_put(view);			/* usage reference */
> +
> +	return 0;
> +}
> +
> +static long
> +pidfd_mem_ctrl_ioctl(struct file *file, unsigned int ioctl, unsigned long arg)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	long result = 0;
> +
> +	switch (ioctl) {
> +	case PIDFD_MEM_MAP: {
> +		struct pidfd_mem_map map;
> +
> +		result = -EINVAL;
> +		if (copy_from_user(&map, argp, sizeof(map)))
> +			return result;
> +
> +		result = pidfd_mem_map(fctx, &map);
> +		break;
> +	}
> +
> +	case PIDFD_MEM_UNMAP: {
> +		struct pidfd_mem_unmap unmap;
> +
> +		result = -EINVAL;
> +		if (copy_from_user(&unmap, argp, sizeof(unmap)))
> +			return result;
> +
> +		result = pidfd_mem_unmap(fctx, &unmap);
> +		break;
> +	}
> +
> +	case PIDFD_MEM_LOCK:
> +		pidfd_mem_lock(fctx);
> +		break;
> +
> +	default:
> +		pr_debug("%s: ioctl %x not implemented\n", __func__, ioctl);
> +		result = -ENOTTY;
> +	}
> +
> +	return result;
> +}
> +
> +static int pidfd_mem_ctrl_release(struct inode *inode, struct file *file)
> +{
> +	struct remote_file_context *fctx = file->private_data;
> +
> +	pr_debug("%s: file %p\n", __func__, file);
> +
> +	remote_file_context_put(fctx);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations pidfd_mem_ctrl_ops = {
> +	.owner = THIS_MODULE,
> +	.unlocked_ioctl = pidfd_mem_ctrl_ioctl,
> +	.compat_ioctl = pidfd_mem_ctrl_ioctl,

<snip>

> +static const struct file_operations pidfd_mem_map_fops = {
> +	.owner = THIS_MODULE,
> +	.mmap = mirror_dev_mmap,
> +	.get_unmapped_area = pidfd_mem_get_unmapped_area,
> +	.unlocked_ioctl = pidfd_mem_map_ioctl,
> +	.compat_ioctl = pidfd_mem_map_ioctl,

Reading this, it really doesn't feel like a clean API. I think it would
be great if you could investigate whether you can get a more flexible
and cleaner design on top of memfds as suggested elsewhere in the
thread.

Christian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
                   ` (7 preceding siblings ...)
  2020-09-04 19:39 ` Andy Lutomirski
@ 2020-09-07 15:05 ` Christian Brauner
  2020-09-07 20:43   ` Andy Lutomirski
  8 siblings, 1 reply; 44+ messages in thread
From: Christian Brauner @ 2020-09-07 15:05 UTC (permalink / raw)
  To: Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Oleg Nesterov, Jann Horn, Kees Cook,
	Matthew Wilcox

On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
> 
> The feature was designed according to a specification suggested by
> Paolo Bonzini:
> >> The proposed API is a new pidfd system call, through which the parent
> >> can map portions of its virtual address space into a file descriptor
> >> and then pass that file descriptor to a child.
> >>
> >> This should be:
> >>
> >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> >> better way to do PTRACE_{PEEK,POKE}DATA

In all honesty, that sentence made me a bit uneasy as it reads like this
is implemented on top of pidfds because it makes it more likely to go
upstream not because it is the right design. To be clear, I'm not
implying any sort of malicious intent on your part but I would suggest
to phrase this a little better. :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-07 14:30   ` Oleg Nesterov
@ 2020-09-07 15:16     ` Adalbert Lazăr
  2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
  1 sibling, 0 replies; 44+ messages in thread
From: Adalbert Lazăr @ 2020-09-07 15:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

On Mon, 7 Sep 2020 16:30:08 +0200, Oleg Nesterov <oleg@redhat.com> wrote:
> it seems that nobody is going to review this patch ;)
> 
> So I tried to read mirror_vm_fault() and the usage of mmap_sem doesn't
> look right to me. But let me repeat, this is not my area I can be easily
> wrong, please correct me.
> 
> On 09/04, Adalbert Lazăr wrote:
> >
> > +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct remote_vma_context *ctx = vma->vm_private_data;
> > +	struct remote_view *view = ctx->view;
> > +	struct file *file = vma->vm_file;
> > +	struct remote_file_context *fctx = file->private_data;
> > +	unsigned long req_addr;
> > +	unsigned int gup_flags;
> > +	struct page *req_page;
> > +	vm_fault_t result = VM_FAULT_SIGBUS;
> > +	struct mm_struct *src_mm = fctx->mm;
> > +	unsigned long seq;
> > +	int idx;
> > +
> > +fault_retry:
> > +	seq = mmu_interval_read_begin(&view->mmin);
> > +
> > +	idx = srcu_read_lock(&fctx->fault_srcu);
> > +
> > +	/* check if view was invalidated */
> > +	if (unlikely(!READ_ONCE(view->valid))) {
> > +		pr_debug("%s: region [%lx-%lx) was invalidated!!\n", __func__,
> > +			view->offset, view->offset + view->size);
> > +		goto out_invalid;		/* VM_FAULT_SIGBUS */
> > +	}
> > +
> > +	/* drop current mm semapchore */
> > +	up_read(&current->mm->mmap_sem);
> 
> Please use mmap_read_lock/unlock(mm) instead of down/up_read(mmap_sem).

This patch series is based on 5.7-rc2.
The cover letter has base-commit: present, but I forgot to mention this
explicitly, sorry.

Adalbert


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-07 15:02   ` Christian Brauner
@ 2020-09-07 16:04     ` Mircea CIRJALIU - MELIU
  0 siblings, 0 replies; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-07 16:04 UTC (permalink / raw)
  To: Christian Brauner, Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox

> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Reading this, it really doesn't feel like a clean API. I think it would be great if
> you could investigate whether you can get a more flexible and cleaner design
> on top of memfds as suggested elsewhere in the thread.

The API may be dirty, and that can be fixed. Any suggestion is appreciated.
But the implementation underneath has a whole other purpose than just sharing 
memory.

Doing a design on top of memfd will surely make some people happy.
Don't know what to say about replacing the QEMU allocator though.

Mircea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-07 15:05 ` Christian Brauner
@ 2020-09-07 20:43   ` Andy Lutomirski
  2020-09-09 11:38     ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2020-09-07 20:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Adalbert Lazăr, Linux-MM, Linux API, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Mircea Cirjaliu, Andy Lutomirski,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Matthew Wilcox

On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > This patchset adds support for the remote mapping feature.
> > Remote mapping, as its name suggests, is a means for transparent and
> > zero-copy access of a remote process' address space.
> > access of a remote process' address space.
> >
> > The feature was designed according to a specification suggested by
> > Paolo Bonzini:
> > >> The proposed API is a new pidfd system call, through which the parent
> > >> can map portions of its virtual address space into a file descriptor
> > >> and then pass that file descriptor to a child.
> > >>
> > >> This should be:
> > >>
> > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > >> better way to do PTRACE_{PEEK,POKE}DATA
>
> In all honesty, that sentence made me a bit uneasy as it reads like this
> is implemented on top of pidfds because it makes it more likely to go
> upstream not because it is the right design. To be clear, I'm not
> implying any sort of malicious intent on your part but I would suggest
> to phrase this a little better. :)


I thought about this whole thing some more, and here are some thoughts.

First, I was nervous about two things.  One was faulting in pages from
the wrong context.  (When a normal page fault or KVM faults in a page,
the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
the actual fault happens, but the mm is loaded when the fault is
handled, I think.  Maybe there are workqueues involved and I'm wrong.)
 When a remote mapping faults in a page, the mm is *not* loaded.)
This ought not to be a problem, though -- get_user_pages_remote() also
faults in pages from a non-current mm, and that's at least supposed to
work correctly, so maybe this is okay.

Second is recursion.  I think this is a genuine problem.

And I think that tying this to pidfds is the wrong approach.  In fact,
tying it to processes at all seems wrong.  There is a lot of demand
for various forms of memory isolation in which memory is mapped only
by its intended user.  Using something tied to a process mm gets in
the way of this in the same way that KVM's current mapping model gets
in the way.

All that being said, I think the whole idea of making fancy address
spaces composed from other mappable objects is neat and possibly quite
useful.  And, if you squint a bit, this is a lot like what KVM does
today.

So I suggest something that may be more generally useful as an
alternative.  This is a sketch and very subject to bikeshedding:

Create an empty address space:

int address_space_create(int flags, etc);

Map an fd into an address space:

int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
...);  /* might run out of args here */

Unmap from an address space:

int address_space_munmap(int asfd, unsigned long addr, unsigned long len);

Stick an address space into KVM:

ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */

Maybe some day allow mapping an address space into a process.

mmap(..., asfd, ...);


And at least for now, there's a rule that an address space that is
address_space_mmapped into an address space is disallowed.


Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
day we allow creating a special address_space that represents a real
process's address space.


Under the hood, an address_space could own an mm_struct that is not
used by any tasks.  And we could have special memfds that are bound to
a VM such that all you can do with them is stick them into an
address_space and map that address_space into the VM in question.  For
this to work, we would want a special vm_operation for mapping into a
VM.


What do you all think?  Is this useful?  Does it solve your problems?
Is it a good approach going forward?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-07 14:30   ` Oleg Nesterov
  2020-09-07 15:16     ` Adalbert Lazăr
@ 2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
  2020-09-10 16:43       ` Oleg Nesterov
  1 sibling, 1 reply; 44+ messages in thread
From: Mircea CIRJALIU - MELIU @ 2020-09-09  8:32 UTC (permalink / raw)
  To: Oleg Nesterov, Adalbert Lazăr
  Cc: linux-mm, linux-api, Andrew Morton, Alexander Graf,
	Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Andy Lutomirski, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Jann Horn, Kees Cook, Matthew Wilcox, Christian Brauner

> From: Oleg Nesterov <oleg@redhat.com>
> 
> it seems that nobody is going to review this patch ;)
> 
> So I tried to read mirror_vm_fault() and the usage of mmap_sem doesn't
> look right to me. But let me repeat, this is not my area I can be easily wrong,
> please correct me.
> 
> On 09/04, Adalbert Lazăr wrote:
> >
> > +static vm_fault_t mirror_vm_fault(struct vm_fault *vmf) {
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct remote_vma_context *ctx = vma->vm_private_data;
> > +	struct remote_view *view = ctx->view;
> > +	struct file *file = vma->vm_file;
> > +	struct remote_file_context *fctx = file->private_data;
> > +	unsigned long req_addr;
> > +	unsigned int gup_flags;
> > +	struct page *req_page;
> > +	vm_fault_t result = VM_FAULT_SIGBUS;
> > +	struct mm_struct *src_mm = fctx->mm;
> > +	unsigned long seq;
> > +	int idx;
> > +
> > +fault_retry:
> > +	seq = mmu_interval_read_begin(&view->mmin);
> > +
> > +	idx = srcu_read_lock(&fctx->fault_srcu);
> > +
> > +	/* check if view was invalidated */
> > +	if (unlikely(!READ_ONCE(view->valid))) {
> > +		pr_debug("%s: region [%lx-%lx) was invalidated!!\n",
> __func__,
> > +			view->offset, view->offset + view->size);
> > +		goto out_invalid;		/* VM_FAULT_SIGBUS */
> > +	}
> > +
> > +	/* drop current mm semapchore */
> > +	up_read(&current->mm->mmap_sem);
> 
> Please use mmap_read_lock/unlock(mm) instead of
> down/up_read(mmap_sem).
> 
> But why is it safe to drop ->mmap_sem without checking
> FAULT_FLAG_ALLOW_RETRY/RETRY_NOWAIT ?
> 
Dropping mmap_sem will have the same effects regardless of FAULT_FLAG_ALLOW_RETRY/RETRY_NOWAIT.
Another thread can unmap the VMA from underneath us, or remap another one in its place.
In the end, the VMA has to be revalidated when re-acquiring the mmap_sem.
Or am I wrong?!

The reason I dropped mmap_sem is cause I had to acquire the remote mm->mmap_sem 
and tried to avoid any nested semaphore scenarios.
AFAIK the faulting rules allow us to return with mmap_sem dropped when FAULT_FLAG_ALLOW_RETRY
is set, but state nothing about dropping and re-acquiring mmap_sem in the fault handler.

> > +	/* take remote mm semaphore */
> > +	if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
> > +		if (!down_read_trylock(&src_mm->mmap_sem)) {
> 
> I don't understand this... perhaps you meant FAULT_FLAG_RETRY_NOWAIT ?

Seems you're right. I never understood the definition of FAULT_FLAG_RETRY_NOWAIT.
@FAULT_FLAG_RETRY_NOWAIT: Don't drop mmap_sem and wait when retrying.
Should have been: Don't drop mmap_sem and don't wait when retrying.
And 'Don't drop mmap_sem' is supposed to mean:
/*
  * NOTE! This will make us return with VM_FAULT_RETRY, but with
  * the mmap_sem still held. That's how FAULT_FLAG_RETRY_NOWAIT
  * is supposed to work. We have way too many special cases..
  */
:~(

> > +	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be
> released
> > +	 * before returning VM_FAULT_RETRY only if
> FAULT_FLAG_RETRY_NOWAIT is
> > +	 * not set.
> 
> Well, iiuc FAULT_FLAG_ALLOW_RETRY means that ->fault() _can_ drop
> mmap_sem and return VM_FAULT_RETRY (unless NOWAIT).

That comment is just copied from elsewhere in the code.
My interpretation was that the fault handler _should_ return with mmap_sem
held or not depending on FAULT_FLAG_RETRY_NOWAIT.

> > +	if (vmf->flags & (FAULT_FLAG_ALLOW_RETRY |
> FAULT_FLAG_TRIED)) {
> > +		if (!(vmf->flags & FAULT_FLAG_KILLABLE))
> > +			if (current && fatal_signal_pending(current)) {
> > +				up_read(&current->mm->mmap_sem);
> > +				return VM_FAULT_RETRY;
> 
> I fail to understand this. But in any case, you do not need to check current !=
> NULL and up_read() looks wrong if RETRY_NOWAIT?

My bad. I didn't need to check current != NULL. 
There was the case when the fault was invoked from a kthread in KVM's async_pf
and I confused current with current->mm.

Mircea


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-07 20:43   ` Andy Lutomirski
@ 2020-09-09 11:38     ` Stefan Hajnoczi
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2020-09-09 11:38 UTC (permalink / raw)
  To: Andy Lutomirski, Adalbert Lazăr
  Cc: Christian Brauner, Linux-MM, Linux API, Andrew Morton,
	Alexander Graf, Jerome Glisse, Paolo Bonzini, Mihai Donțu,
	Mircea Cirjaliu, Arnd Bergmann, Sargun Dhillon, Aleksa Sarai,
	Oleg Nesterov, Jann Horn, Kees Cook, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 6320 bytes --]

On Mon, Sep 07, 2020 at 01:43:48PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > This patchset adds support for the remote mapping feature.
> > > Remote mapping, as its name suggests, is a means for transparent and
> > > zero-copy access of a remote process' address space.
> > > access of a remote process' address space.
> > >
> > > The feature was designed according to a specification suggested by
> > > Paolo Bonzini:
> > > >> The proposed API is a new pidfd system call, through which the parent
> > > >> can map portions of its virtual address space into a file descriptor
> > > >> and then pass that file descriptor to a child.
> > > >>
> > > >> This should be:
> > > >>
> > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > > >> better way to do PTRACE_{PEEK,POKE}DATA
> >
> > In all honesty, that sentence made me a bit uneasy as it reads like this
> > is implemented on top of pidfds because it makes it more likely to go
> > upstream not because it is the right design. To be clear, I'm not
> > implying any sort of malicious intent on your part but I would suggest
> > to phrase this a little better. :)
> 
> 
> I thought about this whole thing some more, and here are some thoughts.
> 
> First, I was nervous about two things.  One was faulting in pages from
> the wrong context.  (When a normal page fault or KVM faults in a page,
> the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
> the actual fault happens, but the mm is loaded when the fault is
> handled, I think.  Maybe there are workqueues involved and I'm wrong.)
>  When a remote mapping faults in a page, the mm is *not* loaded.)
> This ought not to be a problem, though -- get_user_pages_remote() also
> faults in pages from a non-current mm, and that's at least supposed to
> work correctly, so maybe this is okay.
> 
> Second is recursion.  I think this is a genuine problem.
> 
> And I think that tying this to pidfds is the wrong approach.  In fact,
> tying it to processes at all seems wrong.  There is a lot of demand
> for various forms of memory isolation in which memory is mapped only
> by its intended user.  Using something tied to a process mm gets in
> the way of this in the same way that KVM's current mapping model gets
> in the way.
> 
> All that being said, I think the whole idea of making fancy address
> spaces composed from other mappable objects is neat and possibly quite
> useful.  And, if you squint a bit, this is a lot like what KVM does
> today.
> 
> So I suggest something that may be more generally useful as an
> alternative.  This is a sketch and very subject to bikeshedding:
> 
> Create an empty address space:
> 
> int address_space_create(int flags, etc);
> 
> Map an fd into an address space:
> 
> int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
> ...);  /* might run out of args here */
> 
> Unmap from an address space:
> 
> int address_space_munmap(int asfd, unsigned long addr, unsigned long len);
> 
> Stick an address space into KVM:
> 
> ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */
> 
> Maybe some day allow mapping an address space into a process.
> 
> mmap(..., asfd, ...);
> 
> 
> And at least for now, there's a rule that an address space that is
> address_space_mmapped into an address space is disallowed.
> 
> 
> Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
> day we allow creating a special address_space that represents a real
> process's address space.
> 
> 
> Under the hood, an address_space could own an mm_struct that is not
> used by any tasks.  And we could have special memfds that are bound to
> a VM such that all you can do with them is stick them into an
> address_space and map that address_space into the VM in question.  For
> this to work, we would want a special vm_operation for mapping into a
> VM.
> 
> 
> What do you all think?  Is this useful?  Does it solve your problems?
> Is it a good approach going forward?

Hi Adalbert and Andy,
As everyone continues to discuss how the mechanism should look, I want
to share two use cases for something like this. Let me know if you would
like more detail on these use cases.

They requirement in both cases is that process A can map a virtual
memory range from process B so that mmap/munmap operations within the
memory range in process B also affect process A.

An enforcing vIOMMU for vhost-user and vfio-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user, vfio-user, and other out-of-process device emulation
interfaces need a way for the virtual machine manager (VMM) to enforce
the vIOMMU mappings on the device emulation process. The VMM emulates
the vIOMMU and only wants to expose a subset of memory to the device
emulation process. This subset can change as the guest programs the
vIOMMU.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later.

The new mechanism would allow the VMM to add/remove mappings so that the
device emulation process can only access ranges of memory programmed by
the guest vIOMMU. Accesses to unmapped addresses would raise a signal.

Accelerating the virtio-fs DAX window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process handles guest file map/unmap messages.
The map/unmap messages allow the guest to map ranges of files into its
memory space. The guest kernel then uses DAX to access the file pages
without copying their contents into the guest page cache and mmap
MAP_SHARED is coherent when guests access the same file.

Today virtiofsd sends a message to the VMM over a UNIX domain socket
asking for an mmap/munmap. The VMM must perform the mapping on behalf of
virtiofsd. This communication and file descriptor passing is clumsy and
slow.

The new mechanism would allow virtiofsd to map/unmap without extra
coordination with the VMM. The VMM only needs to perform an initial mmap
of the DAX window so that kvm.ko can resolve page faults to that region.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process
  2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
@ 2020-09-10 16:43       ` Oleg Nesterov
  0 siblings, 0 replies; 44+ messages in thread
From: Oleg Nesterov @ 2020-09-10 16:43 UTC (permalink / raw)
  To: Mircea CIRJALIU - MELIU
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Andy Lutomirski, Arnd Bergmann, Sargun Dhillon,
	Aleksa Sarai, Jann Horn, Kees Cook, Matthew Wilcox,
	Christian Brauner

On 09/09, Mircea CIRJALIU - MELIU wrote:
>
> > From: Oleg Nesterov <oleg@redhat.com>
> >
> > But why is it safe to drop ->mmap_sem without checking
> > FAULT_FLAG_ALLOW_RETRY/RETRY_NOWAIT ?
> >
> Dropping mmap_sem will have the same effects regardless of FAULT_FLAG_ALLOW_RETRY/RETRY_NOWAIT.
> Another thread can unmap the VMA from underneath us, or remap another one in its place.
> In the end, the VMA has to be revalidated when re-acquiring the mmap_sem.
> Or am I wrong?!

To simplify, lets forget about RETRY_NOWAIT/TRIED.

Again, I can be easily wrong. But iiuc, you simply can't drop mmap_sem
if FAULT_FLAG_ALLOW_RETRY is not set, the caller doesn't expect mmap_sem
can be unlocked.

OTOH, if FAULT_FLAG_ALLOW_RETRY is set and you drop mmap_sem, you can
only return VM_FAULT_RETRY to let the caller know it was dropped.

> > > +	 * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be
> > released
> > > +	 * before returning VM_FAULT_RETRY only if
> > FAULT_FLAG_RETRY_NOWAIT is
> > > +	 * not set.
> >
> > Well, iiuc FAULT_FLAG_ALLOW_RETRY means that ->fault() _can_ drop
> > mmap_sem and return VM_FAULT_RETRY (unless NOWAIT).
>
> That comment is just copied from elsewhere in the code.
> My interpretation was that the fault handler _should_ return with mmap_sem
> held or not depending on FAULT_FLAG_RETRY_NOWAIT.

Yes, this depends on FAULT_FLAG_RETRY_NOWAIT.

But your comment above looks as if he mmap_sem must be _always_ released
if FAULT_FLAG_ALLOW_RETRY && !FAULT_FLAG_RETRY_NOWAIT. This is not true.


Nevermind. If you ever resend this patch, please CC mm/ experts. I tried
to look at this code again and I feel it has much more problems, but as
I said this is not my area.

And I think you should split this patch, mirror_vm_fault() should come in
a separate patch to simplify the review.

Oleg.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RESEND RFC PATCH 0/5] Remote mapping
  2020-09-04 19:41   ` Matthew Wilcox
  2020-09-04 19:49     ` Jason Gunthorpe
  2020-09-04 20:08     ` Paolo Bonzini
@ 2020-12-01 18:01     ` Jason Gunthorpe
  2 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2020-12-01 18:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Adalbert Lazăr, linux-mm, linux-api, Andrew Morton,
	Alexander Graf, Stefan Hajnoczi, Jerome Glisse, Paolo Bonzini,
	Mihai Donțu, Mircea Cirjaliu, Andy Lutomirski,
	Arnd Bergmann, Sargun Dhillon, Aleksa Sarai, Oleg Nesterov,
	Jann Horn, Kees Cook, Christian Brauner

On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > > process address space within the specified range. Pages are installed in the
> > > current process page tables at fault time and removed by the mmu_interval_notifier
> > > invalidate callbck. No further memory management is involved.
> > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > > or if the remote process address space was reaped by OOM, the remote mapping
> > > fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use
> > cases as well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad
> > direction as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant
> > VMA and somehow split/mirror the VMA ops.
> 
> I'm on holiday for the next few days, but does the mshare() API work for
> your use case?
> 
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

I found some interest in this project here, with the detail that
pin_user_pages() should keep working, ideally on both sides of the
share, but essentially on the side that calls mshare()

Maybe we can help out, at least cc me if you make progress :)

Thanks,
Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-12-01 18:01 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-04 11:31 [RESEND RFC PATCH 0/5] Remote mapping Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
2020-09-04 12:03   ` Jason Gunthorpe
2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
2020-09-04 17:55   ` Oleg Nesterov
2020-09-07 14:30   ` Oleg Nesterov
2020-09-07 15:16     ` Adalbert Lazăr
2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
2020-09-10 16:43       ` Oleg Nesterov
2020-09-07 15:02   ` Christian Brauner
2020-09-07 16:04     ` Mircea CIRJALIU - MELIU
2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
2020-09-04 19:18   ` Florian Weimer
2020-09-07 14:55   ` Christian Brauner
2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
2020-09-04 13:39     ` Jason Gunthorpe
2020-09-04 14:18       ` Mircea CIRJALIU - MELIU
2020-09-04 14:39         ` Jason Gunthorpe
2020-09-04 15:40           ` Mircea CIRJALIU - MELIU
2020-09-04 16:11             ` Jason Gunthorpe
2020-09-04 19:41   ` Matthew Wilcox
2020-09-04 19:49     ` Jason Gunthorpe
2020-09-04 20:08     ` Paolo Bonzini
2020-12-01 18:01     ` Jason Gunthorpe
2020-09-04 19:19 ` Florian Weimer
2020-09-04 20:18   ` Paolo Bonzini
2020-09-07  8:33     ` Christian Brauner
2020-09-04 19:39 ` Andy Lutomirski
2020-09-04 20:09   ` Paolo Bonzini
2020-09-04 20:34     ` Andy Lutomirski
2020-09-04 21:58       ` Paolo Bonzini
2020-09-04 23:17         ` Andy Lutomirski
2020-09-05 18:27           ` Paolo Bonzini
2020-09-07  8:38             ` Christian Brauner
2020-09-07 12:41           ` Mircea CIRJALIU - MELIU
2020-09-07  7:05         ` Christoph Hellwig
2020-09-07  8:44           ` Paolo Bonzini
2020-09-07 10:25   ` Mircea CIRJALIU - MELIU
2020-09-07 15:05 ` Christian Brauner
2020-09-07 20:43   ` Andy Lutomirski
2020-09-09 11:38     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).