linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/6] mm, x86: New special mapping ops
@ 2014-10-30  0:42 Andy Lutomirski
  2014-10-30  0:42 ` [RFC 1/6] mm: Add a mechanism to track the current address of a special mapping Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

This is an attempt to make the core special mapping infrastructure
track what arch vdso code needs better than it currently does.  It
adds:

.start_addr_set: A callback to notify arch code that a special mapping
was mremapped.  (CRIU does this.  Without something like this, it's
somewhat broken for 64-bit userspace and completely broken for 32-bit
userspace on Intel hardware.  Apparently no one has noticed the 64-bit
breakage, and no one ever ported CRIU to 32-bit in the first place.)

.fault: Directly fault handling on the vdso.  Imagine that!  It turns
out that storing a list of struct page pointers in the special mapping
data is awkward for pretty much everyone and completely precludes
mapping things that aren't pages without dirty hacks.  (x86 uses dirty
hacks for the HPET mapping.  See below.)

vm_insert_pfn_prot: The only way to support VMAs with different
protections on different pages right now is to either use
(io_)remap_pfn_range or to twiddle the ptes directly.  This is annoying.

One might ask why anyone would ever want different prot values in the
same VMA.  It turns out that x86 maps the HPET into the vvar area, and
the HPET needs to be uncached.

I think that this kind of trick makes no sense on a COW-able mapping or
on any mapping that isn't a pure PFN mapping.  The new interface
enforces this.

The x86 parts are in here mainly as examples for how the new core
interfaces would be used.  I don't know of anything wrong with them,
but I would not go so far as to pretend that I've tested them adequately.

Andy Lutomirski (6):
  mm: Add a mechanism to track the current address of a special mapping
  x86,vdso: Use special mapping tracking for the vdso
  mm: Add a vm_special_mapping .fault method
  mm: Add vm_insert_pfn_prot
  x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping
  x86,vdso: Use .fault for the vdso text mapping

 arch/x86/ia32/ia32_signal.c |  11 ++--
 arch/x86/include/asm/elf.h  |  26 +++-----
 arch/x86/include/asm/mmu.h  |   4 +-
 arch/x86/include/asm/vdso.h |  19 +++++-
 arch/x86/kernel/signal.c    |   9 +--
 arch/x86/vdso/vdso2c.h      |   7 ---
 arch/x86/vdso/vma.c         | 141 +++++++++++++++++++++++++++++++-------------
 include/linux/mm.h          |   5 ++
 include/linux/mm_types.h    |  26 +++++++-
 mm/memory.c                 |  25 +++++++-
 mm/mmap.c                   |  38 +++++++++---
 mm/mremap.c                 |   2 +
 12 files changed, 221 insertions(+), 92 deletions(-)

-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/6] mm: Add a mechanism to track the current address of a special mapping
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:42 ` [RFC 2/6] x86,vdso: Use special mapping tracking for the vdso Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

This adds code to record the start address of a special mapping in
mm->context.  Something like this is needed to enable arch code to
find the vdso or another special mapping if that mapping has been
mremapped.

CRIU remaps special mappings, so this isn't just hypothetical.

Most vdso-using architectures record the vdso address in mm->context
already.  Some of those are only doing it for arch_vma_name, which
is no longer necessary.  Others need it for real:

 - x86_32 (native and compat) need it for the sigreturn,
   rt_sigreturn, and sysenter return thunks.

 - ARM could, in principle, use this for to make its kuser helpers
   relocatable.  (I don't think it will, but it *could*.)

 - x86 may, in the near future, want to change vvar context, per-mm,
   in response to a prctl or other request.  This could, for
   example, be used to turn off RDTSC (using CR4.TSD) without
   crashing the target process.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm.h       |  3 +++
 include/linux/mm_types.h |  8 ++++++++
 mm/mmap.c                | 24 +++++++++++++++++++++---
 mm/mremap.c              |  2 ++
 4 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..66bc9a37ae17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1796,6 +1796,9 @@ extern int install_special_mapping(struct mm_struct *mm,
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+/* Internal helper to update mm context after the vma is moved. */
+extern void update_special_mapping_addr(struct vm_area_struct *vma);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286649f1..ad6652fe3671 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -515,6 +515,14 @@ struct vm_special_mapping
 {
 	const char *name;
 	struct page **pages;
+
+	/*
+	 * If non-NULL, this is called when installed and when mremap
+	 * moves the first page of the mapping.
+	 */
+	void (*start_addr_set)(struct vm_special_mapping *sm,
+			       struct mm_struct *mm,
+			       unsigned long start_addr);
 };
 
 enum tlb_flush_reason {
diff --git a/mm/mmap.c b/mm/mmap.c
index c0a3637cdb64..8c398b9ee225 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2923,8 +2923,21 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = {
 	.fault = special_mapping_fault,
 };
 
+void update_special_mapping_addr(struct vm_area_struct *vma)
+{
+	struct vm_special_mapping *sm;
+
+	if (vma->vm_ops != &special_mapping_vmops)
+		return;
+
+	sm = vma->vm_private_data;
+	if (sm->start_addr_set &&
+	    vma->vm_start == (vma->vm_pgoff << PAGE_SHIFT))
+		sm->start_addr_set(sm, vma->vm_mm, vma->vm_start);
+}
+
 static int special_mapping_fault(struct vm_area_struct *vma,
-				struct vm_fault *vmf)
+				 struct vm_fault *vmf)
 {
 	pgoff_t pgoff;
 	struct page **pages;
@@ -3009,8 +3022,13 @@ struct vm_area_struct *_install_special_mapping(
 	unsigned long addr, unsigned long len,
 	unsigned long vm_flags, const struct vm_special_mapping *spec)
 {
-	return __install_special_mapping(mm, addr, len, vm_flags,
-					 &special_mapping_vmops, (void *)spec);
+	struct vm_area_struct *vma;
+
+	vma = __install_special_mapping(mm, addr, len, vm_flags,
+					&special_mapping_vmops, (void *)spec);
+	if (!IS_ERR(vma))
+		update_special_mapping_addr(vma);
+	return vma;
 }
 
 int install_special_mapping(struct mm_struct *mm,
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180e9f21..7a0b79fdf60f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -287,6 +287,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		old_len = new_len;
 		old_addr = new_addr;
 		new_addr = -ENOMEM;
+	} else {
+		update_special_mapping_addr(new_vma);
 	}
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 2/6] x86,vdso: Use special mapping tracking for the vdso
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
  2014-10-30  0:42 ` [RFC 1/6] mm: Add a mechanism to track the current address of a special mapping Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:42 ` [RFC 3/6] mm: Add a vm_special_mapping .fault method Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

This should give full support for mremap on the vdso except for
sysenter return.  It will also enable future vvar twiddling on
already-started processes.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/ia32/ia32_signal.c | 11 ++++-------
 arch/x86/include/asm/elf.h  | 26 ++++++++-----------------
 arch/x86/include/asm/mmu.h  |  4 +++-
 arch/x86/include/asm/vdso.h | 16 +++++++++++++++
 arch/x86/kernel/signal.c    |  9 +++------
 arch/x86/vdso/vma.c         | 47 ++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 74 insertions(+), 39 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index f9e181aaba97..3b335c674059 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -381,11 +381,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
 	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
 		restorer = ksig->ka.sa.sa_restorer;
 	} else {
-		/* Return stub is in 32bit vsyscall page */
-		if (current->mm->context.vdso)
-			restorer = current->mm->context.vdso +
-				selected_vdso32->sym___kernel_sigreturn;
-		else
+		restorer = VDSO_SYM_ADDR(current->mm, __kernel_sigreturn);
+		if (!restorer)
 			restorer = &frame->retcode;
 	}
 
@@ -462,8 +459,8 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,
 		if (ksig->ka.sa.sa_flags & SA_RESTORER)
 			restorer = ksig->ka.sa.sa_restorer;
 		else
-			restorer = current->mm->context.vdso +
-				selected_vdso32->sym___kernel_rt_sigreturn;
+			restorer = VDSO_SYM_ADDR(current->mm,
+						 __kernel_rt_sigreturn);
 		put_user_ex(ptr_to_compat(restorer), &frame->pretcode);
 
 		/*
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 1a055c81d864..05df8f03faa5 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -276,7 +276,7 @@ struct task_struct;
 
 #define	ARCH_DLINFO_IA32						\
 do {									\
-	if (vdso32_enabled) {						\
+	if (current->mm->context.vdso_image) {				\
 		NEW_AUX_ENT(AT_SYSINFO,	VDSO_ENTRY);			\
 		NEW_AUX_ENT(AT_SYSINFO_EHDR, VDSO_CURRENT_BASE);	\
 	}								\
@@ -295,26 +295,19 @@ do {									\
 /* 1GB for 64bit, 8MB for 32bit */
 #define STACK_RND_MASK (test_thread_flag(TIF_ADDR32) ? 0x7ff : 0x3fffff)
 
-#define ARCH_DLINFO							\
+#define ARCH_DLINFO_X86_64						\
 do {									\
-	if (vdso64_enabled)						\
-		NEW_AUX_ENT(AT_SYSINFO_EHDR,				\
-			    (unsigned long __force)current->mm->context.vdso); \
+	if (current->mm->context.vdso_image)				\
+		NEW_AUX_ENT(AT_SYSINFO_EHDR, VDSO_CURRENT_BASE);	\
 } while (0)
 
-/* As a historical oddity, the x32 and x86_64 vDSOs are controlled together. */
-#define ARCH_DLINFO_X32							\
-do {									\
-	if (vdso64_enabled)						\
-		NEW_AUX_ENT(AT_SYSINFO_EHDR,				\
-			    (unsigned long __force)current->mm->context.vdso); \
-} while (0)
+#define ARCH_DLINFO ARCH_DLINFO_X86_64
 
 #define AT_SYSINFO		32
 
 #define COMPAT_ARCH_DLINFO						\
 if (test_thread_flag(TIF_X32))						\
-	ARCH_DLINFO_X32;						\
+	ARCH_DLINFO_X86_64;						\
 else									\
 	ARCH_DLINFO_IA32
 
@@ -322,11 +315,8 @@ else									\
 
 #endif /* !CONFIG_X86_32 */
 
-#define VDSO_CURRENT_BASE	((unsigned long)current->mm->context.vdso)
-
-#define VDSO_ENTRY							\
-	((unsigned long)current->mm->context.vdso +			\
-	 selected_vdso32->sym___kernel_vsyscall)
+#define VDSO_CURRENT_BASE	((unsigned long)vdso_text_start(current->mm))
+#define VDSO_ENTRY ((unsigned long)VDSO_SYM_ADDR(current->mm, __kernel_vsyscall))
 
 struct linux_binprm;
 
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 876e74e8eec7..bbba90ebd2c8 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -18,7 +18,9 @@ typedef struct {
 #endif
 
 	struct mutex lock;
-	void __user *vdso;
+
+	unsigned long vvar_vma_start;
+	const struct vdso_image *vdso_image;
 } mm_context_t;
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 8021bd28c0f1..3aa1f830c551 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -49,6 +49,22 @@ extern const struct vdso_image *selected_vdso32;
 
 extern void __init init_vdso_image(const struct vdso_image *image);
 
+static inline void __user *vdso_text_start(const struct mm_struct *mm)
+{
+	if (!mm->context.vdso_image)
+		return NULL;
+
+	return (void __user *)ACCESS_ONCE(mm->context.vvar_vma_start) -
+		mm->context.vdso_image->sym_vvar_start;
+}
+
+#define VDSO_SYM_ADDR(mm, sym) (					\
+		(mm)->context.vdso_image ?				\
+		vdso_text_start((mm)) +					\
+			(mm)->context.vdso_image->sym_ ## sym		\
+		: NULL							\
+	)
+
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_X86_VDSO_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 2851d63c1202..d8b21e37e292 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -297,10 +297,8 @@ __setup_frame(int sig, struct ksignal *ksig, sigset_t *set,
 			return -EFAULT;
 	}
 
-	if (current->mm->context.vdso)
-		restorer = current->mm->context.vdso +
-			selected_vdso32->sym___kernel_sigreturn;
-	else
+	restorer = VDSO_SYM_ADDR(current->mm, __kernel_sigreturn);
+	if (!restorer)
 		restorer = &frame->retcode;
 	if (ksig->ka.sa.sa_flags & SA_RESTORER)
 		restorer = ksig->ka.sa.sa_restorer;
@@ -362,8 +360,7 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
 		save_altstack_ex(&frame->uc.uc_stack, regs->sp);
 
 		/* Set up to return from userspace.  */
-		restorer = current->mm->context.vdso +
-			selected_vdso32->sym___kernel_rt_sigreturn;
+		restorer = VDSO_SYM_ADDR(current->mm, __kernel_rt_sigreturn);
 		if (ksig->ka.sa.sa_flags & SA_RESTORER)
 			restorer = ksig->ka.sa.sa_restorer;
 		put_user_ex(restorer, &frame->pretcode);
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 970463b566cf..7f99c2ed1a3e 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -89,6 +89,38 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 #endif
 }
 
+static void vvar_start_set(struct vm_special_mapping *sm,
+			   struct mm_struct *mm, unsigned long start_addr)
+{
+	if (start_addr >= TASK_SIZE_MAX - mm->context.vdso_image->size) {
+		/*
+		 * We were just relocated out of bounds.  Malicious
+		 * user code can cause this by mremapping only the
+		 * first page of a multi-page vdso.
+		 *
+		 * We can't actually fail here, but it's not safe to
+		 * allow vdso symbols to resolve to potentially
+		 * non-canonical addresses.  Instead, just ignore
+		 * the update.
+		 */
+
+		return;
+	}
+
+	mm->context.vvar_vma_start = start_addr;
+
+	/*
+	 * If we're here as a result of an mremap call, there are two
+	 * major gotchas.  First, if that call came from the vdso, we're
+	 * about to crash (i.e. don't do that).  Second, if we have more
+	 * than one thread, this won't update the other threads.
+	 */
+	if (mm->context.vdso_image->sym_VDSO32_SYSENTER_RETURN)
+		current_thread_info()->sysenter_return =
+			VDSO_SYM_ADDR(current->mm, VDSO32_SYSENTER_RETURN);
+
+}
+
 static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 {
 	struct mm_struct *mm = current->mm;
@@ -99,6 +131,12 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 	static struct vm_special_mapping vvar_mapping = {
 		.name = "[vvar]",
 		.pages = no_pages,
+
+		/*
+		 * Tracking the vdso is roughly equivalent to tracking the
+		 * vvar area, and the latter is slightly easier.
+		 */
+		.start_addr_set = vvar_start_set,
 	};
 
 	if (calculate_addr) {
@@ -118,7 +156,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 	}
 
 	text_start = addr - image->sym_vvar_start;
-	current->mm->context.vdso = (void __user *)text_start;
+	current->mm->context.vdso_image = image;
 
 	/*
 	 * MAYWRITE to allow gdb to COW and set breakpoints
@@ -171,7 +209,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 
 up_fail:
 	if (ret)
-		current->mm->context.vdso = NULL;
+		current->mm->context.vdso_image = NULL;
 
 	up_write(&mm->mmap_sem);
 	return ret;
@@ -189,11 +227,6 @@ static int load_vdso32(void)
 	if (ret)
 		return ret;
 
-	if (selected_vdso32->sym_VDSO32_SYSENTER_RETURN)
-		current_thread_info()->sysenter_return =
-			current->mm->context.vdso +
-			selected_vdso32->sym_VDSO32_SYSENTER_RETURN;
-
 	return 0;
 }
 #endif
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 3/6] mm: Add a vm_special_mapping .fault method
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
  2014-10-30  0:42 ` [RFC 1/6] mm: Add a mechanism to track the current address of a special mapping Andy Lutomirski
  2014-10-30  0:42 ` [RFC 2/6] x86,vdso: Use special mapping tracking for the vdso Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:42 ` [RFC 4/6] mm: Add vm_insert_pfn_prot Andy Lutomirski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

Requiring special mappings to give a list of struct pages is
inflexible: it prevents sane use of IO memory in a special mapping,
it's inefficient (it requires arch code to initialize a list of
struct pages, and it requires the mm core to walk the entire list
just to figure out how long it is), and it prevents arch code from
doing anything fancy when a special mapping fault occurs.

Add a .fault method as an alternative to filling in a .pages array.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm_types.h | 18 +++++++++++++++++-
 mm/mmap.c                | 14 ++++++++++----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ad6652fe3671..cc96c63b1002 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -511,12 +511,28 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm)
 }
 #endif
 
+struct vm_fault;
+
 struct vm_special_mapping
 {
-	const char *name;
+	const char *name;	/* The name, e.g. "[vdso]". */
+
+	/*
+	 * If .fault is not provided, this is points to a
+	 * NULL-terminated array of pages that back the special mapping.
+	 *
+	 * This must not be NULL unless .fault is provided.
+	 */
 	struct page **pages;
 
 	/*
+	 * If non-NULL, then this is called to resolve page faults
+	 * on the special mapping.  If used, .pages is not checked.
+	 */
+	int (*fault)(struct vm_special_mapping *sm, struct vm_area_struct *vma,
+		     struct vm_fault *vmf);
+
+	/*
 	 * If non-NULL, this is called when installed and when mremap
 	 * moves the first page of the mapping.
 	 */
diff --git a/mm/mmap.c b/mm/mmap.c
index 8c398b9ee225..d27572e3e4f4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2950,11 +2950,17 @@ static int special_mapping_fault(struct vm_area_struct *vma,
 	 */
 	pgoff = vmf->pgoff - vma->vm_pgoff;
 
-	if (vma->vm_ops == &legacy_special_mapping_vmops)
+	if (vma->vm_ops == &legacy_special_mapping_vmops) {
 		pages = vma->vm_private_data;
-	else
-		pages = ((struct vm_special_mapping *)vma->vm_private_data)->
-			pages;
+	} else {
+		struct vm_special_mapping *sm = vma->vm_private_data;
+		if (sm->fault) {
+			vmf->pgoff = pgoff;
+			return sm->fault(sm, vma, vmf);
+		} else {
+			pages = sm->pages;
+		}
+	}
 
 	for (; pgoff && *pages; ++pages)
 		pgoff--;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 4/6] mm: Add vm_insert_pfn_prot
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-10-30  0:42 ` [RFC 3/6] mm: Add a vm_special_mapping .fault method Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:42 ` [RFC 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping Andy Lutomirski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

The x86 vvar mapping contains pages with differing cacheability
flags.  This is currently only supported using (io_)remap_pfn_range,
but those functions can't be used inside page faults.

Add vm_insert_pfn_prot to support varying cacheability within the
same non-COW VMA in a more sane manner.

x86 needs this to avoid a CRIU-breaking and memory-wasting explosion
of VMAs when supporting userspace access to the HPET.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 25 +++++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66bc9a37ae17..8f1fa43cf615 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1960,6 +1960,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
+int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
diff --git a/mm/memory.c b/mm/memory.c
index adeac306610f..f80cea300729 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1598,8 +1598,29 @@ out:
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn)
 {
+	return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_pfn);
+
+/**
+ * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot
+ * @vma: user vma to map to
+ * @addr: target user address of this page
+ * @pfn: source kernel pfn
+ * @pgprot: pgprot flags for the inserted page
+ *
+ * This is exactly like vm_insert_pfn, except that it allows drivers to
+ * to override pgprot on a per-page basis.
+ *
+ * This only makes sense for IO mappings, and it makes no sense for
+ * cow mappings.  In general, using multiple vmas is preferable;
+ * vm_insert_pfn_prot should only be used if using multiple VMAs is
+ * impractical.
+ */
+int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+			unsigned long pfn, pgprot_t pgprot)
+{
 	int ret;
-	pgprot_t pgprot = vma->vm_page_prot;
 	/*
 	 * Technically, architectures with pte_special can avoid all these
 	 * restrictions (same for remap_pfn_range).  However we would like
@@ -1621,7 +1642,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 
 	return ret;
 }
-EXPORT_SYMBOL(vm_insert_pfn);
+EXPORT_SYMBOL(vm_insert_pfn_prot);
 
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn)
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-10-30  0:42 ` [RFC 4/6] mm: Add vm_insert_pfn_prot Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:42 ` [RFC 6/6] x86,vdso: Use .fault for the vdso text mapping Andy Lutomirski
  2014-10-30  0:57 ` [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

This is IMO much less ugly, and it also opens the door to
disallowing unprivileged userspace HPET access on systems with
usable TSCs.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/vdso/vma.c | 68 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 26 deletions(-)

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 7f99c2ed1a3e..5cde3b82d1e9 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -121,16 +121,54 @@ static void vvar_start_set(struct vm_special_mapping *sm,
 
 }
 
+static int vvar_fault(struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	const struct vdso_image *image = vma->vm_mm->context.vdso_image;
+	long sym_offset;
+	int ret = -EFAULT;
+
+	if (!image)
+		return VM_FAULT_SIGBUS;
+	sym_offset = (long)(vmf->pgoff << PAGE_SHIFT) +
+		image->sym_vvar_start;
+
+	/*
+	 * Sanity check: a symbol offset of zero means that the page
+	 * does not exist for this vdso image, not that the page is at
+	 * offset zero relative to the text mapping.  This should be
+	 * impossible here, because sym_offset should only be zero for
+	 * the page past the end of the vvar mapping.
+	 */
+	if (sym_offset == 0)
+		return VM_FAULT_SIGBUS;
+
+	if (sym_offset == image->sym_vvar_page)
+		ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+				    __pa_symbol(&__vvar_page) >> PAGE_SHIFT);
+#ifdef CONFIG_HPET_TIMER
+	else if (hpet_address && sym_offset == image->sym_hpet_page)
+		ret = vm_insert_pfn_prot(vma,
+					 (unsigned long)vmf->virtual_address,
+					 hpet_address >> PAGE_SHIFT,
+					 pgprot_noncached(PAGE_READONLY));
+#endif
+
+	if (ret == 0)
+		return VM_FAULT_NOPAGE;
+
+	return VM_FAULT_SIGBUS;
+}
+
 static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long addr, text_start;
 	int ret = 0;
-	static struct page *no_pages[] = {NULL};
 	static struct vm_special_mapping vvar_mapping = {
 		.name = "[vvar]",
-		.pages = no_pages,
+		.fault = vvar_fault,
 
 		/*
 		 * Tracking the vdso is roughly equivalent to tracking the
@@ -176,7 +214,8 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 	vma = _install_special_mapping(mm,
 				       addr,
 				       -image->sym_vvar_start,
-				       VM_READ|VM_MAYREAD,
+				       VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
+				       VM_PFNMAP,
 				       &vvar_mapping);
 
 	if (IS_ERR(vma)) {
@@ -184,29 +223,6 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 		goto up_fail;
 	}
 
-	if (image->sym_vvar_page)
-		ret = remap_pfn_range(vma,
-				      text_start + image->sym_vvar_page,
-				      __pa_symbol(&__vvar_page) >> PAGE_SHIFT,
-				      PAGE_SIZE,
-				      PAGE_READONLY);
-
-	if (ret)
-		goto up_fail;
-
-#ifdef CONFIG_HPET_TIMER
-	if (hpet_address && image->sym_hpet_page) {
-		ret = io_remap_pfn_range(vma,
-			text_start + image->sym_hpet_page,
-			hpet_address >> PAGE_SHIFT,
-			PAGE_SIZE,
-			pgprot_noncached(PAGE_READONLY));
-
-		if (ret)
-			goto up_fail;
-	}
-#endif
-
 up_fail:
 	if (ret)
 		current->mm->context.vdso_image = NULL;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 6/6] x86,vdso: Use .fault for the vdso text mapping
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
                   ` (4 preceding siblings ...)
  2014-10-30  0:42 ` [RFC 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping Andy Lutomirski
@ 2014-10-30  0:42 ` Andy Lutomirski
  2014-10-30  0:57 ` [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:42 UTC (permalink / raw)
  To: akpm, linux-mm, x86; +Cc: linux-kernel, Andy Lutomirski

The old scheme for mapping the vdso text is rather complicated.  vdso2c
generates a struct vm_special_mapping and a blank .pages array of the
correct size for each vdso image.  Init code in vdso/vma.c populates
the .pages array for each vdso image, and the mapping code selects
the appropriate struct vm_special_mapping.

With .fault, we can use a less roundabout approach: vdso_fault
just returns the appropriate page for the selected vdso image.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/vdso.h |  3 ---
 arch/x86/vdso/vdso2c.h      |  7 -------
 arch/x86/vdso/vma.c         | 26 +++++++++++++++++++-------
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 3aa1f830c551..b730e7a74323 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -13,9 +13,6 @@ struct vdso_image {
 	void *data;
 	unsigned long size;   /* Always a multiple of PAGE_SIZE */
 
-	/* text_mapping.pages is big enough for data/size page pointers */
-	struct vm_special_mapping text_mapping;
-
 	unsigned long alt, alt_len;
 
 	long sym_vvar_start;  /* Negative offset to the vvar area */
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index fd57829b30d8..279f7af7cf5e 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -148,16 +148,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
 	}
 	fprintf(outfile, "\n};\n\n");
 
-	fprintf(outfile, "static struct page *pages[%lu];\n\n",
-		mapping_size / 4096);
-
 	fprintf(outfile, "const struct vdso_image %s = {\n", name);
 	fprintf(outfile, "\t.data = raw_data,\n");
 	fprintf(outfile, "\t.size = %lu,\n", mapping_size);
-	fprintf(outfile, "\t.text_mapping = {\n");
-	fprintf(outfile, "\t\t.name = \"[vdso]\",\n");
-	fprintf(outfile, "\t\t.pages = pages,\n");
-	fprintf(outfile, "\t},\n");
 	if (alt_sec) {
 		fprintf(outfile, "\t.alt = %lu,\n",
 			(unsigned long)GET_LE(&alt_sec->sh_offset));
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 5cde3b82d1e9..0ae947eb7433 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -25,13 +25,7 @@ extern unsigned short vdso_sync_cpuid;
 
 void __init init_vdso_image(const struct vdso_image *image)
 {
-	int i;
-	int npages = (image->size) / PAGE_SIZE;
-
 	BUG_ON(image->size % PAGE_SIZE != 0);
-	for (i = 0; i < npages; i++)
-		image->text_mapping.pages[i] =
-			virt_to_page(image->data + i*PAGE_SIZE);
 
 	apply_alternatives((struct alt_instr *)(image->data + image->alt),
 			   (struct alt_instr *)(image->data + image->alt +
@@ -160,6 +154,24 @@ static int vvar_fault(struct vm_special_mapping *sm,
 	return VM_FAULT_SIGBUS;
 }
 
+static int vdso_fault(struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	const struct vdso_image *image = vma->vm_mm->context.vdso_image;
+
+	if (!image || (vmf->pgoff << PAGE_SHIFT) >= image->size)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(image->data + (vmf->pgoff << PAGE_SHIFT));
+	get_page(vmf->page);
+	return 0;
+}
+
+static struct vm_special_mapping text_mapping = {
+	.name = "[vdso]",
+	.fault = vdso_fault,
+};
+
 static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 {
 	struct mm_struct *mm = current->mm;
@@ -204,7 +216,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
 				       image->size,
 				       VM_READ|VM_EXEC|
 				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-				       &image->text_mapping);
+				       &text_mapping);
 
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 0/6] mm, x86: New special mapping ops
  2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
                   ` (5 preceding siblings ...)
  2014-10-30  0:42 ` [RFC 6/6] x86,vdso: Use .fault for the vdso text mapping Andy Lutomirski
@ 2014-10-30  0:57 ` Andy Lutomirski
  6 siblings, 0 replies; 8+ messages in thread
From: Andy Lutomirski @ 2014-10-30  0:57 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, X86 ML; +Cc: linux-kernel, Andy Lutomirski

On Wed, Oct 29, 2014 at 5:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> This is an attempt to make the core special mapping infrastructure
> track what arch vdso code needs better than it currently does.  It
> adds:
>
> .start_addr_set: A callback to notify arch code that a special mapping
> was mremapped.  (CRIU does this.  Without something like this, it's
> somewhat broken for 64-bit userspace and completely broken for 32-bit
> userspace on Intel hardware.  Apparently no one has noticed the 64-bit
> breakage, and no one ever ported CRIU to 32-bit in the first place.)
>
> .fault: Directly fault handling on the vdso.  Imagine that!  It turns
> out that storing a list of struct page pointers in the special mapping
> data is awkward for pretty much everyone and completely precludes
> mapping things that aren't pages without dirty hacks.  (x86 uses dirty
> hacks for the HPET mapping.  See below.)

I should add that there's further motivation for this.  I want to change the x86
vdso code so that the HPET is only mapped if it's actually in use.  Getting
this right is delicate, but it's almost impossible without this change.

In particular, if the HPET gets selected due to TSC instability after
boot, then there's no good way to start allowing access right now.
I'd have to remap_pfn_range on all mms at (egads!) an unknown address,
whereas now I can just start accepting the reference in .fault.
Getting the other direction right is tricky, but it's doable in a
number of ways.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-30  0:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-30  0:42 [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski
2014-10-30  0:42 ` [RFC 1/6] mm: Add a mechanism to track the current address of a special mapping Andy Lutomirski
2014-10-30  0:42 ` [RFC 2/6] x86,vdso: Use special mapping tracking for the vdso Andy Lutomirski
2014-10-30  0:42 ` [RFC 3/6] mm: Add a vm_special_mapping .fault method Andy Lutomirski
2014-10-30  0:42 ` [RFC 4/6] mm: Add vm_insert_pfn_prot Andy Lutomirski
2014-10-30  0:42 ` [RFC 5/6] x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping Andy Lutomirski
2014-10-30  0:42 ` [RFC 6/6] x86,vdso: Use .fault for the vdso text mapping Andy Lutomirski
2014-10-30  0:57 ` [RFC 0/6] mm, x86: New special mapping ops Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).