[PATCH 0/2] MIPS: Move signal return trampolines off the stack.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] MIPS: Move signal return trampolines off the stack.
@ 2009-04-21 21:30 David Daney
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: David Daney @ 2009-04-21 21:30 UTC (permalink / raw)
  To: linux-mips, Ralf Baechle

This patch set (against 2.6.29.1) creates a vdso and moves the signal
trampolines to it from their previous home on the stack.

Tested with a 64-bit kernel on a Cavium Octeon cn3860 where I have the
following results from lmbench2:

Before:
n64 - Signal handler overhead: 14.517 microseconds
n32 - Signal handler overhead: 14.497 microseconds
o32 - Signal handler overhead: 16.637 microseconds

After:

n64 - Signal handler overhead: 7.935 microseconds
n32 - Signal handler overhead: 7.334 microseconds
o32 - Signal handler overhead: 8.628 microseconds

Comments encourged.

I will reply with two patches.

David Daney (2):
  MIPS: Preliminary vdso.
  MIPS: Move signal trampolines off of the stack.

 arch/mips/include/asm/abi.h         |    6 +-
 arch/mips/include/asm/elf.h         |    4 +
 arch/mips/include/asm/mmu.h         |    5 +-
 arch/mips/include/asm/mmu_context.h |    2 +-
 arch/mips/include/asm/processor.h   |    5 +-
 arch/mips/include/asm/vdso.h        |   29 ++++++++++
 arch/mips/kernel/Makefile           |    2 +-
 arch/mips/kernel/signal-common.h    |    5 --
 arch/mips/kernel/signal.c           |   86 ++++++----------------------
 arch/mips/kernel/signal32.c         |   55 +++++--------------
 arch/mips/kernel/signal_n32.c       |   26 ++-------
 arch/mips/kernel/syscall.c          |    2 +-
 arch/mips/kernel/vdso.c             |  104 +++++++++++++++++++++++++++++++++++
 13 files changed, 190 insertions(+), 141 deletions(-)
 create mode 100644 arch/mips/include/asm/vdso.h
 create mode 100644 arch/mips/kernel/vdso.c

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-21 21:30 [PATCH 0/2] MIPS: Move signal return trampolines off the stack David Daney
@ 2009-04-21 21:33 ` David Daney
  2009-04-22  5:24   ` Shane McDonald
                     ` (2 more replies)
  2009-04-21 21:33 ` [PATCH 2/2] MIPS: Move signal trampolines off of the stack David Daney
  2009-04-22 18:04 ` [PATCH 0/2] MIPS: Move signal return trampolines off " David VomLehn
  2 siblings, 3 replies; 22+ messages in thread
From: David Daney @ 2009-04-21 21:33 UTC (permalink / raw)
  To: linux-mips, ralf; +Cc: David Daney

This is a preliminary patch to add a vdso to all user processes.
Still missing are ELF headers and .eh_frame information.  But it is
enough to allow us to move signal trampolines off of the stack.

We allocate a single page (the vdso) and write all possible signal
trampolines into it.  The stack is moved down by one page and the vdso
is mapped into this space.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
---
 arch/mips/include/asm/elf.h         |    4 +
 arch/mips/include/asm/mmu.h         |    5 +-
 arch/mips/include/asm/mmu_context.h |    2 +-
 arch/mips/include/asm/processor.h   |    5 +-
 arch/mips/include/asm/vdso.h        |   29 ++++++++++
 arch/mips/kernel/Makefile           |    2 +-
 arch/mips/kernel/syscall.c          |    2 +-
 arch/mips/kernel/vdso.c             |  104 +++++++++++++++++++++++++++++++++++
 8 files changed, 147 insertions(+), 6 deletions(-)
 create mode 100644 arch/mips/include/asm/vdso.h
 create mode 100644 arch/mips/kernel/vdso.c

diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index d58f128..08b3bc5 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -364,4 +364,8 @@ extern int dump_task_fpu(struct task_struct *, elf_fpregset_t *);
 #define ELF_ET_DYN_BASE         (TASK_SIZE / 3 * 2)
 #endif
 
+#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
+struct linux_binprm;
+extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       int uses_interp);
 #endif /* _ASM_ELF_H */
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index 4063edd..c436138 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -1,6 +1,9 @@
 #ifndef __ASM_MMU_H
 #define __ASM_MMU_H
 
-typedef unsigned long mm_context_t[NR_CPUS];
+typedef struct {
+	unsigned long asid[NR_CPUS];
+	void *vdso;
+} mm_context_t;
 
 #endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index d7f3eb0..62bcf13 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -73,7 +73,7 @@ extern unsigned long smtc_asid_mask;
 
 #endif
 
-#define cpu_context(cpu, mm)	((mm)->context[cpu])
+#define cpu_context(cpu, mm)	((mm)->context.asid[cpu])
 #define cpu_asid(cpu, mm)	(cpu_context((cpu), (mm)) & ASID_MASK)
 #define asid_cache(cpu)		(cpu_data[cpu].asid_cache)
 
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 0f926aa..530b01f 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -39,7 +39,7 @@ extern unsigned int vced_count, vcei_count;
  * so don't change it unless you know what you are doing.
  */
 #define TASK_SIZE	0x7fff8000UL
-#define STACK_TOP	TASK_SIZE
+#define STACK_TOP	(TASK_SIZE - PAGE_SIZE)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
@@ -59,7 +59,8 @@ extern unsigned int vced_count, vcei_count;
 #define TASK_SIZE32	0x7fff8000UL
 #define TASK_SIZE	0x10000000000UL
 #define STACK_TOP	\
-      (test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
+	((test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE) - \
+	 PAGE_SIZE)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
diff --git a/arch/mips/include/asm/vdso.h b/arch/mips/include/asm/vdso.h
new file mode 100644
index 0000000..cca56aa
--- /dev/null
+++ b/arch/mips/include/asm/vdso.h
@@ -0,0 +1,29 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2009 Cavium Networks
+ */
+
+#ifndef __ASM_VDSO_H
+#define __ASM_VDSO_H
+
+#include <linux/types.h>
+
+
+#ifdef CONFIG_32BIT
+struct mips_vdso {
+	u32 signal_trampoline[2];
+	u32 rt_signal_trampoline[2];
+};
+#else  /* !CONFIG_32BIT */
+struct mips_vdso {
+	u32 o32_signal_trampoline[2];
+	u32 o32_rt_signal_trampoline[2];
+	u32 rt_signal_trampoline[2];
+	u32 n32_rt_signal_trampoline[2];
+};
+#endif /* CONFIG_32BIT */
+
+#endif /* __ASM_VDSO_H */
diff --git a/arch/mips/kernel/Makefile b/arch/mips/kernel/Makefile
index e961221..b8158b5 100644
--- a/arch/mips/kernel/Makefile
+++ b/arch/mips/kernel/Makefile
@@ -6,7 +6,7 @@ extra-y		:= head.o init_task.o vmlinux.lds
 
 obj-y		+= cpu-probe.o branch.o entry.o genex.o irq.o process.o \
 		   ptrace.o reset.o setup.o signal.o syscall.o \
-		   time.o topology.o traps.o unaligned.o watch.o
+		   time.o topology.o traps.o unaligned.o watch.o vdso.o
 
 obj-$(CONFIG_CEVT_BCM1480)	+= cevt-bcm1480.o
 obj-$(CONFIG_CEVT_R4K_LIB)	+= cevt-r4k.o
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index 955c5f0..491e5be 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -77,7 +77,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	int do_color_align;
 	unsigned long task_size;
 
-	task_size = STACK_TOP;
+	task_size = test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE;
 
 	if (len > task_size)
 		return -ENOMEM;
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
new file mode 100644
index 0000000..a1f38a6
--- /dev/null
+++ b/arch/mips/kernel/vdso.c
@@ -0,0 +1,104 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2009 Cavium Networks
+ */
+
+
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/binfmts.h>
+#include <linux/elf.h>
+#include <linux/vmalloc.h>
+#include <linux/unistd.h>
+
+#include <asm/vdso.h>
+
+/*
+ * Including <asm/unistd.h> would give use the 64-bit syscall numbers ...
+ */
+#define __NR_O32_sigreturn		4119
+#define __NR_O32_rt_sigreturn		4193
+#define __NR_N32_rt_sigreturn		6211
+
+static struct page *vdso_page;
+
+static void __init install_trampoline(u32 *tramp, unsigned int sigreturn)
+{
+	tramp[0] = 0x24020000 + sigreturn;	/* li v0, sigreturn */
+	tramp[1] = 0x0000000c;			/* syscall */
+}
+
+static int __init init_vdso(void)
+{
+	struct mips_vdso *vdso;
+
+	vdso_page = alloc_page(GFP_KERNEL);
+	if (!vdso_page)
+		panic("Cannot allocate vdso");
+
+	vdso = vmap(&vdso_page, 1, 0, PAGE_KERNEL);
+	if (!vdso)
+		panic("Cannot map vdso");
+	clear_page(vdso);
+
+	install_trampoline(vdso->rt_signal_trampoline, __NR_rt_sigreturn);
+#ifdef CONFIG_32BIT
+	install_trampoline(vdso->signal_trampoline, __NR_sigreturn);
+#else
+	install_trampoline(vdso->n32_rt_signal_trampoline,
+			   __NR_N32_rt_sigreturn);
+	install_trampoline(vdso->o32_signal_trampoline, __NR_O32_sigreturn);
+	install_trampoline(vdso->o32_rt_signal_trampoline,
+			   __NR_O32_rt_sigreturn);
+#endif
+
+	vunmap(vdso);
+
+	pr_notice("init_vdso successfull\n");
+
+	return 0;
+}
+device_initcall(init_vdso);
+
+static unsigned long vdso_addr(unsigned long start)
+{
+	return STACK_TOP;
+}
+
+int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+{
+	int ret;
+	unsigned long addr;
+	struct mm_struct *mm = current->mm;
+
+	down_write(&mm->mmap_sem);
+
+	addr = vdso_addr(mm->start_stack);
+
+	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
+	if (IS_ERR_VALUE(addr)) {
+		ret = addr;
+		goto up_fail;
+	}
+
+	ret = install_special_mapping(mm, addr, PAGE_SIZE,
+				      VM_READ|VM_EXEC|
+				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
+				      VM_ALWAYSDUMP,
+				      &vdso_page);
+
+	if (ret)
+		goto up_fail;
+
+	mm->context.vdso = (void *)addr;
+
+up_fail:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] MIPS: Move signal trampolines off of the stack.
  2009-04-21 21:30 [PATCH 0/2] MIPS: Move signal return trampolines off the stack David Daney
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
@ 2009-04-21 21:33 ` David Daney
  2009-04-22 17:57   ` David VomLehn
  2009-04-22 18:04 ` [PATCH 0/2] MIPS: Move signal return trampolines off " David VomLehn
  2 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-21 21:33 UTC (permalink / raw)
  To: linux-mips, ralf; +Cc: David Daney

This is a follow on to the vdso patch.

Since all processes now have signal trampolines permanently mapped, we
can use those instead of putting the trampoline on the stack and
invalidating the corresponding icache across all CPUs.  We also get
rid of a bunch of ICACHE_REFILLS_WORKAROUND_WAR code.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
---
 arch/mips/include/asm/abi.h      |    6 ++-
 arch/mips/kernel/signal-common.h |    5 --
 arch/mips/kernel/signal.c        |   86 ++++++++-----------------------------
 arch/mips/kernel/signal32.c      |   55 ++++++------------------
 arch/mips/kernel/signal_n32.c    |   26 +++---------
 5 files changed, 43 insertions(+), 135 deletions(-)

diff --git a/arch/mips/include/asm/abi.h b/arch/mips/include/asm/abi.h
index 1dd74fb..9252d9b 100644
--- a/arch/mips/include/asm/abi.h
+++ b/arch/mips/include/asm/abi.h
@@ -13,12 +13,14 @@
 #include <asm/siginfo.h>
 
 struct mips_abi {
-	int (* const setup_frame)(struct k_sigaction * ka,
+	int (* const setup_frame)(void *sig_return, struct k_sigaction *ka,
 	                          struct pt_regs *regs, int signr,
 	                          sigset_t *set);
-	int (* const setup_rt_frame)(struct k_sigaction * ka,
+	const unsigned long	signal_return_offset;
+	int (* const setup_rt_frame)(void *sig_return, struct k_sigaction *ka,
 	                       struct pt_regs *regs, int signr,
 	                       sigset_t *set, siginfo_t *info);
+	const unsigned long	rt_signal_return_offset;
 	const unsigned long	restart;
 };
 
diff --git a/arch/mips/kernel/signal-common.h b/arch/mips/kernel/signal-common.h
index 6c8e8c4..10263b4 100644
--- a/arch/mips/kernel/signal-common.h
+++ b/arch/mips/kernel/signal-common.h
@@ -26,11 +26,6 @@
  */
 extern void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs,
 				 size_t frame_size);
-/*
- * install trampoline code to get back from the sig handler
- */
-extern int install_sigtramp(unsigned int __user *tramp, unsigned int syscall);
-
 /* Check and clear pending FPU exceptions in saved CSR */
 extern int fpcsr_pending(unsigned int __user *fpcsr);
 
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 830c5ef..2d83902 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -31,50 +31,24 @@
 #include <asm/ucontext.h>
 #include <asm/cpu-features.h>
 #include <asm/war.h>
+#include <asm/vdso.h>
 
 #include "signal-common.h"
 
-/*
- * Horribly complicated - with the bloody RM9000 workarounds enabled
- * the signal trampolines is moving to the end of the structure so we can
- * increase the alignment without breaking software compatibility.
- */
-#if ICACHE_REFILLS_WORKAROUND_WAR == 0
-
 struct sigframe {
 	u32 sf_ass[4];		/* argument save space for o32 */
-	u32 sf_code[2];		/* signal trampoline */
+	u32 sf_pad[2];		/*  Was: signal trampoline */
 	struct sigcontext sf_sc;
 	sigset_t sf_mask;
 };
 
 struct rt_sigframe {
 	u32 rs_ass[4];		/* argument save space for o32 */
-	u32 rs_code[2];		/* signal trampoline */
+	u32 rs_pad[2];		/*  Was: signal trampoline */
 	struct siginfo rs_info;
 	struct ucontext rs_uc;
 };
 
-#else
-
-struct sigframe {
-	u32 sf_ass[4];			/* argument save space for o32 */
-	u32 sf_pad[2];
-	struct sigcontext sf_sc;	/* hw context */
-	sigset_t sf_mask;
-	u32 sf_code[8] ____cacheline_aligned;	/* signal trampoline */
-};
-
-struct rt_sigframe {
-	u32 rs_ass[4];			/* argument save space for o32 */
-	u32 rs_pad[2];
-	struct siginfo rs_info;
-	struct ucontext rs_uc;
-	u32 rs_code[8] ____cacheline_aligned;	/* signal trampoline */
-};
-
-#endif
-
 /*
  * Helper routines
  */
@@ -256,32 +230,6 @@ void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs,
 	return (void __user *)((sp - frame_size) & (ICACHE_REFILLS_WORKAROUND_WAR ? ~(cpu_icache_line_size()-1) : ALMASK));
 }
 
-int install_sigtramp(unsigned int __user *tramp, unsigned int syscall)
-{
-	int err;
-
-	/*
-	 * Set up the return code ...
-	 *
-	 *         li      v0, __NR__foo_sigreturn
-	 *         syscall
-	 */
-
-	err = __put_user(0x24020000 + syscall, tramp + 0);
-	err |= __put_user(0x0000000c         , tramp + 1);
-	if (ICACHE_REFILLS_WORKAROUND_WAR) {
-		err |= __put_user(0, tramp + 2);
-		err |= __put_user(0, tramp + 3);
-		err |= __put_user(0, tramp + 4);
-		err |= __put_user(0, tramp + 5);
-		err |= __put_user(0, tramp + 6);
-		err |= __put_user(0, tramp + 7);
-	}
-	flush_cache_sigtramp((unsigned long) tramp);
-
-	return err;
-}
-
 /*
  * Atomically swap in the new signal mask, and wait for a signal.
  */
@@ -474,8 +422,8 @@ badframe:
 }
 
 #ifdef CONFIG_TRAD_SIGNALS
-static int setup_frame(struct k_sigaction * ka, struct pt_regs *regs,
-	int signr, sigset_t *set)
+static int setup_frame(void *sig_return, struct k_sigaction *ka,
+		       struct pt_regs *regs, int signr, sigset_t *set)
 {
 	struct sigframe __user *frame;
 	int err = 0;
@@ -484,8 +432,6 @@ static int setup_frame(struct k_sigaction * ka, struct pt_regs *regs,
 	if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
 		goto give_sigsegv;
 
-	err |= install_sigtramp(frame->sf_code, __NR_sigreturn);
-
 	err |= setup_sigcontext(regs, &frame->sf_sc);
 	err |= __copy_to_user(&frame->sf_mask, set, sizeof(*set));
 	if (err)
@@ -505,7 +451,7 @@ static int setup_frame(struct k_sigaction * ka, struct pt_regs *regs,
 	regs->regs[ 5] = 0;
 	regs->regs[ 6] = (unsigned long) &frame->sf_sc;
 	regs->regs[29] = (unsigned long) frame;
-	regs->regs[31] = (unsigned long) frame->sf_code;
+	regs->regs[31] = (unsigned long) sig_return;
 	regs->cp0_epc = regs->regs[25] = (unsigned long) ka->sa.sa_handler;
 
 	DEBUGP("SIG deliver (%s:%d): sp=0x%p pc=0x%lx ra=0x%lx\n",
@@ -519,8 +465,9 @@ give_sigsegv:
 }
 #endif
 
-static int setup_rt_frame(struct k_sigaction * ka, struct pt_regs *regs,
-	int signr, sigset_t *set, siginfo_t *info)
+static int setup_rt_frame(void *sig_return, struct k_sigaction *ka,
+			  struct pt_regs *regs,	int signr, sigset_t *set,
+			  siginfo_t *info)
 {
 	struct rt_sigframe __user *frame;
 	int err = 0;
@@ -529,8 +476,6 @@ static int setup_rt_frame(struct k_sigaction * ka, struct pt_regs *regs,
 	if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
 		goto give_sigsegv;
 
-	err |= install_sigtramp(frame->rs_code, __NR_rt_sigreturn);
-
 	/* Create siginfo.  */
 	err |= copy_siginfo_to_user(&frame->rs_info, info);
 
@@ -563,7 +508,7 @@ static int setup_rt_frame(struct k_sigaction * ka, struct pt_regs *regs,
 	regs->regs[ 5] = (unsigned long) &frame->rs_info;
 	regs->regs[ 6] = (unsigned long) &frame->rs_uc;
 	regs->regs[29] = (unsigned long) frame;
-	regs->regs[31] = (unsigned long) frame->rs_code;
+	regs->regs[31] = (unsigned long) sig_return;
 	regs->cp0_epc = regs->regs[25] = (unsigned long) ka->sa.sa_handler;
 
 	DEBUGP("SIG deliver (%s:%d): sp=0x%p pc=0x%lx ra=0x%lx\n",
@@ -580,8 +525,11 @@ give_sigsegv:
 struct mips_abi mips_abi = {
 #ifdef CONFIG_TRAD_SIGNALS
 	.setup_frame	= setup_frame,
+	.signal_return_offset = offsetof(struct mips_vdso, signal_trampoline),
 #endif
 	.setup_rt_frame	= setup_rt_frame,
+	.rt_signal_return_offset =
+		offsetof(struct mips_vdso, rt_signal_trampoline),
 	.restart	= __NR_restart_syscall
 };
 
@@ -589,6 +537,8 @@ static int handle_signal(unsigned long sig, siginfo_t *info,
 	struct k_sigaction *ka, sigset_t *oldset, struct pt_regs *regs)
 {
 	int ret;
+	struct mips_abi *abi = current->thread.abi;
+	void *vdso = current->mm->context.vdso;
 
 	switch(regs->regs[0]) {
 	case ERESTART_RESTARTBLOCK:
@@ -609,9 +559,11 @@ static int handle_signal(unsigned long sig, siginfo_t *info,
 	regs->regs[0] = 0;		/* Don't deal with this again.  */
 
 	if (sig_uses_siginfo(ka))
-		ret = current->thread.abi->setup_rt_frame(ka, regs, sig, oldset, info);
+		ret = abi->setup_rt_frame(vdso + abi->rt_signal_return_offset,
+					  ka, regs, sig, oldset, info);
 	else
-		ret = current->thread.abi->setup_frame(ka, regs, sig, oldset);
+		ret = abi->setup_frame(vdso + abi->signal_return_offset,
+				       ka, regs, sig, oldset);
 
 	spin_lock_irq(&current->sighand->siglock);
 	sigorsets(&current->blocked, &current->blocked, &ka->sa.sa_mask);
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 2e74075..0507608 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -32,14 +32,13 @@
 #include <asm/system.h>
 #include <asm/fpu.h>
 #include <asm/war.h>
+#include <asm/vdso.h>
 
 #include "signal-common.h"
 
 /*
  * Including <asm/unistd.h> would give use the 64-bit syscall numbers ...
  */
-#define __NR_O32_sigreturn		4119
-#define __NR_O32_rt_sigreturn		4193
 #define __NR_O32_restart_syscall        4253
 
 /* 32-bit compatibility types */
@@ -68,47 +67,20 @@ struct ucontext32 {
 	compat_sigset_t     uc_sigmask;   /* mask last for extensibility */
 };
 
-/*
- * Horribly complicated - with the bloody RM9000 workarounds enabled
- * the signal trampolines is moving to the end of the structure so we can
- * increase the alignment without breaking software compatibility.
- */
-#if ICACHE_REFILLS_WORKAROUND_WAR == 0
-
 struct sigframe32 {
 	u32 sf_ass[4];		/* argument save space for o32 */
-	u32 sf_code[2];		/* signal trampoline */
+	u32 sf_pad[2];		/* Was: signal trampoline */
 	struct sigcontext32 sf_sc;
 	compat_sigset_t sf_mask;
 };
 
 struct rt_sigframe32 {
 	u32 rs_ass[4];			/* argument save space for o32 */
-	u32 rs_code[2];			/* signal trampoline */
+	u32 rs_pad[2];			/* Was: signal trampoline */
 	compat_siginfo_t rs_info;
 	struct ucontext32 rs_uc;
 };
 
-#else  /* ICACHE_REFILLS_WORKAROUND_WAR */
-
-struct sigframe32 {
-	u32 sf_ass[4];			/* argument save space for o32 */
-	u32 sf_pad[2];
-	struct sigcontext32 sf_sc;	/* hw context */
-	compat_sigset_t sf_mask;
-	u32 sf_code[8] ____cacheline_aligned;	/* signal trampoline */
-};
-
-struct rt_sigframe32 {
-	u32 rs_ass[4];			/* argument save space for o32 */
-	u32 rs_pad[2];
-	compat_siginfo_t rs_info;
-	struct ucontext32 rs_uc;
-	u32 rs_code[8] __attribute__((aligned(32)));	/* signal trampoline */
-};
-
-#endif	/* !ICACHE_REFILLS_WORKAROUND_WAR */
-
 /*
  * sigcontext handlers
  */
@@ -589,8 +561,8 @@ badframe:
 	force_sig(SIGSEGV, current);
 }
 
-static int setup_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
-	int signr, sigset_t *set)
+static int setup_frame_32(void *sig_return, struct k_sigaction *ka,
+			  struct pt_regs *regs, int signr, sigset_t *set)
 {
 	struct sigframe32 __user *frame;
 	int err = 0;
@@ -599,8 +571,6 @@ static int setup_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
 	if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
 		goto give_sigsegv;
 
-	err |= install_sigtramp(frame->sf_code, __NR_O32_sigreturn);
-
 	err |= setup_sigcontext32(regs, &frame->sf_sc);
 	err |= __copy_conv_sigset_to_user(&frame->sf_mask, set);
 
@@ -621,7 +591,7 @@ static int setup_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
 	regs->regs[ 5] = 0;
 	regs->regs[ 6] = (unsigned long) &frame->sf_sc;
 	regs->regs[29] = (unsigned long) frame;
-	regs->regs[31] = (unsigned long) frame->sf_code;
+	regs->regs[31] = (unsigned long) sig_return;
 	regs->cp0_epc = regs->regs[25] = (unsigned long) ka->sa.sa_handler;
 
 	DEBUGP("SIG deliver (%s:%d): sp=0x%p pc=0x%lx ra=0x%lx\n",
@@ -635,8 +605,9 @@ give_sigsegv:
 	return -EFAULT;
 }
 
-static int setup_rt_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
-	int signr, sigset_t *set, siginfo_t *info)
+static int setup_rt_frame_32(void *sig_return, struct k_sigaction *ka,
+			     struct pt_regs *regs, int signr, sigset_t *set,
+			     siginfo_t *info)
 {
 	struct rt_sigframe32 __user *frame;
 	int err = 0;
@@ -646,8 +617,6 @@ static int setup_rt_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
 	if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
 		goto give_sigsegv;
 
-	err |= install_sigtramp(frame->rs_code, __NR_O32_rt_sigreturn);
-
 	/* Convert (siginfo_t -> compat_siginfo_t) and copy to user. */
 	err |= copy_siginfo_to_user32(&frame->rs_info, info);
 
@@ -681,7 +650,7 @@ static int setup_rt_frame_32(struct k_sigaction * ka, struct pt_regs *regs,
 	regs->regs[ 5] = (unsigned long) &frame->rs_info;
 	regs->regs[ 6] = (unsigned long) &frame->rs_uc;
 	regs->regs[29] = (unsigned long) frame;
-	regs->regs[31] = (unsigned long) frame->rs_code;
+	regs->regs[31] = (unsigned long) sig_return;
 	regs->cp0_epc = regs->regs[25] = (unsigned long) ka->sa.sa_handler;
 
 	DEBUGP("SIG deliver (%s:%d): sp=0x%p pc=0x%lx ra=0x%lx\n",
@@ -700,7 +669,11 @@ give_sigsegv:
  */
 struct mips_abi mips_abi_32 = {
 	.setup_frame	= setup_frame_32,
+	.signal_return_offset =
+		offsetof(struct mips_vdso, o32_signal_trampoline),
 	.setup_rt_frame	= setup_rt_frame_32,
+	.rt_signal_return_offset =
+		offsetof(struct mips_vdso, o32_rt_signal_trampoline),
 	.restart	= __NR_O32_restart_syscall
 };
 
diff --git a/arch/mips/kernel/signal_n32.c b/arch/mips/kernel/signal_n32.c
index bb277e8..2c5df81 100644
--- a/arch/mips/kernel/signal_n32.c
+++ b/arch/mips/kernel/signal_n32.c
@@ -39,13 +39,13 @@
 #include <asm/fpu.h>
 #include <asm/cpu-features.h>
 #include <asm/war.h>
+#include <asm/vdso.h>
 
 #include "signal-common.h"
 
 /*
  * Including <asm/unistd.h> would give use the 64-bit syscall numbers ...
  */
-#define __NR_N32_rt_sigreturn		6211
 #define __NR_N32_restart_syscall	6214
 
 extern int setup_sigcontext(struct pt_regs *, struct sigcontext __user *);
@@ -67,27 +67,13 @@ struct ucontextn32 {
 	compat_sigset_t     uc_sigmask;   /* mask last for extensibility */
 };
 
-#if ICACHE_REFILLS_WORKAROUND_WAR == 0
-
-struct rt_sigframe_n32 {
-	u32 rs_ass[4];			/* argument save space for o32 */
-	u32 rs_code[2];			/* signal trampoline */
-	struct compat_siginfo rs_info;
-	struct ucontextn32 rs_uc;
-};
-
-#else  /* ICACHE_REFILLS_WORKAROUND_WAR */
-
 struct rt_sigframe_n32 {
 	u32 rs_ass[4];			/* argument save space for o32 */
-	u32 rs_pad[2];
+	u32 rs_pad[2];			/* Was: signal trampoline */
 	struct compat_siginfo rs_info;
 	struct ucontextn32 rs_uc;
-	u32 rs_code[8] ____cacheline_aligned;		/* signal trampoline */
 };
 
-#endif	/* !ICACHE_REFILLS_WORKAROUND_WAR */
-
 extern void sigset_from_compat(sigset_t *set, compat_sigset_t *compat);
 
 asmlinkage int sysn32_rt_sigsuspend(nabi_no_regargs struct pt_regs regs)
@@ -173,7 +159,7 @@ badframe:
 	force_sig(SIGSEGV, current);
 }
 
-static int setup_rt_frame_n32(struct k_sigaction * ka,
+static int setup_rt_frame_n32(void *sig_return, struct k_sigaction *ka,
 	struct pt_regs *regs, int signr, sigset_t *set, siginfo_t *info)
 {
 	struct rt_sigframe_n32 __user *frame;
@@ -184,8 +170,6 @@ static int setup_rt_frame_n32(struct k_sigaction * ka,
 	if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
 		goto give_sigsegv;
 
-	install_sigtramp(frame->rs_code, __NR_N32_rt_sigreturn);
-
 	/* Create siginfo.  */
 	err |= copy_siginfo_to_user32(&frame->rs_info, info);
 
@@ -219,7 +203,7 @@ static int setup_rt_frame_n32(struct k_sigaction * ka,
 	regs->regs[ 5] = (unsigned long) &frame->rs_info;
 	regs->regs[ 6] = (unsigned long) &frame->rs_uc;
 	regs->regs[29] = (unsigned long) frame;
-	regs->regs[31] = (unsigned long) frame->rs_code;
+	regs->regs[31] = (unsigned long) sig_return;
 	regs->cp0_epc = regs->regs[25] = (unsigned long) ka->sa.sa_handler;
 
 	DEBUGP("SIG deliver (%s:%d): sp=0x%p pc=0x%lx ra=0x%lx\n",
@@ -235,5 +219,7 @@ give_sigsegv:
 
 struct mips_abi mips_abi_n32 = {
 	.setup_rt_frame	= setup_rt_frame_n32,
+	.rt_signal_return_offset =
+		offsetof(struct mips_vdso, n32_rt_signal_trampoline),
 	.restart	= __NR_N32_restart_syscall
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
@ 2009-04-22  5:24   ` Shane McDonald
  2009-04-22 15:18     ` David Daney
  2009-04-22  9:35   ` Kevin D. Kissell
  2009-04-22 17:50   ` David VomLehn
  2 siblings, 1 reply; 22+ messages in thread
From: Shane McDonald @ 2009-04-22  5:24 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, ralf

[-- Attachment #1: Type: text/plain, Size: 1770 bytes --]

Hello David:

On Tue, Apr 21, 2009 at 3:33 PM, David Daney <ddaney@caviumnetworks.com>wrote:

> This is a preliminary patch to add a vdso to all user processes.
> Still missing are ELF headers and .eh_frame information.  But it is
> enough to allow us to move signal trampolines off of the stack.
>
> We allocate a single page (the vdso) and write all possible signal
> trampolines into it.  The stack is moved down by one page and the vdso
> is mapped into this space.


This patch fails to compile for me with an RM7035C-based system (out of
tree, sadly).  The error I see is:

  CC      arch/mips/kernel/syscall.o
arch/mips/kernel/syscall.c: In function 'arch_get_unmapped_area':
arch/mips/kernel/syscall.c:80: error: 'TASK_SIZE32' undeclared (first use in
this function)
arch/mips/kernel/syscall.c:80: error: (Each undeclared identifier is
reported only once
arch/mips/kernel/syscall.c:80: error: for each function it appears in.)
make[1]: *** [arch/mips/kernel/syscall.o] Error 1

I believe it is related to the following portion of the patch:

diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
> index 955c5f0..491e5be 100644
> --- a/arch/mips/kernel/syscall.c
> +++ b/arch/mips/kernel/syscall.c
> @@ -77,7 +77,7 @@ unsigned long arch_get_unmapped_area(struct file *filp,
> unsigned long addr,
>        int do_color_align;
>        unsigned long task_size;
>
> -       task_size = STACK_TOP;
> +       task_size = test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 :
> TASK_SIZE;
>
>        if (len > task_size)
>                return -ENOMEM;
>

On my system, CONFIG_32BIT is defined and CONFIG_64BIT is not -- looking at
arch/mips/include/asm/processor.h, it appears that TASK_SIZE32 is only
defined when CONFIG_64BIT is defined.

Shane McDonald

[-- Attachment #2: Type: text/html, Size: 2316 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
  2009-04-22  5:24   ` Shane McDonald
@ 2009-04-22  9:35   ` Kevin D. Kissell
  2009-04-22 18:01     ` David Daney
  2009-04-22 17:50   ` David VomLehn
  2 siblings, 1 reply; 22+ messages in thread
From: Kevin D. Kissell @ 2009-04-22  9:35 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, ralf

David Daney wrote:
> This is a preliminary patch to add a vdso to all user processes.
> Still missing are ELF headers and .eh_frame information.  But it is
> enough to allow us to move signal trampolines off of the stack.
>
> We allocate a single page (the vdso) and write all possible signal
> trampolines into it.  The stack is moved down by one page and the vdso
> is mapped into this space.
>
> Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Note that for FPU-less CPUs, the kernel FP emulator also uses a user
stack trampoline to execute instructions in the delay slots of emulated
FP branches.  I didn't see any of the math-emu modules being tweaked in
either part of your patch.  Presumably, one would want to move that
operation into the vdso as well.  With the proposed patch, I'm not sure
whether things would continue working normally as before, still using
the user stack, or whether the dsemulret code depends on something that
is changed by the patch, and will now implode.  Probably the former, but
paranoia is not a character defect in OS kernel work. I don't have a
test case handy (nor a test system any more), but compiling something
like whetstone or linpack in C with a high degree of optimization will
*probably* generate FP branches with live delay slots.

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-22  5:24   ` Shane McDonald
@ 2009-04-22 15:18     ` David Daney
  0 siblings, 0 replies; 22+ messages in thread
From: David Daney @ 2009-04-22 15:18 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips, ralf

Shane McDonald wrote:
> Hello David:
> 
> On Tue, Apr 21, 2009 at 3:33 PM, David Daney <ddaney@caviumnetworks.com 
> <mailto:ddaney@caviumnetworks.com>> wrote:
> 
>     This is a preliminary patch to add a vdso to all user processes.
>     Still missing are ELF headers and .eh_frame information.  But it is
>     enough to allow us to move signal trampolines off of the stack.
> 
>     We allocate a single page (the vdso) and write all possible signal
>     trampolines into it.  The stack is moved down by one page and the vdso
>     is mapped into this space.
> 
> 
> This patch fails to compile for me with an RM7035C-based system (out of 
> tree, sadly).  The error I see is:
> 
>   CC      arch/mips/kernel/syscall.o
> arch/mips/kernel/syscall.c: In function 'arch_get_unmapped_area':
> arch/mips/kernel/syscall.c:80: error: 'TASK_SIZE32' undeclared (first 
> use in this function)
> arch/mips/kernel/syscall.c:80: error: (Each undeclared identifier is 
> reported only once
> arch/mips/kernel/syscall.c:80: error: for each function it appears in.)
> make[1]: *** [arch/mips/kernel/syscall.o] Error 1
> 

I never built a 32-bit kernel with the patch.  I will endeavor to fix this.

Thanks,
David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
  2009-04-22  5:24   ` Shane McDonald
  2009-04-22  9:35   ` Kevin D. Kissell
@ 2009-04-22 17:50   ` David VomLehn
  2009-04-22 18:05     ` David Daney
  2 siblings, 1 reply; 22+ messages in thread
From: David VomLehn @ 2009-04-22 17:50 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, ralf

On Tue, Apr 21, 2009 at 02:33:24PM -0700, David Daney wrote:
> This is a preliminary patch to add a vdso to all user processes.
...
> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> new file mode 100644
> index 0000000..a1f38a6
...
> +static void __init install_trampoline(u32 *tramp, unsigned int sigreturn)
> +{
> +	tramp[0] = 0x24020000 + sigreturn;	/* li v0, sigreturn */
> +	tramp[1] = 0x0000000c;			/* syscall */
> +}

We can do away with magic constants by building on asm/inst.h:

static void __init install_trampoline(union mips_instruction *tramp,
	short sigreturn)
{
	static struct i_format li_sigreturn = {	/* li, v0, sigreturn */
		.opcode = addiu_op,
		.rs = 0,			/* zero */
		.rt = 2,			/* v0 */
	};
	static struct r_format syscall = {		/* syscall */
		.opcode = spec_op,
		.func = syscall_op
	};
	tramp[0].i_format = li_sigreturn;
	tramp[0].i_format.simmediate = sigreturn;
	tramp[1].r_format = syscall;
}

Admittedly, this isn't really a proper use of the r_format structure. The
right way is to add another structure to asm/inst.h and use that:

#ifdef __MIPSEB__
...
	struct syscall_format {
		unsigned int opcode : 6;
		unsigned int code : 20;
		unsigned int func : 6;
	};
...
#else
...
	struct syscall_format {
		unsigned int func : 6;
		unsigned int code : 20;
		unsigned int opcode : 6;
	};
...
#endif

Then we could do the syscall the right way by using syscall_format
instead of r_format...

...

> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> +{
> +	int ret;
> +	unsigned long addr;
> +	struct mm_struct *mm = current->mm;
> +
> +	down_write(&mm->mmap_sem);
> +
> +	addr = vdso_addr(mm->start_stack);
> +
> +	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
> +	if (IS_ERR_VALUE(addr)) {
> +		ret = addr;
> +		goto up_fail;
> +	}
> +
> +	ret = install_special_mapping(mm, addr, PAGE_SIZE,
> +				      VM_READ|VM_EXEC|
> +				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
> +				      VM_ALWAYSDUMP,
> +				      &vdso_page);
> +
> +	if (ret)
> +		goto up_fail;
> +
> +	mm->context.vdso = (void *)addr;
> +
> +up_fail:

It seems that this is an unexpected condition that probably indicates
a failure of the expected state of the process at this point. Perhaps
a pr_err() or pr_warning() would be appropriate?

> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] MIPS: Move signal trampolines off of the stack.
  2009-04-21 21:33 ` [PATCH 2/2] MIPS: Move signal trampolines off of the stack David Daney
@ 2009-04-22 17:57   ` David VomLehn
  0 siblings, 0 replies; 22+ messages in thread
From: David VomLehn @ 2009-04-22 17:57 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, ralf

On Tue, Apr 21, 2009 at 02:33:25PM -0700, David Daney wrote:
> This is a follow on to the vdso patch.
> 
> Since all processes now have signal trampolines permanently mapped, we
> can use those instead of putting the trampoline on the stack and
> invalidating the corresponding icache across all CPUs.  We also get
> rid of a bunch of ICACHE_REFILLS_WORKAROUND_WAR code.

This sure gets rid of some ugly code!

> Signed-off-by: David Daney <ddaney@caviumnetworks.com>

Reviewed by: David VomLehn <dvomlehn@cisco.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-22  9:35   ` Kevin D. Kissell
@ 2009-04-22 18:01     ` David Daney
  2009-04-24  7:20       ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-22 18:01 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips, ralf

Kevin D. Kissell wrote:
> David Daney wrote:
>> This is a preliminary patch to add a vdso to all user processes.
>> Still missing are ELF headers and .eh_frame information.  But it is
>> enough to allow us to move signal trampolines off of the stack.
>>
>> We allocate a single page (the vdso) and write all possible signal
>> trampolines into it.  The stack is moved down by one page and the vdso
>> is mapped into this space.
>>
>> Signed-off-by: David Daney <ddaney@caviumnetworks.com>
> Note that for FPU-less CPUs, the kernel FP emulator also uses a user
> stack trampoline to execute instructions in the delay slots of emulated
> FP branches.  I didn't see any of the math-emu modules being tweaked in
> either part of your patch.  Presumably, one would want to move that
> operation into the vdso as well.  With the proposed patch, I'm not sure
> whether things would continue working normally as before, still using
> the user stack, or whether the dsemulret code depends on something that
> is changed by the patch, and will now implode.  Probably the former, but
> paranoia is not a character defect in OS kernel work. I don't have a
> test case handy (nor a test system any more), but compiling something
> like whetstone or linpack in C with a high degree of optimization will
> *probably* generate FP branches with live delay slots.
> 

It is an ugly problem.  I am trying to hack something up to fix it.

David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] MIPS: Move signal return trampolines off the stack.
  2009-04-21 21:30 [PATCH 0/2] MIPS: Move signal return trampolines off the stack David Daney
  2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
  2009-04-21 21:33 ` [PATCH 2/2] MIPS: Move signal trampolines off of the stack David Daney
@ 2009-04-22 18:04 ` David VomLehn
  2009-04-22 18:13   ` David Daney
  2 siblings, 1 reply; 22+ messages in thread
From: David VomLehn @ 2009-04-22 18:04 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, Ralf Baechle

On Tue, Apr 21, 2009 at 02:30:55PM -0700, David Daney wrote:
> This patch set (against 2.6.29.1) creates a vdso and moves the signal
> trampolines to it from their previous home on the stack.
>
> Tested with a 64-bit kernel on a Cavium Octeon cn3860 where I have the
> following results from lmbench2:
>
> Before:
> n64 - Signal handler overhead: 14.517 microseconds
> n32 - Signal handler overhead: 14.497 microseconds
> o32 - Signal handler overhead: 16.637 microseconds
>
> After:
>
> n64 - Signal handler overhead: 7.935 microseconds
> n32 - Signal handler overhead: 7.334 microseconds
> o32 - Signal handler overhead: 8.628 microseconds

Nice numbers, and something that will be even more critical as real-time
features are added and used!

> Comments encourged.

Only one comment, which I would not want to hold up acceptance:
based on some numbers sent out recently, it looks like the kernel is
experiencing some performance issues with exec() and I think this change will
make it slightly slower. You could avoid this by deferring installation of
the trampoline to the first use of a system call that registers a signal
handler.

David VomLehn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-22 17:50   ` David VomLehn
@ 2009-04-22 18:05     ` David Daney
  2009-04-22 18:28       ` David VomLehn
  0 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-22 18:05 UTC (permalink / raw)
  To: David VomLehn; +Cc: linux-mips, ralf

David VomLehn wrote:
> On Tue, Apr 21, 2009 at 02:33:24PM -0700, David Daney wrote:
>> This is a preliminary patch to add a vdso to all user processes.
> ...
>> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
>> new file mode 100644
>> index 0000000..a1f38a6
> ...
>> +static void __init install_trampoline(u32 *tramp, unsigned int sigreturn)
>> +{
>> +	tramp[0] = 0x24020000 + sigreturn;	/* li v0, sigreturn */
>> +	tramp[1] = 0x0000000c;			/* syscall */
>> +}
> 
> We can do away with magic constants by building on asm/inst.h:
> 
> static void __init install_trampoline(union mips_instruction *tramp,
> 	short sigreturn)
> {
> 	static struct i_format li_sigreturn = {	/* li, v0, sigreturn */
> 		.opcode = addiu_op,
> 		.rs = 0,			/* zero */
> 		.rt = 2,			/* v0 */
> 	};
> 	static struct r_format syscall = {		/* syscall */
> 		.opcode = spec_op,
> 		.func = syscall_op
> 	};
> 	tramp[0].i_format = li_sigreturn;
> 	tramp[0].i_format.simmediate = sigreturn;
> 	tramp[1].r_format = syscall;
> }
> 
> Admittedly, this isn't really a proper use of the r_format structure. The
> right way is to add another structure to asm/inst.h and use that:

Perhaps using the uasm would be even better.

[...]
> 
>> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>> +{
>> +	int ret;
>> +	unsigned long addr;
>> +	struct mm_struct *mm = current->mm;
>> +
>> +	down_write(&mm->mmap_sem);
>> +
>> +	addr = vdso_addr(mm->start_stack);
>> +
>> +	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
>> +	if (IS_ERR_VALUE(addr)) {
>> +		ret = addr;
>> +		goto up_fail;
>> +	}
>> +
>> +	ret = install_special_mapping(mm, addr, PAGE_SIZE,
>> +				      VM_READ|VM_EXEC|
>> +				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
>> +				      VM_ALWAYSDUMP,
>> +				      &vdso_page);
>> +
>> +	if (ret)
>> +		goto up_fail;
>> +
>> +	mm->context.vdso = (void *)addr;
>> +
>> +up_fail:
> 
> It seems that this is an unexpected condition that probably indicates
> a failure of the expected state of the process at this point. Perhaps
> a pr_err() or pr_warning() would be appropriate?
> 
>> +	up_write(&mm->mmap_sem);
>> +	return ret;
>> +}

Really it should always succeed.  Something is seriously wrong if you 
cannot map that page and we should probably panic().

David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] MIPS: Move signal return trampolines off the stack.
  2009-04-22 18:04 ` [PATCH 0/2] MIPS: Move signal return trampolines off " David VomLehn
@ 2009-04-22 18:13   ` David Daney
  2009-04-22 18:31     ` David VomLehn
  0 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-22 18:13 UTC (permalink / raw)
  To: David VomLehn; +Cc: linux-mips, Ralf Baechle

David VomLehn wrote:
> On Tue, Apr 21, 2009 at 02:30:55PM -0700, David Daney wrote:
>> This patch set (against 2.6.29.1) creates a vdso and moves the signal
>> trampolines to it from their previous home on the stack.
>>
>> Tested with a 64-bit kernel on a Cavium Octeon cn3860 where I have the
>> following results from lmbench2:
>>
>> Before:
>> n64 - Signal handler overhead: 14.517 microseconds
>> n32 - Signal handler overhead: 14.497 microseconds
>> o32 - Signal handler overhead: 16.637 microseconds
>>
>> After:
>>
>> n64 - Signal handler overhead: 7.935 microseconds
>> n32 - Signal handler overhead: 7.334 microseconds
>> o32 - Signal handler overhead: 8.628 microseconds
> 
> Nice numbers, and something that will be even more critical as real-time
> features are added and used!
> 

non-SMP systems will probably not see so much improvement.

Although the numbers are nice, they are not the primary motivation 
behind the patch.  The real gains are in not having to interrupt all 
cores to invalidate their caches, and the possibility of eXecute Inhibit 
on the stack.

>> Comments encourged.
> 
> Only one comment, which I would not want to hold up acceptance:
> based on some numbers sent out recently, it looks like the kernel is
> experiencing some performance issues with exec() and I think this change will
> make it slightly slower. You could avoid this by deferring installation of
> the trampoline to the first use of a system call that registers a signal
> handler.
> 

I should try to measure this too.  Although this is what x86 et al. do. 
  It is by far much simpler and less prone to bugs that trying to hook 
into the system calls.  After an executable has had the chance to start 
additional threads and establish arbitrary mappings things get complicated.

David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-22 18:05     ` David Daney
@ 2009-04-22 18:28       ` David VomLehn
  0 siblings, 0 replies; 22+ messages in thread
From: David VomLehn @ 2009-04-22 18:28 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, ralf

>>> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>>> +{
>>> +	int ret;
>>> +	unsigned long addr;
>>> +	struct mm_struct *mm = current->mm;
>>> +
>>> +	down_write(&mm->mmap_sem);
>>> +
>>> +	addr = vdso_addr(mm->start_stack);
>>> +
>>> +	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
>>> +	if (IS_ERR_VALUE(addr)) {
>>> +		ret = addr;
>>> +		goto up_fail;
>>> +	}
>>> +
>>> +	ret = install_special_mapping(mm, addr, PAGE_SIZE,
>>> +				      VM_READ|VM_EXEC|
>>> +				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
>>> +				      VM_ALWAYSDUMP,
>>> +				      &vdso_page);
>>> +
>>> +	if (ret)
>>> +		goto up_fail;
>>> +
>>> +	mm->context.vdso = (void *)addr;
>>> +
>>> +up_fail:
>>
>> It seems that this is an unexpected condition that probably indicates
>> a failure of the expected state of the process at this point. Perhaps
>> a pr_err() or pr_warning() would be appropriate?
>>
>>> +	up_write(&mm->mmap_sem);
>>> +	return ret;
>>> +}
>
> Really it should always succeed.  Something is seriously wrong if you  
> cannot map that page and we should probably panic().

It seems like it may be recoverable, so perhaps BUG() is better.

> David Daney

David VomLehn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] MIPS: Move signal return trampolines off the stack.
  2009-04-22 18:13   ` David Daney
@ 2009-04-22 18:31     ` David VomLehn
  0 siblings, 0 replies; 22+ messages in thread
From: David VomLehn @ 2009-04-22 18:31 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips, Ralf Baechle

On Wed, Apr 22, 2009 at 11:13:44AM -0700, David Daney wrote:
> David VomLehn wrote:
...
>> Only one comment, which I would not want to hold up acceptance:
>> based on some numbers sent out recently, it looks like the kernel is
>> experiencing some performance issues with exec() and I think this change will
>> make it slightly slower. You could avoid this by deferring installation of
>> the trampoline to the first use of a system call that registers a signal
>> handler.
>
> I should try to measure this too.  Although this is what x86 et al. do.  
> It is by far much simpler and less prone to bugs that trying to hook  
> into the system calls.  After an executable has had the chance to start  
> additional threads and establish arbitrary mappings things get 
> complicated.

I suspect the overhead is quite small and agree that hooking into the
system calls a bit risky. This might be better done as a phase two, if
at all.

> David Daney

David VomLehn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-22 18:01     ` David Daney
@ 2009-04-24  7:20       ` Brian Foster
  2009-04-24  7:50         ` Kevin D. Kissell
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2009-04-24  7:20 UTC (permalink / raw)
  To: David Daney, Kevin D. Kissell; +Cc: linux-mips

On Wednesday 22 April 2009 20:01:44 David Daney wrote:
> Kevin D. Kissell wrote:
> > David Daney wrote:
> >> This is a preliminary patch to add a vdso to all user processes.
> >> Still missing are ELF headers and .eh_frame information.  But it is
> >> enough to allow us to move signal trampolines off of the stack.
> >>
> >> We allocate a single page (the vdso) and write all possible signal
> >> trampolines into it.  The stack is moved down by one page and the vdso
> >> is mapped into this space.
> >>
> >> Signed-off-by: David Daney <ddaney@caviumnetworks.com>
> > 
> > Note that for FPU-less CPUs, the kernel FP emulator also uses a user
> > stack trampoline to execute instructions in the delay slots of emulated
> > FP branches.  I didn't see any of the math-emu modules being tweaked in
> > either part of your patch.  Presumably, one would want to move that
> > operation into the vdso as well.

Kevin,

   As David says, this is a Very Ugly Problem.  Each FP trampoline
  is effectively per-(runtime-)instance per-thread, i.e., there is
  a unique FP trampoline for every dynamic instance of (non-trivial
  non-FP) instruction in an FP delay slot.  This is essentially the
  complete opposite of the signal-return trampoline, which is fixed
  (constant text) for all instances in all threads.

   As such, David's vdso (assuming it's similar to those on other
  architectures (I've not looked at it closely yet)) may not have
  any obvious role to play in moving the FP trampoline('s code?)
  off the user's stack.

>[ ... ]
> It is an ugly problem.  I am trying to hack something up to fix it.
> 
> David Daney

David,

   Since we are massively interested in what all this is leading
  to (no-execute stacks using the XI bit), I'm very happy to help
  (time permitting — I've got an overflowing list of other stuff).
  We make a 32-bit 4KSd-based SoC (Innova Card (now Maxim) USIP),
  albeit the latest kernel we have running ATM is 2.6.25 (soon, I
  hope, to be 2.6.26(.1 at least?)).  I'll be trying your patches
  (presumably by backporting) as soon as I can (see: overflowing
  list....  ;-\  ).

cheers!
	-blf-

-- 
“How many surrealists does it take to   | Brian Foster
 change a lightbulb? Three. One calms   | somewhere in south of France
 the warthog, and two fill the bathtub  |   Stop E$$o (ExxonMobil)!
 with brightly-coloured machine tools.” |      http://www.stopesso.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-24  7:20       ` Brian Foster
@ 2009-04-24  7:50         ` Kevin D. Kissell
  2009-04-24 15:30           ` David Daney
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin D. Kissell @ 2009-04-24  7:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: David Daney, linux-mips

[-- Attachment #1: Type: text/plain, Size: 3181 bytes --]

Brian Foster wrote:
> On Wednesday 22 April 2009 20:01:44 David Daney wrote:
>   
>> Kevin D. Kissell wrote:
>>     
>>> David Daney wrote:
>>>       
>>>> This is a preliminary patch to add a vdso to all user processes.
>>>> Still missing are ELF headers and .eh_frame information.  But it is
>>>> enough to allow us to move signal trampolines off of the stack.
>>>>
>>>> We allocate a single page (the vdso) and write all possible signal
>>>> trampolines into it.  The stack is moved down by one page and the vdso
>>>> is mapped into this space.
>>>>
>>>> Signed-off-by: David Daney <ddaney@caviumnetworks.com>
>>>>         
>>> Note that for FPU-less CPUs, the kernel FP emulator also uses a user
>>> stack trampoline to execute instructions in the delay slots of emulated
>>> FP branches.  I didn't see any of the math-emu modules being tweaked in
>>> either part of your patch.  Presumably, one would want to move that
>>> operation into the vdso as well.
>>>       
>
> Kevin,
>
>    As David says, this is a Very Ugly Problem.  Each FP trampoline
>   is effectively per-(runtime-)instance per-thread, i.e., there is
>   a unique FP trampoline for every dynamic instance of (non-trivial
>   non-FP) instruction in an FP delay slot.  This is essentially the
>   complete opposite of the signal-return trampoline, which is fixed
>   (constant text) for all instances in all threads.
>
>    As such, David's vdso (assuming it's similar to those on other
>   architectures (I've not looked at it closely yet)) may not have
>   any obvious role to play in moving the FP trampoline('s code?)
>   off the user's stack.
>   
I haven't reviewed David's code in detail, but from his description, I
thought that there was a vdso page per task/thread.  If there's only one
per processor, then, yes, that poses a challenge to porting the FPU
emulation code to use it, since, as you observe, the instruction
sequence to be executed may differ for each delay slot emulation.  It
should still be possible, though.  FP emulation is in itself expensive,
and FP branches with live delay slots are a smallish subset of the
overall FP instructions to be emulated, so a dynamic scheme to
allocate/free slots in a vdso page wouldn't have that dramatic a
performance impact, overall.  As the instructions aren't constant, the
I-caches would need to be flushed after each dsemul setup, even using a
vdso page, but that shouldn't break the fact that one could avoid it for
signals, so long as a different cache line within the vdso page is used
for signal versus dsemul trampolines.

I'm no longer paid to worry about this stuff - I participate in the
mailing list out of habit, as time permits. I don't have any MIPS
hardware handy to work with, even if I wasn't busy with totally
unrelated stuff.  So my talk is cheap.  You guys can do whatever you
want.  I'm just pointing out that, if you want to get rid of executable
user stacks, you either have to re-implement FP branch delay slot
emulation, or eliminate FPU emulation in the kernel.  If your motivation
is really only signal dispatch performance, you can just leave the
dsemul stuff on the user stack.

          Regards,

          Kevin K.

[-- Attachment #2: Type: text/html, Size: 3815 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-24  7:50         ` Kevin D. Kissell
@ 2009-04-24 15:30           ` David Daney
  2009-04-27  7:19             ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-24 15:30 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: Brian Foster, linux-mips

Kevin D. Kissell wrote:
> Brian Foster wrote:
>> On Wednesday 22 April 2009 20:01:44 David Daney wrote:
>>   
>>> Kevin D. Kissell wrote:
>>>     
>>>> David Daney wrote:
>>>>       
>>>>> This is a preliminary patch to add a vdso to all user processes.
>>>>> Still missing are ELF headers and .eh_frame information.  But it is
>>>>> enough to allow us to move signal trampolines off of the stack.
>>>>>
>>>>> We allocate a single page (the vdso) and write all possible signal
>>>>> trampolines into it.  The stack is moved down by one page and the vdso
>>>>> is mapped into this space.
>>>>>
>>>>> Signed-off-by: David Daney <ddaney@caviumnetworks.com>
>>>>>         
>>>> Note that for FPU-less CPUs, the kernel FP emulator also uses a user
>>>> stack trampoline to execute instructions in the delay slots of emulated
>>>> FP branches.  I didn't see any of the math-emu modules being tweaked in
>>>> either part of your patch.  Presumably, one would want to move that
>>>> operation into the vdso as well.
>>>>       
>>
>> Kevin,
>>
>>    As David says, this is a Very Ugly Problem.  Each FP trampoline
>>   is effectively per-(runtime-)instance per-thread, i.e., there is
>>   a unique FP trampoline for every dynamic instance of (non-trivial
>>   non-FP) instruction in an FP delay slot.  This is essentially the
>>   complete opposite of the signal-return trampoline, which is fixed
>>   (constant text) for all instances in all threads.
>>
>>    As such, David's vdso (assuming it's similar to those on other
>>   architectures (I've not looked at it closely yet)) may not have
>>   any obvious role to play in moving the FP trampoline('s code?)
>>   off the user's stack.
>>   
> I haven't reviewed David's code in detail, but from his description, I 
> thought that there was a vdso page per task/thread.  If there's only one 
> per processor, then, yes, that poses a challenge to porting the FPU 
> emulation code to use it, since, as you observe, the instruction 
> sequence to be executed may differ for each delay slot emulation.  It 
> should still be possible, though.  FP emulation is in itself expensive, 
> and FP branches with live delay slots are a smallish subset of the 
> overall FP instructions to be emulated, so a dynamic scheme to 
> allocate/free slots in a vdso page wouldn't have that dramatic a 
> performance impact, overall.  As the instructions aren't constant, the 
> I-caches would need to be flushed after each dsemul setup, even using a 
> vdso page, but that shouldn't break the fact that one could avoid it for 
> signals, so long as a different cache line within the vdso page is used 
> for signal versus dsemul trampolines.
> 

Kevin is right, this is ugly.

My current plan is to map an anonymous page with execute permission for 
each vma (process) and place all FP trampolines there.  Each thread that 
needs a trampoline will allocate a piece of this page and write the 
trampoline.  We can arrange it so that the only way a thread can exit 
the trampoline is by taking some sort of fault (currently this is true 
for the normal case), or exiting.  Then we free the trampoline for other 
threads to use.  If all the slots in the trampoline page are in use, a 
thread would block until there is a free one, rarely, if ever would this 
happen.

Doing the memory management for the trampoline area and the blocking 
would add quite a bit of complexity.  I am not really concerned about 
performance because people that want good FP performance should use a 
CPU with FP hardware (like my R5000 O2).

David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-24 15:30           ` David Daney
@ 2009-04-27  7:19             ` Brian Foster
  2009-04-27 12:51               ` Kevin D. Kissell
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2009-04-27  7:19 UTC (permalink / raw)
  To: David Daney; +Cc: Kevin D. Kissell, linux-mips

On Friday 24 April 2009 17:30:35 David Daney wrote:
> Kevin D. Kissell wrote:
> > Brian Foster wrote:
> >> On Wednesday 22 April 2009 20:01:44 David Daney wrote:
> >>> Kevin D. Kissell wrote:
> >>>> David Daney wrote:
> >>>>> This is a preliminary patch to add a vdso to all user processes.
> >>>>>[ ... ]
> >>>> Note that for FPU-less CPUs, the kernel FP emulator also uses a user
> >>>> stack trampoline to execute instructions in the delay slots of emulated
> >>>> FP branches.  [ ... ]
> >>
> >>    As David says, this is a Very Ugly Problem.  Each FP trampoline
> >>   is effectively per-(runtime-)instance per-thread [ ... ]
> > 
> > I haven't reviewed David's code in detail, but from his description, I 
> > thought that there was a vdso page per task/thread.  If there's only one 
> > per processor, then, yes, that poses a challenge to porting the FPU 
> > emulation code to use it, since, as you observe, the instruction 
> > sequence to be executed may differ for each delay slot emulation.  It 
> > should still be possible, though.  [ ... ]
> 
> Kevin is right, this is ugly.
> 
> My current plan is to map an anonymous page with execute permission for 
> each vma (process) and place all FP trampolines there.  Each thread that 
> needs a trampoline will allocate a piece of this page and write the 
> trampoline.  We can arrange it so that the only way a thread can exit 
> the trampoline is by taking some sort of fault (currently this is true 
> for the normal case), or exiting.

David,

   The above is the bit which has always stumped me.
  Having a per-process(or similar) page for the FP
  trampoline(s) is the “obvious” approach, but what
  has had me going around in circles is how to know
  when an allocated slot/trampoline can be freed.
  As you imply, in the normal case, it seems trivial.
  It's the not-normal cases which aren't clear (or at
  least aren't clear to me!).

   You say (EMPHASIS added) “We can arrange it so
  that the ONLY way a thread can exit the trampoline
  is by taking some sort of fault ... or exiting”,
  which if true, could solve the issue.  Could you
  elucidate on this point, please?

  Thanks for your time, effort, and patience.
cheers!
	-blf-

>                                    Then we free the trampoline for other 
> threads to use.  If all the slots in the trampoline page are in use, a 
> thread would block until there is a free one, rarely, if ever would this 
> happen.
>[ ... ]

-- 
“How many surrealists does it take to   | Brian Foster
 change a lightbulb? Three. One calms   | somewhere in south of France
 the warthog, and two fill the bathtub  |   Stop E$$o (ExxonMobil)!
 with brightly-coloured machine tools.” |      http://www.stopesso.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-27  7:19             ` Brian Foster
@ 2009-04-27 12:51               ` Kevin D. Kissell
  2009-04-27 15:54                 ` David Daney
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin D. Kissell @ 2009-04-27 12:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: David Daney, linux-mips

[-- Attachment #1: Type: text/plain, Size: 5051 bytes --]

Brian Foster wrote:
> On Friday 24 April 2009 17:30:35 David Daney wrote:
>   
>> Kevin D. Kissell wrote:
>>     
>>> Brian Foster wrote:
>>>       
>>>> On Wednesday 22 April 2009 20:01:44 David Daney wrote:
>>>>         
>>>>> Kevin D. Kissell wrote:
>>>>>           
>>>>>> David Daney wrote:
>>>>>>             
>>>>>>> This is a preliminary patch to add a vdso to all user processes.
>>>>>>> [ ... ]
>>>>>>>               
>>>>>> Note that for FPU-less CPUs, the kernel FP emulator also uses a user
>>>>>> stack trampoline to execute instructions in the delay slots of emulated
>>>>>> FP branches.  [ ... ]
>>>>>>             
>>>>    As David says, this is a Very Ugly Problem.  Each FP trampoline
>>>>   is effectively per-(runtime-)instance per-thread [ ... ]
>>>>         
>>> I haven't reviewed David's code in detail, but from his description, I 
>>> thought that there was a vdso page per task/thread.  If there's only one 
>>> per processor, then, yes, that poses a challenge to porting the FPU 
>>> emulation code to use it, since, as you observe, the instruction 
>>> sequence to be executed may differ for each delay slot emulation.  It 
>>> should still be possible, though.  [ ... ]
>>>       
>> Kevin is right, this is ugly.
>>
>> My current plan is to map an anonymous page with execute permission for 
>> each vma (process) and place all FP trampolines there.  Each thread that 
>> needs a trampoline will allocate a piece of this page and write the 
>> trampoline.  We can arrange it so that the only way a thread can exit 
>> the trampoline is by taking some sort of fault (currently this is true 
>> for the normal case), or exiting.
>>     
>
> David,
>
>    The above is the bit which has always stumped me.
>   Having a per-process(or similar) page for the FP
>   trampoline(s) is the “obvious” approach, but what
>   has had me going around in circles is how to know
>   when an allocated slot/trampoline can be freed.
>   As you imply, in the normal case, it seems trivial.
>   It's the not-normal cases which aren't clear (or at
>   least aren't clear to me!).
>
>    You say (EMPHASIS added) “We can arrange it so
>   that the ONLY way a thread can exit the trampoline
>   is by taking some sort of fault ... or exiting”,
>   which if true, could solve the issue.  Could you
>   elucidate on this point, please?
>
>   
Well, he's *almost* right about that. The delay slot emulation function
executes a single instruction off the user stack/vdso slot, which is
followed in memory by an instruction that provokes an address
exception.  The address exception handler detects the special case (and
it should be noted that detecting the special case could be made simpler
and more reliable if a vdso-type region were used), cleans up, and
restores normal stack behavior.  That "clean up" could, of course,
include any necessary vdso slot management.  But what about cases that
won't get to the magic alignment trap?

As the instruction being executed is extracted from a branch delay slot,
we know it's not legal for it to be any sort of branch or jump
instruction.  But it *could* be a trap or system call instruction, or a
load/store that would provoke a TLB exception.  In the usual cases,
however, as I believe David was alluding, either the exception will
ultimately unwind to return to execute the magic alignment trap, or the
thread will exit, and could free the emulation slot as part of general
cleanup.

But there's a case that isn't handled in this model, and that's the case
of an exception (or interrupt that falls in the 2-instruction window)
resulting in a signal that is caught and dispatched, and where either
the signal handler does a longjmp and restarts FP computation, or where
the signal handler itself contains a FP branch with yet another delay
slot to be emulated. One *could* get alarm signal before the original
delay slot instruction is executed, so recycling the same vdso cache
line would be premature.  It's hard to get away from something
distinctly stack-like if one wants to cover these cases.

My short-term suggestion would be to leave FP emulator delay slot
handling on the (executable) user stack, even if signal trampolines use
the vdso.  Longer term, we might consider what sorts of crockery would
be necessary to deal with delay slot abandonment and recursion.  That
might mean adding cruft to the signal dispatch logic to detect that
we're in mid-delay-slot-emulation and defer the signal until after the
alignment trap cleanup is done (adds annoying run-time overhead, but is
probably the smallest increase in footprint and complexity), or it might
mean changing the delay slot emulation paradigm completely and bolting a
full instruction set emulator into the FP emulator, so that the delay
slot instruction is simulated in kernel mode, rather than requiring
execution in user mode.  I rejected that idea out-of-hand when I first
did the FP emulator integration with the kernel, years ago, but maybe
the constraints have changed...

          Regards,

          Kevin K.

[-- Attachment #2: Type: text/html, Size: 5878 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-27 12:51               ` Kevin D. Kissell
@ 2009-04-27 15:54                 ` David Daney
  2009-04-27 17:27                   ` Kevin D. Kissell
  0 siblings, 1 reply; 22+ messages in thread
From: David Daney @ 2009-04-27 15:54 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: Brian Foster, linux-mips

Kevin D. Kissell wrote:

> Well, he's *almost* right about that. The delay slot emulation function 
> executes a single instruction off the user stack/vdso slot, which is 
> followed in memory by an instruction that provokes an address 
> exception.  The address exception handler detects the special case (and 
> it should be noted that detecting the special case could be made simpler 
> and more reliable if a vdso-type region were used),

Ralf recently changed this to a 'break' instruction, but the logic 
remains the same.

> cleans up, and 
> restores normal stack behavior.  That "clean up" could, of course, 
> include any necessary vdso slot management.  But what about cases that 
> won't get to the magic alignment trap?
> 
> As the instruction being executed is extracted from a branch delay slot, 
> we know it's not legal for it to be any sort of branch or jump 
> instruction.

These we would detect and since the behavior is 'UNPREDICTABLE' we can 
treat them as a nop and remain within the specified behavior.

>  But it *could* be a trap or system call instruction, or a 
> load/store that would provoke a TLB exception.  In the usual cases, 
> however, as I believe David was alluding, either the exception will 
> ultimately unwind to return to execute the magic alignment trap, or the 
> thread will exit, and could free the emulation slot as part of general 
> cleanup.
> 
> But there's a case that isn't handled in this model, and that's the case 
> of an exception (or interrupt that falls in the 2-instruction window) 
> resulting in a signal that is caught and dispatched, and where either 
> the signal handler does a longjmp and restarts FP computation, or where 
> the signal handler itself contains a FP branch with yet another delay 
> slot to be emulated. One *could* get alarm signal before the original 
> delay slot instruction is executed, so recycling the same vdso cache 
> line would be premature.  It's hard to get away from something 
> distinctly stack-like if one wants to cover these cases.
> 

System calls we don't have to handle, they will eventually return to the 
break instruction following the delay slot instruction and be handled by 
the normal processing.

I am thinking that all other exceptions will result in one of three cases:

1) They will work like system calls and return to the 'break'.

2) The thread will exit.

3) They result in a signal being sent to the thread.  We can handle it 
in force_signal().  In this case we would adjust the eip to point at the 
  original location of the instruction and clean things up.  If the 
signal handler tries to restart the instruction, the FP emulator will 
re-run the emulation.


> My short-term suggestion would be to leave FP emulator delay slot 
> handling on the (executable) user stack, even if signal trampolines use 
> the vdso.

They are really two seperate (but related) problems.  If we want 
eXecute-Inhibit for the stack we need to solve it.

> Longer term, we might consider what sorts of crockery would 
> be necessary to deal with delay slot abandonment and recursion.  That 
> might mean adding cruft to the signal dispatch logic to detect that 
> we're in mid-delay-slot-emulation and defer the signal until after the 
> alignment trap cleanup is done (adds annoying run-time overhead, but is 
> probably the smallest increase in footprint and complexity), or it might 
> mean changing the delay slot emulation paradigm completely and bolting a 
> full instruction set emulator into the FP emulator, so that the delay 
> slot instruction is simulated in kernel mode, rather than requiring 
> execution in user mode.  I rejected that idea out-of-hand when I first 
> did the FP emulator integration with the kernel, years ago, but maybe 
> the constraints have changed...
> 

I think full instruction set emulation is not so easy.  How would you 
emulate COP2 instructions?

David Daney

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-27 15:54                 ` David Daney
@ 2009-04-27 17:27                   ` Kevin D. Kissell
  2009-04-27 18:26                     ` David Daney
  0 siblings, 1 reply; 22+ messages in thread
From: Kevin D. Kissell @ 2009-04-27 17:27 UTC (permalink / raw)
  To: David Daney; +Cc: Brian Foster, linux-mips

David Daney wrote:
> Kevin D. Kissell wrote:
>
>
>>  But it *could* be a trap or system call instruction, or a load/store
>> that would provoke a TLB exception.  In the usual cases, however, as
>> I believe David was alluding, either the exception will ultimately
>> unwind to return to execute the magic alignment trap, or the thread
>> will exit, and could free the emulation slot as part of general cleanup.
>>
>> But there's a case that isn't handled in this model, and that's the
>> case of an exception (or interrupt that falls in the 2-instruction
>> window) resulting in a signal that is caught and dispatched, and
>> where either the signal handler does a longjmp and restarts FP
>> computation, or where the signal handler itself contains a FP branch
>> with yet another delay slot to be emulated. One *could* get alarm
>> signal before the original delay slot instruction is executed, so
>> recycling the same vdso cache line would be premature.  It's hard to
>> get away from something distinctly stack-like if one wants to cover
>> these cases.
>>
>
> System calls we don't have to handle, they will eventually return to
> the break instruction following the delay slot instruction and be
> handled by the normal processing.
>
> I am thinking that all other exceptions will result in one of three
> cases:
>
> 1) They will work like system calls and return to the 'break'.
>
> 2) The thread will exit.
>
> 3) They result in a signal being sent to the thread.  We can handle it
> in force_signal().  In this case we would adjust the eip to point at
> the  original location of the instruction and clean things up.  If the
> signal handler tries to restart the instruction, the FP emulator will
> re-run the emulation.
That's presumably OK if we *know* that the delay slot instruction has
*not* executed prior to the signal being taken.  But if it has, it may
have had side-effects, i.e. imagine if it's an "ADD.S f4, f4, f6". We
can't re-run the emulation without generating erroneous processor
state.  What do we do if, between the ADD.S and the
BREAK-that-replaces-my-old-alignment-trap, a timer interrupt comes in,
causing a SIGALRM which is caught and which executes another FP branch? 
When I first wrote the delay slot handling code, I dreamed of disabling
interrupts during the delay slot instruction emulation - I think I even
coded it that way at one point - but it's fundamentally uncool to go off
into user mode with interrupts off.

I really think that once the delay slot emulation machinery has been put
into motion, that fact needs to become a part of the thread state. 
Currently, it's encoded, in a sense, in the stack pointer and the stack
contents.  If we no longer stack it, and use a more static instruction
buffer as a trampoline, then I think it needs to be tagged as part of
the kernel's thread state.

That knowledge might be used in various ways.  I still think it would
work to check that state and cause any signals to that thread need to be
deferred until it has processed the BREAK instruction to restore the
pre-emulation instruction flow.  Once the state has been restored from
the dsemul record, the signal can be dispatched rather than returning
directly to the branch target.  I don't like putting another check into
the context restore code, though. 

A cruftier, but less inefficient-in-the-expected-case approach would be
a variant on what you suggest above.  Signal dispatch could check the
EPC, and *if* EPC shows that the delay slot exception hasn't yet
executed, it could roll back the EPC to re-execute the FP branch after
the signal.  If the delay slot instruction *has* executed, but not the
following BREAK, the signal dispatch code could preemptively do the
dsemulret cleanup and restore the pre-emulation stack and post-emulation
EPC before doing the signal dispatch.

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] MIPS: Preliminary vdso.
  2009-04-27 17:27                   ` Kevin D. Kissell
@ 2009-04-27 18:26                     ` David Daney
  0 siblings, 0 replies; 22+ messages in thread
From: David Daney @ 2009-04-27 18:26 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: Brian Foster, linux-mips

Kevin D. Kissell wrote:
> David Daney wrote:
>> Kevin D. Kissell wrote:
>>
>>
>>>  But it *could* be a trap or system call instruction, or a load/store
>>> that would provoke a TLB exception.  In the usual cases, however, as
>>> I believe David was alluding, either the exception will ultimately
>>> unwind to return to execute the magic alignment trap, or the thread
>>> will exit, and could free the emulation slot as part of general cleanup.
>>>
>>> But there's a case that isn't handled in this model, and that's the
>>> case of an exception (or interrupt that falls in the 2-instruction
>>> window) resulting in a signal that is caught and dispatched, and
>>> where either the signal handler does a longjmp and restarts FP
>>> computation, or where the signal handler itself contains a FP branch
>>> with yet another delay slot to be emulated. One *could* get alarm
>>> signal before the original delay slot instruction is executed, so
>>> recycling the same vdso cache line would be premature.  It's hard to
>>> get away from something distinctly stack-like if one wants to cover
>>> these cases.
>>>
>> System calls we don't have to handle, they will eventually return to
>> the break instruction following the delay slot instruction and be
>> handled by the normal processing.
>>
>> I am thinking that all other exceptions will result in one of three
>> cases:
>>
>> 1) They will work like system calls and return to the 'break'.
>>
>> 2) The thread will exit.
>>
>> 3) They result in a signal being sent to the thread.  We can handle it
>> in force_signal().  In this case we would adjust the eip to point at
>> the  original location of the instruction and clean things up.  If the
>> signal handler tries to restart the instruction, the FP emulator will
>> re-run the emulation.
> That's presumably OK if we *know* that the delay slot instruction has
> *not* executed prior to the signal being taken.  But if it has, it may
> have had side-effects, i.e. imagine if it's an "ADD.S f4, f4, f6". We
> can't re-run the emulation without generating erroneous processor
> state.  What do we do if, between the ADD.S and the

FP instructions are always emulated, so they don't do this delay slot 
processing thing.

> BREAK-that-replaces-my-old-alignment-trap, a timer interrupt comes in,
> causing a SIGALRM which is caught and which executes another FP branch? 
> When I first wrote the delay slot handling code, I dreamed of disabling
> interrupts during the delay slot instruction emulation - I think I even
> coded it that way at one point - but it's fundamentally uncool to go off
> into user mode with interrupts off.

Asynchronous signals could be a problem.  I would have to think about 
that more.


> 
> I really think that once the delay slot emulation machinery has been put
> into motion, that fact needs to become a part of the thread state. 
> Currently, it's encoded, in a sense, in the stack pointer and the stack
> contents.  If we no longer stack it, and use a more static instruction
> buffer as a trampoline, then I think it needs to be tagged as part of
> the kernel's thread state.
> 

Exactly.  I had planned on augmenting the thread state to handle all of 
this.



> That knowledge might be used in various ways.  I still think it would
> work to check that state and cause any signals to that thread need to be
> deferred until it has processed the BREAK instruction to restore the
> pre-emulation instruction flow.  Once the state has been restored from
> the dsemul record, the signal can be dispatched rather than returning
> directly to the branch target.  I don't like putting another check into
> the context restore code, though. 
> 
> A cruftier, but less inefficient-in-the-expected-case approach would be
> a variant on what you suggest above.  Signal dispatch could check the
> EPC, and *if* EPC shows that the delay slot exception hasn't yet
> executed, it could roll back the EPC to re-execute the FP branch after
> the signal.  If the delay slot instruction *has* executed, but not the
> following BREAK, the signal dispatch code could preemptively do the
> dsemulret cleanup and restore the pre-emulation stack and post-emulation
> EPC before doing the signal dispatch.
> 
>           Regards,
> 
>           Kevin K.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-04-27 18:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-21 21:30 [PATCH 0/2] MIPS: Move signal return trampolines off the stack David Daney
2009-04-21 21:33 ` [PATCH 1/2] MIPS: Preliminary vdso David Daney
2009-04-22  5:24   ` Shane McDonald
2009-04-22 15:18     ` David Daney
2009-04-22  9:35   ` Kevin D. Kissell
2009-04-22 18:01     ` David Daney
2009-04-24  7:20       ` Brian Foster
2009-04-24  7:50         ` Kevin D. Kissell
2009-04-24 15:30           ` David Daney
2009-04-27  7:19             ` Brian Foster
2009-04-27 12:51               ` Kevin D. Kissell
2009-04-27 15:54                 ` David Daney
2009-04-27 17:27                   ` Kevin D. Kissell
2009-04-27 18:26                     ` David Daney
2009-04-22 17:50   ` David VomLehn
2009-04-22 18:05     ` David Daney
2009-04-22 18:28       ` David VomLehn
2009-04-21 21:33 ` [PATCH 2/2] MIPS: Move signal trampolines off of the stack David Daney
2009-04-22 17:57   ` David VomLehn
2009-04-22 18:04 ` [PATCH 0/2] MIPS: Move signal return trampolines off " David VomLehn
2009-04-22 18:13   ` David Daney
2009-04-22 18:31     ` David VomLehn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.