All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHES] alpha asm glue cleanups and fixes
@ 2021-09-25  2:54 Al Viro
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:54 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

	Fallout from asm glue review on alpha:

1) TIF_NOTIFY_SIGNAL support is broken; do_work_pending() handles
it, but the logics *calling* do_work_pending() ignores that flag
completely.  If it's called for other reasons - fine, but
TIF_NOTIFY_SIGNAL alone will not suffice for that.  Bug from the
last cycle.  5.11 bug.

2) _TIF_ALLWORK_MASK is junk - never had been used.

3) !AUDIT_SYSCALL configs have buggered logics for going into
straced syscall path.  Any thread flag (including TIF_SIGNAL_PENDING)
will suffice to send us there.  3.14 bug.

4) on straced syscalls we have force_successful_syscall_return() broken -
it ends up with a3 *not* set to 0.

5) on non-straced syscalls force_successful_syscall_return() handling is
suboptimal - it duplicates code from the normal syscall return path for
no good reason; instead of branching to the copy, it might branch to the
original just fine.

6) ret_from_fork could just as well go to ret_from_user - it's not going
to be hit when returning to kernel mode.

Patchset lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
#work.alpha; individual patches in followups.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling
  2021-09-25  2:54 [PATCHES] alpha asm glue cleanups and fixes Al Viro
@ 2021-09-25  2:55 ` Al Viro
  2021-09-25  2:55   ` [PATCH 2/6] alpha: _TIF_ALLWORK_MASK is unused Al Viro
                     ` (5 more replies)
  2021-09-25  2:59 ` [PATCHES] alpha asm glue cleanups and fixes Al Viro
  2022-09-02  1:48 ` Al Viro
  2 siblings, 6 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

it needs to be added to _TIF_WORK_MASK, or we might not reach
do_work_pending() in the first place...

Fixes: 5a9a8897c253a "alpha: add support for TIF_NOTIFY_SIGNAL"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/thread_info.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 2592356e32154..0ce1eee0924b1 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -77,7 +77,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
 
 /* Work to do on interrupt/exception return.  */
 #define _TIF_WORK_MASK		(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-				 _TIF_NOTIFY_RESUME)
+				 _TIF_NOTIFY_RESUME | _TIF_NOTIFY_SIGNAL)
 
 /* Work to do on any return to userspace.  */
 #define _TIF_ALLWORK_MASK	(_TIF_WORK_MASK		\
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] alpha: _TIF_ALLWORK_MASK is unused
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25  2:55   ` [PATCH 3/6] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

... and never had been used, actually

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/thread_info.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 0ce1eee0924b1..9b99fece40af9 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -79,10 +79,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define _TIF_WORK_MASK		(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_NOTIFY_SIGNAL)
 
-/* Work to do on any return to userspace.  */
-#define _TIF_ALLWORK_MASK	(_TIF_WORK_MASK		\
-				 | _TIF_SYSCALL_TRACE)
-
 #define TS_UAC_NOPRINT		0x0001	/* ! Preserve the following three */
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] alpha: fix syscall entry in !AUDUT_SYSCALL case
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2021-09-25  2:55   ` [PATCH 2/6] alpha: _TIF_ALLWORK_MASK is unused Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25  2:55   ` [PATCH 4/6] alpha: fix handling of a3 on straced syscalls Al Viro
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

We only want to take the slow path if SYSCALL_TRACE or SYSCALL_AUDIT is
set; on !AUDIT_SYSCALL configs the current tree hits it whenever _any_
thread flag (including NEED_RESCHED, NOTIFY_SIGNAL, etc.) happens to
be set.

Fixes: a9302e843944 "alpha: Enable system-call auditing support"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43c..c41a5a9c3b9f2 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -469,8 +469,10 @@ entSys:
 #ifdef CONFIG_AUDITSYSCALL
 	lda     $6, _TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT
 	and     $3, $6, $3
-#endif
 	bne     $3, strace
+#else
+	blbs    $3, strace		/* check for SYSCALL_TRACE in disguise */
+#endif
 	beq	$4, 1f
 	ldq	$27, 0($5)
 1:	jsr	$26, ($27), sys_ni_syscall
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] alpha: fix handling of a3 on straced syscalls
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2021-09-25  2:55   ` [PATCH 2/6] alpha: _TIF_ALLWORK_MASK is unused Al Viro
  2021-09-25  2:55   ` [PATCH 3/6] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25  2:55   ` [PATCH 5/6] alpha: syscall exit cleanup Al Viro
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

For successful syscall that happens to return a negative, we want
a3 set to 0, no matter whether it's straced or not.  As it is,
for straced case we leave the value it used to have on syscall
entry.  Easily fixed, fortunately...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index c41a5a9c3b9f2..78fe7ee254250 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -600,8 +600,8 @@ ret_from_straced:
 
 	/* check return.. */
 	blt	$0, $strace_error	/* the call failed */
-	stq	$31, 72($sp)		/* a3=0 => no error */
 $strace_success:
+	stq	$31, 72($sp)		/* a3=0 => no error */
 	stq	$0, 0($sp)		/* save return value */
 
 	DO_SWITCH_STACK
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] alpha: syscall exit cleanup
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                     ` (2 preceding siblings ...)
  2021-09-25  2:55   ` [PATCH 4/6] alpha: fix handling of a3 on straced syscalls Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25  2:55   ` [PATCH 6/6] alpha: ret_from_fork can go straight to ret_to_user Al Viro
  2021-09-25  2:55   ` [PATCH 7/7] alpha: lazy FPU switching Al Viro
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

$ret_success consists of two insn + branch to ret_from_syscall.
The thing is, those insns are identical to the ones immediately
preceding ret_from_syscall...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index 78fe7ee254250..43380fbf600dd 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -478,6 +478,7 @@ entSys:
 1:	jsr	$26, ($27), sys_ni_syscall
 	ldgp	$gp, 0($26)
 	blt	$0, $syscall_error	/* the call failed */
+$ret_success:
 	stq	$0, 0($sp)
 	stq	$31, 72($sp)		/* a3=0 => no error */
 
@@ -527,11 +528,6 @@ $syscall_error:
 	stq	$1, 72($sp)	/* a3 for return */
 	br	ret_from_sys_call
 
-$ret_success:
-	stq	$0, 0($sp)
-	stq	$31, 72($sp)	/* a3=0 => no error */
-	br	ret_from_sys_call
-
 /*
  * Do all cleanup when returning from all interrupts and system calls.
  *
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] alpha: ret_from_fork can go straight to ret_to_user
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                     ` (3 preceding siblings ...)
  2021-09-25  2:55   ` [PATCH 5/6] alpha: syscall exit cleanup Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25  2:55   ` [PATCH 7/7] alpha: lazy FPU switching Al Viro
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

We only hit ret_from_fork when the child is meant to return to
userland (since 2012 or so).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index 43380fbf600dd..a6207c47f0894 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -766,7 +766,7 @@ alpha_switch_to:
 	.align	4
 	.ent	ret_from_fork
 ret_from_fork:
-	lda	$26, ret_from_sys_call
+	lda	$26, ret_to_user
 	mov	$17, $16
 	jmp	$31, schedule_tail
 .end ret_from_fork
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                     ` (4 preceding siblings ...)
  2021-09-25  2:55   ` [PATCH 6/6] alpha: ret_from_fork can go straight to ret_to_user Al Viro
@ 2021-09-25  2:55   ` Al Viro
  2021-09-25 19:07     ` Linus Torvalds
  5 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:55 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

	On each context switch we save the FPU registers on stack
of old process and restore FPU registers from the stack of new one.
That allows us to avoid doing that each time we enter/leave the
kernel mode; however, that can get suboptimal in some cases.

	For one thing, we don't need to bother saving anything
for kernel threads.  For another, if between entering and leaving
the kernel a thread gives CPU up more than once, it will do
useless work, saving the same values every time, only to discard
the saved copy as soon as it returns from switch_to().

	Alternative solution:

* move the array we save into from switch_stack to thread_info
* have a (thread-synchronous) flag set when we save them
* do *NOT* save/restore them in do_switch_stack()/undo_switch_stack().
* restore on the exit to user mode (and clear the flag) if the flag had
been set.
* on context switch, entry to fork()/clone()/vfork() and on entry into
straced syscall save (and set the flag) if the flag had not been set.
* have copy_thread() set the flag for child, so they would be restored
once the child returns to userland.
* save (again, conditionally and setting the flag) before do_signal(),
use the saved data in setup_sigcontext()
* have restore_sigcontext() set the flag and copy from sigframe to
save area.
* teach ptrace to look for FPU registers in thread_info instead of
switch_stack.
* teach isolated accesses to FPU registers (rdfpcr, wrfpcr, etc.)
to check the flag (under preempt_disable()) and work with the save area
if it's been set.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/fpu.h         |  60 +++++++-----
 arch/alpha/include/asm/switch_to.h   |   1 +
 arch/alpha/include/asm/thread_info.h |  15 +++
 arch/alpha/include/uapi/asm/ptrace.h |   2 +
 arch/alpha/kernel/asm-offsets.c      |   2 +
 arch/alpha/kernel/entry.S            | 184 +++++++++++++++++++++--------------
 arch/alpha/kernel/process.c          |   5 +-
 arch/alpha/kernel/ptrace.c           |  18 ++--
 arch/alpha/kernel/signal.c           |  20 ++--
 arch/alpha/lib/fpreg.c               |  41 ++++++--
 10 files changed, 229 insertions(+), 119 deletions(-)

diff --git a/arch/alpha/include/asm/fpu.h b/arch/alpha/include/asm/fpu.h
index b9691405e56b3..4de001bf2811a 100644
--- a/arch/alpha/include/asm/fpu.h
+++ b/arch/alpha/include/asm/fpu.h
@@ -15,21 +15,27 @@ rdfpcr(void)
 {
 	unsigned long tmp, ret;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		ret = current_thread_info()->fp[31];
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"ftoit $f0,%1\n\t"
-		"itoft %0,$f0"
-		: "=r"(tmp), "=r"(ret));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"ftoit $f0,%1\n\t"
+			"itoft %0,$f0"
+			: "=r"(tmp), "=r"(ret));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"stt $f0,%1\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp), "=m"(ret));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"stt $f0,%1\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp), "=m"(ret));
 #endif
+	}
+	preempt_enable();
 
 	return ret;
 }
@@ -39,21 +45,27 @@ wrfpcr(unsigned long val)
 {
 	unsigned long tmp;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->fp[31] = val;
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"itoft %1,$f0\n\t"
-		"mt_fpcr $f0\n\t"
-		"itoft %0,$f0"
-		: "=&r"(tmp) : "r"(val));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"itoft %1,$f0\n\t"
+			"mt_fpcr $f0\n\t"
+			"itoft %0,$f0"
+			: "=&r"(tmp) : "r"(val));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"ldt $f0,%1\n\t"
-		"mt_fpcr $f0\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp) : "m"(val));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"ldt $f0,%1\n\t"
+			"mt_fpcr $f0\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp) : "m"(val));
 #endif
+	}
+	preempt_enable();
 }
 
 static inline unsigned long
diff --git a/arch/alpha/include/asm/switch_to.h b/arch/alpha/include/asm/switch_to.h
index 762b7f975310c..32863581a2975 100644
--- a/arch/alpha/include/asm/switch_to.h
+++ b/arch/alpha/include/asm/switch_to.h
@@ -8,6 +8,7 @@ extern struct task_struct *alpha_switch_to(unsigned long, struct task_struct *);
 
 #define switch_to(P,N,L)						 \
   do {									 \
+    save_fpu();								 \
     (L) = alpha_switch_to(virt_to_phys(&task_thread_info(N)->pcb), (P)); \
     check_mmu_context();						 \
   } while (0)
diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 9b99fece40af9..58faec89cc881 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -27,6 +27,7 @@ struct thread_info {
 	int bpt_nsaved;
 	unsigned long bpt_addr[2];		/* breakpoint handling  */
 	unsigned int bpt_insn[2];
+	unsigned long fp[32];
 };
 
 /*
@@ -83,6 +84,8 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
 
+#define TS_SAVED_FP		0x0008
+
 #define SET_UNALIGN_CTL(task,value)	({				\
 	__u32 status = task_thread_info(task)->status & ~UAC_BITMASK;	\
 	if (value & PR_UNALIGN_NOPRINT)					\
@@ -106,5 +109,17 @@ register struct thread_info *__current_thread_info __asm__("$8");
 	put_user(res, (int __user *)(value));				\
 	})
 
+#ifndef __ASSEMBLY__
+extern void __save_fpu(void);
+
+static inline void save_fpu(void)
+{
+	if (!(current_thread_info()->status & TS_SAVED_FP)) {
+		current_thread_info()->status |= TS_SAVED_FP;
+		__save_fpu();
+	}
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_THREAD_INFO_H */
diff --git a/arch/alpha/include/uapi/asm/ptrace.h b/arch/alpha/include/uapi/asm/ptrace.h
index c29194181025f..5ca45934fcbb8 100644
--- a/arch/alpha/include/uapi/asm/ptrace.h
+++ b/arch/alpha/include/uapi/asm/ptrace.h
@@ -64,7 +64,9 @@ struct switch_stack {
 	unsigned long r14;
 	unsigned long r15;
 	unsigned long r26;
+#ifndef __KERNEL__
 	unsigned long fp[32];	/* fp[31] is fpcr */
+#endif
 };
 
 
diff --git a/arch/alpha/kernel/asm-offsets.c b/arch/alpha/kernel/asm-offsets.c
index 2e125e5c1508c..b121294bee266 100644
--- a/arch/alpha/kernel/asm-offsets.c
+++ b/arch/alpha/kernel/asm-offsets.c
@@ -17,6 +17,8 @@ void foo(void)
 	DEFINE(TI_TASK, offsetof(struct thread_info, task));
 	DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
 	DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
+	DEFINE(TI_FP, offsetof(struct thread_info, fp));
+	DEFINE(TI_STATUS, offsetof(struct thread_info, status));
 	BLANK();
 
         DEFINE(TASK_BLOCKED, offsetof(struct task_struct, blocked));
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index a6207c47f0894..397254fb93cfe 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -17,7 +17,7 @@
 
 /* Stack offsets.  */
 #define SP_OFF			184
-#define SWITCH_STACK_SIZE	320
+#define SWITCH_STACK_SIZE	64
 
 .macro	CFI_START_OSF_FRAME	func
 	.align	4
@@ -159,7 +159,6 @@
 	.cfi_rel_offset	$13, 32
 	.cfi_rel_offset	$14, 40
 	.cfi_rel_offset	$15, 48
-	/* We don't really care about the FP registers for debugging.  */
 .endm
 
 .macro	UNDO_SWITCH_STACK
@@ -498,6 +497,10 @@ ret_to_user:
 	and	$17, _TIF_WORK_MASK, $2
 	bne	$2, work_pending
 restore_all:
+	ldl	$2, TI_STATUS($8)
+	and	$2, TS_SAVED_FP, $3
+	bne	$3, restore_fpu
+restore_other:
 	.cfi_remember_state
 	RESTORE_ALL
 	call_pal PAL_rti
@@ -506,7 +509,7 @@ ret_to_kernel:
 	.cfi_restore_state
 	lda	$16, 7
 	call_pal PAL_swpipl
-	br restore_all
+	br restore_other
 
 	.align 3
 $syscall_error:
@@ -570,6 +573,14 @@ $work_notifysig:
 	.type	strace, @function
 strace:
 	/* set up signal stack, call syscall_trace */
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	jsr	$26, __save_fpu
+1:
 	DO_SWITCH_STACK
 	jsr	$26, syscall_trace_enter /* returns the syscall number */
 	UNDO_SWITCH_STACK
@@ -649,40 +660,6 @@ do_switch_stack:
 	stq	$14, 40($sp)
 	stq	$15, 48($sp)
 	stq	$26, 56($sp)
-	stt	$f0, 64($sp)
-	stt	$f1, 72($sp)
-	stt	$f2, 80($sp)
-	stt	$f3, 88($sp)
-	stt	$f4, 96($sp)
-	stt	$f5, 104($sp)
-	stt	$f6, 112($sp)
-	stt	$f7, 120($sp)
-	stt	$f8, 128($sp)
-	stt	$f9, 136($sp)
-	stt	$f10, 144($sp)
-	stt	$f11, 152($sp)
-	stt	$f12, 160($sp)
-	stt	$f13, 168($sp)
-	stt	$f14, 176($sp)
-	stt	$f15, 184($sp)
-	stt	$f16, 192($sp)
-	stt	$f17, 200($sp)
-	stt	$f18, 208($sp)
-	stt	$f19, 216($sp)
-	stt	$f20, 224($sp)
-	stt	$f21, 232($sp)
-	stt	$f22, 240($sp)
-	stt	$f23, 248($sp)
-	stt	$f24, 256($sp)
-	stt	$f25, 264($sp)
-	stt	$f26, 272($sp)
-	stt	$f27, 280($sp)
-	mf_fpcr	$f0		# get fpcr
-	stt	$f28, 288($sp)
-	stt	$f29, 296($sp)
-	stt	$f30, 304($sp)
-	stt	$f0, 312($sp)	# save fpcr in slot of $f31
-	ldt	$f0, 64($sp)	# dont let "do_switch_stack" change fp state.
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	do_switch_stack, .-do_switch_stack
@@ -701,48 +678,105 @@ undo_switch_stack:
 	ldq	$14, 40($sp)
 	ldq	$15, 48($sp)
 	ldq	$26, 56($sp)
-	ldt	$f30, 312($sp)	# get saved fpcr
-	ldt	$f0, 64($sp)
-	ldt	$f1, 72($sp)
-	ldt	$f2, 80($sp)
-	ldt	$f3, 88($sp)
-	mt_fpcr	$f30		# install saved fpcr
-	ldt	$f4, 96($sp)
-	ldt	$f5, 104($sp)
-	ldt	$f6, 112($sp)
-	ldt	$f7, 120($sp)
-	ldt	$f8, 128($sp)
-	ldt	$f9, 136($sp)
-	ldt	$f10, 144($sp)
-	ldt	$f11, 152($sp)
-	ldt	$f12, 160($sp)
-	ldt	$f13, 168($sp)
-	ldt	$f14, 176($sp)
-	ldt	$f15, 184($sp)
-	ldt	$f16, 192($sp)
-	ldt	$f17, 200($sp)
-	ldt	$f18, 208($sp)
-	ldt	$f19, 216($sp)
-	ldt	$f20, 224($sp)
-	ldt	$f21, 232($sp)
-	ldt	$f22, 240($sp)
-	ldt	$f23, 248($sp)
-	ldt	$f24, 256($sp)
-	ldt	$f25, 264($sp)
-	ldt	$f26, 272($sp)
-	ldt	$f27, 280($sp)
-	ldt	$f28, 288($sp)
-	ldt	$f29, 296($sp)
-	ldt	$f30, 304($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	undo_switch_stack, .-undo_switch_stack
+
+	.align	4
+	.globl	__save_fpu
+	.type	__save_fpu, @function
+__save_fpu:
+.macro V n
+	stt	$f\n, \n * 8 + TI_FP($8)
+.endm
+	V 0
+	V 1
+	V 2
+	V 3
+	V 4
+	V 5
+	V 6
+	V 7
+	V 8
+	V 9
+	V 10
+	V 11
+	V 12
+	V 13
+	V 14
+	V 15
+	V 16
+	V 17
+	V 18
+	V 19
+	V 20
+	V 21
+	V 21
+	V 22
+	V 23
+	V 24
+	V 25
+	V 26
+	V 27
+	mf_fpcr	$f0		# get fpcr
+	V 28
+	V 29
+	V 30
+	stt	$f0, 31 * 8 + TI_FP($8)	# save fpcr in slot of $f31
+	ldt	$f0, TI_FP($8)	# don't let "__save_fpu" change fp state.
+	ret
+.purgem V
+	.size	__save_fpu, .-__save_fpu
+
+	.align	4
+restore_fpu:
+	bic	$2, TS_SAVED_FP, $2
+.macro V n
+	ldt	$f\n, \n * 8 + TI_FP($8)
+.endm
+	ldt	$f30, 31 * 8 + TI_FP($8)	# get saved fpcr
+	V 0
+	V 1
+	V 2
+	V 3
+	mt_fpcr	$f30		# install saved fpcr
+	V 4
+	V 5
+	V 6
+	V 7
+	V 8
+	V 9
+	V 10
+	V 11
+	V 12
+	V 13
+	V 14
+	V 15
+	V 16
+	V 17
+	V 18
+	V 19
+	V 20
+	V 21
+	V 21
+	V 22
+	V 23
+	V 24
+	V 25
+	V 26
+	V 27
+	V 28
+	V 29
+	V 30
+	stl $2, TI_STATUS($8)
+	br restore_other
+.purgem V
+
 \f
 /*
  * The meat of the context switch code.
  */
-
 	.align	4
 	.globl	alpha_switch_to
 	.type	alpha_switch_to, @function
@@ -798,6 +832,14 @@ ret_from_kernel_thread:
 	.ent	alpha_\name
 alpha_\name:
 	.prologue 0
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	jsr	$26, __save_fpu
+1:
 	bsr	$1, do_switch_stack
 	jsr	$26, sys_\name
 	ldq	$26, 56($sp)
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index a5123ea426ce5..e45df572d42cd 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -248,6 +248,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childstack = ((struct switch_stack *) childregs) - 1;
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
+	childti->status |= TS_SAVED_FP;
 
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
@@ -257,6 +258,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 		childstack->r9 = usp;	/* function */
 		childstack->r10 = kthread_arg;
 		childregs->hae = alpha_mv.hae_cache;
+		memset(childti->fp, '\0', sizeof(childti->fp));
 		childti->pcb.usp = 0;
 		return 0;
 	}
@@ -340,8 +342,7 @@ EXPORT_SYMBOL(dump_elf_task);
 int
 dump_elf_task_fp(elf_fpreg_t *dest, struct task_struct *task)
 {
-	struct switch_stack *sw = (struct switch_stack *)task_pt_regs(task) - 1;
-	memcpy(dest, sw->fp, 32 * 8);
+	memcpy(dest, current_thread_info()->fp, 32 * 8);
 	return 1;
 }
 EXPORT_SYMBOL(dump_elf_task_fp);
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index 8c43212ae38e6..1abb03c912d96 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -79,6 +79,8 @@ enum {
  (PAGE_SIZE*2 - sizeof(struct pt_regs) - sizeof(struct switch_stack) \
   + offsetof(struct switch_stack, reg))
 
+#define FP_REG(reg) (offsetof(struct thread_info, reg))
+
 static int regoff[] = {
 	PT_REG(	   r0), PT_REG(	   r1), PT_REG(	   r2), PT_REG(	  r3),
 	PT_REG(	   r4), PT_REG(	   r5), PT_REG(	   r6), PT_REG(	  r7),
@@ -88,14 +90,14 @@ static int regoff[] = {
 	PT_REG(	  r20), PT_REG(	  r21), PT_REG(	  r22), PT_REG(	 r23),
 	PT_REG(	  r24), PT_REG(	  r25), PT_REG(	  r26), PT_REG(	 r27),
 	PT_REG(	  r28), PT_REG(	   gp),		   -1,		   -1,
-	SW_REG(fp[ 0]), SW_REG(fp[ 1]), SW_REG(fp[ 2]), SW_REG(fp[ 3]),
-	SW_REG(fp[ 4]), SW_REG(fp[ 5]), SW_REG(fp[ 6]), SW_REG(fp[ 7]),
-	SW_REG(fp[ 8]), SW_REG(fp[ 9]), SW_REG(fp[10]), SW_REG(fp[11]),
-	SW_REG(fp[12]), SW_REG(fp[13]), SW_REG(fp[14]), SW_REG(fp[15]),
-	SW_REG(fp[16]), SW_REG(fp[17]), SW_REG(fp[18]), SW_REG(fp[19]),
-	SW_REG(fp[20]), SW_REG(fp[21]), SW_REG(fp[22]), SW_REG(fp[23]),
-	SW_REG(fp[24]), SW_REG(fp[25]), SW_REG(fp[26]), SW_REG(fp[27]),
-	SW_REG(fp[28]), SW_REG(fp[29]), SW_REG(fp[30]), SW_REG(fp[31]),
+	FP_REG(fp[ 0]), FP_REG(fp[ 1]), FP_REG(fp[ 2]), FP_REG(fp[ 3]),
+	FP_REG(fp[ 4]), FP_REG(fp[ 5]), FP_REG(fp[ 6]), FP_REG(fp[ 7]),
+	FP_REG(fp[ 8]), FP_REG(fp[ 9]), FP_REG(fp[10]), FP_REG(fp[11]),
+	FP_REG(fp[12]), FP_REG(fp[13]), FP_REG(fp[14]), FP_REG(fp[15]),
+	FP_REG(fp[16]), FP_REG(fp[17]), FP_REG(fp[18]), FP_REG(fp[19]),
+	FP_REG(fp[20]), FP_REG(fp[21]), FP_REG(fp[22]), FP_REG(fp[23]),
+	FP_REG(fp[24]), FP_REG(fp[25]), FP_REG(fp[26]), FP_REG(fp[27]),
+	FP_REG(fp[28]), FP_REG(fp[29]), FP_REG(fp[30]), FP_REG(fp[31]),
 	PT_REG(	   pc)
 };
 
diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c
index bc077babafab5..6968b3a2273f0 100644
--- a/arch/alpha/kernel/signal.c
+++ b/arch/alpha/kernel/signal.c
@@ -150,9 +150,10 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 {
 	unsigned long usp;
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = __get_user(regs->pc, &sc->sc_pc);
+	long err = __get_user(regs->pc, &sc->sc_pc);
 
 	current->restart_block.fn = do_no_restart_syscall;
+	current_thread_info()->status |= TS_SAVED_FP;
 
 	sw->r26 = (unsigned long) ret_from_sys_call;
 
@@ -189,9 +190,9 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 	err |= __get_user(usp, sc->sc_regs+30);
 	wrusp(usp);
 
-	for (i = 0; i < 31; i++)
-		err |= __get_user(sw->fp[i], sc->sc_fpregs+i);
-	err |= __get_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __copy_from_user(current_thread_info()->fp,
+				sc->sc_fpregs, 31 * 8);
+	err |= __get_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	return err;
 }
@@ -272,7 +273,7 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 		 unsigned long mask, unsigned long sp)
 {
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = 0;
+	long err = 0;
 
 	err |= __put_user(on_sig_stack((unsigned long)sc), &sc->sc_onstack);
 	err |= __put_user(mask, &sc->sc_mask);
@@ -312,10 +313,10 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 	err |= __put_user(sp, sc->sc_regs+30);
 	err |= __put_user(0, sc->sc_regs+31);
 
-	for (i = 0; i < 31; i++)
-		err |= __put_user(sw->fp[i], sc->sc_fpregs+i);
+	err |= __copy_to_user(sc->sc_fpregs,
+			      current_thread_info()->fp, 31 * 8);
 	err |= __put_user(0, sc->sc_fpregs+31);
-	err |= __put_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __put_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	err |= __put_user(regs->trap_a0, &sc->sc_traparg_a0);
 	err |= __put_user(regs->trap_a1, &sc->sc_traparg_a1);
@@ -528,6 +529,9 @@ do_work_pending(struct pt_regs *regs, unsigned long thread_flags,
 		} else {
 			local_irq_enable();
 			if (thread_flags & (_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)) {
+				preempt_disable();
+				save_fpu();
+				preempt_enable();
 				do_signal(regs, r0, r19);
 				r0 = 0;
 			} else {
diff --git a/arch/alpha/lib/fpreg.c b/arch/alpha/lib/fpreg.c
index 34fea465645ba..41830c95fd8bc 100644
--- a/arch/alpha/lib/fpreg.c
+++ b/arch/alpha/lib/fpreg.c
@@ -7,6 +7,8 @@
 
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <linux/preempt.h>
+#include <asm/thread_info.h>
 
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
 #define STT(reg,val)  asm volatile ("ftoit $f"#reg",%0" : "=r"(val));
@@ -19,7 +21,12 @@ alpha_read_fp_reg (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		val = current_thread_info()->fp[reg];
+	else switch (reg) {
 	      case  0: STT( 0, val); break;
 	      case  1: STT( 1, val); break;
 	      case  2: STT( 2, val); break;
@@ -52,8 +59,8 @@ alpha_read_fp_reg (unsigned long reg)
 	      case 29: STT(29, val); break;
 	      case 30: STT(30, val); break;
 	      case 31: STT(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg);
@@ -67,7 +74,13 @@ EXPORT_SYMBOL(alpha_read_fp_reg);
 void
 alpha_write_fp_reg (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		current_thread_info()->fp[reg] = val;
+	else switch (reg) {
 	      case  0: LDT( 0, val); break;
 	      case  1: LDT( 1, val); break;
 	      case  2: LDT( 2, val); break;
@@ -101,6 +114,7 @@ alpha_write_fp_reg (unsigned long reg, unsigned long val)
 	      case 30: LDT(30, val); break;
 	      case 31: LDT(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg);
 
@@ -115,7 +129,14 @@ alpha_read_fp_reg_s (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDT(0, current_thread_info()->fp[reg]);
+		STS(0, val);
+	} else switch (reg) {
 	      case  0: STS( 0, val); break;
 	      case  1: STS( 1, val); break;
 	      case  2: STS( 2, val); break;
@@ -148,8 +169,8 @@ alpha_read_fp_reg_s (unsigned long reg)
 	      case 29: STS(29, val); break;
 	      case 30: STS(30, val); break;
 	      case 31: STS(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg_s);
@@ -163,7 +184,14 @@ EXPORT_SYMBOL(alpha_read_fp_reg_s);
 void
 alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDS(0, val);
+		STT(0, current_thread_info()->fp[reg]);
+	} else switch (reg) {
 	      case  0: LDS( 0, val); break;
 	      case  1: LDS( 1, val); break;
 	      case  2: LDS( 2, val); break;
@@ -197,5 +225,6 @@ alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 	      case 30: LDS(30, val); break;
 	      case 31: LDS(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg_s);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHES] alpha asm glue cleanups and fixes
  2021-09-25  2:54 [PATCHES] alpha asm glue cleanups and fixes Al Viro
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
@ 2021-09-25  2:59 ` Al Viro
  2022-09-02  1:48 ` Al Viro
  2 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25  2:59 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

On Sat, Sep 25, 2021 at 02:54:13AM +0000, Al Viro wrote:
> 	Fallout from asm glue review on alpha:
> 
> 1) TIF_NOTIFY_SIGNAL support is broken; do_work_pending() handles
> it, but the logics *calling* do_work_pending() ignores that flag
> completely.  If it's called for other reasons - fine, but
> TIF_NOTIFY_SIGNAL alone will not suffice for that.  Bug from the
> last cycle.  5.11 bug.
> 
> 2) _TIF_ALLWORK_MASK is junk - never had been used.
> 
> 3) !AUDIT_SYSCALL configs have buggered logics for going into
> straced syscall path.  Any thread flag (including TIF_SIGNAL_PENDING)
> will suffice to send us there.  3.14 bug.
> 
> 4) on straced syscalls we have force_successful_syscall_return() broken -
> it ends up with a3 *not* set to 0.
> 
> 5) on non-straced syscalls force_successful_syscall_return() handling is
> suboptimal - it duplicates code from the normal syscall return path for
> no good reason; instead of branching to the copy, it might branch to the
> original just fine.
> 
> 6) ret_from_fork could just as well go to ret_from_user - it's not going
> to be hit when returning to kernel mode.
> 
> Patchset lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
> #work.alpha; individual patches in followups.

... and as a followup to that (pretty much untested), the following (vfs.git
#untested.alpha); review and testing (especially on ev4 boxen) would be very
welcome.

commit fa9de0e4325e86401e4e70ce839a5d3a75dae5cc
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Sep 22 14:12:39 2021 -0400

    alpha: lazy FPU switching
    
            On each context switch we save the FPU registers on stack
    of old process and restore FPU registers from the stack of new one.
    That allows us to avoid doing that each time we enter/leave the
    kernel mode; however, that can get suboptimal in some cases.
    
            For one thing, we don't need to bother saving anything
    for kernel threads.  For another, if between entering and leaving
    the kernel a thread gives CPU up more than once, it will do
    useless work, saving the same values every time, only to discard
    the saved copy as soon as it returns from switch_to().
    
            Alternative solution:
    
    * move the array we save into from switch_stack to thread_info
    * have a (thread-synchronous) flag set when we save them
    * do *NOT* save/restore them in do_switch_stack()/undo_switch_stack().
    * restore on the exit to user mode (and clear the flag) if the flag had
    been set.
    * on context switch, entry to fork()/clone()/vfork() and on entry into
    straced syscall save (and set the flag) if the flag had not been set.
    * have copy_thread() set the flag for child, so they would be restored
    once the child returns to userland.
    * save (again, conditionally and setting the flag) before do_signal(),
    use the saved data in setup_sigcontext()
    * have restore_sigcontext() set the flag and copy from sigframe to
    save area.
    * teach ptrace to look for FPU registers in thread_info instead of
    switch_stack.
    * teach isolated accesses to FPU registers (rdfpcr, wrfpcr, etc.)
    to check the flag (under preempt_disable()) and work with the save area
    if it's been set.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/arch/alpha/include/asm/fpu.h b/arch/alpha/include/asm/fpu.h
index b9691405e56b3..4de001bf2811a 100644
--- a/arch/alpha/include/asm/fpu.h
+++ b/arch/alpha/include/asm/fpu.h
@@ -15,21 +15,27 @@ rdfpcr(void)
 {
 	unsigned long tmp, ret;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		ret = current_thread_info()->fp[31];
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"ftoit $f0,%1\n\t"
-		"itoft %0,$f0"
-		: "=r"(tmp), "=r"(ret));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"ftoit $f0,%1\n\t"
+			"itoft %0,$f0"
+			: "=r"(tmp), "=r"(ret));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"stt $f0,%1\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp), "=m"(ret));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"stt $f0,%1\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp), "=m"(ret));
 #endif
+	}
+	preempt_enable();
 
 	return ret;
 }
@@ -39,21 +45,27 @@ wrfpcr(unsigned long val)
 {
 	unsigned long tmp;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->fp[31] = val;
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"itoft %1,$f0\n\t"
-		"mt_fpcr $f0\n\t"
-		"itoft %0,$f0"
-		: "=&r"(tmp) : "r"(val));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"itoft %1,$f0\n\t"
+			"mt_fpcr $f0\n\t"
+			"itoft %0,$f0"
+			: "=&r"(tmp) : "r"(val));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"ldt $f0,%1\n\t"
-		"mt_fpcr $f0\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp) : "m"(val));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"ldt $f0,%1\n\t"
+			"mt_fpcr $f0\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp) : "m"(val));
 #endif
+	}
+	preempt_enable();
 }
 
 static inline unsigned long
diff --git a/arch/alpha/include/asm/switch_to.h b/arch/alpha/include/asm/switch_to.h
index 762b7f975310c..32863581a2975 100644
--- a/arch/alpha/include/asm/switch_to.h
+++ b/arch/alpha/include/asm/switch_to.h
@@ -8,6 +8,7 @@ extern struct task_struct *alpha_switch_to(unsigned long, struct task_struct *);
 
 #define switch_to(P,N,L)						 \
   do {									 \
+    save_fpu();								 \
     (L) = alpha_switch_to(virt_to_phys(&task_thread_info(N)->pcb), (P)); \
     check_mmu_context();						 \
   } while (0)
diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 9b99fece40af9..58faec89cc881 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -27,6 +27,7 @@ struct thread_info {
 	int bpt_nsaved;
 	unsigned long bpt_addr[2];		/* breakpoint handling  */
 	unsigned int bpt_insn[2];
+	unsigned long fp[32];
 };
 
 /*
@@ -83,6 +84,8 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
 
+#define TS_SAVED_FP		0x0008
+
 #define SET_UNALIGN_CTL(task,value)	({				\
 	__u32 status = task_thread_info(task)->status & ~UAC_BITMASK;	\
 	if (value & PR_UNALIGN_NOPRINT)					\
@@ -106,5 +109,17 @@ register struct thread_info *__current_thread_info __asm__("$8");
 	put_user(res, (int __user *)(value));				\
 	})
 
+#ifndef __ASSEMBLY__
+extern void __save_fpu(void);
+
+static inline void save_fpu(void)
+{
+	if (!(current_thread_info()->status & TS_SAVED_FP)) {
+		current_thread_info()->status |= TS_SAVED_FP;
+		__save_fpu();
+	}
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_THREAD_INFO_H */
diff --git a/arch/alpha/include/uapi/asm/ptrace.h b/arch/alpha/include/uapi/asm/ptrace.h
index c29194181025f..5ca45934fcbb8 100644
--- a/arch/alpha/include/uapi/asm/ptrace.h
+++ b/arch/alpha/include/uapi/asm/ptrace.h
@@ -64,7 +64,9 @@ struct switch_stack {
 	unsigned long r14;
 	unsigned long r15;
 	unsigned long r26;
+#ifndef __KERNEL__
 	unsigned long fp[32];	/* fp[31] is fpcr */
+#endif
 };
 
 
diff --git a/arch/alpha/kernel/asm-offsets.c b/arch/alpha/kernel/asm-offsets.c
index 2e125e5c1508c..b121294bee266 100644
--- a/arch/alpha/kernel/asm-offsets.c
+++ b/arch/alpha/kernel/asm-offsets.c
@@ -17,6 +17,8 @@ void foo(void)
 	DEFINE(TI_TASK, offsetof(struct thread_info, task));
 	DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
 	DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
+	DEFINE(TI_FP, offsetof(struct thread_info, fp));
+	DEFINE(TI_STATUS, offsetof(struct thread_info, status));
 	BLANK();
 
         DEFINE(TASK_BLOCKED, offsetof(struct task_struct, blocked));
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index a6207c47f0894..397254fb93cfe 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -17,7 +17,7 @@
 
 /* Stack offsets.  */
 #define SP_OFF			184
-#define SWITCH_STACK_SIZE	320
+#define SWITCH_STACK_SIZE	64
 
 .macro	CFI_START_OSF_FRAME	func
 	.align	4
@@ -159,7 +159,6 @@
 	.cfi_rel_offset	$13, 32
 	.cfi_rel_offset	$14, 40
 	.cfi_rel_offset	$15, 48
-	/* We don't really care about the FP registers for debugging.  */
 .endm
 
 .macro	UNDO_SWITCH_STACK
@@ -498,6 +497,10 @@ ret_to_user:
 	and	$17, _TIF_WORK_MASK, $2
 	bne	$2, work_pending
 restore_all:
+	ldl	$2, TI_STATUS($8)
+	and	$2, TS_SAVED_FP, $3
+	bne	$3, restore_fpu
+restore_other:
 	.cfi_remember_state
 	RESTORE_ALL
 	call_pal PAL_rti
@@ -506,7 +509,7 @@ ret_to_kernel:
 	.cfi_restore_state
 	lda	$16, 7
 	call_pal PAL_swpipl
-	br restore_all
+	br restore_other
 
 	.align 3
 $syscall_error:
@@ -570,6 +573,14 @@ $work_notifysig:
 	.type	strace, @function
 strace:
 	/* set up signal stack, call syscall_trace */
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	jsr	$26, __save_fpu
+1:
 	DO_SWITCH_STACK
 	jsr	$26, syscall_trace_enter /* returns the syscall number */
 	UNDO_SWITCH_STACK
@@ -649,40 +660,6 @@ do_switch_stack:
 	stq	$14, 40($sp)
 	stq	$15, 48($sp)
 	stq	$26, 56($sp)
-	stt	$f0, 64($sp)
-	stt	$f1, 72($sp)
-	stt	$f2, 80($sp)
-	stt	$f3, 88($sp)
-	stt	$f4, 96($sp)
-	stt	$f5, 104($sp)
-	stt	$f6, 112($sp)
-	stt	$f7, 120($sp)
-	stt	$f8, 128($sp)
-	stt	$f9, 136($sp)
-	stt	$f10, 144($sp)
-	stt	$f11, 152($sp)
-	stt	$f12, 160($sp)
-	stt	$f13, 168($sp)
-	stt	$f14, 176($sp)
-	stt	$f15, 184($sp)
-	stt	$f16, 192($sp)
-	stt	$f17, 200($sp)
-	stt	$f18, 208($sp)
-	stt	$f19, 216($sp)
-	stt	$f20, 224($sp)
-	stt	$f21, 232($sp)
-	stt	$f22, 240($sp)
-	stt	$f23, 248($sp)
-	stt	$f24, 256($sp)
-	stt	$f25, 264($sp)
-	stt	$f26, 272($sp)
-	stt	$f27, 280($sp)
-	mf_fpcr	$f0		# get fpcr
-	stt	$f28, 288($sp)
-	stt	$f29, 296($sp)
-	stt	$f30, 304($sp)
-	stt	$f0, 312($sp)	# save fpcr in slot of $f31
-	ldt	$f0, 64($sp)	# dont let "do_switch_stack" change fp state.
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	do_switch_stack, .-do_switch_stack
@@ -701,48 +678,105 @@ undo_switch_stack:
 	ldq	$14, 40($sp)
 	ldq	$15, 48($sp)
 	ldq	$26, 56($sp)
-	ldt	$f30, 312($sp)	# get saved fpcr
-	ldt	$f0, 64($sp)
-	ldt	$f1, 72($sp)
-	ldt	$f2, 80($sp)
-	ldt	$f3, 88($sp)
-	mt_fpcr	$f30		# install saved fpcr
-	ldt	$f4, 96($sp)
-	ldt	$f5, 104($sp)
-	ldt	$f6, 112($sp)
-	ldt	$f7, 120($sp)
-	ldt	$f8, 128($sp)
-	ldt	$f9, 136($sp)
-	ldt	$f10, 144($sp)
-	ldt	$f11, 152($sp)
-	ldt	$f12, 160($sp)
-	ldt	$f13, 168($sp)
-	ldt	$f14, 176($sp)
-	ldt	$f15, 184($sp)
-	ldt	$f16, 192($sp)
-	ldt	$f17, 200($sp)
-	ldt	$f18, 208($sp)
-	ldt	$f19, 216($sp)
-	ldt	$f20, 224($sp)
-	ldt	$f21, 232($sp)
-	ldt	$f22, 240($sp)
-	ldt	$f23, 248($sp)
-	ldt	$f24, 256($sp)
-	ldt	$f25, 264($sp)
-	ldt	$f26, 272($sp)
-	ldt	$f27, 280($sp)
-	ldt	$f28, 288($sp)
-	ldt	$f29, 296($sp)
-	ldt	$f30, 304($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	undo_switch_stack, .-undo_switch_stack
+
+	.align	4
+	.globl	__save_fpu
+	.type	__save_fpu, @function
+__save_fpu:
+.macro V n
+	stt	$f\n, \n * 8 + TI_FP($8)
+.endm
+	V 0
+	V 1
+	V 2
+	V 3
+	V 4
+	V 5
+	V 6
+	V 7
+	V 8
+	V 9
+	V 10
+	V 11
+	V 12
+	V 13
+	V 14
+	V 15
+	V 16
+	V 17
+	V 18
+	V 19
+	V 20
+	V 21
+	V 21
+	V 22
+	V 23
+	V 24
+	V 25
+	V 26
+	V 27
+	mf_fpcr	$f0		# get fpcr
+	V 28
+	V 29
+	V 30
+	stt	$f0, 31 * 8 + TI_FP($8)	# save fpcr in slot of $f31
+	ldt	$f0, TI_FP($8)	# don't let "__save_fpu" change fp state.
+	ret
+.purgem V
+	.size	__save_fpu, .-__save_fpu
+
+	.align	4
+restore_fpu:
+	bic	$2, TS_SAVED_FP, $2
+.macro V n
+	ldt	$f\n, \n * 8 + TI_FP($8)
+.endm
+	ldt	$f30, 31 * 8 + TI_FP($8)	# get saved fpcr
+	V 0
+	V 1
+	V 2
+	V 3
+	mt_fpcr	$f30		# install saved fpcr
+	V 4
+	V 5
+	V 6
+	V 7
+	V 8
+	V 9
+	V 10
+	V 11
+	V 12
+	V 13
+	V 14
+	V 15
+	V 16
+	V 17
+	V 18
+	V 19
+	V 20
+	V 21
+	V 21
+	V 22
+	V 23
+	V 24
+	V 25
+	V 26
+	V 27
+	V 28
+	V 29
+	V 30
+	stl $2, TI_STATUS($8)
+	br restore_other
+.purgem V
+
 \f
 /*
  * The meat of the context switch code.
  */
-
 	.align	4
 	.globl	alpha_switch_to
 	.type	alpha_switch_to, @function
@@ -798,6 +832,14 @@ ret_from_kernel_thread:
 	.ent	alpha_\name
 alpha_\name:
 	.prologue 0
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	jsr	$26, __save_fpu
+1:
 	bsr	$1, do_switch_stack
 	jsr	$26, sys_\name
 	ldq	$26, 56($sp)
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index a5123ea426ce5..e45df572d42cd 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -248,6 +248,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childstack = ((struct switch_stack *) childregs) - 1;
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
+	childti->status |= TS_SAVED_FP;
 
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
@@ -257,6 +258,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 		childstack->r9 = usp;	/* function */
 		childstack->r10 = kthread_arg;
 		childregs->hae = alpha_mv.hae_cache;
+		memset(childti->fp, '\0', sizeof(childti->fp));
 		childti->pcb.usp = 0;
 		return 0;
 	}
@@ -340,8 +342,7 @@ EXPORT_SYMBOL(dump_elf_task);
 int
 dump_elf_task_fp(elf_fpreg_t *dest, struct task_struct *task)
 {
-	struct switch_stack *sw = (struct switch_stack *)task_pt_regs(task) - 1;
-	memcpy(dest, sw->fp, 32 * 8);
+	memcpy(dest, current_thread_info()->fp, 32 * 8);
 	return 1;
 }
 EXPORT_SYMBOL(dump_elf_task_fp);
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index 8c43212ae38e6..1abb03c912d96 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -79,6 +79,8 @@ enum {
  (PAGE_SIZE*2 - sizeof(struct pt_regs) - sizeof(struct switch_stack) \
   + offsetof(struct switch_stack, reg))
 
+#define FP_REG(reg) (offsetof(struct thread_info, reg))
+
 static int regoff[] = {
 	PT_REG(	   r0), PT_REG(	   r1), PT_REG(	   r2), PT_REG(	  r3),
 	PT_REG(	   r4), PT_REG(	   r5), PT_REG(	   r6), PT_REG(	  r7),
@@ -88,14 +90,14 @@ static int regoff[] = {
 	PT_REG(	  r20), PT_REG(	  r21), PT_REG(	  r22), PT_REG(	 r23),
 	PT_REG(	  r24), PT_REG(	  r25), PT_REG(	  r26), PT_REG(	 r27),
 	PT_REG(	  r28), PT_REG(	   gp),		   -1,		   -1,
-	SW_REG(fp[ 0]), SW_REG(fp[ 1]), SW_REG(fp[ 2]), SW_REG(fp[ 3]),
-	SW_REG(fp[ 4]), SW_REG(fp[ 5]), SW_REG(fp[ 6]), SW_REG(fp[ 7]),
-	SW_REG(fp[ 8]), SW_REG(fp[ 9]), SW_REG(fp[10]), SW_REG(fp[11]),
-	SW_REG(fp[12]), SW_REG(fp[13]), SW_REG(fp[14]), SW_REG(fp[15]),
-	SW_REG(fp[16]), SW_REG(fp[17]), SW_REG(fp[18]), SW_REG(fp[19]),
-	SW_REG(fp[20]), SW_REG(fp[21]), SW_REG(fp[22]), SW_REG(fp[23]),
-	SW_REG(fp[24]), SW_REG(fp[25]), SW_REG(fp[26]), SW_REG(fp[27]),
-	SW_REG(fp[28]), SW_REG(fp[29]), SW_REG(fp[30]), SW_REG(fp[31]),
+	FP_REG(fp[ 0]), FP_REG(fp[ 1]), FP_REG(fp[ 2]), FP_REG(fp[ 3]),
+	FP_REG(fp[ 4]), FP_REG(fp[ 5]), FP_REG(fp[ 6]), FP_REG(fp[ 7]),
+	FP_REG(fp[ 8]), FP_REG(fp[ 9]), FP_REG(fp[10]), FP_REG(fp[11]),
+	FP_REG(fp[12]), FP_REG(fp[13]), FP_REG(fp[14]), FP_REG(fp[15]),
+	FP_REG(fp[16]), FP_REG(fp[17]), FP_REG(fp[18]), FP_REG(fp[19]),
+	FP_REG(fp[20]), FP_REG(fp[21]), FP_REG(fp[22]), FP_REG(fp[23]),
+	FP_REG(fp[24]), FP_REG(fp[25]), FP_REG(fp[26]), FP_REG(fp[27]),
+	FP_REG(fp[28]), FP_REG(fp[29]), FP_REG(fp[30]), FP_REG(fp[31]),
 	PT_REG(	   pc)
 };
 
diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c
index bc077babafab5..6968b3a2273f0 100644
--- a/arch/alpha/kernel/signal.c
+++ b/arch/alpha/kernel/signal.c
@@ -150,9 +150,10 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 {
 	unsigned long usp;
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = __get_user(regs->pc, &sc->sc_pc);
+	long err = __get_user(regs->pc, &sc->sc_pc);
 
 	current->restart_block.fn = do_no_restart_syscall;
+	current_thread_info()->status |= TS_SAVED_FP;
 
 	sw->r26 = (unsigned long) ret_from_sys_call;
 
@@ -189,9 +190,9 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 	err |= __get_user(usp, sc->sc_regs+30);
 	wrusp(usp);
 
-	for (i = 0; i < 31; i++)
-		err |= __get_user(sw->fp[i], sc->sc_fpregs+i);
-	err |= __get_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __copy_from_user(current_thread_info()->fp,
+				sc->sc_fpregs, 31 * 8);
+	err |= __get_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	return err;
 }
@@ -272,7 +273,7 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 		 unsigned long mask, unsigned long sp)
 {
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = 0;
+	long err = 0;
 
 	err |= __put_user(on_sig_stack((unsigned long)sc), &sc->sc_onstack);
 	err |= __put_user(mask, &sc->sc_mask);
@@ -312,10 +313,10 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 	err |= __put_user(sp, sc->sc_regs+30);
 	err |= __put_user(0, sc->sc_regs+31);
 
-	for (i = 0; i < 31; i++)
-		err |= __put_user(sw->fp[i], sc->sc_fpregs+i);
+	err |= __copy_to_user(sc->sc_fpregs,
+			      current_thread_info()->fp, 31 * 8);
 	err |= __put_user(0, sc->sc_fpregs+31);
-	err |= __put_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __put_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	err |= __put_user(regs->trap_a0, &sc->sc_traparg_a0);
 	err |= __put_user(regs->trap_a1, &sc->sc_traparg_a1);
@@ -528,6 +529,9 @@ do_work_pending(struct pt_regs *regs, unsigned long thread_flags,
 		} else {
 			local_irq_enable();
 			if (thread_flags & (_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)) {
+				preempt_disable();
+				save_fpu();
+				preempt_enable();
 				do_signal(regs, r0, r19);
 				r0 = 0;
 			} else {
diff --git a/arch/alpha/lib/fpreg.c b/arch/alpha/lib/fpreg.c
index 34fea465645ba..41830c95fd8bc 100644
--- a/arch/alpha/lib/fpreg.c
+++ b/arch/alpha/lib/fpreg.c
@@ -7,6 +7,8 @@
 
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <linux/preempt.h>
+#include <asm/thread_info.h>
 
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
 #define STT(reg,val)  asm volatile ("ftoit $f"#reg",%0" : "=r"(val));
@@ -19,7 +21,12 @@ alpha_read_fp_reg (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		val = current_thread_info()->fp[reg];
+	else switch (reg) {
 	      case  0: STT( 0, val); break;
 	      case  1: STT( 1, val); break;
 	      case  2: STT( 2, val); break;
@@ -52,8 +59,8 @@ alpha_read_fp_reg (unsigned long reg)
 	      case 29: STT(29, val); break;
 	      case 30: STT(30, val); break;
 	      case 31: STT(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg);
@@ -67,7 +74,13 @@ EXPORT_SYMBOL(alpha_read_fp_reg);
 void
 alpha_write_fp_reg (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		current_thread_info()->fp[reg] = val;
+	else switch (reg) {
 	      case  0: LDT( 0, val); break;
 	      case  1: LDT( 1, val); break;
 	      case  2: LDT( 2, val); break;
@@ -101,6 +114,7 @@ alpha_write_fp_reg (unsigned long reg, unsigned long val)
 	      case 30: LDT(30, val); break;
 	      case 31: LDT(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg);
 
@@ -115,7 +129,14 @@ alpha_read_fp_reg_s (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDT(0, current_thread_info()->fp[reg]);
+		STS(0, val);
+	} else switch (reg) {
 	      case  0: STS( 0, val); break;
 	      case  1: STS( 1, val); break;
 	      case  2: STS( 2, val); break;
@@ -148,8 +169,8 @@ alpha_read_fp_reg_s (unsigned long reg)
 	      case 29: STS(29, val); break;
 	      case 30: STS(30, val); break;
 	      case 31: STS(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg_s);
@@ -163,7 +184,14 @@ EXPORT_SYMBOL(alpha_read_fp_reg_s);
 void
 alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDS(0, val);
+		STT(0, current_thread_info()->fp[reg]);
+	} else switch (reg) {
 	      case  0: LDS( 0, val); break;
 	      case  1: LDS( 1, val); break;
 	      case  2: LDS( 2, val); break;
@@ -197,5 +225,6 @@ alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 	      case 30: LDS(30, val); break;
 	      case 31: LDS(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg_s);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25  2:55   ` [PATCH 7/7] alpha: lazy FPU switching Al Viro
@ 2021-09-25 19:07     ` Linus Torvalds
  2021-09-25 20:43       ` Al Viro
  2021-09-26  9:08       ` John Paul Adrian Glaubitz
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2021-09-25 19:07 UTC (permalink / raw)
  To: Al Viro; +Cc: alpha

On Fri, Sep 24, 2021 at 7:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         On each context switch we save the FPU registers on stack
> of old process and restore FPU registers from the stack of new one.
> That allows us to avoid doing that each time we enter/leave the
> kernel mode; however, that can get suboptimal in some cases.

Do you actually have a system or virtual image to test this all out on?

I'm not saying this doesn't look like an improvement, I'm more
questioning whether it's worth it...

          Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25 19:07     ` Linus Torvalds
@ 2021-09-25 20:43       ` Al Viro
  2021-09-25 23:18         ` Linus Torvalds
  2021-10-30 20:46         ` Al Viro
  2021-09-26  9:08       ` John Paul Adrian Glaubitz
  1 sibling, 2 replies; 29+ messages in thread
From: Al Viro @ 2021-09-25 20:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: alpha

On Sat, Sep 25, 2021 at 12:07:17PM -0700, Linus Torvalds wrote:
> On Fri, Sep 24, 2021 at 7:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         On each context switch we save the FPU registers on stack
> > of old process and restore FPU registers from the stack of new one.
> > That allows us to avoid doing that each time we enter/leave the
> > kernel mode; however, that can get suboptimal in some cases.
> 
> Do you actually have a system or virtual image to test this all out on?
> 
> I'm not saying this doesn't look like an improvement, I'm more
> questioning whether it's worth it...

Umm...  Bootable AS200 (EV45), bootable DS10 (EV6), theoretically
resurrectable UP1000 (EV67, fans on CPU module are in horrible state
and southbridge is unreliable, so the life is more interesting than
it's worth), working qemu-system-alpha (EV67).  No SMP boxen and
I've no idea if qemu can do SMP alpha these days...

Whether it's worth it... beginning of the series or this one?  If it's about
the former, the stuff in the series is pretty straightforward bug fixes and
equally straightforward cleanups.  If it's the latter... hell knows;
it would be tempting to see if we could
	* make FPU saves/restores lazy, evicting that stuff from switch_stack
	* add r9..r15 to pt_regs, saving on each kernel entry and restoring
if we have a flag set (note that entMM() and entUnaUser() already save/restore
those - unconditionally).  That would've killed the need to play with
switch_stack in straced syscalls/do_signal/etc.  switch_stack (trimmed down
to r9..r15,r26 - the callee-saved registers) would be used by switch_to(),
but that would be it.
	* take the entire ret_from_syscall et.al. out into C side of things.

This patch is basically "let's see how awful FPU-related part would be"
experiment.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25 20:43       ` Al Viro
@ 2021-09-25 23:18         ` Linus Torvalds
  2021-09-26  0:31           ` Al Viro
  2021-10-30 20:46         ` Al Viro
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2021-09-25 23:18 UTC (permalink / raw)
  To: Al Viro; +Cc: alpha

On Sat, Sep 25, 2021 at 1:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  Bootable AS200 (EV45), bootable DS10 (EV6), theoretically
> resurrectable UP1000 (EV67, fans on CPU module are in horrible state
> and southbridge is unreliable, so the life is more interesting than
> it's worth), working qemu-system-alpha (EV67).  No SMP boxen and
> I've no idea if qemu can do SMP alpha these days...

Well, the way we traditionally did lazy x87 state save/restore on x86
was very much smp-sensitive, so that lack of test coverage is a bit
sad.

That said, this approach doesn't really seem to have _those_ kinds of
issues, since you always save things at 'switch_to()', so I guess it
doesn't matter.

> Whether it's worth it... beginning of the series or this one?

This last one was the one I reacted to.

I don't think it's wrong (although please, use a more descriptive name
that "V" for that asm macro shorthand), but it does strike me as
somewhat special.

And if we do want to do this (I'm open to it, I just want to make sure
it's tested), please just make those alpha_{read|write}_fp_reg()
functions always do the save_fpu() thing, and then alway sjust access
the array.

IOW, something like

        preempt_disable();
        save_fpu(current);
        preempt_disable();
        .. now access the array that is easy to index ..

and just remove the silly "switch (reg)" things that access the raw
registers. We couldn't do that before, but with that save state area
it's trivial and much cleaner.

            Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25 23:18         ` Linus Torvalds
@ 2021-09-26  0:31           ` Al Viro
  0 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-09-26  0:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: alpha

On Sat, Sep 25, 2021 at 04:18:15PM -0700, Linus Torvalds wrote:
> This last one was the one I reacted to.
> 
> I don't think it's wrong (although please, use a more descriptive name
> that "V" for that asm macro shorthand), but it does strike me as
> somewhat special.

Ended up with the following (see updated #untested.alpha):

#define FR(n) n * 8 + TI_FP($8)
        .align  4
        .globl  __save_fpu
        .type   __save_fpu, @function
__save_fpu:
#define V(n) stt        $f##n, FR(n)
        V( 0); V( 1); V( 2); V( 3)
        V( 4); V( 5); V( 6); V( 7)
        V( 8); V( 9); V(10); V(11)  
        V(12); V(13); V(14); V(15)
        V(16); V(17); V(18); V(19)
        V(20); V(21); V(22); V(23)
        V(24); V(25); V(26); V(27)
        mf_fpcr $f0             # get fpcr
        V(28); V(29); V(30)  
        stt     $f0, FR(31)     # save fpcr in slot of $f31
        ldt     $f0, FR(0)      # don't let "__save_fpu" change fp state.
        ret
#undef V
        .size   __save_fpu, .-__save_fpu
        .align  4
restore_fpu:
        bic     $2, TS_SAVED_FP, $2 
#define V(n) ldt        $f##n, FR(n)
        ldt     $f30, FR(31)    # get saved fpcr
        V( 0); V( 1); V( 2); V( 3)
        mt_fpcr $f30            # install saved fpcr
        V( 4); V( 5); V( 6); V( 7)
        V( 8); V( 9); V(10); V(11)
        V(12); V(13); V(14); V(15)
        V(16); V(17); V(18); V(19)
        V(20); V(21); V(22); V(23)
        V(24); V(25); V(26); V(27)
        V(28); V(29); V(30)
        stl $2, TI_STATUS($8)
        br restore_other
#undef V

More readable that way, I think.  FWIW, asm macros were leftovers
of an attempt to do something like save_regs 0,27 that would expand
into the series of stores; no hope to do that with C macros, but
gas(1) ones do allow kinda-sorta loops, so I hoped to get it done.
No luck - AFAICS, no way to force evaluation of expressions as part
of macro expansion there, so you end up with something like $f(1+1+1).
It claims that .altmacro has something of that sort, and it might
be even possible to abuse a mix of C and asm macros to get there,
but it would be _way_ too opaque.  Not worth bothering, IMO, so
I ended up with the variant above...

> And if we do want to do this (I'm open to it, I just want to make sure
> it's tested), please just make those alpha_{read|write}_fp_reg()
> functions always do the save_fpu() thing, and then alway sjust access
> the array.
> 
> IOW, something like
> 
>         preempt_disable();
>         save_fpu(current);
>         preempt_disable();
>         .. now access the array that is easy to index ..
> 
> and just remove the silly "switch (reg)" things that access the raw
> registers. We couldn't do that before, but with that save state area
> it's trivial and much cleaner.

Umm...  It is, but if you look at the callers... it's used only
by the math emulator, and usually you end up with reading a register
or two + writing one.  Full FPU save + full FPU load on return to
userland might be too much overhead on that codepath.  Not sure;
it certainly simplifies things, but I'd like to see that tested on
EV4 hardware - it's already slow as it is, and an extra couple of
cachelines worth of loads and stores on each floating point arithmetical
instruction...  Sure, we do that on each context switch there, but
the frequency of those is much lower.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25 19:07     ` Linus Torvalds
  2021-09-25 20:43       ` Al Viro
@ 2021-09-26  9:08       ` John Paul Adrian Glaubitz
  1 sibling, 0 replies; 29+ messages in thread
From: John Paul Adrian Glaubitz @ 2021-09-26  9:08 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro; +Cc: alpha, Michael Cree, Matt Turner

Hi Linus!

On 9/25/21 21:07, Linus Torvalds wrote:
> On Fri, Sep 24, 2021 at 7:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>         On each context switch we save the FPU registers on stack
>> of old process and restore FPU registers from the stack of new one.
>> That allows us to avoid doing that each time we enter/leave the
>> kernel mode; however, that can get suboptimal in some cases.
> 
> Do you actually have a system or virtual image to test this all out on?

I have a system for testing and I can also create a QEMU VM image for tesing
using the Debian Ports installation ISO for Alpha [1].

I assume Matt and Michael (CC'ed) would be able to test these improvements
as well as they're working on Alpha support in the kernel.

> I'm not saying this doesn't look like an improvement, I'm more
> questioning whether it's worth it...

Since we're still maintaining Alpha ports in Debian and Gentoo, it should
be worth it, I would say.

Adrian

> [1] https://cdimage.debian.org/cdimage/ports/snapshots/2021-09-23/

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaubitz@debian.org
`. `'   Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-09-25 20:43       ` Al Viro
  2021-09-25 23:18         ` Linus Torvalds
@ 2021-10-30 20:46         ` Al Viro
  2021-10-30 20:46           ` Al Viro
  2021-10-30 21:25           ` Maciej W. Rozycki
  1 sibling, 2 replies; 29+ messages in thread
From: Al Viro @ 2021-10-30 20:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: alpha

[-- Attachment #1: Type: text/plain, Size: 5424 bytes --]

On Sat, Sep 25, 2021 at 08:43:01PM +0000, Al Viro wrote:
> On Sat, Sep 25, 2021 at 12:07:17PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 24, 2021 at 7:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > >         On each context switch we save the FPU registers on stack
> > > of old process and restore FPU registers from the stack of new one.
> > > That allows us to avoid doing that each time we enter/leave the
> > > kernel mode; however, that can get suboptimal in some cases.
> > 
> > Do you actually have a system or virtual image to test this all out on?
> > 
> > I'm not saying this doesn't look like an improvement, I'm more
> > questioning whether it's worth it...
> 
> Umm...  Bootable AS200 (EV45), bootable DS10 (EV6), theoretically
> resurrectable UP1000 (EV67, fans on CPU module are in horrible state
> and southbridge is unreliable, so the life is more interesting than
> it's worth), working qemu-system-alpha (EV67).  No SMP boxen and
> I've no idea if qemu can do SMP alpha these days...
> 
> Whether it's worth it... beginning of the series or this one?  If it's about
> the former, the stuff in the series is pretty straightforward bug fixes and
> equally straightforward cleanups.  If it's the latter... hell knows;
> it would be tempting to see if we could
> 	* make FPU saves/restores lazy, evicting that stuff from switch_stack
> 	* add r9..r15 to pt_regs, saving on each kernel entry and restoring
> if we have a flag set (note that entMM() and entUnaUser() already save/restore
> those - unconditionally).  That would've killed the need to play with
> switch_stack in straced syscalls/do_signal/etc.  switch_stack (trimmed down
> to r9..r15,r26 - the callee-saved registers) would be used by switch_to(),
> but that would be it.
> 	* take the entire ret_from_syscall et.al. out into C side of things.
> 
> This patch is basically "let's see how awful FPU-related part would be"
> experiment.

OK, here's what I've got:

1) lazy FPU part has a braino in it; __save_fpu() in alpha_fork() et.al. should
be called *after* do_switch_stack(), not before it.  Another (minor) problem is
that use of jsr for calls for functions in the same object file is stupid -
should be bsr instead.  Not a bug, per se, but it's clearly suboptimal.  Both
fixes folded.

2) resulting branch rebased on top of -rc3 and tested (glibc build, with its
testsuite to hit the floating point hard enough).  Rebase due to posix cpu
timers regression that got fixed in -rc3 (breaks gmon/* tests in glibc).
Branch is vfs.git #work.alpha-lazy_FPU (or #work.alpha, for the same sans the
last commit).

3) there's an oddity in qemu - handling of SQRTT/SU et.al. matches the hardware
manual, but not the actual hardware.  Details: on the real hw the absense of
/I modifier suppresses the inexact exception, but it does *not* suppress
setting FPCR.INE flag.  IOW, when trying to calculate sqrt(2),
SQRTT/SUI on EV6:	exception raised (when not disabled), FPCR.INE set
SQRTT/SUI on qemu:	exception raised (when not disabled), FPCR.INE set
SQRTT/SU on EV6:	no exception, FPCR.INE set
SQRTT/SU on qemu:	no exception, FPCR.INE not set.
Behaviour of qemu matches the alpha hardware manual, but
	* it does not match the real hardware (at least EV6)
	* it does not match the expectation of gcc and glibc
	* it makes less sense than what the real hw does.
I don't know what EV67 and later variants are doing; the only EV67 box I have
would be a royal PITA to resurrect (radiator and fans in CPU module are clogged,
so it overheats very fast, and southbridge on the motherboard is flaky).
They might've changed the hardware behaviour to match the manual in later
variants, but I think it's unlikely - we would've seen glibc builds failing
very loudly on such boxen.  It's not just sqrt - all IEEE floating point insns
behave that way, so one gets hundreds of tests failing.

I've done a trivial patch to qemu (attached) getting it to match the EV6
behaviour in that area; with that applied glibc tests pass as on the real hw.
That's completely orthogonal to the kernel patches in this series - behaviour
is identical for patched kernel and for mainline (as well as debian-ports
5.14.0-3-alpha-generic kernel image).

4) qemu in buster (1:3.1+dfsg-8+deb10u8) apparently has a bug in x86 backend,
fixed at some point in their git tree.  Coredump on debian-ports netinst
alpha image, just before it starts to install actual packages.  Host coredump,
that is.  Might be worth bisecting at some point; that kind of crap is very
likely to mean guest-to-host escalation...  Their current git tree works fine,
so I'd been using that for testing.

5) qemu -smp 4 breaks; csum mismatches on ext4 iget are the first visible
symptoms in logs.  Hell knows how to debug that; I suspected ldl_l/stl_c
breakage, but there's nothing obvious on the inspection.  Might be palcode
image, might be anything.  Ought to look into that someday; not now, though...
UP seems to work.

	I would really appreciate
* more testing on real hw (most of mine had been on qemu); the branch in
question is vfs.git #work.alpha-lazy_FPU.  Everything but the last commit
is identical to posted upthread; the last commit has a couple of fixes
folded in (in followup)

* somebody with EV67 or later taking a look at the behaviour of FPCR.INE on
SQRTT/SU et.al.; compile the attached sqrtt.c with -mcpu=ev67 -lm, run it and
compare with the attached expected output (res-ev6).

[-- Attachment #2: sqrtt.c --]
[-- Type: text/plain, Size: 2403 bytes --]

#include <stdio.h>
#include <signal.h>
#define __USE_GNU
#include <fenv.h>
#include <float.h>
#include <setjmp.h>

double v;
long r;

void dump(char *s)
{
	unsigned long tmp, ret;
	static char buf[25];
	static char *names[] = {"IOV", "INE", "UNF", "OVF", "DZE", "INV"};
	int i;

	__asm__ __volatile__ (
		"stt $f0,%0\n\t"
		"trapb\n\t"
		"mf_fpcr $f0\n\t"
		"trapb\n\t"
		"stt $f0,%1\n\t"
		"ldt $f0,%0"
		: "=m"(tmp), "=m"(ret));
	for (i = 0; i < 6; i++)
		sprintf(buf + 4 * i,
			" %s", (ret >> (57-i)) & 1 ? names[i] : "   ");
	printf("%s%s %08x %lx\n", s, buf, fetestexcept(FE_ALL_EXCEPT), r);
}

void __attribute__((noinline)) sqrtt(void)
{
	asm __volatile__(
		"ldt		$f10,%1\n\t"
		"sqrtt		$f10,$f11\n\t"
		"trapb\n\t"
		"stt		$f11,%0\n\t"
		: "=m"(r) :"m"(v));
}

void __attribute__((noinline)) sqrtt_u(void)
{
	asm __volatile__(
		"ldt		$f10,%1\n\t"
		"sqrtt/u	$f10,$f11\n\t"
		"trapb\n\t"
		"stt		$f11,%0\n\t"
		: "=m"(r) :"m"(v));
}

void __attribute__((noinline)) sqrtt_su(void)
{
	asm __volatile__(
		"ldt		$f10,%1\n\t"
		"sqrtt/su	$f10,$f11\n\t"
		"trapb\n\t"
		"stt		$f11,%0\n\t"
		: "=m"(r) :"m"(v));
}

void __attribute__((noinline)) sqrtt_sui(void)
{
	asm __volatile__(
		"ldt		$f10,%1\n\t"
		"sqrtt/sui	$f10,$f11\n\t"
		"trapb\n\t"
		"stt		$f11,%0\n\t"
		: "=m"(r) :"m"(v));
}

static sigjmp_buf buf;

void handler(int n)
{
	dump("SIGFPE ");
	siglongjmp(buf, 0);
}

void try(void (*f)(void))
{
	printf("disabled:");
	fedisableexcept(FE_ALL_EXCEPT);
	feclearexcept(FE_ALL_EXCEPT);
	if (!sigsetjmp(buf, 1)) {
		f();
		dump("       ");
	}
	printf("enabled: ");
	feenableexcept(FE_ALL_EXCEPT);
	feclearexcept(FE_ALL_EXCEPT);
	if (!sigsetjmp(buf, 1)) {
		f();
		dump("       ");
	}
}

void __for_value(double x, char *s, char *str)
{
	v = x;
	printf("%s - %s\n", s, str);
	printf("%s:\n", s);
	try(sqrtt);
	printf("%s/U:\n", s);
	try(sqrtt_u);
	printf("%s/SU:\n", s);
	try(sqrtt_su);
	printf("%s/SUI:\n", s);
	try(sqrtt_sui);
}

#define for_value(v, s) __for_value(v, s, #v)

int main()
{
	volatile union {
		unsigned long l;
		double d;
	} x;
	signal(SIGFPE, handler);
	for_value(4, "normal");
	for_value(1.5, "inexact");
	for_value(-1, "neg");
	for_value(DBL_MIN/2, "denorm");
	for_value(-DBL_MIN/2, "-denorm");
	x.l = 0x7ff0000000000000ULL;
	for_value(x.d, "inf");
	x.l = 0x7ff8000000000000ULL;
	for_value(x.d, "nan");
	x.l = 0x7ff4000000000000ULL;
	for_value(x.d, "snan");
	return 0;
}

[-- Attachment #3: res-ev6 --]
[-- Type: text/plain, Size: 4685 bytes --]

normal - 4
normal:
disabled:                                00000000 4000000000000000
enabled:                                 00000000 4000000000000000
normal/U:
disabled:                                00000000 4000000000000000
enabled:                                 00000000 4000000000000000
normal/SU:
disabled:                                00000000 4000000000000000
enabled:                                 00000000 4000000000000000
normal/SUI:
disabled:                                00000000 4000000000000000
enabled:                                 00000000 4000000000000000
inexact - 1.5
inexact:
disabled:            INE                 00200000 3ff3988e1409212e
enabled:             INE                 00200000 3ff3988e1409212e
inexact/U:
disabled:            INE                 00200000 3ff3988e1409212e
enabled:             INE                 00200000 3ff3988e1409212e
inexact/SU:
disabled:            INE                 00200000 3ff3988e1409212e
enabled:             INE                 00200000 3ff3988e1409212e
inexact/SUI:
disabled:            INE                 00200000 3ff3988e1409212e
enabled: SIGFPE      INE                 00200000 3ff3988e1409212e
neg - -1
neg:
disabled:SIGFPE                      INV 00020000 3ff3988e1409212e
enabled: SIGFPE                      INV 00020000 3ff3988e1409212e
neg/U:
disabled:SIGFPE                      INV 00020000 3ff3988e1409212e
enabled: SIGFPE                      INV 00020000 3ff3988e1409212e
neg/SU:
disabled:                            INV 00020000 fff8000000000000
enabled: SIGFPE                      INV 00020000 fff8000000000000
neg/SUI:
disabled:                            INV 00020000 fff8000000000000
enabled: SIGFPE                      INV 00020000 fff8000000000000
denorm - DBL_MIN/2
denorm:
disabled:SIGFPE                          00000000 fff8000000000000
enabled: SIGFPE                          00000000 fff8000000000000
denorm/U:
disabled:SIGFPE                          00000000 fff8000000000000
enabled: SIGFPE                          00000000 fff8000000000000
denorm/SU:
disabled:        IOV INE                 00600000 1ff6a09e667f3bcd
enabled: SIGFPE  IOV INE                 00600000 1ff6a09e667f3bcd
denorm/SUI:
disabled:        IOV INE                 00600000 1ff6a09e667f3bcd
enabled: SIGFPE  IOV INE                 00600000 1ff6a09e667f3bcd
-denorm - -DBL_MIN/2
-denorm:
disabled:SIGFPE                          00000000 1ff6a09e667f3bcd
enabled: SIGFPE                          00000000 1ff6a09e667f3bcd
-denorm/U:
disabled:SIGFPE                          00000000 1ff6a09e667f3bcd
enabled: SIGFPE                          00000000 1ff6a09e667f3bcd
-denorm/SU:
disabled:        IOV                 INV 00420000 fff8000000000000
enabled: SIGFPE  IOV                 INV 00420000 fff8000000000000
-denorm/SUI:
disabled:        IOV                 INV 00420000 fff8000000000000
enabled: SIGFPE  IOV                 INV 00420000 fff8000000000000
inf - x.d
inf:
disabled:SIGFPE                      INV 00020000 fff8000000000000
enabled: SIGFPE                      INV 00020000 fff8000000000000
inf/U:
disabled:SIGFPE                      INV 00020000 fff8000000000000
enabled: SIGFPE                      INV 00020000 fff8000000000000
inf/SU:
disabled:                                00000000 7ff0000000000000
enabled:                                 00000000 7ff0000000000000
inf/SUI:
disabled:                                00000000 7ff0000000000000
enabled:                                 00000000 7ff0000000000000
nan - x.d
nan:
disabled:SIGFPE                      INV 00020000 7ff0000000000000
enabled: SIGFPE                      INV 00020000 7ff0000000000000
nan/U:
disabled:SIGFPE                      INV 00020000 7ff0000000000000
enabled: SIGFPE                      INV 00020000 7ff0000000000000
nan/SU:
disabled:                                00000000 7ff8000000000000
enabled:                                 00000000 7ff8000000000000
nan/SUI:
disabled:                                00000000 7ff8000000000000
enabled:                                 00000000 7ff8000000000000
snan - x.d
snan:
disabled:SIGFPE                      INV 00020000 7ff8000000000000
enabled: SIGFPE                      INV 00020000 7ff8000000000000
snan/U:
disabled:SIGFPE                      INV 00020000 7ff8000000000000
enabled: SIGFPE                      INV 00020000 7ff8000000000000
snan/SU:
disabled:                            INV 00020000 7ffc000000000000
enabled: SIGFPE                      INV 00020000 7ffc000000000000
snan/SUI:
disabled:                            INV 00020000 7ffc000000000000
enabled: SIGFPE                      INV 00020000 7ffc000000000000

[-- Attachment #4: patch-alpha-qemu --]
[-- Type: text/plain, Size: 1280 bytes --]

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 6e769f990c..7cd061bee5 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -220,7 +220,7 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64)
  * the use of hardfloat, since hardfloat relies on the inexact flag being
  * already set.
  */
-#if defined(TARGET_PPC) || defined(__FAST_MATH__)
+#if defined(TARGET_PPC) || defined(TARGET_ALPHA) || defined(__FAST_MATH__)
 # if defined(__FAST_MATH__)
 #  warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \
     IEEE implementation
diff --git a/target/alpha/fpu_helper.c b/target/alpha/fpu_helper.c
index 3ff8bb456d..083b805b1e 100644
--- a/target/alpha/fpu_helper.c
+++ b/target/alpha/fpu_helper.c
@@ -87,9 +87,12 @@ void helper_fp_exc_raise(CPUAlphaState *env, uint32_t ignore, uint32_t regno)
 /* Raise exceptions for ieee fp insns with software completion.  */
 void helper_fp_exc_raise_s(CPUAlphaState *env, uint32_t ignore, uint32_t regno)
 {
-    uint32_t exc = env->error_code & ~ignore;
+    uint32_t exc = env->error_code;
+    if (!exc)
+	return;
+    env->fpcr |= exc;
+    exc &= ~ignore;
     if (exc) {
-        env->fpcr |= exc;
         exc &= env->fpcr_exc_enable;
         /*
          * In system mode, the software handler gets invoked

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-10-30 20:46         ` Al Viro
@ 2021-10-30 20:46           ` Al Viro
  2021-10-30 21:25           ` Maciej W. Rozycki
  1 sibling, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-10-30 20:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: alpha

	On each context switch we save the FPU registers on stack
of old process and restore FPU registers from the stack of new one.
That allows us to avoid doing that each time we enter/leave the
kernel mode; however, that can get suboptimal in some cases.

	For one thing, we don't need to bother saving anything
for kernel threads.  For another, if between entering and leaving
the kernel a thread gives CPU up more than once, it will do
useless work, saving the same values every time, only to discard
the saved copy as soon as it returns from switch_to().

	Alternative solution:

* move the array we save into from switch_stack to thread_info
* have a (thread-synchronous) flag set when we save them
* do *NOT* save/restore them in do_switch_stack()/undo_switch_stack().
* restore on the exit to user mode (and clear the flag) if the flag had
been set.
* on context switch, entry to fork()/clone()/vfork() and on entry into
straced syscall save (and set the flag) if the flag had not been set.
* have copy_thread() set the flag for child, so they would be restored
once the child returns to userland.
* save (again, conditionally and setting the flag) before do_signal(),
use the saved data in setup_sigcontext()
* have restore_sigcontext() set the flag and copy from sigframe to
save area.
* teach ptrace to look for FPU registers in thread_info instead of
switch_stack.
* teach isolated accesses to FPU registers (rdfpcr, wrfpcr, etc.)
to check the flag (under preempt_disable()) and work with the save area
if it's been set.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/fpu.h         |  60 +++++++++-------
 arch/alpha/include/asm/switch_to.h   |   1 +
 arch/alpha/include/asm/thread_info.h |  15 ++++
 arch/alpha/include/uapi/asm/ptrace.h |   2 +
 arch/alpha/kernel/asm-offsets.c      |   2 +
 arch/alpha/kernel/entry.S            | 133 ++++++++++++++++-------------------
 arch/alpha/kernel/process.c          |   5 +-
 arch/alpha/kernel/ptrace.c           |  18 ++---
 arch/alpha/kernel/signal.c           |  20 +++---
 arch/alpha/lib/fpreg.c               |  41 +++++++++--
 10 files changed, 178 insertions(+), 119 deletions(-)

diff --git a/arch/alpha/include/asm/fpu.h b/arch/alpha/include/asm/fpu.h
index b9691405e56b3..4de001bf2811a 100644
--- a/arch/alpha/include/asm/fpu.h
+++ b/arch/alpha/include/asm/fpu.h
@@ -15,21 +15,27 @@ rdfpcr(void)
 {
 	unsigned long tmp, ret;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		ret = current_thread_info()->fp[31];
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"ftoit $f0,%1\n\t"
-		"itoft %0,$f0"
-		: "=r"(tmp), "=r"(ret));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"ftoit $f0,%1\n\t"
+			"itoft %0,$f0"
+			: "=r"(tmp), "=r"(ret));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"stt $f0,%1\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp), "=m"(ret));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"stt $f0,%1\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp), "=m"(ret));
 #endif
+	}
+	preempt_enable();
 
 	return ret;
 }
@@ -39,21 +45,27 @@ wrfpcr(unsigned long val)
 {
 	unsigned long tmp;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->fp[31] = val;
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"itoft %1,$f0\n\t"
-		"mt_fpcr $f0\n\t"
-		"itoft %0,$f0"
-		: "=&r"(tmp) : "r"(val));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"itoft %1,$f0\n\t"
+			"mt_fpcr $f0\n\t"
+			"itoft %0,$f0"
+			: "=&r"(tmp) : "r"(val));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"ldt $f0,%1\n\t"
-		"mt_fpcr $f0\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp) : "m"(val));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"ldt $f0,%1\n\t"
+			"mt_fpcr $f0\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp) : "m"(val));
 #endif
+	}
+	preempt_enable();
 }
 
 static inline unsigned long
diff --git a/arch/alpha/include/asm/switch_to.h b/arch/alpha/include/asm/switch_to.h
index 762b7f975310c..32863581a2975 100644
--- a/arch/alpha/include/asm/switch_to.h
+++ b/arch/alpha/include/asm/switch_to.h
@@ -8,6 +8,7 @@ extern struct task_struct *alpha_switch_to(unsigned long, struct task_struct *);
 
 #define switch_to(P,N,L)						 \
   do {									 \
+    save_fpu();								 \
     (L) = alpha_switch_to(virt_to_phys(&task_thread_info(N)->pcb), (P)); \
     check_mmu_context();						 \
   } while (0)
diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 9b99fece40af9..58faec89cc881 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -27,6 +27,7 @@ struct thread_info {
 	int bpt_nsaved;
 	unsigned long bpt_addr[2];		/* breakpoint handling  */
 	unsigned int bpt_insn[2];
+	unsigned long fp[32];
 };
 
 /*
@@ -83,6 +84,8 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
 
+#define TS_SAVED_FP		0x0008
+
 #define SET_UNALIGN_CTL(task,value)	({				\
 	__u32 status = task_thread_info(task)->status & ~UAC_BITMASK;	\
 	if (value & PR_UNALIGN_NOPRINT)					\
@@ -106,5 +109,17 @@ register struct thread_info *__current_thread_info __asm__("$8");
 	put_user(res, (int __user *)(value));				\
 	})
 
+#ifndef __ASSEMBLY__
+extern void __save_fpu(void);
+
+static inline void save_fpu(void)
+{
+	if (!(current_thread_info()->status & TS_SAVED_FP)) {
+		current_thread_info()->status |= TS_SAVED_FP;
+		__save_fpu();
+	}
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_THREAD_INFO_H */
diff --git a/arch/alpha/include/uapi/asm/ptrace.h b/arch/alpha/include/uapi/asm/ptrace.h
index c29194181025f..5ca45934fcbb8 100644
--- a/arch/alpha/include/uapi/asm/ptrace.h
+++ b/arch/alpha/include/uapi/asm/ptrace.h
@@ -64,7 +64,9 @@ struct switch_stack {
 	unsigned long r14;
 	unsigned long r15;
 	unsigned long r26;
+#ifndef __KERNEL__
 	unsigned long fp[32];	/* fp[31] is fpcr */
+#endif
 };
 
 
diff --git a/arch/alpha/kernel/asm-offsets.c b/arch/alpha/kernel/asm-offsets.c
index 2e125e5c1508c..b121294bee266 100644
--- a/arch/alpha/kernel/asm-offsets.c
+++ b/arch/alpha/kernel/asm-offsets.c
@@ -17,6 +17,8 @@ void foo(void)
 	DEFINE(TI_TASK, offsetof(struct thread_info, task));
 	DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
 	DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
+	DEFINE(TI_FP, offsetof(struct thread_info, fp));
+	DEFINE(TI_STATUS, offsetof(struct thread_info, status));
 	BLANK();
 
         DEFINE(TASK_BLOCKED, offsetof(struct task_struct, blocked));
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index a6207c47f0894..c0b04d5404852 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -17,7 +17,7 @@
 
 /* Stack offsets.  */
 #define SP_OFF			184
-#define SWITCH_STACK_SIZE	320
+#define SWITCH_STACK_SIZE	64
 
 .macro	CFI_START_OSF_FRAME	func
 	.align	4
@@ -159,7 +159,6 @@
 	.cfi_rel_offset	$13, 32
 	.cfi_rel_offset	$14, 40
 	.cfi_rel_offset	$15, 48
-	/* We don't really care about the FP registers for debugging.  */
 .endm
 
 .macro	UNDO_SWITCH_STACK
@@ -498,6 +497,10 @@ ret_to_user:
 	and	$17, _TIF_WORK_MASK, $2
 	bne	$2, work_pending
 restore_all:
+	ldl	$2, TI_STATUS($8)
+	and	$2, TS_SAVED_FP, $3
+	bne	$3, restore_fpu
+restore_other:
 	.cfi_remember_state
 	RESTORE_ALL
 	call_pal PAL_rti
@@ -506,7 +509,7 @@ ret_to_kernel:
 	.cfi_restore_state
 	lda	$16, 7
 	call_pal PAL_swpipl
-	br restore_all
+	br restore_other
 
 	.align 3
 $syscall_error:
@@ -570,6 +573,14 @@ $work_notifysig:
 	.type	strace, @function
 strace:
 	/* set up signal stack, call syscall_trace */
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	bsr	$26, __save_fpu
+1:
 	DO_SWITCH_STACK
 	jsr	$26, syscall_trace_enter /* returns the syscall number */
 	UNDO_SWITCH_STACK
@@ -649,40 +660,6 @@ do_switch_stack:
 	stq	$14, 40($sp)
 	stq	$15, 48($sp)
 	stq	$26, 56($sp)
-	stt	$f0, 64($sp)
-	stt	$f1, 72($sp)
-	stt	$f2, 80($sp)
-	stt	$f3, 88($sp)
-	stt	$f4, 96($sp)
-	stt	$f5, 104($sp)
-	stt	$f6, 112($sp)
-	stt	$f7, 120($sp)
-	stt	$f8, 128($sp)
-	stt	$f9, 136($sp)
-	stt	$f10, 144($sp)
-	stt	$f11, 152($sp)
-	stt	$f12, 160($sp)
-	stt	$f13, 168($sp)
-	stt	$f14, 176($sp)
-	stt	$f15, 184($sp)
-	stt	$f16, 192($sp)
-	stt	$f17, 200($sp)
-	stt	$f18, 208($sp)
-	stt	$f19, 216($sp)
-	stt	$f20, 224($sp)
-	stt	$f21, 232($sp)
-	stt	$f22, 240($sp)
-	stt	$f23, 248($sp)
-	stt	$f24, 256($sp)
-	stt	$f25, 264($sp)
-	stt	$f26, 272($sp)
-	stt	$f27, 280($sp)
-	mf_fpcr	$f0		# get fpcr
-	stt	$f28, 288($sp)
-	stt	$f29, 296($sp)
-	stt	$f30, 304($sp)
-	stt	$f0, 312($sp)	# save fpcr in slot of $f31
-	ldt	$f0, 64($sp)	# dont let "do_switch_stack" change fp state.
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	do_switch_stack, .-do_switch_stack
@@ -701,48 +678,54 @@ undo_switch_stack:
 	ldq	$14, 40($sp)
 	ldq	$15, 48($sp)
 	ldq	$26, 56($sp)
-	ldt	$f30, 312($sp)	# get saved fpcr
-	ldt	$f0, 64($sp)
-	ldt	$f1, 72($sp)
-	ldt	$f2, 80($sp)
-	ldt	$f3, 88($sp)
-	mt_fpcr	$f30		# install saved fpcr
-	ldt	$f4, 96($sp)
-	ldt	$f5, 104($sp)
-	ldt	$f6, 112($sp)
-	ldt	$f7, 120($sp)
-	ldt	$f8, 128($sp)
-	ldt	$f9, 136($sp)
-	ldt	$f10, 144($sp)
-	ldt	$f11, 152($sp)
-	ldt	$f12, 160($sp)
-	ldt	$f13, 168($sp)
-	ldt	$f14, 176($sp)
-	ldt	$f15, 184($sp)
-	ldt	$f16, 192($sp)
-	ldt	$f17, 200($sp)
-	ldt	$f18, 208($sp)
-	ldt	$f19, 216($sp)
-	ldt	$f20, 224($sp)
-	ldt	$f21, 232($sp)
-	ldt	$f22, 240($sp)
-	ldt	$f23, 248($sp)
-	ldt	$f24, 256($sp)
-	ldt	$f25, 264($sp)
-	ldt	$f26, 272($sp)
-	ldt	$f27, 280($sp)
-	ldt	$f28, 288($sp)
-	ldt	$f29, 296($sp)
-	ldt	$f30, 304($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	undo_switch_stack, .-undo_switch_stack
+
+#define FR(n) n * 8 + TI_FP($8)
+	.align	4
+	.globl	__save_fpu
+	.type	__save_fpu, @function
+__save_fpu:
+#define V(n) stt	$f##n, FR(n)
+	V( 0); V( 1); V( 2); V( 3)
+	V( 4); V( 5); V( 6); V( 7)
+	V( 8); V( 9); V(10); V(11)
+	V(12); V(13); V(14); V(15)
+	V(16); V(17); V(18); V(19)
+	V(20); V(21); V(22); V(23)
+	V(24); V(25); V(26); V(27)
+	mf_fpcr	$f0		# get fpcr
+	V(28); V(29); V(30)
+	stt	$f0, FR(31)	# save fpcr in slot of $f31
+	ldt	$f0, FR(0)	# don't let "__save_fpu" change fp state.
+	ret
+#undef V
+	.size	__save_fpu, .-__save_fpu
+
+	.align	4
+restore_fpu:
+	bic	$2, TS_SAVED_FP, $2
+#define V(n) ldt	$f##n, FR(n)
+	ldt	$f30, FR(31)	# get saved fpcr
+	V( 0); V( 1); V( 2); V( 3)
+	mt_fpcr	$f30		# install saved fpcr
+	V( 4); V( 5); V( 6); V( 7)
+	V( 8); V( 9); V(10); V(11)
+	V(12); V(13); V(14); V(15)
+	V(16); V(17); V(18); V(19)
+	V(20); V(21); V(22); V(23)
+	V(24); V(25); V(26); V(27)
+	V(28); V(29); V(30)
+	stl $2, TI_STATUS($8)
+	br restore_other
+#undef V
+
 \f
 /*
  * The meat of the context switch code.
  */
-
 	.align	4
 	.globl	alpha_switch_to
 	.type	alpha_switch_to, @function
@@ -799,6 +782,14 @@ ret_from_kernel_thread:
 alpha_\name:
 	.prologue 0
 	bsr	$1, do_switch_stack
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	bsr	$26, __save_fpu
+1:
 	jsr	$26, sys_\name
 	ldq	$26, 56($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index a5123ea426ce5..e45df572d42cd 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -248,6 +248,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 	childstack = ((struct switch_stack *) childregs) - 1;
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
+	childti->status |= TS_SAVED_FP;
 
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		/* kernel thread */
@@ -257,6 +258,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 		childstack->r9 = usp;	/* function */
 		childstack->r10 = kthread_arg;
 		childregs->hae = alpha_mv.hae_cache;
+		memset(childti->fp, '\0', sizeof(childti->fp));
 		childti->pcb.usp = 0;
 		return 0;
 	}
@@ -340,8 +342,7 @@ EXPORT_SYMBOL(dump_elf_task);
 int
 dump_elf_task_fp(elf_fpreg_t *dest, struct task_struct *task)
 {
-	struct switch_stack *sw = (struct switch_stack *)task_pt_regs(task) - 1;
-	memcpy(dest, sw->fp, 32 * 8);
+	memcpy(dest, current_thread_info()->fp, 32 * 8);
 	return 1;
 }
 EXPORT_SYMBOL(dump_elf_task_fp);
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index 8c43212ae38e6..1abb03c912d96 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -79,6 +79,8 @@ enum {
  (PAGE_SIZE*2 - sizeof(struct pt_regs) - sizeof(struct switch_stack) \
   + offsetof(struct switch_stack, reg))
 
+#define FP_REG(reg) (offsetof(struct thread_info, reg))
+
 static int regoff[] = {
 	PT_REG(	   r0), PT_REG(	   r1), PT_REG(	   r2), PT_REG(	  r3),
 	PT_REG(	   r4), PT_REG(	   r5), PT_REG(	   r6), PT_REG(	  r7),
@@ -88,14 +90,14 @@ static int regoff[] = {
 	PT_REG(	  r20), PT_REG(	  r21), PT_REG(	  r22), PT_REG(	 r23),
 	PT_REG(	  r24), PT_REG(	  r25), PT_REG(	  r26), PT_REG(	 r27),
 	PT_REG(	  r28), PT_REG(	   gp),		   -1,		   -1,
-	SW_REG(fp[ 0]), SW_REG(fp[ 1]), SW_REG(fp[ 2]), SW_REG(fp[ 3]),
-	SW_REG(fp[ 4]), SW_REG(fp[ 5]), SW_REG(fp[ 6]), SW_REG(fp[ 7]),
-	SW_REG(fp[ 8]), SW_REG(fp[ 9]), SW_REG(fp[10]), SW_REG(fp[11]),
-	SW_REG(fp[12]), SW_REG(fp[13]), SW_REG(fp[14]), SW_REG(fp[15]),
-	SW_REG(fp[16]), SW_REG(fp[17]), SW_REG(fp[18]), SW_REG(fp[19]),
-	SW_REG(fp[20]), SW_REG(fp[21]), SW_REG(fp[22]), SW_REG(fp[23]),
-	SW_REG(fp[24]), SW_REG(fp[25]), SW_REG(fp[26]), SW_REG(fp[27]),
-	SW_REG(fp[28]), SW_REG(fp[29]), SW_REG(fp[30]), SW_REG(fp[31]),
+	FP_REG(fp[ 0]), FP_REG(fp[ 1]), FP_REG(fp[ 2]), FP_REG(fp[ 3]),
+	FP_REG(fp[ 4]), FP_REG(fp[ 5]), FP_REG(fp[ 6]), FP_REG(fp[ 7]),
+	FP_REG(fp[ 8]), FP_REG(fp[ 9]), FP_REG(fp[10]), FP_REG(fp[11]),
+	FP_REG(fp[12]), FP_REG(fp[13]), FP_REG(fp[14]), FP_REG(fp[15]),
+	FP_REG(fp[16]), FP_REG(fp[17]), FP_REG(fp[18]), FP_REG(fp[19]),
+	FP_REG(fp[20]), FP_REG(fp[21]), FP_REG(fp[22]), FP_REG(fp[23]),
+	FP_REG(fp[24]), FP_REG(fp[25]), FP_REG(fp[26]), FP_REG(fp[27]),
+	FP_REG(fp[28]), FP_REG(fp[29]), FP_REG(fp[30]), FP_REG(fp[31]),
 	PT_REG(	   pc)
 };
 
diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c
index bc077babafab5..6968b3a2273f0 100644
--- a/arch/alpha/kernel/signal.c
+++ b/arch/alpha/kernel/signal.c
@@ -150,9 +150,10 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 {
 	unsigned long usp;
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = __get_user(regs->pc, &sc->sc_pc);
+	long err = __get_user(regs->pc, &sc->sc_pc);
 
 	current->restart_block.fn = do_no_restart_syscall;
+	current_thread_info()->status |= TS_SAVED_FP;
 
 	sw->r26 = (unsigned long) ret_from_sys_call;
 
@@ -189,9 +190,9 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 	err |= __get_user(usp, sc->sc_regs+30);
 	wrusp(usp);
 
-	for (i = 0; i < 31; i++)
-		err |= __get_user(sw->fp[i], sc->sc_fpregs+i);
-	err |= __get_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __copy_from_user(current_thread_info()->fp,
+				sc->sc_fpregs, 31 * 8);
+	err |= __get_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	return err;
 }
@@ -272,7 +273,7 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 		 unsigned long mask, unsigned long sp)
 {
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = 0;
+	long err = 0;
 
 	err |= __put_user(on_sig_stack((unsigned long)sc), &sc->sc_onstack);
 	err |= __put_user(mask, &sc->sc_mask);
@@ -312,10 +313,10 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 	err |= __put_user(sp, sc->sc_regs+30);
 	err |= __put_user(0, sc->sc_regs+31);
 
-	for (i = 0; i < 31; i++)
-		err |= __put_user(sw->fp[i], sc->sc_fpregs+i);
+	err |= __copy_to_user(sc->sc_fpregs,
+			      current_thread_info()->fp, 31 * 8);
 	err |= __put_user(0, sc->sc_fpregs+31);
-	err |= __put_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __put_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	err |= __put_user(regs->trap_a0, &sc->sc_traparg_a0);
 	err |= __put_user(regs->trap_a1, &sc->sc_traparg_a1);
@@ -528,6 +529,9 @@ do_work_pending(struct pt_regs *regs, unsigned long thread_flags,
 		} else {
 			local_irq_enable();
 			if (thread_flags & (_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)) {
+				preempt_disable();
+				save_fpu();
+				preempt_enable();
 				do_signal(regs, r0, r19);
 				r0 = 0;
 			} else {
diff --git a/arch/alpha/lib/fpreg.c b/arch/alpha/lib/fpreg.c
index 34fea465645ba..41830c95fd8bc 100644
--- a/arch/alpha/lib/fpreg.c
+++ b/arch/alpha/lib/fpreg.c
@@ -7,6 +7,8 @@
 
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <linux/preempt.h>
+#include <asm/thread_info.h>
 
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
 #define STT(reg,val)  asm volatile ("ftoit $f"#reg",%0" : "=r"(val));
@@ -19,7 +21,12 @@ alpha_read_fp_reg (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		val = current_thread_info()->fp[reg];
+	else switch (reg) {
 	      case  0: STT( 0, val); break;
 	      case  1: STT( 1, val); break;
 	      case  2: STT( 2, val); break;
@@ -52,8 +59,8 @@ alpha_read_fp_reg (unsigned long reg)
 	      case 29: STT(29, val); break;
 	      case 30: STT(30, val); break;
 	      case 31: STT(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg);
@@ -67,7 +74,13 @@ EXPORT_SYMBOL(alpha_read_fp_reg);
 void
 alpha_write_fp_reg (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		current_thread_info()->fp[reg] = val;
+	else switch (reg) {
 	      case  0: LDT( 0, val); break;
 	      case  1: LDT( 1, val); break;
 	      case  2: LDT( 2, val); break;
@@ -101,6 +114,7 @@ alpha_write_fp_reg (unsigned long reg, unsigned long val)
 	      case 30: LDT(30, val); break;
 	      case 31: LDT(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg);
 
@@ -115,7 +129,14 @@ alpha_read_fp_reg_s (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDT(0, current_thread_info()->fp[reg]);
+		STS(0, val);
+	} else switch (reg) {
 	      case  0: STS( 0, val); break;
 	      case  1: STS( 1, val); break;
 	      case  2: STS( 2, val); break;
@@ -148,8 +169,8 @@ alpha_read_fp_reg_s (unsigned long reg)
 	      case 29: STS(29, val); break;
 	      case 30: STS(30, val); break;
 	      case 31: STS(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg_s);
@@ -163,7 +184,14 @@ EXPORT_SYMBOL(alpha_read_fp_reg_s);
 void
 alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDS(0, val);
+		STT(0, current_thread_info()->fp[reg]);
+	} else switch (reg) {
 	      case  0: LDS( 0, val); break;
 	      case  1: LDS( 1, val); break;
 	      case  2: LDS( 2, val); break;
@@ -197,5 +225,6 @@ alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 	      case 30: LDS(30, val); break;
 	      case 31: LDS(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg_s);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-10-30 20:46         ` Al Viro
  2021-10-30 20:46           ` Al Viro
@ 2021-10-30 21:25           ` Maciej W. Rozycki
  2021-10-30 22:13             ` Al Viro
  1 sibling, 1 reply; 29+ messages in thread
From: Maciej W. Rozycki @ 2021-10-30 21:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, alpha

On Sat, 30 Oct 2021, Al Viro wrote:

> 1) lazy FPU part has a braino in it; __save_fpu() in alpha_fork() et.al. should
> be called *after* do_switch_stack(), not before it.  Another (minor) problem is
> that use of jsr for calls for functions in the same object file is stupid -
> should be bsr instead.  Not a bug, per se, but it's clearly suboptimal.  Both
> fixes folded.

 The linker is supposed to relax any eligible JSR to BSR (same with JMP vs 
BR) so it shouldn't really matter, and writing it down as JSR is surely 
more flexible as you don't have to track which caller/callee is where.

  Maciej

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] alpha: lazy FPU switching
  2021-10-30 21:25           ` Maciej W. Rozycki
@ 2021-10-30 22:13             ` Al Viro
  0 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2021-10-30 22:13 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Linus Torvalds, alpha

On Sat, Oct 30, 2021 at 10:25:34PM +0100, Maciej W. Rozycki wrote:
> On Sat, 30 Oct 2021, Al Viro wrote:
> 
> > 1) lazy FPU part has a braino in it; __save_fpu() in alpha_fork() et.al. should
> > be called *after* do_switch_stack(), not before it.  Another (minor) problem is
> > that use of jsr for calls for functions in the same object file is stupid -
> > should be bsr instead.  Not a bug, per se, but it's clearly suboptimal.  Both
> > fixes folded.
> 
>  The linker is supposed to relax any eligible JSR to BSR (same with JMP vs 
> BR) so it shouldn't really matter, and writing it down as JSR is surely 
> more flexible as you don't have to track which caller/callee is where.

All within arch/alpha/kernel/entry.S.  If that ever grows past 1M insns, we
have much worse problems...  Other callers are from C, so they all end up
with jsr, obviously.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHES] alpha asm glue cleanups and fixes
  2021-09-25  2:54 [PATCHES] alpha asm glue cleanups and fixes Al Viro
  2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2021-09-25  2:59 ` [PATCHES] alpha asm glue cleanups and fixes Al Viro
@ 2022-09-02  1:48 ` Al Viro
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:48 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

Resurrecting old series:

 	Fallout from asm glue review on alpha:

1) TIF_NOTIFY_SIGNAL support is broken; do_work_pending() handles
it, but the logics *calling* do_work_pending() ignores that flag
completely.  If it's called for other reasons - fine, but
TIF_NOTIFY_SIGNAL alone will not suffice for that.  Bug from the
last cycle.  5.11 bug.

2) _TIF_ALLWORK_MASK is junk - never had been used.
 
3) !AUDIT_SYSCALL configs have buggered logics for going into
straced syscall path.  Any thread flag (including TIF_SIGNAL_PENDING)
will suffice to send us there.  3.14 bug.
 
4) on straced syscalls we have force_successful_syscall_return() broken -
it ends up with a3 *not* set to 0.

5) on non-straced syscalls force_successful_syscall_return() handling is
suboptimal - it duplicates code from the normal syscall return path for
no good reason; instead of branching to the copy, it might branch to the
original just fine.

6) ret_from_fork could just as well go to ret_from_user - it's not going
to be hit when returning to kernel mode.

7) lazy FPU switching.  We save/restore all FPU registers a lot more than
we have to; the following reduces the amount quite a bit:
	* move the array we save into from switch_stack to thread_info
	* have a (thread-synchronous) flag set when we save them
	* have another flag set when they should be restored on return to
userland.
	* do *NOT* save/restore them in do_switch_stack()/undo_switch_stack().
	* restore on the exit to user mode if the restore flag had been set.
Clear both flags.
	* on context switch, entry to fork/clone/vfork, before entry into
do_signal() and on entry into straced syscall save the registers and set
the 'saved' flag unless it had been already set.
	* on context switch set the 'restore' flag as well.
	* have copy_thread() set both flags for child, so the registers would
be restored once the child returns to userland.
	* use the saved data in setup_sigcontext(); have restore_sigcontext()
set both flags and copy from sigframe to save area.
	* teach ptrace to look for FPU registers in thread_info instead of
switch_stack.
	* teach isolated accesses to FPU registers (rdfpcr, wrfpcr, etc.)
to check the 'saved' flag (under preempt_disable()) and work with the save area
if it's been set; if 'saved' flag is found upon write access, set 'restore' flag
as well.  NOTE: it's tempting to just force register saving in those - it
would simplify the code quite a bit.  Unfortunately, it would also force the
full FPU save/restore in situations where we really don't want the overhead
of that ;-/

Tested on qemu and on real hw (older one - the only EV67 box I have is not
in good condition).  Seems to work; benefits depend upon the load.

Patchset lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
#next.alpha; individual patches in followups.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling
  2022-09-02  1:48 ` Al Viro
@ 2022-09-02  1:50   ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 2/7] alpha: _TIF_ALLWORK_MASK is unused Al Viro
                       ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

it needs to be added to _TIF_WORK_MASK, or we might not reach
do_work_pending() in the first place...

Fixes: 5a9a8897c253a "alpha: add support for TIF_NOTIFY_SIGNAL"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/thread_info.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index fdc485d7787a..084c27cb0c70 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -75,7 +75,7 @@ register struct thread_info *__current_thread_info __asm__("$8");
 
 /* Work to do on interrupt/exception return.  */
 #define _TIF_WORK_MASK		(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
-				 _TIF_NOTIFY_RESUME)
+				 _TIF_NOTIFY_RESUME | _TIF_NOTIFY_SIGNAL)
 
 /* Work to do on any return to userspace.  */
 #define _TIF_ALLWORK_MASK	(_TIF_WORK_MASK		\
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/7] alpha: _TIF_ALLWORK_MASK is unused
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 3/7] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

... and never had been used, actually

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/thread_info.h | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 084c27cb0c70..082631465074 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -77,10 +77,6 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define _TIF_WORK_MASK		(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_NOTIFY_SIGNAL)
 
-/* Work to do on any return to userspace.  */
-#define _TIF_ALLWORK_MASK	(_TIF_WORK_MASK		\
-				 | _TIF_SYSCALL_TRACE)
-
 #define TS_UAC_NOPRINT		0x0001	/* ! Preserve the following three */
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/7] alpha: fix syscall entry in !AUDUT_SYSCALL case
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2022-09-02  1:50     ` [PATCH v2 2/7] alpha: _TIF_ALLWORK_MASK is unused Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 4/7] alpha: fix handling of a3 on straced syscalls Al Viro
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

We only want to take the slow path if SYSCALL_TRACE or SYSCALL_AUDIT is
set; on !AUDIT_SYSCALL configs the current tree hits it whenever _any_
thread flag (including NEED_RESCHED, NOTIFY_SIGNAL, etc.) happens to
be set.

Fixes: a9302e843944 "alpha: Enable system-call auditing support"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..c41a5a9c3b9f 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -469,8 +469,10 @@ entSys:
 #ifdef CONFIG_AUDITSYSCALL
 	lda     $6, _TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT
 	and     $3, $6, $3
-#endif
 	bne     $3, strace
+#else
+	blbs    $3, strace		/* check for SYSCALL_TRACE in disguise */
+#endif
 	beq	$4, 1f
 	ldq	$27, 0($5)
 1:	jsr	$26, ($27), sys_ni_syscall
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/7] alpha: fix handling of a3 on straced syscalls
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
  2022-09-02  1:50     ` [PATCH v2 2/7] alpha: _TIF_ALLWORK_MASK is unused Al Viro
  2022-09-02  1:50     ` [PATCH v2 3/7] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 5/7] alpha: syscall exit cleanup Al Viro
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

For successful syscall that happens to return a negative, we want
a3 set to 0, no matter whether it's straced or not.  As it is,
for straced case we leave the value it used to have on syscall
entry.  Easily fixed, fortunately...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index c41a5a9c3b9f..78fe7ee25425 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -600,8 +600,8 @@ ret_from_straced:
 
 	/* check return.. */
 	blt	$0, $strace_error	/* the call failed */
-	stq	$31, 72($sp)		/* a3=0 => no error */
 $strace_success:
+	stq	$31, 72($sp)		/* a3=0 => no error */
 	stq	$0, 0($sp)		/* save return value */
 
 	DO_SWITCH_STACK
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/7] alpha: syscall exit cleanup
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                       ` (2 preceding siblings ...)
  2022-09-02  1:50     ` [PATCH v2 4/7] alpha: fix handling of a3 on straced syscalls Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 6/7] alpha: ret_from_fork can go straight to ret_to_user Al Viro
  2022-09-02  1:50     ` [PATCH v2 7/7] alpha: lazy FPU switching Al Viro
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

$ret_success consists of two insn + branch to ret_from_syscall.
The thing is, those insns are identical to the ones immediately
preceding ret_from_syscall...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index 78fe7ee25425..43380fbf600d 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -478,6 +478,7 @@ entSys:
 1:	jsr	$26, ($27), sys_ni_syscall
 	ldgp	$gp, 0($26)
 	blt	$0, $syscall_error	/* the call failed */
+$ret_success:
 	stq	$0, 0($sp)
 	stq	$31, 72($sp)		/* a3=0 => no error */
 
@@ -527,11 +528,6 @@ $syscall_error:
 	stq	$1, 72($sp)	/* a3 for return */
 	br	ret_from_sys_call
 
-$ret_success:
-	stq	$0, 0($sp)
-	stq	$31, 72($sp)	/* a3=0 => no error */
-	br	ret_from_sys_call
-
 /*
  * Do all cleanup when returning from all interrupts and system calls.
  *
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/7] alpha: ret_from_fork can go straight to ret_to_user
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                       ` (3 preceding siblings ...)
  2022-09-02  1:50     ` [PATCH v2 5/7] alpha: syscall exit cleanup Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  1:50     ` [PATCH v2 7/7] alpha: lazy FPU switching Al Viro
  5 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

We only hit ret_from_fork when the child is meant to return to
userland (since 2012 or so).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/kernel/entry.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index 43380fbf600d..a6207c47f089 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -766,7 +766,7 @@ alpha_switch_to:
 	.align	4
 	.ent	ret_from_fork
 ret_from_fork:
-	lda	$26, ret_from_sys_call
+	lda	$26, ret_to_user
 	mov	$17, $16
 	jmp	$31, schedule_tail
 .end ret_from_fork
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 7/7] alpha: lazy FPU switching
  2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
                       ` (4 preceding siblings ...)
  2022-09-02  1:50     ` [PATCH v2 6/7] alpha: ret_from_fork can go straight to ret_to_user Al Viro
@ 2022-09-02  1:50     ` Al Viro
  2022-09-02  4:24       ` Linus Torvalds
  5 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-09-02  1:50 UTC (permalink / raw)
  To: linux-alpha; +Cc: Linus Torvalds

	On each context switch we save the FPU registers on stack
of old process and restore FPU registers from the stack of new one.
That allows us to avoid doing that each time we enter/leave the
kernel mode; however, that can get suboptimal in some cases.

	For one thing, we don't need to bother saving anything
for kernel threads.  For another, if between entering and leaving
the kernel a thread gives CPU up more than once, it will do
useless work, saving the same values every time, only to discard
the saved copy as soon as it returns from switch_to().

	Alternative solution:

* move the array we save into from switch_stack to thread_info
* have a (thread-synchronous) flag set when we save them
* have another flag set when they should be restored on return to userland.
* do *NOT* save/restore them in do_switch_stack()/undo_switch_stack().
* restore on the exit to user mode if the restore flag had
been set.  Clear both flags.
* on context switch, entry to fork/clone/vfork, before entry into do_signal()
and on entry into straced syscall save the registers and set the 'saved' flag
unless it had been already set.
* on context switch set the 'restore' flag as well.
* have copy_thread() set both flags for child, so the registers would be
restored once the child returns to userland.
* use the saved data in setup_sigcontext(); have restore_sigcontext() set both flags
and copy from sigframe to save area.
* teach ptrace to look for FPU registers in thread_info instead of
switch_stack.
* teach isolated accesses to FPU registers (rdfpcr, wrfpcr, etc.)
to check the 'saved' flag (under preempt_disable()) and work with the save area
if it's been set; if 'saved' flag is found upon write access, set 'restore' flag
as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/alpha/include/asm/fpu.h         |  61 +++++++-----
 arch/alpha/include/asm/thread_info.h |  16 +++
 arch/alpha/include/uapi/asm/ptrace.h |   2 +
 arch/alpha/kernel/asm-offsets.c      |   2 +
 arch/alpha/kernel/entry.S            | 144 ++++++++++++++-------------
 arch/alpha/kernel/process.c          |   5 +-
 arch/alpha/kernel/ptrace.c           |  18 ++--
 arch/alpha/kernel/signal.c           |  20 ++--
 arch/alpha/lib/fpreg.c               |  43 ++++++--
 9 files changed, 192 insertions(+), 119 deletions(-)

diff --git a/arch/alpha/include/asm/fpu.h b/arch/alpha/include/asm/fpu.h
index b9691405e56b..30b24135dd7a 100644
--- a/arch/alpha/include/asm/fpu.h
+++ b/arch/alpha/include/asm/fpu.h
@@ -15,21 +15,27 @@ rdfpcr(void)
 {
 	unsigned long tmp, ret;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		ret = current_thread_info()->fp[31];
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"ftoit $f0,%1\n\t"
-		"itoft %0,$f0"
-		: "=r"(tmp), "=r"(ret));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"ftoit $f0,%1\n\t"
+			"itoft %0,$f0"
+			: "=r"(tmp), "=r"(ret));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"mf_fpcr $f0\n\t"
-		"stt $f0,%1\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp), "=m"(ret));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"mf_fpcr $f0\n\t"
+			"stt $f0,%1\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp), "=m"(ret));
 #endif
+	}
+	preempt_enable();
 
 	return ret;
 }
@@ -39,21 +45,28 @@ wrfpcr(unsigned long val)
 {
 	unsigned long tmp;
 
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->status |= TS_RESTORE_FP;
+		current_thread_info()->fp[31] = val;
+	} else {
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
-	__asm__ __volatile__ (
-		"ftoit $f0,%0\n\t"
-		"itoft %1,$f0\n\t"
-		"mt_fpcr $f0\n\t"
-		"itoft %0,$f0"
-		: "=&r"(tmp) : "r"(val));
+		__asm__ __volatile__ (
+			"ftoit $f0,%0\n\t"
+			"itoft %1,$f0\n\t"
+			"mt_fpcr $f0\n\t"
+			"itoft %0,$f0"
+			: "=&r"(tmp) : "r"(val));
 #else
-	__asm__ __volatile__ (
-		"stt $f0,%0\n\t"
-		"ldt $f0,%1\n\t"
-		"mt_fpcr $f0\n\t"
-		"ldt $f0,%0"
-		: "=m"(tmp) : "m"(val));
+		__asm__ __volatile__ (
+			"stt $f0,%0\n\t"
+			"ldt $f0,%1\n\t"
+			"mt_fpcr $f0\n\t"
+			"ldt $f0,%0"
+			: "=m"(tmp) : "m"(val));
 #endif
+	}
+	preempt_enable();
 }
 
 static inline unsigned long
diff --git a/arch/alpha/include/asm/thread_info.h b/arch/alpha/include/asm/thread_info.h
index 082631465074..531855caa50b 100644
--- a/arch/alpha/include/asm/thread_info.h
+++ b/arch/alpha/include/asm/thread_info.h
@@ -26,6 +26,7 @@ struct thread_info {
 	int bpt_nsaved;
 	unsigned long bpt_addr[2];		/* breakpoint handling  */
 	unsigned int bpt_insn[2];
+	unsigned long fp[32];
 };
 
 /*
@@ -81,6 +82,9 @@ register struct thread_info *__current_thread_info __asm__("$8");
 #define TS_UAC_NOFIX		0x0002	/* ! flags as they match          */
 #define TS_UAC_SIGBUS		0x0004	/* ! userspace part of 'osf_sysinfo' */
 
+#define TS_SAVED_FP		0x0008
+#define TS_RESTORE_FP		0x0010
+
 #define SET_UNALIGN_CTL(task,value)	({				\
 	__u32 status = task_thread_info(task)->status & ~UAC_BITMASK;	\
 	if (value & PR_UNALIGN_NOPRINT)					\
@@ -104,5 +108,17 @@ register struct thread_info *__current_thread_info __asm__("$8");
 	put_user(res, (int __user *)(value));				\
 	})
 
+#ifndef __ASSEMBLY__
+extern void __save_fpu(void);
+
+static inline void save_fpu(void)
+{
+	if (!(current_thread_info()->status & TS_SAVED_FP)) {
+		current_thread_info()->status |= TS_SAVED_FP;
+		__save_fpu();
+	}
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ALPHA_THREAD_INFO_H */
diff --git a/arch/alpha/include/uapi/asm/ptrace.h b/arch/alpha/include/uapi/asm/ptrace.h
index c29194181025..5ca45934fcbb 100644
--- a/arch/alpha/include/uapi/asm/ptrace.h
+++ b/arch/alpha/include/uapi/asm/ptrace.h
@@ -64,7 +64,9 @@ struct switch_stack {
 	unsigned long r14;
 	unsigned long r15;
 	unsigned long r26;
+#ifndef __KERNEL__
 	unsigned long fp[32];	/* fp[31] is fpcr */
+#endif
 };
 
 
diff --git a/arch/alpha/kernel/asm-offsets.c b/arch/alpha/kernel/asm-offsets.c
index 2e125e5c1508..b121294bee26 100644
--- a/arch/alpha/kernel/asm-offsets.c
+++ b/arch/alpha/kernel/asm-offsets.c
@@ -17,6 +17,8 @@ void foo(void)
 	DEFINE(TI_TASK, offsetof(struct thread_info, task));
 	DEFINE(TI_FLAGS, offsetof(struct thread_info, flags));
 	DEFINE(TI_CPU, offsetof(struct thread_info, cpu));
+	DEFINE(TI_FP, offsetof(struct thread_info, fp));
+	DEFINE(TI_STATUS, offsetof(struct thread_info, status));
 	BLANK();
 
         DEFINE(TASK_BLOCKED, offsetof(struct task_struct, blocked));
diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index a6207c47f089..e4e142cb9a82 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -17,7 +17,7 @@
 
 /* Stack offsets.  */
 #define SP_OFF			184
-#define SWITCH_STACK_SIZE	320
+#define SWITCH_STACK_SIZE	64
 
 .macro	CFI_START_OSF_FRAME	func
 	.align	4
@@ -159,7 +159,6 @@
 	.cfi_rel_offset	$13, 32
 	.cfi_rel_offset	$14, 40
 	.cfi_rel_offset	$15, 48
-	/* We don't really care about the FP registers for debugging.  */
 .endm
 
 .macro	UNDO_SWITCH_STACK
@@ -498,6 +497,10 @@ ret_to_user:
 	and	$17, _TIF_WORK_MASK, $2
 	bne	$2, work_pending
 restore_all:
+	ldl	$2, TI_STATUS($8)
+	and	$2, TS_SAVED_FP | TS_RESTORE_FP, $3
+	bne	$3, restore_fpu
+restore_other:
 	.cfi_remember_state
 	RESTORE_ALL
 	call_pal PAL_rti
@@ -506,7 +509,7 @@ ret_to_kernel:
 	.cfi_restore_state
 	lda	$16, 7
 	call_pal PAL_swpipl
-	br restore_all
+	br restore_other
 
 	.align 3
 $syscall_error:
@@ -570,6 +573,14 @@ $work_notifysig:
 	.type	strace, @function
 strace:
 	/* set up signal stack, call syscall_trace */
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	bsr	$26, __save_fpu
+1:
 	DO_SWITCH_STACK
 	jsr	$26, syscall_trace_enter /* returns the syscall number */
 	UNDO_SWITCH_STACK
@@ -649,40 +660,6 @@ do_switch_stack:
 	stq	$14, 40($sp)
 	stq	$15, 48($sp)
 	stq	$26, 56($sp)
-	stt	$f0, 64($sp)
-	stt	$f1, 72($sp)
-	stt	$f2, 80($sp)
-	stt	$f3, 88($sp)
-	stt	$f4, 96($sp)
-	stt	$f5, 104($sp)
-	stt	$f6, 112($sp)
-	stt	$f7, 120($sp)
-	stt	$f8, 128($sp)
-	stt	$f9, 136($sp)
-	stt	$f10, 144($sp)
-	stt	$f11, 152($sp)
-	stt	$f12, 160($sp)
-	stt	$f13, 168($sp)
-	stt	$f14, 176($sp)
-	stt	$f15, 184($sp)
-	stt	$f16, 192($sp)
-	stt	$f17, 200($sp)
-	stt	$f18, 208($sp)
-	stt	$f19, 216($sp)
-	stt	$f20, 224($sp)
-	stt	$f21, 232($sp)
-	stt	$f22, 240($sp)
-	stt	$f23, 248($sp)
-	stt	$f24, 256($sp)
-	stt	$f25, 264($sp)
-	stt	$f26, 272($sp)
-	stt	$f27, 280($sp)
-	mf_fpcr	$f0		# get fpcr
-	stt	$f28, 288($sp)
-	stt	$f29, 296($sp)
-	stt	$f30, 304($sp)
-	stt	$f0, 312($sp)	# save fpcr in slot of $f31
-	ldt	$f0, 64($sp)	# dont let "do_switch_stack" change fp state.
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	do_switch_stack, .-do_switch_stack
@@ -701,54 +678,71 @@ undo_switch_stack:
 	ldq	$14, 40($sp)
 	ldq	$15, 48($sp)
 	ldq	$26, 56($sp)
-	ldt	$f30, 312($sp)	# get saved fpcr
-	ldt	$f0, 64($sp)
-	ldt	$f1, 72($sp)
-	ldt	$f2, 80($sp)
-	ldt	$f3, 88($sp)
-	mt_fpcr	$f30		# install saved fpcr
-	ldt	$f4, 96($sp)
-	ldt	$f5, 104($sp)
-	ldt	$f6, 112($sp)
-	ldt	$f7, 120($sp)
-	ldt	$f8, 128($sp)
-	ldt	$f9, 136($sp)
-	ldt	$f10, 144($sp)
-	ldt	$f11, 152($sp)
-	ldt	$f12, 160($sp)
-	ldt	$f13, 168($sp)
-	ldt	$f14, 176($sp)
-	ldt	$f15, 184($sp)
-	ldt	$f16, 192($sp)
-	ldt	$f17, 200($sp)
-	ldt	$f18, 208($sp)
-	ldt	$f19, 216($sp)
-	ldt	$f20, 224($sp)
-	ldt	$f21, 232($sp)
-	ldt	$f22, 240($sp)
-	ldt	$f23, 248($sp)
-	ldt	$f24, 256($sp)
-	ldt	$f25, 264($sp)
-	ldt	$f26, 272($sp)
-	ldt	$f27, 280($sp)
-	ldt	$f28, 288($sp)
-	ldt	$f29, 296($sp)
-	ldt	$f30, 304($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
 	ret	$31, ($1), 1
 	.cfi_endproc
 	.size	undo_switch_stack, .-undo_switch_stack
+
+#define FR(n) n * 8 + TI_FP($8)
+	.align	4
+	.globl	__save_fpu
+	.type	__save_fpu, @function
+__save_fpu:
+#define V(n) stt	$f##n, FR(n)
+	V( 0); V( 1); V( 2); V( 3)
+	V( 4); V( 5); V( 6); V( 7)
+	V( 8); V( 9); V(10); V(11)
+	V(12); V(13); V(14); V(15)
+	V(16); V(17); V(18); V(19)
+	V(20); V(21); V(22); V(23)
+	V(24); V(25); V(26); V(27)
+	mf_fpcr	$f0		# get fpcr
+	V(28); V(29); V(30)
+	stt	$f0, FR(31)	# save fpcr in slot of $f31
+	ldt	$f0, FR(0)	# don't let "__save_fpu" change fp state.
+	ret
+#undef V
+	.size	__save_fpu, .-__save_fpu
+
+	.align	4
+restore_fpu:
+	and	$3, TS_RESTORE_FP, $3
+	bic	$2, TS_SAVED_FP | TS_RESTORE_FP, $2
+	beq	$3, 1f
+#define V(n) ldt	$f##n, FR(n)
+	ldt	$f30, FR(31)	# get saved fpcr
+	V( 0); V( 1); V( 2); V( 3)
+	mt_fpcr	$f30		# install saved fpcr
+	V( 4); V( 5); V( 6); V( 7)
+	V( 8); V( 9); V(10); V(11)
+	V(12); V(13); V(14); V(15)
+	V(16); V(17); V(18); V(19)
+	V(20); V(21); V(22); V(23)
+	V(24); V(25); V(26); V(27)
+	V(28); V(29); V(30)
+1:	stl $2, TI_STATUS($8)
+	br restore_other
+#undef V
+
 \f
 /*
  * The meat of the context switch code.
  */
-
 	.align	4
 	.globl	alpha_switch_to
 	.type	alpha_switch_to, @function
 	.cfi_startproc
 alpha_switch_to:
 	DO_SWITCH_STACK
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_RESTORE_FP, $3
+	bne	$3, 1f
+	or	$1, TS_RESTORE_FP | TS_SAVED_FP, $2
+	and	$1, TS_SAVED_FP, $3
+	stl	$2, TI_STATUS($8)
+	bne	$3, 1f
+	bsr	$26, __save_fpu
+1:
 	call_pal PAL_swpctx
 	lda	$8, 0x3fff
 	UNDO_SWITCH_STACK
@@ -799,6 +793,14 @@ ret_from_kernel_thread:
 alpha_\name:
 	.prologue 0
 	bsr	$1, do_switch_stack
+	// NB: if anyone adds preemption, this block will need to be protected
+	ldl	$1, TI_STATUS($8)
+	and	$1, TS_SAVED_FP, $3
+	or	$1, TS_SAVED_FP, $2
+	bne	$3, 1f
+	stl	$2, TI_STATUS($8)
+	bsr	$26, __save_fpu
+1:
 	jsr	$26, sys_\name
 	ldq	$26, 56($sp)
 	lda	$sp, SWITCH_STACK_SIZE($sp)
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index e2e25f8b5e76..ae89739226c2 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,6 +249,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	childstack = ((struct switch_stack *) childregs) - 1;
 	childti->pcb.ksp = (unsigned long) childstack;
 	childti->pcb.flags = 1;	/* set FEN, clear everything else */
+	childti->status |= TS_SAVED_FP | TS_RESTORE_FP;
 
 	if (unlikely(args->fn)) {
 		/* kernel thread */
@@ -258,6 +259,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		childstack->r9 = (unsigned long) args->fn;
 		childstack->r10 = (unsigned long) args->fn_arg;
 		childregs->hae = alpha_mv.hae_cache;
+		memset(childti->fp, '\0', sizeof(childti->fp));
 		childti->pcb.usp = 0;
 		return 0;
 	}
@@ -341,8 +343,7 @@ EXPORT_SYMBOL(dump_elf_task);
 int
 dump_elf_task_fp(elf_fpreg_t *dest, struct task_struct *task)
 {
-	struct switch_stack *sw = (struct switch_stack *)task_pt_regs(task) - 1;
-	memcpy(dest, sw->fp, 32 * 8);
+	memcpy(dest, task_thread_info(task)->fp, 32 * 8);
 	return 1;
 }
 EXPORT_SYMBOL(dump_elf_task_fp);
diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index a1a239ea002d..fde4c68e7a0b 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -78,6 +78,8 @@ enum {
  (PAGE_SIZE*2 - sizeof(struct pt_regs) - sizeof(struct switch_stack) \
   + offsetof(struct switch_stack, reg))
 
+#define FP_REG(reg) (offsetof(struct thread_info, reg))
+
 static int regoff[] = {
 	PT_REG(	   r0), PT_REG(	   r1), PT_REG(	   r2), PT_REG(	  r3),
 	PT_REG(	   r4), PT_REG(	   r5), PT_REG(	   r6), PT_REG(	  r7),
@@ -87,14 +89,14 @@ static int regoff[] = {
 	PT_REG(	  r20), PT_REG(	  r21), PT_REG(	  r22), PT_REG(	 r23),
 	PT_REG(	  r24), PT_REG(	  r25), PT_REG(	  r26), PT_REG(	 r27),
 	PT_REG(	  r28), PT_REG(	   gp),		   -1,		   -1,
-	SW_REG(fp[ 0]), SW_REG(fp[ 1]), SW_REG(fp[ 2]), SW_REG(fp[ 3]),
-	SW_REG(fp[ 4]), SW_REG(fp[ 5]), SW_REG(fp[ 6]), SW_REG(fp[ 7]),
-	SW_REG(fp[ 8]), SW_REG(fp[ 9]), SW_REG(fp[10]), SW_REG(fp[11]),
-	SW_REG(fp[12]), SW_REG(fp[13]), SW_REG(fp[14]), SW_REG(fp[15]),
-	SW_REG(fp[16]), SW_REG(fp[17]), SW_REG(fp[18]), SW_REG(fp[19]),
-	SW_REG(fp[20]), SW_REG(fp[21]), SW_REG(fp[22]), SW_REG(fp[23]),
-	SW_REG(fp[24]), SW_REG(fp[25]), SW_REG(fp[26]), SW_REG(fp[27]),
-	SW_REG(fp[28]), SW_REG(fp[29]), SW_REG(fp[30]), SW_REG(fp[31]),
+	FP_REG(fp[ 0]), FP_REG(fp[ 1]), FP_REG(fp[ 2]), FP_REG(fp[ 3]),
+	FP_REG(fp[ 4]), FP_REG(fp[ 5]), FP_REG(fp[ 6]), FP_REG(fp[ 7]),
+	FP_REG(fp[ 8]), FP_REG(fp[ 9]), FP_REG(fp[10]), FP_REG(fp[11]),
+	FP_REG(fp[12]), FP_REG(fp[13]), FP_REG(fp[14]), FP_REG(fp[15]),
+	FP_REG(fp[16]), FP_REG(fp[17]), FP_REG(fp[18]), FP_REG(fp[19]),
+	FP_REG(fp[20]), FP_REG(fp[21]), FP_REG(fp[22]), FP_REG(fp[23]),
+	FP_REG(fp[24]), FP_REG(fp[25]), FP_REG(fp[26]), FP_REG(fp[27]),
+	FP_REG(fp[28]), FP_REG(fp[29]), FP_REG(fp[30]), FP_REG(fp[31]),
 	PT_REG(	   pc)
 };
 
diff --git a/arch/alpha/kernel/signal.c b/arch/alpha/kernel/signal.c
index 6f47f256fe80..e62d1d461b1f 100644
--- a/arch/alpha/kernel/signal.c
+++ b/arch/alpha/kernel/signal.c
@@ -150,9 +150,10 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 {
 	unsigned long usp;
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = __get_user(regs->pc, &sc->sc_pc);
+	long err = __get_user(regs->pc, &sc->sc_pc);
 
 	current->restart_block.fn = do_no_restart_syscall;
+	current_thread_info()->status |= TS_SAVED_FP | TS_RESTORE_FP;
 
 	sw->r26 = (unsigned long) ret_from_sys_call;
 
@@ -189,9 +190,9 @@ restore_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs)
 	err |= __get_user(usp, sc->sc_regs+30);
 	wrusp(usp);
 
-	for (i = 0; i < 31; i++)
-		err |= __get_user(sw->fp[i], sc->sc_fpregs+i);
-	err |= __get_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __copy_from_user(current_thread_info()->fp,
+				sc->sc_fpregs, 31 * 8);
+	err |= __get_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	return err;
 }
@@ -272,7 +273,7 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 		 unsigned long mask, unsigned long sp)
 {
 	struct switch_stack *sw = (struct switch_stack *)regs - 1;
-	long i, err = 0;
+	long err = 0;
 
 	err |= __put_user(on_sig_stack((unsigned long)sc), &sc->sc_onstack);
 	err |= __put_user(mask, &sc->sc_mask);
@@ -312,10 +313,10 @@ setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
 	err |= __put_user(sp, sc->sc_regs+30);
 	err |= __put_user(0, sc->sc_regs+31);
 
-	for (i = 0; i < 31; i++)
-		err |= __put_user(sw->fp[i], sc->sc_fpregs+i);
+	err |= __copy_to_user(sc->sc_fpregs,
+			      current_thread_info()->fp, 31 * 8);
 	err |= __put_user(0, sc->sc_fpregs+31);
-	err |= __put_user(sw->fp[31], &sc->sc_fpcr);
+	err |= __put_user(current_thread_info()->fp[31], &sc->sc_fpcr);
 
 	err |= __put_user(regs->trap_a0, &sc->sc_traparg_a0);
 	err |= __put_user(regs->trap_a1, &sc->sc_traparg_a1);
@@ -528,6 +529,9 @@ do_work_pending(struct pt_regs *regs, unsigned long thread_flags,
 		} else {
 			local_irq_enable();
 			if (thread_flags & (_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)) {
+				preempt_disable();
+				save_fpu();
+				preempt_enable();
 				do_signal(regs, r0, r19);
 				r0 = 0;
 			} else {
diff --git a/arch/alpha/lib/fpreg.c b/arch/alpha/lib/fpreg.c
index 34fea465645b..612c5eca71bc 100644
--- a/arch/alpha/lib/fpreg.c
+++ b/arch/alpha/lib/fpreg.c
@@ -7,6 +7,8 @@
 
 #include <linux/compiler.h>
 #include <linux/export.h>
+#include <linux/preempt.h>
+#include <asm/thread_info.h>
 
 #if defined(CONFIG_ALPHA_EV6) || defined(CONFIG_ALPHA_EV67)
 #define STT(reg,val)  asm volatile ("ftoit $f"#reg",%0" : "=r"(val));
@@ -19,7 +21,12 @@ alpha_read_fp_reg (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP)
+		val = current_thread_info()->fp[reg];
+	else switch (reg) {
 	      case  0: STT( 0, val); break;
 	      case  1: STT( 1, val); break;
 	      case  2: STT( 2, val); break;
@@ -52,8 +59,8 @@ alpha_read_fp_reg (unsigned long reg)
 	      case 29: STT(29, val); break;
 	      case 30: STT(30, val); break;
 	      case 31: STT(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg);
@@ -67,7 +74,14 @@ EXPORT_SYMBOL(alpha_read_fp_reg);
 void
 alpha_write_fp_reg (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->status |= TS_RESTORE_FP;
+		current_thread_info()->fp[reg] = val;
+	} else switch (reg) {
 	      case  0: LDT( 0, val); break;
 	      case  1: LDT( 1, val); break;
 	      case  2: LDT( 2, val); break;
@@ -101,6 +115,7 @@ alpha_write_fp_reg (unsigned long reg, unsigned long val)
 	      case 30: LDT(30, val); break;
 	      case 31: LDT(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg);
 
@@ -115,7 +130,14 @@ alpha_read_fp_reg_s (unsigned long reg)
 {
 	unsigned long val;
 
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return 0;
+
+	preempt_enable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		LDT(0, current_thread_info()->fp[reg]);
+		STS(0, val);
+	} else switch (reg) {
 	      case  0: STS( 0, val); break;
 	      case  1: STS( 1, val); break;
 	      case  2: STS( 2, val); break;
@@ -148,8 +170,8 @@ alpha_read_fp_reg_s (unsigned long reg)
 	      case 29: STS(29, val); break;
 	      case 30: STS(30, val); break;
 	      case 31: STS(31, val); break;
-	      default: return 0;
 	}
+	preempt_enable();
 	return val;
 }
 EXPORT_SYMBOL(alpha_read_fp_reg_s);
@@ -163,7 +185,15 @@ EXPORT_SYMBOL(alpha_read_fp_reg_s);
 void
 alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 {
-	switch (reg) {
+	if (unlikely(reg >= 32))
+		return;
+
+	preempt_disable();
+	if (current_thread_info()->status & TS_SAVED_FP) {
+		current_thread_info()->status |= TS_RESTORE_FP;
+		LDS(0, val);
+		STT(0, current_thread_info()->fp[reg]);
+	} else switch (reg) {
 	      case  0: LDS( 0, val); break;
 	      case  1: LDS( 1, val); break;
 	      case  2: LDS( 2, val); break;
@@ -197,5 +227,6 @@ alpha_write_fp_reg_s (unsigned long reg, unsigned long val)
 	      case 30: LDS(30, val); break;
 	      case 31: LDS(31, val); break;
 	}
+	preempt_enable();
 }
 EXPORT_SYMBOL(alpha_write_fp_reg_s);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/7] alpha: lazy FPU switching
  2022-09-02  1:50     ` [PATCH v2 7/7] alpha: lazy FPU switching Al Viro
@ 2022-09-02  4:24       ` Linus Torvalds
  2022-09-02  5:07         ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2022-09-02  4:24 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-alpha

On Thu, Sep 1, 2022 at 6:50 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         On each context switch we save the FPU registers on stack
> of old process and restore FPU registers from the stack of new one.
> That allows us to avoid doing that each time we enter/leave the
> kernel mode; however, that can get suboptimal in some cases.

Do we really care, for what is effectively a dead architecture?

This patch feels like something that might have made sense 25 years
ago. Does it make sense today?

I guess I don't care (for the same reason), but just how much testing
has this gotten, and what subtle bugs might this have?

With the asm even having a comment about how it only works because
alpha doesn't do preemption (ARCH_NO_PREEMPT), but then the C code
does do those preempt_disable/enable pairs, and I see an actual bug in
there too:

Both alpha_read_fp_reg() and alpha_read_fp_reg_s() do a
preempt_enable() -> preempt_enable() pair (ie the first one should be
a preempt_disable()).

Does that bug matter? No. ARCH_NO_PREEMPT means that it's all no-ops
anyway. But it's wrong and I think shows the status of this patch -
well-meaning, but maybe not really fully thought out.

           Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/7] alpha: lazy FPU switching
  2022-09-02  4:24       ` Linus Torvalds
@ 2022-09-02  5:07         ` Al Viro
  2022-09-02  5:14           ` Al Viro
  0 siblings, 1 reply; 29+ messages in thread
From: Al Viro @ 2022-09-02  5:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-alpha

On Thu, Sep 01, 2022 at 09:24:52PM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2022 at 6:50 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         On each context switch we save the FPU registers on stack
> > of old process and restore FPU registers from the stack of new one.
> > That allows us to avoid doing that each time we enter/leave the
> > kernel mode; however, that can get suboptimal in some cases.
> 
> Do we really care, for what is effectively a dead architecture?

Umm...  To an extent we do - remember the fun bugs Eric had caught
wrt kernel threads that end up running with unusual stack layout?
That's where this series had come from - alpha is the worst offender
in that respect; it has batshit crazy amount of extras on top of
pt_regs and while the rest of that stuff could be dealt with, the
full set of FP registers is well beyond anything we could reasonably
save on each syscall entry.  And that also happens to be a killer
for ever switching to generic syscall glue.

So I wanted to see if such stuff could be dealt with; alpha FPU registers
were the worst example in the entire tree...
 
> This patch feels like something that might have made sense 25 years
> ago. Does it make sense today?
> 
> I guess I don't care (for the same reason), but just how much testing
> has this gotten, and what subtle bugs might this have?

Umm... kernel builds, libc builds (and self-tests), xfstests (qemu only;
sorry, but doing that on DS10 with IDE disk is just fucking awful).  Debian
updates, to an extent...
 
> With the asm even having a comment about how it only works because
> alpha doesn't do preemption (ARCH_NO_PREEMPT), but then the C code
> does do those preempt_disable/enable pairs, and I see an actual bug in
> there too:
> 
> Both alpha_read_fp_reg() and alpha_read_fp_reg_s() do a
> preempt_enable() -> preempt_enable() pair (ie the first one should be
> a preempt_disable()).

Will fix.

> Does that bug matter? No. ARCH_NO_PREEMPT means that it's all no-ops
> anyway. But it's wrong and I think shows the status of this patch -
> well-meaning, but maybe not really fully thought out.

Any review would obviously be welcome.  Again, as far as I'm concerned,
it's more of figuring out how painful does that kind of work end up
being.

Beginning of the series is a different story (and a good example of the
reasons for taking as much as possible out of asm glue into generic
C helpers - look at the first patch and note that TIF_NOTIFY_SIGNAL
is going to grow more uses in generic kernel).  TBH, I'm really sick
and tired of crawling through asm glue every year or so and coming
up with new piles of fun bugs ;-/  And it's not as if it had only
affected dead and stillborn architectures - riscv development is quite
alive...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/7] alpha: lazy FPU switching
  2022-09-02  5:07         ` Al Viro
@ 2022-09-02  5:14           ` Al Viro
  0 siblings, 0 replies; 29+ messages in thread
From: Al Viro @ 2022-09-02  5:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-alpha

On Fri, Sep 02, 2022 at 06:07:31AM +0100, Al Viro wrote:

> > With the asm even having a comment about how it only works because
> > alpha doesn't do preemption (ARCH_NO_PREEMPT), but then the C code
> > does do those preempt_disable/enable pairs, and I see an actual bug in
> > there too:
> > 
> > Both alpha_read_fp_reg() and alpha_read_fp_reg_s() do a
> > preempt_enable() -> preempt_enable() pair (ie the first one should be
> > a preempt_disable()).
> 
> Will fix.

	Done and pushed.  IIRC, that started as a similar comment re
"we'd need to disable preemption here if we ever grow one on alpha",
but I ended up looking at it and deciding that it's easier to just
go ahead and call preempt_disable()/preempt_enable() instead
of comments.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-09-02  5:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-25  2:54 [PATCHES] alpha asm glue cleanups and fixes Al Viro
2021-09-25  2:55 ` [PATCH 1/6] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
2021-09-25  2:55   ` [PATCH 2/6] alpha: _TIF_ALLWORK_MASK is unused Al Viro
2021-09-25  2:55   ` [PATCH 3/6] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
2021-09-25  2:55   ` [PATCH 4/6] alpha: fix handling of a3 on straced syscalls Al Viro
2021-09-25  2:55   ` [PATCH 5/6] alpha: syscall exit cleanup Al Viro
2021-09-25  2:55   ` [PATCH 6/6] alpha: ret_from_fork can go straight to ret_to_user Al Viro
2021-09-25  2:55   ` [PATCH 7/7] alpha: lazy FPU switching Al Viro
2021-09-25 19:07     ` Linus Torvalds
2021-09-25 20:43       ` Al Viro
2021-09-25 23:18         ` Linus Torvalds
2021-09-26  0:31           ` Al Viro
2021-10-30 20:46         ` Al Viro
2021-10-30 20:46           ` Al Viro
2021-10-30 21:25           ` Maciej W. Rozycki
2021-10-30 22:13             ` Al Viro
2021-09-26  9:08       ` John Paul Adrian Glaubitz
2021-09-25  2:59 ` [PATCHES] alpha asm glue cleanups and fixes Al Viro
2022-09-02  1:48 ` Al Viro
2022-09-02  1:50   ` [PATCH v2 1/7] alpha: fix TIF_NOTIFY_SIGNAL handling Al Viro
2022-09-02  1:50     ` [PATCH v2 2/7] alpha: _TIF_ALLWORK_MASK is unused Al Viro
2022-09-02  1:50     ` [PATCH v2 3/7] alpha: fix syscall entry in !AUDUT_SYSCALL case Al Viro
2022-09-02  1:50     ` [PATCH v2 4/7] alpha: fix handling of a3 on straced syscalls Al Viro
2022-09-02  1:50     ` [PATCH v2 5/7] alpha: syscall exit cleanup Al Viro
2022-09-02  1:50     ` [PATCH v2 6/7] alpha: ret_from_fork can go straight to ret_to_user Al Viro
2022-09-02  1:50     ` [PATCH v2 7/7] alpha: lazy FPU switching Al Viro
2022-09-02  4:24       ` Linus Torvalds
2022-09-02  5:07         ` Al Viro
2022-09-02  5:14           ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.