All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps
@ 2021-02-01 12:28 Mark Brown
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-01 12:28 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Julien Grall, Julien Grall, Zhang Lei, Mark Brown, Dave Martin,
	linux-arm-kernel, Daniel Kiss

This patch series aims to improve the performance of handling SVE access
traps, earlier versions were originally written by Julien Gral but based
on discussions on previous versions the patches have been substantially
reworked to use a different approach.  The patches are now different
enough that I set myself as the author, hopefully that's OK for Julien.

I've marked this as RFC since it's not quite ready yet but I'd really
like feedback on the overall approach, it's a big change in
implementation.  It needs at least one more pass for polish and while
it's holding up in my testing thus far I've not done as much as I'd like
yet.

Per the syscall ABI, SVE registers will be unknown after a syscall.  In
practice, the kernel will disable SVE and the registers will be zeroed
(except the first 128 bits of each vector) on the next SVE instruction.
Currently we do this by saving the FPSIMD state to memory, converting to
the matching SVE state and then reloading the registers on return to
userspace.  This requires a lot of memory accesses that we shouldn't
need, improve this by reworking the SVE state tracking so we track if we
should trap on executing SVE instructions separately to if we need to
save the full register state.  This allows us to avoid tracking the full
SVE state until we need to return to userspace and to convert directly
in registers in the common case where the FPSIMD state is still in
registers then.

As with current mainline we disable SVE on every syscall.  This may not
be ideal for applications that mix SVE and syscall usage, strategies
such as SH's fpu_counter may perform better but we need to assess the
performance on a wider range of systems than are currently available
before implementing anything.

It is also possible to optimize the case when the SVE vector length
is 128-bit (ie the same size as the FPSIMD vectors).  This could be
explored in the future, it becomes a lot easier to do with this
implementation.

v7:
 - A few minor cosmetic updates and one bugfix for
   fpsimd_update_current_state().
v6:
 - Substantially rework the patch so that TIF_SVE is now replaced by
   two flags TIF_SVE_EXEC and TIF_SVE_FULL_REGS.
 - Return to disabling SVE after every syscall as for current
   mainine rather than leaving it enabled unless reset via ptrace.
v5:
 - Rebase onto v5.10-rc2.
 - Explicitly support the case where TIF_SVE and TIF_SVE_NEEDS_FLUSH are
   set simultaneously, though this is not currently expected to happen.
 - Extensively revised the documentation for TIF_SVE and
   TIF_SVE_NEEDS_FLUSH to hopefully make things more clear together with
   the above, I hope this addresses the comments on the prior version
   but it really needs fresh eyes to tell if that's actually the case.
 - Make comments in ptrace.c more precise.
 - Remove some redundant checks for system_has_sve().
v4:
 - Rebase onto v5.9-rc2
 - Address review comments from Dave Martin, mostly documentation but
   also some refactorings to ensure we don't check capabilities multiple
   times and the addition of some WARN_ONs to make sure assumptions we
   are making about what TIF_ flags can be set when are true.
v3:
 - Rebased to current kernels.
 - Addressed review comments from v2, mostly around tweaks in the
   documentation.

Mark Brown (2):
  arm64/sve: Split TIF_SVE into separate execute and register state
    flags
  arm64/sve: Rework SVE trap access to minimise memory access

 arch/arm64/include/asm/fpsimd.h      |   2 +
 arch/arm64/include/asm/thread_info.h |   3 +-
 arch/arm64/kernel/entry-fpsimd.S     |   5 +
 arch/arm64/kernel/fpsimd.c           | 203 +++++++++++++++++++--------
 arch/arm64/kernel/process.c          |   7 +-
 arch/arm64/kernel/ptrace.c           |   8 +-
 arch/arm64/kernel/signal.c           |  15 +-
 arch/arm64/kernel/syscall.c          |   3 +-
 arch/arm64/kvm/fpsimd.c              |   6 +-
 9 files changed, 179 insertions(+), 73 deletions(-)


base-commit: 7c53f6b671f4aba70ff15e1b05148b10d58c2837
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-01 12:28 [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Mark Brown
@ 2021-02-01 12:29 ` Mark Brown
  2021-02-01 15:35   ` Dave Martin
                     ` (2 more replies)
  2021-02-01 12:29 ` [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access Mark Brown
  2021-02-08 17:26 ` [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Dave Martin
  2 siblings, 3 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-01 12:29 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Julien Grall, Julien Grall, Zhang Lei, Mark Brown, Dave Martin,
	linux-arm-kernel, Daniel Kiss

Currently we have a single flag TIF_SVE which says that a task is
allowed to execute SVE instructions without trapping and also that full
SVE register state is stored for the task.  This results in us doing
extra work storing and restoring the full SVE register state even in
those cases where the ABI is that only the first 128 bits of the Z0-V31
registers which are shared with the FPSIMD V0-V31 are valid.

In order to allow us to avoid these overheads split TIF_SVE up so that
we have two separate flags, TIF_SVE_EXEC which allows execution of SVE
instructions without trapping and TIF_SVE_FULL_REGS which indicates that
the full SVE register state is stored.  If both are set the behaviour is
as currently, if TIF_SVE_EXEC is set without TIF_SVE_FULL_REGS then we
save and restore only the FPSIMD registers until we return to userspace
with TIF_SVE_EXEC enabled at which point we convert the FPSIMD registers
to SVE.  It is not meaningful to have TIF_SVE_FULL_REGS set without
TIF_SVE_EXEC.

This patch is intended only to split the flags, it does not take
avantage of the ability to set the flags independently and the new
state with TIF_SVE_EXEC only should not be observed.

This is based on earlier work by Julien Gral implementing a slightly
different approach.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/thread_info.h |   3 +-
 arch/arm64/kernel/fpsimd.c           | 174 +++++++++++++++++++--------
 arch/arm64/kernel/process.c          |   7 +-
 arch/arm64/kernel/ptrace.c           |   8 +-
 arch/arm64/kernel/signal.c           |  15 ++-
 arch/arm64/kernel/syscall.c          |   3 +-
 arch/arm64/kvm/fpsimd.c              |   6 +-
 7 files changed, 152 insertions(+), 64 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 9f4e3b266f21..c856159e071c 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -65,6 +65,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_MTE_ASYNC_FAULT	5	/* MTE Asynchronous Tag Check Fault */
 #define TIF_NOTIFY_SIGNAL	6	/* signal notifications exist */
+#define TIF_SVE_EXEC		7	/* SVE instructions don't trap */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
@@ -75,7 +76,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
 #define TIF_32BIT		22	/* 32bit process */
-#define TIF_SVE			23	/* Scalable Vector Extension in use */
+#define TIF_SVE_FULL_REGS	23	/* Full SVE register set stored */
 #define TIF_SVE_VL_INHERIT	24	/* Inherit sve_vl_onexec across exec */
 #define TIF_SSBD		25	/* Wants SSB mitigation */
 #define TIF_TAGGED_ADDR		26	/* Allow tagged user addresses */
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 062b21f30f94..58c749ef04c4 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -215,48 +215,70 @@ static bool have_cpu_fpsimd_context(void)
  */
 static void __sve_free(struct task_struct *task)
 {
+	/* SVE context will be zeroed when allocated. */
+	clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS);
 	kfree(task->thread.sve_state);
 	task->thread.sve_state = NULL;
 }
 
 static void sve_free(struct task_struct *task)
 {
-	WARN_ON(test_tsk_thread_flag(task, TIF_SVE));
+	WARN_ON(test_tsk_thread_flag(task, TIF_SVE_EXEC));
 
 	__sve_free(task);
 }
 
 /*
- * TIF_SVE controls whether a task can use SVE without trapping while
- * in userspace, and also the way a task's FPSIMD/SVE state is stored
- * in thread_struct.
+ * In order to avoid the expense of storing the SVE registers when not
+ * in active use by tasks we keep track of the task's SVE usage and
+ * only allocate space for SVE registers for tasks that need it.  In
+ * addition since on first use and after every syscall only the portion
+ * of the SVE registers shared with FPSIMD are used we separately track
+ * if we need to actually save all that state.
  *
- * The kernel uses this flag to track whether a user task is actively
- * using SVE, and therefore whether full SVE register state needs to
- * be tracked.  If not, the cheaper FPSIMD context handling code can
- * be used instead of the more costly SVE equivalents.
+ * TIF_SVE_EXEC controls whether a task can use SVE without trapping
+ * while in userspace.  TIF_SVE_FULL_REGS controls the way a task's
+ * FPSIMD/SVE state is stored in thread_struct. The kernel uses this
+ * flag to track whether a user task has active SVE state, and
+ * therefore whether full SVE register state needs to be tracked.  If
+ * not, the cheaper FPSIMD context handling code can be used instead
+ * of the more costly SVE equivalents.
  *
- *  * TIF_SVE set:
+ *  * FPSR and FPCR are always stored in task->thread.uw.fpsimd_state
+ *    irrespective of any flags, since these are not vector length
+ *    dependent.
  *
- *    The task can execute SVE instructions while in userspace without
- *    trapping to the kernel.
+ *  * TIF_SVE_EXEC is not set:
  *
- *    When stored, Z0-Z31 (incorporating Vn in bits[127:0] or the
- *    corresponding Zn), P0-P15 and FFR are encoded in in
- *    task->thread.sve_state, formatted appropriately for vector
- *    length task->thread.sve_vl.
+ *    An attempt by the user task to execute an SVE instruction causes
+ *    do_sve_acc() to be called, which does some preparation and sets
+ *    TIF_SVE_EXEC.
+ *
+ *    When stored, FPSIMD registers V0-V31 are encoded in
+ *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
+ *    logically zero but not stored anywhere; P0-P15 and FFR are not
+ *    stored and have unspecified values from userspace's point of
+ *    view.  For hygiene purposes, the kernel zeroes them on next use,
+ *    but userspace is discouraged from relying on this.
+ *
+ *    task->thread.sve_state does not need to be non-NULL, valid or any
+ *    particular size: it must not be dereferenced.  TIF_SVE_FULL_REGS
+ *    will have no effect and should never be set.
+ *
+ *  * TIF_SVE_EXEC set:
+ *
+ *    The task can execute SVE instructions while in userspace without
+ *    trapping to the kernel.  Storage of Z0-Z31 (incorporating Vn in
+ *    bits[0-127]) is determined by TIF_SVE_FULL_REGS.
  *
  *    task->thread.sve_state must point to a valid buffer at least
  *    sve_state_size(task) bytes in size.
  *
- *    During any syscall, the kernel may optionally clear TIF_SVE and
- *    discard the vector state except for the FPSIMD subset.
- *
- *  * TIF_SVE clear:
+ *    During any syscall the ABI allows the kernel to discard the
+ *    vector state other than the FPSIMD subset.  When this is done
+ *    both TIF_SVE_EXEC and TIF_SVE_FULL_REGS will be cleared.
  *
- *    An attempt by the user task to execute an SVE instruction causes
- *    do_sve_acc() to be called, which does some preparation and then
- *    sets TIF_SVE.
+ *  * TIF_SVE_FULL_REGS is not set:
  *
  *    When stored, FPSIMD registers V0-V31 are encoded in
  *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
@@ -265,12 +287,38 @@ static void sve_free(struct task_struct *task)
  *    view.  For hygiene purposes, the kernel zeroes them on next use,
  *    but userspace is discouraged from relying on this.
  *
- *    task->thread.sve_state does not need to be non-NULL, valid or any
- *    particular size: it must not be dereferenced.
+ *    On entry to the kernel other than from a syscall the kernel must
+ *    set TIF_SVE_FULL_REGS and save the full register state.
  *
- *  * FPSR and FPCR are always stored in task->thread.uw.fpsimd_state
- *    irrespective of whether TIF_SVE is clear or set, since these are
- *    not vector length dependent.
+ *  * TIF_SVE_FULL_REGS is set:
+ *
+ *    This flag is only valid when TIF_SVE_EXEC is set.
+ *
+ *    When stored, Z0-Z31 (incorporating Vn in bits[127:0] or the
+ *    corresponding Zn), P0-P15 and FFR are encoded in in
+ *    task->thread.sve_state, formatted appropriately for vector
+ *    length task->thread.sve_vl.
+ *
+ *    On entry to the kernel from a syscall this flag and TIF_SVE_EXEC
+ *    are cleared and only the FPSIMD subset of the register state is
+ *    stored.
+ *
+ * In summary, combined with TIF_FOREIGN_FPSTATE:
+ *
+ *          !SVE           _EXEC           _EXEC+_FULL_REGS
+ *        +---------------+---------------+---------------+
+ *        | Valid: FPSIMD | Valid: FPSIMD | Valid: SVE    |
+ * !FFP   | Trap:  Yes    | Trap:  No     | Trap:  No     |
+ *        | Where: regs   | Where: regs   | Where: regs   |
+ *        +---------------+---------------+---------------+
+ *        | Valid: FPSIMD | Valid: FPSIMD | Valid: SVE    |
+ * FFP    | Trap:  Yes    | Trap:  No     | Trap:  No     |
+ *        | Where: memory | Where: memory | Where: memory |
+ *        +---------------+---------------+---------------+
+ *
+ * Where valid indicates what state is valid, trap indicates if we
+ * should trap on executing a SVE instruction and where indicates
+ * where the current copy of the register state is.
  */
 
 /*
@@ -279,18 +327,37 @@ static void sve_free(struct task_struct *task)
  * This function should be called only when the FPSIMD/SVE state in
  * thread_struct is known to be up to date, when preparing to enter
  * userspace.
+ *
+ * When TIF_SVE_EXEC is set but TIF_SVE_FULL_REGS is not set the SVE
+ * state will be restored from the FPSIMD state.
  */
 static void task_fpsimd_load(void)
 {
+	unsigned int vl;
+
 	WARN_ON(!system_supports_fpsimd());
 	WARN_ON(!have_cpu_fpsimd_context());
 
-	if (system_supports_sve() && test_thread_flag(TIF_SVE))
-		sve_load_state(sve_pffr(&current->thread),
-			       &current->thread.uw.fpsimd_state.fpsr,
-			       sve_vq_from_vl(current->thread.sve_vl) - 1);
-	else
-		fpsimd_load_state(&current->thread.uw.fpsimd_state);
+	if (test_thread_flag(TIF_SVE_EXEC)) {
+		vl = sve_vq_from_vl(current->thread.sve_vl) - 1;
+
+		/*
+		 * We always return with the full register state, if
+		 * there is no explicit SVE state load from the FPSIMD
+		 * state instead.
+		 */
+		if (test_and_set_thread_flag(TIF_SVE_FULL_REGS))
+			sve_load_state(sve_pffr(&current->thread),
+				       &current->thread.uw.fpsimd_state.fpsr,
+				       vl);
+		else
+			sve_load_from_fpsimd_state(&current->thread.uw.fpsimd_state,
+						   vl);
+
+		return;
+	}
+
+	fpsimd_load_state(&current->thread.uw.fpsimd_state);
 }
 
 /*
@@ -307,7 +374,7 @@ static void fpsimd_save(void)
 	WARN_ON(!have_cpu_fpsimd_context());
 
 	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
-		if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+		if (system_supports_sve() && test_thread_flag(TIF_SVE_EXEC)) {
 			if (WARN_ON(sve_get_vl() != last->sve_vl)) {
 				/*
 				 * Can't save the user regs, so current would
@@ -318,6 +385,7 @@ static void fpsimd_save(void)
 				return;
 			}
 
+			set_thread_flag(TIF_SVE_FULL_REGS);
 			sve_save_state((char *)last->sve_state +
 						sve_ffr_offset(last->sve_vl),
 				       &last->st->fpsr);
@@ -536,7 +604,7 @@ void sve_alloc(struct task_struct *task)
  */
 void fpsimd_sync_to_sve(struct task_struct *task)
 {
-	if (!test_tsk_thread_flag(task, TIF_SVE))
+	if (!test_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
 		fpsimd_to_sve(task);
 }
 
@@ -550,7 +618,7 @@ void fpsimd_sync_to_sve(struct task_struct *task)
  */
 void sve_sync_to_fpsimd(struct task_struct *task)
 {
-	if (test_tsk_thread_flag(task, TIF_SVE))
+	if (test_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
 		sve_to_fpsimd(task);
 }
 
@@ -572,7 +640,7 @@ void sve_sync_from_fpsimd_zeropad(struct task_struct *task)
 	void *sst = task->thread.sve_state;
 	struct user_fpsimd_state const *fst = &task->thread.uw.fpsimd_state;
 
-	if (!test_tsk_thread_flag(task, TIF_SVE))
+	if (!test_tsk_thread_flag(task, TIF_SVE_EXEC))
 		return;
 
 	vq = sve_vq_from_vl(task->thread.sve_vl);
@@ -627,8 +695,9 @@ int sve_set_vector_length(struct task_struct *task,
 	}
 
 	fpsimd_flush_task_state(task);
-	if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
+	if (test_and_clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
 		sve_to_fpsimd(task);
+	clear_thread_flag(TIF_SVE_EXEC);
 
 	if (task == current)
 		put_cpu_fpsimd_context();
@@ -926,13 +995,14 @@ void fpsimd_release_task(struct task_struct *dead_task)
  * Trapped SVE access
  *
  * Storage is allocated for the full SVE state, the current FPSIMD
- * register contents are migrated across, and TIF_SVE is set so that
+ * register contents are migrated across, and TIF_SVE_EXEC is set so that
  * the SVE access trap will be disabled the next time this task
  * reaches ret_to_user.
  *
- * TIF_SVE should be clear on entry: otherwise, fpsimd_restore_current_state()
- * would have disabled the SVE access trap for userspace during
- * ret_to_user, making an SVE access trap impossible in that case.
+ * TIF_SVE_EXEC should be clear on entry: otherwise,
+ * fpsimd_restore_current_state() would have disabled the SVE access
+ * trap for userspace during ret_to_user, making an SVE access trap
+ * impossible in that case.
  */
 void do_sve_acc(unsigned int esr, struct pt_regs *regs)
 {
@@ -952,8 +1022,9 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
 	fpsimd_flush_task_state(current);
 
 	fpsimd_to_sve(current);
-	if (test_and_set_thread_flag(TIF_SVE))
+	if (test_and_set_thread_flag(TIF_SVE_EXEC))
 		WARN_ON(1); /* SVE access shouldn't have trapped */
+	set_thread_flag(TIF_SVE_FULL_REGS);
 
 	put_cpu_fpsimd_context();
 }
@@ -1033,7 +1104,8 @@ void fpsimd_flush_thread(void)
 	       sizeof(current->thread.uw.fpsimd_state));
 
 	if (system_supports_sve()) {
-		clear_thread_flag(TIF_SVE);
+		clear_thread_flag(TIF_SVE_EXEC);
+		clear_thread_flag(TIF_SVE_FULL_REGS);
 		sve_free(current);
 
 		/*
@@ -1092,7 +1164,7 @@ void fpsimd_preserve_current_state(void)
 void fpsimd_signal_preserve_current_state(void)
 {
 	fpsimd_preserve_current_state();
-	if (system_supports_sve() && test_thread_flag(TIF_SVE))
+	if (system_supports_sve() && test_thread_flag(TIF_SVE_FULL_REGS))
 		sve_to_fpsimd(current);
 }
 
@@ -1114,7 +1186,7 @@ void fpsimd_bind_task_to_cpu(void)
 
 	if (system_supports_sve()) {
 		/* Toggle SVE trapping for userspace if needed */
-		if (test_thread_flag(TIF_SVE))
+		if (test_thread_flag(TIF_SVE_EXEC))
 			sve_user_enable();
 		else
 			sve_user_disable();
@@ -1163,6 +1235,14 @@ void fpsimd_restore_current_state(void)
 	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
 		task_fpsimd_load();
 		fpsimd_bind_task_to_cpu();
+	} else {
+		/*
+		 * Convert FPSIMD state to SVE if userspace can execute SVE
+		 * but we have no explicit SVE state.
+		 */
+		if (test_thread_flag(TIF_SVE_EXEC) &&
+		    !test_and_set_thread_flag(TIF_SVE_FULL_REGS))
+			sve_flush_live();
 	}
 
 	put_cpu_fpsimd_context();
@@ -1181,7 +1261,7 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
 	get_cpu_fpsimd_context();
 
 	current->thread.uw.fpsimd_state = *state;
-	if (system_supports_sve() && test_thread_flag(TIF_SVE))
+	if (test_thread_flag(TIF_SVE_FULL_REGS))
 		fpsimd_to_sve(current);
 
 	task_fpsimd_load();
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6616486a58fe..71c8265b9139 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -364,13 +364,14 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	 * Detach src's sve_state (if any) from dst so that it does not
 	 * get erroneously used or freed prematurely.  dst's sve_state
 	 * will be allocated on demand later on if dst uses SVE.
-	 * For consistency, also clear TIF_SVE here: this could be done
+	 * For consistency, also clear TIF_SVE_* here: this could be done
 	 * later in copy_process(), but to avoid tripping up future
-	 * maintainers it is best not to leave TIF_SVE and sve_state in
+	 * maintainers it is best not to leave TIF_SVE_* and sve_state in
 	 * an inconsistent state, even temporarily.
 	 */
 	dst->thread.sve_state = NULL;
-	clear_tsk_thread_flag(dst, TIF_SVE);
+	clear_tsk_thread_flag(dst, TIF_SVE_EXEC);
+	clear_tsk_thread_flag(dst, TIF_SVE_FULL_REGS);
 
 	/* clear any pending asynchronous tag fault raised by the parent */
 	clear_tsk_thread_flag(dst, TIF_MTE_ASYNC_FAULT);
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 8ac487c84e37..f0406b3dc389 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -719,7 +719,7 @@ static void sve_init_header_from_task(struct user_sve_header *header,
 
 	memset(header, 0, sizeof(*header));
 
-	header->flags = test_tsk_thread_flag(target, TIF_SVE) ?
+	header->flags = test_tsk_thread_flag(target, TIF_SVE_FULL_REGS) ?
 		SVE_PT_REGS_SVE : SVE_PT_REGS_FPSIMD;
 	if (test_tsk_thread_flag(target, TIF_SVE_VL_INHERIT))
 		header->flags |= SVE_PT_VL_INHERIT;
@@ -827,7 +827,8 @@ static int sve_set(struct task_struct *target,
 	if ((header.flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_FPSIMD) {
 		ret = __fpr_set(target, regset, pos, count, kbuf, ubuf,
 				SVE_PT_FPSIMD_OFFSET);
-		clear_tsk_thread_flag(target, TIF_SVE);
+		clear_tsk_thread_flag(target, TIF_SVE_EXEC);
+		clear_tsk_thread_flag(target, TIF_SVE_FULL_REGS);
 		goto out;
 	}
 
@@ -851,7 +852,8 @@ static int sve_set(struct task_struct *target,
 	 * unmodified.
 	 */
 	fpsimd_sync_to_sve(target);
-	set_tsk_thread_flag(target, TIF_SVE);
+	set_tsk_thread_flag(target, TIF_SVE_EXEC);
+	set_tsk_thread_flag(target, TIF_SVE_FULL_REGS);
 
 	BUILD_BUG_ON(SVE_PT_SVE_OFFSET != sizeof(header));
 	start = SVE_PT_SVE_OFFSET;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index f71d6ce4673f..5d5610af7ea3 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -205,7 +205,8 @@ static int restore_fpsimd_context(struct fpsimd_context __user *ctx)
 	__get_user_error(fpsimd.fpsr, &ctx->fpsr, err);
 	__get_user_error(fpsimd.fpcr, &ctx->fpcr, err);
 
-	clear_thread_flag(TIF_SVE);
+	clear_thread_flag(TIF_SVE_EXEC);
+	clear_thread_flag(TIF_SVE_FULL_REGS);
 
 	/* load the hardware registers from the fpsimd_state structure */
 	if (!err)
@@ -229,7 +230,7 @@ static int preserve_sve_context(struct sve_context __user *ctx)
 	unsigned int vl = current->thread.sve_vl;
 	unsigned int vq = 0;
 
-	if (test_thread_flag(TIF_SVE))
+	if (test_thread_flag(TIF_SVE_EXEC))
 		vq = sve_vq_from_vl(vl);
 
 	memset(reserved, 0, sizeof(reserved));
@@ -241,7 +242,7 @@ static int preserve_sve_context(struct sve_context __user *ctx)
 	BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
 	err |= __copy_to_user(&ctx->__reserved, reserved, sizeof(reserved));
 
-	if (vq) {
+	if (vq && test_thread_flag(TIF_SVE_FULL_REGS)) {
 		/*
 		 * This assumes that the SVE state has already been saved to
 		 * the task struct by calling the function
@@ -269,7 +270,8 @@ static int restore_sve_fpsimd_context(struct user_ctxs *user)
 		return -EINVAL;
 
 	if (sve.head.size <= sizeof(*user->sve)) {
-		clear_thread_flag(TIF_SVE);
+		clear_thread_flag(TIF_SVE_EXEC);
+		clear_thread_flag(TIF_SVE_FULL_REGS);
 		goto fpsimd_only;
 	}
 
@@ -296,7 +298,8 @@ static int restore_sve_fpsimd_context(struct user_ctxs *user)
 	if (err)
 		return -EFAULT;
 
-	set_thread_flag(TIF_SVE);
+	set_thread_flag(TIF_SVE_EXEC);
+	set_thread_flag(TIF_SVE_FULL_REGS);
 
 fpsimd_only:
 	/* copy the FP and status/control registers */
@@ -587,7 +590,7 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user,
 	if (system_supports_sve()) {
 		unsigned int vq = 0;
 
-		if (add_all || test_thread_flag(TIF_SVE)) {
+		if (add_all || test_thread_flag(TIF_SVE_EXEC)) {
 			int vl = sve_max_vl;
 
 			if (!add_all)
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index f61e9d8cc55a..f8a2598730c2 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -186,7 +186,8 @@ static inline void sve_user_discard(void)
 	if (!system_supports_sve())
 		return;
 
-	clear_thread_flag(TIF_SVE);
+	clear_thread_flag(TIF_SVE_EXEC);
+	clear_thread_flag(TIF_SVE_FULL_REGS);
 
 	/*
 	 * task_fpsimd_load() won't be called to update CPACR_EL1 in
diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
index 3e081d556e81..1b7c0d03581b 100644
--- a/arch/arm64/kvm/fpsimd.c
+++ b/arch/arm64/kvm/fpsimd.c
@@ -67,7 +67,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
 			      KVM_ARM64_HOST_SVE_ENABLED);
 	vcpu->arch.flags |= KVM_ARM64_FP_HOST;
 
-	if (test_thread_flag(TIF_SVE))
+	if (test_thread_flag(TIF_SVE_EXEC))
 		vcpu->arch.flags |= KVM_ARM64_HOST_SVE_IN_USE;
 
 	if (read_sysreg(cpacr_el1) & CPACR_EL1_ZEN_EL0EN)
@@ -90,7 +90,7 @@ void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu)
 					 vcpu->arch.sve_max_vl);
 
 		clear_thread_flag(TIF_FOREIGN_FPSTATE);
-		update_thread_flag(TIF_SVE, vcpu_has_sve(vcpu));
+		update_thread_flag(TIF_SVE_EXEC, vcpu_has_sve(vcpu));
 	}
 }
 
@@ -127,7 +127,7 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
 			sysreg_clear_set(CPACR_EL1, CPACR_EL1_ZEN_EL0EN, 0);
 	}
 
-	update_thread_flag(TIF_SVE,
+	update_thread_flag(TIF_SVE_EXEC,
 			   vcpu->arch.flags & KVM_ARM64_HOST_SVE_IN_USE);
 
 	local_irq_restore(flags);
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access
  2021-02-01 12:28 [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Mark Brown
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
@ 2021-02-01 12:29 ` Mark Brown
  2021-02-10 11:09   ` Dave Martin
  2021-02-08 17:26 ` [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Dave Martin
  2 siblings, 1 reply; 17+ messages in thread
From: Mark Brown @ 2021-02-01 12:29 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Julien Grall, Julien Grall, Zhang Lei, Mark Brown, Dave Martin,
	linux-arm-kernel, Daniel Kiss

When we take a SVE access trap only the subset of the SVE Z0-Z31
registers shared with the FPSIMD V0-V31 registers is valid, the rest
of the bits in the SVE registers must be cleared before returning to
userspace.  Currently we do this by saving the current FPSIMD register
state to the task struct and then using that to initalize the copy of
the SVE registers in the task struct so they can be loaded from there
into the registers.  This requires a lot more memory access than we
need.

The newly added TIF_SVE_FULL_REGS can be used to reduce this overhead -
instead of doing the conversion immediately we can set only TIF_SVE_EXEC
and not TIF_SVE_FULL_REGS.  This means that until we return to userspace
we only need to store the FPSIMD registers and if (as should be the
common case) the hardware still has the task state and does not need
that to be reloaded from the task struct we can do the initialization of
the SVE state entirely in registers.  In the event that we do need to
reload the registers from the task struct only the FPSIMD subset needs
to be loaded from memory.

If the FPSIMD state is loaded then we need to set the vector length.
This is because the vector length is only set when loading from memory,
the expectation is that the vector length is set when TIF_SVE_EXEC is
set.  We also need to rebind the task to the CPU so the newly allocated
SVE state is used when the task is saved.

This is based on earlier work by Julien Gral implementing a similar idea.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/fpsimd.h  |  2 ++
 arch/arm64/kernel/entry-fpsimd.S |  5 +++++
 arch/arm64/kernel/fpsimd.c       | 35 +++++++++++++++++++++-----------
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index bec5f14b622a..e60aa4ebb351 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -74,6 +74,8 @@ extern void sve_load_from_fpsimd_state(struct user_fpsimd_state const *state,
 				       unsigned long vq_minus_1);
 extern unsigned int sve_get_vl(void);
 
+extern void sve_set_vq(unsigned long vq_minus_1);
+
 struct arm64_cpu_capabilities;
 extern void sve_kernel_enable(const struct arm64_cpu_capabilities *__unused);
 
diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S
index 2ca395c25448..3ecec60d3295 100644
--- a/arch/arm64/kernel/entry-fpsimd.S
+++ b/arch/arm64/kernel/entry-fpsimd.S
@@ -48,6 +48,11 @@ SYM_FUNC_START(sve_get_vl)
 	ret
 SYM_FUNC_END(sve_get_vl)
 
+SYM_FUNC_START(sve_set_vq)
+	sve_load_vq x0, x1, x2
+	ret
+SYM_FUNC_END(sve_set_vq)
+
 /*
  * Load SVE state from FPSIMD state.
  *
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 58c749ef04c4..05caf207e2ce 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -994,10 +994,10 @@ void fpsimd_release_task(struct task_struct *dead_task)
 /*
  * Trapped SVE access
  *
- * Storage is allocated for the full SVE state, the current FPSIMD
- * register contents are migrated across, and TIF_SVE_EXEC is set so that
- * the SVE access trap will be disabled the next time this task
- * reaches ret_to_user.
+ * Storage is allocated for the full SVE state so that the code
+ * running subsequently has somewhere to save the SVE registers to. We
+ * then rely on ret_to_user to actually convert the FPSIMD registers
+ * to SVE state by flushing as required.
  *
  * TIF_SVE_EXEC should be clear on entry: otherwise,
  * fpsimd_restore_current_state() would have disabled the SVE access
@@ -1016,15 +1016,26 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
 
 	get_cpu_fpsimd_context();
 
-	fpsimd_save();
-
-	/* Force ret_to_user to reload the registers: */
-	fpsimd_flush_task_state(current);
-
-	fpsimd_to_sve(current);
+	/*
+	 * We shouldn't trap if we can execute SVE instructions and
+	 * there should be no SVE state if that is the case.
+	 */
 	if (test_and_set_thread_flag(TIF_SVE_EXEC))
-		WARN_ON(1); /* SVE access shouldn't have trapped */
-	set_thread_flag(TIF_SVE_FULL_REGS);
+		WARN_ON(1);
+	if (test_and_clear_thread_flag(TIF_SVE_FULL_REGS))
+		WARN_ON(1);
+
+	/*
+	 * When the FPSIMD state is loaded:
+	 *      - The return path (see fpsimd_restore_current_state) requires
+	 *        the vector length to be loaded beforehand.
+	 *      - We need to rebind the task to the CPU so the newly allocated
+	 *        SVE state is used when the task is saved.
+	 */
+	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
+		sve_set_vq(sve_vq_from_vl(current->thread.sve_vl) - 1);
+		fpsimd_bind_task_to_cpu();
+	}
 
 	put_cpu_fpsimd_context();
 }
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
@ 2021-02-01 15:35   ` Dave Martin
  2021-02-01 15:45     ` Mark Brown
  2021-02-09 17:59   ` Dave Martin
  2021-02-10 10:56   ` Dave Martin
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-01 15:35 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Mon, Feb 01, 2021 at 12:29:00PM +0000, Mark Brown wrote:
> Currently we have a single flag TIF_SVE which says that a task is
> allowed to execute SVE instructions without trapping and also that full
> SVE register state is stored for the task.  This results in us doing
> extra work storing and restoring the full SVE register state even in
> those cases where the ABI is that only the first 128 bits of the Z0-V31
> registers which are shared with the FPSIMD V0-V31 are valid.
> 
> In order to allow us to avoid these overheads split TIF_SVE up so that
> we have two separate flags, TIF_SVE_EXEC which allows execution of SVE
> instructions without trapping and TIF_SVE_FULL_REGS which indicates that
> the full SVE register state is stored.  If both are set the behaviour is
> as currently, if TIF_SVE_EXEC is set without TIF_SVE_FULL_REGS then we
> save and restore only the FPSIMD registers until we return to userspace
> with TIF_SVE_EXEC enabled at which point we convert the FPSIMD registers
> to SVE.  It is not meaningful to have TIF_SVE_FULL_REGS set without
> TIF_SVE_EXEC.
> 
> This patch is intended only to split the flags, it does not take
> avantage of the ability to set the flags independently and the new
> state with TIF_SVE_EXEC only should not be observed.
> 
> This is based on earlier work by Julien Gral implementing a slightly
> different approach.
> 
> Signed-off-by: Mark Brown <broonie@kernel.org>

While I'm trying to get my head around this, can you sketch out the
mapping between the old flag combinations (including Julien's
TIF_SVE_NEEDS_FLUSH) and yours?

[...]

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-01 15:35   ` Dave Martin
@ 2021-02-01 15:45     ` Mark Brown
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-01 15:45 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 324 bytes --]

On Mon, Feb 01, 2021 at 03:35:53PM +0000, Dave Martin wrote:

> While I'm trying to get my head around this, can you sketch out the
> mapping between the old flag combinations (including Julien's
> TIF_SVE_NEEDS_FLUSH) and yours?

TIF_SVE => TIF_SVE_EXEC+TIF_SVE_FULL_REGS
TIF_SVE_NEEDS_FLUSH => TIF_SVE_EXEC

more or less.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps
  2021-02-01 12:28 [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Mark Brown
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
  2021-02-01 12:29 ` [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access Mark Brown
@ 2021-02-08 17:26 ` Dave Martin
  2021-02-09 18:22   ` Mark Brown
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-08 17:26 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Mon, Feb 01, 2021 at 12:28:59PM +0000, Mark Brown wrote:
> This patch series aims to improve the performance of handling SVE access
> traps, earlier versions were originally written by Julien Gral but based
> on discussions on previous versions the patches have been substantially
> reworked to use a different approach.  The patches are now different
> enough that I set myself as the author, hopefully that's OK for Julien.
> 
> I've marked this as RFC since it's not quite ready yet but I'd really
> like feedback on the overall approach, it's a big change in
> implementation.  It needs at least one more pass for polish and while
> it's holding up in my testing thus far I've not done as much as I'd like
> yet.
> 
> Per the syscall ABI, SVE registers will be unknown after a syscall.  In
> practice, the kernel will disable SVE and the registers will be zeroed
> (except the first 128 bits of each vector) on the next SVE instruction.
> Currently we do this by saving the FPSIMD state to memory, converting to
> the matching SVE state and then reloading the registers on return to
> userspace.  This requires a lot of memory accesses that we shouldn't
> need, improve this by reworking the SVE state tracking so we track if we
> should trap on executing SVE instructions separately to if we need to
> save the full register state.  This allows us to avoid tracking the full
> SVE state until we need to return to userspace and to convert directly
> in registers in the common case where the FPSIMD state is still in
> registers then.
> 
> As with current mainline we disable SVE on every syscall.  This may not
> be ideal for applications that mix SVE and syscall usage, strategies
> such as SH's fpu_counter may perform better but we need to assess the
> performance on a wider range of systems than are currently available
> before implementing anything.
> 
> It is also possible to optimize the case when the SVE vector length
> is 128-bit (ie the same size as the FPSIMD vectors).  This could be
> explored in the future, it becomes a lot easier to do with this
> implementation.
> 
> v7:
>  - A few minor cosmetic updates and one bugfix for
>    fpsimd_update_current_state().

I see the following delta in this function:

@@ -1272,7 +1272,8 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
 	get_cpu_fpsimd_context();
 
 	current->thread.uw.fpsimd_state = *state;
-	clear_thread_flag(TIF_SVE_FULL_REGS);
+	if (test_thread_flag(TIF_SVE_FULL_REGS))
+		fpsimd_to_sve(current);
 
 	task_fpsimd_load();
 	fpsimd_bind_task_to_cpu();


Can you elaborate on that?

> v6:
>  - Substantially rework the patch so that TIF_SVE is now replaced by
>    two flags TIF_SVE_EXEC and TIF_SVE_FULL_REGS.
>  - Return to disabling SVE after every syscall as for current
>    mainine rather than leaving it enabled unless reset via ptrace.

Some general thoughts before I get into the detail:

I think the implementation flow in this series is roughly:

 a) Split TIF_SVE into two flags, but keep the flags in lockstep, and
    don't pretend to handle cases where they are unequal (no functional
    change).

 b) Add handling for cases where the flags are unequal (the only
    meaningful case being TIF_SVE_EXEC && !TIF_SVE_FULL_REGS) (purely
    dead code addition; no functional change).

 c) Add code that causes / resolves inequality of the flags in order to
    achieve the goal of the series (functional change).

Does that sound about right?  Patch 2 just does (c), broadly speaking,
I couldn't convince myself whether patch 1 is introducing functional
changes or not.

The KVM parts of (b) could maybe benefit from their own commit message.
If (b) were all in a separate patch by itself, it would be
straightfoward to split the host and KVM changes up.

Maybe splitting patch 1 into two would help clarify the situation.


Second, TIF_SVE_EXEC is advertised as being a pure indication of whether
the task is to be allowed to use SVE in userspace.  I find that this
jars a little with the fact that we leave TIF_SVE_EXEC set in the
!FULL_REGS case -- where we cannot enter userspace without doing some
additional work.  This is not necessarily a problem, but it was one
extra thing to get my head around.

There seems to be a preexisting convention that a set thread flag
implies a "dirtier" condition than when the flag is clear, though
obviously it works either way.  TIF_SVE_FULL_REGS here is really a work-
not-needed flag (whereas TIF_FOREIGN_FPSTATE and various others are
work-needed flags).  This also is not necessarily a problem, but I find
it a bit confusing.  Inverting the sense of TIF_SVE_FULL_REGS, or
a more explicit name such as TIF_SVE_FULL_REGS_VALID might make it
clearer that there is work to do in the "not valid" case and we cannot
blindly enter userspace or save the task state without taking any notice
of the flag.

These are not intended as criticisms -- it's hard to slice up the states
we need into nice orthogonal concepts, and I'm still waiting for this
refactored view to settle properly in my head.


Ultimately, we want a decision point somewhere about whether to fall
back to FPSIMD-only mode for a task or not, even if the decision is
trivial for now.  Where do you think that sits?  I'd generally assumed
that it would go in fpsimd_save() since that's where we take the hit for
saving (and later restoring) the state.  That function isn't preemptible
though, so it might also make sense to make the decision elsewhere if
there are suitable places to do it.

I say this because it is important for this series to make a foundation
that such a decision can be slotted naturally into -- that's part of the
motiviation of making these changes in the first place (IIUC, anyway).

[...]

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
  2021-02-01 15:35   ` Dave Martin
@ 2021-02-09 17:59   ` Dave Martin
  2021-02-09 22:16     ` Mark Brown
  2021-02-10 10:56   ` Dave Martin
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-09 17:59 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Mon, Feb 01, 2021 at 12:29:00PM +0000, Mark Brown wrote:
> Currently we have a single flag TIF_SVE which says that a task is
> allowed to execute SVE instructions without trapping and also that full
> SVE register state is stored for the task.  This results in us doing
> extra work storing and restoring the full SVE register state even in
> those cases where the ABI is that only the first 128 bits of the Z0-V31
> registers which are shared with the FPSIMD V0-V31 are valid.
> 
> In order to allow us to avoid these overheads split TIF_SVE up so that
> we have two separate flags, TIF_SVE_EXEC which allows execution of SVE
> instructions without trapping and TIF_SVE_FULL_REGS which indicates that
> the full SVE register state is stored.  If both are set the behaviour is
> as currently, if TIF_SVE_EXEC is set without TIF_SVE_FULL_REGS then we
> save and restore only the FPSIMD registers until we return to userspace
> with TIF_SVE_EXEC enabled at which point we convert the FPSIMD registers
> to SVE.  It is not meaningful to have TIF_SVE_FULL_REGS set without
> TIF_SVE_EXEC.
> 
> This patch is intended only to split the flags, it does not take
> avantage of the ability to set the flags independently and the new
> state with TIF_SVE_EXEC only should not be observed.
> 
> This is based on earlier work by Julien Gral implementing a slightly
> different approach.

General thoughts:

I'm wondering if TIF_SVE_FULL_REGS is actually conflating two things
here:
 a) whether the SVE regs can be discarded
 b) how the FPSIMD/SVE regs are encoded in thread_struct.

When implementing a trivial policy for discarding the SVE regs, we
discard the regs at the earliest opportunity, so (a) and (b) are not
very distinct.  But if we try to be cleverer later on then this would
break down.  If we separate the two meanings from the outset, would
it help to steer any future policy stuff in a more maintainable
direction.

This might mean having two flags instead of one.

> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/include/asm/thread_info.h |   3 +-
>  arch/arm64/kernel/fpsimd.c           | 174 +++++++++++++++++++--------
>  arch/arm64/kernel/process.c          |   7 +-
>  arch/arm64/kernel/ptrace.c           |   8 +-
>  arch/arm64/kernel/signal.c           |  15 ++-
>  arch/arm64/kernel/syscall.c          |   3 +-
>  arch/arm64/kvm/fpsimd.c              |   6 +-
>  7 files changed, 152 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index 9f4e3b266f21..c856159e071c 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -65,6 +65,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
>  #define TIF_MTE_ASYNC_FAULT	5	/* MTE Asynchronous Tag Check Fault */
>  #define TIF_NOTIFY_SIGNAL	6	/* signal notifications exist */
> +#define TIF_SVE_EXEC		7	/* SVE instructions don't trap */
>  #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
>  #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
>  #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
> @@ -75,7 +76,7 @@ void arch_release_task_struct(struct task_struct *tsk);
>  #define TIF_RESTORE_SIGMASK	20
>  #define TIF_SINGLESTEP		21
>  #define TIF_32BIT		22	/* 32bit process */
> -#define TIF_SVE			23	/* Scalable Vector Extension in use */
> +#define TIF_SVE_FULL_REGS	23	/* Full SVE register set stored */
>  #define TIF_SVE_VL_INHERIT	24	/* Inherit sve_vl_onexec across exec */
>  #define TIF_SSBD		25	/* Wants SSB mitigation */
>  #define TIF_TAGGED_ADDR		26	/* Allow tagged user addresses */
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 062b21f30f94..58c749ef04c4 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -215,48 +215,70 @@ static bool have_cpu_fpsimd_context(void)
>   */
>  static void __sve_free(struct task_struct *task)
>  {
> +	/* SVE context will be zeroed when allocated. */
> +	clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS);

Can you elaborate on why this is here?

I'm not sure this is the right place to clear flags.  This code is
really just for freeing the buffer, along with whatever checks are
necessary to confirm that this is safe / sane to do. (i.e.,
TIF_SVE(_EXEC) set.).

>  	kfree(task->thread.sve_state);
>  	task->thread.sve_state = NULL;
>  }
>  
>  static void sve_free(struct task_struct *task)
>  {
> -	WARN_ON(test_tsk_thread_flag(task, TIF_SVE));
> +	WARN_ON(test_tsk_thread_flag(task, TIF_SVE_EXEC));
>  
>  	__sve_free(task);
>  }
>  
>  /*
> - * TIF_SVE controls whether a task can use SVE without trapping while
> - * in userspace, and also the way a task's FPSIMD/SVE state is stored
> - * in thread_struct.
> + * In order to avoid the expense of storing the SVE registers when not
> + * in active use by tasks we keep track of the task's SVE usage and
> + * only allocate space for SVE registers for tasks that need it.  In
> + * addition since on first use and after every syscall only the portion
> + * of the SVE registers shared with FPSIMD are used we separately track
> + * if we need to actually save all that state.
>   *
> - * The kernel uses this flag to track whether a user task is actively
> - * using SVE, and therefore whether full SVE register state needs to
> - * be tracked.  If not, the cheaper FPSIMD context handling code can
> - * be used instead of the more costly SVE equivalents.
> + * TIF_SVE_EXEC controls whether a task can use SVE without trapping
> + * while in userspace.  TIF_SVE_FULL_REGS controls the way a task's
> + * FPSIMD/SVE state is stored in thread_struct. The kernel uses this
> + * flag to track whether a user task has active SVE state, and
> + * therefore whether full SVE register state needs to be tracked.  If
> + * not, the cheaper FPSIMD context handling code can be used instead
> + * of the more costly SVE equivalents.
>   *
> - *  * TIF_SVE set:
> + *  * FPSR and FPCR are always stored in task->thread.uw.fpsimd_state
> + *    irrespective of any flags, since these are not vector length
> + *    dependent.
>   *
> - *    The task can execute SVE instructions while in userspace without
> - *    trapping to the kernel.
> + *  * TIF_SVE_EXEC is not set:
>   *
> - *    When stored, Z0-Z31 (incorporating Vn in bits[127:0] or the
> - *    corresponding Zn), P0-P15 and FFR are encoded in in
> - *    task->thread.sve_state, formatted appropriately for vector
> - *    length task->thread.sve_vl.
> + *    An attempt by the user task to execute an SVE instruction causes
> + *    do_sve_acc() to be called, which does some preparation and sets
> + *    TIF_SVE_EXEC.
> + *
> + *    When stored, FPSIMD registers V0-V31 are encoded in
> + *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
> + *    logically zero but not stored anywhere; P0-P15 and FFR are not
> + *    stored and have unspecified values from userspace's point of
> + *    view.  For hygiene purposes, the kernel zeroes them on next use,
> + *    but userspace is discouraged from relying on this.
> + *
> + *    task->thread.sve_state does not need to be non-NULL, valid or any
> + *    particular size: it must not be dereferenced.  TIF_SVE_FULL_REGS
> + *    will have no effect and should never be set.
> + *
> + *  * TIF_SVE_EXEC set:
> + *
> + *    The task can execute SVE instructions while in userspace without
> + *    trapping to the kernel.  Storage of Z0-Z31 (incorporating Vn in
> + *    bits[0-127]) is determined by TIF_SVE_FULL_REGS.
>   *
>   *    task->thread.sve_state must point to a valid buffer at least
>   *    sve_state_size(task) bytes in size.
>   *
> - *    During any syscall, the kernel may optionally clear TIF_SVE and
> - *    discard the vector state except for the FPSIMD subset.
> - *
> - *  * TIF_SVE clear:
> + *    During any syscall the ABI allows the kernel to discard the
> + *    vector state other than the FPSIMD subset.  When this is done
> + *    both TIF_SVE_EXEC and TIF_SVE_FULL_REGS will be cleared.

Can this occur if TIF_SVE_FULL_REGS is initially set?  I thought
!TIF_SVE_FULL_REGS our way of telling ourselves that we can discard the
SVE state at all...

>   *
> - *    An attempt by the user task to execute an SVE instruction causes
> - *    do_sve_acc() to be called, which does some preparation and then
> - *    sets TIF_SVE.
> + *  * TIF_SVE_FULL_REGS is not set:
>   *
>   *    When stored, FPSIMD registers V0-V31 are encoded in
>   *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
> @@ -265,12 +287,38 @@ static void sve_free(struct task_struct *task)
>   *    view.  For hygiene purposes, the kernel zeroes them on next use,
>   *    but userspace is discouraged from relying on this.
>   *
> - *    task->thread.sve_state does not need to be non-NULL, valid or any
> - *    particular size: it must not be dereferenced.
> + *    On entry to the kernel other than from a syscall the kernel must
> + *    set TIF_SVE_FULL_REGS and save the full register state.

Only if TIF_SVE_EXEC is set though?

I had thought that it is more natural to set TIF_SVE_FULL_REGS in
preparation for entering userspace.  This does mean that if we are
preempted between fpsimd_restore_current_state() and exception return to
userspace, then we would have to save the full regs unnecessarily.  This
can only happen during a small window though, so it should be rare very
unless the system is already thrashing.  This is the situation prior to
your series IIUC.

Alternatively, we set TIF_SVE_FULL_REGS when entering the kernel on a
non-syscall path with TIF_SVE_EXEC set - but that adds a small overhead
to every kernel entry and feels slightly messier.

> - *  * FPSR and FPCR are always stored in task->thread.uw.fpsimd_state
> - *    irrespective of whether TIF_SVE is clear or set, since these are
> - *    not vector length dependent.
> + *  * TIF_SVE_FULL_REGS is set:
> + *
> + *    This flag is only valid when TIF_SVE_EXEC is set.
> + *
> + *    When stored, Z0-Z31 (incorporating Vn in bits[127:0] or the
> + *    corresponding Zn), P0-P15 and FFR are encoded in in
> + *    task->thread.sve_state, formatted appropriately for vector
> + *    length task->thread.sve_vl.
> + *
> + *    On entry to the kernel from a syscall this flag and TIF_SVE_EXEC
> + *    are cleared and only the FPSIMD subset of the register state is
> + *    stored.
> + *
> + * In summary, combined with TIF_FOREIGN_FPSTATE:
> + *
> + *          !SVE           _EXEC           _EXEC+_FULL_REGS
> + *        +---------------+---------------+---------------+
> + *        | Valid: FPSIMD | Valid: FPSIMD | Valid: SVE    |
> + * !FFP   | Trap:  Yes    | Trap:  No     | Trap:  No     |
> + *        | Where: regs   | Where: regs   | Where: regs   |
> + *        +---------------+---------------+---------------+
> + *        | Valid: FPSIMD | Valid: FPSIMD | Valid: SVE    |
> + * FFP    | Trap:  Yes    | Trap:  No     | Trap:  No     |
> + *        | Where: memory | Where: memory | Where: memory |
> + *        +---------------+---------------+---------------+
> + *
> + * Where valid indicates what state is valid, trap indicates if we
> + * should trap on executing a SVE instruction and where indicates
> + * where the current copy of the register state is.

I wonder if it would help to describe TIF_SVE_FULL_REGS orthogonally to
TIF_SVE_EXEC.

Although we don't intend to implement the combination !TIF_SVE_EXEC &&
TIF_SVE_FULL_REGS, it does have a logically consistent meaning (SVE
disabled for userspace, but Vn nonetheless stored in sve_state, laid out
according to thread->sve_vl).

When flipping the flags, there may be a hazard where this combination
temporarily appears, but we could hide that in a helper that runs under
get_cpu_fpsimd_context() -- just like we already do for various other
things.

(We could avoid this by adding atomic thread_flags manipulators that
flip multiple flags, but that's overkill just for this.)


This makes no functional difference, but it might help simplify the
description a bit.

>   */
>  
>  /*
> @@ -279,18 +327,37 @@ static void sve_free(struct task_struct *task)
>   * This function should be called only when the FPSIMD/SVE state in
>   * thread_struct is known to be up to date, when preparing to enter
>   * userspace.
> + *
> + * When TIF_SVE_EXEC is set but TIF_SVE_FULL_REGS is not set the SVE
> + * state will be restored from the FPSIMD state.
>   */
>  static void task_fpsimd_load(void)
>  {
> +	unsigned int vl;
> +
>  	WARN_ON(!system_supports_fpsimd());
>  	WARN_ON(!have_cpu_fpsimd_context());
>  
> -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> -		sve_load_state(sve_pffr(&current->thread),
> -			       &current->thread.uw.fpsimd_state.fpsr,
> -			       sve_vq_from_vl(current->thread.sve_vl) - 1);
> -	else
> -		fpsimd_load_state(&current->thread.uw.fpsimd_state);
> +	if (test_thread_flag(TIF_SVE_EXEC)) {

Was system_supports_sve() dropped on purpose?  And why?

> +		vl = sve_vq_from_vl(current->thread.sve_vl) - 1;
> +
> +		/*
> +		 * We always return with the full register state, if

. If

(Otherwise this misparses as "We always return with the full register
state [] if there is no explicit SVE state [] .)

> +		 * there is no explicit SVE state load from the FPSIMD
> +		 * state instead.
> +		 */

To keep line length down with less impact to readability, can we take a
pointer to current->thread.uw.fpsimd_state in advance here?  We use it
multiple times anyway.

> +		if (test_and_set_thread_flag(TIF_SVE_FULL_REGS))
> +			sve_load_state(sve_pffr(&current->thread),
> +				       &current->thread.uw.fpsimd_state.fpsr,
> +				       vl);
> +		else
> +			sve_load_from_fpsimd_state(&current->thread.uw.fpsimd_state,
> +						   vl);

As observed above, there could be an argument for setting
TIF_SVE_FULL_REGS here, since the only reason to actually load the regs
is in preparation for entering userspace, which can freely dirty all the
SVE bits undetected since trapping for EL0 will be disabled due to
TIF_SVE_EXEC being est.

> +
> +		return;
> +	}
> +
> +	fpsimd_load_state(&current->thread.uw.fpsimd_state);

Maybe this could be more readable, as well as reducing indentation on
the more complex branch of the if, as:

	if (!system_supports_sve || !test_thread_flag(TIF_SVE_EXEC)) {
		fpsimd_load_state(&current->thread.uw.fpsimd_state);

		return;
	}

	/* handle TIF_SVE_EXEC case */

>  }
>  
>  /*
> @@ -307,7 +374,7 @@ static void fpsimd_save(void)
>  	WARN_ON(!have_cpu_fpsimd_context());
>  
>  	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
> -		if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
> +		if (system_supports_sve() && test_thread_flag(TIF_SVE_EXEC)) {
>  			if (WARN_ON(sve_get_vl() != last->sve_vl)) {
>  				/*
>  				 * Can't save the user regs, so current would
> @@ -318,6 +385,7 @@ static void fpsimd_save(void)
>  				return;
>  			}
>  
> +			set_thread_flag(TIF_SVE_FULL_REGS);

Why do this here?  What if we're in a syscall?

>  			sve_save_state((char *)last->sve_state +
>  						sve_ffr_offset(last->sve_vl),
>  				       &last->st->fpsr);
> @@ -536,7 +604,7 @@ void sve_alloc(struct task_struct *task)
>   */
>  void fpsimd_sync_to_sve(struct task_struct *task)
>  {
> -	if (!test_tsk_thread_flag(task, TIF_SVE))
> +	if (!test_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
>  		fpsimd_to_sve(task);
>  }
>  
> @@ -550,7 +618,7 @@ void fpsimd_sync_to_sve(struct task_struct *task)
>   */
>  void sve_sync_to_fpsimd(struct task_struct *task)
>  {
> -	if (test_tsk_thread_flag(task, TIF_SVE))
> +	if (test_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
>  		sve_to_fpsimd(task);
>  }
>  
> @@ -572,7 +640,7 @@ void sve_sync_from_fpsimd_zeropad(struct task_struct *task)
>  	void *sst = task->thread.sve_state;
>  	struct user_fpsimd_state const *fst = &task->thread.uw.fpsimd_state;
>  
> -	if (!test_tsk_thread_flag(task, TIF_SVE))
> +	if (!test_tsk_thread_flag(task, TIF_SVE_EXEC))
>  		return;
>  
>  	vq = sve_vq_from_vl(task->thread.sve_vl);
> @@ -627,8 +695,9 @@ int sve_set_vector_length(struct task_struct *task,
>  	}
>  
>  	fpsimd_flush_task_state(task);
> -	if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
> +	if (test_and_clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
>  		sve_to_fpsimd(task);

Won't this mean that we dereference thread->sve_state even if
TIF_SVE_EXEC was clear?

For the ptrace case, I think we can probably can hit this, IIUC.

It might not apply to the prctl() case if TIF_SVE_EXEC was already
cleared during syscall entry (?), but as observed above this still
assumes that the SVE regs are discarded at the earliest opportinuty,
which might not be true in future.

> +	clear_thread_flag(TIF_SVE_EXEC);
>  
>  	if (task == current)
>  		put_cpu_fpsimd_context();
> @@ -926,13 +995,14 @@ void fpsimd_release_task(struct task_struct *dead_task)
>   * Trapped SVE access
>   *
>   * Storage is allocated for the full SVE state, the current FPSIMD
> - * register contents are migrated across, and TIF_SVE is set so that
> + * register contents are migrated across, and TIF_SVE_EXEC is set so that
>   * the SVE access trap will be disabled the next time this task
>   * reaches ret_to_user.
>   *
> - * TIF_SVE should be clear on entry: otherwise, fpsimd_restore_current_state()
> - * would have disabled the SVE access trap for userspace during
> - * ret_to_user, making an SVE access trap impossible in that case.
> + * TIF_SVE_EXEC should be clear on entry: otherwise,
> + * fpsimd_restore_current_state() would have disabled the SVE access
> + * trap for userspace during ret_to_user, making an SVE access trap
> + * impossible in that case.
>   */
>  void do_sve_acc(unsigned int esr, struct pt_regs *regs)
>  {
> @@ -952,8 +1022,9 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
>  	fpsimd_flush_task_state(current);
>  
>  	fpsimd_to_sve(current);

Hmmm, there's a latent bug upstream here: if the WARN() fires then
sve_state is not safe to dereference.  But we already did.

So perhaps this should have been something like:

	if (!WARN_ON(test_and_set_thread_flag(TIF_SVE)))
		fpsimd_to_sve();

This might make sense as a separate Fixes patch to precede the series.


> -	if (test_and_set_thread_flag(TIF_SVE))
> +	if (test_and_set_thread_flag(TIF_SVE_EXEC))
>  		WARN_ON(1); /* SVE access shouldn't have trapped */
> +	set_thread_flag(TIF_SVE_FULL_REGS);
>  
>  	put_cpu_fpsimd_context();
>  }
> @@ -1033,7 +1104,8 @@ void fpsimd_flush_thread(void)
>  	       sizeof(current->thread.uw.fpsimd_state));
>  
>  	if (system_supports_sve()) {
> -		clear_thread_flag(TIF_SVE);
> +		clear_thread_flag(TIF_SVE_EXEC);
> +		clear_thread_flag(TIF_SVE_FULL_REGS);
>  		sve_free(current);
>  
>  		/*
> @@ -1092,7 +1164,7 @@ void fpsimd_preserve_current_state(void)
>  void fpsimd_signal_preserve_current_state(void)
>  {
>  	fpsimd_preserve_current_state();
> -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> +	if (system_supports_sve() && test_thread_flag(TIF_SVE_FULL_REGS))

So, here TIF_SVE_FULL_REGS is assumed to be meaningful even in the
absence of TIF_SVE_EXEC.  While that's not necessarily wrong, it's a bit
hard to prove it.  If we go down this route, then the confition for
sve_state validity should probably be TIF_SVE_FULL_REGS and not
TIF_SVE_EXEC, etc.

In the comments, you still say that if TIF_SVE_EXEC is set, then
sve_state must be valid.  If TIF_SVE_FULL_REGS is never set when
TIF_SVE_EXEC is clear that we get away with it, but I wonder whether
this would bitrot and break under future maintenance.

>  		sve_to_fpsimd(current);
>  }
>  
> @@ -1114,7 +1186,7 @@ void fpsimd_bind_task_to_cpu(void)
>  
>  	if (system_supports_sve()) {
>  		/* Toggle SVE trapping for userspace if needed */
> -		if (test_thread_flag(TIF_SVE))
> +		if (test_thread_flag(TIF_SVE_EXEC))
>  			sve_user_enable();
>  		else
>  			sve_user_disable();
> @@ -1163,6 +1235,14 @@ void fpsimd_restore_current_state(void)
>  	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
>  		task_fpsimd_load();
>  		fpsimd_bind_task_to_cpu();
> +	} else {
> +		/*
> +		 * Convert FPSIMD state to SVE if userspace can execute SVE
> +		 * but we have no explicit SVE state.
> +		 */
> +		if (test_thread_flag(TIF_SVE_EXEC) &&
> +		    !test_and_set_thread_flag(TIF_SVE_FULL_REGS))
> +			sve_flush_live();
>  	}
>  
>  	put_cpu_fpsimd_context();
> @@ -1181,7 +1261,7 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
>  	get_cpu_fpsimd_context();
>  
>  	current->thread.uw.fpsimd_state = *state;
> -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> +	if (test_thread_flag(TIF_SVE_FULL_REGS))

Unintentionally dropped system_supports_sve()?

>  		fpsimd_to_sve(current);
>  
>  	task_fpsimd_load();
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6616486a58fe..71c8265b9139 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -364,13 +364,14 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>  	 * Detach src's sve_state (if any) from dst so that it does not
>  	 * get erroneously used or freed prematurely.  dst's sve_state
>  	 * will be allocated on demand later on if dst uses SVE.
> -	 * For consistency, also clear TIF_SVE here: this could be done
> +	 * For consistency, also clear TIF_SVE_* here: this could be done
>  	 * later in copy_process(), but to avoid tripping up future
> -	 * maintainers it is best not to leave TIF_SVE and sve_state in
> +	 * maintainers it is best not to leave TIF_SVE_* and sve_state in
>  	 * an inconsistent state, even temporarily.
>  	 */
>  	dst->thread.sve_state = NULL;
> -	clear_tsk_thread_flag(dst, TIF_SVE);
> +	clear_tsk_thread_flag(dst, TIF_SVE_EXEC);
> +	clear_tsk_thread_flag(dst, TIF_SVE_FULL_REGS);
>  
>  	/* clear any pending asynchronous tag fault raised by the parent */
>  	clear_tsk_thread_flag(dst, TIF_MTE_ASYNC_FAULT);
> diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> index 8ac487c84e37..f0406b3dc389 100644
> --- a/arch/arm64/kernel/ptrace.c
> +++ b/arch/arm64/kernel/ptrace.c
> @@ -719,7 +719,7 @@ static void sve_init_header_from_task(struct user_sve_header *header,
>  
>  	memset(header, 0, sizeof(*header));
>  
> -	header->flags = test_tsk_thread_flag(target, TIF_SVE) ?
> +	header->flags = test_tsk_thread_flag(target, TIF_SVE_FULL_REGS) ?

Same observations as for fpsimd_signal_preserve_current_state().

>  		SVE_PT_REGS_SVE : SVE_PT_REGS_FPSIMD;
>  	if (test_tsk_thread_flag(target, TIF_SVE_VL_INHERIT))
>  		header->flags |= SVE_PT_VL_INHERIT;
> @@ -827,7 +827,8 @@ static int sve_set(struct task_struct *target,
>  	if ((header.flags & SVE_PT_REGS_MASK) == SVE_PT_REGS_FPSIMD) {
>  		ret = __fpr_set(target, regset, pos, count, kbuf, ubuf,
>  				SVE_PT_FPSIMD_OFFSET);
> -		clear_tsk_thread_flag(target, TIF_SVE);
> +		clear_tsk_thread_flag(target, TIF_SVE_EXEC);
> +		clear_tsk_thread_flag(target, TIF_SVE_FULL_REGS);
>  		goto out;
>  	}
>  
> @@ -851,7 +852,8 @@ static int sve_set(struct task_struct *target,
>  	 * unmodified.
>  	 */
>  	fpsimd_sync_to_sve(target);
> -	set_tsk_thread_flag(target, TIF_SVE);
> +	set_tsk_thread_flag(target, TIF_SVE_EXEC);
> +	set_tsk_thread_flag(target, TIF_SVE_FULL_REGS);
>  
>  	BUILD_BUG_ON(SVE_PT_SVE_OFFSET != sizeof(header));
>  	start = SVE_PT_SVE_OFFSET;
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index f71d6ce4673f..5d5610af7ea3 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -205,7 +205,8 @@ static int restore_fpsimd_context(struct fpsimd_context __user *ctx)
>  	__get_user_error(fpsimd.fpsr, &ctx->fpsr, err);
>  	__get_user_error(fpsimd.fpcr, &ctx->fpcr, err);
>  
> -	clear_thread_flag(TIF_SVE);
> +	clear_thread_flag(TIF_SVE_EXEC);
> +	clear_thread_flag(TIF_SVE_FULL_REGS);
>  
>  	/* load the hardware registers from the fpsimd_state structure */
>  	if (!err)
> @@ -229,7 +230,7 @@ static int preserve_sve_context(struct sve_context __user *ctx)
>  	unsigned int vl = current->thread.sve_vl;
>  	unsigned int vq = 0;
>  
> -	if (test_thread_flag(TIF_SVE))
> +	if (test_thread_flag(TIF_SVE_EXEC))
>  		vq = sve_vq_from_vl(vl);
>  
>  	memset(reserved, 0, sizeof(reserved));
> @@ -241,7 +242,7 @@ static int preserve_sve_context(struct sve_context __user *ctx)
>  	BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
>  	err |= __copy_to_user(&ctx->__reserved, reserved, sizeof(reserved));
>  
> -	if (vq) {
> +	if (vq && test_thread_flag(TIF_SVE_FULL_REGS)) {

Hmm, in theory we should be able to mark the regs as discardable once
they have been saved in the frame, though we never did that in the
past.  Since we never zeroed the extra bits on signal handler entry
anyway, we could actually skip the zeroing for this specific case.  Any
change of this sort should go in a separate patch though.

(Strictly speaking it could be an ABI break, though it would be pretty
insane -- and hard work -- for userspace to rely somehow on the full SVE
regs showing through to the signal handler.)

>  		/*
>  		 * This assumes that the SVE state has already been saved to
>  		 * the task struct by calling the function
> @@ -269,7 +270,8 @@ static int restore_sve_fpsimd_context(struct user_ctxs *user)
>  		return -EINVAL;
>  
>  	if (sve.head.size <= sizeof(*user->sve)) {
> -		clear_thread_flag(TIF_SVE);
> +		clear_thread_flag(TIF_SVE_EXEC);
> +		clear_thread_flag(TIF_SVE_FULL_REGS);
>  		goto fpsimd_only;
>  	}
>  
> @@ -296,7 +298,8 @@ static int restore_sve_fpsimd_context(struct user_ctxs *user)
>  	if (err)
>  		return -EFAULT;
>  
> -	set_thread_flag(TIF_SVE);
> +	set_thread_flag(TIF_SVE_EXEC);
> +	set_thread_flag(TIF_SVE_FULL_REGS);
>  
>  fpsimd_only:
>  	/* copy the FP and status/control registers */
> @@ -587,7 +590,7 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user,
>  	if (system_supports_sve()) {
>  		unsigned int vq = 0;
>  
> -		if (add_all || test_thread_flag(TIF_SVE)) {
> +		if (add_all || test_thread_flag(TIF_SVE_EXEC)) {

Doesn't this need to match the SVE reg dumping condition in
preserve_sve_context()?

>  			int vl = sve_max_vl;
>  
>  			if (!add_all)
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index f61e9d8cc55a..f8a2598730c2 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -186,7 +186,8 @@ static inline void sve_user_discard(void)
>  	if (!system_supports_sve())
>  		return;
>  
> -	clear_thread_flag(TIF_SVE);
> +	clear_thread_flag(TIF_SVE_EXEC);
> +	clear_thread_flag(TIF_SVE_FULL_REGS);

(Assuming this will be refined in the next patch.)

>  	/*
>  	 * task_fpsimd_load() won't be called to update CPACR_EL1 in
> diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
> index 3e081d556e81..1b7c0d03581b 100644
> --- a/arch/arm64/kvm/fpsimd.c
> +++ b/arch/arm64/kvm/fpsimd.c
> @@ -67,7 +67,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
>  			      KVM_ARM64_HOST_SVE_ENABLED);
>  	vcpu->arch.flags |= KVM_ARM64_FP_HOST;
>  
> -	if (test_thread_flag(TIF_SVE))
> +	if (test_thread_flag(TIF_SVE_EXEC))
>  		vcpu->arch.flags |= KVM_ARM64_HOST_SVE_IN_USE;

Here we're setting TIF_SVE(_EXEC) to describe the guest context, while
stashing off the host's TIF_SVE(_EXEC) flag so we can get it back later.

Don't we need to do a similar things for TIF_SVE_FULL_REGS though?

For the vcpu context, TIF_SVE_FULL_REGS needs to be set for SVE-enabled
vcpus (where we don't have a !FULL_REGS case).  For the host, this flag
might always be clear if we have to be in the KVM_RUN ioctl to get
here, but if we're relying on that we should at least stick a WARN_ON()
here.

>  
>  	if (read_sysreg(cpacr_el1) & CPACR_EL1_ZEN_EL0EN)
> @@ -90,7 +90,7 @@ void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu)
>  					 vcpu->arch.sve_max_vl);
>  
>  		clear_thread_flag(TIF_FOREIGN_FPSTATE);
> -		update_thread_flag(TIF_SVE, vcpu_has_sve(vcpu));
> +		update_thread_flag(TIF_SVE_EXEC, vcpu_has_sve(vcpu));

Does TIF_SVE_FULL_REGS need to be set similarly?

>  	}
>  }
>  
> @@ -127,7 +127,7 @@ void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
>  			sysreg_clear_set(CPACR_EL1, CPACR_EL1_ZEN_EL0EN, 0);
>  	}
>  
> -	update_thread_flag(TIF_SVE,
> +	update_thread_flag(TIF_SVE_EXEC,
>  			   vcpu->arch.flags & KVM_ARM64_HOST_SVE_IN_USE);

And here does TIF_SVE_FULL_REGS need to be restored (or perhaps
unconditionally cleared if we believe we can't get here while the host
task has full regs?)

[...]

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps
  2021-02-08 17:26 ` [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Dave Martin
@ 2021-02-09 18:22   ` Mark Brown
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-09 18:22 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 5969 bytes --]

On Mon, Feb 08, 2021 at 05:26:10PM +0000, Dave Martin wrote:
> On Mon, Feb 01, 2021 at 12:28:59PM +0000, Mark Brown wrote:

> > v7:
> >  - A few minor cosmetic updates and one bugfix for
> >    fpsimd_update_current_state().

> I see the following delta in this function:

> @@ -1272,7 +1272,8 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
>  	get_cpu_fpsimd_context();

>  	current->thread.uw.fpsimd_state = *state;
> -	clear_thread_flag(TIF_SVE_FULL_REGS);
> +	if (test_thread_flag(TIF_SVE_FULL_REGS))
> +		fpsimd_to_sve(current);

>  	task_fpsimd_load();
>  	fpsimd_bind_task_to_cpu();

> Can you elaborate on that?

This restores the behaviour we have in mainline, I'd messed up the
update here when going through doing the changes to split the flags.
The function is called by the signal code to load both the SVE and
FPSIMD state, it can enable SVE.

> I think the implementation flow in this series is roughly:

>  a) Split TIF_SVE into two flags, but keep the flags in lockstep, and
>     don't pretend to handle cases where they are unequal (no functional
>     change).

>  b) Add handling for cases where the flags are unequal (the only
>     meaningful case being TIF_SVE_EXEC && !TIF_SVE_FULL_REGS) (purely
>     dead code addition; no functional change).

>  c) Add code that causes / resolves inequality of the flags in order to
>     achieve the goal of the series (functional change).

> Does that sound about right?  Patch 2 just does (c), broadly speaking,
> I couldn't convince myself whether patch 1 is introducing functional
> changes or not.

Yes, that's roughly the idea.

> The KVM parts of (b) could maybe benefit from their own commit message.
> If (b) were all in a separate patch by itself, it would be
> straightfoward to split the host and KVM changes up.

> Maybe splitting patch 1 into two would help clarify the situation.

I did start off splitting it up more but doing it in a way that doesn't
result in messier changes due to needing to work with two flags and
manages to preserve bisectability ended up pushing me towards the
smaller set here.  It'd be easy to add a bit more verbiage about KVM to
the commit log though.

> Second, TIF_SVE_EXEC is advertised as being a pure indication of whether
> the task is to be allowed to use SVE in userspace.  I find that this
> jars a little with the fact that we leave TIF_SVE_EXEC set in the
> !FULL_REGS case -- where we cannot enter userspace without doing some
> additional work.  This is not necessarily a problem, but it was one
> extra thing to get my head around.

Right.

> There seems to be a preexisting convention that a set thread flag
> implies a "dirtier" condition than when the flag is clear, though
> obviously it works either way.  TIF_SVE_FULL_REGS here is really a work-
> not-needed flag (whereas TIF_FOREIGN_FPSTATE and various others are

I have to say I was completely unaware of that convention, it didn't
jump out at me from the code :/ .  I suppose you could also say that
it's saying that it's a work needed in that it means that the code has
to work with the full register set rather than just the FPSIMD subset
but it I guess it really just depends on which particular bit of code
you're looking at at the specific moment - on the save paths _FULL_REGS
means more work as you have to save all the SVE state rather than just
the FPSIMD subset.  As you say it doesn't *really* matter.

> work-needed flags).  This also is not necessarily a problem, but I find
> it a bit confusing.  Inverting the sense of TIF_SVE_FULL_REGS, or
> a more explicit name such as TIF_SVE_FULL_REGS_VALID might make it
> clearer that there is work to do in the "not valid" case and we cannot
> blindly enter userspace or save the task state without taking any notice
> of the flag.

I'm not at all attached to the name, if people find _FULL_REGS_VALID
or some other name clearer that works for me.

> Ultimately, we want a decision point somewhere about whether to fall
> back to FPSIMD-only mode for a task or not, even if the decision is
> trivial for now.  Where do you think that sits?  I'd generally assumed
> that it would go in fpsimd_save() since that's where we take the hit for
> saving (and later restoring) the state.  That function isn't preemptible
> though, so it might also make sense to make the decision elsewhere if
> there are suitable places to do it.

To me fpsimd_save() is more of a "do the thing we already decided to do"
function - it doesn't have enough context to make any decisions, and one
of the areas where we can benefit is when we never need to save at all,
for example when we take a SVE access trap and can initialize the SVE
state from the FPSIMD state already in registers and return to userspace
without saving to memory.

We definitely should optimize fpsimd_save() here, it shouldn't need to
set TIF_SVE_FULL_REGS or spill to the SVE state, but I wouldn't call it
a *decision* exactly.

> I say this because it is important for this series to make a foundation
> that such a decision can be slotted naturally into -- that's part of the
> motiviation of making these changes in the first place (IIUC, anyway).

To me the goal was mainly about making the access trap case more
efficient and potentially also enabling further work like having an
equivalent to fpu_counter instead of disabling the SVE access trap on
every syscall.

The way I was thinking about this was more that the decision points were
on kernel entry and exit.  We need to generate the full register state
any time it's observed by something outside the kernel and there are a
set of defined kernel entry points where the SVE state can be discarded
but these are all individual points where an individual thing happens.

I'm therefore not 100% sure I'm fully understanding what you're thinking
of here.  It doesn't really seem like a single point thing to me but
I've got a feeling that I'm not exactly talking about the same thing as
you are.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-09 17:59   ` Dave Martin
@ 2021-02-09 22:16     ` Mark Brown
  2021-02-10 19:52       ` Mark Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Brown @ 2021-02-09 22:16 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 13704 bytes --]

On Tue, Feb 09, 2021 at 05:59:46PM +0000, Dave Martin wrote:

> I'm wondering if TIF_SVE_FULL_REGS is actually conflating two things
> here:
>  a) whether the SVE regs can be discarded
>  b) how the FPSIMD/SVE regs are encoded in thread_struct.

The way I was thinking about this was that at any point where we can
discard the SVE register state we should actually do so - I wasn't
considering a state where we could potentially discard them but didn't
actually want to do so.

> When implementing a trivial policy for discarding the SVE regs, we
> discard the regs at the earliest opportunity, so (a) and (b) are not
> very distinct.  But if we try to be cleverer later on then this would
> break down.  If we separate the two meanings from the outset, would
> it help to steer any future policy stuff in a more maintainable
> direction.

> This might mean having two flags instead of one.

I think if we want to track potential actions here then rather than
tracking a further "potentially discardable" state we should be tracking
the things that let us decide to discard the state such as being in a
syscall, it seems likely that things that allow us to have such a state
could also have similar implications for other areas of the ABI at some
point and that we'd be able to get more usage out of that tracking than
something which only applies to SVE.

> >  static void __sve_free(struct task_struct *task)
> >  {
> > +	/* SVE context will be zeroed when allocated. */
> > +	clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS);

> Can you elaborate on why this is here?

> I'm not sure this is the right place to clear flags.  This code is
> really just for freeing the buffer, along with whatever checks are
> necessary to confirm that this is safe / sane to do. (i.e.,
> TIF_SVE(_EXEC) set.).

It felt like it lined up with the assignment of sve_state to NULL
immediately below.  It could equally go a level up in sve_free() I
think, the logic in splitting the two free functions isn't super clear
from the code TBH - it seems to be due to fpsimd_release_task() and
wanting to avoid the state check on SVE_EXEC in the existing code which
I guess people thought was valuable.

> > + *  * TIF_SVE_EXEC set:
> > + *
> > + *    The task can execute SVE instructions while in userspace without
> > + *    trapping to the kernel.  Storage of Z0-Z31 (incorporating Vn in
> > + *    bits[0-127]) is determined by TIF_SVE_FULL_REGS.
> >   *
> >   *    task->thread.sve_state must point to a valid buffer at least
> >   *    sve_state_size(task) bytes in size.
> >   *
> > - *    During any syscall, the kernel may optionally clear TIF_SVE and
> > - *    discard the vector state except for the FPSIMD subset.
> > - *
> > - *  * TIF_SVE clear:
> > + *    During any syscall the ABI allows the kernel to discard the
> > + *    vector state other than the FPSIMD subset.  When this is done
> > + *    both TIF_SVE_EXEC and TIF_SVE_FULL_REGS will be cleared.

> Can this occur if TIF_SVE_FULL_REGS is initially set?  I thought
> !TIF_SVE_FULL_REGS our way of telling ourselves that we can discard the
> SVE state at all...

I'm not sure what you mean here - what exactly is the "this" you're
referring to, and what do you mean by "initially set"?  The intent of
the above is to document the fact that when we do a syscall we both
disable SVE execution (so an access trap is generated) and discard the
SVE state.

As I said in the other mail and the earlier part of this comment is
intended to say the intent of !TIF_SVE_FULL_REGS is that the state for
the SVE registers (if required) should be taken from the FPSIMD
registers, wherever they are stored at the time the SVE state is needed.

This documentation is attempting to cover the intended behaviour after
further changes to allow the two flags to be set separately, it didn't
seem sensible to try to document the behaviour when they're set in lock
step since otherwise there will be questions about what the flags are
supposed to mean separately when they're being introduced.

> 
> > + *  * TIF_SVE_FULL_REGS is not set:
> >   *
> >   *    When stored, FPSIMD registers V0-V31 are encoded in
> >   *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
> > @@ -265,12 +287,38 @@ static void sve_free(struct task_struct *task)
> >   *    view.  For hygiene purposes, the kernel zeroes them on next use,
> >   *    but userspace is discouraged from relying on this.
> >   *
> > - *    task->thread.sve_state does not need to be non-NULL, valid or any
> > - *    particular size: it must not be dereferenced.
> > + *    On entry to the kernel other than from a syscall the kernel must
> > + *    set TIF_SVE_FULL_REGS and save the full register state.

> Only if TIF_SVE_EXEC is set though?

Yes, I'll clarify this.

> I had thought that it is more natural to set TIF_SVE_FULL_REGS in
> preparation for entering userspace.  This does mean that if we are
> preempted between fpsimd_restore_current_state() and exception return to
> userspace, then we would have to save the full regs unnecessarily.  This
> can only happen during a small window though, so it should be rare very
> unless the system is already thrashing.  This is the situation prior to
> your series IIUC.

We need to set TIF_SVE_FULL_REGS on entry to the kernel so that if we
come to save the register state we know we need to save the full SVE
state (which userspace may have changed) and so that if we return to
userspace without saving we know if we need to zero the bits of SVE
state that are not shared with the FPSIMD state.  It is true that there
is a race after doing a conversion from FPSIMD in the exit path where we
might end up saving the full state again, as you say that's not changed
here and is pretty thin.  

> Alternatively, we set TIF_SVE_FULL_REGS when entering the kernel on a
> non-syscall path with TIF_SVE_EXEC set - but that adds a small overhead
> to every kernel entry and feels slightly messier.

The goal is to avoid that overhead in practice by ensuring that we never
leave the kernel without having both flags set, changing the check on
entry to the inverse condition.  I'll explicitly call that out in the
comment.

> I wonder if it would help to describe TIF_SVE_FULL_REGS orthogonally to
> TIF_SVE_EXEC.

> Although we don't intend to implement the combination !TIF_SVE_EXEC &&
> TIF_SVE_FULL_REGS, it does have a logically consistent meaning (SVE
> disabled for userspace, but Vn nonetheless stored in sve_state, laid out
> according to thread->sve_vl).

My worry with doing that was that as it's not a meaningful state it
would create confusion regarding how we could get into that situation
and what it would actually mean.

> When flipping the flags, there may be a hazard where this combination
> temporarily appears, but we could hide that in a helper that runs under
> get_cpu_fpsimd_context() -- just like we already do for various other
> things.

> (We could avoid this by adding atomic thread_flags manipulators that
> flip multiple flags, but that's overkill just for this.)

How about explicitly saying it's not meaningful and should never be
allowed to happen?  To me it's just undefined behaviour and anything
that is expedient is fine.

> > -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> > -		sve_load_state(sve_pffr(&current->thread),
> > -			       &current->thread.uw.fpsimd_state.fpsr,
> > -			       sve_vq_from_vl(current->thread.sve_vl) - 1);
> > -	else
> > -		fpsimd_load_state(&current->thread.uw.fpsimd_state);
> > +	if (test_thread_flag(TIF_SVE_EXEC)) {

> Was system_supports_sve() dropped on purpose?  And why?

There were concerns on earlier versions of the series about the
system_supports_sve() checks being redundant, mainly raised by new
checks being added in this pattern - if we have TIF_SVE and no SVE
hardware we've got a pretty bad bug so I was cutting them out.  I had
started going through and doing that but it looks like I got distracted
somewhere before actually finishing it and noting it in the revision
list.

> > +		if (test_and_set_thread_flag(TIF_SVE_FULL_REGS))
> > +			sve_load_state(sve_pffr(&current->thread),
> > +				       &current->thread.uw.fpsimd_state.fpsr,
> > +				       vl);
> > +		else
> > +			sve_load_from_fpsimd_state(&current->thread.uw.fpsimd_state,
> > +						   vl);

> As observed above, there could be an argument for setting
> TIF_SVE_FULL_REGS here, since the only reason to actually load the regs
> is in preparation for entering userspace, which can freely dirty all the
> SVE bits undetected since trapping for EL0 will be disabled due to
> TIF_SVE_EXEC being est.

Yes, we could do that - I guess it's OK to squash into this change.

> > +	fpsimd_load_state(&current->thread.uw.fpsimd_state);

> Maybe this could be more readable, as well as reducing indentation on
> the more complex branch of the if, as:

> 	if (!system_supports_sve || !test_thread_flag(TIF_SVE_EXEC)) {
> 		fpsimd_load_state(&current->thread.uw.fpsimd_state);
> 
> 		return;
> 	}

> 	/* handle TIF_SVE_EXEC case */

Sure.

> > @@ -318,6 +385,7 @@ static void fpsimd_save(void)
> >  				return;
> >  			}
> >  
> > +			set_thread_flag(TIF_SVE_FULL_REGS);

> Why do this here?  What if we're in a syscall?

There's no problem if we're in a syscall since we'll have already
cleared TIF_SVE_EXEC but yes, we shouldn't do this here - I think it's
just left over from my initial pass through where I was trying to
explictly handle both flags at once which just ended up being too
tortuous.

> > @@ -627,8 +695,9 @@ int sve_set_vector_length(struct task_struct *task,
> >  	}
> >  
> >  	fpsimd_flush_task_state(task);
> > -	if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
> > +	if (test_and_clear_tsk_thread_flag(task, TIF_SVE_FULL_REGS))
> >  		sve_to_fpsimd(task);

> Won't this mean that we dereference thread->sve_state even if
> TIF_SVE_EXEC was clear?

Yes, potentially if we let the flag be set.  I'll add an extra check
here and in other similar places you mentioned.

> For the ptrace case, I think we can probably can hit this, IIUC.

> It might not apply to the prctl() case if TIF_SVE_EXEC was already
> cleared during syscall entry (?), but as observed above this still
> assumes that the SVE regs are discarded at the earliest opportinuty,
> which might not be true in future.

I would hope that if the SVE registers become actually invalid that'd
also involve clearing TIF_SVE_FULL_REGS but yeah.

> > @@ -952,8 +1022,9 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> >  	fpsimd_flush_task_state(current);
> >  
> >  	fpsimd_to_sve(current);

> Hmmm, there's a latent bug upstream here: if the WARN() fires then
> sve_state is not safe to dereference.  But we already did.

> So perhaps this should have been something like:

> 	if (!WARN_ON(test_and_set_thread_flag(TIF_SVE)))
> 		fpsimd_to_sve();

> This might make sense as a separate Fixes patch to precede the series.

Yes, that's definitely a separate fix I think.

> > @@ -1181,7 +1261,7 @@ void fpsimd_update_current_state(struct user_fpsimd_state const *state)
> >  	get_cpu_fpsimd_context();
> >  
> >  	current->thread.uw.fpsimd_state = *state;
> > -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> > +	if (test_thread_flag(TIF_SVE_FULL_REGS))

> Unintentionally dropped system_supports_sve()?

As above discussion on previous revisions suggested removing these - I
actually spotted a few more that I'd missed here.

> > @@ -241,7 +242,7 @@ static int preserve_sve_context(struct sve_context __user *ctx)
> >  	BUILD_BUG_ON(sizeof(ctx->__reserved) != sizeof(reserved));
> >  	err |= __copy_to_user(&ctx->__reserved, reserved, sizeof(reserved));
> >  
> > -	if (vq) {
> > +	if (vq && test_thread_flag(TIF_SVE_FULL_REGS)) {
> 
> Hmm, in theory we should be able to mark the regs as discardable once
> they have been saved in the frame, though we never did that in the
> past.  Since we never zeroed the extra bits on signal handler entry
> anyway, we could actually skip the zeroing for this specific case.  Any
> change of this sort should go in a separate patch though.

Definitely a separate change, and probably sensible to split it from
this series.

> > @@ -587,7 +590,7 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user,
> >  	if (system_supports_sve()) {
> >  		unsigned int vq = 0;
> >  
> > -		if (add_all || test_thread_flag(TIF_SVE)) {
> > +		if (add_all || test_thread_flag(TIF_SVE_EXEC)) {
> 
> Doesn't this need to match the SVE reg dumping condition in
> preserve_sve_context()?

Posssibly, yeah.

> > --- a/arch/arm64/kernel/syscall.c
> > +++ b/arch/arm64/kernel/syscall.c
> > @@ -186,7 +186,8 @@ static inline void sve_user_discard(void)
> >  	if (!system_supports_sve())
> >  		return;

> > -	clear_thread_flag(TIF_SVE);
> > +	clear_thread_flag(TIF_SVE_EXEC);
> > +	clear_thread_flag(TIF_SVE_FULL_REGS);

> (Assuming this will be refined in the next patch.)

No - could you elaborate on what refinement you were expecting here?

> > -	if (test_thread_flag(TIF_SVE))
> > +	if (test_thread_flag(TIF_SVE_EXEC))
> >  		vcpu->arch.flags |= KVM_ARM64_HOST_SVE_IN_USE;

> Here we're setting TIF_SVE(_EXEC) to describe the guest context, while
> stashing off the host's TIF_SVE(_EXEC) flag so we can get it back later.

> Don't we need to do a similar things for TIF_SVE_FULL_REGS though?

Yes, I think this needs redoing...

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
  2021-02-01 15:35   ` Dave Martin
  2021-02-09 17:59   ` Dave Martin
@ 2021-02-10 10:56   ` Dave Martin
  2021-02-10 14:54     ` Mark Brown
  2 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-10 10:56 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Mon, Feb 01, 2021 at 12:29:00PM +0000, Mark Brown wrote:
> Currently we have a single flag TIF_SVE which says that a task is
> allowed to execute SVE instructions without trapping and also that full
> SVE register state is stored for the task.  This results in us doing
> extra work storing and restoring the full SVE register state even in
> those cases where the ABI is that only the first 128 bits of the Z0-V31
> registers which are shared with the FPSIMD V0-V31 are valid.
> 
> In order to allow us to avoid these overheads split TIF_SVE up so that
> we have two separate flags, TIF_SVE_EXEC which allows execution of SVE
> instructions without trapping and TIF_SVE_FULL_REGS which indicates that
> the full SVE register state is stored.  If both are set the behaviour is
> as currently, if TIF_SVE_EXEC is set without TIF_SVE_FULL_REGS then we
> save and restore only the FPSIMD registers until we return to userspace
> with TIF_SVE_EXEC enabled at which point we convert the FPSIMD registers
> to SVE.  It is not meaningful to have TIF_SVE_FULL_REGS set without
> TIF_SVE_EXEC.
> 
> This patch is intended only to split the flags, it does not take
> avantage of the ability to set the flags independently and the new
> state with TIF_SVE_EXEC only should not be observed.
> 
> This is based on earlier work by Julien Gral implementing a slightly
> different approach.
> 
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---

[...]

> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c

[...]

> @@ -279,18 +327,37 @@ static void sve_free(struct task_struct *task)
>   * This function should be called only when the FPSIMD/SVE state in
>   * thread_struct is known to be up to date, when preparing to enter
>   * userspace.
> + *
> + * When TIF_SVE_EXEC is set but TIF_SVE_FULL_REGS is not set the SVE
> + * state will be restored from the FPSIMD state.
>   */
>  static void task_fpsimd_load(void)
>  {
> +	unsigned int vl;
> +
>  	WARN_ON(!system_supports_fpsimd());
>  	WARN_ON(!have_cpu_fpsimd_context());
>  
> -	if (system_supports_sve() && test_thread_flag(TIF_SVE))
> -		sve_load_state(sve_pffr(&current->thread),
> -			       &current->thread.uw.fpsimd_state.fpsr,
> -			       sve_vq_from_vl(current->thread.sve_vl) - 1);
> -	else
> -		fpsimd_load_state(&current->thread.uw.fpsimd_state);
> +	if (test_thread_flag(TIF_SVE_EXEC)) {
> +		vl = sve_vq_from_vl(current->thread.sve_vl) - 1;

One more nit: because of the confusion that can arises from "vl" being a
somewhat overloaded term in the architecture, I was trying to avoid
using the name "vl" for anything that isn't the vector length in bytes.

Can this instead be renamed to vq_minus_1 to match the function
arguments it's passed for?

(You could save a couple of lines by moving the declaration here and
combining it with this assignment too.)

[...]

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access
  2021-02-01 12:29 ` [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access Mark Brown
@ 2021-02-10 11:09   ` Dave Martin
  2021-02-10 17:54     ` Mark Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-10 11:09 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Mon, Feb 01, 2021 at 12:29:01PM +0000, Mark Brown wrote:
> When we take a SVE access trap only the subset of the SVE Z0-Z31
> registers shared with the FPSIMD V0-V31 registers is valid, the rest
> of the bits in the SVE registers must be cleared before returning to
> userspace.  Currently we do this by saving the current FPSIMD register
> state to the task struct and then using that to initalize the copy of
> the SVE registers in the task struct so they can be loaded from there
> into the registers.  This requires a lot more memory access than we
> need.
> 
> The newly added TIF_SVE_FULL_REGS can be used to reduce this overhead -
> instead of doing the conversion immediately we can set only TIF_SVE_EXEC
> and not TIF_SVE_FULL_REGS.  This means that until we return to userspace
> we only need to store the FPSIMD registers and if (as should be the
> common case) the hardware still has the task state and does not need
> that to be reloaded from the task struct we can do the initialization of
> the SVE state entirely in registers.  In the event that we do need to
> reload the registers from the task struct only the FPSIMD subset needs
> to be loaded from memory.
> 
> If the FPSIMD state is loaded then we need to set the vector length.
> This is because the vector length is only set when loading from memory,
> the expectation is that the vector length is set when TIF_SVE_EXEC is
> set.  We also need to rebind the task to the CPU so the newly allocated
> SVE state is used when the task is saved.
> 
> This is based on earlier work by Julien Gral implementing a similar idea.
> 
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/include/asm/fpsimd.h  |  2 ++
>  arch/arm64/kernel/entry-fpsimd.S |  5 +++++
>  arch/arm64/kernel/fpsimd.c       | 35 +++++++++++++++++++++-----------
>  3 files changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index bec5f14b622a..e60aa4ebb351 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -74,6 +74,8 @@ extern void sve_load_from_fpsimd_state(struct user_fpsimd_state const *state,
>  				       unsigned long vq_minus_1);
>  extern unsigned int sve_get_vl(void);
>  
> +extern void sve_set_vq(unsigned long vq_minus_1);
> +
>  struct arm64_cpu_capabilities;
>  extern void sve_kernel_enable(const struct arm64_cpu_capabilities *__unused);
>  
> diff --git a/arch/arm64/kernel/entry-fpsimd.S b/arch/arm64/kernel/entry-fpsimd.S
> index 2ca395c25448..3ecec60d3295 100644
> --- a/arch/arm64/kernel/entry-fpsimd.S
> +++ b/arch/arm64/kernel/entry-fpsimd.S
> @@ -48,6 +48,11 @@ SYM_FUNC_START(sve_get_vl)
>  	ret
>  SYM_FUNC_END(sve_get_vl)
>  
> +SYM_FUNC_START(sve_set_vq)
> +	sve_load_vq x0, x1, x2
> +	ret
> +SYM_FUNC_END(sve_set_vq)
> +
>  /*
>   * Load SVE state from FPSIMD state.
>   *
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 58c749ef04c4..05caf207e2ce 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -994,10 +994,10 @@ void fpsimd_release_task(struct task_struct *dead_task)
>  /*
>   * Trapped SVE access
>   *
> - * Storage is allocated for the full SVE state, the current FPSIMD
> - * register contents are migrated across, and TIF_SVE_EXEC is set so that
> - * the SVE access trap will be disabled the next time this task
> - * reaches ret_to_user.
> + * Storage is allocated for the full SVE state so that the code
> + * running subsequently has somewhere to save the SVE registers to. We
> + * then rely on ret_to_user to actually convert the FPSIMD registers
> + * to SVE state by flushing as required.
>   *
>   * TIF_SVE_EXEC should be clear on entry: otherwise,
>   * fpsimd_restore_current_state() would have disabled the SVE access
> @@ -1016,15 +1016,26 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
>  
>  	get_cpu_fpsimd_context();
>  
> -	fpsimd_save();
> -
> -	/* Force ret_to_user to reload the registers: */
> -	fpsimd_flush_task_state(current);
> -
> -	fpsimd_to_sve(current);
> +	/*
> +	 * We shouldn't trap if we can execute SVE instructions and
> +	 * there should be no SVE state if that is the case.
> +	 */
>  	if (test_and_set_thread_flag(TIF_SVE_EXEC))
> -		WARN_ON(1); /* SVE access shouldn't have trapped */
> -	set_thread_flag(TIF_SVE_FULL_REGS);
> +		WARN_ON(1);
> +	if (test_and_clear_thread_flag(TIF_SVE_FULL_REGS))
> +		WARN_ON(1);
> +
> +	/*
> +	 * When the FPSIMD state is loaded:
> +	 *      - The return path (see fpsimd_restore_current_state) requires
> +	 *        the vector length to be loaded beforehand.
> +	 *      - We need to rebind the task to the CPU so the newly allocated
> +	 *        SVE state is used when the task is saved.
> +	 */
> +	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
> +		sve_set_vq(sve_vq_from_vl(current->thread.sve_vl) - 1);

Hmmm, I can see why we need this here, but it feels slightly odd.
Still, I don't have a better idea.

Logically, this is all part of a single state change, where we
transition from live FPSIMD-only state in the registers to live SVE
state with a pending flush.  Although we could wrap that up in a helper,
we only do this particular transition here so I guess factoring it out
may not be worth it.

> +		fpsimd_bind_task_to_cpu();
> +	}
>  
>  	put_cpu_fpsimd_context();

From here, can things go wrong if we get preempted and scheduled out?

I think fpsimd_save() would just set TIF_SVE_FULL_REGS and save out the
full register data, which may contain stale data in the non-FPSIMD bits
because we haven't flushed them yet.

Assuming I've not confused myself here, the same think could probably
happen with Ard's changes if a softirq uses kernel_neon_begin(), causing
fpsimd_save() to get called.

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-10 10:56   ` Dave Martin
@ 2021-02-10 14:54     ` Mark Brown
  2021-02-10 15:42       ` Dave Martin
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Brown @ 2021-02-10 14:54 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 1088 bytes --]

On Wed, Feb 10, 2021 at 10:56:55AM +0000, Dave Martin wrote:
> On Mon, Feb 01, 2021 at 12:29:00PM +0000, Mark Brown wrote:

> > +	if (test_thread_flag(TIF_SVE_EXEC)) {
> > +		vl = sve_vq_from_vl(current->thread.sve_vl) - 1;

> One more nit: because of the confusion that can arises from "vl" being a
> somewhat overloaded term in the architecture, I was trying to avoid
> using the name "vl" for anything that isn't the vector length in bytes.

> Can this instead be renamed to vq_minus_1 to match the function
> arguments it's passed for?

Oh, *that's* what that's all about.  I spent quite a bit of time trying
to figure out why we were sometimes using vq_minus_1 but never managed
to get to the bottom of it - it's an awkward name and there's nothing in
the code that explains the logic behind when we use it so it was really
confusing.  We can do the rename but I'm not sure it's achieving the
goal of comprehensibility.

> (You could save a couple of lines by moving the declaration here and
> combining it with this assignment too.)

Not really the coding style in the file though.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-10 14:54     ` Mark Brown
@ 2021-02-10 15:42       ` Dave Martin
  2021-02-10 17:14         ` Mark Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Martin @ 2021-02-10 15:42 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Wed, Feb 10, 2021 at 02:54:52PM +0000, Mark Brown wrote:
> On Wed, Feb 10, 2021 at 10:56:55AM +0000, Dave Martin wrote:
> > On Mon, Feb 01, 2021 at 12:29:00PM +0000, Mark Brown wrote:
> 
> > > +	if (test_thread_flag(TIF_SVE_EXEC)) {
> > > +		vl = sve_vq_from_vl(current->thread.sve_vl) - 1;
> 
> > One more nit: because of the confusion that can arises from "vl" being a
> > somewhat overloaded term in the architecture, I was trying to avoid
> > using the name "vl" for anything that isn't the vector length in bytes.
> 
> > Can this instead be renamed to vq_minus_1 to match the function
> > arguments it's passed for?
> 
> Oh, *that's* what that's all about.  I spent quite a bit of time trying
> to figure out why we were sometimes using vq_minus_1 but never managed
> to get to the bottom of it - it's an awkward name and there's nothing in
> the code that explains the logic behind when we use it so it was really
> confusing.  We can do the rename but I'm not sure it's achieving the
> goal of comprehensibility.

Ah, I see.  The reason for the difference is that the vector length is
encoded in ZCR_ELx.LEN as the vector length in quadwords ("vq" -- see
Documentation/arm64/sve.rst) minus one.  It seemed poor practice to do
the conversion in asm where the compiler can't see or optimise it, plus
I didn't want the possibility of passing meaningless values at that
level.  So the caller has to validate the vector length with
sve_vl_valid() where deemed necessary, and then convert explicitly.

Either way, calling this "vl" is breaking a useful convention that's
followed throughout the rest of the kernel, so I'd prefer we call it
something else -- but within reason, I don't mind what name is used.

> 
> > (You could save a couple of lines by moving the declaration here and
> > combining it with this assignment too.)
> 
> Not really the coding style in the file though.

I'm not sure there's really a rigid convention in this file, other than
keeping declarations at the start of blocks.  I tend to push
declarations down when it doesn't harm readability -- i.e., when the
function is more than a screenful and/or the relevant block already has
braces enclosing several lines.  But it's a moot point.

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-10 15:42       ` Dave Martin
@ 2021-02-10 17:14         ` Mark Brown
  2021-02-10 18:15           ` Dave Martin
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Brown @ 2021-02-10 17:14 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 1692 bytes --]

On Wed, Feb 10, 2021 at 03:42:51PM +0000, Dave Martin wrote:
> On Wed, Feb 10, 2021 at 02:54:52PM +0000, Mark Brown wrote:

> > Oh, *that's* what that's all about.  I spent quite a bit of time trying
> > to figure out why we were sometimes using vq_minus_1 but never managed
> > to get to the bottom of it - it's an awkward name and there's nothing in
> > the code that explains the logic behind when we use it so it was really
> > confusing.  We can do the rename but I'm not sure it's achieving the
> > goal of comprehensibility.

> Ah, I see.  The reason for the difference is that the vector length is
> encoded in ZCR_ELx.LEN as the vector length in quadwords ("vq" -- see
> Documentation/arm64/sve.rst) minus one.  It seemed poor practice to do
> the conversion in asm where the compiler can't see or optimise it, plus
> I didn't want the possibility of passing meaningless values at that
> level.  So the caller has to validate the vector length with
> sve_vl_valid() where deemed necessary, and then convert explicitly.

Yeah, it's relatively clear to get to the fact that it's due to the
ZCR_ELx.LEN - what was not at all clear was what the rule for choosing
between the two representations was, my instinct would've been to hide
the different representation, something like a static inline wrapper for
the assembly function would still let the compiler see what's going on.  

> Either way, calling this "vl" is breaking a useful convention that's
> followed throughout the rest of the kernel, so I'd prefer we call it
> something else -- but within reason, I don't mind what name is used.

That's the convention in the arm64 FP code or something else?

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access
  2021-02-10 11:09   ` Dave Martin
@ 2021-02-10 17:54     ` Mark Brown
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-10 17:54 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 2095 bytes --]

On Wed, Feb 10, 2021 at 11:09:56AM +0000, Dave Martin wrote:
> On Mon, Feb 01, 2021 at 12:29:01PM +0000, Mark Brown wrote:

> > +	/*
> > +	 * When the FPSIMD state is loaded:
> > +	 *      - The return path (see fpsimd_restore_current_state) requires
> > +	 *        the vector length to be loaded beforehand.
> > +	 *      - We need to rebind the task to the CPU so the newly allocated
> > +	 *        SVE state is used when the task is saved.
> > +	 */
> > +	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
> > +		sve_set_vq(sve_vq_from_vl(current->thread.sve_vl) - 1);

> Hmmm, I can see why we need this here, but it feels slightly odd.
> Still, I don't have a better idea.

> Logically, this is all part of a single state change, where we
> transition from live FPSIMD-only state in the registers to live SVE
> state with a pending flush.  Although we could wrap that up in a helper,
> we only do this particular transition here so I guess factoring it out
> may not be worth it.

Yes, such a helper would have exactly one user so it's a bit unclear if
it's sensible to factor it out.

> > +		fpsimd_bind_task_to_cpu();
> > +	}
> >  
> >  	put_cpu_fpsimd_context();

> From here, can things go wrong if we get preempted and scheduled out?

> I think fpsimd_save() would just set TIF_SVE_FULL_REGS and save out the
> full register data, which may contain stale data in the non-FPSIMD bits
> because we haven't flushed them yet.

> Assuming I've not confused myself here, the same think could probably
> happen with Ard's changes if a softirq uses kernel_neon_begin(), causing
> fpsimd_save() to get called.

I think the issue here is actually in fpsimd_save() which as you said in
one of your other mails shouldn't be forcing on TIF_SVE_FULL_REGS, it
should just be saving whatever is specified by TIF_SVE_FULL_REGS.  That
way if we get scheduled between do_sve_acc() and returning to userspace
only the FPSIMD registers will be saved and we'll set TIF_SVE_FULL_REGS
and do the flush when reloading the registers prior to returning to
userspace.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-10 17:14         ` Mark Brown
@ 2021-02-10 18:15           ` Dave Martin
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Martin @ 2021-02-10 18:15 UTC (permalink / raw)
  To: Mark Brown
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss

On Wed, Feb 10, 2021 at 05:14:42PM +0000, Mark Brown wrote:
> On Wed, Feb 10, 2021 at 03:42:51PM +0000, Dave Martin wrote:
> > On Wed, Feb 10, 2021 at 02:54:52PM +0000, Mark Brown wrote:
> 
> > > Oh, *that's* what that's all about.  I spent quite a bit of time trying
> > > to figure out why we were sometimes using vq_minus_1 but never managed
> > > to get to the bottom of it - it's an awkward name and there's nothing in
> > > the code that explains the logic behind when we use it so it was really
> > > confusing.  We can do the rename but I'm not sure it's achieving the
> > > goal of comprehensibility.
> 
> > Ah, I see.  The reason for the difference is that the vector length is
> > encoded in ZCR_ELx.LEN as the vector length in quadwords ("vq" -- see
> > Documentation/arm64/sve.rst) minus one.  It seemed poor practice to do
> > the conversion in asm where the compiler can't see or optimise it, plus
> > I didn't want the possibility of passing meaningless values at that
> > level.  So the caller has to validate the vector length with
> > sve_vl_valid() where deemed necessary, and then convert explicitly.
> 
> Yeah, it's relatively clear to get to the fact that it's due to the
> ZCR_ELx.LEN - what was not at all clear was what the rule for choosing
> between the two representations was, my instinct would've been to hide
> the different representation, something like a static inline wrapper for
> the assembly function would still let the compiler see what's going on.  

We could certainly do that.  One reason why I didn't go for that
initially is that the ZCR setting is done in low-level places where we
don't really want to have to BUG().  Requiring the caller to do
something explicit reduces the change of someone passing in an unchecked
garbage value for the vl.  But provided the caller checks with
sve_vl_valid(), or we are confident by construction that the value is
valid, this doesn't really matter.

Due to a lot of painful debugging, I became pretty paranoid when
upstreaming the initial code.  We should keep some of that paranoia, but
we probably don't need quite so much of it now.

> > Either way, calling this "vl" is breaking a useful convention that's
> > followed throughout the rest of the kernel, so I'd prefer we call it
> > something else -- but within reason, I don't mind what name is used.
> 
> That's the convention in the arm64 FP code or something else?

It's the convention I made up for the arm64 SVE code and user/kernel
API -- so when I say "the rest of the kernel", I just mean arch/arm64/.
The rest of the kernel doesn't contain any SVE code that doesn't follow
this convention, so I can claim it is followed everywhere ;)

In the architecture, "VL" is a looser concept that in most contexts
means something like "the size of a vector", but it can be a bit
unexpected -- as in the ", MUL VL" addressing mode syntax for example.

Cheers
---Dave

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags
  2021-02-09 22:16     ` Mark Brown
@ 2021-02-10 19:52       ` Mark Brown
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Brown @ 2021-02-10 19:52 UTC (permalink / raw)
  To: Dave Martin
  Cc: Julien Grall, Julien Grall, Catalin Marinas, Zhang Lei,
	Will Deacon, linux-arm-kernel, Daniel Kiss


[-- Attachment #1.1: Type: text/plain, Size: 1344 bytes --]

On Tue, Feb 09, 2021 at 10:16:27PM +0000, Mark Brown wrote:
> On Tue, Feb 09, 2021 at 05:59:46PM +0000, Dave Martin wrote:

> > > @@ -952,8 +1022,9 @@ void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > >  	fpsimd_flush_task_state(current);
> > >  
> > >  	fpsimd_to_sve(current);

> > Hmmm, there's a latent bug upstream here: if the WARN() fires then
> > sve_state is not safe to dereference.  But we already did.

> > So perhaps this should have been something like:

> > 	if (!WARN_ON(test_and_set_thread_flag(TIF_SVE)))
> > 		fpsimd_to_sve();

> > This might make sense as a separate Fixes patch to precede the series.

> Yes, that's definitely a separate fix I think.

Actually now I look at this properly I think that we're sufficiently
confused if this happens that it's not worth worrying about avoiding the
conversion.  We already did a sve_alloc() which will have either
allocated SVE state or dereferenced a SVE state pointer that was there
and pointing to freed memory so we *might* be OK and if we're not then
it's too late to prevent anything and we'll exit with TIF_SVE which will
have a high likelyhood of leading to future dereferences even if we stop
this one.  It's not clear to me that we're helping by potentially
leaving stale data around so I'm inclined to leave things as they are.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-02-10 19:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 12:28 [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Mark Brown
2021-02-01 12:29 ` [PATCH v7 1/2] arm64/sve: Split TIF_SVE into separate execute and register state flags Mark Brown
2021-02-01 15:35   ` Dave Martin
2021-02-01 15:45     ` Mark Brown
2021-02-09 17:59   ` Dave Martin
2021-02-09 22:16     ` Mark Brown
2021-02-10 19:52       ` Mark Brown
2021-02-10 10:56   ` Dave Martin
2021-02-10 14:54     ` Mark Brown
2021-02-10 15:42       ` Dave Martin
2021-02-10 17:14         ` Mark Brown
2021-02-10 18:15           ` Dave Martin
2021-02-01 12:29 ` [PATCH v7 2/2] arm64/sve: Rework SVE trap access to minimise memory access Mark Brown
2021-02-10 11:09   ` Dave Martin
2021-02-10 17:54     ` Mark Brown
2021-02-08 17:26 ` [PATCH v7 0/2] arm64/sve: Improve performance when handling SVE access traps Dave Martin
2021-02-09 18:22   ` Mark Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.