linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state
@ 2024-01-22 19:42 Mark Brown
  2024-03-06 18:54 ` Catalin Marinas
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Brown @ 2024-01-22 19:42 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Benjamin Herrenschmidt, Dave Martin, linux-arm-kernel,
	linux-kernel, Mark Brown

When we are in a syscall we take the opportunity to discard the SVE state,
saving only the FPSIMD subset of the register state. When we reload the
state from memory we reenable SVE access traps, stopping tracking SVE until
the task takes another SVE access trap. This means that for a task which is
actively using SVE many blocking system calls will have the additional
overhead of a SVE access trap.

As SVE deployment is progressing we are seeing much wider use of the SVE
instruction set, including performance optimised implementations of
operations like memset() and memcpy(), which mean that even tasks which are
not obviously floating point based can end up with substantial SVE usage.

It does not, however, make sense to just unconditionally use the full SVE
register state all the time since it is larger than the FPSIMD register
state so there is overhead saving and restoring it on context switch and
our requirement to flush the register state not shared with FPSIMD on
syscall also creates a noticeable overhead on system call.

I did some instrumentation which counted the number of SVE access traps
and the number of times we loaded FPSIMD only register state for each task.
Testing with Debian Bookworm this showed that during boot the overwhelming
majority of tasks triggered another SVE access trap more than 50% of the
time after loading FPSIMD only state with a substantial number near 100%,
though some programs had a very small number of SVE accesses most likely
from startup. There were few tasks in the range 5-45%, most tasks either
used SVE frequently or used it only a tiny proportion of times. As expected
older distributions which do not have the SVE performance work available
showed no SVE usage in general applications.

This indicates that there should be some useful benefit from reducing the
number of SVE access traps for blocking system calls like we did for non
blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
on syscall if we don't context switch"). Let's do this by counting the
number of times we have loaded FPSIMD only register state for SVE tasks
and only disabling traps after some number of times, otherwise leaving
traps disabled and flushing the non-shared register state like we would on
trap.

I pulled 64 out of thin air for the number of flushes to do, there is
doubtless room for tuning here. Ideally we would be able to tell if the
task is actually using SVE but without using performance counters (which
would be substantial work) we can't currently tell. I picked the number
because so many of the tasks using SVE used it so frequently.

This means that for a task which is actively using SVE the number of SVE
access traps will be substantially reduced but applications which use SVE
only very infrequently will avoid the overheads associated with tracking
SVE state once the counter expires.

There should be no functional change resulting from this, it is purely a
performance optimisation.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v4:
- Rebase onto v6.8-rc1.
- Link to v3: https://lore.kernel.org/r/20231113-arm64-sve-trap-mitigation-v3-1-4779c9382483@kernel.org

Changes in v3:
- Rebase onto v6.7-rc1.
- Link to v2: https://lore.kernel.org/r/20230913-arm64-sve-trap-mitigation-v2-1-1bdeff382171@kernel.org

Changes in v2:
- Rebase onto v6.6-rc1.
- Link to v1: https://lore.kernel.org/r/20230807-arm64-sve-trap-mitigation-v1-1-d92eed1d2855@kernel.org
---
 arch/arm64/include/asm/processor.h |  1 +
 arch/arm64/kernel/fpsimd.c         | 43 ++++++++++++++++++++++++++++++++------
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 5b0a04810b23..a99ea422f744 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -162,6 +162,7 @@ struct thread_struct {
 	unsigned int		fpsimd_cpu;
 	void			*sve_state;	/* SVE registers, if any */
 	void			*sme_state;	/* ZA and ZT state, if any */
+	unsigned int		sve_timeout;    /* For SVE trap suppression */
 	unsigned int		vl[ARM64_VEC_MAX];	/* vector length */
 	unsigned int		vl_onexec[ARM64_VEC_MAX]; /* vl after next exec */
 	unsigned long		fault_address;	/* fault info */
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index a5dc6f764195..50770ea13db7 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -354,6 +354,7 @@ static void task_fpsimd_load(void)
 {
 	bool restore_sve_regs = false;
 	bool restore_ffr;
+	unsigned long sve_vq_minus_one;
 
 	WARN_ON(!system_supports_fpsimd());
 	WARN_ON(preemptible());
@@ -362,18 +363,12 @@ static void task_fpsimd_load(void)
 	if (system_supports_sve() || system_supports_sme()) {
 		switch (current->thread.fp_type) {
 		case FP_STATE_FPSIMD:
-			/* Stop tracking SVE for this task until next use. */
-			if (test_and_clear_thread_flag(TIF_SVE))
-				sve_user_disable();
 			break;
 		case FP_STATE_SVE:
 			if (!thread_sm_enabled(&current->thread) &&
 			    !WARN_ON_ONCE(!test_and_set_thread_flag(TIF_SVE)))
 				sve_user_enable();
 
-			if (test_thread_flag(TIF_SVE))
-				sve_set_vq(sve_vq_from_vl(task_get_sve_vl(current)) - 1);
-
 			restore_sve_regs = true;
 			restore_ffr = true;
 			break;
@@ -392,6 +387,15 @@ static void task_fpsimd_load(void)
 		}
 	}
 
+	/*
+	 * If SVE has been enabled we may keep it enabled even if
+	 * loading only FPSIMD state, so always set the VL.
+	 */
+	if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+		sve_vq_minus_one = sve_vq_from_vl(task_get_sve_vl(current)) - 1;
+		sve_set_vq(sve_vq_minus_one);
+	}
+
 	/* Restore SME, override SVE register configuration if needed */
 	if (system_supports_sme()) {
 		unsigned long sme_vl = task_get_sme_vl(current);
@@ -418,6 +422,25 @@ static void task_fpsimd_load(void)
 	} else {
 		WARN_ON_ONCE(current->thread.fp_type != FP_STATE_FPSIMD);
 		fpsimd_load_state(&current->thread.uw.fpsimd_state);
+
+		/*
+		 * If the task had been using SVE we keep it enabled
+		 * when loading FPSIMD only state for a number of
+		 * times to minimise overhead for tasks actively using
+		 * SVE, disabling it periodicaly to enusre that tasks
+		 * that use SVE intermittently do eventually avoid the
+		 * overhead of carrying SVE state.  The timeout is
+		 * initialised when we take a SVE trap in in do_sve_acc().
+		 */
+		if (system_supports_sve() && test_thread_flag(TIF_SVE)) {
+			if (!current->thread.sve_timeout) {
+				clear_thread_flag(TIF_SVE);
+				sve_user_disable();
+			} else {
+				current->thread.sve_timeout--;
+				sve_flush_live(true, sve_vq_minus_one);
+			}
+		}
 	}
 }
 
@@ -1364,6 +1387,13 @@ void do_sve_acc(unsigned long esr, struct pt_regs *regs)
 
 	get_cpu_fpsimd_context();
 
+	/*
+	 * We will keep SVE enabled when loading FPSIMD only state
+	 * this many times to minimise traps when userspace is
+	 * actively using SVE.
+	 */
+	current->thread.sve_timeout = 64;
+
 	if (test_and_set_thread_flag(TIF_SVE))
 		WARN_ON(1); /* SVE access shouldn't have trapped */
 
@@ -1591,6 +1621,7 @@ void fpsimd_flush_thread(void)
 		/* Defer kfree() while in atomic context */
 		sve_state = current->thread.sve_state;
 		current->thread.sve_state = NULL;
+		current->thread.sve_timeout = 0;
 
 		fpsimd_flush_thread_vl(ARM64_VEC_SVE);
 	}

---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20230807-arm64-sve-trap-mitigation-2e7e2663c849

Best regards,
-- 
Mark Brown <broonie@kernel.org>


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state
  2024-01-22 19:42 [PATCH v4] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
@ 2024-03-06 18:54 ` Catalin Marinas
  2024-03-06 22:44   ` Mark Brown
  0 siblings, 1 reply; 3+ messages in thread
From: Catalin Marinas @ 2024-03-06 18:54 UTC (permalink / raw)
  To: Mark Brown
  Cc: Will Deacon, Benjamin Herrenschmidt, Dave Martin,
	linux-arm-kernel, linux-kernel

On Mon, Jan 22, 2024 at 07:42:14PM +0000, Mark Brown wrote:
> This indicates that there should be some useful benefit from reducing the
> number of SVE access traps for blocking system calls like we did for non
> blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
> on syscall if we don't context switch"). Let's do this by counting the
> number of times we have loaded FPSIMD only register state for SVE tasks
> and only disabling traps after some number of times, otherwise leaving
> traps disabled and flushing the non-shared register state like we would on
> trap.

It looks like some people complain about SVE being disabled, though I
assume this is for kernels prior to 6.2 and commit 8c845e273104
("arm64/sve: Leave SVE enabled on syscall if we don't context switch"):

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1999551/comments/52

I assume we may see the other camp complaining about the additional
state saving on context switch.

Anyway, I don't see why we should treat blocking syscalls differently
from non-blocking ones (addressed by the commit above). I guess the
difference is that one goes through a context switch but, from a user
perspective, it's still a syscall. The SVE state is expected to be
discarded and there may be a preference for avoiding the subsequent
fault.

> I pulled 64 out of thin air for the number of flushes to do, there is
> doubtless room for tuning here. Ideally we would be able to tell if the
> task is actually using SVE but without using performance counters (which
> would be substantial work) we can't currently tell. I picked the number
> because so many of the tasks using SVE used it so frequently.

So I wonder whether we should make the timeout disabling behaviour the
same for both blocking and non-blocking syscalls. IOW, ignore the
context switching aspect. Every X number of returns, disable SVE
irrespective of whether it was context switched or not. Or, if the
number of returns has a variation in time, just use a jiffy (or other
time based figure), it would time out in the same way for all types of
workloads.

-- 
Catalin

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state
  2024-03-06 18:54 ` Catalin Marinas
@ 2024-03-06 22:44   ` Mark Brown
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Brown @ 2024-03-06 22:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Benjamin Herrenschmidt, Dave Martin,
	linux-arm-kernel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3932 bytes --]

On Wed, Mar 06, 2024 at 06:54:51PM +0000, Catalin Marinas wrote:
> On Mon, Jan 22, 2024 at 07:42:14PM +0000, Mark Brown wrote:

> > This indicates that there should be some useful benefit from reducing the
> > number of SVE access traps for blocking system calls like we did for non
> > blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
> > on syscall if we don't context switch"). Let's do this by counting the
> > number of times we have loaded FPSIMD only register state for SVE tasks
> > and only disabling traps after some number of times, otherwise leaving
> > traps disabled and flushing the non-shared register state like we would on
> > trap.

> It looks like some people complain about SVE being disabled, though I
> assume this is for kernels prior to 6.2 and commit 8c845e273104
> ("arm64/sve: Leave SVE enabled on syscall if we don't context switch"):

> https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1999551/comments/52

Yes, from a user perspective this is a case where they're running what
they see as a perfectly standard instruction and finding that it might
take thousands of cycles longer than you'd expect it to while we take a
trip through EL1.  If this happens frequently then it becomes much
harder to justify using SVE, especially for things that run for a short
time or which don't have an overwhelming performance advantage over a
non-SVE implementation.

> I assume we may see the other camp complaining about the additional
> state saving on context switch.

Yes, in the case where things get preempted that's an issue for tasks
that are mostly FPSIMD only otherwise we'd be better off just never
disabling SVE after it's been enabled and we had to allocate the
storage.

> Anyway, I don't see why we should treat blocking syscalls differently
> from non-blocking ones (addressed by the commit above). I guess the
> difference is that one goes through a context switch but, from a user
> perspective, it's still a syscall. The SVE state is expected to be
> discarded and there may be a preference for avoiding the subsequent
> fault.

To me this is purely about the fault behaviour, the fact that this is
related to the state saving/loading is more of an implementation detail
than 

> > I pulled 64 out of thin air for the number of flushes to do, there is
> > doubtless room for tuning here. Ideally we would be able to tell if the
> > task is actually using SVE but without using performance counters (which
> > would be substantial work) we can't currently tell. I picked the number
> > because so many of the tasks using SVE used it so frequently.

> So I wonder whether we should make the timeout disabling behaviour the
> same for both blocking and non-blocking syscalls. IOW, ignore the
> context switching aspect. Every X number of returns, disable SVE
> irrespective of whether it was context switched or not. Or, if the
> number of returns has a variation in time, just use a jiffy (or other
> time based figure), it would time out in the same way for all types of
> workloads.

I think of those approaches a time based one seems more appealing -
we're basically just using the number of times we needed to reload the
state as a proxy for "has been running for a while" here since it's
information that's readily to hand.  It would penalize tasks that spend
a lot of time blocked.

I'd still be inclined to only actually check when loading the state
simply because otherwise you can't avoid the overhead by doing something
like pinning your sensitive task to a CPU so it never gets scheduled
away which seems like a reasonable thing to optimise for.  That would
mean a task that is only ever context switched through preemption would
never drop SVE state but I think that if your task is using that much
CPU the cost of saving the extra state on context switch is kind of in
the noise.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-03-06 22:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-22 19:42 [PATCH v4] arm64/fpsimd: Suppress SVE access traps when loading FPSIMD state Mark Brown
2024-03-06 18:54 ` Catalin Marinas
2024-03-06 22:44   ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).