All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	"Andy Lutomirski" <luto@kernel.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	kvm@vger.kernel.org, "Jason A. Donenfeld" <Jason@zx2c4.com>,
	"Rik van Riel" <riel@surriel.com>,
	"Dave Hansen" <dave.hansen@linux.intel.com>
Subject: Re: [PATCH v6] x86: load FPU registers on return to userland
Date: Fri, 8 Feb 2019 14:12:33 +0100	[thread overview]
Message-ID: <20190208131233.5g5guefglq6ik2ow@linutronix.de> (raw)
In-Reply-To: <20190130122713.GG18383@zn.tnic>

On 2019-01-30 13:27:13 [+0100], Borislav Petkov wrote:
> On Wed, Jan 30, 2019 at 01:06:47PM +0100, Sebastian Andrzej Siewior wrote:
> > I don't know if hackbench would show anything besides noise.
> 
> Yeah, if a sensible benchmark (dunno if hackbench is among them :))
> shows no difference, is also saying something.

"hackbench -g80 -l 1000 -s 255" shows just noise. I don't see any
reasonable difference with or without the series.

Tracing. The following patch

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c5a6edd92de4f..aa1914e5bf5c0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -292,6 +292,7 @@ struct fpu {
 	 * FPU state should be reloaded next time the task is run.
 	 */
 	unsigned int			last_cpu;
+	unsigned int avoided_loads;
 
 	/*
 	 * @state:
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c98c54e796186..7560942a550ed 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -358,9 +358,11 @@ void fpu__clear(struct fpu *fpu)
  */
 void switch_fpu_return(void)
 {
+	struct fpu *fpu = &current->thread.fpu;
+
 	if (!static_cpu_has(X86_FEATURE_FPU))
 		return;
-
+	fpu->avoided_loads = 0;
 	__fpregs_load_activate();
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 37b2ecef041e6..875f74b1e8779 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -522,6 +522,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_prepare(prev_fpu, cpu);
+	else if (current->mm) {
+		prev_fpu->avoided_loads++;
+		trace_printk("skipped save %d\n", prev_fpu->avoided_loads);
+	}
 
 	/* We must save %fs and %gs before load_TLS() because
 	 * %fs and %gs may be cleared by load_TLS().

should help to spot the optimization. So if TIF_NEED_FPU_LOAD is set at
this point then between this and the last invocation of schedule() we
haven't been in userland and so we avoided loading + storing of FPU
registers. I saw things like:

|  http-1935  [001] d..2   223.460434: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|   apt-1931  [001] d..2   223.460680: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120
|  http-1935  [001] d..2   223.460729: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|  http-1935  [001] d..2   223.460732: __switch_to: skipped save 1
|   apt-1931  [001] d..2   223.461076: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120
|  http-1935  [001] d..2   223.461111: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|  http-1935  [001] d..2   223.461112: __switch_to: skipped save 2

which means we avoided loading FPU registers for `http' because for some
reason it was not required. Here we switched between two user tasks so
without the patches we would have to save and restore them.

I captured a few instances of something like:

|  rcu_preempt-10    [000] d..2  1032.867293: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=98 prev_state=I ==> next_comm=kswapd0 next_pid=536 next_prio=120
|          apt-1954  [001] d..2  1032.867435: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:0 next_pid=1943 next_prio=120
|          apt-1954  [001] d..2  1032.867436: __switch_to: skipped save 30
|  kworker/1:0-1943  [001] d..2  1032.867455: sched_switch: prev_comm=kworker/1:0 prev_pid=1943 prev_prio=120 prev_state=I ==> next_comm=apt next_pid=1954 next_prio=120
|          apt-1954  [001] d..2  1032.867459: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120
|          apt-1954  [001] d..2  1032.867460: __switch_to: skipped save 31

It has been avoided to restore and save the FPU register of `apt' 31
times (in a row). This isn't 100% true. We switched to and from a kernel
thread to `apt' so switch_fpu_finish() wouldn't load the registers
because the switch to the kernel thread (switch_fpu_prepare()) would not
destroy them. *However* the switch away from `apt' would save the FPU
registers so we avoid this (current code always saves FPU registers on
context switch, see switch_fpu_prepare()).
My understanding is that if the CPU supports `xsaves' then it wouldn't
save anything in this scenario because the CPU would notice that its FPU
state didn't change since last time so no need to save anything.

Then we have lat_sig [0]. Without the series 64bit:
|Signal handler overhead: 2.6839 microseconds
|Signal handler overhead: 2.6996 microseconds
|Signal handler overhead: 2.6821 microseconds

with the series:
|Signal handler overhead: 3.2976 microseconds
|Signal handler overhead: 3.3033 microseconds
|Signal handler overhead: 3.2980 microseconds

that is approximately 22% worse. Without the series 64bit kernel with
32bit binary:
| Signal handler overhead: 3.8139 microseconds
| Signal handler overhead: 3.8035 microseconds
| Signal handler overhead: 3.8127 microseconds

with the series:
| Signal handler overhead: 4.0434 microseconds
| Signal handler overhead: 4.0438 microseconds
| Signal handler overhead: 4.0408 microseconds

approximately 6% worse. I'm a little surprised in the 32bit case because
it did save+copy earlier (while the 64bit saved it directly to signal
stack).

If we restore directly from signal stack (instead the copy_from_user())
we get to (64bit only):
| Signal handler overhead: 3.0376 microseconds
| Signal handler overhead: 3.0687 microseconds
| Signal handler overhead: 3.0510 microseconds

and if additionally save the registers to the signal stack:
| Signal handler overhead: 2.7835 microseconds
| Signal handler overhead: 2.7850 microseconds
| Signal handler overhead: 2.7766 microseconds

then we get almost to where we started. I will fire a commit per commit
bench to see if I notice something.
Ach and this was PREEMPT on a 
|x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.

machine. So those with AVX-512 might be worse but I don't have any of
those.

[0] Part of lmbench, test
    taskset 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_sig -P 1 -W 64 -N 5000 catch

Sebastian

  reply	other threads:[~2019-02-08 13:12 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-09 11:47 [PATCH v6] x86: load FPU registers on return to userland Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 01/22] x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig() Sebastian Andrzej Siewior
2019-01-14 16:24   ` Borislav Petkov
2019-02-05 10:08     ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 02/22] x86/fpu: Remove fpu__restore() Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 03/22] x86/fpu: Remove preempt_disable() in fpu__clear() Sebastian Andrzej Siewior
2019-01-14 18:55   ` Borislav Petkov
2019-01-09 11:47 ` [PATCH 04/22] x86/fpu: Always init the `state' " Sebastian Andrzej Siewior
2019-01-14 19:32   ` Borislav Petkov
2019-01-09 11:47 ` [PATCH 05/22] x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe() Sebastian Andrzej Siewior
2019-01-16 19:36   ` Borislav Petkov
2019-01-16 22:40     ` Sebastian Andrzej Siewior
2019-01-17 12:22       ` Borislav Petkov
2019-01-18 21:14         ` Sebastian Andrzej Siewior
2019-01-18 21:17           ` Dave Hansen
2019-01-18 21:37             ` Sebastian Andrzej Siewior
2019-01-18 21:43               ` Dave Hansen
2019-01-21 11:21             ` Oleg Nesterov
2019-01-22 13:40               ` Borislav Petkov
2019-01-22 16:15                 ` Oleg Nesterov
2019-01-22 17:00                   ` Borislav Petkov
2019-02-05 11:34                     ` Sebastian Andrzej Siewior
2019-02-05 11:17               ` Sebastian Andrzej Siewior
2019-02-26 16:38                 ` Oleg Nesterov
2019-03-08 18:12                   ` Sebastian Andrzej Siewior
2019-02-05 14:37         ` [PATCH 05/22 v2] " Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 06/22] x86/fpu: Don't save fxregs for ia32 frames " Sebastian Andrzej Siewior
2019-01-24 11:17   ` Borislav Petkov
2019-02-05 16:43     ` [PATCH 06/22 v2] x86/fpu: Don't save fxregs for ia32 frames in Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 07/22] x86/fpu: Remove fpu->initialized Sebastian Andrzej Siewior
2019-01-24 13:34   ` Borislav Petkov
2019-02-05 18:03     ` Sebastian Andrzej Siewior
2019-02-06 14:01       ` Borislav Petkov
2019-02-07 10:13         ` Sebastian Andrzej Siewior
2019-02-07 10:37           ` Borislav Petkov
2019-02-05 18:06     ` [PATCH 07/22 v2] " Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 08/22] x86/fpu: Remove user_fpu_begin() Sebastian Andrzej Siewior
2019-01-25 15:18   ` Borislav Petkov
2019-02-05 18:16     ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 09/22] x86/fpu: Add (__)make_fpregs_active helpers Sebastian Andrzej Siewior
2019-01-28 18:23   ` Borislav Petkov
2019-02-07 10:43     ` Sebastian Andrzej Siewior
2019-02-13  9:30       ` Borislav Petkov
2019-02-14 14:51         ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 10/22] x86/fpu: Make __raw_xsave_addr() use feature number instead of mask Sebastian Andrzej Siewior
2019-01-28 18:30   ` Borislav Petkov
2019-01-09 11:47 ` [PATCH 11/22] x86/fpu: Make get_xsave_field_ptr() and get_xsave_addr() " Sebastian Andrzej Siewior
2019-01-28 18:49   ` Borislav Petkov
2019-02-07 11:13     ` Sebastian Andrzej Siewior
2019-02-13  9:31       ` Borislav Petkov
2019-01-09 11:47 ` [PATCH 12/22] x86/fpu: Only write PKRU if it is different from current Sebastian Andrzej Siewior
2019-01-23 18:09   ` Dave Hansen
2019-02-07 11:27     ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 13/22] x86/pkeys: Don't check if PKRU is zero before writting it Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 14/22] x86/fpu: Eager switch PKRU state Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 15/22] x86/entry: Add TIF_NEED_FPU_LOAD Sebastian Andrzej Siewior
2019-01-30 11:55   ` Borislav Petkov
2019-02-07 11:49     ` Sebastian Andrzej Siewior
2019-02-13  9:35       ` Borislav Petkov
2019-02-14 15:28         ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 16/22] x86/fpu: Always store the registers in copy_fpstate_to_sigframe() Sebastian Andrzej Siewior
2019-01-30 11:43   ` Borislav Petkov
2019-02-07 13:28     ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 17/22] x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD Sebastian Andrzej Siewior
2019-01-30 11:56   ` Borislav Petkov
2019-01-30 12:28     ` Sebastian Andrzej Siewior
2019-01-30 12:53       ` Borislav Petkov
2019-02-07 14:10         ` Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 18/22] x86/fpu: Update xstate's PKRU value on write_pkru() Sebastian Andrzej Siewior
2019-01-23 17:28   ` Dave Hansen
2019-01-09 11:47 ` [PATCH 19/22] x86/fpu: Inline copy_user_to_fpregs_zeroing() Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 20/22] x86/fpu: Let __fpu__restore_sig() restore the !32bit+fxsr frame from kernel memory Sebastian Andrzej Siewior
2019-01-30 21:29   ` Borislav Petkov
2019-01-09 11:47 ` [PATCH 21/22] x86/fpu: Merge the two code paths in __fpu__restore_sig() Sebastian Andrzej Siewior
2019-01-09 11:47 ` [PATCH 22/22] x86/fpu: Defer FPU state load until return to userspace Sebastian Andrzej Siewior
2019-01-31  9:16   ` Borislav Petkov
2019-01-15 12:44 ` [PATCH v6] x86: load FPU registers on return to userland David Laight
2019-01-15 13:15   ` 'Sebastian Andrzej Siewior'
2019-01-15 14:33     ` David Laight
2019-01-15 19:46   ` Dave Hansen
2019-01-15 20:26     ` Andy Lutomirski
2019-01-15 20:54       ` Dave Hansen
2019-01-15 21:11         ` Andy Lutomirski
2019-01-16 10:31           ` David Laight
2019-01-16 10:18       ` David Laight
2019-01-30 11:35 ` Borislav Petkov
2019-01-30 12:06   ` Sebastian Andrzej Siewior
2019-01-30 12:27     ` Borislav Petkov
2019-02-08 13:12       ` Sebastian Andrzej Siewior [this message]
2019-02-13 15:54         ` Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190208131233.5g5guefglq6ik2ow@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=Jason@zx2c4.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=riel@surriel.com \
    --cc=rkrcmar@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.