linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: "Liu, Jing2" <jing2.liu@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>,
	"Bae, Chang Seok" <chang.seok.bae@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Arjan van de Ven <arjan@linux.intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	Jing Liu <jing2.liu@linux.intel.com>,
	"seanjc@google.com" <seanjc@google.com>,
	"Cooper, Andrew" <andrew.cooper3@citrix.com>
Subject: RE: [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core
Date: Sat, 16 Oct 2021 16:45:55 +0200	[thread overview]
Message-ID: <87mtn93u58.ffs@tglx> (raw)
In-Reply-To: <BYAPR11MB3256B3120DEB5FE0DD53B5B9A9B99@BYAPR11MB3256.namprd11.prod.outlook.com>

Jing,

On Fri, Oct 15 2021 at 14:24, Jing2 Liu wrote:
> On 10/15/2021 5:36 PM, Thomas Gleixner wrote:
>> If you carefully look at part 2 of the rework, then you might notice that there
>> is a fundamental change which allows to do a real simplification for KVM FPU
>> handling:
>> 
>>    current->thread.fpu.fpstate
>> 
>> is now a pointer. So you can spare one FPU allocation because we can now
>> do:
>
> Trying to understand your point, seems struct fpu will add a guest_fpstate
> pointer. And this will be allocated when vcpu_create() by the following
> function. Swap between the two pointers in load/put. What I was thinking 
> is that vcpu keeps having guest_fpu and delete user_fpu.

unfortunately we can't do that in vcpu_create() because the thread doing
that is not necessarily the vCPU thread which invokes vcpu_run()
later. But that does not matter much.

So vcpu_create() will do

   vcpu->arch.guest_fpstate = fpu_alloc_guest_fpstate();

and in vcpu_run() invoke

    fpu_swap_kvm_fpstate(guest_fpstate, ...)

which in turn does:

int fpu_swap_kvm_fpstate(struct fpstate *guest_fps, bool enter_guest,
			 u64 restore_mask)
{
	struct fpu *fpu = &current->thread.fpu;
	struct fpstate *fps = fpu->fpstate;

	fpregs_lock();
	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
		save_fpregs_to_fpstate(fpu);

	/* Swap fpstate */
	if (enter_guest) {
		fpu->__task_fpstate = fps;
		fpu->fpstate = guest_fps;
	} else {
		fpu->fpstate = fpu->__task_fpstate;
		fpu->__task_fpstate = NULL;
	}

	fps = fpu->fpstate;

	/*
	 * Once XFD support is added, XFP switching happens here
	 * right before the restore.
	 */
	restore_mask &= XFEATURE_MASK_FPSTATE;
	restore_fpregs_from_fpstate(fps, restore_mask);

	fpregs_mark_activate();
	fpregs_unlock();
	return 0;
}

That's a simplified version of what I have already running on top of the
FPU rework part 3, but you get the idea.

If you are curious:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/log/?h=x86/fpu-3-kvm

If you compare that to the current KVM FPU swap handling then you'll
notice that there is only _one_ buffer required and in case that
TIF_NEED_FPU_RELOAD is set there is no memcpy() required either because
the state is already saved in the to be swapped out buffer.

That's a valuable cleanup and improvement independent of AMX.

See?

This also makes the confidential computing case less awkward because we
can do:

	if (!fpstate->is-scratch && !test_thread_flag(TIF_NEED_FPU_LOAD))
		save_fpregs_to_fpstate(fpu);

instead of the current hack of freeing guest_fpu. See the git tree.

Though I'm not sure whether the logic for this "is_scratch" optimization
is correct as I implemnted it, but I'm neither sure that the current
logic in KVM is correct. But that's just a implementation detail which
needs to be looked at.

XFD support will be also fully consistently integrated:

  XFD will be switched before the restore and this will be fully
  consistent with everything we are doing vs. host side support because
  current->thread.fpu.fpstate->xfd will always be the authoritive
  answer. No need to copy any information from one place to another.

Ergo: No 4 copies of XFD.

>> In a second step:
>> 
>> fpu_swap_kvm_fpu(bool enter_guest, u64 guest_needs_features) {
>>         possibly_reallocate(enter_guest, guest_needs_features);
>
> When KVM traps guest wrmsr XFD in #NM, I think KVM need allocate
> fpstate buffer for full features.
> Because in next vmexit, guest might have dynamic state and KVM
> can be preempted before running fpu_swap_kvm_fpu().
> Thus, here the current->thread.fpu.fpstate already has enough space
> for saving guest.

I think we are talking past each other.

You are looking at this from the point of view of what you have been
doing so far and I am looking at it from a design and abstraction point
of view.

That explains why we have different excpectations vs. XCR0/XFD/#NM.

So the regular boring case will be:

H   vcpu_run()
H   	fpu_swap_kvm_fpstate() <- Sets guest_fpstate->xfd
H
H        while (..) {
H           vmenter()

G              ....
G              ....     -> vmexit (unrelated to XCR0/XFD)

H           ...
H        }
H
H  	fpu_swap_kvm_fpstate() <- Sets user (task) XFD

Now let's look at the XFD/XCR0 intercept case:

H   vcpu_run()
H   	fpu_swap_kvm_fpstate() <- Sets guest_fpstate->xfd
H
H        while (..) {
H           vmenter()

G              ....
G              write to XFD/XCR0;        -> intercept

H           ...
H           if (reason == write to XFD/XCR0)) {
H                if (needs_action(guest_fpstate, $XFDVAL, $XCR0VAL)) {
H                        fpstate_set_realloc_request(guest_fpstate);
H
H                        break;
H
H                }
H           }
H           .....
H        }
H
H  	fpu_swap_kvm_fpstate()

fpu_swap_kvm_fpstate() will see the reallocation request in
guest_fpstate and act accordingly.

Both user and guest state are fully consistent at that point. Why?

It does not matter at all whether the wrmsrl(XFD) or XSETBV affecting
XCR0 in the guest happens because the guest decided it is cool to enable
it just for fun or because the guest took a #NM and wrote to XFD.

In both cases the XFD controlled component is in init state at that
point. So there is _nothing_ to save and _nothing_ which can be lost and
no buffer size problem at all.

Therefore it does also not matter whether the vCPU thread gets preempted
or not on the way out to fpu_swap_kvm_fpstate(). It's all still
consistent.

So fpu_swap_kvm_fpstate() will do in case of a reallocation request:

  1) Allocate new guest fpstate

     If that fails, it does a save and restore with the existing
     fpstates and return an error code which makes KVM drop out to user
     space to decide what to do.

     On success initialize the state properly including the new XFD
     value.

  2) Save guest state into new guest fpstate

  3) Restore host (user) state from the existing fpstate

See?

It does not even have to allocate a new host (user) fpstate to do
that. Why?

  Because once the fpstate pointer is flipped the state is consistent in
  both directions including XFD.

See?

Now if you think about the other way round then the same principle
applies:

  If the host (user) side of the vCPU thread used a dynamic state it has
  a large buffer, but that does not require the guest side buffer to be
  large as well.

  So this is what Paolo wanted, right? I was fighting that because with
  the existing three buffer scheme this cannot not work.

See?

The point is that: 

  - the simple state switching was impossible because the proposed host
    side infrastructure had the wrong abstraction:

    It just reallocated the register buffer, but did not give
    it a container which carries the other relevant information,
    i.e. features, sizes and importantly xfd.

  - the simple state switching was impossible because the user/guest FPU
    concept of KVM was preventing that.

  - it was tried to plug the reallocation into the wrong place:

    There is no point to do that from inside the vcpu_run() loop. It's a
    one off event and that extra overhead of going out to the place
    where this can be handled sanely does not matter at all.

Just to be clear: I'm not blamning you for any of this at all.

There have been enough senior people involved who should have seen the
obvious instead of misguiding you.

So please just forget the horrors which you went through due to lack of
proper guidance, sit back and think about it.

The code in that git branch is just a first step and requires a few
tweaks to get the reallocation handled correctly, but if you look at the
above then you might realize that there are two related but largely
independent problems to solve:

  1) What needs to be intercepted and analyzed in the intercept handlers
     of XCR0 and XFD

  2) Handling the reallocation problem

#1 is a KVM problem and #2 is a host FPU problem

As you are a KVM wizard, I let you sort out #1 with the KVM folks and
I'm looking at the #2 part together with Chang and Dave. Why?

  #1 is not my area of expertise, but I surely might have opinions.
  
  #2 is not any different from handling the host side lazy reallocation.

Can you spot the difference between the original approach and the approach
I have taken?

Maybe you understand now why I was explaining over and over that we need
consistent state and asked everyone to look at the AMX host series.

Just for the record. When I looked at that KVM FPU switching exactly two
weeks ago while I was thinking about the right abstraction for AMX, it
was bloody obvious that just reallocating the register state buffer is
wrong. And it was bloody obvious that the user/guest FPU concept of KVM
is nonsense to begin with and going to be in the way of doing a clean
integration.

Why?

Because when you switch buffers, you need to switch state information
which belongs to the buffer, i.e. features, sizes and xfd, as well
because otherwise you create inconsistent state. Sure you can copy tons
of stuff back and forth, but why would you do that if you just can
switch the full state by swapping the pointer to a container which
contains all the information which is needed and makes everything else
(KVM trap bits aside) just work.

So you can rightfully ask why I did not tell that plan right away?

The reason is that I wanted all of you look at the AMX host series and I
desperately hoped that my insisting on state consistency will make at
least one of the involved people come back and say:

  "Thomas, why don't you do the obvious in fpu_swap_kvm_fpu() and switch
   the fpstate pointers? That FPU switching with the three buffers you
   kept around is bloody stupid."

My answer would have been: "Doh, of course, it's obvious. Stupid me."

But that did not happen. Even when I brought up the

    vcpu_create() -> alloc_and_attach()
    vcpu_run() -> swap() -> vmenter_loop() -> swap()
    vcpu_destroy() -> detach_and_free()

proposal nobody told me:

     "Thomas, this can't work because the thread which creates the  vCPU
      is not necessarily the same as the one which runs it."

No, I had to ask the question myself because I had second thoughts when
I was starting to implement that scheme. I had not thought about that
when I wrote it up in mail simply because I'm not a KVM expert. But it
did not matter because the concept stayed the same, just the
implementation details changed:

    vcpu_create() -> alloc()
    vcpu_run() -> attach() -> vmenter_loop() -> detach()
    vcpu_destroy() -> free()

Why? Because everyone was busy trying to cram their hacks into the code
I just changed instead of rethinking the situation.

See?

Jing, as I said before, I'm not blaming you personally. What I blame is
the general approach to add new features to the kernel:

    Hack it into the existing mess until it works by some definition
    of works.

That simply cannot go anywhere because it makes the code slow and
unmaintainable in the long run.

If a new feature does not fit nicely into the existing code, then the
only viable approach is to sit back, look at the requirements of the new
feature and come up with proper abstractions and a plan how to refactor
the code so that the feature falls into place at the very end and does
not create mess and overhead all over the place.

If you look at the three series I posted, then you see not a single bit
which does not make sense on it's own, except for the last series which
adds the fpu config data structure with the pairs of default_* and
max_*.

Even the fpstate pointer makes sense on it's own because the resulting
cleanup of the KVM FPU switch code is already worthwhile even w/o AMX
and XFD in terms of memory consumption, performance and robustness.

See?

The real AMX stuff which still needs to be posted is just building upon
this refactoring. It adds the necessary infrastructure for AMX, which is
all slow path code.

In the hotpath it adds only the XFD update at exactly 4 places where
state is saved or restored. IOW, all hotpath operations are exactly the
same. If XFD is not available on a CPU then the overhead of the XFD
update code is a few extra NOPs due to the patched out static branch.
If enabled then yes, it has an extra conditional and when the XFD value
really changes then a wrmsrl, but that's inevitable.

See?

Now if you sit back and look at the KVM concepts I explained above then
you surely can see that the overhead for the KVM case is going to
exactly a few extra NOPs in the hotpath when XFD is not available.

When XFD is enabled then yes, it needs the extra setup for XFD, but the
common case in the vmenter_loop() will have only a minimalistic overhead
if at all. The common case in fpu_swap_kvm_fpstate() will only grow a
single conditional in the hotpath:

       if (unlikely(guest_fpstate->need_realloc)) {

       }

but that's not even measurable.

See?

Thanks,

        Thomas

  parent reply	other threads:[~2021-10-16 14:46 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-11 23:59 [patch 00/31] x86/fpu: Preparatory cleanups for AMX support (part 1) Thomas Gleixner
2021-10-11 23:59 ` [patch 01/31] x86/fpu: Remove pointless argument from switch_fpu_finish() Thomas Gleixner
2021-10-12  0:00 ` [patch 02/31] x86/fpu: Update stale comments Thomas Gleixner
2021-10-12  0:00 ` [patch 03/31] x86/pkru: Remove useless include Thomas Gleixner
2021-10-12  0:00 ` [patch 04/31] x86/fpu: Restrict xsaves()/xrstors() to independent states Thomas Gleixner
2021-10-12 14:24   ` Borislav Petkov
2021-10-12  0:00 ` [patch 05/31] x86/fpu: Cleanup the on_boot_cpu clutter Thomas Gleixner
2021-10-12  0:00 ` [patch 06/31] x86/fpu: Remove pointless memset in fpu_clone() Thomas Gleixner
2021-10-12  0:00 ` [patch 07/31] x86/process: Clone FPU in copy_thread() Thomas Gleixner
2021-10-12  0:00 ` [patch 08/31] x86/fpu: Do not inherit FPU context for kernel and IO worker threads Thomas Gleixner
2021-10-12  0:00 ` [patch 09/31] x86/fpu: Do not inherit FPU context for CLONE_THREAD Thomas Gleixner
2021-10-12 16:10   ` Borislav Petkov
2021-10-12 18:52     ` Thomas Gleixner
2021-10-12 19:01       ` Thomas Gleixner
2021-10-12  0:00 ` [patch 10/31] x86/fpu: Cleanup xstate xcomp_bv initialization Thomas Gleixner
2021-10-12  0:00 ` [patch 11/31] x86/fpu/xstate: Provide and use for_each_xfeature() Thomas Gleixner
2021-10-12 16:45   ` Borislav Petkov
2021-10-12  0:00 ` [patch 12/31] x86/fpu/xstate: Mark all init only functions __init Thomas Gleixner
2021-10-12  0:00 ` [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core Thomas Gleixner
2021-10-12 16:53   ` Borislav Petkov
2021-10-12 18:25     ` Thomas Gleixner
2021-10-12 18:26       ` Thomas Gleixner
2021-10-12 17:22   ` Paolo Bonzini
2021-10-13  6:15     ` Liu, Jing2
2021-10-13  6:26       ` Paolo Bonzini
2021-10-13  7:46         ` Liu, Jing2
2021-10-13  8:42           ` Paolo Bonzini
2021-10-13 10:14             ` Andy Lutomirski
2021-10-13 12:26               ` Paolo Bonzini
2021-10-13 14:14                 ` Thomas Gleixner
2021-10-13 14:24                   ` Thomas Gleixner
2021-10-13 14:59                 ` Andy Lutomirski
2021-10-13 15:05                   ` Paolo Bonzini
2021-10-13 10:25             ` Liu, Jing2
2021-10-13 12:37               ` Paolo Bonzini
2021-10-13 14:06             ` Thomas Gleixner
2021-10-14  6:50               ` Paolo Bonzini
2021-10-14  8:02                 ` Liu, Jing2
2021-10-14  9:01                   ` Paolo Bonzini
2021-10-14 11:21                     ` Liu, Jing2
2021-10-14 11:33                       ` Paolo Bonzini
2021-10-14 11:30                     ` Liu, Jing2
2021-10-14 11:39                       ` Paolo Bonzini
2021-11-22  8:50                         ` Liu, Jing2
2021-10-14 14:09                     ` Thomas Gleixner
2021-10-14 14:37                       ` Thomas Gleixner
2021-10-14 15:01                       ` Paolo Bonzini
2021-10-14 19:14                         ` Thomas Gleixner
2021-10-15  9:20                           ` Liu, Jing2
2021-10-15  9:36                           ` Thomas Gleixner
2021-10-15 14:24                             ` Liu, Jing2
2021-10-15 15:53                               ` Paolo Bonzini
2021-10-16 14:45                               ` Thomas Gleixner [this message]
2021-10-15  9:00                         ` Liu, Jing2
2021-10-15 10:50                           ` Thomas Gleixner
2021-10-15 11:17                             ` Paolo Bonzini
2021-10-15 13:01                             ` Liu, Jing2
2021-10-14 12:23                 ` Thomas Gleixner
2021-10-14 12:26                   ` Paolo Bonzini
2021-10-14 14:23                     ` Thomas Gleixner
2021-10-13 15:12       ` Thomas Gleixner
2021-10-14  8:21         ` Liu, Jing2
2021-10-14 13:08           ` Thomas Gleixner
2021-10-12  0:00 ` [patch 14/31] x86/fpu: Replace KVMs homebrewn FPU copy from user Thomas Gleixner
2021-10-12 17:00   ` Borislav Petkov
2021-10-13 14:57     ` Sean Christopherson
2021-10-13 15:12       ` Paolo Bonzini
2021-10-13 15:16       ` Thomas Gleixner
2021-10-12 17:30   ` Paolo Bonzini
2021-10-12  0:00 ` [patch 15/31] x86/fpu: Rework copy_xstate_to_uabi_buf() Thomas Gleixner
2021-10-12 17:30   ` Paolo Bonzini
2021-10-12  0:00 ` [patch 16/31] x86/fpu: Replace KVMs homebrewn FPU copy to user Thomas Gleixner
2021-10-12 17:10   ` Borislav Petkov
2021-10-12 17:36   ` Paolo Bonzini
2021-10-12 17:47     ` Thomas Gleixner
2021-10-12 18:40       ` [patch V2 16/31] x86/fpu: Replace KVMs home brewed " Thomas Gleixner
2021-10-13  5:34       ` [patch 16/31] x86/fpu: Replace KVMs homebrewn " Paolo Bonzini
2021-10-12  0:00 ` [patch 17/31] x86/fpu: Mark fpu__init_prepare_fx_sw_frame() as __init Thomas Gleixner
2021-10-12  0:00 ` [patch 18/31] x86/fpu: Move context switch and exit to user inlines into sched.h Thomas Gleixner
2021-10-12  0:00 ` [patch 19/31] x86/fpu: Clean up cpu feature tests Thomas Gleixner
2021-10-12  0:00 ` [patch 20/31] x86/fpu: Make os_xrstor_booting() private Thomas Gleixner
2021-10-12  0:00 ` [patch 21/31] x86/fpu: Move os_xsave() and os_xrstor() to core Thomas Gleixner
2021-10-12  0:00 ` [patch 22/31] x86/fpu: Move legacy ASM wrappers " Thomas Gleixner
2021-10-12  0:00 ` [patch 23/31] x86/fpu: Make WARN_ON_FPU() private Thomas Gleixner
2021-10-12  0:00 ` [patch 24/31] x86/fpu: Move fpregs_restore_userregs() to core Thomas Gleixner
2021-10-12 17:32   ` Borislav Petkov
2021-10-12  0:00 ` [patch 25/31] x86/fpu: Move mxcsr related code " Thomas Gleixner
2021-10-12  0:00 ` [patch 26/31] x86/fpu: Move fpstate functions to api.h Thomas Gleixner
2021-10-12 17:46   ` Borislav Petkov
2021-10-12  0:00 ` [patch 27/31] x86/fpu: Remove internal.h dependency from fpu/signal.h Thomas Gleixner
2021-10-12  0:00 ` [patch 28/31] x86/sev: Include fpu/xcr.h Thomas Gleixner
2021-10-12  7:24   ` Xiaoyao Li
2021-10-12  0:00 ` [patch 29/31] x86/fpu: Mop up the internal.h leftovers Thomas Gleixner
2021-10-12  0:00 ` [patch 30/31] x86/fpu: Replace the includes of fpu/internal.h Thomas Gleixner
2021-10-12  0:00 ` [patch 31/31] x86/fpu: Provide a proper function for ex_handler_fprestore() Thomas Gleixner
2021-10-12 21:15 ` [patch 00/31] x86/fpu: Preparatory cleanups for AMX support (part 1) Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtn93u58.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=andrew.cooper3@citrix.com \
    --cc=arjan@linux.intel.com \
    --cc=chang.seok.bae@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jing2.liu@intel.com \
    --cc=jing2.liu@linux.intel.com \
    --cc=jun.nakajima@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).