Re: [PATCH 1/2] x86: notify hypervisor about guest entering s2idle state

From: Grzegorz Jaszczyk <jaz@semihalf.com>
To: "Limonciello, Mario" <mario.limonciello@amd.com>,
	Sean Christopherson <seanjc@google.com>
Cc: linux-kernel@vger.kernel.org, Dmytro Maluka <dmy@semihalf.com>,
	Zide Chen <zide.chen@intel.corp-partner.google.com>,
	Peter Fang <peter.fang@intel.corp-partner.google.com>,
	Tomasz Nowicki <tn@semihalf.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>, Pavel Machek <pavel@ucw.cz>,
	Ashish Kalra <ashish.kalra@amd.com>,
	Hans de Goede <hdegoede@redhat.com>,
	Sachi King <nakato@nakato.io>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	David Dunn <daviddunn@google.com>,
	Wei Wang <wei.w.wang@intel.com>,
	Nicholas Piggin <npiggin@gmail.com>,
	"open list:KERNEL VIRTUAL MACHINE (KVM)" <kvm@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:ACPI" <linux-acpi@vger.kernel.org>,
	"open list:HIBERNATION (aka Software Suspend,
	aka swsusp)"  <linux-pm@vger.kernel.org>,
	Dominik Behr <dbehr@google.com>,
	Dmitry Torokhov <dtor@google.com>
Subject: Re: [PATCH 1/2] x86: notify hypervisor about guest entering s2idle state
Date: Thu, 23 Jun 2022 18:50:13 +0200	[thread overview]
Message-ID: <CAH76GKNB0V+-Ky6bfhX6Kzudyn6zJW42iSWfRkfbo9C-eKdo-w@mail.gmail.com> (raw)
In-Reply-To: <7c428b03-261f-78cb-4ce3-5949ac93f028@amd.com>

śr., 22 cze 2022 o 23:50 Limonciello, Mario
<mario.limonciello@amd.com> napisał(a):
>
> On 6/22/2022 04:53, Grzegorz Jaszczyk wrote:
> > pon., 20 cze 2022 o 18:32 Limonciello, Mario
> > <mario.limonciello@amd.com> napisał(a):
> >>
> >> On 6/20/2022 10:43, Grzegorz Jaszczyk wrote:
> >>> czw., 16 cze 2022 o 18:58 Limonciello, Mario
> >>> <mario.limonciello@amd.com> napisał(a):
> >>>>
> >>>> On 6/16/2022 11:48, Sean Christopherson wrote:
> >>>>> On Wed, Jun 15, 2022, Grzegorz Jaszczyk wrote:
> >>>>>> pt., 10 cze 2022 o 16:30 Sean Christopherson <seanjc@google.com> napisał(a):
> >>>>>>> MMIO or PIO for the actual exit, there's nothing special about hypercalls.  As for
> >>>>>>> enumerating to the guest that it should do something, why not add a new ACPI_LPS0_*
> >>>>>>> function?  E.g. something like
> >>>>>>>
> >>>>>>> static void s2idle_hypervisor_notify(void)
> >>>>>>> {
> >>>>>>>            if (lps0_dsm_func_mask > 0)
> >>>>>>>                    acpi_sleep_run_lps0_dsm(ACPI_LPS0_EXIT_HYPERVISOR_NOTIFY
> >>>>>>>                                            lps0_dsm_func_mask, lps0_dsm_guid);
> >>>>>>> }
> >>>>>>
> >>>>>> Great, thank you for your suggestion! I will try this approach and
> >>>>>> come back. Since this will be the main change in the next version,
> >>>>>> will it be ok for you to add Suggested-by: Sean Christopherson
> >>>>>> <seanjc@google.com> tag?
> >>>>>
> >>>>> If you want, but there's certainly no need to do so.  But I assume you or someone
> >>>>> at Intel will need to get formal approval for adding another ACPI LPS0 function?
> >>>>> I.e. isn't there work to be done outside of the kernel before any patches can be
> >>>>> merged?
> >>>>
> >>>> There are 3 different LPS0 GUIDs in use.  An Intel one, an AMD (legacy)
> >>>> one, and a Microsoft one.  They all have their own specs, and so if this
> >>>> was to be added I think all 3 need to be updated.
> >>>
> >>> Yes this will not be easy to achieve I think.
> >>>
> >>>>
> >>>> As this is Linux specific hypervisor behavior, I don't know you would be
> >>>> able to convince Microsoft to update theirs' either.
> >>>>
> >>>> How about using s2idle_devops?  There is a prepare() call and a
> >>>> restore() call that is set for each handler.  The only consumer of this
> >>>> ATM I'm aware of is the amd-pmc driver, but it's done like a
> >>>> notification chain so that a bunch of drivers can hook in if they need to.
> >>>>
> >>>> Then you can have this notification path and the associated ACPI device
> >>>> it calls out to be it's own driver.
> >>>
> >>> Thank you for your suggestion, just to be sure that I've understand
> >>> your idea correctly:
> >>> 1) it will require to extend acpi_s2idle_dev_ops about something like
> >>> hypervisor_notify() call, since existing prepare() is called from end
> >>> of acpi_s2idle_prepare_late so it is too early as it was described in
> >>> one of previous message (between acpi_s2idle_prepare_late and place
> >>> where we use hypercall there are several places where the suspend
> >>> could be canceled, otherwise we could probably try to trap on other
> >>> acpi_sleep_run_lps0_dsm occurrence from acpi_s2idle_prepare_late).
> >>>
> >>
> >> The idea for prepare() was it would be the absolute last thing before
> >> the s2idle loop was run.  You're sure that's too early?  It's basically
> >> the same thing as having a last stage new _DSM call.
> >>
> >> What about adding a new abort() extension to acpi_s2idle_dev_ops?  Then
> >> you could catch the cancelled suspend case still and take corrective
> >> action (if that action is different than what restore() would do).
> >
> > It will be problematic since the abort/restore notification could
> > arrive too late and therefore the whole system will go to suspend
> > thinking that the guest is in desired s2ilde state. Also in this case
> > it would be impossible to prevent races and actually making sure that
> > the guest is suspended or not. We already had similar discussion with
> > Sean earlier in this thread why the notification have to be send just
> > before swait_event_exclusive(s2idle_wait_head, s2idle_state ==
> > S2IDLE_STATE_WAKE) and that the VMM have to have control over guest
> > resumption.
> >
> > Nevertheless if extending acpi_s2idle_dev_ops is possible, why not
> > extend it about the hypervisor_notify() and use it in the same place
> > where the hypercall is used in this patch? Do you see any issue with
> > that?
>
> If this needs to be a hypercall and the hypercall needs to go at that
> specific time, I wouldn't bother with extending acpi_s2idle_dev_ops.
> The whole idea there was that this would be less custom and could follow
> a spec.

Just to clarify - it probably doesn't need to be a hypercall. I've
probably misled you with copy-pasting a handler name from the current
patch but aiming your and Sean ACPI like approach. What I meant is
something like:
- extend acpi_s2idle_dev_ops with notify()
- implement notify() handler for acpi_s2idle_dev_ops in HYPE0001
driver (without hypercall):
static void s2idle_notify(void)
{
        acpi_evaluate_dsm(acpi_handle, guid_of_HYPE0001, 0,
ACPI_HYPE_NOTIFY, NULL);
}

- register it via acpi_register_lps0_dev() from HYPE0001 driver
- use it just before swait_event_exclusive(s2idle_wait_head..) as it
is with original patch (the name of the function will be different):
static void s2idle_hypervisor_notify(void)
{
         struct acpi_s2idle_dev_ops *handler;
...
         list_for_each_entry(handler, &lps0_s2idle_devops_head, list_node) {
                  if (handler->notify)
                          handler->notify();
          }
}

so it will be like:
-> s2idle_enter (just before swait_event_exclusive(s2idle_wait_head,.. )
--> s2idle_hypervisor_notify (as platform_s2idle_ops)
---> notify (as acpi_s2idle_dev_ops)
----> HYPE0001 device driver's notify () routine

It will probably be easier to understand it if I actually implement
it. Nevertheless this way we ensure that:
- notification will be triggered at very last command before actually
entering s2idle
- we can trap on MMIO/PIO by implementing HYPE0001 specific  _DSM
method and therefore this implementation will not become hypervisor
specific and also not use KVM as "dumb pipe out to userspace" as Sean
suggested
- we will not have to change existing Intel/AMD/Window spec (3
different LPS0 GUIDs) but thanks to HYPE0001's acpi_s2idle_dev_ops
involvment, only care about new HYPE0001 spec

>
> TBH - given the strong dependency on being the very last command and
> this being all Linux specific (you won't need to do something similar
> with Windows) - I think the way you already did it makes the most sense.
> It seems to me the ACPI device model doesn't really work well for this
> scenario.
>
> >
> >>
> >>> 2) using newly introduced acpi_s2idle_dev_ops hypervisor_notify() call
> >>> will allow to register handler from Intel x86/intel/pmc/core.c driver
> >>> and/or AMD x86/amd-pmc.c driver. Therefore we will need to get only
> >>> Intel and/or AMD approval about extending the ACPI LPS0 _DSM method,
> >>> correct?
> >>>
> >>
> >> Right now the only thing that hooks prepare()/restore() is the amd-pmc
> >> driver (unless Intel's PMC had a change I didn't catch yet).
> >>
> >> I don't think you should be changing any existing drivers but rather
> >> introduce another platform driver for this specific case.
> >>
> >> So it would be something like this:
> >>
> >> acpi_s2idle_prepare_late
> >> -> prepare()
> >> --> AMD: amd_pmc handler for prepare()
> >> --> Intel: intel_pmc handler for prepare() (conceptual)
> >> --> HYPE0001 device: new driver's prepare() routine
> >>
> >> So the platform driver would match the HYPE0001 device to load, and it
> >> wouldn't do anything other than provide a prepare()/restore() handler
> >> for your case.
> >>
> >> You don't need to change any existing specs.  If anything a new spec to
> >> go with this new ACPI device would be made.  Someone would need to
> >> reserve the ID and such for it, but I think you can mock it up in advance.
> >
> > Thank you for your explanation. This means that I should register
> > "HYPE" through https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuefi.org%2FPNP_ACPI_Registry&amp;data=05%7C01%7Cmario.limonciello%40amd.com%7C49512293908e4ee17e8c08da54351ed5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637914884458918039%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=v5VsnxAINiJhOMLpwORLHd13WcYBHf%2FGSNv8Bjhyino%3D&amp;reserved=0 before introducing
> > this new driver to Linux.
> > I have no experience with the above, so I wonder who should be
> > responsible for maintaining such ACPI ID since it will not belong to
> > any specific vendor? There is an example of e.g. COREBOOT PROJECT
> > using "BOOT" ACPI ID [1], which seems similar in terms of not
> > specifying any vendor but rather the project as a responsible entity.
> > Maybe you have some recommendations?
>
> Maybe LF could own a namespace and ID?  But I would suggest you make a
> mockup that everything works this way before you go explore too much.

Yeah, sure.

>
> Also make sure Rafael is aligned with your mockup.

Agree.

>
> >
> > I am also not sure if and where a specification describing such a
> > device has to be maintained. Since "HYPE0001" will have its own _DSM
> > so will it be required to document it somewhere rather than just using
> > it in the driver and preparing proper ACPI tables for guest?
> >
> >>
> >>> I wonder if this will be affordable so just re-thinking loudly if
> >>> there is no other mechanism that could be suggested and used upstream
> >>> so we could notify hypervisor/vmm about guest entering s2idle state?
> >>> Especially that such _DSM function will be introduced only to trap on
> >>> some fake MMIO/PIO access and will be useful only for guest ACPI
> >>> tables?
> >>>
> >>
> >> Do you need to worry about Microsoft guests using Modern Standby too or
> >> is that out of the scope of your problem set?  I think you'll be a lot
> >> more limited in how this can behave and where you can modify things if so.
> >>
> >
> > I do not need to worry about Microsoft guests.
>
> Makes life a lot easier :)

Agree :) and thank you for all your feedback,
Grzegorz