linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 1/1] x86: Skip WBINVD instruction for VM guest
@ 2021-11-16  0:50 Kuppuswamy Sathyanarayanan
  2021-11-16 16:24 ` Borislav Petkov
  0 siblings, 1 reply; 32+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-11-16  0:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Tony Luck, Dan Williams, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Kuppuswamy Sathyanarayanan, linux-kernel

ACPI mandates that CPU caches be flushed before entering any sleep
state. This ensures that the CPU and its caches can be powered down
without losing data.

ACPI-based VMs have maintained this sleep-state-entry behavior.
However, cache flushing for VM sleep state entry is useless. Unlike on
bare metal, guest sleep states are not correlated with potential data
loss of any kind; the host is responsible for data preservation. In
fact, some KVM configurations simply skip the cache flushing
instruction (see need_emulate_wbinvd()).

Further, on TDX systems, the WBINVD instruction causes an
unconditional #VE exception.  If this cache flushing remained, it would
need extra code in the form of a #VE handler.

All use of ACPI_FLUSH_CPU_CACHE() appears to be in sleep-state-related
code.

This means that the ACPI use of WBINVD is at *best* superfluous.

Disable ACPI CPU cache flushing on all X86_FEATURE_HYPERVISOR systems,
which includes TDX.

Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-acpi@vger.kernel.org
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/acenv.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..d4162e94bee8 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
 #define _ASM_X86_ACENV_H
 
 #include <asm/special_insns.h>
+#include <asm/cpu.h>
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()				\
+do {							\
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))	\
+		wbinvd();				\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 1/1] x86: Skip WBINVD instruction for VM guest
  2021-11-16  0:50 [PATCH v1 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
@ 2021-11-16 16:24 ` Borislav Petkov
  2021-11-16 16:36   ` Sathyanarayanan Kuppuswamy
  2021-11-19  4:03   ` [PATCH v2] " Kuppuswamy Sathyanarayanan
  0 siblings, 2 replies; 32+ messages in thread
From: Borislav Petkov @ 2021-11-16 16:24 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Dan Williams, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, linux-kernel

On Mon, Nov 15, 2021 at 04:50:27PM -0800, Kuppuswamy Sathyanarayanan wrote:
> -#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
> +#define ACPI_FLUSH_CPU_CACHE()				\
> +do {							\
> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))	\

cpu_feature_enabled()

If you wanna query a X86_FEATURE_* bit, from now on, use only this
function.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 1/1] x86: Skip WBINVD instruction for VM guest
  2021-11-16 16:24 ` Borislav Petkov
@ 2021-11-16 16:36   ` Sathyanarayanan Kuppuswamy
  2021-11-19  4:03   ` [PATCH v2] " Kuppuswamy Sathyanarayanan
  1 sibling, 0 replies; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2021-11-16 16:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Dan Williams, Andi Kleen, Kirill Shutemov,
	Kuppuswamy Sathyanarayanan, linux-kernel

Hi,

On 11/16/21 8:24 AM, Borislav Petkov wrote:
> On Mon, Nov 15, 2021 at 04:50:27PM -0800, Kuppuswamy Sathyanarayanan wrote:
>> -#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
>> +#define ACPI_FLUSH_CPU_CACHE()				\
>> +do {							\
>> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))	\
> 
> cpu_feature_enabled()
> 
> If you wanna query a X86_FEATURE_* bit, from now on, use only this
> function.
> 

Ok. I will change it in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-11-16 16:24 ` Borislav Petkov
  2021-11-16 16:36   ` Sathyanarayanan Kuppuswamy
@ 2021-11-19  4:03   ` Kuppuswamy Sathyanarayanan
  2021-11-25  0:40     ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2021-11-19  4:03 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Rafael J . Wysocki
  Cc: H . Peter Anvin, Tony Luck, Dan Williams, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Kuppuswamy Sathyanarayanan, linux-kernel, linux-acpi

ACPI mandates that CPU caches be flushed before entering any sleep
state. This ensures that the CPU and its caches can be powered down
without losing data.

ACPI-based VMs have maintained this sleep-state-entry behavior.
However, cache flushing for VM sleep state entry is useless. Unlike on
bare metal, guest sleep states are not correlated with potential data
loss of any kind; the host is responsible for data preservation. In
fact, some KVM configurations simply skip the cache flushing
instruction (see need_emulate_wbinvd()).

Further, on TDX systems, the WBINVD instruction causes an
unconditional #VE exception.  If this cache flushing remained, it would
need extra code in the form of a #VE handler.

All use of ACPI_FLUSH_CPU_CACHE() appears to be in sleep-state-related
code.

This means that the ACPI use of WBINVD is at *best* superfluous.

Disable ACPI CPU cache flushing on all X86_FEATURE_HYPERVISOR systems,
which includes TDX.

Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-acpi@vger.kernel.org
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---

Changes since v1:
 * Used cpu_feature_enabled() instead of boot_cpu_has().

 arch/x86/include/asm/acenv.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/acenv.h b/arch/x86/include/asm/acenv.h
index 9aff97f0de7f..dba05c74bd7e 100644
--- a/arch/x86/include/asm/acenv.h
+++ b/arch/x86/include/asm/acenv.h
@@ -10,10 +10,15 @@
 #define _ASM_X86_ACENV_H
 
 #include <asm/special_insns.h>
+#include <asm/cpu.h>
 
 /* Asm macros */
 
-#define ACPI_FLUSH_CPU_CACHE()	wbinvd()
+#define ACPI_FLUSH_CPU_CACHE()					\
+do {								\
+	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))	\
+		wbinvd();					\
+} while (0)
 
 int __acpi_acquire_global_lock(unsigned int *lock);
 int __acpi_release_global_lock(unsigned int *lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-11-19  4:03   ` [PATCH v2] " Kuppuswamy Sathyanarayanan
@ 2021-11-25  0:40     ` Thomas Gleixner
  2021-12-02 22:21       ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2021-11-25  0:40 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki
  Cc: H . Peter Anvin, Tony Luck, Dan Williams, Andi Kleen,
	Kirill Shutemov, Kuppuswamy Sathyanarayanan,
	Kuppuswamy Sathyanarayanan, linux-kernel, linux-acpi

Kuppuswamy,

On Thu, Nov 18 2021 at 20:03, Kuppuswamy Sathyanarayanan wrote:
> ACPI mandates that CPU caches be flushed before entering any sleep
> state. This ensures that the CPU and its caches can be powered down
> without losing data.
>
> ACPI-based VMs have maintained this sleep-state-entry behavior.
> However, cache flushing for VM sleep state entry is useless. Unlike on
> bare metal, guest sleep states are not correlated with potential data
> loss of any kind; the host is responsible for data preservation. In
> fact, some KVM configurations simply skip the cache flushing
> instruction (see need_emulate_wbinvd()).

KVM starts out with kvm->arch.noncoherent_dma_count = 0 which makes
need_emulate_wbinvd() skip WBINVD emulation. So far so good.

VFIO has code to invoke kvm_arch_register_noncoherent_dma() which
increments the count which will subsequently cause WBINVD emulation to
be enabled. What now?

> Further, on TDX systems, the WBINVD instruction causes an
> unconditional #VE exception.  If this cache flushing remained, it would
> need extra code in the form of a #VE handler.
>
> All use of ACPI_FLUSH_CPU_CACHE() appears to be in sleep-state-related
> code.

C3 is considered a sleep state nowadays? Also ACPI_FLUSH_CPU_CACHE() is
used in other places which have nothing to do with sleep states.

git grep is not rocket science to use.

> This means that the ACPI use of WBINVD is at *best* superfluous.

Really? You probably meant to say:

  This means that the ACPI usage of WBINVD from within a guest is at
  best superfluous.

No?

But aside of that this does not give any reasonable answers why
disabling WBINVD for guests unconditionally in ACPI_FLUSH_CPU_CACHE()
and the argumentation vs. need_emulate_wbinvd() are actually correct
under all circumstances.

I'm neither going to do that analysis nor am I going to accept a patch
which comes with 'appears' based arguments and some handwavy references
to disabled WBINVD emulation code which can obviously be enabled for a
reason.

The even more interesting question for me is how a TDX guest is dealing
with all other potential invocations of WBINVD all over the place. Are
they all going to get the same treatment or are those magically going to
be never executed in TDX guests?

I really have to ask why SEV can deal with WBINVD and other things just
nicely by implementing trivial #VC handler functions, but TDX has to
prematurely optimize the kernel tree based on half baken arguments?

Having a few trivial #VE handlers is not the end of the world. You can
revisit that once basic support for TDX is merged in order to gain
performance or whatever.

Either that or you provide patches with arguments which are based on
proper analysis and not on 'appears to' observations.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-11-25  0:40     ` Thomas Gleixner
@ 2021-12-02 22:21       ` Kirill A. Shutemov
  2021-12-02 22:38         ` Dave Hansen
  2021-12-02 23:48         ` Thomas Gleixner
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-02 22:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki, H . Peter Anvin, Tony Luck,
	Dan Williams, Andi Kleen, Kuppuswamy Sathyanarayanan,
	linux-kernel, linux-acpi

On Thu, Nov 25, 2021 at 01:40:24AM +0100, Thomas Gleixner wrote:
> Kuppuswamy,
> 
> On Thu, Nov 18 2021 at 20:03, Kuppuswamy Sathyanarayanan wrote:
> > ACPI mandates that CPU caches be flushed before entering any sleep
> > state. This ensures that the CPU and its caches can be powered down
> > without losing data.
> >
> > ACPI-based VMs have maintained this sleep-state-entry behavior.
> > However, cache flushing for VM sleep state entry is useless. Unlike on
> > bare metal, guest sleep states are not correlated with potential data
> > loss of any kind; the host is responsible for data preservation. In
> > fact, some KVM configurations simply skip the cache flushing
> > instruction (see need_emulate_wbinvd()).
> 
> KVM starts out with kvm->arch.noncoherent_dma_count = 0 which makes
> need_emulate_wbinvd() skip WBINVD emulation. So far so good.
> 
> VFIO has code to invoke kvm_arch_register_noncoherent_dma() which
> increments the count which will subsequently cause WBINVD emulation to
> be enabled. What now?
> 
> > Further, on TDX systems, the WBINVD instruction causes an
> > unconditional #VE exception.  If this cache flushing remained, it would
> > need extra code in the form of a #VE handler.
> >
> > All use of ACPI_FLUSH_CPU_CACHE() appears to be in sleep-state-related
> > code.
> 
> C3 is considered a sleep state nowadays? Also ACPI_FLUSH_CPU_CACHE() is
> used in other places which have nothing to do with sleep states.
> 
> git grep is not rocket science to use.
> 
> > This means that the ACPI use of WBINVD is at *best* superfluous.
> 
> Really? You probably meant to say:
> 
>   This means that the ACPI usage of WBINVD from within a guest is at
>   best superfluous.
> 
> No?
> 
> But aside of that this does not give any reasonable answers why
> disabling WBINVD for guests unconditionally in ACPI_FLUSH_CPU_CACHE()
> and the argumentation vs. need_emulate_wbinvd() are actually correct
> under all circumstances.
> 
> I'm neither going to do that analysis nor am I going to accept a patch
> which comes with 'appears' based arguments and some handwavy references
> to disabled WBINVD emulation code which can obviously be enabled for a
> reason.
> 
> The even more interesting question for me is how a TDX guest is dealing
> with all other potential invocations of WBINVD all over the place. Are
> they all going to get the same treatment or are those magically going to
> be never executed in TDX guests?
> 
> I really have to ask why SEV can deal with WBINVD and other things just
> nicely by implementing trivial #VC handler functions, but TDX has to
> prematurely optimize the kernel tree based on half baken arguments?
> 
> Having a few trivial #VE handlers is not the end of the world. You can
> revisit that once basic support for TDX is merged in order to gain
> performance or whatever.
> 
> Either that or you provide patches with arguments which are based on
> proper analysis and not on 'appears to' observations.

I think the right solution to the WBINVD would be to add a #VE handler
that does nothing. We don't have a reasonable way to handle it from within
the guest. We can call the VMM in hope that it would handle it, but VMM is
untrusted and it can ignore the request.

Dave suggested that we need to do code audit to make sure that there's no
user inside TDX guest environment that relies on WBINVD to work correctly.

Below is full call tree of WBINVD. It is substantially larger than I
anticipated from initial grep.

Conclusions:

  - Most of callers are in ACPI code on changing S-states. Ignoring cache
    flush for S-state change on virtual machine should be safe.

  - The only WBINVD I was able to trigger is on poweroff from ACPI code.
    Reboot also should trigger it, but for some reason I don't see it.

  - Few caller in CPU offline code. TDX does not allowed to offline CPU as
    we cannot bring it back -- we don't have SIPI. And even if offline
    works for vCPU it should be safe to ignore WBINVD there.

  - NVDIMMs are not supported inside TDX. If it will change we would need
    to deal with cache flushing for this case. Hopefully, we would be able
    to avoid WBINVD.

  - Cache QoS and MTRR use WBINVD. They are disabled in TDX, but it is
    controlled by VMM if the feature is advertised. We would need to
    filter CPUID/MSRs to make sure VMM would not mess with them.

Is it good enough justification for do-nothing #VE WBINVD handler?

WBINVD
  native_wbinvd()
    wbinvd()
      ACPI_FLUSH_CPU_CACHE()
        acpi_hw_extended_sleep()
          acpi_enter_sleep_state()
            x86_acpi_enter_sleep_state()
              do_suspend_lowlevel()
                x86_acpi_suspend_lowlevel()
                  acpi_suspend_enter()
                    >>> On S3: No suspend-to-ram -- no problem
            acpi_db_do_one_sleep_state()
              acpi_db_sleep()
                acpi_db_command_dispatch()
                  >>> "SLEEP" command of ACPI debugger. I guess can trigger poweroff. WBINVD doesn't make any difference in TDX.
            acpi_hibernation_enter()
              >>> On S4. No hibernate -- no problem.
            acpi_power_off()
              >>> On S5. Triggirable on poweroff, but safe to ignore WBINVD here on TDX
            acpi_suspend_enter()
              >>> On S1. No S1 -- no problem.
            xen_acpi_suspend_lowlevel()
              >>> N/A to TDX.
        acpi_hw_legacy_sleep()
          acpi_enter_sleep_state()
            >>> See above. For ACPI_REDUCED_HARDWARE.
        acpi_enter_sleep_state_s4bios()
          No users? Or I failed to decypther ACPI code.
        acpi_idle_enter()
          acpi_processor_setup_cstates()
            acpi_processor_setup_cpuidle_states()
              acpi_processor_power_state_has_changed()
                acpi_processor_notify()
                  >>> Looks like the driver going to get event in case the number of power state will change. But I can be mistaken. Anyway skipping WBINVD is safe.
              acpi_processor_power_init()
                >>> Only applicable if acpi_idle_driver is in use. N/A to TDX.
        acpi_idle_enter_s2idle()
          acpi_processor_setup_cstates()
            >>> See above.
        acpi_idle_play_dead()
          acpi_processor_setup_cstates()
            >>> See above.
        acpi_sleep_prepare()
          >>> On the way to S3/S4/S5. Safe to ignore WBINVD
        acpi_suspend_enter()
          >>> On the way to S3/S4/S5. Safe to ignore WBINVD
        acpi_hibernation_enter()
          >>> On S4, No S4 -- no problem.
        <Bunch of callers in cpufreq/longhaul.c>
          >>> CPU frequency driver for VIA Cyrix CPU. N/A to TDX.
      flush_agp_cache()
        ipi_handler()
          global_cache_flush()
            >>> Used by bunch of random AGP drivers. N/A to TDX: device passthrough is not supported.
      wbinvd_on_cpu()
        amd_l3_disable_index()
          >>> N/A to TDX
      gart_iommu_init()
        >>> N/A to TDX
      init_amd_k6()
        >>> N/A to TDX
      amd_set_mtrr()
        >>> N/A to TDX
      prepare_set() in mtrr/cyrix.c
        >>> N/A to TDX
      post_set() in mtrr/cyrix.c
        >>> N/A to TDX
      prepare_set() mtrr/generic.c
        >>> MTRR is disabled, but it is in control of VMM.
      mwait_play_dead()
        native_play_dead()
          sev_es_play_dead()
            >>> N/A to TDX.
          play_dead()
            arch_cpu_idle_enter()
              do_idle()
                >>> Only for offline CPUs. Offlining is disabled on TDX.
      hlt_play_dead()
        native_play_dead()
          >>> See above
        resume_play_dead()
          hibernate_resume_nonboot_cpu_disable()
            >>> No hipernate -- no problem.
      pseudo_lock_fn()
        rdtgroup_pseudo_lock_create()
          rdtgroup_schemata_write()
            res_common_files[]
              rdtgroup_init()
                resctrl_late_init()
                  >>> Depends on Cache QoS features that configured by VMM.
      wbinvd_ipi() in kvm/x86.c
        >>> KVM emulation of WBINVD. N/A for TDX guest.
      __wbinvd()
        wbinvd_on_cpu()
          >>> See above
        wbinvd_on_all_cpus()
          sev_flush_asids() and other users in kvm/svm/sev.c
            >>> N/A to TDX
          nvdimm_invalidate_cache()
            >>> No NVDIMMs in TDX
          i830_chipset_flush()
            >>> N/A to TDX
          __sev_platform_init_locked()
            >>> N/A to TDX
          drm_clflush_virt_range(), drm_clflush_pages(), drm_clflush_sg()
            >>> Only for !X86_FEATURE_CLFLUSH, N/A to TDX.
          Few callers in i915
            >>> N/A to TDX
      __sme_early_enc_dec()
        >>> N/A to TDX
      __cpa_flush_all()
        cpa_flush_all()
          cpa_flush()
            >>> Only for !X86_FEATURE_CLFLUSH. N/A to TDX.
      powernow_k6_set_cpu_multiplier()
        >>> N/A to TDX
      disable_caches()
        inject_write_store() in amd64_edac.c
          >>> N/A to TDX
      drm_ati_pcigart_init()
        >>> N/A to TDX
      nettel_init() and other nettel users
        >>> N/A to TDX
      atomisp_acc_start() and other atomisp users
        >>> N/A to TDX
    apply_microcode_intel()/apply_microcode_early()
      >>> N/A to TDX
  identity_mapped()
    >>> Only for AMD SME
  __enc_copy in /mem_encrypt_boot.S
    >>> N/A to TDX
  wakeup_start in platform/olpc/xo1-wakeup.S
    >>> N/A to TDX
  machine_real_restart_asm16 in realmode/rm/reboot.S
    >>> Safe to ignore WBINVD on TDX
  trampoline_start in realmode/rm/trampoline_64.S
    >>> TDX doesn't use realmode trampoline
  flush_cache() in i810_main.h
    >>> N/A to TDX
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-02 22:21       ` Kirill A. Shutemov
@ 2021-12-02 22:38         ` Dave Hansen
  2021-12-02 23:48         ` Thomas Gleixner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2021-12-02 22:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner
  Cc: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki, H . Peter Anvin, Tony Luck,
	Dan Williams, Andi Kleen, Kuppuswamy Sathyanarayanan,
	linux-kernel, linux-acpi

On 12/2/21 2:21 PM, Kirill A. Shutemov wrote:
>   - NVDIMMs are not supported inside TDX. If it will change we would need
>     to deal with cache flushing for this case. Hopefully, we would be able
>     to avoid WBINVD.

Maybe we can use this as an example since we have our friendly NVDIMM
developers on cc already.

Let's say that tomorrow Intel decides that NVDIMMs are OK to use in TDX.
 It might not be a good idea, but Intel could arbitrarily start
supporting them immediately.  Further, someone could take today's kernel
and stick it on some future, fancy platform which does support TDX and
NVDIMMs.  In other words, there are multiple reasons we can't just say
"TDX doesn't support NVDIMMs" and forget about it.

If either of those happened, we'd have a NVDIMM driver which uses
WBINVD, expects cache flushing and subsequently loses data.  I think we
can all agree that's a bad idea.

So, we've got two different cases that land in the #VE handler:

	1. Silly ACPI code that doesn't need WBINVD behavior

	2. Less silly NVDIMM code that badly needs WBINVD behavior

... but we have a #VE handler that can't tell the difference.

To me, that says we need to do _something_ different than just papering
over the WBINVD in the #VE handler.

Does anyone have a different take on it?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-02 22:21       ` Kirill A. Shutemov
  2021-12-02 22:38         ` Dave Hansen
@ 2021-12-02 23:48         ` Thomas Gleixner
  2021-12-03 23:49           ` Kirill A. Shutemov
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2021-12-02 23:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki, H . Peter Anvin, Tony Luck,
	Dan Williams, Andi Kleen, Kuppuswamy Sathyanarayanan,
	linux-kernel, linux-acpi

Kirill,

On Fri, Dec 03 2021 at 01:21, Kirill A. Shutemov wrote:
> On Thu, Nov 25, 2021 at 01:40:24AM +0100, Thomas Gleixner wrote:
>> Kuppuswamy,
>> Either that or you provide patches with arguments which are based on
>> proper analysis and not on 'appears to' observations.
>
> I think the right solution to the WBINVD would be to add a #VE handler
> that does nothing. We don't have a reasonable way to handle it from within
> the guest. We can call the VMM in hope that it would handle it, but VMM is
> untrusted and it can ignore the request.
>
> Dave suggested that we need to do code audit to make sure that there's no
> user inside TDX guest environment that relies on WBINVD to work correctly.
>
> Below is full call tree of WBINVD. It is substantially larger than I
> anticipated from initial grep.
>
> Conclusions:
>
>   - Most of callers are in ACPI code on changing S-states. Ignoring cache
>     flush for S-state change on virtual machine should be safe.
>
>   - The only WBINVD I was able to trigger is on poweroff from ACPI code.
>     Reboot also should trigger it, but for some reason I don't see it.
>
>   - Few caller in CPU offline code. TDX does not allowed to offline CPU as
>     we cannot bring it back -- we don't have SIPI. And even if offline
>     works for vCPU it should be safe to ignore WBINVD there.
>
>   - NVDIMMs are not supported inside TDX. If it will change we would need
>     to deal with cache flushing for this case. Hopefully, we would be able
>     to avoid WBINVD.
>
>   - Cache QoS and MTRR use WBINVD. They are disabled in TDX, but it is
>     controlled by VMM if the feature is advertised. We would need to
>     filter CPUID/MSRs to make sure VMM would not mess with them.
>
> Is it good enough justification for do-nothing #VE WBINVD handler?

first of all thank you very much for this very profound analysis.

This is really what I was asking for and you probably went even a step
deeper than that. Very appreciated.

What we should do instead of doing a wholesale let's ignore WBINVD is to
have a separate function/macro:

 ACPI_FLUSH_CPU_CACHE_PHYS()

and invoke that from the functions which are considered to be safe.

That would default to ACPI_FLUSH_CPU_CACHE() for other architecures
obviously.

Then you can rightfully do:

#define ACPI_FLUSH_CPU_CACHE_PHYS()     \
        if (!cpu_feature_enabled(XXX))	\
        	wbinvd();               \              
                
where $XXX might be FEATURE_TDX_GUEST for paranoia sake and then
extended to X86_FEATURE_HYPERVISOR if everyone agrees.

Then you have the #VE handler which just acts on any other wbinvd
invocation via warn, panic, whatever, no?

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-02 23:48         ` Thomas Gleixner
@ 2021-12-03 23:49           ` Kirill A. Shutemov
  2021-12-04  0:20             ` Dave Hansen
  2021-12-04 20:27             ` Rafael J. Wysocki
  0 siblings, 2 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-03 23:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki, H . Peter Anvin, Tony Luck,
	Dan Williams, Andi Kleen, Kuppuswamy Sathyanarayanan,
	linux-kernel, linux-acpi

On Fri, Dec 03, 2021 at 12:48:43AM +0100, Thomas Gleixner wrote:
> Kirill,
> 
> On Fri, Dec 03 2021 at 01:21, Kirill A. Shutemov wrote:
> > On Thu, Nov 25, 2021 at 01:40:24AM +0100, Thomas Gleixner wrote:
> >> Kuppuswamy,
> >> Either that or you provide patches with arguments which are based on
> >> proper analysis and not on 'appears to' observations.
> >
> > I think the right solution to the WBINVD would be to add a #VE handler
> > that does nothing. We don't have a reasonable way to handle it from within
> > the guest. We can call the VMM in hope that it would handle it, but VMM is
> > untrusted and it can ignore the request.
> >
> > Dave suggested that we need to do code audit to make sure that there's no
> > user inside TDX guest environment that relies on WBINVD to work correctly.
> >
> > Below is full call tree of WBINVD. It is substantially larger than I
> > anticipated from initial grep.
> >
> > Conclusions:
> >
> >   - Most of callers are in ACPI code on changing S-states. Ignoring cache
> >     flush for S-state change on virtual machine should be safe.
> >
> >   - The only WBINVD I was able to trigger is on poweroff from ACPI code.
> >     Reboot also should trigger it, but for some reason I don't see it.
> >
> >   - Few caller in CPU offline code. TDX does not allowed to offline CPU as
> >     we cannot bring it back -- we don't have SIPI. And even if offline
> >     works for vCPU it should be safe to ignore WBINVD there.
> >
> >   - NVDIMMs are not supported inside TDX. If it will change we would need
> >     to deal with cache flushing for this case. Hopefully, we would be able
> >     to avoid WBINVD.
> >
> >   - Cache QoS and MTRR use WBINVD. They are disabled in TDX, but it is
> >     controlled by VMM if the feature is advertised. We would need to
> >     filter CPUID/MSRs to make sure VMM would not mess with them.
> >
> > Is it good enough justification for do-nothing #VE WBINVD handler?
> 
> first of all thank you very much for this very profound analysis.
> 
> This is really what I was asking for and you probably went even a step
> deeper than that. Very appreciated.
> 
> What we should do instead of doing a wholesale let's ignore WBINVD is to
> have a separate function/macro:
> 
>  ACPI_FLUSH_CPU_CACHE_PHYS()
> 
> and invoke that from the functions which are considered to be safe.
> 
> That would default to ACPI_FLUSH_CPU_CACHE() for other architecures
> obviously.
> 
> Then you can rightfully do:
> 
> #define ACPI_FLUSH_CPU_CACHE_PHYS()     \
>         if (!cpu_feature_enabled(XXX))	\
>         	wbinvd();               \              
>                 
> where $XXX might be FEATURE_TDX_GUEST for paranoia sake and then
> extended to X86_FEATURE_HYPERVISOR if everyone agrees.
> 
> Then you have the #VE handler which just acts on any other wbinvd
> invocation via warn, panic, whatever, no?

I found another angle at the problem. According to the ACPI spec v6.4
section 16.2 cache flushing is required on the way to S1, S2 and S3.
And according to 8.2 it also is required on the way to C3.

TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.

Adjusting code to match the spec would make TDX work automagically.

Any opinions on the patch below?

I didn't touch ACPI_FLUSH_CPU_CACHE() users in cpufreq/longhaul.c because
it might be outside of ACPI spec, I donno.

diff --git a/drivers/acpi/acpica/hwesleep.c b/drivers/acpi/acpica/hwesleep.c
index 808fdf54aeeb..b004a72a426e 100644
--- a/drivers/acpi/acpica/hwesleep.c
+++ b/drivers/acpi/acpica/hwesleep.c
@@ -104,7 +104,8 @@ acpi_status acpi_hw_extended_sleep(u8 sleep_state)
 
 	/* Flush caches, as per ACPI specification */
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (sleep_state >= ACPI_STATE_S1 && sleep_state <= ACPI_STATE_S3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	status = acpi_os_enter_sleep(sleep_state, sleep_control, 0);
 	if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 34a3825f25d3..bfcd66efeb48 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -110,7 +110,8 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
 
 	/* Flush caches, as per ACPI specification */
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (sleep_state >= ACPI_STATE_S1 && sleep_state <= ACPI_STATE_S3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
 	if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/acpica/hwxfsleep.c b/drivers/acpi/acpica/hwxfsleep.c
index e4cde23a2906..ba77598ee43e 100644
--- a/drivers/acpi/acpica/hwxfsleep.c
+++ b/drivers/acpi/acpica/hwxfsleep.c
@@ -162,8 +162,6 @@ acpi_status acpi_enter_sleep_state_s4bios(void)
 		return_ACPI_STATUS(status);
 	}
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	status = acpi_hw_write_port(acpi_gbl_FADT.smi_command,
 				    (u32)acpi_gbl_FADT.s4_bios_request, 8);
 	if (ACPI_FAILURE(status)) {
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 76ef1bcc8848..01495aca850e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 {
 	struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (cx->type == ACPI_STATE_C3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	while (1) {
 
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index eaa47753b758..a81d08b762c2 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -73,7 +73,9 @@ static int acpi_sleep_prepare(u32 acpi_state)
 		acpi_set_waking_vector(acpi_wakeup_address);
 
 	}
-	ACPI_FLUSH_CPU_CACHE();
+
+	if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
+		ACPI_FLUSH_CPU_CACHE();
 #endif
 	pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
 	acpi_enable_wakeup_devices(acpi_state);
@@ -566,7 +568,8 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
 	u32 acpi_state = acpi_target_sleep_state;
 	int error;
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
 	switch (acpi_state) {
@@ -903,8 +906,6 @@ static int acpi_hibernation_enter(void)
 {
 	acpi_status status = AE_OK;
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	/* This shouldn't return.  If it returns, we have a problem */
 	status = acpi_enter_sleep_state(ACPI_STATE_S4);
 	/* Reprogram control registers */
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-03 23:49           ` Kirill A. Shutemov
@ 2021-12-04  0:20             ` Dave Hansen
  2021-12-04  0:54               ` Kirill A. Shutemov
  2021-12-04 20:27             ` Rafael J. Wysocki
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2021-12-04  0:20 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner
  Cc: Kuppuswamy Sathyanarayanan, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Rafael J . Wysocki, H . Peter Anvin, Tony Luck,
	Dan Williams, Andi Kleen, Kuppuswamy Sathyanarayanan,
	linux-kernel, linux-acpi

On 12/3/21 3:49 PM, Kirill A. Shutemov wrote:
> -	ACPI_FLUSH_CPU_CACHE();
> +	if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
> +		ACPI_FLUSH_CPU_CACHE();

It's a bit of a bummer that this per-sleep-state logic has to be
repeated so many time.

If you pass acpi_state into ACPI_FLUSH_CPU_CACHE() can you centralize
the set of places where that knowledge about which sleep states require
flushing?

> TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.

This makes me a bit nervous.  Is this "the first TDX implementation
supports..." or "the TDX architecture *prohibits* supporting S1 (or
whatever"?

I really think we need some kind of architecture guarantee.  Without
that, we risk breaking things if someone at our employer simply changes
their mind.

The:

> #define ACPI_FLUSH_CPU_CACHE_PHYS()     \
>         if (!cpu_feature_enabled(XXX))	\
>         	wbinvd();               \  

does seem simpler and less error-prone than this, though.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-04  0:20             ` Dave Hansen
@ 2021-12-04  0:54               ` Kirill A. Shutemov
  2021-12-06 15:35                 ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-04  0:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Kuppuswamy Sathyanarayanan, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Rafael J . Wysocki,
	H . Peter Anvin, Tony Luck, Dan Williams, Andi Kleen,
	Kuppuswamy Sathyanarayanan, linux-kernel, linux-acpi

On Fri, Dec 03, 2021 at 04:20:34PM -0800, Dave Hansen wrote:
> On 12/3/21 3:49 PM, Kirill A. Shutemov wrote:
> > -	ACPI_FLUSH_CPU_CACHE();
> > +	if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
> > +		ACPI_FLUSH_CPU_CACHE();
> 
> It's a bit of a bummer that this per-sleep-state logic has to be
> repeated so many time.
> 
> If you pass acpi_state into ACPI_FLUSH_CPU_CACHE() can you centralize
> the set of places where that knowledge about which sleep states require
> flushing?

Yes, sure, it is doable. It we decide that it is the way to go.

> > TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.
> 
> This makes me a bit nervous.  Is this "the first TDX implementation
> supports..." or "the TDX architecture *prohibits* supporting S1 (or
> whatever"?

TDX Virtual Firmware Design Guide only states that "ACPI S3 (not supported
by TDX guests)".

Kernel reports in dmesg "ACPI: PM: (supports S0 S5)".

But I don't see how any state beyond S0 and S5 make sense in TDX context.
Do you?

I find it neat that adjusting ACPI code to conform the spec makes TDX
work.

> I really think we need some kind of architecture guarantee.  Without
> that, we risk breaking things if someone at our employer simply changes
> their mind.

Guarantees are hard.

If somebody change their mind we will get unexpected #VE and crash.
I think it is acceptable way to handle unexpected change in confidential
computing environment.

> The:
> 
> > #define ACPI_FLUSH_CPU_CACHE_PHYS()     \
> >         if (!cpu_feature_enabled(XXX))	\
> >         	wbinvd();               \  
> 
> does seem simpler and less error-prone than this, though.

If it it the way to go, I can make a patch.

But there's no reason to have ACPI_FLUSH_CPU_CACHE_PHYS in addition to
ACPI_FLUSH_CPU_CACHE. All ACPI_FLUSH_CPU_CACHE can skip cache flush on
TDX.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-03 23:49           ` Kirill A. Shutemov
  2021-12-04  0:20             ` Dave Hansen
@ 2021-12-04 20:27             ` Rafael J. Wysocki
  2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
  1 sibling, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-04 20:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Kuppuswamy Sathyanarayanan, Ingo Molnar,
	Borislav Petkov, Dave Hansen, the arch/x86 maintainers,
	Rafael J . Wysocki, H . Peter Anvin, Tony Luck, Dan Williams,
	Andi Kleen, Kuppuswamy Sathyanarayanan,
	Linux Kernel Mailing List, ACPI Devel Maling List

On Sat, Dec 4, 2021 at 12:49 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Dec 03, 2021 at 12:48:43AM +0100, Thomas Gleixner wrote:
> > Kirill,
> >
> > On Fri, Dec 03 2021 at 01:21, Kirill A. Shutemov wrote:
> > > On Thu, Nov 25, 2021 at 01:40:24AM +0100, Thomas Gleixner wrote:
> > >> Kuppuswamy,
> > >> Either that or you provide patches with arguments which are based on
> > >> proper analysis and not on 'appears to' observations.
> > >
> > > I think the right solution to the WBINVD would be to add a #VE handler
> > > that does nothing. We don't have a reasonable way to handle it from within
> > > the guest. We can call the VMM in hope that it would handle it, but VMM is
> > > untrusted and it can ignore the request.
> > >
> > > Dave suggested that we need to do code audit to make sure that there's no
> > > user inside TDX guest environment that relies on WBINVD to work correctly.
> > >
> > > Below is full call tree of WBINVD. It is substantially larger than I
> > > anticipated from initial grep.
> > >
> > > Conclusions:
> > >
> > >   - Most of callers are in ACPI code on changing S-states. Ignoring cache
> > >     flush for S-state change on virtual machine should be safe.
> > >
> > >   - The only WBINVD I was able to trigger is on poweroff from ACPI code.
> > >     Reboot also should trigger it, but for some reason I don't see it.
> > >
> > >   - Few caller in CPU offline code. TDX does not allowed to offline CPU as
> > >     we cannot bring it back -- we don't have SIPI. And even if offline
> > >     works for vCPU it should be safe to ignore WBINVD there.
> > >
> > >   - NVDIMMs are not supported inside TDX. If it will change we would need
> > >     to deal with cache flushing for this case. Hopefully, we would be able
> > >     to avoid WBINVD.
> > >
> > >   - Cache QoS and MTRR use WBINVD. They are disabled in TDX, but it is
> > >     controlled by VMM if the feature is advertised. We would need to
> > >     filter CPUID/MSRs to make sure VMM would not mess with them.
> > >
> > > Is it good enough justification for do-nothing #VE WBINVD handler?
> >
> > first of all thank you very much for this very profound analysis.
> >
> > This is really what I was asking for and you probably went even a step
> > deeper than that. Very appreciated.
> >
> > What we should do instead of doing a wholesale let's ignore WBINVD is to
> > have a separate function/macro:
> >
> >  ACPI_FLUSH_CPU_CACHE_PHYS()
> >
> > and invoke that from the functions which are considered to be safe.
> >
> > That would default to ACPI_FLUSH_CPU_CACHE() for other architecures
> > obviously.
> >
> > Then you can rightfully do:
> >
> > #define ACPI_FLUSH_CPU_CACHE_PHYS()     \
> >         if (!cpu_feature_enabled(XXX))        \
> >               wbinvd();               \
> >
> > where $XXX might be FEATURE_TDX_GUEST for paranoia sake and then
> > extended to X86_FEATURE_HYPERVISOR if everyone agrees.
> >
> > Then you have the #VE handler which just acts on any other wbinvd
> > invocation via warn, panic, whatever, no?
>
> I found another angle at the problem. According to the ACPI spec v6.4
> section 16.2 cache flushing is required on the way to S1, S2 and S3.
> And according to 8.2 it also is required on the way to C3.
>
> TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.
>
> Adjusting code to match the spec would make TDX work automagically.
>
> Any opinions on the patch below?
>
> I didn't touch ACPI_FLUSH_CPU_CACHE() users in cpufreq/longhaul.c because
> it might be outside of ACPI spec, I donno.
>
> diff --git a/drivers/acpi/acpica/hwesleep.c b/drivers/acpi/acpica/hwesleep.c
> index 808fdf54aeeb..b004a72a426e 100644
> --- a/drivers/acpi/acpica/hwesleep.c
> +++ b/drivers/acpi/acpica/hwesleep.c
> @@ -104,7 +104,8 @@ acpi_status acpi_hw_extended_sleep(u8 sleep_state)
>
>         /* Flush caches, as per ACPI specification */
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (sleep_state >= ACPI_STATE_S1 && sleep_state <= ACPI_STATE_S3)
> +               ACPI_FLUSH_CPU_CACHE();

So this basically means

if (sleep_state < ACPI_STATE_S14)
        ACPI_FLUSH_CPU_CACHE();

and analogously below.

This is fine with me, but it is an ACPICA patch, so it needs to be
submitted to the upstream project.  I think I can take care of this,
but not urgently.

>
>         status = acpi_os_enter_sleep(sleep_state, sleep_control, 0);
>         if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 34a3825f25d3..bfcd66efeb48 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -110,7 +110,8 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
>
>         /* Flush caches, as per ACPI specification */
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (sleep_state >= ACPI_STATE_S1 && sleep_state <= ACPI_STATE_S3)
> +               ACPI_FLUSH_CPU_CACHE();
>
>         status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
>         if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/acpica/hwxfsleep.c b/drivers/acpi/acpica/hwxfsleep.c
> index e4cde23a2906..ba77598ee43e 100644
> --- a/drivers/acpi/acpica/hwxfsleep.c
> +++ b/drivers/acpi/acpica/hwxfsleep.c
> @@ -162,8 +162,6 @@ acpi_status acpi_enter_sleep_state_s4bios(void)
>                 return_ACPI_STATUS(status);
>         }
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         status = acpi_hw_write_port(acpi_gbl_FADT.smi_command,
>                                     (u32)acpi_gbl_FADT.s4_bios_request, 8);
>         if (ACPI_FAILURE(status)) {
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 76ef1bcc8848..01495aca850e 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>  {
>         struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (cx->type == ACPI_STATE_C3)
> +               ACPI_FLUSH_CPU_CACHE();

And this is independent of the ACPICA changes above, so it can be made
in a separate patch.

This one is somewhat risky, though, because there is no guarantee that
all of the platforms in the field follow the spec.

>
>         while (1) {
>
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index eaa47753b758..a81d08b762c2 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -73,7 +73,9 @@ static int acpi_sleep_prepare(u32 acpi_state)
>                 acpi_set_waking_vector(acpi_wakeup_address);
>
>         }
> -       ACPI_FLUSH_CPU_CACHE();
> +
> +       if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
> +               ACPI_FLUSH_CPU_CACHE();

This flushing and the one below looks like it may be redundant,
because the cache will be flushed again in the ACPICA code above
anyway.

However, this needs to be double-checked.

>  #endif
>         pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
>         acpi_enable_wakeup_devices(acpi_state);
> @@ -566,7 +568,8 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
>         u32 acpi_state = acpi_target_sleep_state;
>         int error;
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (acpi_state >= ACPI_STATE_S1 && acpi_state <= ACPI_STATE_S3)
> +               ACPI_FLUSH_CPU_CACHE();
>
>         trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
>         switch (acpi_state) {
> @@ -903,8 +906,6 @@ static int acpi_hibernation_enter(void)
>  {
>         acpi_status status = AE_OK;
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         /* This shouldn't return.  If it returns, we have a problem */
>         status = acpi_enter_sleep_state(ACPI_STATE_S4);
>         /* Reprogram control registers */

This one is OK and can be done in a separate patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3
  2021-12-04 20:27             ` Rafael J. Wysocki
@ 2021-12-06 12:29               ` Kirill A. Shutemov
  2021-12-06 12:29                 ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
                                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-06 12:29 UTC (permalink / raw)
  To: rafael
  Cc: ak, bp, dan.j.williams, dave.hansen, hpa, kirill.shutemov,
	knsathya, linux-acpi, linux-kernel, mingo, rjw,
	sathyanarayanan.kuppuswamy, tglx, tony.luck, x86

Does it look like you want?

Kirill A. Shutemov (4):
  ACPICA: Do not flush cache for on entering S4 and S5
  ACPI: PM: Remove redundant cache flushing
  ACPI: processor idle: Only flush cache on entering C3
  ACPI: PM: Avoid cache flush on entering S4

 drivers/acpi/acpica/hwesleep.c  | 3 ++-
 drivers/acpi/acpica/hwsleep.c   | 3 ++-
 drivers/acpi/acpica/hwxfsleep.c | 2 --
 drivers/acpi/processor_idle.c   | 3 ++-
 drivers/acpi/sleep.c            | 9 +++------
 5 files changed, 9 insertions(+), 11 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5
  2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
@ 2021-12-06 12:29                 ` Kirill A. Shutemov
  2021-12-08 14:58                   ` Rafael J. Wysocki
  2021-12-06 12:29                 ` [PATCH 2/4] ACPI: PM: Remove redundant cache flushing Kirill A. Shutemov
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-06 12:29 UTC (permalink / raw)
  To: rafael
  Cc: ak, bp, dan.j.williams, dave.hansen, hpa, kirill.shutemov,
	knsathya, linux-acpi, linux-kernel, mingo, rjw,
	sathyanarayanan.kuppuswamy, tglx, tony.luck, x86

According to the ACPI spec v6.4, section 16.2 the cache flushing is
required on entering to S1, S2, and S3. ACPICA code flushes cache
regardless of the sleep state.

Blind cache flush on entering S5 causes problems for TDX. Flushing
happens with WBINVD that is not supported in the TDX environment.

TDX only supports S5 and adjusting ACPICA code to conform to the spec
fixes the issue.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/acpica/hwesleep.c  | 3 ++-
 drivers/acpi/acpica/hwsleep.c   | 3 ++-
 drivers/acpi/acpica/hwxfsleep.c | 2 --
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/acpica/hwesleep.c b/drivers/acpi/acpica/hwesleep.c
index 808fdf54aeeb..ceb5a4292efa 100644
--- a/drivers/acpi/acpica/hwesleep.c
+++ b/drivers/acpi/acpica/hwesleep.c
@@ -104,7 +104,8 @@ acpi_status acpi_hw_extended_sleep(u8 sleep_state)
 
 	/* Flush caches, as per ACPI specification */
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (sleep_state < ACPI_STATE_S4)
+		ACPI_FLUSH_CPU_CACHE();
 
 	status = acpi_os_enter_sleep(sleep_state, sleep_control, 0);
 	if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 34a3825f25d3..ee094a3aaaab 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -110,7 +110,8 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
 
 	/* Flush caches, as per ACPI specification */
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (sleep_state < ACPI_STATE_S4)
+		ACPI_FLUSH_CPU_CACHE();
 
 	status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
 	if (status == AE_CTRL_TERMINATE) {
diff --git a/drivers/acpi/acpica/hwxfsleep.c b/drivers/acpi/acpica/hwxfsleep.c
index e4cde23a2906..ba77598ee43e 100644
--- a/drivers/acpi/acpica/hwxfsleep.c
+++ b/drivers/acpi/acpica/hwxfsleep.c
@@ -162,8 +162,6 @@ acpi_status acpi_enter_sleep_state_s4bios(void)
 		return_ACPI_STATUS(status);
 	}
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	status = acpi_hw_write_port(acpi_gbl_FADT.smi_command,
 				    (u32)acpi_gbl_FADT.s4_bios_request, 8);
 	if (ACPI_FAILURE(status)) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] ACPI: PM: Remove redundant cache flushing
  2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
  2021-12-06 12:29                 ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
@ 2021-12-06 12:29                 ` Kirill A. Shutemov
  2021-12-07 16:35                   ` Rafael J. Wysocki
  2021-12-06 12:29                 ` [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3 Kirill A. Shutemov
  2021-12-06 12:29                 ` [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4 Kirill A. Shutemov
  3 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-06 12:29 UTC (permalink / raw)
  To: rafael
  Cc: ak, bp, dan.j.williams, dave.hansen, hpa, kirill.shutemov,
	knsathya, linux-acpi, linux-kernel, mingo, rjw,
	sathyanarayanan.kuppuswamy, tglx, tony.luck, x86

ACPICA code takes care about cache flushing on S1/S2/S3 in
acpi_hw_extended_sleep() and acpi_hw_legacy_sleep().

acpi_suspend_enter() calls into ACPICA code via acpi_enter_sleep_state()
for S1 or x86_acpi_suspend_lowlevel() for S3. It only need to flush
cache for S2 (not sure if this call path is ever used for S2).

acpi_sleep_prepare() call tree:
  __acpi_pm_prepare()
    acpi_pm_prepare()
      acpi_suspend_ops::prepare_late()
      acpi_hibernation_ops::pre_snapshot()
      acpi_hibernation_ops::prepare()
    acpi_suspend_begin_old()
      acpi_suspend_begin_old::begin()
  acpi_hibernation_begin_old()
    acpi_hibernation_ops_old::acpi_hibernation_begin_old()
  acpi_power_off_prepare()
    pm_power_off_prepare()

Hibernation (S4) and Power Off (S5) don't require cache flushing. So,
the only interesting callsites are acpi_suspend_ops::prepare_late() and
acpi_suspend_begin_old::begin(). Both of them have cache flush on
->enter() operation in acpi_suspend_enter().

Remove redundant ACPI_FLUSH_CPU_CACHE() in acpi_sleep_prepare() and
acpi_suspend_enter().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/sleep.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index eaa47753b758..14e8df0ac762 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -73,7 +73,6 @@ static int acpi_sleep_prepare(u32 acpi_state)
 		acpi_set_waking_vector(acpi_wakeup_address);
 
 	}
-	ACPI_FLUSH_CPU_CACHE();
 #endif
 	pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
 	acpi_enable_wakeup_devices(acpi_state);
@@ -566,15 +565,15 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
 	u32 acpi_state = acpi_target_sleep_state;
 	int error;
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
 	switch (acpi_state) {
 	case ACPI_STATE_S1:
 		barrier();
 		status = acpi_enter_sleep_state(acpi_state);
 		break;
-
+	case ACPI_STATE_S2:
+		ACPI_FLUSH_CPU_CACHE();
+		break;
 	case ACPI_STATE_S3:
 		if (!acpi_suspend_lowlevel)
 			return -ENOSYS;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3
  2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
  2021-12-06 12:29                 ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
  2021-12-06 12:29                 ` [PATCH 2/4] ACPI: PM: Remove redundant cache flushing Kirill A. Shutemov
@ 2021-12-06 12:29                 ` Kirill A. Shutemov
  2021-12-06 15:03                   ` Peter Zijlstra
  2021-12-06 12:29                 ` [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4 Kirill A. Shutemov
  3 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-06 12:29 UTC (permalink / raw)
  To: rafael
  Cc: ak, bp, dan.j.williams, dave.hansen, hpa, kirill.shutemov,
	knsathya, linux-acpi, linux-kernel, mingo, rjw,
	sathyanarayanan.kuppuswamy, tglx, tony.luck, x86

According to the ACPI spec v6.4, section 8.2, cache flushing required
on entering C3 power state.

Avoid flushing cache on entering other power states.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/processor_idle.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 76ef1bcc8848..01495aca850e 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 {
 	struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (cx->type == ACPI_STATE_C3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	while (1) {
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4
  2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
                                   ` (2 preceding siblings ...)
  2021-12-06 12:29                 ` [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3 Kirill A. Shutemov
@ 2021-12-06 12:29                 ` Kirill A. Shutemov
  2021-12-08 15:10                   ` Rafael J. Wysocki
  3 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-06 12:29 UTC (permalink / raw)
  To: rafael
  Cc: ak, bp, dan.j.williams, dave.hansen, hpa, kirill.shutemov,
	knsathya, linux-acpi, linux-kernel, mingo, rjw,
	sathyanarayanan.kuppuswamy, tglx, tony.luck, x86

According to the ACPI spec v6.4, section 16.2 the cache flushing
required on entering to S1, S2, and S3.

No need to flush caches on hibernation (S4).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/sleep.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 14e8df0ac762..8166d863ed6b 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -902,8 +902,6 @@ static int acpi_hibernation_enter(void)
 {
 	acpi_status status = AE_OK;
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	/* This shouldn't return.  If it returns, we have a problem */
 	status = acpi_enter_sleep_state(ACPI_STATE_S4);
 	/* Reprogram control registers */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3
  2021-12-06 12:29                 ` [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3 Kirill A. Shutemov
@ 2021-12-06 15:03                   ` Peter Zijlstra
  2021-12-08 16:26                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2021-12-06 15:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: rafael, ak, bp, dan.j.williams, dave.hansen, hpa, knsathya,
	linux-acpi, linux-kernel, mingo, rjw, sathyanarayanan.kuppuswamy,
	tglx, tony.luck, x86

On Mon, Dec 06, 2021 at 03:29:51PM +0300, Kirill A. Shutemov wrote:
> According to the ACPI spec v6.4, section 8.2, cache flushing required
> on entering C3 power state.
> 
> Avoid flushing cache on entering other power states.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/acpi/processor_idle.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 76ef1bcc8848..01495aca850e 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>  {
>  	struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
>  
> -	ACPI_FLUSH_CPU_CACHE();
> +	if (cx->type == ACPI_STATE_C3)
> +		ACPI_FLUSH_CPU_CACHE();
>  

acpi_idle_enter() already does this, acpi_idle_enter_s2idle() has it
confused again,

Also, I think acpi_idle_enter() does it too late; consider
acpi_idle_enter_mb(). Either that or the BM crud needs more comments.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-04  0:54               ` Kirill A. Shutemov
@ 2021-12-06 15:35                 ` Dave Hansen
  2021-12-06 16:39                   ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2021-12-06 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Kuppuswamy Sathyanarayanan, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Rafael J . Wysocki,
	H . Peter Anvin, Tony Luck, Dan Williams, Andi Kleen,
	Kuppuswamy Sathyanarayanan, linux-kernel, linux-acpi

On 12/3/21 4:54 PM, Kirill A. Shutemov wrote:
> On Fri, Dec 03, 2021 at 04:20:34PM -0800, Dave Hansen wrote:
>>> TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.
>>
>> This makes me a bit nervous.  Is this "the first TDX implementation
>> supports..." or "the TDX architecture *prohibits* supporting S1 (or
>> whatever"?
> 
> TDX Virtual Firmware Design Guide only states that "ACPI S3 (not supported
> by TDX guests)".
> 
> Kernel reports in dmesg "ACPI: PM: (supports S0 S5)".

Those describe the current firmware implementation, not a guarantee
provided by the TDX architecture forever.

> But I don't see how any state beyond S0 and S5 make sense in TDX context.
> Do you?

Do existing (non-TDX) VMs use anything other than S0 and S5?  If so, I'd
say yes.

>> I really think we need some kind of architecture guarantee.  Without
>> that, we risk breaking things if someone at our employer simply changes
>> their mind.
> 
> Guarantees are hard.
> 
> If somebody change their mind we will get unexpected #VE and crash.
> I think it is acceptable way to handle unexpected change in confidential
> computing environment.

Architectural guarantees are quite easy, actually.  They're just a
contract that two parties agree to.  In this case, the contract would be
that TDX firmware *PROMISES* not to enumerate support for additional
sleep states over what the implementation does today.  If future
firmware breaks that promise (and the kernel crashes) we get to come
after them with torches and pitchforks to fix the firmware.

The contract let's us do things in the OS like:

	WARN_ON(sleep_states[ACPI_STATE_S3]);

We also don't need *formal* documentation of such things.  We really
just need to have a chat.

It would be perfectly sufficient if we go bug Intel's TDX architecture
folks and say, "Hey, Linux is going to crash if you ever implement any
actual sleep states.  The current implementation is fine here, but is it
OK if future implementations are restricted from doing this?"

But, the trick is that we need a contract.  A contract requires a
"meeting of the minds" first.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-06 15:35                 ` Dave Hansen
@ 2021-12-06 16:39                   ` Dan Williams
  2021-12-06 16:53                     ` Dave Hansen
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2021-12-06 16:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Thomas Gleixner, Kuppuswamy Sathyanarayanan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	Rafael J . Wysocki, H . Peter Anvin, Tony Luck, Andi Kleen,
	Kuppuswamy Sathyanarayanan, Linux Kernel Mailing List,
	Linux ACPI

On Mon, Dec 6, 2021 at 7:35 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 12/3/21 4:54 PM, Kirill A. Shutemov wrote:
> > On Fri, Dec 03, 2021 at 04:20:34PM -0800, Dave Hansen wrote:
> >>> TDX doesn't support these S- and C-states. TDX is only supports S0 and S5.
> >>
> >> This makes me a bit nervous.  Is this "the first TDX implementation
> >> supports..." or "the TDX architecture *prohibits* supporting S1 (or
> >> whatever"?
> >
> > TDX Virtual Firmware Design Guide only states that "ACPI S3 (not supported
> > by TDX guests)".
> >
> > Kernel reports in dmesg "ACPI: PM: (supports S0 S5)".
>
> Those describe the current firmware implementation, not a guarantee
> provided by the TDX architecture forever.
>
> > But I don't see how any state beyond S0 and S5 make sense in TDX context.
> > Do you?
>
> Do existing (non-TDX) VMs use anything other than S0 and S5?  If so, I'd
> say yes.
>
> >> I really think we need some kind of architecture guarantee.  Without
> >> that, we risk breaking things if someone at our employer simply changes
> >> their mind.
> >
> > Guarantees are hard.
> >
> > If somebody change their mind we will get unexpected #VE and crash.
> > I think it is acceptable way to handle unexpected change in confidential
> > computing environment.
>
> Architectural guarantees are quite easy, actually.  They're just a
> contract that two parties agree to.  In this case, the contract would be
> that TDX firmware *PROMISES* not to enumerate support for additional
> sleep states over what the implementation does today.  If future
> firmware breaks that promise (and the kernel crashes) we get to come
> after them with torches and pitchforks to fix the firmware.
>
> The contract let's us do things in the OS like:
>
>         WARN_ON(sleep_states[ACPI_STATE_S3]);
>
> We also don't need *formal* documentation of such things.  We really
> just need to have a chat.
>
> It would be perfectly sufficient if we go bug Intel's TDX architecture
> folks and say, "Hey, Linux is going to crash if you ever implement any
> actual sleep states.  The current implementation is fine here, but is it
> OK if future implementations are restricted from doing this?"
>
> But, the trick is that we need a contract.  A contract requires a
> "meeting of the minds" first.

The WBINVD requirement in sleep states is about getting cache contents
out to to power preserved domain before the CPU turns off. The bare
metal host handles that requirement. The conversation that needs to be
had is with the ACPI specification committee to clarify that virtual
machines have no responsibility to flush caches. We can do that as a
Code First proposal to the ACPI Specification Working Group.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-06 16:39                   ` Dan Williams
@ 2021-12-06 16:53                     ` Dave Hansen
  2021-12-06 17:51                       ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Hansen @ 2021-12-06 16:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kirill A. Shutemov, Thomas Gleixner, Kuppuswamy Sathyanarayanan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	Rafael J . Wysocki, H . Peter Anvin, Tony Luck, Andi Kleen,
	Kuppuswamy Sathyanarayanan, Linux Kernel Mailing List,
	Linux ACPI

On 12/6/21 8:39 AM, Dan Williams wrote:
>> But, the trick is that we need a contract.  A contract requires a
>> "meeting of the minds" first.
> The WBINVD requirement in sleep states is about getting cache contents
> out to to power preserved domain before the CPU turns off. The bare
> metal host handles that requirement. The conversation that needs to be
> had is with the ACPI specification committee to clarify that virtual
> machines have no responsibility to flush caches. We can do that as a
> Code First proposal to the ACPI Specification Working Group.

Sounds sane to me.  So, we effectively go to the ACPI folks and say that
Linux isn't going to do WBINVD in virtualized environments any more.
That was effectively the approach that the first patch in this thread did:

> https://lore.kernel.org/all/20211116005027.2929297-1-sathyanarayanan.kuppuswamy@linux.intel.com/

Right?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
  2021-12-06 16:53                     ` Dave Hansen
@ 2021-12-06 17:51                       ` Dan Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Dan Williams @ 2021-12-06 17:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Thomas Gleixner, Kuppuswamy Sathyanarayanan,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	Rafael J . Wysocki, H . Peter Anvin, Tony Luck, Andi Kleen,
	Kuppuswamy Sathyanarayanan, Linux Kernel Mailing List,
	Linux ACPI

On Mon, Dec 6, 2021 at 8:54 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 12/6/21 8:39 AM, Dan Williams wrote:
> >> But, the trick is that we need a contract.  A contract requires a
> >> "meeting of the minds" first.
> > The WBINVD requirement in sleep states is about getting cache contents
> > out to to power preserved domain before the CPU turns off. The bare
> > metal host handles that requirement. The conversation that needs to be
> > had is with the ACPI specification committee to clarify that virtual
> > machines have no responsibility to flush caches. We can do that as a
> > Code First proposal to the ACPI Specification Working Group.
>
> Sounds sane to me.  So, we effectively go to the ACPI folks and say that
> Linux isn't going to do WBINVD in virtualized environments any more.
> That was effectively the approach that the first patch in this thread did:
>
> > https://lore.kernel.org/all/20211116005027.2929297-1-sathyanarayanan.kuppuswamy@linux.intel.com/
>
> Right?

Correct, my reviewed-by was based on that observation, and now we can
close the loop by proposing the specification change.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] ACPI: PM: Remove redundant cache flushing
  2021-12-06 12:29                 ` [PATCH 2/4] ACPI: PM: Remove redundant cache flushing Kirill A. Shutemov
@ 2021-12-07 16:35                   ` Rafael J. Wysocki
  2021-12-09 13:32                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-07 16:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rafael J. Wysocki, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Mon, Dec 6, 2021 at 1:30 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> ACPICA code takes care about cache flushing on S1/S2/S3 in
> acpi_hw_extended_sleep() and acpi_hw_legacy_sleep().
>
> acpi_suspend_enter() calls into ACPICA code via acpi_enter_sleep_state()
> for S1 or x86_acpi_suspend_lowlevel() for S3. It only need to flush
> cache for S2 (not sure if this call path is ever used for S2).
>
> acpi_sleep_prepare() call tree:
>   __acpi_pm_prepare()
>     acpi_pm_prepare()
>       acpi_suspend_ops::prepare_late()
>       acpi_hibernation_ops::pre_snapshot()
>       acpi_hibernation_ops::prepare()
>     acpi_suspend_begin_old()
>       acpi_suspend_begin_old::begin()
>   acpi_hibernation_begin_old()
>     acpi_hibernation_ops_old::acpi_hibernation_begin_old()
>   acpi_power_off_prepare()
>     pm_power_off_prepare()
>
> Hibernation (S4) and Power Off (S5) don't require cache flushing. So,
> the only interesting callsites are acpi_suspend_ops::prepare_late() and
> acpi_suspend_begin_old::begin(). Both of them have cache flush on
> ->enter() operation in acpi_suspend_enter().
>
> Remove redundant ACPI_FLUSH_CPU_CACHE() in acpi_sleep_prepare() and
> acpi_suspend_enter().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/acpi/sleep.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index eaa47753b758..14e8df0ac762 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -73,7 +73,6 @@ static int acpi_sleep_prepare(u32 acpi_state)
>                 acpi_set_waking_vector(acpi_wakeup_address);
>
>         }
> -       ACPI_FLUSH_CPU_CACHE();
>  #endif
>         pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
>         acpi_enable_wakeup_devices(acpi_state);
> @@ -566,15 +565,15 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
>         u32 acpi_state = acpi_target_sleep_state;
>         int error;
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
>         switch (acpi_state) {
>         case ACPI_STATE_S1:
>                 barrier();
>                 status = acpi_enter_sleep_state(acpi_state);
>                 break;
> -
> +       case ACPI_STATE_S2:
> +               ACPI_FLUSH_CPU_CACHE();
> +               break;

I don't think this is needed for S2, because the function doesn't do
anything low-level in that case and simply returns (IOW, S2 isn't
really supported).

>         case ACPI_STATE_S3:
>                 if (!acpi_suspend_lowlevel)
>                         return -ENOSYS;
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5
  2021-12-06 12:29                 ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
@ 2021-12-08 14:58                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-08 14:58 UTC (permalink / raw)
  To: Kirill A. Shutemov, Robert Moore
  Cc: Rafael J. Wysocki, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Mon, Dec 6, 2021 at 1:30 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> According to the ACPI spec v6.4, section 16.2 the cache flushing is
> required on entering to S1, S2, and S3. ACPICA code flushes cache
> regardless of the sleep state.
>
> Blind cache flush on entering S5 causes problems for TDX. Flushing
> happens with WBINVD that is not supported in the TDX environment.
>
> TDX only supports S5 and adjusting ACPICA code to conform to the spec
> fixes the issue.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

I've converted this patch to the upstream ACPICA code base format and
submitted a pull request with it to the upstream project.

Thanks!

> ---
>  drivers/acpi/acpica/hwesleep.c  | 3 ++-
>  drivers/acpi/acpica/hwsleep.c   | 3 ++-
>  drivers/acpi/acpica/hwxfsleep.c | 2 --
>  3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/acpi/acpica/hwesleep.c b/drivers/acpi/acpica/hwesleep.c
> index 808fdf54aeeb..ceb5a4292efa 100644
> --- a/drivers/acpi/acpica/hwesleep.c
> +++ b/drivers/acpi/acpica/hwesleep.c
> @@ -104,7 +104,8 @@ acpi_status acpi_hw_extended_sleep(u8 sleep_state)
>
>         /* Flush caches, as per ACPI specification */
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (sleep_state < ACPI_STATE_S4)
> +               ACPI_FLUSH_CPU_CACHE();
>
>         status = acpi_os_enter_sleep(sleep_state, sleep_control, 0);
>         if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
> index 34a3825f25d3..ee094a3aaaab 100644
> --- a/drivers/acpi/acpica/hwsleep.c
> +++ b/drivers/acpi/acpica/hwsleep.c
> @@ -110,7 +110,8 @@ acpi_status acpi_hw_legacy_sleep(u8 sleep_state)
>
>         /* Flush caches, as per ACPI specification */
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (sleep_state < ACPI_STATE_S4)
> +               ACPI_FLUSH_CPU_CACHE();
>
>         status = acpi_os_enter_sleep(sleep_state, pm1a_control, pm1b_control);
>         if (status == AE_CTRL_TERMINATE) {
> diff --git a/drivers/acpi/acpica/hwxfsleep.c b/drivers/acpi/acpica/hwxfsleep.c
> index e4cde23a2906..ba77598ee43e 100644
> --- a/drivers/acpi/acpica/hwxfsleep.c
> +++ b/drivers/acpi/acpica/hwxfsleep.c
> @@ -162,8 +162,6 @@ acpi_status acpi_enter_sleep_state_s4bios(void)
>                 return_ACPI_STATUS(status);
>         }
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         status = acpi_hw_write_port(acpi_gbl_FADT.smi_command,
>                                     (u32)acpi_gbl_FADT.s4_bios_request, 8);
>         if (ACPI_FAILURE(status)) {
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4
  2021-12-06 12:29                 ` [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4 Kirill A. Shutemov
@ 2021-12-08 15:10                   ` Rafael J. Wysocki
  2021-12-08 16:04                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-08 15:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rafael J. Wysocki, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Mon, Dec 6, 2021 at 1:30 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> According to the ACPI spec v6.4, section 16.2 the cache flushing
> required on entering to S1, S2, and S3.
>
> No need to flush caches on hibernation (S4).
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/acpi/sleep.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index 14e8df0ac762..8166d863ed6b 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -902,8 +902,6 @@ static int acpi_hibernation_enter(void)
>  {
>         acpi_status status = AE_OK;
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         /* This shouldn't return.  If it returns, we have a problem */
>         status = acpi_enter_sleep_state(ACPI_STATE_S4);
>         /* Reprogram control registers */
> --

Applied (with some edits in the subject and changelog) as 5.17 material, thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4
  2021-12-08 15:10                   ` Rafael J. Wysocki
@ 2021-12-08 16:04                     ` Kirill A. Shutemov
  2021-12-08 16:16                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-08 16:04 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andi Kleen, Borislav Petkov, Dan Williams, Dave Hansen,
	H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Wed, Dec 08, 2021 at 04:10:52PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 6, 2021 at 1:30 PM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > According to the ACPI spec v6.4, section 16.2 the cache flushing
> > required on entering to S1, S2, and S3.
> >
> > No need to flush caches on hibernation (S4).
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  drivers/acpi/sleep.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> > index 14e8df0ac762..8166d863ed6b 100644
> > --- a/drivers/acpi/sleep.c
> > +++ b/drivers/acpi/sleep.c
> > @@ -902,8 +902,6 @@ static int acpi_hibernation_enter(void)
> >  {
> >         acpi_status status = AE_OK;
> >
> > -       ACPI_FLUSH_CPU_CACHE();
> > -
> >         /* This shouldn't return.  If it returns, we have a problem */
> >         status = acpi_enter_sleep_state(ACPI_STATE_S4);
> >         /* Reprogram control registers */
> > --
> 
> Applied (with some edits in the subject and changelog) as 5.17 material, thanks!

Is it for the series or only 4/4? Do I need to do something for 2/4 and
3/4?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4
  2021-12-08 16:04                     ` Kirill A. Shutemov
@ 2021-12-08 16:16                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-08 16:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rafael J. Wysocki, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Wed, Dec 8, 2021 at 5:04 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Wed, Dec 08, 2021 at 04:10:52PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 6, 2021 at 1:30 PM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > According to the ACPI spec v6.4, section 16.2 the cache flushing
> > > required on entering to S1, S2, and S3.
> > >
> > > No need to flush caches on hibernation (S4).
> > >
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  drivers/acpi/sleep.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > >
> > > diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> > > index 14e8df0ac762..8166d863ed6b 100644
> > > --- a/drivers/acpi/sleep.c
> > > +++ b/drivers/acpi/sleep.c
> > > @@ -902,8 +902,6 @@ static int acpi_hibernation_enter(void)
> > >  {
> > >         acpi_status status = AE_OK;
> > >
> > > -       ACPI_FLUSH_CPU_CACHE();
> > > -
> > >         /* This shouldn't return.  If it returns, we have a problem */
> > >         status = acpi_enter_sleep_state(ACPI_STATE_S4);
> > >         /* Reprogram control registers */
> > > --
> >
> > Applied (with some edits in the subject and changelog) as 5.17 material, thanks!
>
> Is it for the series or only 4/4?

Just for the [4/4].

> Do I need to do something for 2/4 and 3/4?

For [2/4] you do as per the comment and let me reply to the [3/4].

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3
  2021-12-06 15:03                   ` Peter Zijlstra
@ 2021-12-08 16:26                     ` Rafael J. Wysocki
  2021-12-09 13:33                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-08 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Kirill A. Shutemov, Rafael J. Wysocki, Andi Kleen,
	Borislav Petkov, Dan Williams, Dave Hansen, H. Peter Anvin,
	Kuppuswamy Sathyanarayanan, ACPI Devel Maling List,
	Linux Kernel Mailing List, Ingo Molnar, Rafael J. Wysocki,
	Kuppuswamy Sathyanarayanan, Thomas Gleixner, Tony Luck,
	the arch/x86 maintainers

On Mon, Dec 6, 2021 at 4:03 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Dec 06, 2021 at 03:29:51PM +0300, Kirill A. Shutemov wrote:
> > According to the ACPI spec v6.4, section 8.2, cache flushing required
> > on entering C3 power state.
> >
> > Avoid flushing cache on entering other power states.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  drivers/acpi/processor_idle.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> > index 76ef1bcc8848..01495aca850e 100644
> > --- a/drivers/acpi/processor_idle.c
> > +++ b/drivers/acpi/processor_idle.c
> > @@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
> >  {
> >       struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
> >
> > -     ACPI_FLUSH_CPU_CACHE();
> > +     if (cx->type == ACPI_STATE_C3)
> > +             ACPI_FLUSH_CPU_CACHE();
> >
>
> acpi_idle_enter() already does this, acpi_idle_enter_s2idle() has it
> confused again,

No, they do the same thing: acpi_idle_enter_bm() if flags.bm_check is set.

> Also, I think acpi_idle_enter() does it too late; consider
> acpi_idle_enter_mb(). Either that or the BM crud needs more comments.

I think the latter.

Evidently, acpi_idle_play_dead(() doesn't support FFH and the BM
thing, so it is only necessary to flush the cache when using
ACPI_CSTATE_SYSTEMIO and when cx->type is C3.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] ACPI: PM: Remove redundant cache flushing
  2021-12-07 16:35                   ` Rafael J. Wysocki
@ 2021-12-09 13:32                     ` Kirill A. Shutemov
  2021-12-17 18:04                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-09 13:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andi Kleen, Borislav Petkov, Dan Williams, Dave Hansen,
	H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Tue, Dec 07, 2021 at 05:35:38PM +0100, Rafael J. Wysocki wrote:
> I don't think this is needed for S2, because the function doesn't do
> anything low-level in that case and simply returns (IOW, S2 isn't
> really supported).

Updated patch is below. Does it look good?

From 5eb4ec7d8dd463ba186b779dcef2a802d999c59c Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 9 Dec 2021 16:08:02 +0300
Subject: [PATCH 1/2] ACPI: PM: Remove redundant cache flushing

ACPICA code takes care about cache flushing on S1/S2/S3 in
acpi_hw_extended_sleep() and acpi_hw_legacy_sleep().

acpi_suspend_enter() calls into ACPICA code via acpi_enter_sleep_state()
for S1 or x86_acpi_suspend_lowlevel() for S3.

acpi_sleep_prepare() call tree:
  __acpi_pm_prepare()
    acpi_pm_prepare()
      acpi_suspend_ops::prepare_late()
      acpi_hibernation_ops::pre_snapshot()
      acpi_hibernation_ops::prepare()
    acpi_suspend_begin_old()
      acpi_suspend_begin_old::begin()
  acpi_hibernation_begin_old()
    acpi_hibernation_ops_old::acpi_hibernation_begin_old()
  acpi_power_off_prepare()
    pm_power_off_prepare()

Hibernation (S4) and Power Off (S5) don't require cache flushing. So,
the only interesting callsites are acpi_suspend_ops::prepare_late() and
acpi_suspend_begin_old::begin(). Both of them have cache flush on
->enter() operation in acpi_suspend_enter().

Remove redundant ACPI_FLUSH_CPU_CACHE() in acpi_sleep_prepare() and
acpi_suspend_enter().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/sleep.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index eaa47753b758..5ca6c223ba3d 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -73,7 +73,6 @@ static int acpi_sleep_prepare(u32 acpi_state)
 		acpi_set_waking_vector(acpi_wakeup_address);
 
 	}
-	ACPI_FLUSH_CPU_CACHE();
 #endif
 	pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
 	acpi_enable_wakeup_devices(acpi_state);
@@ -566,8 +565,6 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
 	u32 acpi_state = acpi_target_sleep_state;
 	int error;
 
-	ACPI_FLUSH_CPU_CACHE();
-
 	trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
 	switch (acpi_state) {
 	case ACPI_STATE_S1:
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3
  2021-12-08 16:26                     ` Rafael J. Wysocki
@ 2021-12-09 13:33                       ` Kirill A. Shutemov
  2021-12-17 17:58                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2021-12-09 13:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Peter Zijlstra, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Wed, Dec 08, 2021 at 05:26:12PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 6, 2021 at 4:03 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Dec 06, 2021 at 03:29:51PM +0300, Kirill A. Shutemov wrote:
> > > According to the ACPI spec v6.4, section 8.2, cache flushing required
> > > on entering C3 power state.
> > >
> > > Avoid flushing cache on entering other power states.
> > >
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  drivers/acpi/processor_idle.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> > > index 76ef1bcc8848..01495aca850e 100644
> > > --- a/drivers/acpi/processor_idle.c
> > > +++ b/drivers/acpi/processor_idle.c
> > > @@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
> > >  {
> > >       struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
> > >
> > > -     ACPI_FLUSH_CPU_CACHE();
> > > +     if (cx->type == ACPI_STATE_C3)
> > > +             ACPI_FLUSH_CPU_CACHE();
> > >
> >
> > acpi_idle_enter() already does this, acpi_idle_enter_s2idle() has it
> > confused again,
> 
> No, they do the same thing: acpi_idle_enter_bm() if flags.bm_check is set.
> 
> > Also, I think acpi_idle_enter() does it too late; consider
> > acpi_idle_enter_mb(). Either that or the BM crud needs more comments.
> 
> I think the latter.
> 
> Evidently, acpi_idle_play_dead(() doesn't support FFH and the BM
> thing, so it is only necessary to flush the cache when using
> ACPI_CSTATE_SYSTEMIO and when cx->type is C3.

I'm new to this and not completely follow what I need to change.

Does it look correct?

From 3c544bc95a16d6a23dcb0aa50ee905d5e97c9ce5 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 9 Dec 2021 16:24:44 +0300
Subject: [PATCH] ACPI: processor idle: Only flush cache on entering C3

According to the ACPI spec v6.4, section 8.2, cache flushing required
on entering C3 power state.

Avoid flushing cache on entering other power states.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/acpi/processor_idle.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 76ef1bcc8848..d2a4d4446eff 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -567,7 +567,9 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 {
 	struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
 
-	ACPI_FLUSH_CPU_CACHE();
+	if (cx->entry_method == ACPI_CSTATE_SYSTEMIO &&
+	    cx->type == ACPI_STATE_C3)
+		ACPI_FLUSH_CPU_CACHE();
 
 	while (1) {
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3
  2021-12-09 13:33                       ` Kirill A. Shutemov
@ 2021-12-17 17:58                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-17 17:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rafael J. Wysocki, Peter Zijlstra, Andi Kleen, Borislav Petkov,
	Dan Williams, Dave Hansen, H. Peter Anvin,
	Kuppuswamy Sathyanarayanan, ACPI Devel Maling List,
	Linux Kernel Mailing List, Ingo Molnar, Rafael J. Wysocki,
	Kuppuswamy Sathyanarayanan, Thomas Gleixner, Tony Luck,
	the arch/x86 maintainers

On Thu, Dec 9, 2021 at 2:33 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Wed, Dec 08, 2021 at 05:26:12PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 6, 2021 at 4:03 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Dec 06, 2021 at 03:29:51PM +0300, Kirill A. Shutemov wrote:
> > > > According to the ACPI spec v6.4, section 8.2, cache flushing required
> > > > on entering C3 power state.
> > > >
> > > > Avoid flushing cache on entering other power states.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > >  drivers/acpi/processor_idle.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> > > > index 76ef1bcc8848..01495aca850e 100644
> > > > --- a/drivers/acpi/processor_idle.c
> > > > +++ b/drivers/acpi/processor_idle.c
> > > > @@ -567,7 +567,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
> > > >  {
> > > >       struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
> > > >
> > > > -     ACPI_FLUSH_CPU_CACHE();
> > > > +     if (cx->type == ACPI_STATE_C3)
> > > > +             ACPI_FLUSH_CPU_CACHE();
> > > >
> > >
> > > acpi_idle_enter() already does this, acpi_idle_enter_s2idle() has it
> > > confused again,
> >
> > No, they do the same thing: acpi_idle_enter_bm() if flags.bm_check is set.
> >
> > > Also, I think acpi_idle_enter() does it too late; consider
> > > acpi_idle_enter_mb(). Either that or the BM crud needs more comments.
> >
> > I think the latter.
> >
> > Evidently, acpi_idle_play_dead(() doesn't support FFH and the BM
> > thing, so it is only necessary to flush the cache when using
> > ACPI_CSTATE_SYSTEMIO and when cx->type is C3.
>
> I'm new to this and not completely follow what I need to change.
>
> Does it look correct?

It does, but I liked the original one more and so that one has been
applied as 5.17 material (with some edits in the changelog).

Thanks!

> From 3c544bc95a16d6a23dcb0aa50ee905d5e97c9ce5 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 9 Dec 2021 16:24:44 +0300
> Subject: [PATCH] ACPI: processor idle: Only flush cache on entering C3
>
> According to the ACPI spec v6.4, section 8.2, cache flushing required
> on entering C3 power state.
>
> Avoid flushing cache on entering other power states.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/acpi/processor_idle.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 76ef1bcc8848..d2a4d4446eff 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -567,7 +567,9 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>  {
>         struct acpi_processor_cx *cx = per_cpu(acpi_cstate[index], dev->cpu);
>
> -       ACPI_FLUSH_CPU_CACHE();
> +       if (cx->entry_method == ACPI_CSTATE_SYSTEMIO &&
> +           cx->type == ACPI_STATE_C3)
> +               ACPI_FLUSH_CPU_CACHE();
>
>         while (1) {
>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] ACPI: PM: Remove redundant cache flushing
  2021-12-09 13:32                     ` Kirill A. Shutemov
@ 2021-12-17 18:04                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2021-12-17 18:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Rafael J. Wysocki, Andi Kleen, Borislav Petkov, Dan Williams,
	Dave Hansen, H. Peter Anvin, Kuppuswamy Sathyanarayanan,
	ACPI Devel Maling List, Linux Kernel Mailing List, Ingo Molnar,
	Rafael J. Wysocki, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Tony Luck, the arch/x86 maintainers

On Thu, Dec 9, 2021 at 2:32 PM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Tue, Dec 07, 2021 at 05:35:38PM +0100, Rafael J. Wysocki wrote:
> > I don't think this is needed for S2, because the function doesn't do
> > anything low-level in that case and simply returns (IOW, S2 isn't
> > really supported).
>
> Updated patch is below. Does it look good?

It does, and so applied as 5.17 material with some minor edits in the changelog.

Thanks!

> From 5eb4ec7d8dd463ba186b779dcef2a802d999c59c Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 9 Dec 2021 16:08:02 +0300
> Subject: [PATCH 1/2] ACPI: PM: Remove redundant cache flushing
>
> ACPICA code takes care about cache flushing on S1/S2/S3 in
> acpi_hw_extended_sleep() and acpi_hw_legacy_sleep().
>
> acpi_suspend_enter() calls into ACPICA code via acpi_enter_sleep_state()
> for S1 or x86_acpi_suspend_lowlevel() for S3.
>
> acpi_sleep_prepare() call tree:
>   __acpi_pm_prepare()
>     acpi_pm_prepare()
>       acpi_suspend_ops::prepare_late()
>       acpi_hibernation_ops::pre_snapshot()
>       acpi_hibernation_ops::prepare()
>     acpi_suspend_begin_old()
>       acpi_suspend_begin_old::begin()
>   acpi_hibernation_begin_old()
>     acpi_hibernation_ops_old::acpi_hibernation_begin_old()
>   acpi_power_off_prepare()
>     pm_power_off_prepare()
>
> Hibernation (S4) and Power Off (S5) don't require cache flushing. So,
> the only interesting callsites are acpi_suspend_ops::prepare_late() and
> acpi_suspend_begin_old::begin(). Both of them have cache flush on
> ->enter() operation in acpi_suspend_enter().
>
> Remove redundant ACPI_FLUSH_CPU_CACHE() in acpi_sleep_prepare() and
> acpi_suspend_enter().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  drivers/acpi/sleep.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
> index eaa47753b758..5ca6c223ba3d 100644
> --- a/drivers/acpi/sleep.c
> +++ b/drivers/acpi/sleep.c
> @@ -73,7 +73,6 @@ static int acpi_sleep_prepare(u32 acpi_state)
>                 acpi_set_waking_vector(acpi_wakeup_address);
>
>         }
> -       ACPI_FLUSH_CPU_CACHE();
>  #endif
>         pr_info("Preparing to enter system sleep state S%d\n", acpi_state);
>         acpi_enable_wakeup_devices(acpi_state);
> @@ -566,8 +565,6 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
>         u32 acpi_state = acpi_target_sleep_state;
>         int error;
>
> -       ACPI_FLUSH_CPU_CACHE();
> -
>         trace_suspend_resume(TPS("acpi_suspend"), acpi_state, true);
>         switch (acpi_state) {
>         case ACPI_STATE_S1:
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-12-17 18:04 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-16  0:50 [PATCH v1 1/1] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
2021-11-16 16:24 ` Borislav Petkov
2021-11-16 16:36   ` Sathyanarayanan Kuppuswamy
2021-11-19  4:03   ` [PATCH v2] " Kuppuswamy Sathyanarayanan
2021-11-25  0:40     ` Thomas Gleixner
2021-12-02 22:21       ` Kirill A. Shutemov
2021-12-02 22:38         ` Dave Hansen
2021-12-02 23:48         ` Thomas Gleixner
2021-12-03 23:49           ` Kirill A. Shutemov
2021-12-04  0:20             ` Dave Hansen
2021-12-04  0:54               ` Kirill A. Shutemov
2021-12-06 15:35                 ` Dave Hansen
2021-12-06 16:39                   ` Dan Williams
2021-12-06 16:53                     ` Dave Hansen
2021-12-06 17:51                       ` Dan Williams
2021-12-04 20:27             ` Rafael J. Wysocki
2021-12-06 12:29               ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
2021-12-06 12:29                 ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
2021-12-08 14:58                   ` Rafael J. Wysocki
2021-12-06 12:29                 ` [PATCH 2/4] ACPI: PM: Remove redundant cache flushing Kirill A. Shutemov
2021-12-07 16:35                   ` Rafael J. Wysocki
2021-12-09 13:32                     ` Kirill A. Shutemov
2021-12-17 18:04                       ` Rafael J. Wysocki
2021-12-06 12:29                 ` [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3 Kirill A. Shutemov
2021-12-06 15:03                   ` Peter Zijlstra
2021-12-08 16:26                     ` Rafael J. Wysocki
2021-12-09 13:33                       ` Kirill A. Shutemov
2021-12-17 17:58                         ` Rafael J. Wysocki
2021-12-06 12:29                 ` [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4 Kirill A. Shutemov
2021-12-08 15:10                   ` Rafael J. Wysocki
2021-12-08 16:04                     ` Kirill A. Shutemov
2021-12-08 16:16                       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).