On 02.12.22 15:33, Kirill A. Shutemov wrote: > On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote: >> On 02.12.22 14:27, Kirill A. Shutemov wrote: >>> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote: >>>> On 02.12.22 00:57, Kirill A. Shutemov wrote: >>>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote: >>>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote: >>>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote: >>>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks >>>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support >>>>>>>> MTRR. >>>>>>>> >>>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR >>>>>>>> changes, require a similar sequence and so full PAT support was added >>>>>>>> using the already available MTRR handling. >>>>>>>> >>>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume >>>>>>>> the PAT MSR setting done by the hypervisor without the ability and need >>>>>>>> to change it. This in turn has resulted in a convoluted initialization >>>>>>>> sequence and wrong decisions regarding cache mode availability due to >>>>>>>> misguiding PAT availability flags. >>>>>>>> >>>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking >>>>>>>> the current PAT initialization sequence to match better with the newly >>>>>>>> introduced generic cache initialization. >>>>>>>> >>>>>>>> This removes the need of the recently added pat_force_disabled flag, so >>>>>>>> remove the remnants of the patch adding it. >>>>>>>> >>>>>>>> Signed-off-by: Juergen Gross >>>>>>> >>>>>>> This patch breaks boot for TDX guest. >>>>>>> >>>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and >>>>>>> causes #VE: >>>>>>> >>>>>>> tdx: Unexpected #VE: 28 >>>>>>> VE fault: 0000 [#1] PREEMPT SMP NOPTI >>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646 >>>>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >>>>>>> RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427) >>>>>>> Call Trace: >>>>>>> >>>>>>> ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085) >>>>>>> ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3)) >>>>>>> ? setup_arch (arch/x86/kernel/setup.c:1079) >>>>>>> ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3)) >>>>>>> ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155) >>>>>>> ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) >>>>>>> >>>>>>> >>>>>>> Any suggestion how to fix it? >>>>>>> >>>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568 >>>>>> >>>>>> What was the solution before? >>>>>> >>>>>> I guess MTRR was disabled, so there was no PAT, too? >>>>> >>>>> Right: >>>>> >>>>> Linus' tree: >>>>> >>>>> [ 0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000 >>>>> [ 0.003976] Disabled >>>>> [ 0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too. >>>>> [ 0.005856] CPU MTRRs all blank - virtualized system. >>>>> [ 0.006915] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC >>>>> >>>>> tip/master: >>>>> >>>>> [ 0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000 >>>>> [ 0.005220] Disabled >>>>> [ 0.005818] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT >>>>> [ 0.007752] tdx: Unexpected #VE: 28 >>>>> >>>>> The dangling "Disabled" comes mtrr_bp_init(). >>>>> >>>>> >>>>>> If this is the case, you can go the same route as Xen PV guests do. >>>>> >>>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of >>>>> X86_FEATURE_XENPV there? >>>>> >>>>> Do we have any virtualized platform that supports it? >>>> >>>> Yes, of course. Any hardware virtualized guest should be able to use it, >>>> obviously TDX guests are the first ones not being able to do so. >>>> >>>> And above dmesg snipplets are showing rather nicely that not disabling >>>> PAT completely should be a benefit for TDX guests, as all caching modes >>>> would be usable (the PAT MSR seems to be initialized quite fine). >>>> >>>> Instead of X86_FEATURE_XENPV we could introduce something like >>>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for >>>> TDX guests. >>> >>> Technically, the MSR is writable on TDX. But it seems there's no way to >>> properly change it, following the protocol of changing on MP systems. >> >> Why not? I don't see why it is possible in a non-TDX system, but not within >> a TDX guest. > > Because the protocol you described below requires setting CR0.CD which is > not allowed in TDX guest and causes #VE. Hmm, yes, seems to be a valid reason. :-) > >>> Although, I don't quite follow what role cache disabling playing on system >>> with self-snoop support. Hm? >> >> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache >> Control"): >> >> The operating system is responsible for insuring that changes to a PAT entry >> occur in a manner that maintains the consistency of the processor caches and >> translation lookaside buffers (TLB). This is accomplished by following the >> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems, >> ”for changing the value of an MTRR in a multiple processor system. It requires >> a specific sequence of operations that includes flushing the processors caches >> and TLBs. >> >> And the sequence for the MTRRs is: >> >> 1. Broadcast to all processors to execute the following code sequence. >> 2. Disable interrupts. >> 3. Wait for all processors to reach this point. >> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1 >> and the NW flag to 0.) >> 5. Flush all caches using the WBINVD instructions. Note on a processor that >> supports self-snooping, CPUID feature flag bit 27, this step is unnecessary. >> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing >> that flag. >> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing >> a MOV from control register CR3 to another register and then a MOV from that >> register back to CR3. >> 8. Disable all range registers (by clearing the E flag in register MTRRdefType). >> If only variable ranges are being modified, software may clear the valid bits >> for the affected register pairs instead. >> 9. Update the MTRRs. >> 10. Enable all range registers (by setting the E flag in register MTRRdefType). >> If only variable-range registers were modified and their individual valid >> bits were cleared, then set the valid bits for the affected ranges instead. >> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for >> Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD >> instruction is not needed when using Pentium 4, Intel Xeon, and P6 family >> processors, but it may be needed in future systems.) >> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags >> in control register CR0 to 0.) >> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above). >> 14. Wait for all processors to reach this point. >> 15. Enable interrupts. >> >> So cache disabling is recommended. > > Yeah, I read that. > > But the question is what kind of scenario cache disabling is actually > prevents if self-snoop is supported? In this case cache stays intact (no > WBINVD). The next time a cache line gets accessed with different caching > mode the old line gets snooped, right? > > Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP? > That's a question for the Intel architects, I guess. I'd just ask them how to setup PAT in TDX guests. Either they need to change the recommended setup sequence, or the PAT support bit needs to be cleared IMO. Juergen