On 12.07.22 17:09, Chuck Zmudzinski wrote: > On 7/12/2022 9:32 AM, Juergen Gross wrote: >> On 12.07.22 15:22, Chuck Zmudzinski wrote: >>> On 7/12/2022 2:04 AM, Jan Beulich wrote: >>>> On 11.07.2022 19:41, Chuck Zmudzinski wrote: >>>>> Moreover... (please move to the bottom of the code snippet >>>>> for more information about my tests in the Xen PV environment...) >>>>> >>>>> void init_cache_modes(void) >>>>> { >>>>>     u64 pat = 0; >>>>> >>>>>     if (pat_cm_initialized) >>>>>         return; >>>>> >>>>>     if (boot_cpu_has(X86_FEATURE_PAT)) { >>>>>         /* >>>>>          * CPU supports PAT. Set PAT table to be consistent with >>>>>          * PAT MSR. This case supports "nopat" boot option, and >>>>>          * virtual machine environments which support PAT without >>>>>          * MTRRs. In specific, Xen has unique setup to PAT MSR. >>>>>          * >>>>>          * If PAT MSR returns 0, it is considered invalid and emulates >>>>>          * as No PAT. >>>>>          */ >>>>>         rdmsrl(MSR_IA32_CR_PAT, pat); >>>>>     } >>>>> >>>>>     if (!pat) { >>>>>         /* >>>>>          * No PAT. Emulate the PAT table that corresponds to the two >>>>>          * cache bits, PWT (Write Through) and PCD (Cache Disable). >>>>>          * This setup is also the same as the BIOS default setup. >>>>>          * >>>>>          * PTE encoding: >>>>>          * >>>>>          *       PCD >>>>>          *       |PWT  PAT >>>>>          *       ||    slot >>>>>          *       00    0    WB : _PAGE_CACHE_MODE_WB >>>>>          *       01    1    WT : _PAGE_CACHE_MODE_WT >>>>>          *       10    2    UC-: _PAGE_CACHE_MODE_UC_MINUS >>>>>          *       11    3    UC : _PAGE_CACHE_MODE_UC >>>>>          * >>>>>          * NOTE: When WC or WP is used, it is redirected to UC- per >>>>>          * the default setup in __cachemode2pte_tbl[]. >>>>>          */ >>>>>         pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) | >>>>>               PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC); >>>>>     } >>>>> >>>>>     else if (!pat_bp_enabled) { >>>>>         /* >>>>>          * In some environments, specifically Xen PV, PAT >>>>>          * initialization is skipped because MTRRs are >>>>>          * disabled even though PAT is available. In such >>>>>          * environments, set PAT to initialized and enabled to >>>>>          * correctly indicate to callers of pat_enabled() that >>>>>          * PAT is available and prevent PAT from being disabled. >>>>>          */ >>>>>         pat_bp_enabled = true; >>>>>         pr_info("x86/PAT: PAT enabled by init_cache_modes\n"); >>>>>     } >>>>> >>>>>     __init_cache_modes(pat); >>>>> } >>>>> >>>>> This function, patched with the extra 'else if' block, fixes the >>>>> regression on my Xen worksatation, and the pr_info message >>>>> "x86/PAT: PAT enabled by init_cache_modes" appears in the logs >>>>> when running this patched kernel in my Xen Dom0. This means >>>>> that in the Xen PV environment on my Xen Dom0 workstation, >>>>> rdmsrl(MSR_IA32_CR_PAT, pat) successfully tested for the presence >>>>> of PAT on the virtual CPU that Xen exposed to the Linux kernel on my >>>>> Xen Dom0 workstation. At least that is what I think my tests prove. >>>>> >>>>> So why is this not a valid way to test for the existence of >>>>> PAT in the Xen PV environment? Are the existing comments >>>>> in init_cache_modes() about supporting both the case when >>>>> the "nopat" boot option is set and the specific case of Xen and >>>>> MTRR disabled wrong? My testing confirms those comments are >>>>> correct. >>>> >>>> At the very least this ignores the possible "nopat" an admin may >>>> have passed to the kernel. >>> >>> I realize that. The patch I proposed here only fixes the regression. It >>> would be easy to also modify the patch to also observe the 'nopat" >>> setting. I think your patch had a force_pat_disable local variable that >>> is set if pat is disabled by the administrator with "nopat." With that >>> variable available, modifying the patch so in init_cache_modes we have: >>> >>>      if (!pat || force_pat_disable) { >>>          /* >>>           * No PAT. Emulate the PAT table that corresponds to the two >>> >>> Instead of: >>> >>>      if (!pat) { >>>          /* >>>           * No PAT. Emulate the PAT table that corresponds to the two >>> >>> would cause the kernel to respect the "nopat" setting by the administrator >>> in the Xen PV Dom0 environment. >> >> Chuck, could you please send out a proper patch with your initial fix >> (setting pat_bp_enabled) and the fix above? >> >> I've chatted with Boris Petkov on IRC and he is fine with that. > > That's great, I will submit a formal patch later today. > >> >>> I agree this needs to be fixed up, because currently the code is very >>> confusing and the current variable names and function names do not >>> always accurately describe what they actually do in the code. That is >>> why I am working on a patch to do some re-factoring, which only consists >>> of function and variable name changes and comment changes to fix >>> the places where the comments in the code are misleading or incomplete. >> >> Boris and I agreed to pursue my approach further by removing the >> dependency between PAT and MTRR and to make this whole mess more >> clear. >> >> I will start to work on this as soon as possible, which will >> probably be some time in September. > > Good, I will look for your patches and try them out. > >> >>> I think perhaps the most misnamed variable here is the  local >>> variable pat_disabled in memtypes.c and the most misnamed function is the >>> pat_disable function in memtypes.c. They should be named pat_init_disabled >>> and pat_init_disable, respectively, because they do not really disable >>> PAT in >>> the code but only prevent execution of the pat_init function. That >>> leaves open >>> the possibility for PAT to be enabled by init_cache_modes, which actually >>> occurs in the current code in the Xen PV Dom0 environment, but the current >>> code neglects to set pat_bp_enabled to true in that case. So we need a patch >>> to fix that in order to fix the regression. >> >> In principle I agree, but you should be aware of my refactoring plans. > > I will defer to you and stop working on this re-factoring effort, but I > will prepare a formal patch to fix the regression later today. > > I do think Jan's point about respecting the administrator's "nopat" setting > should also be considered. AFAICT, the "nopat" option in current code Yes, please add that, too. This was what I meant with "the fix above". > is also not being respected on the bare metal, and the patch to > init_cache_modes with a force_no_pat variable is also needed to > ensure "nopat" is respected on the bare metal, AFAICT. Hmm, I don't see how the PAT MSR will be written on bare metal when "nopat" has been specified. I just tried it with a 5.19 kernel and it booted with the correct PAT settings: [ 0.000000] x86/PAT: PAT support disabled via boot option. [ 0.000986] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC Juergen