On 25.05.22 09:45, Thorsten Leemhuis wrote: > > > On 24.05.22 20:32, Chuck Zmudzinski wrote: >> On 5/21/22 6:47 AM, Thorsten Leemhuis wrote: >>> On 20.05.22 16:48, Chuck Zmudzinski wrote: >>>> On 5/20/2022 10:06 AM, Jan Beulich wrote: >>>>> On 20.05.2022 15:33, Chuck Zmudzinski wrote: >>>>>> On 5/20/2022 5:41 AM, Jan Beulich wrote: >>>>>>> On 20.05.2022 10:30, Chuck Zmudzinski wrote: >>>>>>>> On 5/20/2022 2:59 AM, Chuck Zmudzinski wrote: >>>>>>>>> On 5/20/2022 2:05 AM, Jan Beulich wrote: >>>>>>>>>> On 20.05.2022 06:43, Chuck Zmudzinski wrote: >>>>>>>>>>> On 5/4/22 5:14 AM, Juergen Gross wrote: >>>>>>>>>>>> On 04.05.22 10:31, Jan Beulich wrote: >>>>>>>>>>>>> On 03.05.2022 15:22, Juergen Gross wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> ... these uses there are several more. You say nothing on why >>>>>>>>>>>>> those want >>>>>>>>>>>>> leaving unaltered. When preparing my earlier patch I did >>>>>>>>>>>>> inspect them >>>>>>>>>>>>> and came to the conclusion that these all would also better >>>>>>>>>>>>> observe the >>>>>>>>>>>>> adjusted behavior (or else I couldn't have left pat_enabled() >>>>>>>>>>>>> as the >>>>>>>>>>>>> only predicate). In fact, as said in the description of my >>>>>>>>>>>>> earlier >>>>>>>>>>>>> patch, in >>>>>>>>>>>>> my debugging I did find the use in i915_gem_object_pin_map() >>>>>>>>>>>>> to be >>>>>>>>>>>>> the >>>>>>>>>>>>> problematic one, which you leave alone. >>>>>>>>>>>> Oh, I missed that one, sorry. >>>>>>>>>>> That is why your patch would not fix my Haswell unless >>>>>>>>>>> it also touches i915_gem_object_pin_map() in >>>>>>>>>>> drivers/gpu/drm/i915/gem/i915_gem_pages.c >>>>>>>>>>> >>>>>>>>>>>> I wanted to be rather defensive in my changes, but I agree at >>>>>>>>>>>> least >>>>>>>>>>>> the >>>>>>>>>>>> case in arch_phys_wc_add() might want to be changed, too. >>>>>>>>>>> I think your approach needs to be more aggressive so it will fix >>>>>>>>>>> all the known false negatives introduced by bdd8b6c98239 >>>>>>>>>>> such as the one in i915_gem_object_pin_map(). >>>>>>>>>>> >>>>>>>>>>> I looked at Jan's approach and I think it would fix the issue >>>>>>>>>>> with my Haswell as long as I don't use the nopat option. I >>>>>>>>>>> really don't have a strong opinion on that question, but I >>>>>>>>>>> think the nopat option as a Linux kernel option, as opposed >>>>>>>>>>> to a hypervisor option, should only affect the kernel, and >>>>>>>>>>> if the hypervisor provides the pat feature, then the kernel >>>>>>>>>>> should not override that, >>>>>>>>>> Hmm, why would the kernel not be allowed to override that? Such >>>>>>>>>> an override would affect only the single domain where the >>>>>>>>>> kernel runs; other domains could take their own decisions. >>>>>>>>>> >>>>>>>>>> Also, for the sake of completeness: "nopat" used when running on >>>>>>>>>> bare metal has the same bad effect on system boot, so there >>>>>>>>>> pretty clearly is an error cleanup issue in the i915 driver. But >>>>>>>>>> that's orthogonal, and I expect the maintainers may not even care >>>>>>>>>> (but tell us "don't do that then"). >>>>>>>> Actually I just did a test with the last official Debian kernel >>>>>>>> build of Linux 5.16, that is, a kernel before bdd8b6c98239 was >>>>>>>> applied. In fact, the nopat option does *not* break the i915 driver >>>>>>>> in 5.16. That is, with the nopat option, the i915 driver loads >>>>>>>> normally on both the bare metal and on the Xen hypervisor. >>>>>>>> That means your presumption (and the presumption of >>>>>>>> the author of bdd8b6c98239) that the "nopat" option was >>>>>>>> being observed by the i915 driver is incorrect. Setting "nopat" >>>>>>>> had no effect on my system with Linux 5.16. So after doing these >>>>>>>> tests, I am against the aggressive approach of breaking the i915 >>>>>>>> driver with the "nopat" option because prior to bdd8b6c98239, >>>>>>>> nopat did not break the i915 driver. Why break it now? >>>>>>> Because that's, in my understanding, is the purpose of "nopat" >>>>>>> (not breaking the driver of course - that's a driver bug -, but >>>>>>> having an effect on the driver). >>>>>> I wouldn't call it a driver bug, but an incorrect configuration of the >>>>>> kernel by the user.  I presume X86_FEATURE_PAT is required by the >>>>>> i915 driver >>>>> The driver ought to work fine without PAT (and hence without being >>>>> able to make WC mappings). It would use UC instead and be slow, but >>>>> it ought to work. >>>>> >>>>>> and therefore the driver should refuse to disable >>>>>> it if the user requests to disable it and instead warn the user that >>>>>> the driver did not disable the feature, contrary to what the user >>>>>> requested with the nopat option. >>>>>> >>>>>> In any case, my test did not verify that when nopat is set in Linux >>>>>> 5.16, >>>>>> the thread takes the same code path as when nopat is not set, >>>>>> so I am not totally sure that the reason nopat does not break the >>>>>> i915 driver in 5.16 is that static_cpu_has(X86_FEATURE_PAT) >>>>>> returns true even when nopat is set. I could test it with a custom >>>>>> log message in 5.16 if that is necessary. >>>>>> >>>>>> Are you saying it was wrong for static_cpu_has(X86_FEATURE_PAT) >>>>>> to return true in 5.16 when the user requests nopat? >>>>> No, I'm not saying that. It was wrong for this construct to be used >>>>> in the driver, which was fixed for 5.17 (and which had caused the >>>>> regression I did observe, leading to the patch as a hopefully least >>>>> bad option). >>>>> >>>>>> I think that is >>>>>> just permitting a bad configuration to break the driver that a >>>>>> well-written operating system should not allow. The i915 driver >>>>>> was, in my opinion, correctly ignoring the nopat option in 5.16 >>>>>> because that option is not compatible with the hardware the >>>>>> i915 driver is trying to initialize and setup at boot time. At least >>>>>> that is my understanding now, but I will need to test it on 5.16 >>>>>> to be sure I understand it correctly. >>>>>> >>>>>> Also, AFAICT, your patch would break the driver when the nopat >>>>>> option is set and only fix the regression introduced by bdd8b6c98239 >>>>>> when nopat is not set on my box, so your patch would >>>>>> introduce a regression relative to Linux 5.16 and earlier for the >>>>>> case when nopat is set on my box. I think your point would >>>>>> be that it is not a regression if it is an incorrect user >>>>>> configuration. >>>>> Again no - my view is that there's a separate, pre-existing issue >>>>> in the driver which was uncovered by the change. This may be a >>>>> perceived regression, but is imo different from a real one. >>> Sorry, for you maybe, but I'm pretty sure for Linus it's not when it >>> comes to the "no regressions rule". Just took a quick look at quotes >>> from Linus >>> https://www.kernel.org/doc/html/latest/process/handling-regressions.html >>> and found this statement from Linus to back this up: >>> >>> ``` >>> One _particularly_ last-minute revert is the top-most commit (ignoring >>> the version change itself) done just before the release, and while >>> it's very annoying, it's perhaps also instructive. >>> >>> What's instructive about it is that I reverted a commit that wasn't >>> actually buggy. In fact, it was doing exactly what it set out to do, >>> and did it very well. In fact it did it _so_ well that the much >>> improved IO patterns it caused then ended up revealing a user-visible >>> regression due to a real bug in a completely unrelated area. >>> ``` >>> >>> He said that here: >>> https://www.kernel.org/doc/html/latest/process/handling-regressions.html >>> >>> The situation is of course different here, but similar enough. >>> >>>> Since it is a regression, I think for now bdd8b6c98239 should >>>> be reverted and the fix backported to Linux 5.17 stable until >>>> the underlying memory subsystem can provide the i915 driver >>>> with an updated test for the PAT feature that also meets the >>>> requirements of the author of bdd8b6c98239 without breaking >>>> the i915 driver. >>> I'm not a developer and I'm don't known the details of this thread and >>> the backstory of the regression, but it sounds like that's the approach >>> that is needed here until someone comes up with a fix for the regression >>> exposed by bdd8b6c98239. >>> >>> But if I'm wrong, please tell me. >> >> You are mostly right, I think. Reverting bdd8b6c98239 fixes >> it. There is another way to fix it, though. > > Yeah, I'm aware of it. But it seems... > >> The patch proposed >> by Jan Beulich also fixes the regression on my system, so as >> the person reporting this is a regression, I would also be satisfied >> with Jan's patch instead of reverting bdd8b6c98239 as a fix. Jan >> posted his proposed patch here: >> >> https://lore.kernel.org/lkml/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com/ > > ...that approach is not making any progress either? > > Jan, can could provide a short status update here? I'd really like to > get this regression fixed one way or another rather sooner than later, > as this is taken way to long already IMHO. > >> The only reservation I have about Jan's patch is that the commit >> message does not clearly explain how the patch changes what >> the nopat kernel boot option does. It doesn't affect me because >> I don't use nopat, but it should probably be mentioned in the >> commit message, as pointed out here: >> >> https://lore.kernel.org/lkml/bd9ed2c2-1337-27bb-c9da-dfc7b31d492c@netscape.net/ >> >> >> Whatever fix for the regression exposed by bdd8b6c98239 also >> needs to be backported to the stable versions 5.17 and 5.18. > > Sure. > > BTW, as you seem to be familiar with the issue: there is another report > about a regression WRT to Xen and i915 (that is also not making really > progress): > https://lore.kernel.org/lkml/Yn%2FTgj1Ehs%2FBdpHp@itl-email/ > > It's just a wild guess, but bould this somehow be related? No, doesn't seem so. I'm just reviewing the suggested fix: https://lore.kernel.org/lkml/Yo0LwmVUDSBZb44K@itl-email/ Juergen