All of lore.kernel.org
 help / color / mirror / Atom feed
* i915 and PAT attributes on Xen PV
@ 2022-12-08 13:55 ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-12-08 13:55 UTC (permalink / raw)
  To: intel-gfx
  Cc: xen-devel, Demi M. Obenour, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Matt Roper, Lucas De Marchi,
	José Roberto de Souza

[-- Attachment #1: Type: text/plain, Size: 2242 bytes --]

Hi,

There is an issue with i915 on Xen PV (dom0). The end result is a lot of
glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
(this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
with "modesetting" driver.

After some iterations of debugging, we narrowed it down to i915 handling
caching. The main difference is that PAT is setup differently on Xen PV
than on native Linux. Normally, Linux does have appropriate abstraction
for that, but apparently something related to i915 doesn't play well
with it. The specific difference is:
native linux:
x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
xen pv:
x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
                                  ~~          ~~      ~~  ~~

The specific impact depends on kernel version and the hardware. The most
severe issues I see on >=ADL, but some older hardware is affected too -
sometimes only if composition is disabled in the window manager.
Some more information is collected at
https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
duplicates...).

Kind-of related commit is here:
https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
there too.

Any ideas?

The issue can be easily reproduced without Xen too, by adjusting PAT in
Linux:
-----8<-----
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 66a209f7eb86..319ab60c8d8c 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -400,8 +400,8 @@ void pat_init(void)
 		 * The reserved slots are unused, but mapped to their
 		 * corresponding types in the presence of PAT errata.
 		 */
-		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
-		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
+		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
+		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
 	}
 
 	if (!pat_bp_initialized) {
-----8<-----

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Intel-gfx] i915 and PAT attributes on Xen PV
@ 2022-12-08 13:55 ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-12-08 13:55 UTC (permalink / raw)
  To: intel-gfx; +Cc: Lucas De Marchi, Rodrigo Vivi, Demi M. Obenour, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2242 bytes --]

Hi,

There is an issue with i915 on Xen PV (dom0). The end result is a lot of
glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
(this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
with "modesetting" driver.

After some iterations of debugging, we narrowed it down to i915 handling
caching. The main difference is that PAT is setup differently on Xen PV
than on native Linux. Normally, Linux does have appropriate abstraction
for that, but apparently something related to i915 doesn't play well
with it. The specific difference is:
native linux:
x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
xen pv:
x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
                                  ~~          ~~      ~~  ~~

The specific impact depends on kernel version and the hardware. The most
severe issues I see on >=ADL, but some older hardware is affected too -
sometimes only if composition is disabled in the window manager.
Some more information is collected at
https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
duplicates...).

Kind-of related commit is here:
https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
there too.

Any ideas?

The issue can be easily reproduced without Xen too, by adjusting PAT in
Linux:
-----8<-----
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 66a209f7eb86..319ab60c8d8c 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -400,8 +400,8 @@ void pat_init(void)
 		 * The reserved slots are unused, but mapped to their
 		 * corresponding types in the presence of PAT errata.
 		 */
-		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
-		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
+		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
+		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
 	}
 
 	if (!pat_bp_initialized) {
-----8<-----

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for i915 and PAT attributes on Xen PV
  2022-12-08 13:55 ` [Intel-gfx] " Marek Marczykowski-Górecki
  (?)
@ 2022-12-08 16:24 ` Patchwork
  -1 siblings, 0 replies; 18+ messages in thread
From: Patchwork @ 2022-12-08 16:24 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: intel-gfx

== Series Details ==

Series: i915 and PAT attributes on Xen PV
URL   : https://patchwork.freedesktop.org/series/111776/
State : warning

== Summary ==

Error: dim checkpatch failed
d4d2b7cd4c8c i915 and PAT attributes on Xen PV
-:59: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 1 errors, 0 warnings, 0 checks, 10 lines checked



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for i915 and PAT attributes on Xen PV
  2022-12-08 13:55 ` [Intel-gfx] " Marek Marczykowski-Górecki
  (?)
  (?)
@ 2022-12-08 16:51 ` Patchwork
  -1 siblings, 0 replies; 18+ messages in thread
From: Patchwork @ 2022-12-08 16:51 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 17082 bytes --]

== Series Details ==

Series: i915 and PAT attributes on Xen PV
URL   : https://patchwork.freedesktop.org/series/111776/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12483 -> Patchwork_111776v1
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_111776v1 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_111776v1, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/index.html

Participating hosts (37 -> 37)
------------------------------

  Additional (2): fi-hsw-4770 bat-dg1-5 
  Missing    (2): fi-rkl-11600 bat-dg1-6 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_111776v1:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_busy@busy@all:
    - fi-glk-j4005:       [PASS][1] -> [FAIL][2] +19 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-glk-j4005/igt@gem_busy@busy@all.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-glk-j4005/igt@gem_busy@busy@all.html

  * igt@gem_exec_fence@basic-await@vcs0:
    - fi-elk-e7500:       [PASS][3] -> [TIMEOUT][4] +1 similar issue
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-elk-e7500/igt@gem_exec_fence@basic-await@vcs0.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-elk-e7500/igt@gem_exec_fence@basic-await@vcs0.html

  * igt@gem_exec_fence@nb-await@bcs0:
    - fi-bsw-nick:        [PASS][5] -> [FAIL][6] +7 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-bsw-nick/igt@gem_exec_fence@nb-await@bcs0.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-bsw-nick/igt@gem_exec_fence@nb-await@bcs0.html
    - fi-bsw-kefka:       [PASS][7] -> [TIMEOUT][8] +1 similar issue
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-bsw-kefka/igt@gem_exec_fence@nb-await@bcs0.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-bsw-kefka/igt@gem_exec_fence@nb-await@bcs0.html

  * igt@gem_exec_fence@nb-await@vcs0:
    - fi-glk-j4005:       [PASS][9] -> [TIMEOUT][10] +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-glk-j4005/igt@gem_exec_fence@nb-await@vcs0.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-glk-j4005/igt@gem_exec_fence@nb-await@vcs0.html
    - fi-bsw-nick:        [PASS][11] -> [TIMEOUT][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-bsw-nick/igt@gem_exec_fence@nb-await@vcs0.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-bsw-nick/igt@gem_exec_fence@nb-await@vcs0.html

  * igt@gem_tiled_blits@basic:
    - fi-ilk-650:         NOTRUN -> [TIMEOUT][13] +6 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ilk-650/igt@gem_tiled_blits@basic.html

  * igt@i915_module_load@load:
    - fi-ilk-650:         NOTRUN -> [FAIL][14] +18 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ilk-650/igt@i915_module_load@load.html

  * igt@i915_selftest@live@mman:
    - fi-elk-e7500:       [PASS][15] -> [DMESG-FAIL][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-elk-e7500/igt@i915_selftest@live@mman.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-elk-e7500/igt@i915_selftest@live@mman.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions:
    - fi-bsw-kefka:       [PASS][17] -> [FAIL][18] +9 similar issues
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor@varying-size:
    - fi-pnv-d510:        [PASS][19] -> [FAIL][20] +4 similar issues
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-pnv-d510/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@varying-size.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-pnv-d510/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@varying-size.html

  * igt@kms_frontbuffer_tracking@basic:
    - fi-ivb-3770:        [PASS][21] -> [FAIL][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-ivb-3770/igt@kms_frontbuffer_tracking@basic.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ivb-3770/igt@kms_frontbuffer_tracking@basic.html
    - fi-rkl-guc:         [PASS][23] -> [FAIL][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-rkl-guc/igt@kms_frontbuffer_tracking@basic.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-rkl-guc/igt@kms_frontbuffer_tracking@basic.html
    - fi-adl-ddr5:        [PASS][25] -> [FAIL][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-adl-ddr5/igt@kms_frontbuffer_tracking@basic.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-adl-ddr5/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-b-hdmi-a-1:
    - fi-elk-e7500:       [PASS][27] -> [FAIL][28] +32 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-elk-e7500/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-b-hdmi-a-1.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-elk-e7500/igt@kms_pipe_crc_basic@read-crc-frame-sequence@pipe-b-hdmi-a-1.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@kms_frontbuffer_tracking@basic:
    - {bat-adln-1}:       [PASS][29] -> [FAIL][30]
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-adln-1/igt@kms_frontbuffer_tracking@basic.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-adln-1/igt@kms_frontbuffer_tracking@basic.html
    - {bat-rpls-2}:       [PASS][31] -> [FAIL][32]
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-rpls-2/igt@kms_frontbuffer_tracking@basic.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-rpls-2/igt@kms_frontbuffer_tracking@basic.html
    - {bat-rplp-1}:       [PASS][33] -> [FAIL][34]
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-rplp-1/igt@kms_frontbuffer_tracking@basic.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-rplp-1/igt@kms_frontbuffer_tracking@basic.html
    - {bat-adls-5}:       [PASS][35] -> [FAIL][36]
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-adls-5/igt@kms_frontbuffer_tracking@basic.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-adls-5/igt@kms_frontbuffer_tracking@basic.html

  
Known issues
------------

  Here are the changes found in Patchwork_111776v1 that come from known issues:

### CI changes ###

#### Possible fixes ####

  * boot:
    - fi-ilk-650:         [FAIL][37] ([i915#7350]) -> [PASS][38]
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-ilk-650/boot.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ilk-650/boot.html

  

### IGT changes ###

#### Issues hit ####

  * igt@gem_huc_copy@huc-copy:
    - fi-ilk-650:         NOTRUN -> [SKIP][39] ([fdo#109271]) +5 similar issues
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ilk-650/igt@gem_huc_copy@huc-copy.html

  * igt@gem_mmap@basic:
    - bat-dg1-5:          NOTRUN -> [SKIP][40] ([i915#4083])
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@gem_mmap@basic.html

  * igt@gem_tiled_fence_blits@basic:
    - bat-dg1-5:          NOTRUN -> [SKIP][41] ([i915#4077]) +2 similar issues
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@gem_tiled_fence_blits@basic.html

  * igt@gem_tiled_pread_basic:
    - bat-dg1-5:          NOTRUN -> [SKIP][42] ([i915#4079]) +1 similar issue
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@gem_tiled_pread_basic.html

  * igt@i915_pm_backlight@basic-brightness:
    - bat-dg1-5:          NOTRUN -> [SKIP][43] ([i915#7561])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@i915_pm_backlight@basic-brightness.html

  * igt@i915_pm_rps@basic-api:
    - bat-dg1-5:          NOTRUN -> [SKIP][44] ([i915#6621])
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@i915_pm_rps@basic-api.html

  * igt@kms_addfb_basic@addfb25-y-tiled-small-legacy:
    - fi-hsw-4770:        NOTRUN -> [SKIP][45] ([fdo#109271]) +11 similar issues
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-hsw-4770/igt@kms_addfb_basic@addfb25-y-tiled-small-legacy.html

  * igt@kms_addfb_basic@basic-x-tiled-legacy:
    - bat-dg1-5:          NOTRUN -> [SKIP][46] ([i915#4212]) +7 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_addfb_basic@basic-x-tiled-legacy.html

  * igt@kms_addfb_basic@basic-y-tiled-legacy:
    - bat-dg1-5:          NOTRUN -> [SKIP][47] ([i915#4215])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_addfb_basic@basic-y-tiled-legacy.html

  * igt@kms_chamelium@dp-crc-fast:
    - fi-hsw-4770:        NOTRUN -> [SKIP][48] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-hsw-4770/igt@kms_chamelium@dp-crc-fast.html

  * igt@kms_chamelium@hdmi-edid-read:
    - fi-ilk-650:         NOTRUN -> [SKIP][49] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-ilk-650/igt@kms_chamelium@hdmi-edid-read.html

  * igt@kms_chamelium@hdmi-hpd-fast:
    - bat-dg1-5:          NOTRUN -> [SKIP][50] ([fdo#111827]) +8 similar issues
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_chamelium@hdmi-hpd-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor:
    - bat-dg1-5:          NOTRUN -> [SKIP][51] ([i915#4103] / [i915#4213])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_cursor_legacy@basic-busy-flip-before-cursor.html

  * igt@kms_force_connector_basic@force-load-detect:
    - bat-dg1-5:          NOTRUN -> [SKIP][52] ([fdo#109285])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_frontbuffer_tracking@basic:
    - bat-adlp-4:         [PASS][53] -> [FAIL][54] ([i915#2546])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-adlp-4/igt@kms_frontbuffer_tracking@basic.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-adlp-4/igt@kms_frontbuffer_tracking@basic.html
    - fi-icl-u2:          [PASS][55] -> [FAIL][56] ([i915#2546])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-icl-u2/igt@kms_frontbuffer_tracking@basic.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-icl-u2/igt@kms_frontbuffer_tracking@basic.html
    - fi-glk-j4005:       [PASS][57] -> [FAIL][58] ([i915#2546])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-glk-j4005/igt@kms_frontbuffer_tracking@basic.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-glk-j4005/igt@kms_frontbuffer_tracking@basic.html
    - fi-skl-guc:         [PASS][59] -> [FAIL][60] ([i915#2546])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-skl-guc/igt@kms_frontbuffer_tracking@basic.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-skl-guc/igt@kms_frontbuffer_tracking@basic.html
    - fi-kbl-soraka:      [PASS][61] -> [FAIL][62] ([i915#2546])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/fi-kbl-soraka/igt@kms_frontbuffer_tracking@basic.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-kbl-soraka/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_psr@sprite_plane_onoff:
    - bat-dg1-5:          NOTRUN -> [SKIP][63] ([i915#1072] / [i915#4078]) +3 similar issues
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_psr@sprite_plane_onoff.html
    - fi-hsw-4770:        NOTRUN -> [SKIP][64] ([fdo#109271] / [i915#1072]) +3 similar issues
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/fi-hsw-4770/igt@kms_psr@sprite_plane_onoff.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - bat-dg1-5:          NOTRUN -> [SKIP][65] ([i915#3555])
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@kms_setmode@basic-clone-single-crtc.html

  * igt@prime_vgem@basic-fence-read:
    - bat-dg1-5:          NOTRUN -> [SKIP][66] ([i915#3708]) +3 similar issues
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@prime_vgem@basic-fence-read.html

  * igt@prime_vgem@basic-gtt:
    - bat-dg1-5:          NOTRUN -> [SKIP][67] ([i915#3708] / [i915#4077]) +1 similar issue
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@prime_vgem@basic-gtt.html

  * igt@prime_vgem@basic-userptr:
    - bat-dg1-5:          NOTRUN -> [SKIP][68] ([i915#3708] / [i915#4873])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-dg1-5/igt@prime_vgem@basic-userptr.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s0@smem:
    - {bat-rplp-1}:       [DMESG-WARN][69] ([i915#2867]) -> [PASS][70]
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-rplp-1/igt@gem_exec_suspend@basic-s0@smem.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-rplp-1/igt@gem_exec_suspend@basic-s0@smem.html

  * igt@i915_selftest@live@requests:
    - {bat-rpls-2}:       [INCOMPLETE][71] ([i915#6257]) -> [PASS][72]
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12483/bat-rpls-2/igt@i915_selftest@live@requests.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/bat-rpls-2/igt@i915_selftest@live@requests.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#2546]: https://gitlab.freedesktop.org/drm/intel/issues/2546
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4078]: https://gitlab.freedesktop.org/drm/intel/issues/4078
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
  [i915#4215]: https://gitlab.freedesktop.org/drm/intel/issues/4215
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#6257]: https://gitlab.freedesktop.org/drm/intel/issues/6257
  [i915#6367]: https://gitlab.freedesktop.org/drm/intel/issues/6367
  [i915#6434]: https://gitlab.freedesktop.org/drm/intel/issues/6434
  [i915#6559]: https://gitlab.freedesktop.org/drm/intel/issues/6559
  [i915#6621]: https://gitlab.freedesktop.org/drm/intel/issues/6621
  [i915#7350]: https://gitlab.freedesktop.org/drm/intel/issues/7350
  [i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561


Build changes
-------------

  * Linux: CI_DRM_12483 -> Patchwork_111776v1

  CI-20190529: 20190529
  CI_DRM_12483: 365a519c3ad617b35a9c0eb49ba530614aa2c4f2 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7085: 11af20de3877b23a244b816453bfc41d83591a15 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111776v1: 365a519c3ad617b35a9c0eb49ba530614aa2c4f2 @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

9796d8adb3fe i915 and PAT attributes on Xen PV

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111776v1/index.html

[-- Attachment #2: Type: text/html, Size: 19505 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [cache coherency bug] i915 and PAT attributes
  2022-12-08 13:55 ` [Intel-gfx] " Marek Marczykowski-Górecki
@ 2022-12-16 15:30   ` Andrew Cooper
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2022-12-16 15:30 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, intel-gfx
  Cc: xen-devel, Demi M. Obenour, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Matt Roper, Lucas De Marchi,
	José Roberto de Souza, Daniel Vetter,
	the arch/x86 maintainers

On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> Hi,
>
> There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> with "modesetting" driver.
>
> After some iterations of debugging, we narrowed it down to i915 handling
> caching. The main difference is that PAT is setup differently on Xen PV
> than on native Linux. Normally, Linux does have appropriate abstraction
> for that, but apparently something related to i915 doesn't play well
> with it. The specific difference is:
> native linux:
> x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> xen pv:
> x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
>                                   ~~          ~~      ~~  ~~
>
> The specific impact depends on kernel version and the hardware. The most
> severe issues I see on >=ADL, but some older hardware is affected too -
> sometimes only if composition is disabled in the window manager.
> Some more information is collected at
> https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> duplicates...).
>
> Kind-of related commit is here:
> https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> there too.
>
> Any ideas?
>
> The issue can be easily reproduced without Xen too, by adjusting PAT in
> Linux:
> -----8<-----
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 66a209f7eb86..319ab60c8d8c 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -400,8 +400,8 @@ void pat_init(void)
>  		 * The reserved slots are unused, but mapped to their
>  		 * corresponding types in the presence of PAT errata.
>  		 */
> -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
>  	}
>  
>  	if (!pat_bp_initialized) {
> -----8<-----
>

Hello, can anyone help please?

Intel's CI has taken this reproducer of the bug, and confirmed the
regression. 
https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb

We're reasonably confident that it is an i915 bug (given the repro with
no Xen in the mix), but we're out of any further ideas.

Thanks,

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
@ 2022-12-16 15:30   ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2022-12-16 15:30 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, intel-gfx
  Cc: the arch/x86 maintainers, Lucas De Marchi, Daniel Vetter,
	Rodrigo Vivi, Demi M. Obenour, xen-devel

On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> Hi,
>
> There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> with "modesetting" driver.
>
> After some iterations of debugging, we narrowed it down to i915 handling
> caching. The main difference is that PAT is setup differently on Xen PV
> than on native Linux. Normally, Linux does have appropriate abstraction
> for that, but apparently something related to i915 doesn't play well
> with it. The specific difference is:
> native linux:
> x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> xen pv:
> x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
>                                   ~~          ~~      ~~  ~~
>
> The specific impact depends on kernel version and the hardware. The most
> severe issues I see on >=ADL, but some older hardware is affected too -
> sometimes only if composition is disabled in the window manager.
> Some more information is collected at
> https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> duplicates...).
>
> Kind-of related commit is here:
> https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> there too.
>
> Any ideas?
>
> The issue can be easily reproduced without Xen too, by adjusting PAT in
> Linux:
> -----8<-----
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 66a209f7eb86..319ab60c8d8c 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -400,8 +400,8 @@ void pat_init(void)
>  		 * The reserved slots are unused, but mapped to their
>  		 * corresponding types in the presence of PAT errata.
>  		 */
> -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
>  	}
>  
>  	if (!pat_bp_initialized) {
> -----8<-----
>

Hello, can anyone help please?

Intel's CI has taken this reproducer of the bug, and confirmed the
regression. 
https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb

We're reasonably confident that it is an i915 bug (given the repro with
no Xen in the mix), but we're out of any further ideas.

Thanks,

~Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2022-12-16 15:30   ` [Intel-gfx] " Andrew Cooper
@ 2022-12-22  8:29     ` Ville Syrjälä
  -1 siblings, 0 replies; 18+ messages in thread
From: Ville Syrjälä @ 2022-12-22  8:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: intel-gfx, the arch/x86 maintainers, Lucas De Marchi,
	Marek Marczykowski-Górecki, Daniel Vetter, Rodrigo Vivi,
	Demi M. Obenour, xen-devel

On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > with "modesetting" driver.
> >
> > After some iterations of debugging, we narrowed it down to i915 handling
> > caching. The main difference is that PAT is setup differently on Xen PV
> > than on native Linux. Normally, Linux does have appropriate abstraction
> > for that, but apparently something related to i915 doesn't play well
> > with it. The specific difference is:
> > native linux:
> > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > xen pv:
> > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> >                                   ~~          ~~      ~~  ~~
> >
> > The specific impact depends on kernel version and the hardware. The most
> > severe issues I see on >=ADL, but some older hardware is affected too -
> > sometimes only if composition is disabled in the window manager.
> > Some more information is collected at
> > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > duplicates...).
> >
> > Kind-of related commit is here:
> > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > there too.
> >
> > Any ideas?
> >
> > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > Linux:
> > -----8<-----
> > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > index 66a209f7eb86..319ab60c8d8c 100644
> > --- a/arch/x86/mm/pat/memtype.c
> > +++ b/arch/x86/mm/pat/memtype.c
> > @@ -400,8 +400,8 @@ void pat_init(void)
> >  		 * The reserved slots are unused, but mapped to their
> >  		 * corresponding types in the presence of PAT errata.
> >  		 */
> > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> >  	}
> >  
> >  	if (!pat_bp_initialized) {
> > -----8<-----
> >
> 
> Hello, can anyone help please?
> 
> Intel's CI has taken this reproducer of the bug, and confirmed the
> regression. 
> https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> 
> We're reasonably confident that it is an i915 bug (given the repro with
> no Xen in the mix), but we're out of any further ideas.

I don't think we have any code that assumes anything about the PAT,
apart from WC being available (which seems like it should still be
the case with your modified PAT). I suppose you'll just have to 
start digging from pgprot_writecombine()/noncached() and make sure
everything ends up using the correct PAT entry.

-- 
Ville Syrjälä
Intel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
@ 2022-12-22  8:29     ` Ville Syrjälä
  0 siblings, 0 replies; 18+ messages in thread
From: Ville Syrjälä @ 2022-12-22  8:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Marek Marczykowski-Górecki, intel-gfx,
	the arch/x86 maintainers, Lucas De Marchi, Daniel Vetter,
	Rodrigo Vivi, Demi M. Obenour, xen-devel

On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > with "modesetting" driver.
> >
> > After some iterations of debugging, we narrowed it down to i915 handling
> > caching. The main difference is that PAT is setup differently on Xen PV
> > than on native Linux. Normally, Linux does have appropriate abstraction
> > for that, but apparently something related to i915 doesn't play well
> > with it. The specific difference is:
> > native linux:
> > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > xen pv:
> > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> >                                   ~~          ~~      ~~  ~~
> >
> > The specific impact depends on kernel version and the hardware. The most
> > severe issues I see on >=ADL, but some older hardware is affected too -
> > sometimes only if composition is disabled in the window manager.
> > Some more information is collected at
> > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > duplicates...).
> >
> > Kind-of related commit is here:
> > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > there too.
> >
> > Any ideas?
> >
> > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > Linux:
> > -----8<-----
> > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > index 66a209f7eb86..319ab60c8d8c 100644
> > --- a/arch/x86/mm/pat/memtype.c
> > +++ b/arch/x86/mm/pat/memtype.c
> > @@ -400,8 +400,8 @@ void pat_init(void)
> >  		 * The reserved slots are unused, but mapped to their
> >  		 * corresponding types in the presence of PAT errata.
> >  		 */
> > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> >  	}
> >  
> >  	if (!pat_bp_initialized) {
> > -----8<-----
> >
> 
> Hello, can anyone help please?
> 
> Intel's CI has taken this reproducer of the bug, and confirmed the
> regression. 
> https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> 
> We're reasonably confident that it is an i915 bug (given the repro with
> no Xen in the mix), but we're out of any further ideas.

I don't think we have any code that assumes anything about the PAT,
apart from WC being available (which seems like it should still be
the case with your modified PAT). I suppose you'll just have to 
start digging from pgprot_writecombine()/noncached() and make sure
everything ends up using the correct PAT entry.

-- 
Ville Syrjälä
Intel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2022-12-22  8:29     ` Ville Syrjälä
  (?)
@ 2023-01-01 23:24     ` Marek Marczykowski-Górecki
  2023-01-02  0:03       ` Demi Marie Obenour
  -1 siblings, 1 reply; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-01-01 23:24 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, Demi M. Obenour,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 4687 bytes --]

On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > >
> > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > with "modesetting" driver.
> > >
> > > After some iterations of debugging, we narrowed it down to i915 handling
> > > caching. The main difference is that PAT is setup differently on Xen PV
> > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > for that, but apparently something related to i915 doesn't play well
> > > with it. The specific difference is:
> > > native linux:
> > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > xen pv:
> > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > >                                   ~~          ~~      ~~  ~~
> > >
> > > The specific impact depends on kernel version and the hardware. The most
> > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > sometimes only if composition is disabled in the window manager.
> > > Some more information is collected at
> > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > duplicates...).
> > >
> > > Kind-of related commit is here:
> > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > there too.
> > >
> > > Any ideas?
> > >
> > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > Linux:
> > > -----8<-----
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index 66a209f7eb86..319ab60c8d8c 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -400,8 +400,8 @@ void pat_init(void)
> > >  		 * The reserved slots are unused, but mapped to their
> > >  		 * corresponding types in the presence of PAT errata.
> > >  		 */
> > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > >  	}
> > >  
> > >  	if (!pat_bp_initialized) {
> > > -----8<-----
> > >
> > 
> > Hello, can anyone help please?
> > 
> > Intel's CI has taken this reproducer of the bug, and confirmed the
> > regression. 
> > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > 
> > We're reasonably confident that it is an i915 bug (given the repro with
> > no Xen in the mix), but we're out of any further ideas.
> 
> I don't think we have any code that assumes anything about the PAT,
> apart from WC being available (which seems like it should still be
> the case with your modified PAT). I suppose you'll just have to 
> start digging from pgprot_writecombine()/noncached() and make sure
> everything ends up using the correct PAT entry.

I tried several approach to this, without success. Here is an update on
debugging (reported also on #intel-gfx live):

I did several tests with different PAT configuration (by modifying Xen
that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
Some highlights:
- 1=WC, 4=WT - good
- 1=WT, 4=WC - bad
- 1=WT, 3=WC (4=WC too) - good
- 1=WT, 5=WC - good

So, for me it seems WC at index 4 is problematic for some reason.

Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
write PTEs and verify requested cache attributes. There, it seems all
the requested WC are properly translated (using either index 1, 3, 4, or
5 according to PAT settings). And then after reading PTE back, it indeed
seems to be correctly set. I didn't added reading back after
HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.

Using the same method, I also checked that indexes that aren't supposed
to be used (for example index 4 when both 3 and 4 are WC) indeed are not
used. So, the hypothesis that specific indexes are hardcoded somewhere
is unlikely.

This all looks very weird to me. Any ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2023-01-01 23:24     ` Marek Marczykowski-Górecki
@ 2023-01-02  0:03       ` Demi Marie Obenour
  2023-01-02  1:00           ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 18+ messages in thread
From: Demi Marie Obenour @ 2023-01-02  0:03 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Ville Syrjälä
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > >
> > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > with "modesetting" driver.
> > > >
> > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > for that, but apparently something related to i915 doesn't play well
> > > > with it. The specific difference is:
> > > > native linux:
> > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > xen pv:
> > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > >                                   ~~          ~~      ~~  ~~
> > > >
> > > > The specific impact depends on kernel version and the hardware. The most
> > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > sometimes only if composition is disabled in the window manager.
> > > > Some more information is collected at
> > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > duplicates...).
> > > >
> > > > Kind-of related commit is here:
> > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > there too.
> > > >
> > > > Any ideas?
> > > >
> > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > Linux:
> > > > -----8<-----
> > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > --- a/arch/x86/mm/pat/memtype.c
> > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > >  		 * The reserved slots are unused, but mapped to their
> > > >  		 * corresponding types in the presence of PAT errata.
> > > >  		 */
> > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > >  	}
> > > >  
> > > >  	if (!pat_bp_initialized) {
> > > > -----8<-----
> > > >
> > > 
> > > Hello, can anyone help please?
> > > 
> > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > regression. 
> > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > 
> > > We're reasonably confident that it is an i915 bug (given the repro with
> > > no Xen in the mix), but we're out of any further ideas.
> > 
> > I don't think we have any code that assumes anything about the PAT,
> > apart from WC being available (which seems like it should still be
> > the case with your modified PAT). I suppose you'll just have to 
> > start digging from pgprot_writecombine()/noncached() and make sure
> > everything ends up using the correct PAT entry.
> 
> I tried several approach to this, without success. Here is an update on
> debugging (reported also on #intel-gfx live):
> 
> I did several tests with different PAT configuration (by modifying Xen
> that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> Some highlights:
> - 1=WC, 4=WT - good
> - 1=WT, 4=WC - bad
> - 1=WT, 3=WC (4=WC too) - good
> - 1=WT, 5=WC - good
> 
> So, for me it seems WC at index 4 is problematic for some reason.
> 
> Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> write PTEs and verify requested cache attributes. There, it seems all
> the requested WC are properly translated (using either index 1, 3, 4, or
> 5 according to PAT settings). And then after reading PTE back, it indeed
> seems to be correctly set. I didn't added reading back after
> HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> 
> Using the same method, I also checked that indexes that aren't supposed
> to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> used. So, the hypothesis that specific indexes are hardcoded somewhere
> is unlikely.
> 
> This all looks very weird to me. Any ideas?

Old CPUs have had hardware errata that caused the top bit of the PAT
entry to be ignored in certain cases.  Could modern CPUs be ignoring
this bit when accessing iGPU memory or registers?  With WC at position
4, this would cause WC to be treated as WB, which is consistent with the
observed behavior.  WC at position 3 would not be impacted, and WC at
position 5 would be treated as WT which I expect to be safe.  One way to
test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
should trigger the bug, even if entry 1 in the PAT is unused because
entry 0 is also WB.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2023-01-02  0:03       ` Demi Marie Obenour
@ 2023-01-02  1:00           ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-01-02  1:00 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5986 bytes --]

On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > Hi,
> > > > >
> > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > with "modesetting" driver.
> > > > >
> > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > for that, but apparently something related to i915 doesn't play well
> > > > > with it. The specific difference is:
> > > > > native linux:
> > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > xen pv:
> > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > >                                   ~~          ~~      ~~  ~~
> > > > >
> > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > sometimes only if composition is disabled in the window manager.
> > > > > Some more information is collected at
> > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > duplicates...).
> > > > >
> > > > > Kind-of related commit is here:
> > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > there too.
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > Linux:
> > > > > -----8<-----
> > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > >  		 * The reserved slots are unused, but mapped to their
> > > > >  		 * corresponding types in the presence of PAT errata.
> > > > >  		 */
> > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > >  	}
> > > > >  
> > > > >  	if (!pat_bp_initialized) {
> > > > > -----8<-----
> > > > >
> > > > 
> > > > Hello, can anyone help please?
> > > > 
> > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > regression. 
> > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > 
> > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > no Xen in the mix), but we're out of any further ideas.
> > > 
> > > I don't think we have any code that assumes anything about the PAT,
> > > apart from WC being available (which seems like it should still be
> > > the case with your modified PAT). I suppose you'll just have to 
> > > start digging from pgprot_writecombine()/noncached() and make sure
> > > everything ends up using the correct PAT entry.
> > 
> > I tried several approach to this, without success. Here is an update on
> > debugging (reported also on #intel-gfx live):
> > 
> > I did several tests with different PAT configuration (by modifying Xen
> > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > Some highlights:
> > - 1=WC, 4=WT - good
> > - 1=WT, 4=WC - bad
> > - 1=WT, 3=WC (4=WC too) - good
> > - 1=WT, 5=WC - good
> > 
> > So, for me it seems WC at index 4 is problematic for some reason.
> > 
> > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > write PTEs and verify requested cache attributes. There, it seems all
> > the requested WC are properly translated (using either index 1, 3, 4, or
> > 5 according to PAT settings). And then after reading PTE back, it indeed
> > seems to be correctly set. I didn't added reading back after
> > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > 
> > Using the same method, I also checked that indexes that aren't supposed
> > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > is unlikely.
> > 
> > This all looks very weird to me. Any ideas?
> 
> Old CPUs have had hardware errata that caused the top bit of the PAT
> entry to be ignored in certain cases.  Could modern CPUs be ignoring
> this bit when accessing iGPU memory or registers?  With WC at position
> 4, this would cause WC to be treated as WB, which is consistent with the
> observed behavior.  WC at position 3 would not be impacted, and WC at
> position 5 would be treated as WT which I expect to be safe.  One way to
> test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> should trigger the bug, even if entry 1 in the PAT is unused because
> entry 0 is also WB.

This looks like a very probable situation, indeed 1=WB, 5=WC does
trigger the bug! Specifically this layout:

    WB	WB	UC-	UC	WP	WC	WT	UC

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
@ 2023-01-02  1:00           ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-01-02  1:00 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Ville Syrjälä,
	Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5986 bytes --]

On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > Hi,
> > > > >
> > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > with "modesetting" driver.
> > > > >
> > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > for that, but apparently something related to i915 doesn't play well
> > > > > with it. The specific difference is:
> > > > > native linux:
> > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > xen pv:
> > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > >                                   ~~          ~~      ~~  ~~
> > > > >
> > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > sometimes only if composition is disabled in the window manager.
> > > > > Some more information is collected at
> > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > duplicates...).
> > > > >
> > > > > Kind-of related commit is here:
> > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > there too.
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > Linux:
> > > > > -----8<-----
> > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > >  		 * The reserved slots are unused, but mapped to their
> > > > >  		 * corresponding types in the presence of PAT errata.
> > > > >  		 */
> > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > >  	}
> > > > >  
> > > > >  	if (!pat_bp_initialized) {
> > > > > -----8<-----
> > > > >
> > > > 
> > > > Hello, can anyone help please?
> > > > 
> > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > regression. 
> > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > 
> > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > no Xen in the mix), but we're out of any further ideas.
> > > 
> > > I don't think we have any code that assumes anything about the PAT,
> > > apart from WC being available (which seems like it should still be
> > > the case with your modified PAT). I suppose you'll just have to 
> > > start digging from pgprot_writecombine()/noncached() and make sure
> > > everything ends up using the correct PAT entry.
> > 
> > I tried several approach to this, without success. Here is an update on
> > debugging (reported also on #intel-gfx live):
> > 
> > I did several tests with different PAT configuration (by modifying Xen
> > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > Some highlights:
> > - 1=WC, 4=WT - good
> > - 1=WT, 4=WC - bad
> > - 1=WT, 3=WC (4=WC too) - good
> > - 1=WT, 5=WC - good
> > 
> > So, for me it seems WC at index 4 is problematic for some reason.
> > 
> > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > write PTEs and verify requested cache attributes. There, it seems all
> > the requested WC are properly translated (using either index 1, 3, 4, or
> > 5 according to PAT settings). And then after reading PTE back, it indeed
> > seems to be correctly set. I didn't added reading back after
> > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > 
> > Using the same method, I also checked that indexes that aren't supposed
> > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > is unlikely.
> > 
> > This all looks very weird to me. Any ideas?
> 
> Old CPUs have had hardware errata that caused the top bit of the PAT
> entry to be ignored in certain cases.  Could modern CPUs be ignoring
> this bit when accessing iGPU memory or registers?  With WC at position
> 4, this would cause WC to be treated as WB, which is consistent with the
> observed behavior.  WC at position 3 would not be impacted, and WC at
> position 5 would be treated as WT which I expect to be safe.  One way to
> test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> should trigger the bug, even if entry 1 in the PAT is unused because
> entry 0 is also WB.

This looks like a very probable situation, indeed 1=WB, 5=WC does
trigger the bug! Specifically this layout:

    WB	WB	UC-	UC	WP	WC	WT	UC

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2023-01-02  1:00           ` Marek Marczykowski-Górecki
@ 2023-01-02  1:17             ` Demi Marie Obenour
  -1 siblings, 0 replies; 18+ messages in thread
From: Demi Marie Obenour @ 2023-01-02  1:17 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Ville Syrjälä,
	Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6596 bytes --]

On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > Hi,
> > > > > >
> > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > with "modesetting" driver.
> > > > > >
> > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > with it. The specific difference is:
> > > > > > native linux:
> > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > xen pv:
> > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > >                                   ~~          ~~      ~~  ~~
> > > > > >
> > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > Some more information is collected at
> > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > duplicates...).
> > > > > >
> > > > > > Kind-of related commit is here:
> > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > there too.
> > > > > >
> > > > > > Any ideas?
> > > > > >
> > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > Linux:
> > > > > > -----8<-----
> > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > >  		 */
> > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > >  	}
> > > > > >  
> > > > > >  	if (!pat_bp_initialized) {
> > > > > > -----8<-----
> > > > > >
> > > > > 
> > > > > Hello, can anyone help please?
> > > > > 
> > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > regression. 
> > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > 
> > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > no Xen in the mix), but we're out of any further ideas.
> > > > 
> > > > I don't think we have any code that assumes anything about the PAT,
> > > > apart from WC being available (which seems like it should still be
> > > > the case with your modified PAT). I suppose you'll just have to 
> > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > everything ends up using the correct PAT entry.
> > > 
> > > I tried several approach to this, without success. Here is an update on
> > > debugging (reported also on #intel-gfx live):
> > > 
> > > I did several tests with different PAT configuration (by modifying Xen
> > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > Some highlights:
> > > - 1=WC, 4=WT - good
> > > - 1=WT, 4=WC - bad
> > > - 1=WT, 3=WC (4=WC too) - good
> > > - 1=WT, 5=WC - good
> > > 
> > > So, for me it seems WC at index 4 is problematic for some reason.
> > > 
> > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > write PTEs and verify requested cache attributes. There, it seems all
> > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > seems to be correctly set. I didn't added reading back after
> > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > 
> > > Using the same method, I also checked that indexes that aren't supposed
> > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > is unlikely.
> > > 
> > > This all looks very weird to me. Any ideas?
> > 
> > Old CPUs have had hardware errata that caused the top bit of the PAT
> > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > this bit when accessing iGPU memory or registers?  With WC at position
> > 4, this would cause WC to be treated as WB, which is consistent with the
> > observed behavior.  WC at position 3 would not be impacted, and WC at
> > position 5 would be treated as WT which I expect to be safe.  One way to
> > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > should trigger the bug, even if entry 1 in the PAT is unused because
> > entry 0 is also WB.
> 
> This looks like a very probable situation, indeed 1=WB, 5=WC does
> trigger the bug! Specifically this layout:
> 
>     WB	WB	UC-	UC	WP	WC	WT	UC

What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
only differ in entry 2, which will not be used as it duplicates entry 0
or 1.  Therefore, architecturally, these should behave identically.  If
I am correct, the second will work fine, but the first will trigger the
bug.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
@ 2023-01-02  1:17             ` Demi Marie Obenour
  0 siblings, 0 replies; 18+ messages in thread
From: Demi Marie Obenour @ 2023-01-02  1:17 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6596 bytes --]

On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > Hi,
> > > > > >
> > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > with "modesetting" driver.
> > > > > >
> > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > with it. The specific difference is:
> > > > > > native linux:
> > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > xen pv:
> > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > >                                   ~~          ~~      ~~  ~~
> > > > > >
> > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > Some more information is collected at
> > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > duplicates...).
> > > > > >
> > > > > > Kind-of related commit is here:
> > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > there too.
> > > > > >
> > > > > > Any ideas?
> > > > > >
> > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > Linux:
> > > > > > -----8<-----
> > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > >  		 */
> > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > >  	}
> > > > > >  
> > > > > >  	if (!pat_bp_initialized) {
> > > > > > -----8<-----
> > > > > >
> > > > > 
> > > > > Hello, can anyone help please?
> > > > > 
> > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > regression. 
> > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > 
> > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > no Xen in the mix), but we're out of any further ideas.
> > > > 
> > > > I don't think we have any code that assumes anything about the PAT,
> > > > apart from WC being available (which seems like it should still be
> > > > the case with your modified PAT). I suppose you'll just have to 
> > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > everything ends up using the correct PAT entry.
> > > 
> > > I tried several approach to this, without success. Here is an update on
> > > debugging (reported also on #intel-gfx live):
> > > 
> > > I did several tests with different PAT configuration (by modifying Xen
> > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > Some highlights:
> > > - 1=WC, 4=WT - good
> > > - 1=WT, 4=WC - bad
> > > - 1=WT, 3=WC (4=WC too) - good
> > > - 1=WT, 5=WC - good
> > > 
> > > So, for me it seems WC at index 4 is problematic for some reason.
> > > 
> > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > write PTEs and verify requested cache attributes. There, it seems all
> > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > seems to be correctly set. I didn't added reading back after
> > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > 
> > > Using the same method, I also checked that indexes that aren't supposed
> > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > is unlikely.
> > > 
> > > This all looks very weird to me. Any ideas?
> > 
> > Old CPUs have had hardware errata that caused the top bit of the PAT
> > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > this bit when accessing iGPU memory or registers?  With WC at position
> > 4, this would cause WC to be treated as WB, which is consistent with the
> > observed behavior.  WC at position 3 would not be impacted, and WC at
> > position 5 would be treated as WT which I expect to be safe.  One way to
> > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > should trigger the bug, even if entry 1 in the PAT is unused because
> > entry 0 is also WB.
> 
> This looks like a very probable situation, indeed 1=WB, 5=WC does
> trigger the bug! Specifically this layout:
> 
>     WB	WB	UC-	UC	WP	WC	WT	UC

What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
only differ in entry 2, which will not be used as it duplicates entry 0
or 1.  Therefore, architecturally, these should behave identically.  If
I am correct, the second will work fine, but the first will trigger the
bug.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
  2023-01-02  1:17             ` Demi Marie Obenour
@ 2023-01-02  1:48               ` Demi Marie Obenour
  -1 siblings, 0 replies; 18+ messages in thread
From: Demi Marie Obenour @ 2023-01-02  1:48 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Ville Syrjälä,
	Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7040 bytes --]

On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > > with "modesetting" driver.
> > > > > > >
> > > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > > with it. The specific difference is:
> > > > > > > native linux:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > xen pv:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > >                                   ~~          ~~      ~~  ~~
> > > > > > >
> > > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > Some more information is collected at
> > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > > duplicates...).
> > > > > > >
> > > > > > > Kind-of related commit is here:
> > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > > there too.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > > Linux:
> > > > > > > -----8<-----
> > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > > >  		 */
> > > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	if (!pat_bp_initialized) {
> > > > > > > -----8<-----
> > > > > > >
> > > > > > 
> > > > > > Hello, can anyone help please?
> > > > > > 
> > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > > regression. 
> > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > > 
> > > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > > no Xen in the mix), but we're out of any further ideas.
> > > > > 
> > > > > I don't think we have any code that assumes anything about the PAT,
> > > > > apart from WC being available (which seems like it should still be
> > > > > the case with your modified PAT). I suppose you'll just have to 
> > > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > > everything ends up using the correct PAT entry.
> > > > 
> > > > I tried several approach to this, without success. Here is an update on
> > > > debugging (reported also on #intel-gfx live):
> > > > 
> > > > I did several tests with different PAT configuration (by modifying Xen
> > > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > > Some highlights:
> > > > - 1=WC, 4=WT - good
> > > > - 1=WT, 4=WC - bad
> > > > - 1=WT, 3=WC (4=WC too) - good
> > > > - 1=WT, 5=WC - good
> > > > 
> > > > So, for me it seems WC at index 4 is problematic for some reason.
> > > > 
> > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > > write PTEs and verify requested cache attributes. There, it seems all
> > > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > > seems to be correctly set. I didn't added reading back after
> > > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > > 
> > > > Using the same method, I also checked that indexes that aren't supposed
> > > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > > is unlikely.
> > > > 
> > > > This all looks very weird to me. Any ideas?
> > > 
> > > Old CPUs have had hardware errata that caused the top bit of the PAT
> > > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > > this bit when accessing iGPU memory or registers?  With WC at position
> > > 4, this would cause WC to be treated as WB, which is consistent with the
> > > observed behavior.  WC at position 3 would not be impacted, and WC at
> > > position 5 would be treated as WT which I expect to be safe.  One way to
> > > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > > should trigger the bug, even if entry 1 in the PAT is unused because
> > > entry 0 is also WB.
> > 
> > This looks like a very probable situation, indeed 1=WB, 5=WC does
> > trigger the bug! Specifically this layout:
> > 
> >     WB	WB	UC-	UC	WP	WC	WT	UC
> 
> What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
> only differ in entry 2, which will not be used as it duplicates entry 0
> or 1.  Therefore, architecturally, these should behave identically.  If
> I am correct, the second will work fine, but the first will trigger the
> bug.

Also worth testing:

WB  UC- UC  WB  WB  WP  WT  WC
WB  UC- UC  UC  WB  WP  WT  WC

These differ only in (unused) entry 3.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
@ 2023-01-02  1:48               ` Demi Marie Obenour
  0 siblings, 0 replies; 18+ messages in thread
From: Demi Marie Obenour @ 2023-01-02  1:48 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7040 bytes --]

On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > > with "modesetting" driver.
> > > > > > >
> > > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > > with it. The specific difference is:
> > > > > > > native linux:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > xen pv:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > >                                   ~~          ~~      ~~  ~~
> > > > > > >
> > > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > Some more information is collected at
> > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > > duplicates...).
> > > > > > >
> > > > > > > Kind-of related commit is here:
> > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > > there too.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > > Linux:
> > > > > > > -----8<-----
> > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > > >  		 */
> > > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	if (!pat_bp_initialized) {
> > > > > > > -----8<-----
> > > > > > >
> > > > > > 
> > > > > > Hello, can anyone help please?
> > > > > > 
> > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > > regression. 
> > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > > 
> > > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > > no Xen in the mix), but we're out of any further ideas.
> > > > > 
> > > > > I don't think we have any code that assumes anything about the PAT,
> > > > > apart from WC being available (which seems like it should still be
> > > > > the case with your modified PAT). I suppose you'll just have to 
> > > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > > everything ends up using the correct PAT entry.
> > > > 
> > > > I tried several approach to this, without success. Here is an update on
> > > > debugging (reported also on #intel-gfx live):
> > > > 
> > > > I did several tests with different PAT configuration (by modifying Xen
> > > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > > Some highlights:
> > > > - 1=WC, 4=WT - good
> > > > - 1=WT, 4=WC - bad
> > > > - 1=WT, 3=WC (4=WC too) - good
> > > > - 1=WT, 5=WC - good
> > > > 
> > > > So, for me it seems WC at index 4 is problematic for some reason.
> > > > 
> > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > > write PTEs and verify requested cache attributes. There, it seems all
> > > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > > seems to be correctly set. I didn't added reading back after
> > > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > > 
> > > > Using the same method, I also checked that indexes that aren't supposed
> > > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > > is unlikely.
> > > > 
> > > > This all looks very weird to me. Any ideas?
> > > 
> > > Old CPUs have had hardware errata that caused the top bit of the PAT
> > > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > > this bit when accessing iGPU memory or registers?  With WC at position
> > > 4, this would cause WC to be treated as WB, which is consistent with the
> > > observed behavior.  WC at position 3 would not be impacted, and WC at
> > > position 5 would be treated as WT which I expect to be safe.  One way to
> > > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > > should trigger the bug, even if entry 1 in the PAT is unused because
> > > entry 0 is also WB.
> > 
> > This looks like a very probable situation, indeed 1=WB, 5=WC does
> > trigger the bug! Specifically this layout:
> > 
> >     WB	WB	UC-	UC	WP	WC	WT	UC
> 
> What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
> only differ in entry 2, which will not be used as it duplicates entry 0
> or 1.  Therefore, architecturally, these should behave identically.  If
> I am correct, the second will work fine, but the first will trigger the
> bug.

Also worth testing:

WB  UC- UC  WB  WB  WP  WT  WC
WB  UC- UC  UC  WB  WP  WT  WC

These differ only in (unused) entry 3.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] [hw bug?] i915 and PAT attributes
  2023-01-02  1:48               ` Demi Marie Obenour
@ 2023-01-02  1:58                 ` Marek Marczykowski-Górecki
  -1 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-01-02  1:58 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7658 bytes --]

On Sun, Jan 01, 2023 at 08:48:13PM -0500, Demi Marie Obenour wrote:
> On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > > > with "modesetting" driver.
> > > > > > > >
> > > > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > > > with it. The specific difference is:
> > > > > > > > native linux:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > > xen pv:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > > >                                   ~~          ~~      ~~  ~~
> > > > > > > >
> > > > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > > Some more information is collected at
> > > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > > > duplicates...).
> > > > > > > >
> > > > > > > > Kind-of related commit is here:
> > > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > > > there too.
> > > > > > > >
> > > > > > > > Any ideas?
> > > > > > > >
> > > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > > > Linux:
> > > > > > > > -----8<-----
> > > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > > > >  		 */
> > > > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > >  	if (!pat_bp_initialized) {
> > > > > > > > -----8<-----
> > > > > > > >
> > > > > > > 
> > > > > > > Hello, can anyone help please?
> > > > > > > 
> > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > > > regression. 
> > > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > > > 
> > > > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > > > no Xen in the mix), but we're out of any further ideas.
> > > > > > 
> > > > > > I don't think we have any code that assumes anything about the PAT,
> > > > > > apart from WC being available (which seems like it should still be
> > > > > > the case with your modified PAT). I suppose you'll just have to 
> > > > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > > > everything ends up using the correct PAT entry.
> > > > > 
> > > > > I tried several approach to this, without success. Here is an update on
> > > > > debugging (reported also on #intel-gfx live):
> > > > > 
> > > > > I did several tests with different PAT configuration (by modifying Xen
> > > > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > > > Some highlights:
> > > > > - 1=WC, 4=WT - good
> > > > > - 1=WT, 4=WC - bad
> > > > > - 1=WT, 3=WC (4=WC too) - good
> > > > > - 1=WT, 5=WC - good
> > > > > 
> > > > > So, for me it seems WC at index 4 is problematic for some reason.
> > > > > 
> > > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > > > write PTEs and verify requested cache attributes. There, it seems all
> > > > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > > > seems to be correctly set. I didn't added reading back after
> > > > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > > > 
> > > > > Using the same method, I also checked that indexes that aren't supposed
> > > > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > > > is unlikely.
> > > > > 
> > > > > This all looks very weird to me. Any ideas?
> > > > 
> > > > Old CPUs have had hardware errata that caused the top bit of the PAT
> > > > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > > > this bit when accessing iGPU memory or registers?  With WC at position
> > > > 4, this would cause WC to be treated as WB, which is consistent with the
> > > > observed behavior.  WC at position 3 would not be impacted, and WC at
> > > > position 5 would be treated as WT which I expect to be safe.  One way to
> > > > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > > > should trigger the bug, even if entry 1 in the PAT is unused because
> > > > entry 0 is also WB.
> > > 
> > > This looks like a very probable situation, indeed 1=WB, 5=WC does
> > > trigger the bug! Specifically this layout:
> > > 
> > >     WB	WB	UC-	UC	WP	WC	WT	UC
> > 
> > What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
> > only differ in entry 2, which will not be used as it duplicates entry 0
> > or 1.  Therefore, architecturally, these should behave identically.  If
> > I am correct, the second will work fine, but the first will trigger the
> > bug.

Bingo! This also behaves as predicted.

So, it indeed looks like the _PAGE_PAT bit is ignored by the hardware,
even though set in relevant PTEs.

> Also worth testing:
> 
> WB  UC- UC  WB  WB  WP  WT  WC
> WB  UC- UC  UC  WB  WP  WT  WC
> 
> These differ only in (unused) entry 3.

I'll skip this, as I think it's pretty clear what will be the result.
But if somebody else think it's worth testing anyway, let me know.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-gfx] [cache coherency bug] [hw bug?] i915 and PAT attributes
@ 2023-01-02  1:58                 ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 18+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-01-02  1:58 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Ville Syrjälä,
	Andrew Cooper, intel-gfx, the arch/x86 maintainers,
	Lucas De Marchi, Daniel Vetter, Rodrigo Vivi, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7658 bytes --]

On Sun, Jan 01, 2023 at 08:48:13PM -0500, Demi Marie Obenour wrote:
> On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > > On Fri, Dec 16, 2022 at 03:30:13PM +0000, Andrew Cooper wrote:
> > > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > > > > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > > > > with "modesetting" driver.
> > > > > > > >
> > > > > > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > > > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > > > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > > > with it. The specific difference is:
> > > > > > > > native linux:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > > xen pv:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > > >                                   ~~          ~~      ~~  ~~
> > > > > > > >
> > > > > > > > The specific impact depends on kernel version and the hardware. The most
> > > > > > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > > Some more information is collected at
> > > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > > > duplicates...).
> > > > > > > >
> > > > > > > > Kind-of related commit is here:
> > > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > > > > there too.
> > > > > > > >
> > > > > > > > Any ideas?
> > > > > > > >
> > > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > > > > > Linux:
> > > > > > > > -----8<-----
> > > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > > >  		 * The reserved slots are unused, but mapped to their
> > > > > > > >  		 * corresponding types in the presence of PAT errata.
> > > > > > > >  		 */
> > > > > > > > -		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > > -		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > > > > > > +		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > > > > > > +		      PAT(4, WC) | PAT(5, WP) | PAT(6, UC)       | PAT(7, UC);
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > >  	if (!pat_bp_initialized) {
> > > > > > > > -----8<-----
> > > > > > > >
> > > > > > > 
> > > > > > > Hello, can anyone help please?
> > > > > > > 
> > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > > > regression. 
> > > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > > > 
> > > > > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > > > > no Xen in the mix), but we're out of any further ideas.
> > > > > > 
> > > > > > I don't think we have any code that assumes anything about the PAT,
> > > > > > apart from WC being available (which seems like it should still be
> > > > > > the case with your modified PAT). I suppose you'll just have to 
> > > > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > > > everything ends up using the correct PAT entry.
> > > > > 
> > > > > I tried several approach to this, without success. Here is an update on
> > > > > debugging (reported also on #intel-gfx live):
> > > > > 
> > > > > I did several tests with different PAT configuration (by modifying Xen
> > > > > that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > > > Some highlights:
> > > > > - 1=WC, 4=WT - good
> > > > > - 1=WT, 4=WC - bad
> > > > > - 1=WT, 3=WC (4=WC too) - good
> > > > > - 1=WT, 5=WC - good
> > > > > 
> > > > > So, for me it seems WC at index 4 is problematic for some reason.
> > > > > 
> > > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > > > write PTEs and verify requested cache attributes. There, it seems all
> > > > > the requested WC are properly translated (using either index 1, 3, 4, or
> > > > > 5 according to PAT settings). And then after reading PTE back, it indeed
> > > > > seems to be correctly set. I didn't added reading back after
> > > > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > > > > 
> > > > > Using the same method, I also checked that indexes that aren't supposed
> > > > > to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> > > > > used. So, the hypothesis that specific indexes are hardcoded somewhere
> > > > > is unlikely.
> > > > > 
> > > > > This all looks very weird to me. Any ideas?
> > > > 
> > > > Old CPUs have had hardware errata that caused the top bit of the PAT
> > > > entry to be ignored in certain cases.  Could modern CPUs be ignoring
> > > > this bit when accessing iGPU memory or registers?  With WC at position
> > > > 4, this would cause WC to be treated as WB, which is consistent with the
> > > > observed behavior.  WC at position 3 would not be impacted, and WC at
> > > > position 5 would be treated as WT which I expect to be safe.  One way to
> > > > test this is to test 1=WB, 5=WC.  If my hypothesis is correct, this
> > > > should trigger the bug, even if entry 1 in the PAT is unused because
> > > > entry 0 is also WB.
> > > 
> > > This looks like a very probable situation, indeed 1=WB, 5=WC does
> > > trigger the bug! Specifically this layout:
> > > 
> > >     WB	WB	UC-	UC	WP	WC	WT	UC
> > 
> > What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-?  Those
> > only differ in entry 2, which will not be used as it duplicates entry 0
> > or 1.  Therefore, architecturally, these should behave identically.  If
> > I am correct, the second will work fine, but the first will trigger the
> > bug.

Bingo! This also behaves as predicted.

So, it indeed looks like the _PAGE_PAT bit is ignored by the hardware,
even though set in relevant PTEs.

> Also worth testing:
> 
> WB  UC- UC  WB  WB  WP  WT  WC
> WB  UC- UC  UC  WB  WP  WT  WC
> 
> These differ only in (unused) entry 3.

I'll skip this, as I think it's pretty clear what will be the result.
But if somebody else think it's worth testing anyway, let me know.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-01-03 11:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-08 13:55 i915 and PAT attributes on Xen PV Marek Marczykowski-Górecki
2022-12-08 13:55 ` [Intel-gfx] " Marek Marczykowski-Górecki
2022-12-08 16:24 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2022-12-08 16:51 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-12-16 15:30 ` [cache coherency bug] i915 and PAT attributes Andrew Cooper
2022-12-16 15:30   ` [Intel-gfx] " Andrew Cooper
2022-12-22  8:29   ` Ville Syrjälä
2022-12-22  8:29     ` Ville Syrjälä
2023-01-01 23:24     ` Marek Marczykowski-Górecki
2023-01-02  0:03       ` Demi Marie Obenour
2023-01-02  1:00         ` Marek Marczykowski-Górecki
2023-01-02  1:00           ` Marek Marczykowski-Górecki
2023-01-02  1:17           ` Demi Marie Obenour
2023-01-02  1:17             ` Demi Marie Obenour
2023-01-02  1:48             ` Demi Marie Obenour
2023-01-02  1:48               ` Demi Marie Obenour
2023-01-02  1:58               ` [Intel-gfx] [cache coherency bug] [hw bug?] " Marek Marczykowski-Górecki
2023-01-02  1:58                 ` Marek Marczykowski-Górecki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.