dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] drm/i915: document caching related bits
@ 2021-07-22 11:34 Matthew Auld
  2021-07-22 11:34 ` [PATCH v3 2/2] drm/i915/ehl: unconditionally flush the pages on acquire Matthew Auld
  2021-07-22 11:54 ` [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits Daniel Vetter
  0 siblings, 2 replies; 4+ messages in thread
From: Matthew Auld @ 2021-07-22 11:34 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, dri-devel, Mika Kuoppala

Try to document the object caching related bits, like cache_coherent and
cache_dirty.

v2(Ville):
 - As pointed out by Ville, fix the completely incorrect assumptions
   about the "partial" coherency on shared LLC platforms.
v3(Daniel):
 - Fix nonsense about "dirtying" the cache with reads.

Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 .../gpu/drm/i915/gem/i915_gem_object_types.h  | 176 +++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h               |   9 -
 2 files changed, 172 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index afbadfc5516b..40cce816a7e3 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -92,6 +92,76 @@ struct drm_i915_gem_object_ops {
 	const char *name; /* friendly name for debug, e.g. lockdep classes */
 };
 
+/**
+ * enum i915_cache_level - The supported GTT caching values for system memory
+ * pages.
+ *
+ * These translate to some special GTT PTE bits when binding pages into some
+ * address space. It also determines whether an object, or rather its pages are
+ * coherent with the GPU, when also reading or writing through the CPU cache
+ * with those pages.
+ *
+ * Userspace can also control this through struct drm_i915_gem_caching.
+ */
+enum i915_cache_level {
+	/**
+	 * @I915_CACHE_NONE:
+	 *
+	 * Not coherent with the CPU cache. If the cache is dirty and we need
+	 * the underlying pages to be coherent with some later GPU access then
+	 * we need to manually flush the pages.
+	 *
+	 * Note that on shared LLC platforms reads and writes through the CPU
+	 * cache are still coherent even with this setting. See also
+	 * &drm_i915_gem_object.cache_coherent for more details.
+	 *
+	 * Note that on platforms with a shared LLC this should ideally only be
+	 * used for scanout surfaces, otherwise we end up over-flushing in some
+	 * places.
+	 */
+	I915_CACHE_NONE = 0,
+	/**
+	 * @I915_CACHE_LLC:
+	 *
+	 * Coherent with the CPU cache. If the cache is dirty, then the GPU will
+	 * ensure that access remains coherent, when both reading and writing
+	 * through the CPU cache.
+	 *
+	 * Not used for scanout surfaces.
+	 *
+	 * Applies to both platforms with shared LLC(HAS_LLC), and snooping
+	 * based platforms(HAS_SNOOP).
+	 *
+	 * This should be the default for platforms which share the LLC with the
+	 * CPU. The only exception is scanout objects, where the display engine
+	 * is not coherent with the LLC. For such objects I915_CACHE_NONE or
+	 * I915_CACHE_WT should be used.
+	 */
+	I915_CACHE_LLC,
+	/**
+	 * @I915_CACHE_L3_LLC:
+	 *
+	 * Explicitly enable the Gfx L3 cache, with snooped LLC.
+	 *
+	 * The Gfx L3 sits between the domain specific caches, e.g
+	 * sampler/render caches, and the larger LLC. LLC is coherent with the
+	 * GPU, but L3 is only visible to the GPU, so likely needs to be flushed
+	 * when the workload completes.
+	 *
+	 * Not used for scanout surfaces.
+	 *
+	 * Only exposed on some gen7 + GGTT. More recent hardware has dropped
+	 * this.
+	 */
+	I915_CACHE_L3_LLC,
+	/**
+	 * @I915_CACHE_WT:
+	 *
+	 * hsw:gt3e Write-through for scanout buffers.
+	 */
+	I915_CACHE_WT,
+};
+
 enum i915_map_type {
 	I915_MAP_WB = 0,
 	I915_MAP_WC,
@@ -229,14 +299,112 @@ struct drm_i915_gem_object {
 	unsigned int mem_flags;
 #define I915_BO_FLAG_STRUCT_PAGE BIT(0) /* Object backed by struct pages */
 #define I915_BO_FLAG_IOMEM       BIT(1) /* Object backed by IO memory */
-	/*
-	 * Is the object to be mapped as read-only to the GPU
-	 * Only honoured if hardware has relevant pte bit
+	/**
+	 * @cache_level: The desired GTT caching level.
+	 *
+	 * See enum i915_cache_level for possible values, along with what
+	 * each does.
 	 */
 	unsigned int cache_level:3;
-	unsigned int cache_coherent:2;
+	/**
+	 * @cache_coherent:
+	 *
+	 * Track whether the pages are coherent with the GPU if reading or
+	 * writing through the CPU caches. The largely depends on the
+	 * @cache_level setting.
+	 *
+	 * On platforms which don't have the shared LLC(HAS_SNOOP), like on Atom
+	 * platforms, coherency must be explicitly requested with some special
+	 * GTT caching bits(see enum i915_cache_level). When enabling coherency
+	 * it does come at a performance and power cost on such platforms. On
+	 * the flip side the kernel does need to manually flush any buffers
+	 * which need to be coherent with the GPU, if the object is not
+	 * coherent i.e @cache_coherent is zero.
+	 *
+	 * On platforms that share the LLC with the CPU(HAS_LLC), all GT memory
+	 * access will automatically snoop the CPU caches(even with CACHE_NONE).
+	 * The one exception is when dealing with the display engine, like with
+	 * scanout surfaces. To handle this the kernel will always flush the
+	 * surface out of the CPU caches when preparing it for scanout.  Also
+	 * note that since scanout surfaces are only ever read by the display
+	 * engine we only need to care about flushing any writes through the CPU
+	 * cache, reads on the other hand will always be coherent.
+	 *
+	 * Something strange here is why @cache_coherent is not a simple
+	 * boolean, i.e coherent vs non-coherent. The reasoning for this is back
+	 * to the display engine not being fully coherent. As a result scanout
+	 * surfaces will either be marked as I915_CACHE_NONE or I915_CACHE_WT.
+	 * In the case of seeing I915_CACHE_NONE the kernel makes the assumption
+	 * that this is likely a scanout surface, and will set @cache_coherent
+	 * as only I915_BO_CACHE_COHERENT_FOR_READ, on platforms with the shared
+	 * LLC. The kernel uses this to always flush writes through the CPU
+	 * cache as early as possible, where it can, in effect keeping
+	 * @cache_dirty clean, so we can potentially avoid stalling when
+	 * flushing the surface just before doing the scanout.  This does mean
+	 * we might unnecessarily flush non-scanout objects in some places, but
+	 * the default assumption is that all normal objects should be using
+	 * I915_CACHE_LLC, at least on platforms with the shared LLC.
+	 *
+	 * Supported values:
+	 *
+	 * I915_BO_CACHE_COHERENT_FOR_READ:
+	 *
+	 * On shared LLC platforms, we use this for special scanout surfaces,
+	 * where the display engine is not coherent with the CPU cache. As such
+	 * we need to ensure we flush any writes before doing the scanout. As an
+	 * optimisation we try to flush any writes as early as possible to avoid
+	 * stalling later.
+	 *
+	 * Thus for scanout surfaces using I915_CACHE_NONE, on shared LLC
+	 * platforms, we use:
+	 *
+	 *	cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ
+	 *
+	 * While for normal objects that are fully coherent we use:
+	 *
+	 *	cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ |
+	 *			 I915_BO_CACHE_COHERENT_FOR_WRITE
+	 *
+	 * And then for objects that are not coherent at all we use:
+	 *
+	 *	cache_coherent = 0
+	 *
+	 * I915_BO_CACHE_COHERENT_FOR_WRITE:
+	 *
+	 * When writing through the CPU cache, the GPU is still coherent. Note
+	 * that this also implies I915_BO_CACHE_COHERENT_FOR_READ.
+	 */
 #define I915_BO_CACHE_COHERENT_FOR_READ BIT(0)
 #define I915_BO_CACHE_COHERENT_FOR_WRITE BIT(1)
+	unsigned int cache_coherent:2;
+
+	/**
+	 * @cache_dirty:
+	 *
+	 * Track if we are we dirty with writes through the CPU cache for this
+	 * object. As a result reading directly from main memory might yield
+	 * stale data.
+	 *
+	 * This also ties into whether the kernel is tracking the object as
+	 * coherent with the GPU, as per @cache_coherent, as it determines if
+	 * flushing might be needed at various points.
+	 *
+	 * Another part of @cache_dirty is managing flushing when first
+	 * acquiring the pages for system memory, at this point the pages are
+	 * considered foreign, so the default assumption is that the cache is
+	 * dirty, for example the page zeroing done by the kernel might leave
+	 * writes though the CPU cache, or swapping-in, while the actual data in
+	 * main memory is potentially stale.  Note that this is a potential
+	 * security issue when dealing with userspace objects and zeroing. Now,
+	 * whether we actually need apply the big sledgehammer of flushing all
+	 * the pages on acquire depends on if @cache_coherent is marked as
+	 * I915_BO_CACHE_COHERENT_FOR_WRITE, i.e that the GPU will be coherent
+	 * for both reads and writes though the CPU cache.
+	 *
+	 * Note that on shared LLC platforms we still apply the heavy flush for
+	 * I915_CACHE_NONE objects, under the assumption that this is going to
+	 * be used for scanout.
+	 */
 	unsigned int cache_dirty:1;
 
 	/**
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0321a1f9738d..f97792ccc199 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -394,15 +394,6 @@ struct drm_i915_display_funcs {
 	void (*read_luts)(struct intel_crtc_state *crtc_state);
 };
 
-enum i915_cache_level {
-	I915_CACHE_NONE = 0,
-	I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
-	I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
-			      caches, eg sampler/render caches, and the
-			      large Last-Level-Cache. LLC is coherent with
-			      the CPU, but L3 is only visible to the GPU. */
-	I915_CACHE_WT, /* hsw:gt3e WriteThrough for scanouts */
-};
 
 #define I915_COLOR_UNEVICTABLE (-1) /* a non-vma sharing the address space */
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v3 2/2] drm/i915/ehl: unconditionally flush the pages on acquire
  2021-07-22 11:34 [PATCH v3 1/2] drm/i915: document caching related bits Matthew Auld
@ 2021-07-22 11:34 ` Matthew Auld
  2021-07-22 11:54 ` [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits Daniel Vetter
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Auld @ 2021-07-22 11:34 UTC (permalink / raw)
  To: intel-gfx
  Cc: Daniel Vetter, Lucas De Marchi, dri-devel, Jon Bloomfield,
	Chris Wilson, Francisco Jerez, Tejas Upadhyay

EHL and JSL add the 'Bypass LLC' MOCS entry, which should make it
possible for userspace to bypass the GTT caching bits set by the kernel,
as per the given object cache_level. This is troublesome since the heavy
flush we apply when first acquiring the pages is skipped if the kernel
thinks the object is coherent with the GPU. As a result it might be
possible to bypass the cache and read the contents of the page directly,
which could be stale data. If it's just a case of userspace shooting
themselves in the foot then so be it, but since i915 takes the stance of
always zeroing memory before handing it to userspace, we need to prevent
this.

v2: this time actually set cache_dirty in put_pages()
v3: move to get_pages() which looks simpler

BSpec: 34007
References: 046091758b50 ("Revert "drm/i915/ehl: Update MOCS table for EHL"")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Francisco Jerez <francisco.jerez.plata@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: Chris Wilson <chris.p.wilson@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 .../gpu/drm/i915/gem/i915_gem_object_types.h   |  6 ++++++
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c      | 18 ++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index 40cce816a7e3..f0948f6b1e1d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -404,6 +404,12 @@ struct drm_i915_gem_object {
 	 * Note that on shared LLC platforms we still apply the heavy flush for
 	 * I915_CACHE_NONE objects, under the assumption that this is going to
 	 * be used for scanout.
+	 *
+	 * Update: On some hardware there is now also the 'Bypass LLC' MOCS
+	 * entry, which defeats our @cache_coherent tracking, since userspace
+	 * can freely bypass the CPU cache when touching the pages with the GPU,
+	 * where the kernel is completely unaware. On such platform we need
+	 * apply the sledgehammer-on-acquire regardless of the @cache_coherent.
 	 */
 	unsigned int cache_dirty:1;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 6a04cce188fc..11f072193f3b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -182,6 +182,24 @@ static int shmem_get_pages(struct drm_i915_gem_object *obj)
 	if (i915_gem_object_needs_bit17_swizzle(obj))
 		i915_gem_object_do_bit_17_swizzle(obj, st);
 
+	/*
+	 * EHL and JSL add the 'Bypass LLC' MOCS entry, which should make it
+	 * possible for userspace to bypass the GTT caching bits set by the
+	 * kernel, as per the given object cache_level. This is troublesome
+	 * since the heavy flush we apply when first gathering the pages is
+	 * skipped if the kernel thinks the object is coherent with the GPU. As
+	 * a result it might be possible to bypass the cache and read the
+	 * contents of the page directly, which could be stale data. If it's
+	 * just a case of userspace shooting themselves in the foot then so be
+	 * it, but since i915 takes the stance of always zeroing memory before
+	 * handing it to userspace, we need to prevent this.
+	 *
+	 * By setting cache_dirty here we make the clflush in set_pages
+	 * unconditional on such platforms.
+	 */
+	if (IS_JSL_EHL(i915) && obj->flags & I915_BO_ALLOC_USER)
+		obj->cache_dirty = true;
+
 	__i915_gem_object_set_pages(obj, st, sg_page_sizes);
 
 	return 0;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits
  2021-07-22 11:34 [PATCH v3 1/2] drm/i915: document caching related bits Matthew Auld
  2021-07-22 11:34 ` [PATCH v3 2/2] drm/i915/ehl: unconditionally flush the pages on acquire Matthew Auld
@ 2021-07-22 11:54 ` Daniel Vetter
  2021-07-23  8:58   ` Matthew Auld
  1 sibling, 1 reply; 4+ messages in thread
From: Daniel Vetter @ 2021-07-22 11:54 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Daniel Vetter, intel-gfx, dri-devel

On Thu, Jul 22, 2021 at 12:34:55PM +0100, Matthew Auld wrote:
> Try to document the object caching related bits, like cache_coherent and
> cache_dirty.
> 
> v2(Ville):
>  - As pointed out by Ville, fix the completely incorrect assumptions
>    about the "partial" coherency on shared LLC platforms.
> v3(Daniel):
>  - Fix nonsense about "dirtying" the cache with reads.
> 
> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 176 +++++++++++++++++-
>  drivers/gpu/drm/i915/i915_drv.h               |   9 -
>  2 files changed, 172 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> index afbadfc5516b..40cce816a7e3 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> @@ -92,6 +92,76 @@ struct drm_i915_gem_object_ops {
>  	const char *name; /* friendly name for debug, e.g. lockdep classes */
>  };
>  
> +/**
> + * enum i915_cache_level - The supported GTT caching values for system memory
> + * pages.
> + *
> + * These translate to some special GTT PTE bits when binding pages into some
> + * address space. It also determines whether an object, or rather its pages are
> + * coherent with the GPU, when also reading or writing through the CPU cache
> + * with those pages.
> + *
> + * Userspace can also control this through struct drm_i915_gem_caching.
> + */
> +enum i915_cache_level {
> +	/**
> +	 * @I915_CACHE_NONE:
> +	 *
> +	 * Not coherent with the CPU cache. If the cache is dirty and we need
> +	 * the underlying pages to be coherent with some later GPU access then
> +	 * we need to manually flush the pages.
> +	 *
> +	 * Note that on shared LLC platforms reads and writes through the CPU
> +	 * cache are still coherent even with this setting. See also
> +	 * &drm_i915_gem_object.cache_coherent for more details.
> +	 *
> +	 * Note that on platforms with a shared LLC this should ideally only be

Merge this with the previous note and maybe explain it with "Due to this
we should only use uncached for scanout surfaces on platforms with shared
LLC, otherwise ..."

As-is reads a bit awkward/repetive.

> +	 * used for scanout surfaces, otherwise we end up over-flushing in some
> +	 * places.

Maybe also note that on non-LLC platforms uncached is the default.

> +	 */
> +	I915_CACHE_NONE = 0,
> +	/**
> +	 * @I915_CACHE_LLC:
> +	 *
> +	 * Coherent with the CPU cache. If the cache is dirty, then the GPU will
> +	 * ensure that access remains coherent, when both reading and writing
> +	 * through the CPU cache.
> +	 *
> +	 * Not used for scanout surfaces.
> +	 *
> +	 * Applies to both platforms with shared LLC(HAS_LLC), and snooping
> +	 * based platforms(HAS_SNOOP).
> +	 *
> +	 * This should be the default for platforms which share the LLC with the
s/should/is/

After all it _is_ the default at object creation time.

> +	 * CPU. The only exception is scanout objects, where the display engine
> +	 * is not coherent with the LLC. For such objects I915_CACHE_NONE or
> +	 * I915_CACHE_WT should be used.

Maybe clarify that we automatically apply this transition upon
pin_for_display if userspace hasn't done it.

> +	 */
> +	I915_CACHE_LLC,
> +	/**
> +	 * @I915_CACHE_L3_LLC:
> +	 *
> +	 * Explicitly enable the Gfx L3 cache, with snooped LLC.
> +	 *
> +	 * The Gfx L3 sits between the domain specific caches, e.g
> +	 * sampler/render caches, and the larger LLC. LLC is coherent with the
> +	 * GPU, but L3 is only visible to the GPU, so likely needs to be flushed
> +	 * when the workload completes.
> +	 *
> +	 * Not used for scanout surfaces.
> +	 *
> +	 * Only exposed on some gen7 + GGTT. More recent hardware has dropped
> +	 * this.

I think it's also the default on these?

> +	 */
> +	I915_CACHE_L3_LLC,

> +	/**
> +	 * @I915_CACHE_WT:
> +	 *
> +	 * hsw:gt3e Write-through for scanout buffers.

I haven't checked, but are we using this automatically?

> +	 */
> +	I915_CACHE_WT,
> +};
> +
>  enum i915_map_type {
>  	I915_MAP_WB = 0,
>  	I915_MAP_WC,
> @@ -229,14 +299,112 @@ struct drm_i915_gem_object {
>  	unsigned int mem_flags;
>  #define I915_BO_FLAG_STRUCT_PAGE BIT(0) /* Object backed by struct pages */
>  #define I915_BO_FLAG_IOMEM       BIT(1) /* Object backed by IO memory */
> -	/*
> -	 * Is the object to be mapped as read-only to the GPU
> -	 * Only honoured if hardware has relevant pte bit
> +	/**
> +	 * @cache_level: The desired GTT caching level.
> +	 *
> +	 * See enum i915_cache_level for possible values, along with what
> +	 * each does.
>  	 */
>  	unsigned int cache_level:3;
> -	unsigned int cache_coherent:2;
> +	/**
> +	 * @cache_coherent:
> +	 *
> +	 * Track whether the pages are coherent with the GPU if reading or
> +	 * writing through the CPU caches. The largely depends on the
> +	 * @cache_level setting.
> +	 *
> +	 * On platforms which don't have the shared LLC(HAS_SNOOP), like on Atom
> +	 * platforms, coherency must be explicitly requested with some special
> +	 * GTT caching bits(see enum i915_cache_level). When enabling coherency
> +	 * it does come at a performance and power cost on such platforms. On
> +	 * the flip side the kernel does need to manually flush any buffers

does _not_ need

I think at least that's what you mean here.

> +	 * which need to be coherent with the GPU, if the object is not
> +	 * coherent i.e @cache_coherent is zero.
> +	 *
> +	 * On platforms that share the LLC with the CPU(HAS_LLC), all GT memory
> +	 * access will automatically snoop the CPU caches(even with CACHE_NONE).
> +	 * The one exception is when dealing with the display engine, like with
> +	 * scanout surfaces. To handle this the kernel will always flush the
> +	 * surface out of the CPU caches when preparing it for scanout.  Also
> +	 * note that since scanout surfaces are only ever read by the display
> +	 * engine we only need to care about flushing any writes through the CPU
> +	 * cache, reads on the other hand will always be coherent.
> +	 *
> +	 * Something strange here is why @cache_coherent is not a simple
> +	 * boolean, i.e coherent vs non-coherent. The reasoning for this is back
> +	 * to the display engine not being fully coherent. As a result scanout
> +	 * surfaces will either be marked as I915_CACHE_NONE or I915_CACHE_WT.
> +	 * In the case of seeing I915_CACHE_NONE the kernel makes the assumption
> +	 * that this is likely a scanout surface, and will set @cache_coherent
> +	 * as only I915_BO_CACHE_COHERENT_FOR_READ, on platforms with the shared

Do we only do this for NONE, and not for WT? That would be a bit a bug I
guess ...

> +	 * LLC. The kernel uses this to always flush writes through the CPU
> +	 * cache as early as possible, where it can, in effect keeping
> +	 * @cache_dirty clean, so we can potentially avoid stalling when
> +	 * flushing the surface just before doing the scanout.  This does mean
> +	 * we might unnecessarily flush non-scanout objects in some places, but
> +	 * the default assumption is that all normal objects should be using
> +	 * I915_CACHE_LLC, at least on platforms with the shared LLC.
> +	 *
> +	 * Supported values:
> +	 *
> +	 * I915_BO_CACHE_COHERENT_FOR_READ:
> +	 *
> +	 * On shared LLC platforms, we use this for special scanout surfaces,
> +	 * where the display engine is not coherent with the CPU cache. As such
> +	 * we need to ensure we flush any writes before doing the scanout. As an
> +	 * optimisation we try to flush any writes as early as possible to avoid
> +	 * stalling later.
> +	 *
> +	 * Thus for scanout surfaces using I915_CACHE_NONE, on shared LLC
> +	 * platforms, we use:
> +	 *
> +	 *	cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ
> +	 *
> +	 * While for normal objects that are fully coherent we use:
> +	 *
> +	 *	cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ |
> +	 *			 I915_BO_CACHE_COHERENT_FOR_WRITE
> +	 *
> +	 * And then for objects that are not coherent at all we use:
> +	 *
> +	 *	cache_coherent = 0
> +	 *
> +	 * I915_BO_CACHE_COHERENT_FOR_WRITE:
> +	 *
> +	 * When writing through the CPU cache, the GPU is still coherent. Note
> +	 * that this also implies I915_BO_CACHE_COHERENT_FOR_READ.
> +	 */
>  #define I915_BO_CACHE_COHERENT_FOR_READ BIT(0)
>  #define I915_BO_CACHE_COHERENT_FOR_WRITE BIT(1)
> +	unsigned int cache_coherent:2;
> +
> +	/**
> +	 * @cache_dirty:
> +	 *
> +	 * Track if we are we dirty with writes through the CPU cache for this
> +	 * object. As a result reading directly from main memory might yield
> +	 * stale data.
> +	 *
> +	 * This also ties into whether the kernel is tracking the object as
> +	 * coherent with the GPU, as per @cache_coherent, as it determines if
> +	 * flushing might be needed at various points.
> +	 *
> +	 * Another part of @cache_dirty is managing flushing when first
> +	 * acquiring the pages for system memory, at this point the pages are
> +	 * considered foreign, so the default assumption is that the cache is
> +	 * dirty, for example the page zeroing done by the kernel might leave
> +	 * writes though the CPU cache, or swapping-in, while the actual data in
> +	 * main memory is potentially stale.  Note that this is a potential
> +	 * security issue when dealing with userspace objects and zeroing. Now,
> +	 * whether we actually need apply the big sledgehammer of flushing all
> +	 * the pages on acquire depends on if @cache_coherent is marked as
> +	 * I915_BO_CACHE_COHERENT_FOR_WRITE, i.e that the GPU will be coherent
> +	 * for both reads and writes though the CPU cache.
> +	 *
> +	 * Note that on shared LLC platforms we still apply the heavy flush for
> +	 * I915_CACHE_NONE objects, under the assumption that this is going to
> +	 * be used for scanout.
> +	 */

I feel like rethinking all our special cases here would be really good,
especially around whether we need to flush for security concerns, or not.

E.g. on !LLC platforms, if we set an object to CACHE_LLC, but then use
mocs to not access is such: Can we bypass the cpu cache and potentially
get stale data because i915 didn't force the clflush for this case?

>  	unsigned int cache_dirty:1;
>  
>  	/**
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 0321a1f9738d..f97792ccc199 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -394,15 +394,6 @@ struct drm_i915_display_funcs {
>  	void (*read_luts)(struct intel_crtc_state *crtc_state);
>  };
>  
> -enum i915_cache_level {
> -	I915_CACHE_NONE = 0,
> -	I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
> -	I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
> -			      caches, eg sampler/render caches, and the
> -			      large Last-Level-Cache. LLC is coherent with
> -			      the CPU, but L3 is only visible to the GPU. */
> -	I915_CACHE_WT, /* hsw:gt3e WriteThrough for scanouts */
> -};
>  
>  #define I915_COLOR_UNEVICTABLE (-1) /* a non-vma sharing the address space */

With the nits addressed:

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

>  
> -- 
> 2.26.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits
  2021-07-22 11:54 ` [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits Daniel Vetter
@ 2021-07-23  8:58   ` Matthew Auld
  0 siblings, 0 replies; 4+ messages in thread
From: Matthew Auld @ 2021-07-23  8:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, Intel Graphics Development, Matthew Auld, ML dri-devel

On Thu, 22 Jul 2021 at 12:54, Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Jul 22, 2021 at 12:34:55PM +0100, Matthew Auld wrote:
> > Try to document the object caching related bits, like cache_coherent and
> > cache_dirty.
> >
> > v2(Ville):
> >  - As pointed out by Ville, fix the completely incorrect assumptions
> >    about the "partial" coherency on shared LLC platforms.
> > v3(Daniel):
> >  - Fix nonsense about "dirtying" the cache with reads.
> >
> > Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 176 +++++++++++++++++-
> >  drivers/gpu/drm/i915/i915_drv.h               |   9 -
> >  2 files changed, 172 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > index afbadfc5516b..40cce816a7e3 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > @@ -92,6 +92,76 @@ struct drm_i915_gem_object_ops {
> >       const char *name; /* friendly name for debug, e.g. lockdep classes */
> >  };
> >
> > +/**
> > + * enum i915_cache_level - The supported GTT caching values for system memory
> > + * pages.
> > + *
> > + * These translate to some special GTT PTE bits when binding pages into some
> > + * address space. It also determines whether an object, or rather its pages are
> > + * coherent with the GPU, when also reading or writing through the CPU cache
> > + * with those pages.
> > + *
> > + * Userspace can also control this through struct drm_i915_gem_caching.
> > + */
> > +enum i915_cache_level {
> > +     /**
> > +      * @I915_CACHE_NONE:
> > +      *
> > +      * Not coherent with the CPU cache. If the cache is dirty and we need
> > +      * the underlying pages to be coherent with some later GPU access then
> > +      * we need to manually flush the pages.
> > +      *
> > +      * Note that on shared LLC platforms reads and writes through the CPU
> > +      * cache are still coherent even with this setting. See also
> > +      * &drm_i915_gem_object.cache_coherent for more details.
> > +      *
> > +      * Note that on platforms with a shared LLC this should ideally only be
>
> Merge this with the previous note and maybe explain it with "Due to this
> we should only use uncached for scanout surfaces on platforms with shared
> LLC, otherwise ..."
>
> As-is reads a bit awkward/repetive.
>
> > +      * used for scanout surfaces, otherwise we end up over-flushing in some
> > +      * places.
>
> Maybe also note that on non-LLC platforms uncached is the default.
>
> > +      */
> > +     I915_CACHE_NONE = 0,
> > +     /**
> > +      * @I915_CACHE_LLC:
> > +      *
> > +      * Coherent with the CPU cache. If the cache is dirty, then the GPU will
> > +      * ensure that access remains coherent, when both reading and writing
> > +      * through the CPU cache.
> > +      *
> > +      * Not used for scanout surfaces.
> > +      *
> > +      * Applies to both platforms with shared LLC(HAS_LLC), and snooping
> > +      * based platforms(HAS_SNOOP).
> > +      *
> > +      * This should be the default for platforms which share the LLC with the
> s/should/is/
>
> After all it _is_ the default at object creation time.
>
> > +      * CPU. The only exception is scanout objects, where the display engine
> > +      * is not coherent with the LLC. For such objects I915_CACHE_NONE or
> > +      * I915_CACHE_WT should be used.
>
> Maybe clarify that we automatically apply this transition upon
> pin_for_display if userspace hasn't done it.
>
> > +      */
> > +     I915_CACHE_LLC,
> > +     /**
> > +      * @I915_CACHE_L3_LLC:
> > +      *
> > +      * Explicitly enable the Gfx L3 cache, with snooped LLC.
> > +      *
> > +      * The Gfx L3 sits between the domain specific caches, e.g
> > +      * sampler/render caches, and the larger LLC. LLC is coherent with the
> > +      * GPU, but L3 is only visible to the GPU, so likely needs to be flushed
> > +      * when the workload completes.
> > +      *
> > +      * Not used for scanout surfaces.
> > +      *
> > +      * Only exposed on some gen7 + GGTT. More recent hardware has dropped
> > +      * this.
>
> I think it's also the default on these?

I would say yes.

>
> > +      */
> > +     I915_CACHE_L3_LLC,
>
> > +     /**
> > +      * @I915_CACHE_WT:
> > +      *
> > +      * hsw:gt3e Write-through for scanout buffers.
>
> I haven't checked, but are we using this automatically?

Yes, if the HW supports it.

>
> > +      */
> > +     I915_CACHE_WT,
> > +};
> > +
> >  enum i915_map_type {
> >       I915_MAP_WB = 0,
> >       I915_MAP_WC,
> > @@ -229,14 +299,112 @@ struct drm_i915_gem_object {
> >       unsigned int mem_flags;
> >  #define I915_BO_FLAG_STRUCT_PAGE BIT(0) /* Object backed by struct pages */
> >  #define I915_BO_FLAG_IOMEM       BIT(1) /* Object backed by IO memory */
> > -     /*
> > -      * Is the object to be mapped as read-only to the GPU
> > -      * Only honoured if hardware has relevant pte bit
> > +     /**
> > +      * @cache_level: The desired GTT caching level.
> > +      *
> > +      * See enum i915_cache_level for possible values, along with what
> > +      * each does.
> >        */
> >       unsigned int cache_level:3;
> > -     unsigned int cache_coherent:2;
> > +     /**
> > +      * @cache_coherent:
> > +      *
> > +      * Track whether the pages are coherent with the GPU if reading or
> > +      * writing through the CPU caches. The largely depends on the
> > +      * @cache_level setting.
> > +      *
> > +      * On platforms which don't have the shared LLC(HAS_SNOOP), like on Atom
> > +      * platforms, coherency must be explicitly requested with some special
> > +      * GTT caching bits(see enum i915_cache_level). When enabling coherency
> > +      * it does come at a performance and power cost on such platforms. On
> > +      * the flip side the kernel does need to manually flush any buffers
>
> does _not_ need
>
> I think at least that's what you mean here.
>
> > +      * which need to be coherent with the GPU, if the object is not
> > +      * coherent i.e @cache_coherent is zero.
> > +      *
> > +      * On platforms that share the LLC with the CPU(HAS_LLC), all GT memory
> > +      * access will automatically snoop the CPU caches(even with CACHE_NONE).
> > +      * The one exception is when dealing with the display engine, like with
> > +      * scanout surfaces. To handle this the kernel will always flush the
> > +      * surface out of the CPU caches when preparing it for scanout.  Also
> > +      * note that since scanout surfaces are only ever read by the display
> > +      * engine we only need to care about flushing any writes through the CPU
> > +      * cache, reads on the other hand will always be coherent.
> > +      *
> > +      * Something strange here is why @cache_coherent is not a simple
> > +      * boolean, i.e coherent vs non-coherent. The reasoning for this is back
> > +      * to the display engine not being fully coherent. As a result scanout
> > +      * surfaces will either be marked as I915_CACHE_NONE or I915_CACHE_WT.
> > +      * In the case of seeing I915_CACHE_NONE the kernel makes the assumption
> > +      * that this is likely a scanout surface, and will set @cache_coherent
> > +      * as only I915_BO_CACHE_COHERENT_FOR_READ, on platforms with the shared
>
> Do we only do this for NONE, and not for WT? That would be a bit a bug I
> guess ...

If I'm reading this correctly, write-through only ensures we don't
need to flush the scanout surface when moving it out of the render
domain, but writes through the cache on the CPU side still need to be
flushed, so yeah I would have expected cache_coherent = FOR_READ here
for WT...

I guess that means we don't do the flush-early optimisations in some
places, and for the flush-on-acquire, the forced set_cache_level in
pin_to_display should still ensure cache_dirty = true before the
scanout?

>
> > +      * LLC. The kernel uses this to always flush writes through the CPU
> > +      * cache as early as possible, where it can, in effect keeping
> > +      * @cache_dirty clean, so we can potentially avoid stalling when
> > +      * flushing the surface just before doing the scanout.  This does mean
> > +      * we might unnecessarily flush non-scanout objects in some places, but
> > +      * the default assumption is that all normal objects should be using
> > +      * I915_CACHE_LLC, at least on platforms with the shared LLC.
> > +      *
> > +      * Supported values:
> > +      *
> > +      * I915_BO_CACHE_COHERENT_FOR_READ:
> > +      *
> > +      * On shared LLC platforms, we use this for special scanout surfaces,
> > +      * where the display engine is not coherent with the CPU cache. As such
> > +      * we need to ensure we flush any writes before doing the scanout. As an
> > +      * optimisation we try to flush any writes as early as possible to avoid
> > +      * stalling later.
> > +      *
> > +      * Thus for scanout surfaces using I915_CACHE_NONE, on shared LLC
> > +      * platforms, we use:
> > +      *
> > +      *      cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ
> > +      *
> > +      * While for normal objects that are fully coherent we use:
> > +      *
> > +      *      cache_coherent = I915_BO_CACHE_COHERENT_FOR_READ |
> > +      *                       I915_BO_CACHE_COHERENT_FOR_WRITE
> > +      *
> > +      * And then for objects that are not coherent at all we use:
> > +      *
> > +      *      cache_coherent = 0
> > +      *
> > +      * I915_BO_CACHE_COHERENT_FOR_WRITE:
> > +      *
> > +      * When writing through the CPU cache, the GPU is still coherent. Note
> > +      * that this also implies I915_BO_CACHE_COHERENT_FOR_READ.
> > +      */
> >  #define I915_BO_CACHE_COHERENT_FOR_READ BIT(0)
> >  #define I915_BO_CACHE_COHERENT_FOR_WRITE BIT(1)
> > +     unsigned int cache_coherent:2;
> > +
> > +     /**
> > +      * @cache_dirty:
> > +      *
> > +      * Track if we are we dirty with writes through the CPU cache for this
> > +      * object. As a result reading directly from main memory might yield
> > +      * stale data.
> > +      *
> > +      * This also ties into whether the kernel is tracking the object as
> > +      * coherent with the GPU, as per @cache_coherent, as it determines if
> > +      * flushing might be needed at various points.
> > +      *
> > +      * Another part of @cache_dirty is managing flushing when first
> > +      * acquiring the pages for system memory, at this point the pages are
> > +      * considered foreign, so the default assumption is that the cache is
> > +      * dirty, for example the page zeroing done by the kernel might leave
> > +      * writes though the CPU cache, or swapping-in, while the actual data in
> > +      * main memory is potentially stale.  Note that this is a potential
> > +      * security issue when dealing with userspace objects and zeroing. Now,
> > +      * whether we actually need apply the big sledgehammer of flushing all
> > +      * the pages on acquire depends on if @cache_coherent is marked as
> > +      * I915_BO_CACHE_COHERENT_FOR_WRITE, i.e that the GPU will be coherent
> > +      * for both reads and writes though the CPU cache.
> > +      *
> > +      * Note that on shared LLC platforms we still apply the heavy flush for
> > +      * I915_CACHE_NONE objects, under the assumption that this is going to
> > +      * be used for scanout.
> > +      */
>
> I feel like rethinking all our special cases here would be really good,
> especially around whether we need to flush for security concerns, or not.
>
> E.g. on !LLC platforms, if we set an object to CACHE_LLC, but then use
> mocs to not access is such: Can we bypass the cpu cache and potentially
> get stale data because i915 didn't force the clflush for this case?
>
> >       unsigned int cache_dirty:1;
> >
> >       /**
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 0321a1f9738d..f97792ccc199 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -394,15 +394,6 @@ struct drm_i915_display_funcs {
> >       void (*read_luts)(struct intel_crtc_state *crtc_state);
> >  };
> >
> > -enum i915_cache_level {
> > -     I915_CACHE_NONE = 0,
> > -     I915_CACHE_LLC, /* also used for snoopable memory on non-LLC */
> > -     I915_CACHE_L3_LLC, /* gen7+, L3 sits between the domain specifc
> > -                           caches, eg sampler/render caches, and the
> > -                           large Last-Level-Cache. LLC is coherent with
> > -                           the CPU, but L3 is only visible to the GPU. */
> > -     I915_CACHE_WT, /* hsw:gt3e WriteThrough for scanouts */
> > -};
> >
> >  #define I915_COLOR_UNEVICTABLE (-1) /* a non-vma sharing the address space */
>
> With the nits addressed:
>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> >
> > --
> > 2.26.3
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-07-23  8:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 11:34 [PATCH v3 1/2] drm/i915: document caching related bits Matthew Auld
2021-07-22 11:34 ` [PATCH v3 2/2] drm/i915/ehl: unconditionally flush the pages on acquire Matthew Auld
2021-07-22 11:54 ` [Intel-gfx] [PATCH v3 1/2] drm/i915: document caching related bits Daniel Vetter
2021-07-23  8:58   ` Matthew Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).