[PATCH v6 0/2] drm/mm: Add an iterator to optimally walk over holes suitable for an allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/2] drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
@ 2022-03-07 20:21 ` Vivek Kasireddy
  0 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

The first patch is a drm core patch that replaces the for loop in
drm_mm_insert_node_in_range() with the iterator and would not
cause any functional changes. The second patch is a i915 driver
specific patch that also uses the iterator but solves a different
problem.

v2:
- Added a new patch to this series to fix a potential NULL
  dereference.
- Fixed a typo associated with the iterator introduced in the
  drm core patch.
- Added locking around the snippet in the i915 patch that
  traverses the GGTT hole nodes.

v3: (Tvrtko)
- Replaced mutex_lock with mutex_lock_interruptible_nested() in
  the i915 patch.

v4: (Tvrtko)
- Dropped the patch added in v2 as it was deemed unnecessary.

v5: (Tvrtko)
- Fixed yet another typo in the drm core patch: should have
  passed caller_mode instead of mode to the iterator.

v6: (Tvrtko)
- Fixed the checkpatch warning that warns about precedence issues.

Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Cc: Christian König <christian.koenig@amd.com>

Vivek Kasireddy (2):
  drm/mm: Add an iterator to optimally walk over holes for an allocation
    (v6)
  drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

 drivers/gpu/drm/drm_mm.c        |  32 ++++----
 drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
 include/drm/drm_mm.h            |  36 +++++++++
 3 files changed, 145 insertions(+), 51 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v6 0/2] drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
@ 2022-03-07 20:21 ` Vivek Kasireddy
  0 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

The first patch is a drm core patch that replaces the for loop in
drm_mm_insert_node_in_range() with the iterator and would not
cause any functional changes. The second patch is a i915 driver
specific patch that also uses the iterator but solves a different
problem.

v2:
- Added a new patch to this series to fix a potential NULL
  dereference.
- Fixed a typo associated with the iterator introduced in the
  drm core patch.
- Added locking around the snippet in the i915 patch that
  traverses the GGTT hole nodes.

v3: (Tvrtko)
- Replaced mutex_lock with mutex_lock_interruptible_nested() in
  the i915 patch.

v4: (Tvrtko)
- Dropped the patch added in v2 as it was deemed unnecessary.

v5: (Tvrtko)
- Fixed yet another typo in the drm core patch: should have
  passed caller_mode instead of mode to the iterator.

v6: (Tvrtko)
- Fixed the checkpatch warning that warns about precedence issues.

Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Cc: Christian König <christian.koenig@amd.com>

Vivek Kasireddy (2):
  drm/mm: Add an iterator to optimally walk over holes for an allocation
    (v6)
  drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

 drivers/gpu/drm/drm_mm.c        |  32 ++++----
 drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
 include/drm/drm_mm.h            |  36 +++++++++
 3 files changed, 145 insertions(+), 51 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 1/2] drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
@ 2022-03-07 20:21   ` Vivek Kasireddy
  -1 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

This iterator relies on drm_mm_first_hole() and drm_mm_next_hole()
functions to identify suitable holes for an allocation of a given
size by efficiently traversing the rbtree associated with the given
allocator.

It replaces the for loop in drm_mm_insert_node_in_range() and can
also be used by drm drivers to quickly identify holes of a certain
size within a given range.

v2: (Tvrtko)
- Prepend a double underscore for the newly exported first/next_hole
- s/each_best_hole/each_suitable_hole/g
- Mask out DRM_MM_INSERT_ONCE from the mode before calling
  first/next_hole and elsewhere.

v3: (Tvrtko)
- Reduce the number of hunks by retaining the "mode" variable name

v4:
- Typo: s/__drm_mm_next_hole(.., hole/__drm_mm_next_hole(.., pos

v5: (Tvrtko)
- Fixed another typo: should pass caller_mode instead of mode to
  the iterator in drm_mm_insert_node_in_range().

v6: (Tvrtko)
- Fix the checkpatch warning that warns about precedence issues.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/gpu/drm/drm_mm.c | 32 +++++++++++++++-----------------
 include/drm/drm_mm.h     | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
index 8257f9d4f619..6ff98a0e4df3 100644
--- a/drivers/gpu/drm/drm_mm.c
+++ b/drivers/gpu/drm/drm_mm.c
@@ -352,10 +352,10 @@ static struct drm_mm_node *find_hole_addr(struct drm_mm *mm, u64 addr, u64 size)
 	return node;
 }
 
-static struct drm_mm_node *
-first_hole(struct drm_mm *mm,
-	   u64 start, u64 end, u64 size,
-	   enum drm_mm_insert_mode mode)
+struct drm_mm_node *
+__drm_mm_first_hole(struct drm_mm *mm,
+		    u64 start, u64 end, u64 size,
+		    enum drm_mm_insert_mode mode)
 {
 	switch (mode) {
 	default:
@@ -374,6 +374,7 @@ first_hole(struct drm_mm *mm,
 						hole_stack);
 	}
 }
+EXPORT_SYMBOL(__drm_mm_first_hole);
 
 /**
  * DECLARE_NEXT_HOLE_ADDR - macro to declare next hole functions
@@ -410,11 +411,11 @@ static struct drm_mm_node *name(struct drm_mm_node *entry, u64 size)	\
 DECLARE_NEXT_HOLE_ADDR(next_hole_high_addr, rb_left, rb_right)
 DECLARE_NEXT_HOLE_ADDR(next_hole_low_addr, rb_right, rb_left)
 
-static struct drm_mm_node *
-next_hole(struct drm_mm *mm,
-	  struct drm_mm_node *node,
-	  u64 size,
-	  enum drm_mm_insert_mode mode)
+struct drm_mm_node *
+__drm_mm_next_hole(struct drm_mm *mm,
+		   struct drm_mm_node *node,
+		   u64 size,
+		   enum drm_mm_insert_mode mode)
 {
 	switch (mode) {
 	default:
@@ -432,6 +433,7 @@ next_hole(struct drm_mm *mm,
 		return &node->hole_stack == &mm->hole_stack ? NULL : node;
 	}
 }
+EXPORT_SYMBOL(__drm_mm_next_hole);
 
 /**
  * drm_mm_reserve_node - insert an pre-initialized node
@@ -516,11 +518,11 @@ int drm_mm_insert_node_in_range(struct drm_mm * const mm,
 				u64 size, u64 alignment,
 				unsigned long color,
 				u64 range_start, u64 range_end,
-				enum drm_mm_insert_mode mode)
+				enum drm_mm_insert_mode caller_mode)
 {
 	struct drm_mm_node *hole;
 	u64 remainder_mask;
-	bool once;
+	enum drm_mm_insert_mode mode = caller_mode & ~DRM_MM_INSERT_ONCE;
 
 	DRM_MM_BUG_ON(range_start > range_end);
 
@@ -533,13 +535,9 @@ int drm_mm_insert_node_in_range(struct drm_mm * const mm,
 	if (alignment <= 1)
 		alignment = 0;
 
-	once = mode & DRM_MM_INSERT_ONCE;
-	mode &= ~DRM_MM_INSERT_ONCE;
-
 	remainder_mask = is_power_of_2(alignment) ? alignment - 1 : 0;
-	for (hole = first_hole(mm, range_start, range_end, size, mode);
-	     hole;
-	     hole = once ? NULL : next_hole(mm, hole, size, mode)) {
+	drm_mm_for_each_suitable_hole(hole, mm, range_start, range_end,
+				      size, caller_mode) {
 		u64 hole_start = __drm_mm_hole_node_start(hole);
 		u64 hole_end = hole_start + hole->hole_size;
 		u64 adj_start, adj_end;
diff --git a/include/drm/drm_mm.h b/include/drm/drm_mm.h
index ac33ba1b18bc..896754fa6d69 100644
--- a/include/drm/drm_mm.h
+++ b/include/drm/drm_mm.h
@@ -400,6 +400,42 @@ static inline u64 drm_mm_hole_node_end(const struct drm_mm_node *hole_node)
 	     1 : 0; \
 	     pos = list_next_entry(pos, hole_stack))
 
+struct drm_mm_node *
+__drm_mm_first_hole(struct drm_mm *mm,
+		    u64 start, u64 end, u64 size,
+		    enum drm_mm_insert_mode mode);
+
+struct drm_mm_node *
+__drm_mm_next_hole(struct drm_mm *mm,
+		   struct drm_mm_node *node,
+		   u64 size,
+		   enum drm_mm_insert_mode mode);
+
+/**
+ * drm_mm_for_each_suitable_hole - iterator to optimally walk over all
+ * holes that can fit an allocation of the given @size.
+ * @pos: &drm_mm_node used internally to track progress
+ * @mm: &drm_mm allocator to walk
+ * @range_start: start of the allowed range for the allocation
+ * @range_end: end of the allowed range for the allocation
+ * @size: size of the allocation
+ * @mode: fine-tune the allocation search
+ *
+ * This iterator walks over all holes suitable for the allocation of given
+ * @size in a very efficient manner. It is implemented by calling
+ * drm_mm_first_hole() and drm_mm_next_hole() which identify the
+ * appropriate holes within the given range by efficiently traversing the
+ * rbtree associated with @mm.
+ */
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))
+
 /*
  * Basic range manager support (drm_mm.c)
  */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v6 1/2] drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)
@ 2022-03-07 20:21   ` Vivek Kasireddy
  0 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

This iterator relies on drm_mm_first_hole() and drm_mm_next_hole()
functions to identify suitable holes for an allocation of a given
size by efficiently traversing the rbtree associated with the given
allocator.

It replaces the for loop in drm_mm_insert_node_in_range() and can
also be used by drm drivers to quickly identify holes of a certain
size within a given range.

v2: (Tvrtko)
- Prepend a double underscore for the newly exported first/next_hole
- s/each_best_hole/each_suitable_hole/g
- Mask out DRM_MM_INSERT_ONCE from the mode before calling
  first/next_hole and elsewhere.

v3: (Tvrtko)
- Reduce the number of hunks by retaining the "mode" variable name

v4:
- Typo: s/__drm_mm_next_hole(.., hole/__drm_mm_next_hole(.., pos

v5: (Tvrtko)
- Fixed another typo: should pass caller_mode instead of mode to
  the iterator in drm_mm_insert_node_in_range().

v6: (Tvrtko)
- Fix the checkpatch warning that warns about precedence issues.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Acked-by: Christian König <christian.koenig@amd.com>
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/gpu/drm/drm_mm.c | 32 +++++++++++++++-----------------
 include/drm/drm_mm.h     | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c
index 8257f9d4f619..6ff98a0e4df3 100644
--- a/drivers/gpu/drm/drm_mm.c
+++ b/drivers/gpu/drm/drm_mm.c
@@ -352,10 +352,10 @@ static struct drm_mm_node *find_hole_addr(struct drm_mm *mm, u64 addr, u64 size)
 	return node;
 }
 
-static struct drm_mm_node *
-first_hole(struct drm_mm *mm,
-	   u64 start, u64 end, u64 size,
-	   enum drm_mm_insert_mode mode)
+struct drm_mm_node *
+__drm_mm_first_hole(struct drm_mm *mm,
+		    u64 start, u64 end, u64 size,
+		    enum drm_mm_insert_mode mode)
 {
 	switch (mode) {
 	default:
@@ -374,6 +374,7 @@ first_hole(struct drm_mm *mm,
 						hole_stack);
 	}
 }
+EXPORT_SYMBOL(__drm_mm_first_hole);
 
 /**
  * DECLARE_NEXT_HOLE_ADDR - macro to declare next hole functions
@@ -410,11 +411,11 @@ static struct drm_mm_node *name(struct drm_mm_node *entry, u64 size)	\
 DECLARE_NEXT_HOLE_ADDR(next_hole_high_addr, rb_left, rb_right)
 DECLARE_NEXT_HOLE_ADDR(next_hole_low_addr, rb_right, rb_left)
 
-static struct drm_mm_node *
-next_hole(struct drm_mm *mm,
-	  struct drm_mm_node *node,
-	  u64 size,
-	  enum drm_mm_insert_mode mode)
+struct drm_mm_node *
+__drm_mm_next_hole(struct drm_mm *mm,
+		   struct drm_mm_node *node,
+		   u64 size,
+		   enum drm_mm_insert_mode mode)
 {
 	switch (mode) {
 	default:
@@ -432,6 +433,7 @@ next_hole(struct drm_mm *mm,
 		return &node->hole_stack == &mm->hole_stack ? NULL : node;
 	}
 }
+EXPORT_SYMBOL(__drm_mm_next_hole);
 
 /**
  * drm_mm_reserve_node - insert an pre-initialized node
@@ -516,11 +518,11 @@ int drm_mm_insert_node_in_range(struct drm_mm * const mm,
 				u64 size, u64 alignment,
 				unsigned long color,
 				u64 range_start, u64 range_end,
-				enum drm_mm_insert_mode mode)
+				enum drm_mm_insert_mode caller_mode)
 {
 	struct drm_mm_node *hole;
 	u64 remainder_mask;
-	bool once;
+	enum drm_mm_insert_mode mode = caller_mode & ~DRM_MM_INSERT_ONCE;
 
 	DRM_MM_BUG_ON(range_start > range_end);
 
@@ -533,13 +535,9 @@ int drm_mm_insert_node_in_range(struct drm_mm * const mm,
 	if (alignment <= 1)
 		alignment = 0;
 
-	once = mode & DRM_MM_INSERT_ONCE;
-	mode &= ~DRM_MM_INSERT_ONCE;
-
 	remainder_mask = is_power_of_2(alignment) ? alignment - 1 : 0;
-	for (hole = first_hole(mm, range_start, range_end, size, mode);
-	     hole;
-	     hole = once ? NULL : next_hole(mm, hole, size, mode)) {
+	drm_mm_for_each_suitable_hole(hole, mm, range_start, range_end,
+				      size, caller_mode) {
 		u64 hole_start = __drm_mm_hole_node_start(hole);
 		u64 hole_end = hole_start + hole->hole_size;
 		u64 adj_start, adj_end;
diff --git a/include/drm/drm_mm.h b/include/drm/drm_mm.h
index ac33ba1b18bc..896754fa6d69 100644
--- a/include/drm/drm_mm.h
+++ b/include/drm/drm_mm.h
@@ -400,6 +400,42 @@ static inline u64 drm_mm_hole_node_end(const struct drm_mm_node *hole_node)
 	     1 : 0; \
 	     pos = list_next_entry(pos, hole_stack))
 
+struct drm_mm_node *
+__drm_mm_first_hole(struct drm_mm *mm,
+		    u64 start, u64 end, u64 size,
+		    enum drm_mm_insert_mode mode);
+
+struct drm_mm_node *
+__drm_mm_next_hole(struct drm_mm *mm,
+		   struct drm_mm_node *node,
+		   u64 size,
+		   enum drm_mm_insert_mode mode);
+
+/**
+ * drm_mm_for_each_suitable_hole - iterator to optimally walk over all
+ * holes that can fit an allocation of the given @size.
+ * @pos: &drm_mm_node used internally to track progress
+ * @mm: &drm_mm allocator to walk
+ * @range_start: start of the allowed range for the allocation
+ * @range_end: end of the allowed range for the allocation
+ * @size: size of the allocation
+ * @mode: fine-tune the allocation search
+ *
+ * This iterator walks over all holes suitable for the allocation of given
+ * @size in a very efficient manner. It is implemented by calling
+ * drm_mm_first_hole() and drm_mm_next_hole() which identify the
+ * appropriate holes within the given range by efficiently traversing the
+ * rbtree associated with @mm.
+ */
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))
+
 /*
  * Basic range manager support (drm_mm.c)
  */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
@ 2022-03-07 20:21   ` Vivek Kasireddy
  -1 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
              i915_gem_object_pin_to_display_plane() {
0.102 us   |    i915_gem_object_set_cache_level();
                i915_gem_object_ggtt_pin_ww() {
0.390 us   |      i915_vma_instance();
0.178 us   |      i915_vma_misplaced();
                  i915_vma_unbind() {
                  __i915_active_wait() {
0.082 us   |        i915_active_acquire_if_busy();
0.475 us   |      }
                  intel_runtime_pm_get() {
0.087 us   |        intel_runtime_pm_acquire();
0.259 us   |      }
                  __i915_active_wait() {
0.085 us   |        i915_active_acquire_if_busy();
0.240 us   |      }
                  __i915_vma_evict() {
                    ggtt_unbind_vma() {
                      gen8_ggtt_clear_range() {
10507.255 us |        }
10507.689 us |      }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
    buffer is too big by checking to see if it is possible to map
    two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
  instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
  map into the aperture.
- Account for guard pages while calculating the total size required
  for the object.
- Do not subject all objects to the heuristic check and instead
  consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
  preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
  size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
  as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Manasi Navare <manasi.d.navare@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
 1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
 #include "gem/i915_gem_pm.h"
 #include "gem/i915_gem_region.h"
 #include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
 #include "gt/intel_engine_user.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
 	spin_unlock(&obj->vma.lock);
 }
 
+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+				 u64 alignment, u64 flags)
+{
+	struct drm_i915_private *i915 = to_i915(obj->base.dev);
+	struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
+	struct drm_mm_node *hole;
+	u64 hole_start, hole_end, start, end;
+	u64 fence_size, fence_alignment;
+	unsigned int count = 0;
+	int err = 0;
+
+	/*
+	 * If the required space is larger than the available
+	 * aperture, we will not able to find a slot for the
+	 * object and unbinding the object now will be in
+	 * vain. Worse, doing so may cause us to ping-pong
+	 * the object in and out of the Global GTT and
+	 * waste a lot of cycles under the mutex.
+	 */
+	if (obj->base.size > ggtt->mappable_end)
+		return -E2BIG;
+
+	/*
+	 * If NONBLOCK is set the caller is optimistically
+	 * trying to cache the full object within the mappable
+	 * aperture, and *must* have a fallback in place for
+	 * situations where we cannot bind the object. We
+	 * can be a little more lax here and use the fallback
+	 * more often to avoid costly migrations of ourselves
+	 * and other objects within the aperture.
+	 */
+	if (!(flags & PIN_NONBLOCK))
+		return 0;
+
+	/*
+	 * Other objects such as batchbuffers are fairly small compared
+	 * to FBs and are unlikely to exahust the aperture space.
+	 * Therefore, return early if this obj is not an FB.
+	 */
+	if (!i915_gem_object_is_framebuffer(obj))
+		return 0;
+
+	fence_size = i915_gem_fence_size(i915, obj->base.size,
+					 i915_gem_object_get_tiling(obj),
+					 i915_gem_object_get_stride(obj));
+
+	if (i915_vm_has_cache_coloring(&ggtt->vm))
+		fence_size += 2 * I915_GTT_PAGE_SIZE;
+
+	fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
+						   i915_gem_object_get_tiling(obj),
+						   i915_gem_object_get_stride(obj));
+	alignment = max_t(u64, alignment, fence_alignment);
+
+	err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
+	if (err)
+		return err;
+
+	/*
+	 * Assuming this object is a large scanout buffer, we try to find
+	 * out if there is room to map at-least two of them. There could
+	 * be space available to map one but to be consistent, we try to
+	 * avoid mapping/fencing any of them.
+	 */
+	drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
+				      fence_size, DRM_MM_INSERT_LOW) {
+		hole_start = drm_mm_hole_node_start(hole);
+		hole_end = hole_start + hole->hole_size;
+
+		do {
+			start = round_up(hole_start, alignment);
+			end = min_t(u64, hole_end, ggtt->mappable_end);
+
+			if (range_overflows(start, fence_size, end))
+				break;
+
+			if (++count >= 2) {
+				mutex_unlock(&ggtt->vm.mutex);
+				return 0;
+			}
+
+			hole_start = start + fence_size;
+		} while (1);
+	}
+
+	mutex_unlock(&ggtt->vm.mutex);
+	return -ENOSPC;
+}
+
 struct i915_vma *
 i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 			    struct i915_gem_ww_ctx *ww,
@@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 
 	if (flags & PIN_MAPPABLE &&
 	    (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
-		/*
-		 * If the required space is larger than the available
-		 * aperture, we will not able to find a slot for the
-		 * object and unbinding the object now will be in
-		 * vain. Worse, doing so may cause us to ping-pong
-		 * the object in and out of the Global GTT and
-		 * waste a lot of cycles under the mutex.
-		 */
-		if (obj->base.size > ggtt->mappable_end)
-			return ERR_PTR(-E2BIG);
-
-		/*
-		 * If NONBLOCK is set the caller is optimistically
-		 * trying to cache the full object within the mappable
-		 * aperture, and *must* have a fallback in place for
-		 * situations where we cannot bind the object. We
-		 * can be a little more lax here and use the fallback
-		 * more often to avoid costly migrations of ourselves
-		 * and other objects within the aperture.
-		 *
-		 * Half-the-aperture is used as a simple heuristic.
-		 * More interesting would to do search for a free
-		 * block prior to making the commitment to unbind.
-		 * That caters for the self-harm case, and with a
-		 * little more heuristics (e.g. NOFAULT, NOEVICT)
-		 * we could try to minimise harm to others.
-		 */
-		if (flags & PIN_NONBLOCK &&
-		    obj->base.size > ggtt->mappable_end / 2)
-			return ERR_PTR(-ENOSPC);
+		ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
+		if (ret)
+			return ERR_PTR(ret);
 	}
 
 new_vma:
@@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 		if (flags & PIN_NONBLOCK) {
 			if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
 				return ERR_PTR(-ENOSPC);
-
-			if (flags & PIN_MAPPABLE &&
-			    vma->fence_size > ggtt->mappable_end / 2)
-				return ERR_PTR(-ENOSPC);
 		}
 
 		if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-07 20:21   ` Vivek Kasireddy
  0 siblings, 0 replies; 31+ messages in thread
From: Vivek Kasireddy @ 2022-03-07 20:21 UTC (permalink / raw)
  To: intel-gfx, dri-devel, tvrtko.ursulin

On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
              i915_gem_object_pin_to_display_plane() {
0.102 us   |    i915_gem_object_set_cache_level();
                i915_gem_object_ggtt_pin_ww() {
0.390 us   |      i915_vma_instance();
0.178 us   |      i915_vma_misplaced();
                  i915_vma_unbind() {
                  __i915_active_wait() {
0.082 us   |        i915_active_acquire_if_busy();
0.475 us   |      }
                  intel_runtime_pm_get() {
0.087 us   |        intel_runtime_pm_acquire();
0.259 us   |      }
                  __i915_active_wait() {
0.085 us   |        i915_active_acquire_if_busy();
0.240 us   |      }
                  __i915_vma_evict() {
                    ggtt_unbind_vma() {
                      gen8_ggtt_clear_range() {
10507.255 us |        }
10507.689 us |      }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
    buffer is too big by checking to see if it is possible to map
    two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
  instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
  map into the aperture.
- Account for guard pages while calculating the total size required
  for the object.
- Do not subject all objects to the heuristic check and instead
  consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
  preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
  size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
  as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Manasi Navare <manasi.d.navare@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
 1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
 #include "gem/i915_gem_pm.h"
 #include "gem/i915_gem_region.h"
 #include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
 #include "gt/intel_engine_user.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
 	spin_unlock(&obj->vma.lock);
 }
 
+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+				 u64 alignment, u64 flags)
+{
+	struct drm_i915_private *i915 = to_i915(obj->base.dev);
+	struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
+	struct drm_mm_node *hole;
+	u64 hole_start, hole_end, start, end;
+	u64 fence_size, fence_alignment;
+	unsigned int count = 0;
+	int err = 0;
+
+	/*
+	 * If the required space is larger than the available
+	 * aperture, we will not able to find a slot for the
+	 * object and unbinding the object now will be in
+	 * vain. Worse, doing so may cause us to ping-pong
+	 * the object in and out of the Global GTT and
+	 * waste a lot of cycles under the mutex.
+	 */
+	if (obj->base.size > ggtt->mappable_end)
+		return -E2BIG;
+
+	/*
+	 * If NONBLOCK is set the caller is optimistically
+	 * trying to cache the full object within the mappable
+	 * aperture, and *must* have a fallback in place for
+	 * situations where we cannot bind the object. We
+	 * can be a little more lax here and use the fallback
+	 * more often to avoid costly migrations of ourselves
+	 * and other objects within the aperture.
+	 */
+	if (!(flags & PIN_NONBLOCK))
+		return 0;
+
+	/*
+	 * Other objects such as batchbuffers are fairly small compared
+	 * to FBs and are unlikely to exahust the aperture space.
+	 * Therefore, return early if this obj is not an FB.
+	 */
+	if (!i915_gem_object_is_framebuffer(obj))
+		return 0;
+
+	fence_size = i915_gem_fence_size(i915, obj->base.size,
+					 i915_gem_object_get_tiling(obj),
+					 i915_gem_object_get_stride(obj));
+
+	if (i915_vm_has_cache_coloring(&ggtt->vm))
+		fence_size += 2 * I915_GTT_PAGE_SIZE;
+
+	fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
+						   i915_gem_object_get_tiling(obj),
+						   i915_gem_object_get_stride(obj));
+	alignment = max_t(u64, alignment, fence_alignment);
+
+	err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
+	if (err)
+		return err;
+
+	/*
+	 * Assuming this object is a large scanout buffer, we try to find
+	 * out if there is room to map at-least two of them. There could
+	 * be space available to map one but to be consistent, we try to
+	 * avoid mapping/fencing any of them.
+	 */
+	drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
+				      fence_size, DRM_MM_INSERT_LOW) {
+		hole_start = drm_mm_hole_node_start(hole);
+		hole_end = hole_start + hole->hole_size;
+
+		do {
+			start = round_up(hole_start, alignment);
+			end = min_t(u64, hole_end, ggtt->mappable_end);
+
+			if (range_overflows(start, fence_size, end))
+				break;
+
+			if (++count >= 2) {
+				mutex_unlock(&ggtt->vm.mutex);
+				return 0;
+			}
+
+			hole_start = start + fence_size;
+		} while (1);
+	}
+
+	mutex_unlock(&ggtt->vm.mutex);
+	return -ENOSPC;
+}
+
 struct i915_vma *
 i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 			    struct i915_gem_ww_ctx *ww,
@@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 
 	if (flags & PIN_MAPPABLE &&
 	    (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
-		/*
-		 * If the required space is larger than the available
-		 * aperture, we will not able to find a slot for the
-		 * object and unbinding the object now will be in
-		 * vain. Worse, doing so may cause us to ping-pong
-		 * the object in and out of the Global GTT and
-		 * waste a lot of cycles under the mutex.
-		 */
-		if (obj->base.size > ggtt->mappable_end)
-			return ERR_PTR(-E2BIG);
-
-		/*
-		 * If NONBLOCK is set the caller is optimistically
-		 * trying to cache the full object within the mappable
-		 * aperture, and *must* have a fallback in place for
-		 * situations where we cannot bind the object. We
-		 * can be a little more lax here and use the fallback
-		 * more often to avoid costly migrations of ourselves
-		 * and other objects within the aperture.
-		 *
-		 * Half-the-aperture is used as a simple heuristic.
-		 * More interesting would to do search for a free
-		 * block prior to making the commitment to unbind.
-		 * That caters for the self-harm case, and with a
-		 * little more heuristics (e.g. NOFAULT, NOEVICT)
-		 * we could try to minimise harm to others.
-		 */
-		if (flags & PIN_NONBLOCK &&
-		    obj->base.size > ggtt->mappable_end / 2)
-			return ERR_PTR(-ENOSPC);
+		ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
+		if (ret)
+			return ERR_PTR(ret);
 	}
 
 new_vma:
@@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
 		if (flags & PIN_NONBLOCK) {
 			if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
 				return ERR_PTR(-ENOSPC);
-
-			if (flags & PIN_MAPPABLE &&
-			    vma->fence_size > ggtt->mappable_end / 2)
-				return ERR_PTR(-ENOSPC);
 		}
 
 		if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (2 preceding siblings ...)
  (?)
@ 2022-03-07 20:56 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-07 20:56 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
URL   : https://patchwork.freedesktop.org/series/101123/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
521ab4ad04ad drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)
-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'pos' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'mm' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'size' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'mode' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

total: 0 errors, 0 warnings, 4 checks, 114 lines checked
eccd97c3fed3 drm/i915/gem: Don't try to map and fence large scanout buffers (v9)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (3 preceding siblings ...)
  (?)
@ 2022-03-07 20:58 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-07 20:58 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
URL   : https://patchwork.freedesktop.org/series/101123/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (4 preceding siblings ...)
  (?)
@ 2022-03-08 12:42 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-08 12:42 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 20701 bytes --]

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation
URL   : https://patchwork.freedesktop.org/series/101123/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11334 -> Patchwork_22506
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/index.html

Participating hosts (44 -> 45)
------------------------------

  Additional (5): fi-cml-u2 fi-skl-guc fi-pnv-d510 bat-jsl-2 fi-bsw-nick 
  Missing    (4): bat-rpls-2 fi-bsw-cyan fi-bdw-samus bat-dg1-5 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_22506:

### IGT changes ###

#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@gem_busy@busy@all:
    - {bat-dg2-9}:        [PASS][1] -> [DMESG-WARN][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-dg2-9/igt@gem_busy@busy@all.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/bat-dg2-9/igt@gem_busy@busy@all.html

  
Known issues
------------

  Here are the changes found in Patchwork_22506 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_cs_nop@sync-compute0:
    - fi-cml-u2:          NOTRUN -> [SKIP][3] ([fdo#109315]) +17 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@amdgpu/amd_cs_nop@sync-compute0.html

  * igt@core_hotunplug@unbind-rebind:
    - fi-tgl-1115g4:      [PASS][4] -> [DMESG-WARN][5] ([i915#4002])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-tgl-1115g4/igt@core_hotunplug@unbind-rebind.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-tgl-1115g4/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_fence@basic-busy@bcs0:
    - fi-cml-u2:          NOTRUN -> [SKIP][6] ([i915#1208]) +1 similar issue
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@gem_exec_fence@basic-busy@bcs0.html

  * igt@gem_flink_basic@bad-flink:
    - fi-skl-6600u:       [PASS][7] -> [FAIL][8] ([i915#4547])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-skl-6600u/igt@gem_flink_basic@bad-flink.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-skl-6600u/igt@gem_flink_basic@bad-flink.html

  * igt@gem_huc_copy@huc-copy:
    - fi-pnv-d510:        NOTRUN -> [SKIP][9] ([fdo#109271]) +57 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-pnv-d510/igt@gem_huc_copy@huc-copy.html
    - fi-cml-u2:          NOTRUN -> [SKIP][10] ([i915#2190])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@parallel-random-engines:
    - fi-cml-u2:          NOTRUN -> [SKIP][11] ([i915#4613]) +3 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@gem_lmem_swapping@parallel-random-engines.html

  * igt@gem_lmem_swapping@random-engines:
    - fi-skl-guc:         NOTRUN -> [SKIP][12] ([fdo#109271] / [i915#4613]) +3 similar issues
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-skl-guc/igt@gem_lmem_swapping@random-engines.html

  * igt@gem_lmem_swapping@verify-random:
    - fi-bsw-nick:        NOTRUN -> [SKIP][13] ([fdo#109271]) +67 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-bsw-nick/igt@gem_lmem_swapping@verify-random.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-skl-guc:         NOTRUN -> [SKIP][14] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-skl-guc/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@dp-hpd-fast:
    - fi-cml-u2:          NOTRUN -> [SKIP][15] ([fdo#109284] / [fdo#111827]) +8 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@kms_chamelium@dp-hpd-fast.html

  * igt@kms_chamelium@vga-edid-read:
    - fi-bsw-nick:        NOTRUN -> [SKIP][16] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-bsw-nick/igt@kms_chamelium@vga-edid-read.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
    - fi-cml-u2:          NOTRUN -> [SKIP][17] ([fdo#109278]) +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-cml-u2:          NOTRUN -> [SKIP][18] ([fdo#109285])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_frontbuffer_tracking@basic:
    - fi-cml-u2:          NOTRUN -> [DMESG-WARN][19] ([i915#4269])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@kms_frontbuffer_tracking@basic.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-cml-u2:          NOTRUN -> [SKIP][20] ([fdo#109278] / [i915#533])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html
    - fi-skl-guc:         NOTRUN -> [SKIP][21] ([fdo#109271] / [i915#533])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-skl-guc/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-skl-guc:         NOTRUN -> [SKIP][22] ([fdo#109271]) +28 similar issues
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-skl-guc/igt@kms_psr@primary_mmap_gtt.html

  * igt@prime_vgem@basic-userptr:
    - fi-cml-u2:          NOTRUN -> [SKIP][23] ([i915#3301])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-cml-u2/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-bdw-5557u:       NOTRUN -> [FAIL][24] ([i915#2426] / [i915#4312])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-bdw-5557u/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s3@smem:
    - fi-bdw-5557u:       [INCOMPLETE][25] ([i915#146]) -> [PASS][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@i915_pm_backlight@fade:
    - {shard-rkl}:        [SKIP][27] ([i915#3012]) -> [PASS][28]
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@i915_pm_backlight@fade.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@i915_pm_backlight@fade.html

  * igt@i915_pm_rps@basic-api:
    - fi-tgl-1115g4:      [DMESG-WARN][29] ([i915#4002]) -> [PASS][30] +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-tgl-1115g4/igt@i915_pm_rps@basic-api.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-tgl-1115g4/igt@i915_pm_rps@basic-api.html

  * igt@kms_big_fb@x-tiled-32bpp-rotate-180:
    - {shard-dg1}:        [DMESG-WARN][31] ([i915#3891] / [i915#4935]) -> [PASS][32]
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-dg1-12/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-dg1-17/igt@kms_big_fb@x-tiled-32bpp-rotate-180.html

  * igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs:
    - {shard-rkl}:        [SKIP][33] ([i915#1845] / [i915#4098]) -> [PASS][34]
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_ccs@pipe-b-bad-pixel-format-y_tiled_gen12_rc_ccs.html

  * igt@kms_concurrent@pipe-a:
    - {shard-rkl}:        [SKIP][35] ([i915#1845] / [i915#4070]) -> [PASS][36]
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_concurrent@pipe-a.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_concurrent@pipe-a.html

  * igt@kms_cursor_edge_walk@pipe-b-256x256-bottom-edge:
    - {shard-rkl}:        [SKIP][37] ([i915#1849] / [i915#4070]) -> [PASS][38]
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_cursor_edge_walk@pipe-b-256x256-bottom-edge.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_cursor_edge_walk@pipe-b-256x256-bottom-edge.html

  * igt@kms_cursor_legacy@flip-vs-cursor-atomic:
    - {shard-rkl}:        [SKIP][39] ([fdo#111825] / [i915#4070]) -> [PASS][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_cursor_legacy@flip-vs-cursor-atomic.html

  * igt@kms_fbcon_fbt@psr:
    - {shard-rkl}:        [SKIP][41] ([fdo#110189] / [i915#3955]) -> [PASS][42]
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_fbcon_fbt@psr.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_fbcon_fbt@psr.html

  * igt@kms_flip@basic-flip-vs-modeset@a-edp1:
    - {bat-adlp-6}:       [DMESG-WARN][43] ([i915#3576]) -> ([PASS][44], [PASS][45])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/bat-adlp-6/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1:
    - {fi-tgl-dsi}:       [FAIL][46] ([i915#2122]) -> [PASS][47]
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/fi-tgl-dsi/igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1.html
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/fi-tgl-dsi/igt@kms_flip@basic-flip-vs-wf_vblank@b-dsi1.html

  * igt@kms_frontbuffer_tracking@fbc-rgb565-draw-mmap-wc:
    - {shard-rkl}:        [SKIP][48] ([i915#1849]) -> [PASS][49] +6 similar issues
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_frontbuffer_tracking@fbc-rgb565-draw-mmap-wc.html
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_frontbuffer_tracking@fbc-rgb565-draw-mmap-wc.html

  * igt@kms_psr@cursor_render:
    - {shard-rkl}:        [SKIP][50] ([i915#1072]) -> [PASS][51]
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_psr@cursor_render.html
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_psr@cursor_render.html

  * igt@kms_vblank@pipe-b-query-forked-busy-hang:
    - {shard-rkl}:        [SKIP][52] ([i915#1845]) -> [PASS][53] +7 similar issues
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/shard-rkl-2/igt@kms_vblank@pipe-b-query-forked-busy-hang.html
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/shard-rkl-6/igt@kms_vblank@pipe-b-query-forked-busy-hang.html

  
#### Warnings ####

  * igt@i915_selftest@live@hangcheck:
    - bat-dg1-6:          [DMESG-FAIL][54] ([i915#4957]) -> ([DMESG-FAIL][55], [DMESG-FAIL][56]) ([i915#4494] / [i915#4957])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11334/bat-dg1-6/igt@i915_selftest@live@hangcheck.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/bat-dg1-6/igt@i915_selftest@live@hangcheck.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/bat-dg1-6/igt@i915_selftest@live@hangcheck.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#109308]: https://bugs.freedesktop.org/show_bug.cgi?id=109308
  [fdo#109309]: https://bugs.freedesktop.org/show_bug.cgi?id=109309
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#109506]: https://bugs.freedesktop.org/show_bug.cgi?id=109506
  [fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
  [fdo#110542]: https://bugs.freedesktop.org/show_bug.cgi?id=110542
  [fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111314]: https://bugs.freedesktop.org/show_bug.cgi?id=111314
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [fdo#112022]: https://bugs.freedesktop.org/show_bug.cgi?id=112022
  [fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
  [i915#1063]: https://gitlab.freedesktop.org/drm/intel/issues/1063
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1149]: https://gitlab.freedesktop.org/drm/intel/issues/1149
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1187]: https://gitlab.freedesktop.org/drm/intel/issues/1187
  [i915#1208]: https://gitlab.freedesktop.org/drm/intel/issues/1208
  [i915#146]: https://gitlab.freedesktop.org/drm/intel/issues/146
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#1825]: https://gitlab.freedesktop.org/drm/intel/issues/1825
  [i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#2122]: https://gitlab.freedesktop.org/drm/intel/issues/2122
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2433]: https://gitlab.freedesktop.org/drm/intel/issues/2433
  [i915#2436]: https://gitlab.freedesktop.org/drm/intel/issues/2436
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2530]: https://gitlab.freedesktop.org/drm/intel/issues/2530
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#2705]: https://gitlab.freedesktop.org/drm/intel/issues/2705
  [i915#280]: https://gitlab.freedesktop.org/drm/intel/issues/280
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2849]: https://gitlab.freedesktop.org/drm/intel/issues/2849
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3002]: https://gitlab.freedesktop.org/drm/intel/issues/3002
  [i915#3012]: https://gitlab.freedesktop.org/drm/intel/issues/3012
  [i915#3281]: https://gitlab.freedesktop.org/drm/intel/issues/3281
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3299]: https://gitlab.freedesktop.org/drm/intel/issues/3299
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3319]: https://gitlab.freedesktop.org/drm/intel/issues/3319
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3458]: https://gitlab.freedesktop.org/drm/intel/issues/3458
  [i915#3469]: https://gitlab.freedesktop.org/drm/intel/issues/3469
  [i915#3539]: https://gitlab.freedesktop.org/drm/intel/issues/3539
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3558]: https://gitlab.freedesktop.org/drm/intel/issues/3558
  [i915#3576]: https://gitlab.freedesktop.org/drm/intel/issues/3576
  [i915#3580]: https://gitlab.freedesktop.org/drm/intel/issues/3580
  [i915#3637]: https://gitlab.freedesktop.org/drm/intel/issues/3637
  [i915#3638]: https://gitlab.freedesktop.org/drm/intel/issues/3638
  [i915#3639]: https://gitlab.freedesktop.org/drm/intel/issues/3639
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3701]: https://gitlab.freedesktop.org/drm/intel/issues/3701
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#3719]: https://gitlab.freedesktop.org/drm/intel/issues/3719
  [i915#3734]: https://gitlab.freedesktop.org/drm/intel/issues/3734
  [i915#3804]: https://gitlab.freedesktop.org/drm/intel/issues/3804
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#3891]: https://gitlab.freedesktop.org/drm/intel/issues/3891
  [i915#3955]: https://gitlab.freedesktop.org/drm/intel/issues/3955
  [i915#4002]: https://gitlab.freedesktop.org/drm/intel/issues/4002
  [i915#4016]: https://gitlab.freedesktop.org/drm/intel/issues/4016
  [i915#402]: https://gitlab.freedesktop.org/drm/intel/issues/402
  [i915#4036]: https://gitlab.freedesktop.org/drm/intel/issues/4036
  [i915#4070]: https://gitlab.freedesktop.org/drm/intel/issues/4070
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#426]: https://gitlab.freedesktop.org/drm/intel/issues/426
  [i915#4269]: https://gitlab.freedesktop.org/drm/intel/issues/4269
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4278]: https://gitlab.freedesktop.org/drm/intel/issues/4278
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4494]: https://gitlab.freedesktop.org/drm/intel/issues/4494
  [i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
  [i915#4538]: https://gitlab.freedesktop.org/drm/intel/issues/4538
  [i915#4547]: https://gitlab.freedesktop.org/drm/intel/issues/4547
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4807]: https://gitlab.freedesktop.org/drm/intel/issues/4807
  [i915#4812]: https://gitlab.freedesktop.org/drm/intel/issues/4812
  [i915#4833]: https://gitlab.freedesktop.org/drm/intel/issues/4833
  [i915#4842]: https://gitlab.freedesktop.org/drm/intel/issues/4842
  [i915#4852]: https://gitlab.freedesktop.org/drm/intel/issues/4852
  [i915#4853]: https://gitlab.freedesktop.org/drm/intel/issues/4853
  [i915#4873]: https://gitlab.freedesktop.org/drm/intel/issues/4873
  [i915#4880]: https://gitlab.freedesktop.org/drm/intel/issues/4880
  [i915#4893]: https://gitlab.freedesktop.org/drm/intel/issues/4893
  [i915#4935]: https://gitlab.freedesktop.org/drm/intel/issues/4935
  [i915#4957]: https://gitlab.freedesktop.org/drm/intel/issues/4957
  [i915#4991]: https://gitlab.freedesktop.org/drm/intel/issues/4991
  [i915#5098]: https://gitlab.freedesktop.org/drm/intel/issues/5098
  [i915#5127]: https://gitlab.freedesktop.org/drm/intel/issues/5127
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5257]: https://gitlab.freedesktop.org/drm/intel/issues/5257
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658


Build changes
-------------

  * Linux: CI_DRM_11334 -> Patchwork_22506

  CI-20190529: 20190529
  CI_DRM_11334: e7af229f52672104f4b170304c80e2d6849a2489 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6367: f8eac64564b12326721f1d5bea692bde4fe1ef15 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_22506: eccd97c3fed355a8d5bf56ff1b5dfa3f2fb8df9a @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

eccd97c3fed3 drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
521ab4ad04ad drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22506/index.html

[-- Attachment #2: Type: text/html, Size: 17751 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (5 preceding siblings ...)
  (?)
@ 2022-03-09 18:56 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-09 18:56 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
URL   : https://patchwork.freedesktop.org/series/101123/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
14d91959f9bf drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)
-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'pos' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'mm' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'size' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

-:160: CHECK:MACRO_ARG_REUSE: Macro argument reuse 'mode' - possible side-effects?
#160: FILE: include/drm/drm_mm.h:430:
+#define drm_mm_for_each_suitable_hole(pos, mm, range_start, range_end, \
+				      size, mode) \
+	for (pos = __drm_mm_first_hole(mm, range_start, range_end, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE); \
+	     pos; \
+	     pos = (mode) & DRM_MM_INSERT_ONCE ? \
+	     NULL : __drm_mm_next_hole(mm, pos, size, \
+				       (mode) & ~DRM_MM_INSERT_ONCE))

total: 0 errors, 0 warnings, 4 checks, 114 lines checked
f39e77441008 drm/i915/gem: Don't try to map and fence large scanout buffers (v9)



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (6 preceding siblings ...)
  (?)
@ 2022-03-09 18:59 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-09 18:59 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
URL   : https://patchwork.freedesktop.org/series/101123/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (7 preceding siblings ...)
  (?)
@ 2022-03-09 19:31 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-09 19:31 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 5618 bytes --]

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
URL   : https://patchwork.freedesktop.org/series/101123/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_11346 -> Patchwork_22523
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/index.html

Participating hosts (43 -> 35)
------------------------------

  Additional (1): fi-icl-u2 
  Missing    (9): bat-dg1-6 bat-dg1-5 bat-dg2-9 fi-bsw-cyan bat-adlp-6 bat-adlp-4 fi-ctg-p8600 bat-jsl-2 bat-jsl-1 

Known issues
------------

  Here are the changes found in Patchwork_22523 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_cs_nop@fork-gfx0:
    - fi-icl-u2:          NOTRUN -> [SKIP][1] ([fdo#109315]) +17 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@amdgpu/amd_cs_nop@fork-gfx0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-icl-u2:          NOTRUN -> [SKIP][2] ([i915#2190])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@parallel-random-engines:
    - fi-icl-u2:          NOTRUN -> [SKIP][3] ([i915#4613]) +3 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@gem_lmem_swapping@parallel-random-engines.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-kbl-soraka:      [PASS][4] -> [DMESG-FAIL][5] ([i915#541])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/fi-kbl-soraka/igt@i915_selftest@live@gt_heartbeat.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-kbl-soraka/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_selftest@live@hangcheck:
    - fi-hsw-4770:        [PASS][6] -> [INCOMPLETE][7] ([i915#4785])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html

  * igt@kms_chamelium@hdmi-hpd-fast:
    - fi-icl-u2:          NOTRUN -> [SKIP][8] ([fdo#111827]) +8 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@kms_chamelium@hdmi-hpd-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
    - fi-icl-u2:          NOTRUN -> [SKIP][9] ([fdo#109278]) +2 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-icl-u2:          NOTRUN -> [SKIP][10] ([fdo#109285])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - fi-icl-u2:          NOTRUN -> [SKIP][11] ([i915#3555])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@kms_setmode@basic-clone-single-crtc.html

  * igt@prime_vgem@basic-userptr:
    - fi-icl-u2:          NOTRUN -> [SKIP][12] ([i915#3301])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-icl-u2/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-hsw-4770:        NOTRUN -> [FAIL][13] ([fdo#109271] / [i915#1436] / [i915#4312])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-hsw-4770/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@core_hotunplug@unbind-rebind:
    - fi-blb-e6850:       [FAIL][14] ([i915#3194]) -> [PASS][15]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/fi-blb-e6850/igt@core_hotunplug@unbind-rebind.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/fi-blb-e6850/igt@core_hotunplug@unbind-rebind.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1436]: https://gitlab.freedesktop.org/drm/intel/issues/1436
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#3194]: https://gitlab.freedesktop.org/drm/intel/issues/3194
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4785]: https://gitlab.freedesktop.org/drm/intel/issues/4785
  [i915#541]: https://gitlab.freedesktop.org/drm/intel/issues/541


Build changes
-------------

  * Linux: CI_DRM_11346 -> Patchwork_22523

  CI-20190529: 20190529
  CI_DRM_11346: ab6456d23719e60c20e8cef05a5f322eea134b88 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6373: 82306f1903c0fee8371f43a156d8b63163ca61c1 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_22523: f39e77441008478397537bb4d081b3627f66897b @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

f39e77441008 drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
14d91959f9bf drm/mm: Add an iterator to optimally walk over holes for an allocation (v6)

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/index.html

[-- Attachment #2: Type: text/html, Size: 6518 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
  2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
                   ` (8 preceding siblings ...)
  (?)
@ 2022-03-10  5:00 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2022-03-10  5:00 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30310 bytes --]

== Series Details ==

Series: drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2)
URL   : https://patchwork.freedesktop.org/series/101123/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_11346_full -> Patchwork_22523_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_22523_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_22523_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (13 -> 13)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_22523_full:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@gem_contexts:
    - shard-skl:          [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl4/igt@i915_selftest@live@gem_contexts.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl2/igt@i915_selftest@live@gem_contexts.html

  * igt@kms_big_fb@4-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-tglb:         NOTRUN -> [SKIP][3]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@kms_big_fb@4-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_draw_crc@draw-method-xrgb8888-render-4tiled:
    - shard-iclb:         NOTRUN -> [SKIP][4]
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_draw_crc@draw-method-xrgb8888-render-4tiled.html

  * igt@kms_hdr@bpc-switch-dpms@bpc-switch-dpms-edp-1-pipe-a:
    - shard-skl:          [PASS][5] -> [FAIL][6] +2 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl8/igt@kms_hdr@bpc-switch-dpms@bpc-switch-dpms-edp-1-pipe-a.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl4/igt@kms_hdr@bpc-switch-dpms@bpc-switch-dpms-edp-1-pipe-a.html

  * igt@kms_hdr@bpc-switch-suspend@bpc-switch-suspend-dp-1-pipe-a:
    - shard-kbl:          [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-kbl1/igt@kms_hdr@bpc-switch-suspend@bpc-switch-suspend-dp-1-pipe-a.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl4/igt@kms_hdr@bpc-switch-suspend@bpc-switch-suspend-dp-1-pipe-a.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@kms_big_fb@4-tiled-64bpp-rotate-90:
    - {shard-dg1}:        NOTRUN -> [SKIP][9] +6 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-dg1-12/igt@kms_big_fb@4-tiled-64bpp-rotate-90.html

  * igt@kms_mmap_write_crc@main:
    - {shard-dg1}:        NOTRUN -> [FAIL][10]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-dg1-19/igt@kms_mmap_write_crc@main.html

  * igt@kms_plane_multiple@atomic-pipe-d-tiling-4:
    - {shard-rkl}:        [SKIP][11] ([i915#4070]) -> [SKIP][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-1/igt@kms_plane_multiple@atomic-pipe-d-tiling-4.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-5/igt@kms_plane_multiple@atomic-pipe-d-tiling-4.html

  * {igt@kms_plane_scaling@scaler-with-pixel-format-unity-scaling@pipe-b-edp-1-scaler-with-pixel-format}:
    - shard-iclb:         [PASS][13] -> [INCOMPLETE][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb4/igt@kms_plane_scaling@scaler-with-pixel-format-unity-scaling@pipe-b-edp-1-scaler-with-pixel-format.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb2/igt@kms_plane_scaling@scaler-with-pixel-format-unity-scaling@pipe-b-edp-1-scaler-with-pixel-format.html

  * igt@kms_setmode@invalid-clone-single-crtc:
    - {shard-rkl}:        NOTRUN -> [SKIP][15] +2 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-2/igt@kms_setmode@invalid-clone-single-crtc.html

  
Known issues
------------

  Here are the changes found in Patchwork_22523_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@feature_discovery@psr2:
    - shard-iclb:         [PASS][16] -> [SKIP][17] ([i915#658])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb2/igt@feature_discovery@psr2.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@feature_discovery@psr2.html

  * igt@gem_create@create-massive:
    - shard-kbl:          NOTRUN -> [DMESG-WARN][18] ([i915#4991])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl3/igt@gem_create@create-massive.html

  * igt@gem_ctx_isolation@preservation-s3@vcs0:
    - shard-skl:          [PASS][19] -> [INCOMPLETE][20] ([i915#4793])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl8/igt@gem_ctx_isolation@preservation-s3@vcs0.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl3/igt@gem_ctx_isolation@preservation-s3@vcs0.html

  * igt@gem_exec_balancer@parallel-ordering:
    - shard-tglb:         NOTRUN -> [DMESG-FAIL][21] ([i915#5076])
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb8/igt@gem_exec_balancer@parallel-ordering.html

  * igt@gem_exec_capture@pi@vcs0:
    - shard-iclb:         [PASS][22] -> [INCOMPLETE][23] ([i915#3371])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb8/igt@gem_exec_capture@pi@vcs0.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb1/igt@gem_exec_capture@pi@vcs0.html
    - shard-skl:          NOTRUN -> [INCOMPLETE][24] ([i915#4547])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl3/igt@gem_exec_capture@pi@vcs0.html

  * igt@gem_exec_capture@pi@vecs0:
    - shard-tglb:         NOTRUN -> [INCOMPLETE][25] ([i915#1373] / [i915#3371])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb5/igt@gem_exec_capture@pi@vecs0.html

  * igt@gem_exec_fair@basic-flow@rcs0:
    - shard-skl:          NOTRUN -> [SKIP][26] ([fdo#109271]) +191 similar issues
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl1/igt@gem_exec_fair@basic-flow@rcs0.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-apl:          [PASS][27] -> [FAIL][28] ([i915#2842])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-apl1/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl7/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-glk:          [PASS][29] -> [FAIL][30] ([i915#2842]) +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-glk6/igt@gem_exec_fair@basic-pace@vecs0.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk2/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_lmem_swapping@parallel-random:
    - shard-skl:          NOTRUN -> [SKIP][31] ([fdo#109271] / [i915#4613]) +2 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl1/igt@gem_lmem_swapping@parallel-random.html
    - shard-iclb:         NOTRUN -> [SKIP][32] ([i915#4613])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@gem_lmem_swapping@parallel-random.html

  * igt@gem_lmem_swapping@random:
    - shard-glk:          NOTRUN -> [SKIP][33] ([fdo#109271] / [i915#4613])
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@gem_lmem_swapping@random.html

  * igt@gem_pwrite@basic-exhaustion:
    - shard-skl:          NOTRUN -> [WARN][34] ([i915#2658])
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl10/igt@gem_pwrite@basic-exhaustion.html

  * igt@gem_pxp@create-regular-context-1:
    - shard-tglb:         NOTRUN -> [SKIP][35] ([i915#4270])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb8/igt@gem_pxp@create-regular-context-1.html

  * igt@gem_render_copy@yf-tiled-mc-ccs-to-vebox-yf-tiled:
    - shard-iclb:         NOTRUN -> [SKIP][36] ([i915#768]) +2 similar issues
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@gem_render_copy@yf-tiled-mc-ccs-to-vebox-yf-tiled.html

  * igt@gem_softpin@allocator-evict-all-engines:
    - shard-glk:          [PASS][37] -> [FAIL][38] ([i915#4171])
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-glk1/igt@gem_softpin@allocator-evict-all-engines.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk4/igt@gem_softpin@allocator-evict-all-engines.html

  * igt@gem_userptr_blits@create-destroy-unsync:
    - shard-tglb:         NOTRUN -> [SKIP][39] ([i915#3297])
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@gem_userptr_blits@create-destroy-unsync.html
    - shard-iclb:         NOTRUN -> [SKIP][40] ([i915#3297])
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@gem_userptr_blits@create-destroy-unsync.html

  * igt@gem_userptr_blits@input-checking:
    - shard-skl:          NOTRUN -> [DMESG-WARN][41] ([i915#4991])
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl3/igt@gem_userptr_blits@input-checking.html

  * igt@gem_userptr_blits@vma-merge:
    - shard-skl:          NOTRUN -> [FAIL][42] ([i915#3318])
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl6/igt@gem_userptr_blits@vma-merge.html

  * igt@gem_workarounds@suspend-resume:
    - shard-tglb:         [PASS][43] -> [INCOMPLETE][44] ([i915#456])
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-tglb3/igt@gem_workarounds@suspend-resume.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb1/igt@gem_workarounds@suspend-resume.html

  * igt@gen3_render_tiledy_blits:
    - shard-tglb:         NOTRUN -> [SKIP][45] ([fdo#109289]) +2 similar issues
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@gen3_render_tiledy_blits.html
    - shard-iclb:         NOTRUN -> [SKIP][46] ([fdo#109289])
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@gen3_render_tiledy_blits.html

  * igt@gen9_exec_parse@secure-batches:
    - shard-iclb:         NOTRUN -> [SKIP][47] ([i915#2856])
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@gen9_exec_parse@secure-batches.html
    - shard-tglb:         NOTRUN -> [SKIP][48] ([i915#2527] / [i915#2856])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@gen9_exec_parse@secure-batches.html

  * igt@i915_pm_dc@dc6-dpms:
    - shard-skl:          NOTRUN -> [FAIL][49] ([i915#454])
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl6/igt@i915_pm_dc@dc6-dpms.html

  * igt@i915_pm_rpm@modeset-pc8-residency-stress:
    - shard-apl:          NOTRUN -> [SKIP][50] ([fdo#109271]) +46 similar issues
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl4/igt@i915_pm_rpm@modeset-pc8-residency-stress.html

  * igt@kms_big_fb@x-tiled-8bpp-rotate-90:
    - shard-iclb:         NOTRUN -> [SKIP][51] ([fdo#110725] / [fdo#111614])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_big_fb@x-tiled-8bpp-rotate-90.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-180-async-flip:
    - shard-skl:          NOTRUN -> [FAIL][52] ([i915#3743])
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl5/igt@kms_big_fb@x-tiled-max-hw-stride-32bpp-rotate-180-async-flip.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-0-hflip-async-flip:
    - shard-glk:          NOTRUN -> [SKIP][53] ([fdo#109271] / [i915#3777])
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-0-hflip-async-flip.html

  * igt@kms_big_fb@y-tiled-32bpp-rotate-0:
    - shard-glk:          [PASS][54] -> [DMESG-WARN][55] ([i915#118])
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-glk6/igt@kms_big_fb@y-tiled-32bpp-rotate-0.html
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk2/igt@kms_big_fb@y-tiled-32bpp-rotate-0.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-hflip-async-flip:
    - shard-skl:          NOTRUN -> [SKIP][56] ([fdo#109271] / [i915#3777]) +2 similar issues
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl5/igt@kms_big_fb@y-tiled-max-hw-stride-32bpp-rotate-180-hflip-async-flip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-0-async-flip:
    - shard-iclb:         NOTRUN -> [SKIP][57] ([fdo#110723])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-0-async-flip.html
    - shard-tglb:         NOTRUN -> [SKIP][58] ([fdo#111615])
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-0-async-flip.html

  * igt@kms_ccs@pipe-a-bad-aux-stride-y_tiled_gen12_rc_ccs_cc:
    - shard-glk:          NOTRUN -> [SKIP][59] ([fdo#109271] / [i915#3886]) +1 similar issue
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_ccs@pipe-a-bad-aux-stride-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-a-crc-primary-rotation-180-yf_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][60] ([fdo#111615] / [i915#3689])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@kms_ccs@pipe-a-crc-primary-rotation-180-yf_tiled_ccs.html

  * igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs:
    - shard-skl:          NOTRUN -> [SKIP][61] ([fdo#109271] / [i915#3886]) +11 similar issues
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl6/igt@kms_ccs@pipe-b-ccs-on-another-bo-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-apl:          NOTRUN -> [SKIP][62] ([fdo#109271] / [i915#3886])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl4/igt@kms_ccs@pipe-c-crc-sprite-planes-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-c-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc:
    - shard-iclb:         NOTRUN -> [SKIP][63] ([fdo#109278] / [i915#3886])
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_ccs@pipe-c-missing-ccs-buffer-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_chamelium@dp-mode-timings:
    - shard-glk:          NOTRUN -> [SKIP][64] ([fdo#109271] / [fdo#111827]) +3 similar issues
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_chamelium@dp-mode-timings.html

  * igt@kms_chamelium@hdmi-hpd:
    - shard-skl:          NOTRUN -> [SKIP][65] ([fdo#109271] / [fdo#111827]) +12 similar issues
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl5/igt@kms_chamelium@hdmi-hpd.html

  * igt@kms_chamelium@vga-hpd-enable-disable-mode:
    - shard-iclb:         NOTRUN -> [SKIP][66] ([fdo#109284] / [fdo#111827]) +1 similar issue
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_chamelium@vga-hpd-enable-disable-mode.html

  * igt@kms_color_chamelium@pipe-a-ctm-0-5:
    - shard-kbl:          NOTRUN -> [SKIP][67] ([fdo#109271] / [fdo#111827])
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl3/igt@kms_color_chamelium@pipe-a-ctm-0-5.html

  * igt@kms_color_chamelium@pipe-d-ctm-max:
    - shard-iclb:         NOTRUN -> [SKIP][68] ([fdo#109278] / [fdo#109284] / [fdo#111827])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_color_chamelium@pipe-d-ctm-max.html

  * igt@kms_color_chamelium@pipe-d-ctm-negative:
    - shard-apl:          NOTRUN -> [SKIP][69] ([fdo#109271] / [fdo#111827]) +3 similar issues
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl4/igt@kms_color_chamelium@pipe-d-ctm-negative.html

  * igt@kms_cursor_crc@pipe-a-cursor-512x512-offscreen:
    - shard-iclb:         NOTRUN -> [SKIP][70] ([fdo#109278] / [fdo#109279]) +2 similar issues
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@kms_cursor_crc@pipe-a-cursor-512x512-offscreen.html

  * igt@kms_cursor_crc@pipe-a-cursor-suspend:
    - shard-kbl:          [PASS][71] -> [DMESG-WARN][72] ([i915#180]) +2 similar issues
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-kbl6/igt@kms_cursor_crc@pipe-a-cursor-suspend.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl7/igt@kms_cursor_crc@pipe-a-cursor-suspend.html

  * igt@kms_cursor_crc@pipe-b-cursor-32x10-random:
    - shard-kbl:          NOTRUN -> [SKIP][73] ([fdo#109271]) +14 similar issues
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl3/igt@kms_cursor_crc@pipe-b-cursor-32x10-random.html

  * igt@kms_cursor_crc@pipe-b-cursor-512x512-rapid-movement:
    - shard-tglb:         NOTRUN -> [SKIP][74] ([fdo#109279] / [i915#3359]) +2 similar issues
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@kms_cursor_crc@pipe-b-cursor-512x512-rapid-movement.html

  * igt@kms_cursor_crc@pipe-b-cursor-suspend:
    - shard-skl:          [PASS][75] -> [INCOMPLETE][76] ([i915#300])
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl9/igt@kms_cursor_crc@pipe-b-cursor-suspend.html
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl9/igt@kms_cursor_crc@pipe-b-cursor-suspend.html

  * igt@kms_cursor_crc@pipe-d-cursor-512x512-random:
    - shard-iclb:         NOTRUN -> [SKIP][77] ([fdo#109278]) +1 similar issue
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_cursor_crc@pipe-d-cursor-512x512-random.html

  * igt@kms_cursor_crc@pipe-d-cursor-max-size-offscreen:
    - shard-tglb:         NOTRUN -> [SKIP][78] ([i915#3359])
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb8/igt@kms_cursor_crc@pipe-d-cursor-max-size-offscreen.html

  * igt@kms_cursor_legacy@2x-long-nonblocking-modeset-vs-cursor-atomic:
    - shard-iclb:         NOTRUN -> [SKIP][79] ([fdo#109274] / [fdo#109278]) +1 similar issue
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@kms_cursor_legacy@2x-long-nonblocking-modeset-vs-cursor-atomic.html
    - shard-tglb:         NOTRUN -> [SKIP][80] ([fdo#109274] / [fdo#111825]) +2 similar issues
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb3/igt@kms_cursor_legacy@2x-long-nonblocking-modeset-vs-cursor-atomic.html

  * igt@kms_flip@2x-flip-vs-absolute-wf_vblank-interruptible@ac-hdmi-a1-hdmi-a2:
    - shard-glk:          NOTRUN -> [FAIL][81] ([i915#2122])
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_flip@2x-flip-vs-absolute-wf_vblank-interruptible@ac-hdmi-a1-hdmi-a2.html

  * igt@kms_flip@2x-nonexisting-fb:
    - shard-iclb:         NOTRUN -> [SKIP][82] ([fdo#109274]) +2 similar issues
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_flip@2x-nonexisting-fb.html

  * igt@kms_flip@flip-vs-suspend-interruptible@c-dp1:
    - shard-apl:          [PASS][83] -> [DMESG-WARN][84] ([i915#180]) +2 similar issues
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-apl8/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl8/igt@kms_flip@flip-vs-suspend-interruptible@c-dp1.html

  * igt@kms_flip@flip-vs-suspend@a-dp1:
    - shard-kbl:          [PASS][85] -> [INCOMPLETE][86] ([i915#180] / [i915#3614])
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-kbl4/igt@kms_flip@flip-vs-suspend@a-dp1.html
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-kbl7/igt@kms_flip@flip-vs-suspend@a-dp1.html

  * igt@kms_flip@plain-flip-ts-check-interruptible@c-edp1:
    - shard-skl:          [PASS][87] -> [FAIL][88] ([i915#2122]) +1 similar issue
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl1/igt@kms_flip@plain-flip-ts-check-interruptible@c-edp1.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl7/igt@kms_flip@plain-flip-ts-check-interruptible@c-edp1.html

  * igt@kms_flip@plain-flip-ts-check@a-edp1:
    - shard-skl:          NOTRUN -> [FAIL][89] ([i915#2122])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl10/igt@kms_flip@plain-flip-ts-check@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-upscaling:
    - shard-glk:          [PASS][90] -> [FAIL][91] ([i915#4911])
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-glk1/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-upscaling.html
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk8/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs-upscaling.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile-downscaling:
    - shard-iclb:         [PASS][92] -> [SKIP][93] ([i915#3701])
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb7/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile-downscaling.html
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile-downscaling.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render:
    - shard-iclb:         NOTRUN -> [SKIP][94] ([fdo#109280]) +6 similar issues
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-indfb-draw-render.html

  * igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-spr-indfb-draw-render:
    - shard-tglb:         NOTRUN -> [SKIP][95] ([fdo#109280] / [fdo#111825]) +6 similar issues
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb8/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-spr-indfb-draw-render.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-basic:
    - shard-glk:          NOTRUN -> [FAIL][96] ([fdo#108145] / [i915#265])
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_plane_alpha_blend@pipe-a-alpha-basic.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb:
    - shard-skl:          NOTRUN -> [FAIL][97] ([i915#265])
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl5/igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max:
    - shard-apl:          NOTRUN -> [FAIL][98] ([fdo#108145] / [i915#265])
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl4/igt@kms_plane_alpha_blend@pipe-a-constant-alpha-max.html

  * igt@kms_plane_alpha_blend@pipe-b-coverage-7efc:
    - shard-skl:          [PASS][99] -> [FAIL][100] ([fdo#108145] / [i915#265])
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl1/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl7/igt@kms_plane_alpha_blend@pipe-b-coverage-7efc.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-opaque-fb:
    - shard-skl:          NOTRUN -> [FAIL][101] ([fdo#108145] / [i915#265]) +1 similar issue
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl6/igt@kms_plane_alpha_blend@pipe-c-alpha-opaque-fb.html

  * igt@kms_psr2_su@frontbuffer-xrgb8888:
    - shard-skl:          NOTRUN -> [SKIP][102] ([fdo#109271] / [i915#658]) +1 similar issue
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl6/igt@kms_psr2_su@frontbuffer-xrgb8888.html

  * igt@kms_psr@psr2_primary_mmap_gtt:
    - shard-glk:          NOTRUN -> [SKIP][103] ([fdo#109271]) +40 similar issues
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@kms_psr@psr2_primary_mmap_gtt.html

  * igt@kms_psr@psr2_sprite_mmap_gtt:
    - shard-iclb:         [PASS][104] -> [SKIP][105] ([fdo#109441]) +2 similar issues
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb2/igt@kms_psr@psr2_sprite_mmap_gtt.html
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb6/igt@kms_psr@psr2_sprite_mmap_gtt.html

  * igt@kms_writeback@writeback-invalid-parameters:
    - shard-skl:          NOTRUN -> [SKIP][106] ([fdo#109271] / [i915#2437])
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl10/igt@kms_writeback@writeback-invalid-parameters.html

  * igt@nouveau_crc@pipe-b-ctx-flip-detection:
    - shard-tglb:         NOTRUN -> [SKIP][107] ([i915#2530]) +1 similar issue
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb8/igt@nouveau_crc@pipe-b-ctx-flip-detection.html

  * igt@perf@polling-small-buf:
    - shard-skl:          [PASS][108] -> [FAIL][109] ([i915#1722])
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl9/igt@perf@polling-small-buf.html
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl1/igt@perf@polling-small-buf.html

  * igt@prime_nv_api@nv_i915_import_twice_check_flink_name:
    - shard-iclb:         NOTRUN -> [SKIP][110] ([fdo#109291])
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@prime_nv_api@nv_i915_import_twice_check_flink_name.html

  * igt@syncobj_timeline@transfer-timeline-point:
    - shard-glk:          NOTRUN -> [DMESG-FAIL][111] ([i915#5098])
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk5/igt@syncobj_timeline@transfer-timeline-point.html

  * igt@sysfs_clients@busy:
    - shard-skl:          NOTRUN -> [SKIP][112] ([fdo#109271] / [i915#2994]) +2 similar issues
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl1/igt@sysfs_clients@busy.html

  * igt@sysfs_clients@sema-50:
    - shard-iclb:         NOTRUN -> [SKIP][113] ([i915#2994]) +1 similar issue
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb8/igt@sysfs_clients@sema-50.html

  * igt@sysfs_clients@split-25:
    - shard-apl:          NOTRUN -> [SKIP][114] ([fdo#109271] / [i915#2994])
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-apl4/igt@sysfs_clients@split-25.html

  * igt@sysfs_heartbeat_interval@mixed@rcs0:
    - shard-skl:          [PASS][115] -> [FAIL][116] ([i915#1731]) +1 similar issue
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl10/igt@sysfs_heartbeat_interval@mixed@rcs0.html
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl2/igt@sysfs_heartbeat_interval@mixed@rcs0.html

  
#### Possible fixes ####

  * igt@fbdev@unaligned-read:
    - {shard-rkl}:        ([SKIP][117], [SKIP][118]) ([i915#2582]) -> [PASS][119]
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-5/igt@fbdev@unaligned-read.html
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-4/igt@fbdev@unaligned-read.html
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-6/igt@fbdev@unaligned-read.html

  * igt@gem_ctx_persistence@many-contexts:
    - {shard-rkl}:        [FAIL][120] ([i915#2410]) -> [PASS][121]
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-5/igt@gem_ctx_persistence@many-contexts.html
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-2/igt@gem_ctx_persistence@many-contexts.html

  * igt@gem_ctx_persistence@smoketest:
    - shard-iclb:         [FAIL][122] -> [PASS][123]
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-iclb5/igt@gem_ctx_persistence@smoketest.html
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-iclb1/igt@gem_ctx_persistence@smoketest.html

  * igt@gem_eio@in-flight-contexts-1us:
    - {shard-rkl}:        [TIMEOUT][124] ([i915#3063]) -> [PASS][125]
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-5/igt@gem_eio@in-flight-contexts-1us.html
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-2/igt@gem_eio@in-flight-contexts-1us.html

  * igt@gem_exec_capture@pi@rcs0:
    - shard-tglb:         [INCOMPLETE][126] ([i915#3371]) -> [PASS][127]
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-tglb5/igt@gem_exec_capture@pi@rcs0.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-tglb5/igt@gem_exec_capture@pi@rcs0.html
    - shard-skl:          [INCOMPLETE][128] ([i915#4547]) -> [PASS][129]
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-skl4/igt@gem_exec_capture@pi@rcs0.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-skl3/igt@gem_exec_capture@pi@rcs0.html
    - {shard-rkl}:        [INCOMPLETE][130] ([i915#3371]) -> [PASS][131]
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-rkl-5/igt@gem_exec_capture@pi@rcs0.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-rkl-2/igt@gem_exec_capture@pi@rcs0.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-glk:          [FAIL][132] ([i915#2846]) -> [PASS][133]
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-glk5/igt@gem_exec_fair@basic-deadline.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/shard-glk1/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-none-share@rcs0:
    - {shard-tglu}:       [FAIL][134] ([i915#2842]) -> [PASS][135]
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11346/shard-tglu-2/igt@gem_exec_fair@b

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_22523/index.html

[-- Attachment #2: Type: text/html, Size: 33657 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-07 20:21   ` [Intel-gfx] " Vivek Kasireddy
@ 2022-03-11  9:39     ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-11  9:39 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: tvrtko.ursulin, intel-gfx, dri-devel

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>
> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> more framebuffers/scanout buffers results in only one that is mappable/
> fenceable. Therefore, pageflipping between these 2 FBs where only one
> is mappable/fenceable creates latencies large enough to miss alternate
> vblanks thereby producing less optimal framerate.
>
> This mainly happens because when i915_gem_object_pin_to_display_plane()
> is called to pin one of the FB objs, the associated vma is identified
> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> evicts it. This misplaced vma gets subseqently pinned only when
> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> Therefore, to fix this issue, we try to see if there is space to map
> at-least two objects of a given size and return early if there isn't. This
> would ensure that we do not try with PIN_MAPPABLE for any objects that
> are too big to map thereby preventing unncessary unbind.
>
> Testcase:
> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> a frame ~7ms before the next vblank, the latencies seen between atomic
> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> it misses the vblank every other frame.
>
> Here is the ftrace snippet that shows the source of the ~10ms latency:
>               i915_gem_object_pin_to_display_plane() {
> 0.102 us   |    i915_gem_object_set_cache_level();
>                 i915_gem_object_ggtt_pin_ww() {
> 0.390 us   |      i915_vma_instance();
> 0.178 us   |      i915_vma_misplaced();
>                   i915_vma_unbind() {
>                   __i915_active_wait() {
> 0.082 us   |        i915_active_acquire_if_busy();
> 0.475 us   |      }
>                   intel_runtime_pm_get() {
> 0.087 us   |        intel_runtime_pm_acquire();
> 0.259 us   |      }
>                   __i915_active_wait() {
> 0.085 us   |        i915_active_acquire_if_busy();
> 0.240 us   |      }
>                   __i915_vma_evict() {
>                     ggtt_unbind_vma() {
>                       gen8_ggtt_clear_range() {
> 10507.255 us |        }
> 10507.689 us |      }
> 10508.516 us |   }
>
> v2: Instead of using bigjoiner checks, determine whether a scanout
>     buffer is too big by checking to see if it is possible to map
>     two of them into the ggtt.
>
> v3 (Ville):
> - Count how many fb objects can be fit into the available holes
>   instead of checking for a hole twice the object size.
> - Take alignment constraints into account.
> - Limit this large scanout buffer check to >= Gen 11 platforms.
>
> v4:
> - Remove existing heuristic that checks just for size. (Ville)
> - Return early if we find space to map at-least two objects. (Tvrtko)
> - Slightly update the commit message.
>
> v5: (Tvrtko)
> - Rename the function to indicate that the object may be too big to
>   map into the aperture.
> - Account for guard pages while calculating the total size required
>   for the object.
> - Do not subject all objects to the heuristic check and instead
>   consider objects only of a certain size.
> - Do the hole walk using the rbtree.
> - Preserve the existing PIN_NONBLOCK logic.
> - Drop the PIN_MAPPABLE check while pinning the VMA.
>
> v6: (Tvrtko)
> - Return 0 on success and the specific error code on failure to
>   preserve the existing behavior.
>
> v7: (Ville)
> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>   size < ggtt->mappable_end / 4 checks.
> - Drop the redundant check that is based on previous heuristic.
>
> v8:
> - Make sure that we are holding the mutex associated with ggtt vm
>   as we traverse the hole nodes.
>
> v9: (Tvrtko)
> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Manasi Navare <manasi.d.navare@intel.com>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>  1 file changed, 94 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9747924cc57b..e0d731b3f215 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -49,6 +49,7 @@
>  #include "gem/i915_gem_pm.h"
>  #include "gem/i915_gem_region.h"
>  #include "gem/i915_gem_userptr.h"
> +#include "gem/i915_gem_tiling.h"
>  #include "gt/intel_engine_user.h"
>  #include "gt/intel_gt.h"
>  #include "gt/intel_gt_pm.h"
> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>         spin_unlock(&obj->vma.lock);
>  }
>
> +static int
> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> +                                u64 alignment, u64 flags)

Tvrtko asked me to ack the first patch, but then I looked at this and
started wondering.

Conceptually this doesn't pass the smell test. What if we have
multiple per-crtc buffers? Multiple planes on the same crtc? What if
the app does triple buffer? You'll be forever busy tuning this
heuristics, which can't fundamentally be fixed I think. The old "half
of mappable" heuristic isn't really better, but at least it was dead
simple.

Imo what we need here is a change in approach:
1. Check whether the useable view for scanout exists already. If yes,
use that. This should avoid the constant unbinding stalls.
2. Try to in buffer to mappabley, but without evicting anything (so
not the non-blocking thing)
3. Pin the buffer with the most lenient approach

Even the non-blocking interim stage is dangerous, since it'll just
result in other buffers (e.g. when triple-buffering) getting unbound
and we're back to the same stall. Note that this could have an impact
on cpu rendering compositors, where we might end up relying a lot more
partial views. But as long as we are a tad more aggressive (i.e. the
non-blocking binding) in the mmap path that should work out to keep
everything balanced, since usually you render first before you display
anything. And so the buffer should end up in the ideal place.

I'd try to first skip the 2. step since I think it'll require a bit of
work, and frankly I don't think we care about the potential fallout.
-Daniel

> +{
> +       struct drm_i915_private *i915 = to_i915(obj->base.dev);
> +       struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
> +       struct drm_mm_node *hole;
> +       u64 hole_start, hole_end, start, end;
> +       u64 fence_size, fence_alignment;
> +       unsigned int count = 0;
> +       int err = 0;
> +
> +       /*
> +        * If the required space is larger than the available
> +        * aperture, we will not able to find a slot for the
> +        * object and unbinding the object now will be in
> +        * vain. Worse, doing so may cause us to ping-pong
> +        * the object in and out of the Global GTT and
> +        * waste a lot of cycles under the mutex.
> +        */
> +       if (obj->base.size > ggtt->mappable_end)
> +               return -E2BIG;
> +
> +       /*
> +        * If NONBLOCK is set the caller is optimistically
> +        * trying to cache the full object within the mappable
> +        * aperture, and *must* have a fallback in place for
> +        * situations where we cannot bind the object. We
> +        * can be a little more lax here and use the fallback
> +        * more often to avoid costly migrations of ourselves
> +        * and other objects within the aperture.
> +        */
> +       if (!(flags & PIN_NONBLOCK))
> +               return 0;
> +
> +       /*
> +        * Other objects such as batchbuffers are fairly small compared
> +        * to FBs and are unlikely to exahust the aperture space.
> +        * Therefore, return early if this obj is not an FB.
> +        */
> +       if (!i915_gem_object_is_framebuffer(obj))
> +               return 0;
> +
> +       fence_size = i915_gem_fence_size(i915, obj->base.size,
> +                                        i915_gem_object_get_tiling(obj),
> +                                        i915_gem_object_get_stride(obj));
> +
> +       if (i915_vm_has_cache_coloring(&ggtt->vm))
> +               fence_size += 2 * I915_GTT_PAGE_SIZE;
> +
> +       fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
> +                                                  i915_gem_object_get_tiling(obj),
> +                                                  i915_gem_object_get_stride(obj));
> +       alignment = max_t(u64, alignment, fence_alignment);
> +
> +       err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
> +       if (err)
> +               return err;
> +
> +       /*
> +        * Assuming this object is a large scanout buffer, we try to find
> +        * out if there is room to map at-least two of them. There could
> +        * be space available to map one but to be consistent, we try to
> +        * avoid mapping/fencing any of them.
> +        */
> +       drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
> +                                     fence_size, DRM_MM_INSERT_LOW) {
> +               hole_start = drm_mm_hole_node_start(hole);
> +               hole_end = hole_start + hole->hole_size;
> +
> +               do {
> +                       start = round_up(hole_start, alignment);
> +                       end = min_t(u64, hole_end, ggtt->mappable_end);
> +
> +                       if (range_overflows(start, fence_size, end))
> +                               break;
> +
> +                       if (++count >= 2) {
> +                               mutex_unlock(&ggtt->vm.mutex);
> +                               return 0;
> +                       }
> +
> +                       hole_start = start + fence_size;
> +               } while (1);
> +       }
> +
> +       mutex_unlock(&ggtt->vm.mutex);
> +       return -ENOSPC;
> +}
> +
>  struct i915_vma *
>  i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>                             struct i915_gem_ww_ctx *ww,
> @@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>
>         if (flags & PIN_MAPPABLE &&
>             (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
> -               /*
> -                * If the required space is larger than the available
> -                * aperture, we will not able to find a slot for the
> -                * object and unbinding the object now will be in
> -                * vain. Worse, doing so may cause us to ping-pong
> -                * the object in and out of the Global GTT and
> -                * waste a lot of cycles under the mutex.
> -                */
> -               if (obj->base.size > ggtt->mappable_end)
> -                       return ERR_PTR(-E2BIG);
> -
> -               /*
> -                * If NONBLOCK is set the caller is optimistically
> -                * trying to cache the full object within the mappable
> -                * aperture, and *must* have a fallback in place for
> -                * situations where we cannot bind the object. We
> -                * can be a little more lax here and use the fallback
> -                * more often to avoid costly migrations of ourselves
> -                * and other objects within the aperture.
> -                *
> -                * Half-the-aperture is used as a simple heuristic.
> -                * More interesting would to do search for a free
> -                * block prior to making the commitment to unbind.
> -                * That caters for the self-harm case, and with a
> -                * little more heuristics (e.g. NOFAULT, NOEVICT)
> -                * we could try to minimise harm to others.
> -                */
> -               if (flags & PIN_NONBLOCK &&
> -                   obj->base.size > ggtt->mappable_end / 2)
> -                       return ERR_PTR(-ENOSPC);
> +               ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
> +               if (ret)
> +                       return ERR_PTR(ret);
>         }
>
>  new_vma:
> @@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>                 if (flags & PIN_NONBLOCK) {
>                         if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
>                                 return ERR_PTR(-ENOSPC);
> -
> -                       if (flags & PIN_MAPPABLE &&
> -                           vma->fence_size > ggtt->mappable_end / 2)
> -                               return ERR_PTR(-ENOSPC);
>                 }
>
>                 if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
> --
> 2.35.1
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-11  9:39     ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-11  9:39 UTC (permalink / raw)
  To: Vivek Kasireddy; +Cc: intel-gfx, dri-devel

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>
> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> more framebuffers/scanout buffers results in only one that is mappable/
> fenceable. Therefore, pageflipping between these 2 FBs where only one
> is mappable/fenceable creates latencies large enough to miss alternate
> vblanks thereby producing less optimal framerate.
>
> This mainly happens because when i915_gem_object_pin_to_display_plane()
> is called to pin one of the FB objs, the associated vma is identified
> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> evicts it. This misplaced vma gets subseqently pinned only when
> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> Therefore, to fix this issue, we try to see if there is space to map
> at-least two objects of a given size and return early if there isn't. This
> would ensure that we do not try with PIN_MAPPABLE for any objects that
> are too big to map thereby preventing unncessary unbind.
>
> Testcase:
> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> a frame ~7ms before the next vblank, the latencies seen between atomic
> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> it misses the vblank every other frame.
>
> Here is the ftrace snippet that shows the source of the ~10ms latency:
>               i915_gem_object_pin_to_display_plane() {
> 0.102 us   |    i915_gem_object_set_cache_level();
>                 i915_gem_object_ggtt_pin_ww() {
> 0.390 us   |      i915_vma_instance();
> 0.178 us   |      i915_vma_misplaced();
>                   i915_vma_unbind() {
>                   __i915_active_wait() {
> 0.082 us   |        i915_active_acquire_if_busy();
> 0.475 us   |      }
>                   intel_runtime_pm_get() {
> 0.087 us   |        intel_runtime_pm_acquire();
> 0.259 us   |      }
>                   __i915_active_wait() {
> 0.085 us   |        i915_active_acquire_if_busy();
> 0.240 us   |      }
>                   __i915_vma_evict() {
>                     ggtt_unbind_vma() {
>                       gen8_ggtt_clear_range() {
> 10507.255 us |        }
> 10507.689 us |      }
> 10508.516 us |   }
>
> v2: Instead of using bigjoiner checks, determine whether a scanout
>     buffer is too big by checking to see if it is possible to map
>     two of them into the ggtt.
>
> v3 (Ville):
> - Count how many fb objects can be fit into the available holes
>   instead of checking for a hole twice the object size.
> - Take alignment constraints into account.
> - Limit this large scanout buffer check to >= Gen 11 platforms.
>
> v4:
> - Remove existing heuristic that checks just for size. (Ville)
> - Return early if we find space to map at-least two objects. (Tvrtko)
> - Slightly update the commit message.
>
> v5: (Tvrtko)
> - Rename the function to indicate that the object may be too big to
>   map into the aperture.
> - Account for guard pages while calculating the total size required
>   for the object.
> - Do not subject all objects to the heuristic check and instead
>   consider objects only of a certain size.
> - Do the hole walk using the rbtree.
> - Preserve the existing PIN_NONBLOCK logic.
> - Drop the PIN_MAPPABLE check while pinning the VMA.
>
> v6: (Tvrtko)
> - Return 0 on success and the specific error code on failure to
>   preserve the existing behavior.
>
> v7: (Ville)
> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>   size < ggtt->mappable_end / 4 checks.
> - Drop the redundant check that is based on previous heuristic.
>
> v8:
> - Make sure that we are holding the mutex associated with ggtt vm
>   as we traverse the hole nodes.
>
> v9: (Tvrtko)
> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Manasi Navare <manasi.d.navare@intel.com>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>  1 file changed, 94 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9747924cc57b..e0d731b3f215 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -49,6 +49,7 @@
>  #include "gem/i915_gem_pm.h"
>  #include "gem/i915_gem_region.h"
>  #include "gem/i915_gem_userptr.h"
> +#include "gem/i915_gem_tiling.h"
>  #include "gt/intel_engine_user.h"
>  #include "gt/intel_gt.h"
>  #include "gt/intel_gt_pm.h"
> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>         spin_unlock(&obj->vma.lock);
>  }
>
> +static int
> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> +                                u64 alignment, u64 flags)

Tvrtko asked me to ack the first patch, but then I looked at this and
started wondering.

Conceptually this doesn't pass the smell test. What if we have
multiple per-crtc buffers? Multiple planes on the same crtc? What if
the app does triple buffer? You'll be forever busy tuning this
heuristics, which can't fundamentally be fixed I think. The old "half
of mappable" heuristic isn't really better, but at least it was dead
simple.

Imo what we need here is a change in approach:
1. Check whether the useable view for scanout exists already. If yes,
use that. This should avoid the constant unbinding stalls.
2. Try to in buffer to mappabley, but without evicting anything (so
not the non-blocking thing)
3. Pin the buffer with the most lenient approach

Even the non-blocking interim stage is dangerous, since it'll just
result in other buffers (e.g. when triple-buffering) getting unbound
and we're back to the same stall. Note that this could have an impact
on cpu rendering compositors, where we might end up relying a lot more
partial views. But as long as we are a tad more aggressive (i.e. the
non-blocking binding) in the mmap path that should work out to keep
everything balanced, since usually you render first before you display
anything. And so the buffer should end up in the ideal place.

I'd try to first skip the 2. step since I think it'll require a bit of
work, and frankly I don't think we care about the potential fallout.
-Daniel

> +{
> +       struct drm_i915_private *i915 = to_i915(obj->base.dev);
> +       struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
> +       struct drm_mm_node *hole;
> +       u64 hole_start, hole_end, start, end;
> +       u64 fence_size, fence_alignment;
> +       unsigned int count = 0;
> +       int err = 0;
> +
> +       /*
> +        * If the required space is larger than the available
> +        * aperture, we will not able to find a slot for the
> +        * object and unbinding the object now will be in
> +        * vain. Worse, doing so may cause us to ping-pong
> +        * the object in and out of the Global GTT and
> +        * waste a lot of cycles under the mutex.
> +        */
> +       if (obj->base.size > ggtt->mappable_end)
> +               return -E2BIG;
> +
> +       /*
> +        * If NONBLOCK is set the caller is optimistically
> +        * trying to cache the full object within the mappable
> +        * aperture, and *must* have a fallback in place for
> +        * situations where we cannot bind the object. We
> +        * can be a little more lax here and use the fallback
> +        * more often to avoid costly migrations of ourselves
> +        * and other objects within the aperture.
> +        */
> +       if (!(flags & PIN_NONBLOCK))
> +               return 0;
> +
> +       /*
> +        * Other objects such as batchbuffers are fairly small compared
> +        * to FBs and are unlikely to exahust the aperture space.
> +        * Therefore, return early if this obj is not an FB.
> +        */
> +       if (!i915_gem_object_is_framebuffer(obj))
> +               return 0;
> +
> +       fence_size = i915_gem_fence_size(i915, obj->base.size,
> +                                        i915_gem_object_get_tiling(obj),
> +                                        i915_gem_object_get_stride(obj));
> +
> +       if (i915_vm_has_cache_coloring(&ggtt->vm))
> +               fence_size += 2 * I915_GTT_PAGE_SIZE;
> +
> +       fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
> +                                                  i915_gem_object_get_tiling(obj),
> +                                                  i915_gem_object_get_stride(obj));
> +       alignment = max_t(u64, alignment, fence_alignment);
> +
> +       err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
> +       if (err)
> +               return err;
> +
> +       /*
> +        * Assuming this object is a large scanout buffer, we try to find
> +        * out if there is room to map at-least two of them. There could
> +        * be space available to map one but to be consistent, we try to
> +        * avoid mapping/fencing any of them.
> +        */
> +       drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
> +                                     fence_size, DRM_MM_INSERT_LOW) {
> +               hole_start = drm_mm_hole_node_start(hole);
> +               hole_end = hole_start + hole->hole_size;
> +
> +               do {
> +                       start = round_up(hole_start, alignment);
> +                       end = min_t(u64, hole_end, ggtt->mappable_end);
> +
> +                       if (range_overflows(start, fence_size, end))
> +                               break;
> +
> +                       if (++count >= 2) {
> +                               mutex_unlock(&ggtt->vm.mutex);
> +                               return 0;
> +                       }
> +
> +                       hole_start = start + fence_size;
> +               } while (1);
> +       }
> +
> +       mutex_unlock(&ggtt->vm.mutex);
> +       return -ENOSPC;
> +}
> +
>  struct i915_vma *
>  i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>                             struct i915_gem_ww_ctx *ww,
> @@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>
>         if (flags & PIN_MAPPABLE &&
>             (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
> -               /*
> -                * If the required space is larger than the available
> -                * aperture, we will not able to find a slot for the
> -                * object and unbinding the object now will be in
> -                * vain. Worse, doing so may cause us to ping-pong
> -                * the object in and out of the Global GTT and
> -                * waste a lot of cycles under the mutex.
> -                */
> -               if (obj->base.size > ggtt->mappable_end)
> -                       return ERR_PTR(-E2BIG);
> -
> -               /*
> -                * If NONBLOCK is set the caller is optimistically
> -                * trying to cache the full object within the mappable
> -                * aperture, and *must* have a fallback in place for
> -                * situations where we cannot bind the object. We
> -                * can be a little more lax here and use the fallback
> -                * more often to avoid costly migrations of ourselves
> -                * and other objects within the aperture.
> -                *
> -                * Half-the-aperture is used as a simple heuristic.
> -                * More interesting would to do search for a free
> -                * block prior to making the commitment to unbind.
> -                * That caters for the self-harm case, and with a
> -                * little more heuristics (e.g. NOFAULT, NOEVICT)
> -                * we could try to minimise harm to others.
> -                */
> -               if (flags & PIN_NONBLOCK &&
> -                   obj->base.size > ggtt->mappable_end / 2)
> -                       return ERR_PTR(-ENOSPC);
> +               ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
> +               if (ret)
> +                       return ERR_PTR(ret);
>         }
>
>  new_vma:
> @@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>                 if (flags & PIN_NONBLOCK) {
>                         if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
>                                 return ERR_PTR(-ENOSPC);
> -
> -                       if (flags & PIN_MAPPABLE &&
> -                           vma->fence_size > ggtt->mappable_end / 2)
> -                               return ERR_PTR(-ENOSPC);
>                 }
>
>                 if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
> --
> 2.35.1
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-11  9:39     ` Daniel Vetter
  (?)
@ 2022-03-14 11:14     ` Tvrtko Ursulin
  2022-03-15  7:28         ` Kasireddy, Vivek
  -1 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-14 11:14 UTC (permalink / raw)
  To: Daniel Vetter, Vivek Kasireddy; +Cc: intel-gfx, dri-devel


On 11/03/2022 09:39, Daniel Vetter wrote:
> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>>
>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>> more framebuffers/scanout buffers results in only one that is mappable/
>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>> is mappable/fenceable creates latencies large enough to miss alternate
>> vblanks thereby producing less optimal framerate.
>>
>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>> is called to pin one of the FB objs, the associated vma is identified
>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>> evicts it. This misplaced vma gets subseqently pinned only when
>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>> Therefore, to fix this issue, we try to see if there is space to map
>> at-least two objects of a given size and return early if there isn't. This
>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>> are too big to map thereby preventing unncessary unbind.
>>
>> Testcase:
>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>> a frame ~7ms before the next vblank, the latencies seen between atomic
>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>> it misses the vblank every other frame.
>>
>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>                i915_gem_object_pin_to_display_plane() {
>> 0.102 us   |    i915_gem_object_set_cache_level();
>>                  i915_gem_object_ggtt_pin_ww() {
>> 0.390 us   |      i915_vma_instance();
>> 0.178 us   |      i915_vma_misplaced();
>>                    i915_vma_unbind() {
>>                    __i915_active_wait() {
>> 0.082 us   |        i915_active_acquire_if_busy();
>> 0.475 us   |      }
>>                    intel_runtime_pm_get() {
>> 0.087 us   |        intel_runtime_pm_acquire();
>> 0.259 us   |      }
>>                    __i915_active_wait() {
>> 0.085 us   |        i915_active_acquire_if_busy();
>> 0.240 us   |      }
>>                    __i915_vma_evict() {
>>                      ggtt_unbind_vma() {
>>                        gen8_ggtt_clear_range() {
>> 10507.255 us |        }
>> 10507.689 us |      }
>> 10508.516 us |   }
>>
>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>      buffer is too big by checking to see if it is possible to map
>>      two of them into the ggtt.
>>
>> v3 (Ville):
>> - Count how many fb objects can be fit into the available holes
>>    instead of checking for a hole twice the object size.
>> - Take alignment constraints into account.
>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>
>> v4:
>> - Remove existing heuristic that checks just for size. (Ville)
>> - Return early if we find space to map at-least two objects. (Tvrtko)
>> - Slightly update the commit message.
>>
>> v5: (Tvrtko)
>> - Rename the function to indicate that the object may be too big to
>>    map into the aperture.
>> - Account for guard pages while calculating the total size required
>>    for the object.
>> - Do not subject all objects to the heuristic check and instead
>>    consider objects only of a certain size.
>> - Do the hole walk using the rbtree.
>> - Preserve the existing PIN_NONBLOCK logic.
>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>
>> v6: (Tvrtko)
>> - Return 0 on success and the specific error code on failure to
>>    preserve the existing behavior.
>>
>> v7: (Ville)
>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>    size < ggtt->mappable_end / 4 checks.
>> - Drop the redundant check that is based on previous heuristic.
>>
>> v8:
>> - Make sure that we are holding the mutex associated with ggtt vm
>>    as we traverse the hole nodes.
>>
>> v9: (Tvrtko)
>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>
>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>>   1 file changed, 94 insertions(+), 34 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 9747924cc57b..e0d731b3f215 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -49,6 +49,7 @@
>>   #include "gem/i915_gem_pm.h"
>>   #include "gem/i915_gem_region.h"
>>   #include "gem/i915_gem_userptr.h"
>> +#include "gem/i915_gem_tiling.h"
>>   #include "gt/intel_engine_user.h"
>>   #include "gt/intel_gt.h"
>>   #include "gt/intel_gt_pm.h"
>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>          spin_unlock(&obj->vma.lock);
>>   }
>>
>> +static int
>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>> +                                u64 alignment, u64 flags)
> 
> Tvrtko asked me to ack the first patch, but then I looked at this and
> started wondering.
> 
> Conceptually this doesn't pass the smell test. What if we have
> multiple per-crtc buffers? Multiple planes on the same crtc? What if
> the app does triple buffer? You'll be forever busy tuning this
> heuristics, which can't fundamentally be fixed I think. The old "half
> of mappable" heuristic isn't really better, but at least it was dead
> simple.
> 
> Imo what we need here is a change in approach:
> 1. Check whether the useable view for scanout exists already. If yes,
> use that. This should avoid the constant unbinding stalls.
> 2. Try to in buffer to mappabley, but without evicting anything (so
> not the non-blocking thing)
> 3. Pin the buffer with the most lenient approach
> 
> Even the non-blocking interim stage is dangerous, since it'll just
> result in other buffers (e.g. when triple-buffering) getting unbound
> and we're back to the same stall. Note that this could have an impact
> on cpu rendering compositors, where we might end up relying a lot more
> partial views. But as long as we are a tad more aggressive (i.e. the
> non-blocking binding) in the mmap path that should work out to keep
> everything balanced, since usually you render first before you display
> anything. And so the buffer should end up in the ideal place.
> 
> I'd try to first skip the 2. step since I think it'll require a bit of
> work, and frankly I don't think we care about the potential fallout.

To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop respecting this comment from i915_gem_object_pin_to_display_plane:

	/*
	 * As the user may map the buffer once pinned in the display plane
	 * (e.g. libkms for the bootup splash), we have to ensure that we
	 * always use map_and_fenceable for all scanout buffers. However,
	 * it may simply be too big to fit into mappable, in which case
	 * put it anyway and hope that userspace can cope (but always first
	 * try to preserve the existing ABI).
	 */

By a quick look, for this case it appears we would end up creating partial views for CPU access (since the normal mapping would be busy/unpinnable). Worst case for this is to create a bunch of 1MiB VMAs so something to check would be how long those persist in memory before they get released. Or perhaps the bootup splash use case is not common these days?

Regards,

Tvrtko

> -Daniel
> 
>> +{
>> +       struct drm_i915_private *i915 = to_i915(obj->base.dev);
>> +       struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
>> +       struct drm_mm_node *hole;
>> +       u64 hole_start, hole_end, start, end;
>> +       u64 fence_size, fence_alignment;
>> +       unsigned int count = 0;
>> +       int err = 0;
>> +
>> +       /*
>> +        * If the required space is larger than the available
>> +        * aperture, we will not able to find a slot for the
>> +        * object and unbinding the object now will be in
>> +        * vain. Worse, doing so may cause us to ping-pong
>> +        * the object in and out of the Global GTT and
>> +        * waste a lot of cycles under the mutex.
>> +        */
>> +       if (obj->base.size > ggtt->mappable_end)
>> +               return -E2BIG;
>> +
>> +       /*
>> +        * If NONBLOCK is set the caller is optimistically
>> +        * trying to cache the full object within the mappable
>> +        * aperture, and *must* have a fallback in place for
>> +        * situations where we cannot bind the object. We
>> +        * can be a little more lax here and use the fallback
>> +        * more often to avoid costly migrations of ourselves
>> +        * and other objects within the aperture.
>> +        */
>> +       if (!(flags & PIN_NONBLOCK))
>> +               return 0;
>> +
>> +       /*
>> +        * Other objects such as batchbuffers are fairly small compared
>> +        * to FBs and are unlikely to exahust the aperture space.
>> +        * Therefore, return early if this obj is not an FB.
>> +        */
>> +       if (!i915_gem_object_is_framebuffer(obj))
>> +               return 0;
>> +
>> +       fence_size = i915_gem_fence_size(i915, obj->base.size,
>> +                                        i915_gem_object_get_tiling(obj),
>> +                                        i915_gem_object_get_stride(obj));
>> +
>> +       if (i915_vm_has_cache_coloring(&ggtt->vm))
>> +               fence_size += 2 * I915_GTT_PAGE_SIZE;
>> +
>> +       fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
>> +                                                  i915_gem_object_get_tiling(obj),
>> +                                                  i915_gem_object_get_stride(obj));
>> +       alignment = max_t(u64, alignment, fence_alignment);
>> +
>> +       err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
>> +       if (err)
>> +               return err;
>> +
>> +       /*
>> +        * Assuming this object is a large scanout buffer, we try to find
>> +        * out if there is room to map at-least two of them. There could
>> +        * be space available to map one but to be consistent, we try to
>> +        * avoid mapping/fencing any of them.
>> +        */
>> +       drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
>> +                                     fence_size, DRM_MM_INSERT_LOW) {
>> +               hole_start = drm_mm_hole_node_start(hole);
>> +               hole_end = hole_start + hole->hole_size;
>> +
>> +               do {
>> +                       start = round_up(hole_start, alignment);
>> +                       end = min_t(u64, hole_end, ggtt->mappable_end);
>> +
>> +                       if (range_overflows(start, fence_size, end))
>> +                               break;
>> +
>> +                       if (++count >= 2) {
>> +                               mutex_unlock(&ggtt->vm.mutex);
>> +                               return 0;
>> +                       }
>> +
>> +                       hole_start = start + fence_size;
>> +               } while (1);
>> +       }
>> +
>> +       mutex_unlock(&ggtt->vm.mutex);
>> +       return -ENOSPC;
>> +}
>> +
>>   struct i915_vma *
>>   i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>>                              struct i915_gem_ww_ctx *ww,
>> @@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>>
>>          if (flags & PIN_MAPPABLE &&
>>              (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
>> -               /*
>> -                * If the required space is larger than the available
>> -                * aperture, we will not able to find a slot for the
>> -                * object and unbinding the object now will be in
>> -                * vain. Worse, doing so may cause us to ping-pong
>> -                * the object in and out of the Global GTT and
>> -                * waste a lot of cycles under the mutex.
>> -                */
>> -               if (obj->base.size > ggtt->mappable_end)
>> -                       return ERR_PTR(-E2BIG);
>> -
>> -               /*
>> -                * If NONBLOCK is set the caller is optimistically
>> -                * trying to cache the full object within the mappable
>> -                * aperture, and *must* have a fallback in place for
>> -                * situations where we cannot bind the object. We
>> -                * can be a little more lax here and use the fallback
>> -                * more often to avoid costly migrations of ourselves
>> -                * and other objects within the aperture.
>> -                *
>> -                * Half-the-aperture is used as a simple heuristic.
>> -                * More interesting would to do search for a free
>> -                * block prior to making the commitment to unbind.
>> -                * That caters for the self-harm case, and with a
>> -                * little more heuristics (e.g. NOFAULT, NOEVICT)
>> -                * we could try to minimise harm to others.
>> -                */
>> -               if (flags & PIN_NONBLOCK &&
>> -                   obj->base.size > ggtt->mappable_end / 2)
>> -                       return ERR_PTR(-ENOSPC);
>> +               ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
>> +               if (ret)
>> +                       return ERR_PTR(ret);
>>          }
>>
>>   new_vma:
>> @@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>>                  if (flags & PIN_NONBLOCK) {
>>                          if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
>>                                  return ERR_PTR(-ENOSPC);
>> -
>> -                       if (flags & PIN_MAPPABLE &&
>> -                           vma->fence_size > ggtt->mappable_end / 2)
>> -                               return ERR_PTR(-ENOSPC);
>>                  }
>>
>>                  if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
>> --
>> 2.35.1
>>
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-14 11:14     ` Tvrtko Ursulin
@ 2022-03-15  7:28         ` Kasireddy, Vivek
  0 siblings, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-15  7:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko, Daniel,

> 
> On 11/03/2022 09:39, Daniel Vetter wrote:
> > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> >>
> >> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >> more framebuffers/scanout buffers results in only one that is mappable/
> >> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >> is mappable/fenceable creates latencies large enough to miss alternate
> >> vblanks thereby producing less optimal framerate.
> >>
> >> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >> is called to pin one of the FB objs, the associated vma is identified
> >> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >> evicts it. This misplaced vma gets subseqently pinned only when
> >> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >> Therefore, to fix this issue, we try to see if there is space to map
> >> at-least two objects of a given size and return early if there isn't. This
> >> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >> are too big to map thereby preventing unncessary unbind.
> >>
> >> Testcase:
> >> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >> a frame ~7ms before the next vblank, the latencies seen between atomic
> >> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >> it misses the vblank every other frame.
> >>
> >> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>                i915_gem_object_pin_to_display_plane() {
> >> 0.102 us   |    i915_gem_object_set_cache_level();
> >>                  i915_gem_object_ggtt_pin_ww() {
> >> 0.390 us   |      i915_vma_instance();
> >> 0.178 us   |      i915_vma_misplaced();
> >>                    i915_vma_unbind() {
> >>                    __i915_active_wait() {
> >> 0.082 us   |        i915_active_acquire_if_busy();
> >> 0.475 us   |      }
> >>                    intel_runtime_pm_get() {
> >> 0.087 us   |        intel_runtime_pm_acquire();
> >> 0.259 us   |      }
> >>                    __i915_active_wait() {
> >> 0.085 us   |        i915_active_acquire_if_busy();
> >> 0.240 us   |      }
> >>                    __i915_vma_evict() {
> >>                      ggtt_unbind_vma() {
> >>                        gen8_ggtt_clear_range() {
> >> 10507.255 us |        }
> >> 10507.689 us |      }
> >> 10508.516 us |   }
> >>
> >> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>      buffer is too big by checking to see if it is possible to map
> >>      two of them into the ggtt.
> >>
> >> v3 (Ville):
> >> - Count how many fb objects can be fit into the available holes
> >>    instead of checking for a hole twice the object size.
> >> - Take alignment constraints into account.
> >> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>
> >> v4:
> >> - Remove existing heuristic that checks just for size. (Ville)
> >> - Return early if we find space to map at-least two objects. (Tvrtko)
> >> - Slightly update the commit message.
> >>
> >> v5: (Tvrtko)
> >> - Rename the function to indicate that the object may be too big to
> >>    map into the aperture.
> >> - Account for guard pages while calculating the total size required
> >>    for the object.
> >> - Do not subject all objects to the heuristic check and instead
> >>    consider objects only of a certain size.
> >> - Do the hole walk using the rbtree.
> >> - Preserve the existing PIN_NONBLOCK logic.
> >> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>
> >> v6: (Tvrtko)
> >> - Return 0 on success and the specific error code on failure to
> >>    preserve the existing behavior.
> >>
> >> v7: (Ville)
> >> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>    size < ggtt->mappable_end / 4 checks.
> >> - Drop the redundant check that is based on previous heuristic.
> >>
> >> v8:
> >> - Make sure that we are holding the mutex associated with ggtt vm
> >>    as we traverse the hole nodes.
> >>
> >> v9: (Tvrtko)
> >> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>
> >> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >> ---
> >>   drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> >>   1 file changed, 94 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >> index 9747924cc57b..e0d731b3f215 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -49,6 +49,7 @@
> >>   #include "gem/i915_gem_pm.h"
> >>   #include "gem/i915_gem_region.h"
> >>   #include "gem/i915_gem_userptr.h"
> >> +#include "gem/i915_gem_tiling.h"
> >>   #include "gt/intel_engine_user.h"
> >>   #include "gt/intel_gt.h"
> >>   #include "gt/intel_gt_pm.h"
> >> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>          spin_unlock(&obj->vma.lock);
> >>   }
> >>
> >> +static int
> >> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >> +                                u64 alignment, u64 flags)
> >
> > Tvrtko asked me to ack the first patch, but then I looked at this and
> > started wondering.
> >
> > Conceptually this doesn't pass the smell test. What if we have
> > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > the app does triple buffer? You'll be forever busy tuning this
> > heuristics, which can't fundamentally be fixed I think. The old "half
> > of mappable" heuristic isn't really better, but at least it was dead
> > simple.
> >
> > Imo what we need here is a change in approach:
> > 1. Check whether the useable view for scanout exists already. If yes,
> > use that. This should avoid the constant unbinding stalls.
> > 2. Try to in buffer to mappabley, but without evicting anything (so
> > not the non-blocking thing)
> > 3. Pin the buffer with the most lenient approach
> >
> > Even the non-blocking interim stage is dangerous, since it'll just
> > result in other buffers (e.g. when triple-buffering) getting unbound
> > and we're back to the same stall. Note that this could have an impact
> > on cpu rendering compositors, where we might end up relying a lot more
> > partial views. But as long as we are a tad more aggressive (i.e. the
> > non-blocking binding) in the mmap path that should work out to keep
> > everything balanced, since usually you render first before you display
> > anything. And so the buffer should end up in the ideal place.
> >
> > I'd try to first skip the 2. step since I think it'll require a bit of
> > work, and frankly I don't think we care about the potential fallout.
> 
> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> respecting this comment from i915_gem_object_pin_to_display_plane:
> 
> 	/*
> 	 * As the user may map the buffer once pinned in the display plane
> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> 	 * always use map_and_fenceable for all scanout buffers. However,
> 	 * it may simply be too big to fit into mappable, in which case
> 	 * put it anyway and hope that userspace can cope (but always first
> 	 * try to preserve the existing ABI).
> 	 */
[Kasireddy, Vivek] Digging further, this is what the commit message that added
the above comment says:
commit 2efb813d5388e18255c54afac77bd91acd586908
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 18 17:17:06 2016 +0100

    drm/i915: Fallback to using unmappable memory for scanout

    The existing ABI says that scanouts are pinned into the mappable region
    so that legacy clients (e.g. old Xorg or plymouthd) can write directly
    into the scanout through a GTT mapping. However if the surface does not
    fit into the mappable region, we are better off just trying to fit it
    anywhere and hoping for the best. (Any userspace that is capable of
    using ginormous scanouts is also likely not to rely on pure GTT
    updates.) With the partial vma fault support, we are no longer
    restricted to only using scanouts that we can pin (though it is still
    preferred for performance reasons and for powersaving features like
    FBC).

> 
> By a quick look, for this case it appears we would end up creating partial views for CPU
> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> create a bunch of 1MiB VMAs so something to check would be how long those persist in
> memory before they get released. Or perhaps the bootup splash use case is not common
> these days?
[Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
would not to be a problem on ADL-S where there is space in mappable for one 8K FB.

Given this, do you think it would work if we just preserve the existing behavior and
tweak the heuristic introduced in this patch to look for space in aperture for only 
one FB instead of two? Or, is there no good option for solving this issue other than
to create 1MB VMAs?

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko
> 
> > -Daniel
> >
> >> +{
> >> +       struct drm_i915_private *i915 = to_i915(obj->base.dev);
> >> +       struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
> >> +       struct drm_mm_node *hole;
> >> +       u64 hole_start, hole_end, start, end;
> >> +       u64 fence_size, fence_alignment;
> >> +       unsigned int count = 0;
> >> +       int err = 0;
> >> +
> >> +       /*
> >> +        * If the required space is larger than the available
> >> +        * aperture, we will not able to find a slot for the
> >> +        * object and unbinding the object now will be in
> >> +        * vain. Worse, doing so may cause us to ping-pong
> >> +        * the object in and out of the Global GTT and
> >> +        * waste a lot of cycles under the mutex.
> >> +        */
> >> +       if (obj->base.size > ggtt->mappable_end)
> >> +               return -E2BIG;
> >> +
> >> +       /*
> >> +        * If NONBLOCK is set the caller is optimistically
> >> +        * trying to cache the full object within the mappable
> >> +        * aperture, and *must* have a fallback in place for
> >> +        * situations where we cannot bind the object. We
> >> +        * can be a little more lax here and use the fallback
> >> +        * more often to avoid costly migrations of ourselves
> >> +        * and other objects within the aperture.
> >> +        */
> >> +       if (!(flags & PIN_NONBLOCK))
> >> +               return 0;
> >> +
> >> +       /*
> >> +        * Other objects such as batchbuffers are fairly small compared
> >> +        * to FBs and are unlikely to exahust the aperture space.
> >> +        * Therefore, return early if this obj is not an FB.
> >> +        */
> >> +       if (!i915_gem_object_is_framebuffer(obj))
> >> +               return 0;
> >> +
> >> +       fence_size = i915_gem_fence_size(i915, obj->base.size,
> >> +                                        i915_gem_object_get_tiling(obj),
> >> +                                        i915_gem_object_get_stride(obj));
> >> +
> >> +       if (i915_vm_has_cache_coloring(&ggtt->vm))
> >> +               fence_size += 2 * I915_GTT_PAGE_SIZE;
> >> +
> >> +       fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
> >> +                                                  i915_gem_object_get_tiling(obj),
> >> +                                                  i915_gem_object_get_stride(obj));
> >> +       alignment = max_t(u64, alignment, fence_alignment);
> >> +
> >> +       err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
> >> +       if (err)
> >> +               return err;
> >> +
> >> +       /*
> >> +        * Assuming this object is a large scanout buffer, we try to find
> >> +        * out if there is room to map at-least two of them. There could
> >> +        * be space available to map one but to be consistent, we try to
> >> +        * avoid mapping/fencing any of them.
> >> +        */
> >> +       drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
> >> +                                     fence_size, DRM_MM_INSERT_LOW) {
> >> +               hole_start = drm_mm_hole_node_start(hole);
> >> +               hole_end = hole_start + hole->hole_size;
> >> +
> >> +               do {
> >> +                       start = round_up(hole_start, alignment);
> >> +                       end = min_t(u64, hole_end, ggtt->mappable_end);
> >> +
> >> +                       if (range_overflows(start, fence_size, end))
> >> +                               break;
> >> +
> >> +                       if (++count >= 2) {
> >> +                               mutex_unlock(&ggtt->vm.mutex);
> >> +                               return 0;
> >> +                       }
> >> +
> >> +                       hole_start = start + fence_size;
> >> +               } while (1);
> >> +       }
> >> +
> >> +       mutex_unlock(&ggtt->vm.mutex);
> >> +       return -ENOSPC;
> >> +}
> >> +
> >>   struct i915_vma *
> >>   i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
> >>                              struct i915_gem_ww_ctx *ww,
> >> @@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
> *obj,
> >>
> >>          if (flags & PIN_MAPPABLE &&
> >>              (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
> >> -               /*
> >> -                * If the required space is larger than the available
> >> -                * aperture, we will not able to find a slot for the
> >> -                * object and unbinding the object now will be in
> >> -                * vain. Worse, doing so may cause us to ping-pong
> >> -                * the object in and out of the Global GTT and
> >> -                * waste a lot of cycles under the mutex.
> >> -                */
> >> -               if (obj->base.size > ggtt->mappable_end)
> >> -                       return ERR_PTR(-E2BIG);
> >> -
> >> -               /*
> >> -                * If NONBLOCK is set the caller is optimistically
> >> -                * trying to cache the full object within the mappable
> >> -                * aperture, and *must* have a fallback in place for
> >> -                * situations where we cannot bind the object. We
> >> -                * can be a little more lax here and use the fallback
> >> -                * more often to avoid costly migrations of ourselves
> >> -                * and other objects within the aperture.
> >> -                *
> >> -                * Half-the-aperture is used as a simple heuristic.
> >> -                * More interesting would to do search for a free
> >> -                * block prior to making the commitment to unbind.
> >> -                * That caters for the self-harm case, and with a
> >> -                * little more heuristics (e.g. NOFAULT, NOEVICT)
> >> -                * we could try to minimise harm to others.
> >> -                */
> >> -               if (flags & PIN_NONBLOCK &&
> >> -                   obj->base.size > ggtt->mappable_end / 2)
> >> -                       return ERR_PTR(-ENOSPC);
> >> +               ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
> >> +               if (ret)
> >> +                       return ERR_PTR(ret);
> >>          }
> >>
> >>   new_vma:
> >> @@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct
> drm_i915_gem_object *obj,
> >>                  if (flags & PIN_NONBLOCK) {
> >>                          if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
> >>                                  return ERR_PTR(-ENOSPC);
> >> -
> >> -                       if (flags & PIN_MAPPABLE &&
> >> -                           vma->fence_size > ggtt->mappable_end / 2)
> >> -                               return ERR_PTR(-ENOSPC);
> >>                  }
> >>
> >>                  if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
> >> --
> >> 2.35.1
> >>
> >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-15  7:28         ` Kasireddy, Vivek
  0 siblings, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-15  7:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko, Daniel,

> 
> On 11/03/2022 09:39, Daniel Vetter wrote:
> > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> >>
> >> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >> more framebuffers/scanout buffers results in only one that is mappable/
> >> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >> is mappable/fenceable creates latencies large enough to miss alternate
> >> vblanks thereby producing less optimal framerate.
> >>
> >> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >> is called to pin one of the FB objs, the associated vma is identified
> >> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >> evicts it. This misplaced vma gets subseqently pinned only when
> >> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >> Therefore, to fix this issue, we try to see if there is space to map
> >> at-least two objects of a given size and return early if there isn't. This
> >> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >> are too big to map thereby preventing unncessary unbind.
> >>
> >> Testcase:
> >> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >> a frame ~7ms before the next vblank, the latencies seen between atomic
> >> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >> it misses the vblank every other frame.
> >>
> >> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>                i915_gem_object_pin_to_display_plane() {
> >> 0.102 us   |    i915_gem_object_set_cache_level();
> >>                  i915_gem_object_ggtt_pin_ww() {
> >> 0.390 us   |      i915_vma_instance();
> >> 0.178 us   |      i915_vma_misplaced();
> >>                    i915_vma_unbind() {
> >>                    __i915_active_wait() {
> >> 0.082 us   |        i915_active_acquire_if_busy();
> >> 0.475 us   |      }
> >>                    intel_runtime_pm_get() {
> >> 0.087 us   |        intel_runtime_pm_acquire();
> >> 0.259 us   |      }
> >>                    __i915_active_wait() {
> >> 0.085 us   |        i915_active_acquire_if_busy();
> >> 0.240 us   |      }
> >>                    __i915_vma_evict() {
> >>                      ggtt_unbind_vma() {
> >>                        gen8_ggtt_clear_range() {
> >> 10507.255 us |        }
> >> 10507.689 us |      }
> >> 10508.516 us |   }
> >>
> >> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>      buffer is too big by checking to see if it is possible to map
> >>      two of them into the ggtt.
> >>
> >> v3 (Ville):
> >> - Count how many fb objects can be fit into the available holes
> >>    instead of checking for a hole twice the object size.
> >> - Take alignment constraints into account.
> >> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>
> >> v4:
> >> - Remove existing heuristic that checks just for size. (Ville)
> >> - Return early if we find space to map at-least two objects. (Tvrtko)
> >> - Slightly update the commit message.
> >>
> >> v5: (Tvrtko)
> >> - Rename the function to indicate that the object may be too big to
> >>    map into the aperture.
> >> - Account for guard pages while calculating the total size required
> >>    for the object.
> >> - Do not subject all objects to the heuristic check and instead
> >>    consider objects only of a certain size.
> >> - Do the hole walk using the rbtree.
> >> - Preserve the existing PIN_NONBLOCK logic.
> >> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>
> >> v6: (Tvrtko)
> >> - Return 0 on success and the specific error code on failure to
> >>    preserve the existing behavior.
> >>
> >> v7: (Ville)
> >> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>    size < ggtt->mappable_end / 4 checks.
> >> - Drop the redundant check that is based on previous heuristic.
> >>
> >> v8:
> >> - Make sure that we are holding the mutex associated with ggtt vm
> >>    as we traverse the hole nodes.
> >>
> >> v9: (Tvrtko)
> >> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>
> >> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >> ---
> >>   drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> >>   1 file changed, 94 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >> index 9747924cc57b..e0d731b3f215 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -49,6 +49,7 @@
> >>   #include "gem/i915_gem_pm.h"
> >>   #include "gem/i915_gem_region.h"
> >>   #include "gem/i915_gem_userptr.h"
> >> +#include "gem/i915_gem_tiling.h"
> >>   #include "gt/intel_engine_user.h"
> >>   #include "gt/intel_gt.h"
> >>   #include "gt/intel_gt_pm.h"
> >> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>          spin_unlock(&obj->vma.lock);
> >>   }
> >>
> >> +static int
> >> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >> +                                u64 alignment, u64 flags)
> >
> > Tvrtko asked me to ack the first patch, but then I looked at this and
> > started wondering.
> >
> > Conceptually this doesn't pass the smell test. What if we have
> > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > the app does triple buffer? You'll be forever busy tuning this
> > heuristics, which can't fundamentally be fixed I think. The old "half
> > of mappable" heuristic isn't really better, but at least it was dead
> > simple.
> >
> > Imo what we need here is a change in approach:
> > 1. Check whether the useable view for scanout exists already. If yes,
> > use that. This should avoid the constant unbinding stalls.
> > 2. Try to in buffer to mappabley, but without evicting anything (so
> > not the non-blocking thing)
> > 3. Pin the buffer with the most lenient approach
> >
> > Even the non-blocking interim stage is dangerous, since it'll just
> > result in other buffers (e.g. when triple-buffering) getting unbound
> > and we're back to the same stall. Note that this could have an impact
> > on cpu rendering compositors, where we might end up relying a lot more
> > partial views. But as long as we are a tad more aggressive (i.e. the
> > non-blocking binding) in the mmap path that should work out to keep
> > everything balanced, since usually you render first before you display
> > anything. And so the buffer should end up in the ideal place.
> >
> > I'd try to first skip the 2. step since I think it'll require a bit of
> > work, and frankly I don't think we care about the potential fallout.
> 
> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> respecting this comment from i915_gem_object_pin_to_display_plane:
> 
> 	/*
> 	 * As the user may map the buffer once pinned in the display plane
> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> 	 * always use map_and_fenceable for all scanout buffers. However,
> 	 * it may simply be too big to fit into mappable, in which case
> 	 * put it anyway and hope that userspace can cope (but always first
> 	 * try to preserve the existing ABI).
> 	 */
[Kasireddy, Vivek] Digging further, this is what the commit message that added
the above comment says:
commit 2efb813d5388e18255c54afac77bd91acd586908
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 18 17:17:06 2016 +0100

    drm/i915: Fallback to using unmappable memory for scanout

    The existing ABI says that scanouts are pinned into the mappable region
    so that legacy clients (e.g. old Xorg or plymouthd) can write directly
    into the scanout through a GTT mapping. However if the surface does not
    fit into the mappable region, we are better off just trying to fit it
    anywhere and hoping for the best. (Any userspace that is capable of
    using ginormous scanouts is also likely not to rely on pure GTT
    updates.) With the partial vma fault support, we are no longer
    restricted to only using scanouts that we can pin (though it is still
    preferred for performance reasons and for powersaving features like
    FBC).

> 
> By a quick look, for this case it appears we would end up creating partial views for CPU
> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> create a bunch of 1MiB VMAs so something to check would be how long those persist in
> memory before they get released. Or perhaps the bootup splash use case is not common
> these days?
[Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
would not to be a problem on ADL-S where there is space in mappable for one 8K FB.

Given this, do you think it would work if we just preserve the existing behavior and
tweak the heuristic introduced in this patch to look for space in aperture for only 
one FB instead of two? Or, is there no good option for solving this issue other than
to create 1MB VMAs?

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko
> 
> > -Daniel
> >
> >> +{
> >> +       struct drm_i915_private *i915 = to_i915(obj->base.dev);
> >> +       struct i915_ggtt *ggtt = to_gt(i915)->ggtt;
> >> +       struct drm_mm_node *hole;
> >> +       u64 hole_start, hole_end, start, end;
> >> +       u64 fence_size, fence_alignment;
> >> +       unsigned int count = 0;
> >> +       int err = 0;
> >> +
> >> +       /*
> >> +        * If the required space is larger than the available
> >> +        * aperture, we will not able to find a slot for the
> >> +        * object and unbinding the object now will be in
> >> +        * vain. Worse, doing so may cause us to ping-pong
> >> +        * the object in and out of the Global GTT and
> >> +        * waste a lot of cycles under the mutex.
> >> +        */
> >> +       if (obj->base.size > ggtt->mappable_end)
> >> +               return -E2BIG;
> >> +
> >> +       /*
> >> +        * If NONBLOCK is set the caller is optimistically
> >> +        * trying to cache the full object within the mappable
> >> +        * aperture, and *must* have a fallback in place for
> >> +        * situations where we cannot bind the object. We
> >> +        * can be a little more lax here and use the fallback
> >> +        * more often to avoid costly migrations of ourselves
> >> +        * and other objects within the aperture.
> >> +        */
> >> +       if (!(flags & PIN_NONBLOCK))
> >> +               return 0;
> >> +
> >> +       /*
> >> +        * Other objects such as batchbuffers are fairly small compared
> >> +        * to FBs and are unlikely to exahust the aperture space.
> >> +        * Therefore, return early if this obj is not an FB.
> >> +        */
> >> +       if (!i915_gem_object_is_framebuffer(obj))
> >> +               return 0;
> >> +
> >> +       fence_size = i915_gem_fence_size(i915, obj->base.size,
> >> +                                        i915_gem_object_get_tiling(obj),
> >> +                                        i915_gem_object_get_stride(obj));
> >> +
> >> +       if (i915_vm_has_cache_coloring(&ggtt->vm))
> >> +               fence_size += 2 * I915_GTT_PAGE_SIZE;
> >> +
> >> +       fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
> >> +                                                  i915_gem_object_get_tiling(obj),
> >> +                                                  i915_gem_object_get_stride(obj));
> >> +       alignment = max_t(u64, alignment, fence_alignment);
> >> +
> >> +       err = mutex_lock_interruptible_nested(&ggtt->vm.mutex, 0);
> >> +       if (err)
> >> +               return err;
> >> +
> >> +       /*
> >> +        * Assuming this object is a large scanout buffer, we try to find
> >> +        * out if there is room to map at-least two of them. There could
> >> +        * be space available to map one but to be consistent, we try to
> >> +        * avoid mapping/fencing any of them.
> >> +        */
> >> +       drm_mm_for_each_suitable_hole(hole, &ggtt->vm.mm, 0, ggtt->mappable_end,
> >> +                                     fence_size, DRM_MM_INSERT_LOW) {
> >> +               hole_start = drm_mm_hole_node_start(hole);
> >> +               hole_end = hole_start + hole->hole_size;
> >> +
> >> +               do {
> >> +                       start = round_up(hole_start, alignment);
> >> +                       end = min_t(u64, hole_end, ggtt->mappable_end);
> >> +
> >> +                       if (range_overflows(start, fence_size, end))
> >> +                               break;
> >> +
> >> +                       if (++count >= 2) {
> >> +                               mutex_unlock(&ggtt->vm.mutex);
> >> +                               return 0;
> >> +                       }
> >> +
> >> +                       hole_start = start + fence_size;
> >> +               } while (1);
> >> +       }
> >> +
> >> +       mutex_unlock(&ggtt->vm.mutex);
> >> +       return -ENOSPC;
> >> +}
> >> +
> >>   struct i915_vma *
> >>   i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
> >>                              struct i915_gem_ww_ctx *ww,
> >> @@ -897,36 +988,9 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
> *obj,
> >>
> >>          if (flags & PIN_MAPPABLE &&
> >>              (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
> >> -               /*
> >> -                * If the required space is larger than the available
> >> -                * aperture, we will not able to find a slot for the
> >> -                * object and unbinding the object now will be in
> >> -                * vain. Worse, doing so may cause us to ping-pong
> >> -                * the object in and out of the Global GTT and
> >> -                * waste a lot of cycles under the mutex.
> >> -                */
> >> -               if (obj->base.size > ggtt->mappable_end)
> >> -                       return ERR_PTR(-E2BIG);
> >> -
> >> -               /*
> >> -                * If NONBLOCK is set the caller is optimistically
> >> -                * trying to cache the full object within the mappable
> >> -                * aperture, and *must* have a fallback in place for
> >> -                * situations where we cannot bind the object. We
> >> -                * can be a little more lax here and use the fallback
> >> -                * more often to avoid costly migrations of ourselves
> >> -                * and other objects within the aperture.
> >> -                *
> >> -                * Half-the-aperture is used as a simple heuristic.
> >> -                * More interesting would to do search for a free
> >> -                * block prior to making the commitment to unbind.
> >> -                * That caters for the self-harm case, and with a
> >> -                * little more heuristics (e.g. NOFAULT, NOEVICT)
> >> -                * we could try to minimise harm to others.
> >> -                */
> >> -               if (flags & PIN_NONBLOCK &&
> >> -                   obj->base.size > ggtt->mappable_end / 2)
> >> -                       return ERR_PTR(-ENOSPC);
> >> +               ret = i915_gem_object_fits_in_aperture(obj, alignment, flags);
> >> +               if (ret)
> >> +                       return ERR_PTR(ret);
> >>          }
> >>
> >>   new_vma:
> >> @@ -938,10 +1002,6 @@ i915_gem_object_ggtt_pin_ww(struct
> drm_i915_gem_object *obj,
> >>                  if (flags & PIN_NONBLOCK) {
> >>                          if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma))
> >>                                  return ERR_PTR(-ENOSPC);
> >> -
> >> -                       if (flags & PIN_MAPPABLE &&
> >> -                           vma->fence_size > ggtt->mappable_end / 2)
> >> -                               return ERR_PTR(-ENOSPC);
> >>                  }
> >>
> >>                  if (i915_vma_is_pinned(vma) || i915_vma_is_active(vma)) {
> >> --
> >> 2.35.1
> >>
> >
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-15  7:28         ` Kasireddy, Vivek
  (?)
@ 2022-03-15  9:45         ` Tvrtko Ursulin
  2022-03-16  7:37             ` Kasireddy, Vivek
  2022-03-17  9:47             ` Daniel Vetter
  -1 siblings, 2 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-15  9:45 UTC (permalink / raw)
  To: Kasireddy, Vivek, Daniel Vetter; +Cc: intel-gfx, dri-devel


On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> Hi Tvrtko, Daniel,
> 
>>
>> On 11/03/2022 09:39, Daniel Vetter wrote:
>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>>>>
>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>>> more framebuffers/scanout buffers results in only one that is mappable/
>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>>> is mappable/fenceable creates latencies large enough to miss alternate
>>>> vblanks thereby producing less optimal framerate.
>>>>
>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>>> is called to pin one of the FB objs, the associated vma is identified
>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>>>> evicts it. This misplaced vma gets subseqently pinned only when
>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>>>> Therefore, to fix this issue, we try to see if there is space to map
>>>> at-least two objects of a given size and return early if there isn't. This
>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>>>> are too big to map thereby preventing unncessary unbind.
>>>>
>>>> Testcase:
>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>>>> it misses the vblank every other frame.
>>>>
>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>>                 i915_gem_object_pin_to_display_plane() {
>>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>>                   i915_gem_object_ggtt_pin_ww() {
>>>> 0.390 us   |      i915_vma_instance();
>>>> 0.178 us   |      i915_vma_misplaced();
>>>>                     i915_vma_unbind() {
>>>>                     __i915_active_wait() {
>>>> 0.082 us   |        i915_active_acquire_if_busy();
>>>> 0.475 us   |      }
>>>>                     intel_runtime_pm_get() {
>>>> 0.087 us   |        intel_runtime_pm_acquire();
>>>> 0.259 us   |      }
>>>>                     __i915_active_wait() {
>>>> 0.085 us   |        i915_active_acquire_if_busy();
>>>> 0.240 us   |      }
>>>>                     __i915_vma_evict() {
>>>>                       ggtt_unbind_vma() {
>>>>                         gen8_ggtt_clear_range() {
>>>> 10507.255 us |        }
>>>> 10507.689 us |      }
>>>> 10508.516 us |   }
>>>>
>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>>>       buffer is too big by checking to see if it is possible to map
>>>>       two of them into the ggtt.
>>>>
>>>> v3 (Ville):
>>>> - Count how many fb objects can be fit into the available holes
>>>>     instead of checking for a hole twice the object size.
>>>> - Take alignment constraints into account.
>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>>>
>>>> v4:
>>>> - Remove existing heuristic that checks just for size. (Ville)
>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
>>>> - Slightly update the commit message.
>>>>
>>>> v5: (Tvrtko)
>>>> - Rename the function to indicate that the object may be too big to
>>>>     map into the aperture.
>>>> - Account for guard pages while calculating the total size required
>>>>     for the object.
>>>> - Do not subject all objects to the heuristic check and instead
>>>>     consider objects only of a certain size.
>>>> - Do the hole walk using the rbtree.
>>>> - Preserve the existing PIN_NONBLOCK logic.
>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>>>
>>>> v6: (Tvrtko)
>>>> - Return 0 on success and the specific error code on failure to
>>>>     preserve the existing behavior.
>>>>
>>>> v7: (Ville)
>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>>>     size < ggtt->mappable_end / 4 checks.
>>>> - Drop the redundant check that is based on previous heuristic.
>>>>
>>>> v8:
>>>> - Make sure that we are holding the mutex associated with ggtt vm
>>>>     as we traverse the hole nodes.
>>>>
>>>> v9: (Tvrtko)
>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>>>
>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>>>>    1 file changed, 94 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>> index 9747924cc57b..e0d731b3f215 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -49,6 +49,7 @@
>>>>    #include "gem/i915_gem_pm.h"
>>>>    #include "gem/i915_gem_region.h"
>>>>    #include "gem/i915_gem_userptr.h"
>>>> +#include "gem/i915_gem_tiling.h"
>>>>    #include "gt/intel_engine_user.h"
>>>>    #include "gt/intel_gt.h"
>>>>    #include "gt/intel_gt_pm.h"
>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>>>           spin_unlock(&obj->vma.lock);
>>>>    }
>>>>
>>>> +static int
>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>>>> +                                u64 alignment, u64 flags)
>>>
>>> Tvrtko asked me to ack the first patch, but then I looked at this and
>>> started wondering.
>>>
>>> Conceptually this doesn't pass the smell test. What if we have
>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
>>> the app does triple buffer? You'll be forever busy tuning this
>>> heuristics, which can't fundamentally be fixed I think. The old "half
>>> of mappable" heuristic isn't really better, but at least it was dead
>>> simple.
>>>
>>> Imo what we need here is a change in approach:
>>> 1. Check whether the useable view for scanout exists already. If yes,
>>> use that. This should avoid the constant unbinding stalls.
>>> 2. Try to in buffer to mappabley, but without evicting anything (so
>>> not the non-blocking thing)
>>> 3. Pin the buffer with the most lenient approach
>>>
>>> Even the non-blocking interim stage is dangerous, since it'll just
>>> result in other buffers (e.g. when triple-buffering) getting unbound
>>> and we're back to the same stall. Note that this could have an impact
>>> on cpu rendering compositors, where we might end up relying a lot more
>>> partial views. But as long as we are a tad more aggressive (i.e. the
>>> non-blocking binding) in the mmap path that should work out to keep
>>> everything balanced, since usually you render first before you display
>>> anything. And so the buffer should end up in the ideal place.
>>>
>>> I'd try to first skip the 2. step since I think it'll require a bit of
>>> work, and frankly I don't think we care about the potential fallout.
>>
>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
>> respecting this comment from i915_gem_object_pin_to_display_plane:
>>
>> 	/*
>> 	 * As the user may map the buffer once pinned in the display plane
>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
>> 	 * always use map_and_fenceable for all scanout buffers. However,
>> 	 * it may simply be too big to fit into mappable, in which case
>> 	 * put it anyway and hope that userspace can cope (but always first
>> 	 * try to preserve the existing ABI).
>> 	 */
> [Kasireddy, Vivek] Digging further, this is what the commit message that added
> the above comment says:
> commit 2efb813d5388e18255c54afac77bd91acd586908
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Aug 18 17:17:06 2016 +0100
> 
>      drm/i915: Fallback to using unmappable memory for scanout
> 
>      The existing ABI says that scanouts are pinned into the mappable region
>      so that legacy clients (e.g. old Xorg or plymouthd) can write directly
>      into the scanout through a GTT mapping. However if the surface does not
>      fit into the mappable region, we are better off just trying to fit it
>      anywhere and hoping for the best. (Any userspace that is capable of
>      using ginormous scanouts is also likely not to rely on pure GTT
>      updates.) With the partial vma fault support, we are no longer
>      restricted to only using scanouts that we can pin (though it is still
>      preferred for performance reasons and for powersaving features like
>      FBC).
> 
>>
>> By a quick look, for this case it appears we would end up creating partial views for CPU
>> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
>> create a bunch of 1MiB VMAs so something to check would be how long those persist in
>> memory before they get released. Or perhaps the bootup splash use case is not common
>> these days?
> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> 

FBC is a good point - correct me if I am wrong, but if we dropped trying 
to map in aperture by default it looks like we would lose it and that 
would be a significant power regression. In which case it doesn't seem 
like that would be an option.

Which I think leaves us with _some_ heuristics in any case.

1) N-holes heuristics.

2) Don't ever try PIN_MAPPABLE for framebuffers larger than some 
percentage of aperture.

Could this solve the 8k issue, most of the time, maybe? Could the 
current "aperture / 2" test be expressed generically in some terms? Like 
"(aperture - 10% (or some absolute value)) / 2" to account for non-fb 
objects? I forgot what you said the relationship between aperture size 
and 8k fb size was.

3) Don't evict for PIN_MAPPABLE mismatches when 
i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of 
i915_gem_object_pin_to_display_plane. Assumption being if we ended up 
with a non-mappable fb to start with, we must not try to re-bind it or 
we risk ping-pong latencies.

The last would I guess need to distinguish between PIN_MAPPABLE passed 
in versus opportunistically added by i915_gem_object_pin_to_display_plane.

How intrusive would it be to implement this option I am not sure without 
trying myself.

> Given this, do you think it would work if we just preserve the existing behavior and
> tweak the heuristic introduced in this patch to look for space in aperture for only
> one FB instead of two? Or, is there no good option for solving this issue other than
> to create 1MB VMAs?

I did not get how having one hole would solve the issue. Wouldn't it 
still hit the re-bind ping-pong? Or there isn't even a single hole for 
8k fb typically?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-15  9:45         ` Tvrtko Ursulin
@ 2022-03-16  7:37             ` Kasireddy, Vivek
  2022-03-17  9:47             ` Daniel Vetter
  1 sibling, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-16  7:37 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko,

> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> >
> >>
> >> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> >>>>
> >>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>> vblanks thereby producing less optimal framerate.
> >>>>
> >>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>> is called to pin one of the FB objs, the associated vma is identified
> >>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>> at-least two objects of a given size and return early if there isn't. This
> >>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>> are too big to map thereby preventing unncessary unbind.
> >>>>
> >>>> Testcase:
> >>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >>>> it misses the vblank every other frame.
> >>>>
> >>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>>                 i915_gem_object_pin_to_display_plane() {
> >>>> 0.102 us   |    i915_gem_object_set_cache_level();
> >>>>                   i915_gem_object_ggtt_pin_ww() {
> >>>> 0.390 us   |      i915_vma_instance();
> >>>> 0.178 us   |      i915_vma_misplaced();
> >>>>                     i915_vma_unbind() {
> >>>>                     __i915_active_wait() {
> >>>> 0.082 us   |        i915_active_acquire_if_busy();
> >>>> 0.475 us   |      }
> >>>>                     intel_runtime_pm_get() {
> >>>> 0.087 us   |        intel_runtime_pm_acquire();
> >>>> 0.259 us   |      }
> >>>>                     __i915_active_wait() {
> >>>> 0.085 us   |        i915_active_acquire_if_busy();
> >>>> 0.240 us   |      }
> >>>>                     __i915_vma_evict() {
> >>>>                       ggtt_unbind_vma() {
> >>>>                         gen8_ggtt_clear_range() {
> >>>> 10507.255 us |        }
> >>>> 10507.689 us |      }
> >>>> 10508.516 us |   }
> >>>>
> >>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>       buffer is too big by checking to see if it is possible to map
> >>>>       two of them into the ggtt.
> >>>>
> >>>> v3 (Ville):
> >>>> - Count how many fb objects can be fit into the available holes
> >>>>     instead of checking for a hole twice the object size.
> >>>> - Take alignment constraints into account.
> >>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>
> >>>> v4:
> >>>> - Remove existing heuristic that checks just for size. (Ville)
> >>>> - Return early if we find space to map at-least two objects. (Tvrtko)
> >>>> - Slightly update the commit message.
> >>>>
> >>>> v5: (Tvrtko)
> >>>> - Rename the function to indicate that the object may be too big to
> >>>>     map into the aperture.
> >>>> - Account for guard pages while calculating the total size required
> >>>>     for the object.
> >>>> - Do not subject all objects to the heuristic check and instead
> >>>>     consider objects only of a certain size.
> >>>> - Do the hole walk using the rbtree.
> >>>> - Preserve the existing PIN_NONBLOCK logic.
> >>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>>>
> >>>> v6: (Tvrtko)
> >>>> - Return 0 on success and the specific error code on failure to
> >>>>     preserve the existing behavior.
> >>>>
> >>>> v7: (Ville)
> >>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>>>     size < ggtt->mappable_end / 4 checks.
> >>>> - Drop the redundant check that is based on previous heuristic.
> >>>>
> >>>> v8:
> >>>> - Make sure that we are holding the mutex associated with ggtt vm
> >>>>     as we traverse the hole nodes.
> >>>>
> >>>> v9: (Tvrtko)
> >>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>>>
> >>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >>>> ---
> >>>>    drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> >>>>    1 file changed, 94 insertions(+), 34 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>> index 9747924cc57b..e0d731b3f215 100644
> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>> @@ -49,6 +49,7 @@
> >>>>    #include "gem/i915_gem_pm.h"
> >>>>    #include "gem/i915_gem_region.h"
> >>>>    #include "gem/i915_gem_userptr.h"
> >>>> +#include "gem/i915_gem_tiling.h"
> >>>>    #include "gt/intel_engine_user.h"
> >>>>    #include "gt/intel_gt.h"
> >>>>    #include "gt/intel_gt_pm.h"
> >>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>>>           spin_unlock(&obj->vma.lock);
> >>>>    }
> >>>>
> >>>> +static int
> >>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >>>> +                                u64 alignment, u64 flags)
> >>>
> >>> Tvrtko asked me to ack the first patch, but then I looked at this and
> >>> started wondering.
> >>>
> >>> Conceptually this doesn't pass the smell test. What if we have
> >>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
> >>> the app does triple buffer? You'll be forever busy tuning this
> >>> heuristics, which can't fundamentally be fixed I think. The old "half
> >>> of mappable" heuristic isn't really better, but at least it was dead
> >>> simple.
> >>>
> >>> Imo what we need here is a change in approach:
> >>> 1. Check whether the useable view for scanout exists already. If yes,
> >>> use that. This should avoid the constant unbinding stalls.
> >>> 2. Try to in buffer to mappabley, but without evicting anything (so
> >>> not the non-blocking thing)
> >>> 3. Pin the buffer with the most lenient approach
> >>>
> >>> Even the non-blocking interim stage is dangerous, since it'll just
> >>> result in other buffers (e.g. when triple-buffering) getting unbound
> >>> and we're back to the same stall. Note that this could have an impact
> >>> on cpu rendering compositors, where we might end up relying a lot more
> >>> partial views. But as long as we are a tad more aggressive (i.e. the
> >>> non-blocking binding) in the mmap path that should work out to keep
> >>> everything balanced, since usually you render first before you display
> >>> anything. And so the buffer should end up in the ideal place.
> >>>
> >>> I'd try to first skip the 2. step since I think it'll require a bit of
> >>> work, and frankly I don't think we care about the potential fallout.
> >>
> >> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> >> respecting this comment from i915_gem_object_pin_to_display_plane:
> >>
> >> 	/*
> >> 	 * As the user may map the buffer once pinned in the display plane
> >> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> >> 	 * always use map_and_fenceable for all scanout buffers. However,
> >> 	 * it may simply be too big to fit into mappable, in which case
> >> 	 * put it anyway and hope that userspace can cope (but always first
> >> 	 * try to preserve the existing ABI).
> >> 	 */
> > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > the above comment says:
> > commit 2efb813d5388e18255c54afac77bd91acd586908
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Thu Aug 18 17:17:06 2016 +0100
> >
> >      drm/i915: Fallback to using unmappable memory for scanout
> >
> >      The existing ABI says that scanouts are pinned into the mappable region
> >      so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >      into the scanout through a GTT mapping. However if the surface does not
> >      fit into the mappable region, we are better off just trying to fit it
> >      anywhere and hoping for the best. (Any userspace that is capable of
> >      using ginormous scanouts is also likely not to rely on pure GTT
> >      updates.) With the partial vma fault support, we are no longer
> >      restricted to only using scanouts that we can pin (though it is still
> >      preferred for performance reasons and for powersaving features like
> >      FBC).
> >
> >>
> >> By a quick look, for this case it appears we would end up creating partial views for
> CPU
> >> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> >> create a bunch of 1MiB VMAs so something to check would be how long those persist
> in
> >> memory before they get released. Or perhaps the bootup splash use case is not common
> >> these days?
> > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
> (Plymouth's
> > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> >
> 
> FBC is a good point - correct me if I am wrong, but if we dropped trying
> to map in aperture by default it looks like we would lose it and that
> would be a significant power regression. In which case it doesn't seem
> like that would be an option.
[Kasireddy, Vivek] Ok, makes sense.

> 
> Which I think leaves us with _some_ heuristics in any case.
> 
> 1) N-holes heuristics.
> 
> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
> percentage of aperture.
> 
> Could this solve the 8k issue, most of the time, maybe? Could the
> current "aperture / 2" test be expressed generically in some terms? Like
> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
> objects? I forgot what you said the relationship between aperture size
> and 8k fb size was.
> 
> 3) Don't evict for PIN_MAPPABLE mismatches when
> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
> with a non-mappable fb to start with, we must not try to re-bind it or
> we risk ping-pong latencies.
> 
> The last would I guess need to distinguish between PIN_MAPPABLE passed
> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
> 
> How intrusive would it be to implement this option I am not sure without
> trying myself.
[Kasireddy, Vivek] I suspect I might be missing something, but could it not be
as simple as below:
@@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
                                return ERR_PTR(-ENOSPC);

                        if (flags & PIN_MAPPABLE &&
-                           vma->fence_size > ggtt->mappable_end / 2)
+                           (vma->fence_size > ggtt->mappable_end / 2 ||
+                           !i915_vma_is_map_and_fenceable(vma)))
                                    return ERR_PTR(-ENOSPC);
                }
> 
> > Given this, do you think it would work if we just preserve the existing behavior and
> > tweak the heuristic introduced in this patch to look for space in aperture for only
> > one FB instead of two? Or, is there no good option for solving this issue other than
> > to create 1MB VMAs?
> 
> I did not get how having one hole would solve the issue. Wouldn't it
> still hit the re-bind ping-pong? Or there isn't even a single hole for
> 8k fb typically?
[Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
almost always uses only 2 except when it needs to share the FB -- with a plugin
such as "remoting" for desktop streaming.
Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
FB1 is mappable/fenceable and therefore not misplaced.
FB2 is NOT mappable and hence identified as misplaced
(because it fails the check
(flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))

As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced. 
Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
see if there is space in aperture for only FB2 (N = 1) and return early -- before
even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
issue in both these cases.

The current version of this patch -- when running Weston -- puts both FB1
and FB2 (N = 2) outside of aperture although there may be space for FB1.
I don't think this makes sense anymore given Plymouth's single-buffer 
use-case that uses dirtyfb ioctl.

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-16  7:37             ` Kasireddy, Vivek
  0 siblings, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-16  7:37 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko,

> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> >
> >>
> >> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> >>>>
> >>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>> vblanks thereby producing less optimal framerate.
> >>>>
> >>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>> is called to pin one of the FB objs, the associated vma is identified
> >>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>> at-least two objects of a given size and return early if there isn't. This
> >>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>> are too big to map thereby preventing unncessary unbind.
> >>>>
> >>>> Testcase:
> >>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >>>> it misses the vblank every other frame.
> >>>>
> >>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>>                 i915_gem_object_pin_to_display_plane() {
> >>>> 0.102 us   |    i915_gem_object_set_cache_level();
> >>>>                   i915_gem_object_ggtt_pin_ww() {
> >>>> 0.390 us   |      i915_vma_instance();
> >>>> 0.178 us   |      i915_vma_misplaced();
> >>>>                     i915_vma_unbind() {
> >>>>                     __i915_active_wait() {
> >>>> 0.082 us   |        i915_active_acquire_if_busy();
> >>>> 0.475 us   |      }
> >>>>                     intel_runtime_pm_get() {
> >>>> 0.087 us   |        intel_runtime_pm_acquire();
> >>>> 0.259 us   |      }
> >>>>                     __i915_active_wait() {
> >>>> 0.085 us   |        i915_active_acquire_if_busy();
> >>>> 0.240 us   |      }
> >>>>                     __i915_vma_evict() {
> >>>>                       ggtt_unbind_vma() {
> >>>>                         gen8_ggtt_clear_range() {
> >>>> 10507.255 us |        }
> >>>> 10507.689 us |      }
> >>>> 10508.516 us |   }
> >>>>
> >>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>       buffer is too big by checking to see if it is possible to map
> >>>>       two of them into the ggtt.
> >>>>
> >>>> v3 (Ville):
> >>>> - Count how many fb objects can be fit into the available holes
> >>>>     instead of checking for a hole twice the object size.
> >>>> - Take alignment constraints into account.
> >>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>
> >>>> v4:
> >>>> - Remove existing heuristic that checks just for size. (Ville)
> >>>> - Return early if we find space to map at-least two objects. (Tvrtko)
> >>>> - Slightly update the commit message.
> >>>>
> >>>> v5: (Tvrtko)
> >>>> - Rename the function to indicate that the object may be too big to
> >>>>     map into the aperture.
> >>>> - Account for guard pages while calculating the total size required
> >>>>     for the object.
> >>>> - Do not subject all objects to the heuristic check and instead
> >>>>     consider objects only of a certain size.
> >>>> - Do the hole walk using the rbtree.
> >>>> - Preserve the existing PIN_NONBLOCK logic.
> >>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>>>
> >>>> v6: (Tvrtko)
> >>>> - Return 0 on success and the specific error code on failure to
> >>>>     preserve the existing behavior.
> >>>>
> >>>> v7: (Ville)
> >>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>>>     size < ggtt->mappable_end / 4 checks.
> >>>> - Drop the redundant check that is based on previous heuristic.
> >>>>
> >>>> v8:
> >>>> - Make sure that we are holding the mutex associated with ggtt vm
> >>>>     as we traverse the hole nodes.
> >>>>
> >>>> v9: (Tvrtko)
> >>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>>>
> >>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >>>> ---
> >>>>    drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> >>>>    1 file changed, 94 insertions(+), 34 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>> index 9747924cc57b..e0d731b3f215 100644
> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>> @@ -49,6 +49,7 @@
> >>>>    #include "gem/i915_gem_pm.h"
> >>>>    #include "gem/i915_gem_region.h"
> >>>>    #include "gem/i915_gem_userptr.h"
> >>>> +#include "gem/i915_gem_tiling.h"
> >>>>    #include "gt/intel_engine_user.h"
> >>>>    #include "gt/intel_gt.h"
> >>>>    #include "gt/intel_gt_pm.h"
> >>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>>>           spin_unlock(&obj->vma.lock);
> >>>>    }
> >>>>
> >>>> +static int
> >>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >>>> +                                u64 alignment, u64 flags)
> >>>
> >>> Tvrtko asked me to ack the first patch, but then I looked at this and
> >>> started wondering.
> >>>
> >>> Conceptually this doesn't pass the smell test. What if we have
> >>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
> >>> the app does triple buffer? You'll be forever busy tuning this
> >>> heuristics, which can't fundamentally be fixed I think. The old "half
> >>> of mappable" heuristic isn't really better, but at least it was dead
> >>> simple.
> >>>
> >>> Imo what we need here is a change in approach:
> >>> 1. Check whether the useable view for scanout exists already. If yes,
> >>> use that. This should avoid the constant unbinding stalls.
> >>> 2. Try to in buffer to mappabley, but without evicting anything (so
> >>> not the non-blocking thing)
> >>> 3. Pin the buffer with the most lenient approach
> >>>
> >>> Even the non-blocking interim stage is dangerous, since it'll just
> >>> result in other buffers (e.g. when triple-buffering) getting unbound
> >>> and we're back to the same stall. Note that this could have an impact
> >>> on cpu rendering compositors, where we might end up relying a lot more
> >>> partial views. But as long as we are a tad more aggressive (i.e. the
> >>> non-blocking binding) in the mmap path that should work out to keep
> >>> everything balanced, since usually you render first before you display
> >>> anything. And so the buffer should end up in the ideal place.
> >>>
> >>> I'd try to first skip the 2. step since I think it'll require a bit of
> >>> work, and frankly I don't think we care about the potential fallout.
> >>
> >> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> >> respecting this comment from i915_gem_object_pin_to_display_plane:
> >>
> >> 	/*
> >> 	 * As the user may map the buffer once pinned in the display plane
> >> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> >> 	 * always use map_and_fenceable for all scanout buffers. However,
> >> 	 * it may simply be too big to fit into mappable, in which case
> >> 	 * put it anyway and hope that userspace can cope (but always first
> >> 	 * try to preserve the existing ABI).
> >> 	 */
> > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > the above comment says:
> > commit 2efb813d5388e18255c54afac77bd91acd586908
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Thu Aug 18 17:17:06 2016 +0100
> >
> >      drm/i915: Fallback to using unmappable memory for scanout
> >
> >      The existing ABI says that scanouts are pinned into the mappable region
> >      so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >      into the scanout through a GTT mapping. However if the surface does not
> >      fit into the mappable region, we are better off just trying to fit it
> >      anywhere and hoping for the best. (Any userspace that is capable of
> >      using ginormous scanouts is also likely not to rely on pure GTT
> >      updates.) With the partial vma fault support, we are no longer
> >      restricted to only using scanouts that we can pin (though it is still
> >      preferred for performance reasons and for powersaving features like
> >      FBC).
> >
> >>
> >> By a quick look, for this case it appears we would end up creating partial views for
> CPU
> >> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> >> create a bunch of 1MiB VMAs so something to check would be how long those persist
> in
> >> memory before they get released. Or perhaps the bootup splash use case is not common
> >> these days?
> > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
> (Plymouth's
> > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> >
> 
> FBC is a good point - correct me if I am wrong, but if we dropped trying
> to map in aperture by default it looks like we would lose it and that
> would be a significant power regression. In which case it doesn't seem
> like that would be an option.
[Kasireddy, Vivek] Ok, makes sense.

> 
> Which I think leaves us with _some_ heuristics in any case.
> 
> 1) N-holes heuristics.
> 
> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
> percentage of aperture.
> 
> Could this solve the 8k issue, most of the time, maybe? Could the
> current "aperture / 2" test be expressed generically in some terms? Like
> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
> objects? I forgot what you said the relationship between aperture size
> and 8k fb size was.
> 
> 3) Don't evict for PIN_MAPPABLE mismatches when
> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
> with a non-mappable fb to start with, we must not try to re-bind it or
> we risk ping-pong latencies.
> 
> The last would I guess need to distinguish between PIN_MAPPABLE passed
> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
> 
> How intrusive would it be to implement this option I am not sure without
> trying myself.
[Kasireddy, Vivek] I suspect I might be missing something, but could it not be
as simple as below:
@@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
                                return ERR_PTR(-ENOSPC);

                        if (flags & PIN_MAPPABLE &&
-                           vma->fence_size > ggtt->mappable_end / 2)
+                           (vma->fence_size > ggtt->mappable_end / 2 ||
+                           !i915_vma_is_map_and_fenceable(vma)))
                                    return ERR_PTR(-ENOSPC);
                }
> 
> > Given this, do you think it would work if we just preserve the existing behavior and
> > tweak the heuristic introduced in this patch to look for space in aperture for only
> > one FB instead of two? Or, is there no good option for solving this issue other than
> > to create 1MB VMAs?
> 
> I did not get how having one hole would solve the issue. Wouldn't it
> still hit the re-bind ping-pong? Or there isn't even a single hole for
> 8k fb typically?
[Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
almost always uses only 2 except when it needs to share the FB -- with a plugin
such as "remoting" for desktop streaming.
Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
FB1 is mappable/fenceable and therefore not misplaced.
FB2 is NOT mappable and hence identified as misplaced
(because it fails the check
(flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))

As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced. 
Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
see if there is space in aperture for only FB2 (N = 1) and return early -- before
even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
issue in both these cases.

The current version of this patch -- when running Weston -- puts both FB1
and FB2 (N = 2) outside of aperture although there may be space for FB1.
I don't think this makes sense anymore given Plymouth's single-buffer 
use-case that uses dirtyfb ioctl.

Thanks,
Vivek

> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-16  7:37             ` Kasireddy, Vivek
  (?)
@ 2022-03-16 13:34             ` Tvrtko Ursulin
  2022-03-17  7:08                 ` Kasireddy, Vivek
  -1 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-16 13:34 UTC (permalink / raw)
  To: Kasireddy, Vivek, Daniel Vetter; +Cc: intel-gfx, dri-devel


On 16/03/2022 07:37, Kasireddy, Vivek wrote:
> Hi Tvrtko,
> 
>>
>> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
>>> Hi Tvrtko, Daniel,
>>>
>>>>
>>>> On 11/03/2022 09:39, Daniel Vetter wrote:
>>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>>>>>>
>>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>>>>> more framebuffers/scanout buffers results in only one that is mappable/
>>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>>>>> is mappable/fenceable creates latencies large enough to miss alternate
>>>>>> vblanks thereby producing less optimal framerate.
>>>>>>
>>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>>>>> is called to pin one of the FB objs, the associated vma is identified
>>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>>>>>> evicts it. This misplaced vma gets subseqently pinned only when
>>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>>>>>> Therefore, to fix this issue, we try to see if there is space to map
>>>>>> at-least two objects of a given size and return early if there isn't. This
>>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>>>>>> are too big to map thereby preventing unncessary unbind.
>>>>>>
>>>>>> Testcase:
>>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
>>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>>>>>> it misses the vblank every other frame.
>>>>>>
>>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>>>>                  i915_gem_object_pin_to_display_plane() {
>>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>>>>                    i915_gem_object_ggtt_pin_ww() {
>>>>>> 0.390 us   |      i915_vma_instance();
>>>>>> 0.178 us   |      i915_vma_misplaced();
>>>>>>                      i915_vma_unbind() {
>>>>>>                      __i915_active_wait() {
>>>>>> 0.082 us   |        i915_active_acquire_if_busy();
>>>>>> 0.475 us   |      }
>>>>>>                      intel_runtime_pm_get() {
>>>>>> 0.087 us   |        intel_runtime_pm_acquire();
>>>>>> 0.259 us   |      }
>>>>>>                      __i915_active_wait() {
>>>>>> 0.085 us   |        i915_active_acquire_if_busy();
>>>>>> 0.240 us   |      }
>>>>>>                      __i915_vma_evict() {
>>>>>>                        ggtt_unbind_vma() {
>>>>>>                          gen8_ggtt_clear_range() {
>>>>>> 10507.255 us |        }
>>>>>> 10507.689 us |      }
>>>>>> 10508.516 us |   }
>>>>>>
>>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>>>>>        buffer is too big by checking to see if it is possible to map
>>>>>>        two of them into the ggtt.
>>>>>>
>>>>>> v3 (Ville):
>>>>>> - Count how many fb objects can be fit into the available holes
>>>>>>      instead of checking for a hole twice the object size.
>>>>>> - Take alignment constraints into account.
>>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>>>>>
>>>>>> v4:
>>>>>> - Remove existing heuristic that checks just for size. (Ville)
>>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
>>>>>> - Slightly update the commit message.
>>>>>>
>>>>>> v5: (Tvrtko)
>>>>>> - Rename the function to indicate that the object may be too big to
>>>>>>      map into the aperture.
>>>>>> - Account for guard pages while calculating the total size required
>>>>>>      for the object.
>>>>>> - Do not subject all objects to the heuristic check and instead
>>>>>>      consider objects only of a certain size.
>>>>>> - Do the hole walk using the rbtree.
>>>>>> - Preserve the existing PIN_NONBLOCK logic.
>>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>>>>>
>>>>>> v6: (Tvrtko)
>>>>>> - Return 0 on success and the specific error code on failure to
>>>>>>      preserve the existing behavior.
>>>>>>
>>>>>> v7: (Ville)
>>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>>>>>      size < ggtt->mappable_end / 4 checks.
>>>>>> - Drop the redundant check that is based on previous heuristic.
>>>>>>
>>>>>> v8:
>>>>>> - Make sure that we are holding the mutex associated with ggtt vm
>>>>>>      as we traverse the hole nodes.
>>>>>>
>>>>>> v9: (Tvrtko)
>>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>>>>>
>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>>>>>>     1 file changed, 94 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> index 9747924cc57b..e0d731b3f215 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> @@ -49,6 +49,7 @@
>>>>>>     #include "gem/i915_gem_pm.h"
>>>>>>     #include "gem/i915_gem_region.h"
>>>>>>     #include "gem/i915_gem_userptr.h"
>>>>>> +#include "gem/i915_gem_tiling.h"
>>>>>>     #include "gt/intel_engine_user.h"
>>>>>>     #include "gt/intel_gt.h"
>>>>>>     #include "gt/intel_gt_pm.h"
>>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>>>>>            spin_unlock(&obj->vma.lock);
>>>>>>     }
>>>>>>
>>>>>> +static int
>>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>>>>>> +                                u64 alignment, u64 flags)
>>>>>
>>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
>>>>> started wondering.
>>>>>
>>>>> Conceptually this doesn't pass the smell test. What if we have
>>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
>>>>> the app does triple buffer? You'll be forever busy tuning this
>>>>> heuristics, which can't fundamentally be fixed I think. The old "half
>>>>> of mappable" heuristic isn't really better, but at least it was dead
>>>>> simple.
>>>>>
>>>>> Imo what we need here is a change in approach:
>>>>> 1. Check whether the useable view for scanout exists already. If yes,
>>>>> use that. This should avoid the constant unbinding stalls.
>>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
>>>>> not the non-blocking thing)
>>>>> 3. Pin the buffer with the most lenient approach
>>>>>
>>>>> Even the non-blocking interim stage is dangerous, since it'll just
>>>>> result in other buffers (e.g. when triple-buffering) getting unbound
>>>>> and we're back to the same stall. Note that this could have an impact
>>>>> on cpu rendering compositors, where we might end up relying a lot more
>>>>> partial views. But as long as we are a tad more aggressive (i.e. the
>>>>> non-blocking binding) in the mmap path that should work out to keep
>>>>> everything balanced, since usually you render first before you display
>>>>> anything. And so the buffer should end up in the ideal place.
>>>>>
>>>>> I'd try to first skip the 2. step since I think it'll require a bit of
>>>>> work, and frankly I don't think we care about the potential fallout.
>>>>
>>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
>>>> respecting this comment from i915_gem_object_pin_to_display_plane:
>>>>
>>>> 	/*
>>>> 	 * As the user may map the buffer once pinned in the display plane
>>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
>>>> 	 * always use map_and_fenceable for all scanout buffers. However,
>>>> 	 * it may simply be too big to fit into mappable, in which case
>>>> 	 * put it anyway and hope that userspace can cope (but always first
>>>> 	 * try to preserve the existing ABI).
>>>> 	 */
>>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
>>> the above comment says:
>>> commit 2efb813d5388e18255c54afac77bd91acd586908
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Thu Aug 18 17:17:06 2016 +0100
>>>
>>>       drm/i915: Fallback to using unmappable memory for scanout
>>>
>>>       The existing ABI says that scanouts are pinned into the mappable region
>>>       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
>>>       into the scanout through a GTT mapping. However if the surface does not
>>>       fit into the mappable region, we are better off just trying to fit it
>>>       anywhere and hoping for the best. (Any userspace that is capable of
>>>       using ginormous scanouts is also likely not to rely on pure GTT
>>>       updates.) With the partial vma fault support, we are no longer
>>>       restricted to only using scanouts that we can pin (though it is still
>>>       preferred for performance reasons and for powersaving features like
>>>       FBC).
>>>
>>>>
>>>> By a quick look, for this case it appears we would end up creating partial views for
>> CPU
>>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
>>>> create a bunch of 1MiB VMAs so something to check would be how long those persist
>> in
>>>> memory before they get released. Or perhaps the bootup splash use case is not common
>>>> these days?
>>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
>>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
>> (Plymouth's
>>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
>>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
>>>
>>
>> FBC is a good point - correct me if I am wrong, but if we dropped trying
>> to map in aperture by default it looks like we would lose it and that
>> would be a significant power regression. In which case it doesn't seem
>> like that would be an option.
> [Kasireddy, Vivek] Ok, makes sense.
> 
>>
>> Which I think leaves us with _some_ heuristics in any case.
>>
>> 1) N-holes heuristics.
>>
>> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
>> percentage of aperture.
>>
>> Could this solve the 8k issue, most of the time, maybe? Could the
>> current "aperture / 2" test be expressed generically in some terms? Like
>> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
>> objects? I forgot what you said the relationship between aperture size
>> and 8k fb size was.
>>
>> 3) Don't evict for PIN_MAPPABLE mismatches when
>> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
>> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
>> with a non-mappable fb to start with, we must not try to re-bind it or
>> we risk ping-pong latencies.
>>
>> The last would I guess need to distinguish between PIN_MAPPABLE passed
>> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
>>
>> How intrusive would it be to implement this option I am not sure without
>> trying myself.
> [Kasireddy, Vivek] I suspect I might be missing something, but could it not be
> as simple as below:
> @@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
>                                  return ERR_PTR(-ENOSPC);
> 
>                          if (flags & PIN_MAPPABLE &&
> -                           vma->fence_size > ggtt->mappable_end / 2)
> +                           (vma->fence_size > ggtt->mappable_end / 2 ||
> +                           !i915_vma_is_map_and_fenceable(vma)))
>                                      return ERR_PTR(-ENOSPC);
>                  }

Looks like this would work...

>>
>>> Given this, do you think it would work if we just preserve the existing behavior and
>>> tweak the heuristic introduced in this patch to look for space in aperture for only
>>> one FB instead of two? Or, is there no good option for solving this issue other than
>>> to create 1MB VMAs?
>>
>> I did not get how having one hole would solve the issue. Wouldn't it
>> still hit the re-bind ping-pong? Or there isn't even a single hole for
>> 8k fb typically?
> [Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
> almost always uses only 2 except when it needs to share the FB -- with a plugin
> such as "remoting" for desktop streaming.
> Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
> FB1 is mappable/fenceable and therefore not misplaced.
> FB2 is NOT mappable and hence identified as misplaced
> (because it fails the check
> (flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))
> 
> As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
> when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced.
> Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
> see if there is space in aperture for only FB2 (N = 1) and return early -- before
> even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
> issue in both these cases.

... got it, yes, it seems both approaches works for this use case.

Not sure that I have a preference between the two approaches at this point.

Both would be behind a "PIN_MAPPABLE && PIN_NONBLOCK" check, so both 
would only apply to opportunistic PIN_MAPPABLE attempts. That is, any 
caller who only passes PIN_MAPPABLE would be unaffected which is what we 
want.

The extra i915_vma_is_map_and_fenceable check I guess is simpler and 
self-contained. I assume you have a test setup and can try it out to 
check it really works?

> The current version of this patch -- when running Weston -- puts both FB1
> and FB2 (N = 2) outside of aperture although there may be space for FB1.
> I don't think this makes sense anymore given Plymouth's single-buffer
> use-case that uses dirtyfb ioctl.

Yes agreed, it sounds preferable to preserve the current behaviour there.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-16 13:34             ` Tvrtko Ursulin
@ 2022-03-17  7:08                 ` Kasireddy, Vivek
  0 siblings, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-17  7:08 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko,

> 
> On 16/03/2022 07:37, Kasireddy, Vivek wrote:
> > Hi Tvrtko,
> >
> >>
> >> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> >>> Hi Tvrtko, Daniel,
> >>>
> >>>>
> >>>> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com>
> wrote:
> >>>>>>
> >>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>>>> vblanks thereby producing less optimal framerate.
> >>>>>>
> >>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>>>> is called to pin one of the FB objs, the associated vma is identified
> >>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >>>>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >>>>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>>>> at-least two objects of a given size and return early if there isn't. This
> >>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>>>> are too big to map thereby preventing unncessary unbind.
> >>>>>>
> >>>>>> Testcase:
> >>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >>>>>> it misses the vblank every other frame.
> >>>>>>
> >>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>>>>                  i915_gem_object_pin_to_display_plane() {
> >>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
> >>>>>>                    i915_gem_object_ggtt_pin_ww() {
> >>>>>> 0.390 us   |      i915_vma_instance();
> >>>>>> 0.178 us   |      i915_vma_misplaced();
> >>>>>>                      i915_vma_unbind() {
> >>>>>>                      __i915_active_wait() {
> >>>>>> 0.082 us   |        i915_active_acquire_if_busy();
> >>>>>> 0.475 us   |      }
> >>>>>>                      intel_runtime_pm_get() {
> >>>>>> 0.087 us   |        intel_runtime_pm_acquire();
> >>>>>> 0.259 us   |      }
> >>>>>>                      __i915_active_wait() {
> >>>>>> 0.085 us   |        i915_active_acquire_if_busy();
> >>>>>> 0.240 us   |      }
> >>>>>>                      __i915_vma_evict() {
> >>>>>>                        ggtt_unbind_vma() {
> >>>>>>                          gen8_ggtt_clear_range() {
> >>>>>> 10507.255 us |        }
> >>>>>> 10507.689 us |      }
> >>>>>> 10508.516 us |   }
> >>>>>>
> >>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>>>        buffer is too big by checking to see if it is possible to map
> >>>>>>        two of them into the ggtt.
> >>>>>>
> >>>>>> v3 (Ville):
> >>>>>> - Count how many fb objects can be fit into the available holes
> >>>>>>      instead of checking for a hole twice the object size.
> >>>>>> - Take alignment constraints into account.
> >>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>>>
> >>>>>> v4:
> >>>>>> - Remove existing heuristic that checks just for size. (Ville)
> >>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
> >>>>>> - Slightly update the commit message.
> >>>>>>
> >>>>>> v5: (Tvrtko)
> >>>>>> - Rename the function to indicate that the object may be too big to
> >>>>>>      map into the aperture.
> >>>>>> - Account for guard pages while calculating the total size required
> >>>>>>      for the object.
> >>>>>> - Do not subject all objects to the heuristic check and instead
> >>>>>>      consider objects only of a certain size.
> >>>>>> - Do the hole walk using the rbtree.
> >>>>>> - Preserve the existing PIN_NONBLOCK logic.
> >>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>>>>>
> >>>>>> v6: (Tvrtko)
> >>>>>> - Return 0 on success and the specific error code on failure to
> >>>>>>      preserve the existing behavior.
> >>>>>>
> >>>>>> v7: (Ville)
> >>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>>>>>      size < ggtt->mappable_end / 4 checks.
> >>>>>> - Drop the redundant check that is based on previous heuristic.
> >>>>>>
> >>>>>> v8:
> >>>>>> - Make sure that we are holding the mutex associated with ggtt vm
> >>>>>>      as we traverse the hole nodes.
> >>>>>>
> >>>>>> v9: (Tvrtko)
> >>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>>>>>
> >>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++--------
> -
> >>>>>>     1 file changed, 94 insertions(+), 34 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> index 9747924cc57b..e0d731b3f215 100644
> >>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> @@ -49,6 +49,7 @@
> >>>>>>     #include "gem/i915_gem_pm.h"
> >>>>>>     #include "gem/i915_gem_region.h"
> >>>>>>     #include "gem/i915_gem_userptr.h"
> >>>>>> +#include "gem/i915_gem_tiling.h"
> >>>>>>     #include "gt/intel_engine_user.h"
> >>>>>>     #include "gt/intel_gt.h"
> >>>>>>     #include "gt/intel_gt_pm.h"
> >>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>>>>>            spin_unlock(&obj->vma.lock);
> >>>>>>     }
> >>>>>>
> >>>>>> +static int
> >>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >>>>>> +                                u64 alignment, u64 flags)
> >>>>>
> >>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
> >>>>> started wondering.
> >>>>>
> >>>>> Conceptually this doesn't pass the smell test. What if we have
> >>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
> >>>>> the app does triple buffer? You'll be forever busy tuning this
> >>>>> heuristics, which can't fundamentally be fixed I think. The old "half
> >>>>> of mappable" heuristic isn't really better, but at least it was dead
> >>>>> simple.
> >>>>>
> >>>>> Imo what we need here is a change in approach:
> >>>>> 1. Check whether the useable view for scanout exists already. If yes,
> >>>>> use that. This should avoid the constant unbinding stalls.
> >>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
> >>>>> not the non-blocking thing)
> >>>>> 3. Pin the buffer with the most lenient approach
> >>>>>
> >>>>> Even the non-blocking interim stage is dangerous, since it'll just
> >>>>> result in other buffers (e.g. when triple-buffering) getting unbound
> >>>>> and we're back to the same stall. Note that this could have an impact
> >>>>> on cpu rendering compositors, where we might end up relying a lot more
> >>>>> partial views. But as long as we are a tad more aggressive (i.e. the
> >>>>> non-blocking binding) in the mmap path that should work out to keep
> >>>>> everything balanced, since usually you render first before you display
> >>>>> anything. And so the buffer should end up in the ideal place.
> >>>>>
> >>>>> I'd try to first skip the 2. step since I think it'll require a bit of
> >>>>> work, and frankly I don't think we care about the potential fallout.
> >>>>
> >>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie.
> stop
> >>>> respecting this comment from i915_gem_object_pin_to_display_plane:
> >>>>
> >>>> 	/*
> >>>> 	 * As the user may map the buffer once pinned in the display plane
> >>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> >>>> 	 * always use map_and_fenceable for all scanout buffers. However,
> >>>> 	 * it may simply be too big to fit into mappable, in which case
> >>>> 	 * put it anyway and hope that userspace can cope (but always first
> >>>> 	 * try to preserve the existing ABI).
> >>>> 	 */
> >>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
> >>> the above comment says:
> >>> commit 2efb813d5388e18255c54afac77bd91acd586908
> >>> Author: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Date:   Thu Aug 18 17:17:06 2016 +0100
> >>>
> >>>       drm/i915: Fallback to using unmappable memory for scanout
> >>>
> >>>       The existing ABI says that scanouts are pinned into the mappable region
> >>>       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >>>       into the scanout through a GTT mapping. However if the surface does not
> >>>       fit into the mappable region, we are better off just trying to fit it
> >>>       anywhere and hoping for the best. (Any userspace that is capable of
> >>>       using ginormous scanouts is also likely not to rely on pure GTT
> >>>       updates.) With the partial vma fault support, we are no longer
> >>>       restricted to only using scanouts that we can pin (though it is still
> >>>       preferred for performance reasons and for powersaving features like
> >>>       FBC).
> >>>
> >>>>
> >>>> By a quick look, for this case it appears we would end up creating partial views for
> >> CPU
> >>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is
> to
> >>>> create a bunch of 1MiB VMAs so something to check would be how long those
> persist
> >> in
> >>>> memory before they get released. Or perhaps the bootup splash use case is not
> common
> >>>> these days?
> >>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on
> Fedora,
> >>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
> >> (Plymouth's
> >>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl.
> This
> >>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> >>>
> >>
> >> FBC is a good point - correct me if I am wrong, but if we dropped trying
> >> to map in aperture by default it looks like we would lose it and that
> >> would be a significant power regression. In which case it doesn't seem
> >> like that would be an option.
> > [Kasireddy, Vivek] Ok, makes sense.
> >
> >>
> >> Which I think leaves us with _some_ heuristics in any case.
> >>
> >> 1) N-holes heuristics.
> >>
> >> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
> >> percentage of aperture.
> >>
> >> Could this solve the 8k issue, most of the time, maybe? Could the
> >> current "aperture / 2" test be expressed generically in some terms? Like
> >> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
> >> objects? I forgot what you said the relationship between aperture size
> >> and 8k fb size was.
> >>
> >> 3) Don't evict for PIN_MAPPABLE mismatches when
> >> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> >> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
> >> with a non-mappable fb to start with, we must not try to re-bind it or
> >> we risk ping-pong latencies.
> >>
> >> The last would I guess need to distinguish between PIN_MAPPABLE passed
> >> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
> >>
> >> How intrusive would it be to implement this option I am not sure without
> >> trying myself.
> > [Kasireddy, Vivek] I suspect I might be missing something, but could it not be
> > as simple as below:
> > @@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
> *obj,
> >                                  return ERR_PTR(-ENOSPC);
> >
> >                          if (flags & PIN_MAPPABLE &&
> > -                           vma->fence_size > ggtt->mappable_end / 2)
> > +                           (vma->fence_size > ggtt->mappable_end / 2 ||
> > +                           !i915_vma_is_map_and_fenceable(vma)))
> >                                      return ERR_PTR(-ENOSPC);
> >                  }
> 
> Looks like this would work...
> 
> >>
> >>> Given this, do you think it would work if we just preserve the existing behavior and
> >>> tweak the heuristic introduced in this patch to look for space in aperture for only
> >>> one FB instead of two? Or, is there no good option for solving this issue other than
> >>> to create 1MB VMAs?
> >>
> >> I did not get how having one hole would solve the issue. Wouldn't it
> >> still hit the re-bind ping-pong? Or there isn't even a single hole for
> >> 8k fb typically?
> > [Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
> > almost always uses only 2 except when it needs to share the FB -- with a plugin
> > such as "remoting" for desktop streaming.
> > Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
> > FB1 is mappable/fenceable and therefore not misplaced.
> > FB2 is NOT mappable and hence identified as misplaced
> > (because it fails the check
> > (flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))
> >
> > As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
> > when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced.
> > Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
> > see if there is space in aperture for only FB2 (N = 1) and return early -- before
> > even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
> > issue in both these cases.
> 
> ... got it, yes, it seems both approaches works for this use case.
> 
> Not sure that I have a preference between the two approaches at this point.
> 
> Both would be behind a "PIN_MAPPABLE && PIN_NONBLOCK" check, so both
> would only apply to opportunistic PIN_MAPPABLE attempts. That is, any
> caller who only passes PIN_MAPPABLE would be unaffected which is what we
> want.
> 
> The extra i915_vma_is_map_and_fenceable check I guess is simpler and
> self-contained. I assume you have a test setup and can try it out to
> check it really works?
[Kasireddy, Vivek] Yes, it works; my testcase just involves running Weston 
with a mode of 8K@60 on ADL-S and checking the FPS of the sample client
weston-simple-egl. With the fix included, the perf improves to 59 FPS from
40 FPS. I'll send out a new patch for review soon.

Oh, btw, do you think it is now pointless to merge the drm/mm patch that adds
the iterator given that we'd no longer have the i915 patch that uses it anymore?

Thanks,
Vivek
> 
> > The current version of this patch -- when running Weston -- puts both FB1
> > and FB2 (N = 2) outside of aperture although there may be space for FB1.
> > I don't think this makes sense anymore given Plymouth's single-buffer
> > use-case that uses dirtyfb ioctl.
> 
> Yes agreed, it sounds preferable to preserve the current behaviour there.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-17  7:08                 ` Kasireddy, Vivek
  0 siblings, 0 replies; 31+ messages in thread
From: Kasireddy, Vivek @ 2022-03-17  7:08 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter; +Cc: intel-gfx, dri-devel

Hi Tvrtko,

> 
> On 16/03/2022 07:37, Kasireddy, Vivek wrote:
> > Hi Tvrtko,
> >
> >>
> >> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> >>> Hi Tvrtko, Daniel,
> >>>
> >>>>
> >>>> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com>
> wrote:
> >>>>>>
> >>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >>>>>> more framebuffers/scanout buffers results in only one that is mappable/
> >>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >>>>>> is mappable/fenceable creates latencies large enough to miss alternate
> >>>>>> vblanks thereby producing less optimal framerate.
> >>>>>>
> >>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >>>>>> is called to pin one of the FB objs, the associated vma is identified
> >>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >>>>>> evicts it. This misplaced vma gets subseqently pinned only when
> >>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >>>>>> Therefore, to fix this issue, we try to see if there is space to map
> >>>>>> at-least two objects of a given size and return early if there isn't. This
> >>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >>>>>> are too big to map thereby preventing unncessary unbind.
> >>>>>>
> >>>>>> Testcase:
> >>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
> >>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> >>>>>> it misses the vblank every other frame.
> >>>>>>
> >>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>>>>>                  i915_gem_object_pin_to_display_plane() {
> >>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
> >>>>>>                    i915_gem_object_ggtt_pin_ww() {
> >>>>>> 0.390 us   |      i915_vma_instance();
> >>>>>> 0.178 us   |      i915_vma_misplaced();
> >>>>>>                      i915_vma_unbind() {
> >>>>>>                      __i915_active_wait() {
> >>>>>> 0.082 us   |        i915_active_acquire_if_busy();
> >>>>>> 0.475 us   |      }
> >>>>>>                      intel_runtime_pm_get() {
> >>>>>> 0.087 us   |        intel_runtime_pm_acquire();
> >>>>>> 0.259 us   |      }
> >>>>>>                      __i915_active_wait() {
> >>>>>> 0.085 us   |        i915_active_acquire_if_busy();
> >>>>>> 0.240 us   |      }
> >>>>>>                      __i915_vma_evict() {
> >>>>>>                        ggtt_unbind_vma() {
> >>>>>>                          gen8_ggtt_clear_range() {
> >>>>>> 10507.255 us |        }
> >>>>>> 10507.689 us |      }
> >>>>>> 10508.516 us |   }
> >>>>>>
> >>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>>>>>        buffer is too big by checking to see if it is possible to map
> >>>>>>        two of them into the ggtt.
> >>>>>>
> >>>>>> v3 (Ville):
> >>>>>> - Count how many fb objects can be fit into the available holes
> >>>>>>      instead of checking for a hole twice the object size.
> >>>>>> - Take alignment constraints into account.
> >>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>>>>>
> >>>>>> v4:
> >>>>>> - Remove existing heuristic that checks just for size. (Ville)
> >>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
> >>>>>> - Slightly update the commit message.
> >>>>>>
> >>>>>> v5: (Tvrtko)
> >>>>>> - Rename the function to indicate that the object may be too big to
> >>>>>>      map into the aperture.
> >>>>>> - Account for guard pages while calculating the total size required
> >>>>>>      for the object.
> >>>>>> - Do not subject all objects to the heuristic check and instead
> >>>>>>      consider objects only of a certain size.
> >>>>>> - Do the hole walk using the rbtree.
> >>>>>> - Preserve the existing PIN_NONBLOCK logic.
> >>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>>>>>
> >>>>>> v6: (Tvrtko)
> >>>>>> - Return 0 on success and the specific error code on failure to
> >>>>>>      preserve the existing behavior.
> >>>>>>
> >>>>>> v7: (Ville)
> >>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>>>>>      size < ggtt->mappable_end / 4 checks.
> >>>>>> - Drop the redundant check that is based on previous heuristic.
> >>>>>>
> >>>>>> v8:
> >>>>>> - Make sure that we are holding the mutex associated with ggtt vm
> >>>>>>      as we traverse the hole nodes.
> >>>>>>
> >>>>>> v9: (Tvrtko)
> >>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>>>>>
> >>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
> >>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++--------
> -
> >>>>>>     1 file changed, 94 insertions(+), 34 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> index 9747924cc57b..e0d731b3f215 100644
> >>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>>> @@ -49,6 +49,7 @@
> >>>>>>     #include "gem/i915_gem_pm.h"
> >>>>>>     #include "gem/i915_gem_region.h"
> >>>>>>     #include "gem/i915_gem_userptr.h"
> >>>>>> +#include "gem/i915_gem_tiling.h"
> >>>>>>     #include "gt/intel_engine_user.h"
> >>>>>>     #include "gt/intel_gt.h"
> >>>>>>     #include "gt/intel_gt_pm.h"
> >>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >>>>>>            spin_unlock(&obj->vma.lock);
> >>>>>>     }
> >>>>>>
> >>>>>> +static int
> >>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> >>>>>> +                                u64 alignment, u64 flags)
> >>>>>
> >>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
> >>>>> started wondering.
> >>>>>
> >>>>> Conceptually this doesn't pass the smell test. What if we have
> >>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
> >>>>> the app does triple buffer? You'll be forever busy tuning this
> >>>>> heuristics, which can't fundamentally be fixed I think. The old "half
> >>>>> of mappable" heuristic isn't really better, but at least it was dead
> >>>>> simple.
> >>>>>
> >>>>> Imo what we need here is a change in approach:
> >>>>> 1. Check whether the useable view for scanout exists already. If yes,
> >>>>> use that. This should avoid the constant unbinding stalls.
> >>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
> >>>>> not the non-blocking thing)
> >>>>> 3. Pin the buffer with the most lenient approach
> >>>>>
> >>>>> Even the non-blocking interim stage is dangerous, since it'll just
> >>>>> result in other buffers (e.g. when triple-buffering) getting unbound
> >>>>> and we're back to the same stall. Note that this could have an impact
> >>>>> on cpu rendering compositors, where we might end up relying a lot more
> >>>>> partial views. But as long as we are a tad more aggressive (i.e. the
> >>>>> non-blocking binding) in the mmap path that should work out to keep
> >>>>> everything balanced, since usually you render first before you display
> >>>>> anything. And so the buffer should end up in the ideal place.
> >>>>>
> >>>>> I'd try to first skip the 2. step since I think it'll require a bit of
> >>>>> work, and frankly I don't think we care about the potential fallout.
> >>>>
> >>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie.
> stop
> >>>> respecting this comment from i915_gem_object_pin_to_display_plane:
> >>>>
> >>>> 	/*
> >>>> 	 * As the user may map the buffer once pinned in the display plane
> >>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> >>>> 	 * always use map_and_fenceable for all scanout buffers. However,
> >>>> 	 * it may simply be too big to fit into mappable, in which case
> >>>> 	 * put it anyway and hope that userspace can cope (but always first
> >>>> 	 * try to preserve the existing ABI).
> >>>> 	 */
> >>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
> >>> the above comment says:
> >>> commit 2efb813d5388e18255c54afac77bd91acd586908
> >>> Author: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Date:   Thu Aug 18 17:17:06 2016 +0100
> >>>
> >>>       drm/i915: Fallback to using unmappable memory for scanout
> >>>
> >>>       The existing ABI says that scanouts are pinned into the mappable region
> >>>       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >>>       into the scanout through a GTT mapping. However if the surface does not
> >>>       fit into the mappable region, we are better off just trying to fit it
> >>>       anywhere and hoping for the best. (Any userspace that is capable of
> >>>       using ginormous scanouts is also likely not to rely on pure GTT
> >>>       updates.) With the partial vma fault support, we are no longer
> >>>       restricted to only using scanouts that we can pin (though it is still
> >>>       preferred for performance reasons and for powersaving features like
> >>>       FBC).
> >>>
> >>>>
> >>>> By a quick look, for this case it appears we would end up creating partial views for
> >> CPU
> >>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is
> to
> >>>> create a bunch of 1MiB VMAs so something to check would be how long those
> persist
> >> in
> >>>> memory before they get released. Or perhaps the bootup splash use case is not
> common
> >>>> these days?
> >>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on
> Fedora,
> >>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
> >> (Plymouth's
> >>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl.
> This
> >>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> >>>
> >>
> >> FBC is a good point - correct me if I am wrong, but if we dropped trying
> >> to map in aperture by default it looks like we would lose it and that
> >> would be a significant power regression. In which case it doesn't seem
> >> like that would be an option.
> > [Kasireddy, Vivek] Ok, makes sense.
> >
> >>
> >> Which I think leaves us with _some_ heuristics in any case.
> >>
> >> 1) N-holes heuristics.
> >>
> >> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
> >> percentage of aperture.
> >>
> >> Could this solve the 8k issue, most of the time, maybe? Could the
> >> current "aperture / 2" test be expressed generically in some terms? Like
> >> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
> >> objects? I forgot what you said the relationship between aperture size
> >> and 8k fb size was.
> >>
> >> 3) Don't evict for PIN_MAPPABLE mismatches when
> >> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> >> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
> >> with a non-mappable fb to start with, we must not try to re-bind it or
> >> we risk ping-pong latencies.
> >>
> >> The last would I guess need to distinguish between PIN_MAPPABLE passed
> >> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
> >>
> >> How intrusive would it be to implement this option I am not sure without
> >> trying myself.
> > [Kasireddy, Vivek] I suspect I might be missing something, but could it not be
> > as simple as below:
> > @@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
> *obj,
> >                                  return ERR_PTR(-ENOSPC);
> >
> >                          if (flags & PIN_MAPPABLE &&
> > -                           vma->fence_size > ggtt->mappable_end / 2)
> > +                           (vma->fence_size > ggtt->mappable_end / 2 ||
> > +                           !i915_vma_is_map_and_fenceable(vma)))
> >                                      return ERR_PTR(-ENOSPC);
> >                  }
> 
> Looks like this would work...
> 
> >>
> >>> Given this, do you think it would work if we just preserve the existing behavior and
> >>> tweak the heuristic introduced in this patch to look for space in aperture for only
> >>> one FB instead of two? Or, is there no good option for solving this issue other than
> >>> to create 1MB VMAs?
> >>
> >> I did not get how having one hole would solve the issue. Wouldn't it
> >> still hit the re-bind ping-pong? Or there isn't even a single hole for
> >> 8k fb typically?
> > [Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
> > almost always uses only 2 except when it needs to share the FB -- with a plugin
> > such as "remoting" for desktop streaming.
> > Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
> > FB1 is mappable/fenceable and therefore not misplaced.
> > FB2 is NOT mappable and hence identified as misplaced
> > (because it fails the check
> > (flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))
> >
> > As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
> > when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced.
> > Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
> > see if there is space in aperture for only FB2 (N = 1) and return early -- before
> > even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
> > issue in both these cases.
> 
> ... got it, yes, it seems both approaches works for this use case.
> 
> Not sure that I have a preference between the two approaches at this point.
> 
> Both would be behind a "PIN_MAPPABLE && PIN_NONBLOCK" check, so both
> would only apply to opportunistic PIN_MAPPABLE attempts. That is, any
> caller who only passes PIN_MAPPABLE would be unaffected which is what we
> want.
> 
> The extra i915_vma_is_map_and_fenceable check I guess is simpler and
> self-contained. I assume you have a test setup and can try it out to
> check it really works?
[Kasireddy, Vivek] Yes, it works; my testcase just involves running Weston 
with a mode of 8K@60 on ADL-S and checking the FPS of the sample client
weston-simple-egl. With the fix included, the perf improves to 59 FPS from
40 FPS. I'll send out a new patch for review soon.

Oh, btw, do you think it is now pointless to merge the drm/mm patch that adds
the iterator given that we'd no longer have the i915 patch that uses it anymore?

Thanks,
Vivek
> 
> > The current version of this patch -- when running Weston -- puts both FB1
> > and FB2 (N = 2) outside of aperture although there may be space for FB1.
> > I don't think this makes sense anymore given Plymouth's single-buffer
> > use-case that uses dirtyfb ioctl.
> 
> Yes agreed, it sounds preferable to preserve the current behaviour there.
> 
> Regards,
> 
> Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-15  9:45         ` Tvrtko Ursulin
@ 2022-03-17  9:47             ` Daniel Vetter
  2022-03-17  9:47             ` Daniel Vetter
  1 sibling, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:47 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: dri-devel, intel-gfx, Kasireddy, Vivek

On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> > 
> > > 
> > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> > > > > 
> > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > > > > more framebuffers/scanout buffers results in only one that is mappable/
> > > > > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > > > > is mappable/fenceable creates latencies large enough to miss alternate
> > > > > vblanks thereby producing less optimal framerate.
> > > > > 
> > > > > This mainly happens because when i915_gem_object_pin_to_display_plane()
> > > > > is called to pin one of the FB objs, the associated vma is identified
> > > > > as misplaced and therefore i915_vma_unbind() is called which unbinds and
> > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > results in a latency of ~10ms and happens every other vblank/repaint cycle.
> > > > > Therefore, to fix this issue, we try to see if there is space to map
> > > > > at-least two objects of a given size and return early if there isn't. This
> > > > > would ensure that we do not try with PIN_MAPPABLE for any objects that
> > > > > are too big to map thereby preventing unncessary unbind.
> > > > > 
> > > > > Testcase:
> > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> > > > > a frame ~7ms before the next vblank, the latencies seen between atomic
> > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> > > > > it misses the vblank every other frame.
> > > > > 
> > > > > Here is the ftrace snippet that shows the source of the ~10ms latency:
> > > > >                 i915_gem_object_pin_to_display_plane() {
> > > > > 0.102 us   |    i915_gem_object_set_cache_level();
> > > > >                   i915_gem_object_ggtt_pin_ww() {
> > > > > 0.390 us   |      i915_vma_instance();
> > > > > 0.178 us   |      i915_vma_misplaced();
> > > > >                     i915_vma_unbind() {
> > > > >                     __i915_active_wait() {
> > > > > 0.082 us   |        i915_active_acquire_if_busy();
> > > > > 0.475 us   |      }
> > > > >                     intel_runtime_pm_get() {
> > > > > 0.087 us   |        intel_runtime_pm_acquire();
> > > > > 0.259 us   |      }
> > > > >                     __i915_active_wait() {
> > > > > 0.085 us   |        i915_active_acquire_if_busy();
> > > > > 0.240 us   |      }
> > > > >                     __i915_vma_evict() {
> > > > >                       ggtt_unbind_vma() {
> > > > >                         gen8_ggtt_clear_range() {
> > > > > 10507.255 us |        }
> > > > > 10507.689 us |      }
> > > > > 10508.516 us |   }
> > > > > 
> > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > >       buffer is too big by checking to see if it is possible to map
> > > > >       two of them into the ggtt.
> > > > > 
> > > > > v3 (Ville):
> > > > > - Count how many fb objects can be fit into the available holes
> > > > >     instead of checking for a hole twice the object size.
> > > > > - Take alignment constraints into account.
> > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > 
> > > > > v4:
> > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > - Return early if we find space to map at-least two objects. (Tvrtko)
> > > > > - Slightly update the commit message.
> > > > > 
> > > > > v5: (Tvrtko)
> > > > > - Rename the function to indicate that the object may be too big to
> > > > >     map into the aperture.
> > > > > - Account for guard pages while calculating the total size required
> > > > >     for the object.
> > > > > - Do not subject all objects to the heuristic check and instead
> > > > >     consider objects only of a certain size.
> > > > > - Do the hole walk using the rbtree.
> > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > - Drop the PIN_MAPPABLE check while pinning the VMA.
> > > > > 
> > > > > v6: (Tvrtko)
> > > > > - Return 0 on success and the specific error code on failure to
> > > > >     preserve the existing behavior.
> > > > > 
> > > > > v7: (Ville)
> > > > > - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> > > > >     size < ggtt->mappable_end / 4 checks.
> > > > > - Drop the redundant check that is based on previous heuristic.
> > > > > 
> > > > > v8:
> > > > > - Make sure that we are holding the mutex associated with ggtt vm
> > > > >     as we traverse the hole nodes.
> > > > > 
> > > > > v9: (Tvrtko)
> > > > > - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> > > > > 
> > > > > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > > > Cc: Manasi Navare <manasi.d.navare@intel.com>
> > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> > > > >    1 file changed, 94 insertions(+), 34 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > index 9747924cc57b..e0d731b3f215 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > @@ -49,6 +49,7 @@
> > > > >    #include "gem/i915_gem_pm.h"
> > > > >    #include "gem/i915_gem_region.h"
> > > > >    #include "gem/i915_gem_userptr.h"
> > > > > +#include "gem/i915_gem_tiling.h"
> > > > >    #include "gt/intel_engine_user.h"
> > > > >    #include "gt/intel_gt.h"
> > > > >    #include "gt/intel_gt_pm.h"
> > > > > @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> > > > >           spin_unlock(&obj->vma.lock);
> > > > >    }
> > > > > 
> > > > > +static int
> > > > > +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> > > > > +                                u64 alignment, u64 flags)
> > > > 
> > > > Tvrtko asked me to ack the first patch, but then I looked at this and
> > > > started wondering.
> > > > 
> > > > Conceptually this doesn't pass the smell test. What if we have
> > > > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > > > the app does triple buffer? You'll be forever busy tuning this
> > > > heuristics, which can't fundamentally be fixed I think. The old "half
> > > > of mappable" heuristic isn't really better, but at least it was dead
> > > > simple.
> > > > 
> > > > Imo what we need here is a change in approach:
> > > > 1. Check whether the useable view for scanout exists already. If yes,
> > > > use that. This should avoid the constant unbinding stalls.
> > > > 2. Try to in buffer to mappabley, but without evicting anything (so
> > > > not the non-blocking thing)
> > > > 3. Pin the buffer with the most lenient approach
> > > > 
> > > > Even the non-blocking interim stage is dangerous, since it'll just
> > > > result in other buffers (e.g. when triple-buffering) getting unbound
> > > > and we're back to the same stall. Note that this could have an impact
> > > > on cpu rendering compositors, where we might end up relying a lot more
> > > > partial views. But as long as we are a tad more aggressive (i.e. the
> > > > non-blocking binding) in the mmap path that should work out to keep
> > > > everything balanced, since usually you render first before you display
> > > > anything. And so the buffer should end up in the ideal place.
> > > > 
> > > > I'd try to first skip the 2. step since I think it'll require a bit of
> > > > work, and frankly I don't think we care about the potential fallout.
> > > 
> > > To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> > > respecting this comment from i915_gem_object_pin_to_display_plane:
> > > 
> > > 	/*
> > > 	 * As the user may map the buffer once pinned in the display plane
> > > 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> > > 	 * always use map_and_fenceable for all scanout buffers. However,
> > > 	 * it may simply be too big to fit into mappable, in which case
> > > 	 * put it anyway and hope that userspace can cope (but always first
> > > 	 * try to preserve the existing ABI).
> > > 	 */
> > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > the above comment says:
> > commit 2efb813d5388e18255c54afac77bd91acd586908
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Thu Aug 18 17:17:06 2016 +0100
> > 
> >      drm/i915: Fallback to using unmappable memory for scanout
> > 
> >      The existing ABI says that scanouts are pinned into the mappable region
> >      so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >      into the scanout through a GTT mapping. However if the surface does not
> >      fit into the mappable region, we are better off just trying to fit it
> >      anywhere and hoping for the best. (Any userspace that is capable of
> >      using ginormous scanouts is also likely not to rely on pure GTT
> >      updates.) With the partial vma fault support, we are no longer
> >      restricted to only using scanouts that we can pin (though it is still
> >      preferred for performance reasons and for powersaving features like
> >      FBC).
> > 
> > > 
> > > By a quick look, for this case it appears we would end up creating partial views for CPU
> > > access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> > > create a bunch of 1MiB VMAs so something to check would be how long those persist in
> > > memory before they get released. Or perhaps the bootup splash use case is not common
> > > these days?
> > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
> > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> > 
> 
> FBC is a good point - correct me if I am wrong, but if we dropped trying to
> map in aperture by default it looks like we would lose it and that would be
> a significant power regression. In which case it doesn't seem like that
> would be an option.

FBC fence is only required for frontbuffer hw tracking, which is another
thing that's somewhere between "meh" and "we should just sunset set it
right away". I think that work has even been done.

So I wouldn't worry about this.

If you are worried, then I'd check with display folks whether we need
a platform based cut-off for this heuristics.

> Which I think leaves us with _some_ heuristics in any case.
> 
> 1) N-holes heuristics.
> 
> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
> of aperture.
> 
> Could this solve the 8k issue, most of the time, maybe? Could the current
> "aperture / 2" test be expressed generically in some terms? Like "(aperture
> - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
> what you said the relationship between aperture size and 8k fb size was.
> 
> 3) Don't evict for PIN_MAPPABLE mismatches when
> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
> non-mappable fb to start with, we must not try to re-bind it or we risk
> ping-pong latencies.
> 
> The last would I guess need to distinguish between PIN_MAPPABLE passed in
> versus opportunistically added by i915_gem_object_pin_to_display_plane.
> 
> How intrusive would it be to implement this option I am not sure without
> trying myself.

This won't work, see my initial mail. All you need is triple buffering (or
multiple per-crtc buffers that flip)

1. fb A gets pinned as mappable
2. fb B gets pinned as mappable, fb A is unpinned
3. fb C gets pinned as mappable, we don't have space and end up evicting
fb A

Repeat, and you have exactly the same old eviction loop as with two
buffers. Not good.

Therefore for this to work we don't just need to make sure that we don't
move our own buffer, but also that we don't move any other buffer.

The downside of that is that if a buffer is ever misplaced as mappable, we
never fix up that mistake (at least not until the application entirely
destroys all the involved fb and bo). I think that's acceptable, but
definitely deserves a comment.

Cheers, Daniel

> 
> > Given this, do you think it would work if we just preserve the existing behavior and
> > tweak the heuristic introduced in this patch to look for space in aperture for only
> > one FB instead of two? Or, is there no good option for solving this issue other than
> > to create 1MB VMAs?
> 
> I did not get how having one hole would solve the issue. Wouldn't it still
> hit the re-bind ping-pong? Or there isn't even a single hole for 8k fb
> typically?
> 
> Regards,
> 
> Tvrtko

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-17  9:47             ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:47 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: dri-devel, intel-gfx

On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> > 
> > > 
> > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> > > > > 
> > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > > > > more framebuffers/scanout buffers results in only one that is mappable/
> > > > > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > > > > is mappable/fenceable creates latencies large enough to miss alternate
> > > > > vblanks thereby producing less optimal framerate.
> > > > > 
> > > > > This mainly happens because when i915_gem_object_pin_to_display_plane()
> > > > > is called to pin one of the FB objs, the associated vma is identified
> > > > > as misplaced and therefore i915_vma_unbind() is called which unbinds and
> > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > results in a latency of ~10ms and happens every other vblank/repaint cycle.
> > > > > Therefore, to fix this issue, we try to see if there is space to map
> > > > > at-least two objects of a given size and return early if there isn't. This
> > > > > would ensure that we do not try with PIN_MAPPABLE for any objects that
> > > > > are too big to map thereby preventing unncessary unbind.
> > > > > 
> > > > > Testcase:
> > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> > > > > a frame ~7ms before the next vblank, the latencies seen between atomic
> > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> > > > > it misses the vblank every other frame.
> > > > > 
> > > > > Here is the ftrace snippet that shows the source of the ~10ms latency:
> > > > >                 i915_gem_object_pin_to_display_plane() {
> > > > > 0.102 us   |    i915_gem_object_set_cache_level();
> > > > >                   i915_gem_object_ggtt_pin_ww() {
> > > > > 0.390 us   |      i915_vma_instance();
> > > > > 0.178 us   |      i915_vma_misplaced();
> > > > >                     i915_vma_unbind() {
> > > > >                     __i915_active_wait() {
> > > > > 0.082 us   |        i915_active_acquire_if_busy();
> > > > > 0.475 us   |      }
> > > > >                     intel_runtime_pm_get() {
> > > > > 0.087 us   |        intel_runtime_pm_acquire();
> > > > > 0.259 us   |      }
> > > > >                     __i915_active_wait() {
> > > > > 0.085 us   |        i915_active_acquire_if_busy();
> > > > > 0.240 us   |      }
> > > > >                     __i915_vma_evict() {
> > > > >                       ggtt_unbind_vma() {
> > > > >                         gen8_ggtt_clear_range() {
> > > > > 10507.255 us |        }
> > > > > 10507.689 us |      }
> > > > > 10508.516 us |   }
> > > > > 
> > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > >       buffer is too big by checking to see if it is possible to map
> > > > >       two of them into the ggtt.
> > > > > 
> > > > > v3 (Ville):
> > > > > - Count how many fb objects can be fit into the available holes
> > > > >     instead of checking for a hole twice the object size.
> > > > > - Take alignment constraints into account.
> > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > 
> > > > > v4:
> > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > - Return early if we find space to map at-least two objects. (Tvrtko)
> > > > > - Slightly update the commit message.
> > > > > 
> > > > > v5: (Tvrtko)
> > > > > - Rename the function to indicate that the object may be too big to
> > > > >     map into the aperture.
> > > > > - Account for guard pages while calculating the total size required
> > > > >     for the object.
> > > > > - Do not subject all objects to the heuristic check and instead
> > > > >     consider objects only of a certain size.
> > > > > - Do the hole walk using the rbtree.
> > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > - Drop the PIN_MAPPABLE check while pinning the VMA.
> > > > > 
> > > > > v6: (Tvrtko)
> > > > > - Return 0 on success and the specific error code on failure to
> > > > >     preserve the existing behavior.
> > > > > 
> > > > > v7: (Ville)
> > > > > - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> > > > >     size < ggtt->mappable_end / 4 checks.
> > > > > - Drop the redundant check that is based on previous heuristic.
> > > > > 
> > > > > v8:
> > > > > - Make sure that we are holding the mutex associated with ggtt vm
> > > > >     as we traverse the hole nodes.
> > > > > 
> > > > > v9: (Tvrtko)
> > > > > - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> > > > > 
> > > > > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > > > Cc: Manasi Navare <manasi.d.navare@intel.com>
> > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> > > > >    1 file changed, 94 insertions(+), 34 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > index 9747924cc57b..e0d731b3f215 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > @@ -49,6 +49,7 @@
> > > > >    #include "gem/i915_gem_pm.h"
> > > > >    #include "gem/i915_gem_region.h"
> > > > >    #include "gem/i915_gem_userptr.h"
> > > > > +#include "gem/i915_gem_tiling.h"
> > > > >    #include "gt/intel_engine_user.h"
> > > > >    #include "gt/intel_gt.h"
> > > > >    #include "gt/intel_gt_pm.h"
> > > > > @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> > > > >           spin_unlock(&obj->vma.lock);
> > > > >    }
> > > > > 
> > > > > +static int
> > > > > +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> > > > > +                                u64 alignment, u64 flags)
> > > > 
> > > > Tvrtko asked me to ack the first patch, but then I looked at this and
> > > > started wondering.
> > > > 
> > > > Conceptually this doesn't pass the smell test. What if we have
> > > > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > > > the app does triple buffer? You'll be forever busy tuning this
> > > > heuristics, which can't fundamentally be fixed I think. The old "half
> > > > of mappable" heuristic isn't really better, but at least it was dead
> > > > simple.
> > > > 
> > > > Imo what we need here is a change in approach:
> > > > 1. Check whether the useable view for scanout exists already. If yes,
> > > > use that. This should avoid the constant unbinding stalls.
> > > > 2. Try to in buffer to mappabley, but without evicting anything (so
> > > > not the non-blocking thing)
> > > > 3. Pin the buffer with the most lenient approach
> > > > 
> > > > Even the non-blocking interim stage is dangerous, since it'll just
> > > > result in other buffers (e.g. when triple-buffering) getting unbound
> > > > and we're back to the same stall. Note that this could have an impact
> > > > on cpu rendering compositors, where we might end up relying a lot more
> > > > partial views. But as long as we are a tad more aggressive (i.e. the
> > > > non-blocking binding) in the mmap path that should work out to keep
> > > > everything balanced, since usually you render first before you display
> > > > anything. And so the buffer should end up in the ideal place.
> > > > 
> > > > I'd try to first skip the 2. step since I think it'll require a bit of
> > > > work, and frankly I don't think we care about the potential fallout.
> > > 
> > > To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> > > respecting this comment from i915_gem_object_pin_to_display_plane:
> > > 
> > > 	/*
> > > 	 * As the user may map the buffer once pinned in the display plane
> > > 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> > > 	 * always use map_and_fenceable for all scanout buffers. However,
> > > 	 * it may simply be too big to fit into mappable, in which case
> > > 	 * put it anyway and hope that userspace can cope (but always first
> > > 	 * try to preserve the existing ABI).
> > > 	 */
> > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > the above comment says:
> > commit 2efb813d5388e18255c54afac77bd91acd586908
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Thu Aug 18 17:17:06 2016 +0100
> > 
> >      drm/i915: Fallback to using unmappable memory for scanout
> > 
> >      The existing ABI says that scanouts are pinned into the mappable region
> >      so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> >      into the scanout through a GTT mapping. However if the surface does not
> >      fit into the mappable region, we are better off just trying to fit it
> >      anywhere and hoping for the best. (Any userspace that is capable of
> >      using ginormous scanouts is also likely not to rely on pure GTT
> >      updates.) With the partial vma fault support, we are no longer
> >      restricted to only using scanouts that we can pin (though it is still
> >      preferred for performance reasons and for powersaving features like
> >      FBC).
> > 
> > > 
> > > By a quick look, for this case it appears we would end up creating partial views for CPU
> > > access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> > > create a bunch of 1MiB VMAs so something to check would be how long those persist in
> > > memory before they get released. Or perhaps the bootup splash use case is not common
> > > these days?
> > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
> > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> > 
> 
> FBC is a good point - correct me if I am wrong, but if we dropped trying to
> map in aperture by default it looks like we would lose it and that would be
> a significant power regression. In which case it doesn't seem like that
> would be an option.

FBC fence is only required for frontbuffer hw tracking, which is another
thing that's somewhere between "meh" and "we should just sunset set it
right away". I think that work has even been done.

So I wouldn't worry about this.

If you are worried, then I'd check with display folks whether we need
a platform based cut-off for this heuristics.

> Which I think leaves us with _some_ heuristics in any case.
> 
> 1) N-holes heuristics.
> 
> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
> of aperture.
> 
> Could this solve the 8k issue, most of the time, maybe? Could the current
> "aperture / 2" test be expressed generically in some terms? Like "(aperture
> - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
> what you said the relationship between aperture size and 8k fb size was.
> 
> 3) Don't evict for PIN_MAPPABLE mismatches when
> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
> non-mappable fb to start with, we must not try to re-bind it or we risk
> ping-pong latencies.
> 
> The last would I guess need to distinguish between PIN_MAPPABLE passed in
> versus opportunistically added by i915_gem_object_pin_to_display_plane.
> 
> How intrusive would it be to implement this option I am not sure without
> trying myself.

This won't work, see my initial mail. All you need is triple buffering (or
multiple per-crtc buffers that flip)

1. fb A gets pinned as mappable
2. fb B gets pinned as mappable, fb A is unpinned
3. fb C gets pinned as mappable, we don't have space and end up evicting
fb A

Repeat, and you have exactly the same old eviction loop as with two
buffers. Not good.

Therefore for this to work we don't just need to make sure that we don't
move our own buffer, but also that we don't move any other buffer.

The downside of that is that if a buffer is ever misplaced as mappable, we
never fix up that mistake (at least not until the application entirely
destroys all the involved fb and bo). I think that's acceptable, but
definitely deserves a comment.

Cheers, Daniel

> 
> > Given this, do you think it would work if we just preserve the existing behavior and
> > tweak the heuristic introduced in this patch to look for space in aperture for only
> > one FB instead of two? Or, is there no good option for solving this issue other than
> > to create 1MB VMAs?
> 
> I did not get how having one hole would solve the issue. Wouldn't it still
> hit the re-bind ping-pong? Or there isn't even a single hole for 8k fb
> typically?
> 
> Regards,
> 
> Tvrtko

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-17  9:47             ` Daniel Vetter
@ 2022-03-17 10:04               ` Tvrtko Ursulin
  -1 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-17 10:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, Kasireddy, Vivek, dri-devel


On 17/03/2022 09:47, Daniel Vetter wrote:
> On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
>>
>> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
>>> Hi Tvrtko, Daniel,
>>>
>>>>
>>>> On 11/03/2022 09:39, Daniel Vetter wrote:
>>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>>>>>>
>>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>>>>> more framebuffers/scanout buffers results in only one that is mappable/
>>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>>>>> is mappable/fenceable creates latencies large enough to miss alternate
>>>>>> vblanks thereby producing less optimal framerate.
>>>>>>
>>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>>>>> is called to pin one of the FB objs, the associated vma is identified
>>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>>>>>> evicts it. This misplaced vma gets subseqently pinned only when
>>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>>>>>> Therefore, to fix this issue, we try to see if there is space to map
>>>>>> at-least two objects of a given size and return early if there isn't. This
>>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>>>>>> are too big to map thereby preventing unncessary unbind.
>>>>>>
>>>>>> Testcase:
>>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
>>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>>>>>> it misses the vblank every other frame.
>>>>>>
>>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>>>>                  i915_gem_object_pin_to_display_plane() {
>>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>>>>                    i915_gem_object_ggtt_pin_ww() {
>>>>>> 0.390 us   |      i915_vma_instance();
>>>>>> 0.178 us   |      i915_vma_misplaced();
>>>>>>                      i915_vma_unbind() {
>>>>>>                      __i915_active_wait() {
>>>>>> 0.082 us   |        i915_active_acquire_if_busy();
>>>>>> 0.475 us   |      }
>>>>>>                      intel_runtime_pm_get() {
>>>>>> 0.087 us   |        intel_runtime_pm_acquire();
>>>>>> 0.259 us   |      }
>>>>>>                      __i915_active_wait() {
>>>>>> 0.085 us   |        i915_active_acquire_if_busy();
>>>>>> 0.240 us   |      }
>>>>>>                      __i915_vma_evict() {
>>>>>>                        ggtt_unbind_vma() {
>>>>>>                          gen8_ggtt_clear_range() {
>>>>>> 10507.255 us |        }
>>>>>> 10507.689 us |      }
>>>>>> 10508.516 us |   }
>>>>>>
>>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>>>>>        buffer is too big by checking to see if it is possible to map
>>>>>>        two of them into the ggtt.
>>>>>>
>>>>>> v3 (Ville):
>>>>>> - Count how many fb objects can be fit into the available holes
>>>>>>      instead of checking for a hole twice the object size.
>>>>>> - Take alignment constraints into account.
>>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>>>>>
>>>>>> v4:
>>>>>> - Remove existing heuristic that checks just for size. (Ville)
>>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
>>>>>> - Slightly update the commit message.
>>>>>>
>>>>>> v5: (Tvrtko)
>>>>>> - Rename the function to indicate that the object may be too big to
>>>>>>      map into the aperture.
>>>>>> - Account for guard pages while calculating the total size required
>>>>>>      for the object.
>>>>>> - Do not subject all objects to the heuristic check and instead
>>>>>>      consider objects only of a certain size.
>>>>>> - Do the hole walk using the rbtree.
>>>>>> - Preserve the existing PIN_NONBLOCK logic.
>>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>>>>>
>>>>>> v6: (Tvrtko)
>>>>>> - Return 0 on success and the specific error code on failure to
>>>>>>      preserve the existing behavior.
>>>>>>
>>>>>> v7: (Ville)
>>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>>>>>      size < ggtt->mappable_end / 4 checks.
>>>>>> - Drop the redundant check that is based on previous heuristic.
>>>>>>
>>>>>> v8:
>>>>>> - Make sure that we are holding the mutex associated with ggtt vm
>>>>>>      as we traverse the hole nodes.
>>>>>>
>>>>>> v9: (Tvrtko)
>>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>>>>>
>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>>>>>>     1 file changed, 94 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> index 9747924cc57b..e0d731b3f215 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> @@ -49,6 +49,7 @@
>>>>>>     #include "gem/i915_gem_pm.h"
>>>>>>     #include "gem/i915_gem_region.h"
>>>>>>     #include "gem/i915_gem_userptr.h"
>>>>>> +#include "gem/i915_gem_tiling.h"
>>>>>>     #include "gt/intel_engine_user.h"
>>>>>>     #include "gt/intel_gt.h"
>>>>>>     #include "gt/intel_gt_pm.h"
>>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>>>>>            spin_unlock(&obj->vma.lock);
>>>>>>     }
>>>>>>
>>>>>> +static int
>>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>>>>>> +                                u64 alignment, u64 flags)
>>>>>
>>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
>>>>> started wondering.
>>>>>
>>>>> Conceptually this doesn't pass the smell test. What if we have
>>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
>>>>> the app does triple buffer? You'll be forever busy tuning this
>>>>> heuristics, which can't fundamentally be fixed I think. The old "half
>>>>> of mappable" heuristic isn't really better, but at least it was dead
>>>>> simple.
>>>>>
>>>>> Imo what we need here is a change in approach:
>>>>> 1. Check whether the useable view for scanout exists already. If yes,
>>>>> use that. This should avoid the constant unbinding stalls.
>>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
>>>>> not the non-blocking thing)
>>>>> 3. Pin the buffer with the most lenient approach
>>>>>
>>>>> Even the non-blocking interim stage is dangerous, since it'll just
>>>>> result in other buffers (e.g. when triple-buffering) getting unbound
>>>>> and we're back to the same stall. Note that this could have an impact
>>>>> on cpu rendering compositors, where we might end up relying a lot more
>>>>> partial views. But as long as we are a tad more aggressive (i.e. the
>>>>> non-blocking binding) in the mmap path that should work out to keep
>>>>> everything balanced, since usually you render first before you display
>>>>> anything. And so the buffer should end up in the ideal place.
>>>>>
>>>>> I'd try to first skip the 2. step since I think it'll require a bit of
>>>>> work, and frankly I don't think we care about the potential fallout.
>>>>
>>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
>>>> respecting this comment from i915_gem_object_pin_to_display_plane:
>>>>
>>>> 	/*
>>>> 	 * As the user may map the buffer once pinned in the display plane
>>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
>>>> 	 * always use map_and_fenceable for all scanout buffers. However,
>>>> 	 * it may simply be too big to fit into mappable, in which case
>>>> 	 * put it anyway and hope that userspace can cope (but always first
>>>> 	 * try to preserve the existing ABI).
>>>> 	 */
>>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
>>> the above comment says:
>>> commit 2efb813d5388e18255c54afac77bd91acd586908
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Thu Aug 18 17:17:06 2016 +0100
>>>
>>>       drm/i915: Fallback to using unmappable memory for scanout
>>>
>>>       The existing ABI says that scanouts are pinned into the mappable region
>>>       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
>>>       into the scanout through a GTT mapping. However if the surface does not
>>>       fit into the mappable region, we are better off just trying to fit it
>>>       anywhere and hoping for the best. (Any userspace that is capable of
>>>       using ginormous scanouts is also likely not to rely on pure GTT
>>>       updates.) With the partial vma fault support, we are no longer
>>>       restricted to only using scanouts that we can pin (though it is still
>>>       preferred for performance reasons and for powersaving features like
>>>       FBC).
>>>
>>>>
>>>> By a quick look, for this case it appears we would end up creating partial views for CPU
>>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
>>>> create a bunch of 1MiB VMAs so something to check would be how long those persist in
>>>> memory before they get released. Or perhaps the bootup splash use case is not common
>>>> these days?
>>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
>>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
>>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
>>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
>>>
>>
>> FBC is a good point - correct me if I am wrong, but if we dropped trying to
>> map in aperture by default it looks like we would lose it and that would be
>> a significant power regression. In which case it doesn't seem like that
>> would be an option.
> 
> FBC fence is only required for frontbuffer hw tracking, which is another
> thing that's somewhere between "meh" and "we should just sunset set it
> right away". I think that work has even been done.
> 
> So I wouldn't worry about this.
> 
> If you are worried, then I'd check with display folks whether we need
> a platform based cut-off for this heuristics.
> 
>> Which I think leaves us with _some_ heuristics in any case.
>>
>> 1) N-holes heuristics.
>>
>> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
>> of aperture.
>>
>> Could this solve the 8k issue, most of the time, maybe? Could the current
>> "aperture / 2" test be expressed generically in some terms? Like "(aperture
>> - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
>> what you said the relationship between aperture size and 8k fb size was.
>>
>> 3) Don't evict for PIN_MAPPABLE mismatches when
>> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
>> i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
>> non-mappable fb to start with, we must not try to re-bind it or we risk
>> ping-pong latencies.
>>
>> The last would I guess need to distinguish between PIN_MAPPABLE passed in
>> versus opportunistically added by i915_gem_object_pin_to_display_plane.
>>
>> How intrusive would it be to implement this option I am not sure without
>> trying myself.
> 
> This won't work, see my initial mail. All you need is triple buffering (or
> multiple per-crtc buffers that flip)

I asked for clarifications on your initial email but you went a bit 
quiet on us, which is why I tried to drive this forward.

> 
> 1. fb A gets pinned as mappable
> 2. fb B gets pinned as mappable, fb A is unpinned
> 3. fb C gets pinned as mappable, we don't have space and end up evicting
> fb A
> 
> Repeat, and you have exactly the same old eviction loop as with two
> buffers. Not good.

Maybe a misunderstanding of what I wrote above? Idea was specifically 
not to evict for "opportunistic" PIN_MAPPABLE. Anyway, with the current 
solution to implement that, this is what would happen (see latest patch):

1. fb A get pinned as mappable
2. fb B gets pinned as mappable, assuming there is space, fb A unpinned
3. fb C, assuming there is no space, does not get pinned as mappable so 
nothing is evicted

> Therefore for this to work we don't just need to make sure that we don't
> move our own buffer, but also that we don't move any other buffer.

I think we achieved it by failing the "opportunistic" PIN_MAPPABLE 
attempts for all vmas which weren't already bound mappable in the past.

> The downside of that is that if a buffer is ever misplaced as mappable, we
> never fix up that mistake (at least not until the application entirely
> destroys all the involved fb and bo). I think that's acceptable, but
> definitely deserves a comment.

This is true yes.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-17 10:04               ` Tvrtko Ursulin
  0 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-17 10:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel


On 17/03/2022 09:47, Daniel Vetter wrote:
> On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
>>
>> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
>>> Hi Tvrtko, Daniel,
>>>
>>>>
>>>> On 11/03/2022 09:39, Daniel Vetter wrote:
>>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
>>>>>>
>>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>>>>> more framebuffers/scanout buffers results in only one that is mappable/
>>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>>>>> is mappable/fenceable creates latencies large enough to miss alternate
>>>>>> vblanks thereby producing less optimal framerate.
>>>>>>
>>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>>>>> is called to pin one of the FB objs, the associated vma is identified
>>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>>>>>> evicts it. This misplaced vma gets subseqently pinned only when
>>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>>>>>> Therefore, to fix this issue, we try to see if there is space to map
>>>>>> at-least two objects of a given size and return early if there isn't. This
>>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>>>>>> are too big to map thereby preventing unncessary unbind.
>>>>>>
>>>>>> Testcase:
>>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
>>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>>>>>> it misses the vblank every other frame.
>>>>>>
>>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>>>>                  i915_gem_object_pin_to_display_plane() {
>>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>>>>                    i915_gem_object_ggtt_pin_ww() {
>>>>>> 0.390 us   |      i915_vma_instance();
>>>>>> 0.178 us   |      i915_vma_misplaced();
>>>>>>                      i915_vma_unbind() {
>>>>>>                      __i915_active_wait() {
>>>>>> 0.082 us   |        i915_active_acquire_if_busy();
>>>>>> 0.475 us   |      }
>>>>>>                      intel_runtime_pm_get() {
>>>>>> 0.087 us   |        intel_runtime_pm_acquire();
>>>>>> 0.259 us   |      }
>>>>>>                      __i915_active_wait() {
>>>>>> 0.085 us   |        i915_active_acquire_if_busy();
>>>>>> 0.240 us   |      }
>>>>>>                      __i915_vma_evict() {
>>>>>>                        ggtt_unbind_vma() {
>>>>>>                          gen8_ggtt_clear_range() {
>>>>>> 10507.255 us |        }
>>>>>> 10507.689 us |      }
>>>>>> 10508.516 us |   }
>>>>>>
>>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>>>>>        buffer is too big by checking to see if it is possible to map
>>>>>>        two of them into the ggtt.
>>>>>>
>>>>>> v3 (Ville):
>>>>>> - Count how many fb objects can be fit into the available holes
>>>>>>      instead of checking for a hole twice the object size.
>>>>>> - Take alignment constraints into account.
>>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>>>>>
>>>>>> v4:
>>>>>> - Remove existing heuristic that checks just for size. (Ville)
>>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
>>>>>> - Slightly update the commit message.
>>>>>>
>>>>>> v5: (Tvrtko)
>>>>>> - Rename the function to indicate that the object may be too big to
>>>>>>      map into the aperture.
>>>>>> - Account for guard pages while calculating the total size required
>>>>>>      for the object.
>>>>>> - Do not subject all objects to the heuristic check and instead
>>>>>>      consider objects only of a certain size.
>>>>>> - Do the hole walk using the rbtree.
>>>>>> - Preserve the existing PIN_NONBLOCK logic.
>>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>>>>>
>>>>>> v6: (Tvrtko)
>>>>>> - Return 0 on success and the specific error code on failure to
>>>>>>      preserve the existing behavior.
>>>>>>
>>>>>> v7: (Ville)
>>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>>>>>      size < ggtt->mappable_end / 4 checks.
>>>>>> - Drop the redundant check that is based on previous heuristic.
>>>>>>
>>>>>> v8:
>>>>>> - Make sure that we are holding the mutex associated with ggtt vm
>>>>>>      as we traverse the hole nodes.
>>>>>>
>>>>>> v9: (Tvrtko)
>>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>>>>>
>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
>>>>>>     1 file changed, 94 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> index 9747924cc57b..e0d731b3f215 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> @@ -49,6 +49,7 @@
>>>>>>     #include "gem/i915_gem_pm.h"
>>>>>>     #include "gem/i915_gem_region.h"
>>>>>>     #include "gem/i915_gem_userptr.h"
>>>>>> +#include "gem/i915_gem_tiling.h"
>>>>>>     #include "gt/intel_engine_user.h"
>>>>>>     #include "gt/intel_gt.h"
>>>>>>     #include "gt/intel_gt_pm.h"
>>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>>>>>            spin_unlock(&obj->vma.lock);
>>>>>>     }
>>>>>>
>>>>>> +static int
>>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>>>>>> +                                u64 alignment, u64 flags)
>>>>>
>>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
>>>>> started wondering.
>>>>>
>>>>> Conceptually this doesn't pass the smell test. What if we have
>>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
>>>>> the app does triple buffer? You'll be forever busy tuning this
>>>>> heuristics, which can't fundamentally be fixed I think. The old "half
>>>>> of mappable" heuristic isn't really better, but at least it was dead
>>>>> simple.
>>>>>
>>>>> Imo what we need here is a change in approach:
>>>>> 1. Check whether the useable view for scanout exists already. If yes,
>>>>> use that. This should avoid the constant unbinding stalls.
>>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
>>>>> not the non-blocking thing)
>>>>> 3. Pin the buffer with the most lenient approach
>>>>>
>>>>> Even the non-blocking interim stage is dangerous, since it'll just
>>>>> result in other buffers (e.g. when triple-buffering) getting unbound
>>>>> and we're back to the same stall. Note that this could have an impact
>>>>> on cpu rendering compositors, where we might end up relying a lot more
>>>>> partial views. But as long as we are a tad more aggressive (i.e. the
>>>>> non-blocking binding) in the mmap path that should work out to keep
>>>>> everything balanced, since usually you render first before you display
>>>>> anything. And so the buffer should end up in the ideal place.
>>>>>
>>>>> I'd try to first skip the 2. step since I think it'll require a bit of
>>>>> work, and frankly I don't think we care about the potential fallout.
>>>>
>>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
>>>> respecting this comment from i915_gem_object_pin_to_display_plane:
>>>>
>>>> 	/*
>>>> 	 * As the user may map the buffer once pinned in the display plane
>>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
>>>> 	 * always use map_and_fenceable for all scanout buffers. However,
>>>> 	 * it may simply be too big to fit into mappable, in which case
>>>> 	 * put it anyway and hope that userspace can cope (but always first
>>>> 	 * try to preserve the existing ABI).
>>>> 	 */
>>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
>>> the above comment says:
>>> commit 2efb813d5388e18255c54afac77bd91acd586908
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Thu Aug 18 17:17:06 2016 +0100
>>>
>>>       drm/i915: Fallback to using unmappable memory for scanout
>>>
>>>       The existing ABI says that scanouts are pinned into the mappable region
>>>       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
>>>       into the scanout through a GTT mapping. However if the surface does not
>>>       fit into the mappable region, we are better off just trying to fit it
>>>       anywhere and hoping for the best. (Any userspace that is capable of
>>>       using ginormous scanouts is also likely not to rely on pure GTT
>>>       updates.) With the partial vma fault support, we are no longer
>>>       restricted to only using scanouts that we can pin (though it is still
>>>       preferred for performance reasons and for powersaving features like
>>>       FBC).
>>>
>>>>
>>>> By a quick look, for this case it appears we would end up creating partial views for CPU
>>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is to
>>>> create a bunch of 1MiB VMAs so something to check would be how long those persist in
>>>> memory before they get released. Or perhaps the bootup splash use case is not common
>>>> these days?
>>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
>>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
>>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
>>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
>>>
>>
>> FBC is a good point - correct me if I am wrong, but if we dropped trying to
>> map in aperture by default it looks like we would lose it and that would be
>> a significant power regression. In which case it doesn't seem like that
>> would be an option.
> 
> FBC fence is only required for frontbuffer hw tracking, which is another
> thing that's somewhere between "meh" and "we should just sunset set it
> right away". I think that work has even been done.
> 
> So I wouldn't worry about this.
> 
> If you are worried, then I'd check with display folks whether we need
> a platform based cut-off for this heuristics.
> 
>> Which I think leaves us with _some_ heuristics in any case.
>>
>> 1) N-holes heuristics.
>>
>> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
>> of aperture.
>>
>> Could this solve the 8k issue, most of the time, maybe? Could the current
>> "aperture / 2" test be expressed generically in some terms? Like "(aperture
>> - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
>> what you said the relationship between aperture size and 8k fb size was.
>>
>> 3) Don't evict for PIN_MAPPABLE mismatches when
>> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
>> i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
>> non-mappable fb to start with, we must not try to re-bind it or we risk
>> ping-pong latencies.
>>
>> The last would I guess need to distinguish between PIN_MAPPABLE passed in
>> versus opportunistically added by i915_gem_object_pin_to_display_plane.
>>
>> How intrusive would it be to implement this option I am not sure without
>> trying myself.
> 
> This won't work, see my initial mail. All you need is triple buffering (or
> multiple per-crtc buffers that flip)

I asked for clarifications on your initial email but you went a bit 
quiet on us, which is why I tried to drive this forward.

> 
> 1. fb A gets pinned as mappable
> 2. fb B gets pinned as mappable, fb A is unpinned
> 3. fb C gets pinned as mappable, we don't have space and end up evicting
> fb A
> 
> Repeat, and you have exactly the same old eviction loop as with two
> buffers. Not good.

Maybe a misunderstanding of what I wrote above? Idea was specifically 
not to evict for "opportunistic" PIN_MAPPABLE. Anyway, with the current 
solution to implement that, this is what would happen (see latest patch):

1. fb A get pinned as mappable
2. fb B gets pinned as mappable, assuming there is space, fb A unpinned
3. fb C, assuming there is no space, does not get pinned as mappable so 
nothing is evicted

> Therefore for this to work we don't just need to make sure that we don't
> move our own buffer, but also that we don't move any other buffer.

I think we achieved it by failing the "opportunistic" PIN_MAPPABLE 
attempts for all vmas which weren't already bound mappable in the past.

> The downside of that is that if a buffer is ever misplaced as mappable, we
> never fix up that mistake (at least not until the application entirely
> destroys all the involved fb and bo). I think that's acceptable, but
> definitely deserves a comment.

This is true yes.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-17 10:04               ` Tvrtko Ursulin
@ 2022-03-17 10:10                 ` Daniel Vetter
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-17 10:10 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: dri-devel, intel-gfx, Kasireddy, Vivek

On Thu, Mar 17, 2022 at 10:04:36AM +0000, Tvrtko Ursulin wrote:
> 
> On 17/03/2022 09:47, Daniel Vetter wrote:
> > On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > > > Hi Tvrtko, Daniel,
> > > > 
> > > > > 
> > > > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> > > > > > > 
> > > > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > > > > > > more framebuffers/scanout buffers results in only one that is mappable/
> > > > > > > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > > > > > > is mappable/fenceable creates latencies large enough to miss alternate
> > > > > > > vblanks thereby producing less optimal framerate.
> > > > > > > 
> > > > > > > This mainly happens because when i915_gem_object_pin_to_display_plane()
> > > > > > > is called to pin one of the FB objs, the associated vma is identified
> > > > > > > as misplaced and therefore i915_vma_unbind() is called which unbinds and
> > > > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > > > results in a latency of ~10ms and happens every other vblank/repaint cycle.
> > > > > > > Therefore, to fix this issue, we try to see if there is space to map
> > > > > > > at-least two objects of a given size and return early if there isn't. This
> > > > > > > would ensure that we do not try with PIN_MAPPABLE for any objects that
> > > > > > > are too big to map thereby preventing unncessary unbind.
> > > > > > > 
> > > > > > > Testcase:
> > > > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > > > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> > > > > > > a frame ~7ms before the next vblank, the latencies seen between atomic
> > > > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> > > > > > > it misses the vblank every other frame.
> > > > > > > 
> > > > > > > Here is the ftrace snippet that shows the source of the ~10ms latency:
> > > > > > >                  i915_gem_object_pin_to_display_plane() {
> > > > > > > 0.102 us   |    i915_gem_object_set_cache_level();
> > > > > > >                    i915_gem_object_ggtt_pin_ww() {
> > > > > > > 0.390 us   |      i915_vma_instance();
> > > > > > > 0.178 us   |      i915_vma_misplaced();
> > > > > > >                      i915_vma_unbind() {
> > > > > > >                      __i915_active_wait() {
> > > > > > > 0.082 us   |        i915_active_acquire_if_busy();
> > > > > > > 0.475 us   |      }
> > > > > > >                      intel_runtime_pm_get() {
> > > > > > > 0.087 us   |        intel_runtime_pm_acquire();
> > > > > > > 0.259 us   |      }
> > > > > > >                      __i915_active_wait() {
> > > > > > > 0.085 us   |        i915_active_acquire_if_busy();
> > > > > > > 0.240 us   |      }
> > > > > > >                      __i915_vma_evict() {
> > > > > > >                        ggtt_unbind_vma() {
> > > > > > >                          gen8_ggtt_clear_range() {
> > > > > > > 10507.255 us |        }
> > > > > > > 10507.689 us |      }
> > > > > > > 10508.516 us |   }
> > > > > > > 
> > > > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > > > >        buffer is too big by checking to see if it is possible to map
> > > > > > >        two of them into the ggtt.
> > > > > > > 
> > > > > > > v3 (Ville):
> > > > > > > - Count how many fb objects can be fit into the available holes
> > > > > > >      instead of checking for a hole twice the object size.
> > > > > > > - Take alignment constraints into account.
> > > > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > > > 
> > > > > > > v4:
> > > > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > > > - Return early if we find space to map at-least two objects. (Tvrtko)
> > > > > > > - Slightly update the commit message.
> > > > > > > 
> > > > > > > v5: (Tvrtko)
> > > > > > > - Rename the function to indicate that the object may be too big to
> > > > > > >      map into the aperture.
> > > > > > > - Account for guard pages while calculating the total size required
> > > > > > >      for the object.
> > > > > > > - Do not subject all objects to the heuristic check and instead
> > > > > > >      consider objects only of a certain size.
> > > > > > > - Do the hole walk using the rbtree.
> > > > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > > > - Drop the PIN_MAPPABLE check while pinning the VMA.
> > > > > > > 
> > > > > > > v6: (Tvrtko)
> > > > > > > - Return 0 on success and the specific error code on failure to
> > > > > > >      preserve the existing behavior.
> > > > > > > 
> > > > > > > v7: (Ville)
> > > > > > > - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> > > > > > >      size < ggtt->mappable_end / 4 checks.
> > > > > > > - Drop the redundant check that is based on previous heuristic.
> > > > > > > 
> > > > > > > v8:
> > > > > > > - Make sure that we are holding the mutex associated with ggtt vm
> > > > > > >      as we traverse the hole nodes.
> > > > > > > 
> > > > > > > v9: (Tvrtko)
> > > > > > > - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> > > > > > > 
> > > > > > > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > > > > > Cc: Manasi Navare <manasi.d.navare@intel.com>
> > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> > > > > > >     1 file changed, 94 insertions(+), 34 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > index 9747924cc57b..e0d731b3f215 100644
> > > > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > @@ -49,6 +49,7 @@
> > > > > > >     #include "gem/i915_gem_pm.h"
> > > > > > >     #include "gem/i915_gem_region.h"
> > > > > > >     #include "gem/i915_gem_userptr.h"
> > > > > > > +#include "gem/i915_gem_tiling.h"
> > > > > > >     #include "gt/intel_engine_user.h"
> > > > > > >     #include "gt/intel_gt.h"
> > > > > > >     #include "gt/intel_gt_pm.h"
> > > > > > > @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> > > > > > >            spin_unlock(&obj->vma.lock);
> > > > > > >     }
> > > > > > > 
> > > > > > > +static int
> > > > > > > +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> > > > > > > +                                u64 alignment, u64 flags)
> > > > > > 
> > > > > > Tvrtko asked me to ack the first patch, but then I looked at this and
> > > > > > started wondering.
> > > > > > 
> > > > > > Conceptually this doesn't pass the smell test. What if we have
> > > > > > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > > > > > the app does triple buffer? You'll be forever busy tuning this
> > > > > > heuristics, which can't fundamentally be fixed I think. The old "half
> > > > > > of mappable" heuristic isn't really better, but at least it was dead
> > > > > > simple.
> > > > > > 
> > > > > > Imo what we need here is a change in approach:
> > > > > > 1. Check whether the useable view for scanout exists already. If yes,
> > > > > > use that. This should avoid the constant unbinding stalls.
> > > > > > 2. Try to in buffer to mappabley, but without evicting anything (so
> > > > > > not the non-blocking thing)
> > > > > > 3. Pin the buffer with the most lenient approach
> > > > > > 
> > > > > > Even the non-blocking interim stage is dangerous, since it'll just
> > > > > > result in other buffers (e.g. when triple-buffering) getting unbound
> > > > > > and we're back to the same stall. Note that this could have an impact
> > > > > > on cpu rendering compositors, where we might end up relying a lot more
> > > > > > partial views. But as long as we are a tad more aggressive (i.e. the
> > > > > > non-blocking binding) in the mmap path that should work out to keep
> > > > > > everything balanced, since usually you render first before you display
> > > > > > anything. And so the buffer should end up in the ideal place.
> > > > > > 
> > > > > > I'd try to first skip the 2. step since I think it'll require a bit of
> > > > > > work, and frankly I don't think we care about the potential fallout.
> > > > > 
> > > > > To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> > > > > respecting this comment from i915_gem_object_pin_to_display_plane:
> > > > > 
> > > > > 	/*
> > > > > 	 * As the user may map the buffer once pinned in the display plane
> > > > > 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> > > > > 	 * always use map_and_fenceable for all scanout buffers. However,
> > > > > 	 * it may simply be too big to fit into mappable, in which case
> > > > > 	 * put it anyway and hope that userspace can cope (but always first
> > > > > 	 * try to preserve the existing ABI).
> > > > > 	 */
> > > > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > > > the above comment says:
> > > > commit 2efb813d5388e18255c54afac77bd91acd586908
> > > > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Date:   Thu Aug 18 17:17:06 2016 +0100
> > > > 
> > > >       drm/i915: Fallback to using unmappable memory for scanout
> > > > 
> > > >       The existing ABI says that scanouts are pinned into the mappable region
> > > >       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> > > >       into the scanout through a GTT mapping. However if the surface does not
> > > >       fit into the mappable region, we are better off just trying to fit it
> > > >       anywhere and hoping for the best. (Any userspace that is capable of
> > > >       using ginormous scanouts is also likely not to rely on pure GTT
> > > >       updates.) With the partial vma fault support, we are no longer
> > > >       restricted to only using scanouts that we can pin (though it is still
> > > >       preferred for performance reasons and for powersaving features like
> > > >       FBC).
> > > > 
> > > > > 
> > > > > By a quick look, for this case it appears we would end up creating partial views for CPU
> > > > > access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> > > > > create a bunch of 1MiB VMAs so something to check would be how long those persist in
> > > > > memory before they get released. Or perhaps the bootup splash use case is not common
> > > > > these days?
> > > > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > > > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
> > > > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > > > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> > > > 
> > > 
> > > FBC is a good point - correct me if I am wrong, but if we dropped trying to
> > > map in aperture by default it looks like we would lose it and that would be
> > > a significant power regression. In which case it doesn't seem like that
> > > would be an option.
> > 
> > FBC fence is only required for frontbuffer hw tracking, which is another
> > thing that's somewhere between "meh" and "we should just sunset set it
> > right away". I think that work has even been done.
> > 
> > So I wouldn't worry about this.
> > 
> > If you are worried, then I'd check with display folks whether we need
> > a platform based cut-off for this heuristics.
> > 
> > > Which I think leaves us with _some_ heuristics in any case.
> > > 
> > > 1) N-holes heuristics.
> > > 
> > > 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
> > > of aperture.
> > > 
> > > Could this solve the 8k issue, most of the time, maybe? Could the current
> > > "aperture / 2" test be expressed generically in some terms? Like "(aperture
> > > - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
> > > what you said the relationship between aperture size and 8k fb size was.
> > > 
> > > 3) Don't evict for PIN_MAPPABLE mismatches when
> > > i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> > > i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
> > > non-mappable fb to start with, we must not try to re-bind it or we risk
> > > ping-pong latencies.
> > > 
> > > The last would I guess need to distinguish between PIN_MAPPABLE passed in
> > > versus opportunistically added by i915_gem_object_pin_to_display_plane.
> > > 
> > > How intrusive would it be to implement this option I am not sure without
> > > trying myself.
> > 
> > This won't work, see my initial mail. All you need is triple buffering (or
> > multiple per-crtc buffers that flip)
> 
> I asked for clarifications on your initial email but you went a bit quiet on
> us, which is why I tried to drive this forward.

Yeah I'm burried on mailing list stuff pretty badly.

> > 1. fb A gets pinned as mappable
> > 2. fb B gets pinned as mappable, fb A is unpinned
> > 3. fb C gets pinned as mappable, we don't have space and end up evicting
> > fb A
> > 
> > Repeat, and you have exactly the same old eviction loop as with two
> > buffers. Not good.
> 
> Maybe a misunderstanding of what I wrote above? Idea was specifically not to
> evict for "opportunistic" PIN_MAPPABLE. Anyway, with the current solution to
> implement that, this is what would happen (see latest patch):
> 
> 1. fb A get pinned as mappable
> 2. fb B gets pinned as mappable, assuming there is space, fb A unpinned
> 3. fb C, assuming there is no space, does not get pinned as mappable so
> nothing is evicted
> 
> > Therefore for this to work we don't just need to make sure that we don't
> > move our own buffer, but also that we don't move any other buffer.
> 
> I think we achieved it by failing the "opportunistic" PIN_MAPPABLE attempts
> for all vmas which weren't already bound mappable in the past.

Ah yeah if that all works like that then I think we're fine.
-Daniel

> > The downside of that is that if a buffer is ever misplaced as mappable, we
> > never fix up that mistake (at least not until the application entirely
> > destroys all the involved fb and bo). I think that's acceptable, but
> > definitely deserves a comment.
> 
> This is true yes.
> 
> Regards,
> 
> Tvrtko

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
@ 2022-03-17 10:10                 ` Daniel Vetter
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Vetter @ 2022-03-17 10:10 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: dri-devel, intel-gfx

On Thu, Mar 17, 2022 at 10:04:36AM +0000, Tvrtko Ursulin wrote:
> 
> On 17/03/2022 09:47, Daniel Vetter wrote:
> > On Tue, Mar 15, 2022 at 09:45:20AM +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > > > Hi Tvrtko, Daniel,
> > > > 
> > > > > 
> > > > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com> wrote:
> > > > > > > 
> > > > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > > > > > > more framebuffers/scanout buffers results in only one that is mappable/
> > > > > > > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > > > > > > is mappable/fenceable creates latencies large enough to miss alternate
> > > > > > > vblanks thereby producing less optimal framerate.
> > > > > > > 
> > > > > > > This mainly happens because when i915_gem_object_pin_to_display_plane()
> > > > > > > is called to pin one of the FB objs, the associated vma is identified
> > > > > > > as misplaced and therefore i915_vma_unbind() is called which unbinds and
> > > > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > > > results in a latency of ~10ms and happens every other vblank/repaint cycle.
> > > > > > > Therefore, to fix this issue, we try to see if there is space to map
> > > > > > > at-least two objects of a given size and return early if there isn't. This
> > > > > > > would ensure that we do not try with PIN_MAPPABLE for any objects that
> > > > > > > are too big to map thereby preventing unncessary unbind.
> > > > > > > 
> > > > > > > Testcase:
> > > > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > > > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> > > > > > > a frame ~7ms before the next vblank, the latencies seen between atomic
> > > > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> > > > > > > it misses the vblank every other frame.
> > > > > > > 
> > > > > > > Here is the ftrace snippet that shows the source of the ~10ms latency:
> > > > > > >                  i915_gem_object_pin_to_display_plane() {
> > > > > > > 0.102 us   |    i915_gem_object_set_cache_level();
> > > > > > >                    i915_gem_object_ggtt_pin_ww() {
> > > > > > > 0.390 us   |      i915_vma_instance();
> > > > > > > 0.178 us   |      i915_vma_misplaced();
> > > > > > >                      i915_vma_unbind() {
> > > > > > >                      __i915_active_wait() {
> > > > > > > 0.082 us   |        i915_active_acquire_if_busy();
> > > > > > > 0.475 us   |      }
> > > > > > >                      intel_runtime_pm_get() {
> > > > > > > 0.087 us   |        intel_runtime_pm_acquire();
> > > > > > > 0.259 us   |      }
> > > > > > >                      __i915_active_wait() {
> > > > > > > 0.085 us   |        i915_active_acquire_if_busy();
> > > > > > > 0.240 us   |      }
> > > > > > >                      __i915_vma_evict() {
> > > > > > >                        ggtt_unbind_vma() {
> > > > > > >                          gen8_ggtt_clear_range() {
> > > > > > > 10507.255 us |        }
> > > > > > > 10507.689 us |      }
> > > > > > > 10508.516 us |   }
> > > > > > > 
> > > > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > > > >        buffer is too big by checking to see if it is possible to map
> > > > > > >        two of them into the ggtt.
> > > > > > > 
> > > > > > > v3 (Ville):
> > > > > > > - Count how many fb objects can be fit into the available holes
> > > > > > >      instead of checking for a hole twice the object size.
> > > > > > > - Take alignment constraints into account.
> > > > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > > > 
> > > > > > > v4:
> > > > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > > > - Return early if we find space to map at-least two objects. (Tvrtko)
> > > > > > > - Slightly update the commit message.
> > > > > > > 
> > > > > > > v5: (Tvrtko)
> > > > > > > - Rename the function to indicate that the object may be too big to
> > > > > > >      map into the aperture.
> > > > > > > - Account for guard pages while calculating the total size required
> > > > > > >      for the object.
> > > > > > > - Do not subject all objects to the heuristic check and instead
> > > > > > >      consider objects only of a certain size.
> > > > > > > - Do the hole walk using the rbtree.
> > > > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > > > - Drop the PIN_MAPPABLE check while pinning the VMA.
> > > > > > > 
> > > > > > > v6: (Tvrtko)
> > > > > > > - Return 0 on success and the specific error code on failure to
> > > > > > >      preserve the existing behavior.
> > > > > > > 
> > > > > > > v7: (Ville)
> > > > > > > - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> > > > > > >      size < ggtt->mappable_end / 4 checks.
> > > > > > > - Drop the redundant check that is based on previous heuristic.
> > > > > > > 
> > > > > > > v8:
> > > > > > > - Make sure that we are holding the mutex associated with ggtt vm
> > > > > > >      as we traverse the hole nodes.
> > > > > > > 
> > > > > > > v9: (Tvrtko)
> > > > > > > - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> > > > > > > 
> > > > > > > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > > > > > Cc: Manasi Navare <manasi.d.navare@intel.com>
> > > > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > > > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++---------
> > > > > > >     1 file changed, 94 insertions(+), 34 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > index 9747924cc57b..e0d731b3f215 100644
> > > > > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > > > > @@ -49,6 +49,7 @@
> > > > > > >     #include "gem/i915_gem_pm.h"
> > > > > > >     #include "gem/i915_gem_region.h"
> > > > > > >     #include "gem/i915_gem_userptr.h"
> > > > > > > +#include "gem/i915_gem_tiling.h"
> > > > > > >     #include "gt/intel_engine_user.h"
> > > > > > >     #include "gt/intel_gt.h"
> > > > > > >     #include "gt/intel_gt_pm.h"
> > > > > > > @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> > > > > > >            spin_unlock(&obj->vma.lock);
> > > > > > >     }
> > > > > > > 
> > > > > > > +static int
> > > > > > > +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> > > > > > > +                                u64 alignment, u64 flags)
> > > > > > 
> > > > > > Tvrtko asked me to ack the first patch, but then I looked at this and
> > > > > > started wondering.
> > > > > > 
> > > > > > Conceptually this doesn't pass the smell test. What if we have
> > > > > > multiple per-crtc buffers? Multiple planes on the same crtc? What if
> > > > > > the app does triple buffer? You'll be forever busy tuning this
> > > > > > heuristics, which can't fundamentally be fixed I think. The old "half
> > > > > > of mappable" heuristic isn't really better, but at least it was dead
> > > > > > simple.
> > > > > > 
> > > > > > Imo what we need here is a change in approach:
> > > > > > 1. Check whether the useable view for scanout exists already. If yes,
> > > > > > use that. This should avoid the constant unbinding stalls.
> > > > > > 2. Try to in buffer to mappabley, but without evicting anything (so
> > > > > > not the non-blocking thing)
> > > > > > 3. Pin the buffer with the most lenient approach
> > > > > > 
> > > > > > Even the non-blocking interim stage is dangerous, since it'll just
> > > > > > result in other buffers (e.g. when triple-buffering) getting unbound
> > > > > > and we're back to the same stall. Note that this could have an impact
> > > > > > on cpu rendering compositors, where we might end up relying a lot more
> > > > > > partial views. But as long as we are a tad more aggressive (i.e. the
> > > > > > non-blocking binding) in the mmap path that should work out to keep
> > > > > > everything balanced, since usually you render first before you display
> > > > > > anything. And so the buffer should end up in the ideal place.
> > > > > > 
> > > > > > I'd try to first skip the 2. step since I think it'll require a bit of
> > > > > > work, and frankly I don't think we care about the potential fallout.
> > > > > 
> > > > > To be sure I understand, you propose to stop trying to pin mappable by default. Ie. stop
> > > > > respecting this comment from i915_gem_object_pin_to_display_plane:
> > > > > 
> > > > > 	/*
> > > > > 	 * As the user may map the buffer once pinned in the display plane
> > > > > 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> > > > > 	 * always use map_and_fenceable for all scanout buffers. However,
> > > > > 	 * it may simply be too big to fit into mappable, in which case
> > > > > 	 * put it anyway and hope that userspace can cope (but always first
> > > > > 	 * try to preserve the existing ABI).
> > > > > 	 */
> > > > [Kasireddy, Vivek] Digging further, this is what the commit message that added
> > > > the above comment says:
> > > > commit 2efb813d5388e18255c54afac77bd91acd586908
> > > > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Date:   Thu Aug 18 17:17:06 2016 +0100
> > > > 
> > > >       drm/i915: Fallback to using unmappable memory for scanout
> > > > 
> > > >       The existing ABI says that scanouts are pinned into the mappable region
> > > >       so that legacy clients (e.g. old Xorg or plymouthd) can write directly
> > > >       into the scanout through a GTT mapping. However if the surface does not
> > > >       fit into the mappable region, we are better off just trying to fit it
> > > >       anywhere and hoping for the best. (Any userspace that is capable of
> > > >       using ginormous scanouts is also likely not to rely on pure GTT
> > > >       updates.) With the partial vma fault support, we are no longer
> > > >       restricted to only using scanouts that we can pin (though it is still
> > > >       preferred for performance reasons and for powersaving features like
> > > >       FBC).
> > > > 
> > > > > 
> > > > > By a quick look, for this case it appears we would end up creating partial views for CPU
> > > > > access (since the normal mapping would be busy/unpinnable). Worst case for this is to
> > > > > create a bunch of 1MiB VMAs so something to check would be how long those persist in
> > > > > memory before they get released. Or perhaps the bootup splash use case is not common
> > > > > these days?
> > > > [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on Fedora,
> > > > Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it (Plymouth's
> > > > drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. This
> > > > would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
> > > > 
> > > 
> > > FBC is a good point - correct me if I am wrong, but if we dropped trying to
> > > map in aperture by default it looks like we would lose it and that would be
> > > a significant power regression. In which case it doesn't seem like that
> > > would be an option.
> > 
> > FBC fence is only required for frontbuffer hw tracking, which is another
> > thing that's somewhere between "meh" and "we should just sunset set it
> > right away". I think that work has even been done.
> > 
> > So I wouldn't worry about this.
> > 
> > If you are worried, then I'd check with display folks whether we need
> > a platform based cut-off for this heuristics.
> > 
> > > Which I think leaves us with _some_ heuristics in any case.
> > > 
> > > 1) N-holes heuristics.
> > > 
> > > 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some percentage
> > > of aperture.
> > > 
> > > Could this solve the 8k issue, most of the time, maybe? Could the current
> > > "aperture / 2" test be expressed generically in some terms? Like "(aperture
> > > - 10% (or some absolute value)) / 2" to account for non-fb objects? I forgot
> > > what you said the relationship between aperture size and 8k fb size was.
> > > 
> > > 3) Don't evict for PIN_MAPPABLE mismatches when
> > > i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
> > > i915_gem_object_pin_to_display_plane. Assumption being if we ended up with a
> > > non-mappable fb to start with, we must not try to re-bind it or we risk
> > > ping-pong latencies.
> > > 
> > > The last would I guess need to distinguish between PIN_MAPPABLE passed in
> > > versus opportunistically added by i915_gem_object_pin_to_display_plane.
> > > 
> > > How intrusive would it be to implement this option I am not sure without
> > > trying myself.
> > 
> > This won't work, see my initial mail. All you need is triple buffering (or
> > multiple per-crtc buffers that flip)
> 
> I asked for clarifications on your initial email but you went a bit quiet on
> us, which is why I tried to drive this forward.

Yeah I'm burried on mailing list stuff pretty badly.

> > 1. fb A gets pinned as mappable
> > 2. fb B gets pinned as mappable, fb A is unpinned
> > 3. fb C gets pinned as mappable, we don't have space and end up evicting
> > fb A
> > 
> > Repeat, and you have exactly the same old eviction loop as with two
> > buffers. Not good.
> 
> Maybe a misunderstanding of what I wrote above? Idea was specifically not to
> evict for "opportunistic" PIN_MAPPABLE. Anyway, with the current solution to
> implement that, this is what would happen (see latest patch):
> 
> 1. fb A get pinned as mappable
> 2. fb B gets pinned as mappable, assuming there is space, fb A unpinned
> 3. fb C, assuming there is no space, does not get pinned as mappable so
> nothing is evicted
> 
> > Therefore for this to work we don't just need to make sure that we don't
> > move our own buffer, but also that we don't move any other buffer.
> 
> I think we achieved it by failing the "opportunistic" PIN_MAPPABLE attempts
> for all vmas which weren't already bound mappable in the past.

Ah yeah if that all works like that then I think we're fine.
-Daniel

> > The downside of that is that if a buffer is ever misplaced as mappable, we
> > never fix up that mistake (at least not until the application entirely
> > destroys all the involved fb and bo). I think that's acceptable, but
> > definitely deserves a comment.
> 
> This is true yes.
> 
> Regards,
> 
> Tvrtko

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)
  2022-03-17  7:08                 ` Kasireddy, Vivek
  (?)
@ 2022-03-17 10:33                 ` Tvrtko Ursulin
  -1 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2022-03-17 10:33 UTC (permalink / raw)
  To: Kasireddy, Vivek, Daniel Vetter; +Cc: intel-gfx, dri-devel


On 17/03/2022 07:08, Kasireddy, Vivek wrote:
> Hi Tvrtko,
> 
>>
>> On 16/03/2022 07:37, Kasireddy, Vivek wrote:
>>> Hi Tvrtko,
>>>
>>>>
>>>> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
>>>>> Hi Tvrtko, Daniel,
>>>>>
>>>>>>
>>>>>> On 11/03/2022 09:39, Daniel Vetter wrote:
>>>>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@intel.com>
>> wrote:
>>>>>>>>
>>>>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>>>>>>>> more framebuffers/scanout buffers results in only one that is mappable/
>>>>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one
>>>>>>>> is mappable/fenceable creates latencies large enough to miss alternate
>>>>>>>> vblanks thereby producing less optimal framerate.
>>>>>>>>
>>>>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane()
>>>>>>>> is called to pin one of the FB objs, the associated vma is identified
>>>>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and
>>>>>>>> evicts it. This misplaced vma gets subseqently pinned only when
>>>>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>>>>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle.
>>>>>>>> Therefore, to fix this issue, we try to see if there is space to map
>>>>>>>> at-least two objects of a given size and return early if there isn't. This
>>>>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that
>>>>>>>> are too big to map thereby preventing unncessary unbind.
>>>>>>>>
>>>>>>>> Testcase:
>>>>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>>>>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>>>>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic
>>>>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
>>>>>>>> it misses the vblank every other frame.
>>>>>>>>
>>>>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency:
>>>>>>>>                   i915_gem_object_pin_to_display_plane() {
>>>>>>>> 0.102 us   |    i915_gem_object_set_cache_level();
>>>>>>>>                     i915_gem_object_ggtt_pin_ww() {
>>>>>>>> 0.390 us   |      i915_vma_instance();
>>>>>>>> 0.178 us   |      i915_vma_misplaced();
>>>>>>>>                       i915_vma_unbind() {
>>>>>>>>                       __i915_active_wait() {
>>>>>>>> 0.082 us   |        i915_active_acquire_if_busy();
>>>>>>>> 0.475 us   |      }
>>>>>>>>                       intel_runtime_pm_get() {
>>>>>>>> 0.087 us   |        intel_runtime_pm_acquire();
>>>>>>>> 0.259 us   |      }
>>>>>>>>                       __i915_active_wait() {
>>>>>>>> 0.085 us   |        i915_active_acquire_if_busy();
>>>>>>>> 0.240 us   |      }
>>>>>>>>                       __i915_vma_evict() {
>>>>>>>>                         ggtt_unbind_vma() {
>>>>>>>>                           gen8_ggtt_clear_range() {
>>>>>>>> 10507.255 us |        }
>>>>>>>> 10507.689 us |      }
>>>>>>>> 10508.516 us |   }
>>>>>>>>
>>>>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout
>>>>>>>>         buffer is too big by checking to see if it is possible to map
>>>>>>>>         two of them into the ggtt.
>>>>>>>>
>>>>>>>> v3 (Ville):
>>>>>>>> - Count how many fb objects can be fit into the available holes
>>>>>>>>       instead of checking for a hole twice the object size.
>>>>>>>> - Take alignment constraints into account.
>>>>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms.
>>>>>>>>
>>>>>>>> v4:
>>>>>>>> - Remove existing heuristic that checks just for size. (Ville)
>>>>>>>> - Return early if we find space to map at-least two objects. (Tvrtko)
>>>>>>>> - Slightly update the commit message.
>>>>>>>>
>>>>>>>> v5: (Tvrtko)
>>>>>>>> - Rename the function to indicate that the object may be too big to
>>>>>>>>       map into the aperture.
>>>>>>>> - Account for guard pages while calculating the total size required
>>>>>>>>       for the object.
>>>>>>>> - Do not subject all objects to the heuristic check and instead
>>>>>>>>       consider objects only of a certain size.
>>>>>>>> - Do the hole walk using the rbtree.
>>>>>>>> - Preserve the existing PIN_NONBLOCK logic.
>>>>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA.
>>>>>>>>
>>>>>>>> v6: (Tvrtko)
>>>>>>>> - Return 0 on success and the specific error code on failure to
>>>>>>>>       preserve the existing behavior.
>>>>>>>>
>>>>>>>> v7: (Ville)
>>>>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>>>>>>>>       size < ggtt->mappable_end / 4 checks.
>>>>>>>> - Drop the redundant check that is based on previous heuristic.
>>>>>>>>
>>>>>>>> v8:
>>>>>>>> - Make sure that we are holding the mutex associated with ggtt vm
>>>>>>>>       as we traverse the hole nodes.
>>>>>>>>
>>>>>>>> v9: (Tvrtko)
>>>>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>>>>>>>>
>>>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>>>> Cc: Manasi Navare <manasi.d.navare@intel.com>
>>>>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++--------
>> -
>>>>>>>>      1 file changed, 94 insertions(+), 34 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> index 9747924cc57b..e0d731b3f215 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>>>> @@ -49,6 +49,7 @@
>>>>>>>>      #include "gem/i915_gem_pm.h"
>>>>>>>>      #include "gem/i915_gem_region.h"
>>>>>>>>      #include "gem/i915_gem_userptr.h"
>>>>>>>> +#include "gem/i915_gem_tiling.h"
>>>>>>>>      #include "gt/intel_engine_user.h"
>>>>>>>>      #include "gt/intel_gt.h"
>>>>>>>>      #include "gt/intel_gt_pm.h"
>>>>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
>>>>>>>>             spin_unlock(&obj->vma.lock);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +static int
>>>>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
>>>>>>>> +                                u64 alignment, u64 flags)
>>>>>>>
>>>>>>> Tvrtko asked me to ack the first patch, but then I looked at this and
>>>>>>> started wondering.
>>>>>>>
>>>>>>> Conceptually this doesn't pass the smell test. What if we have
>>>>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if
>>>>>>> the app does triple buffer? You'll be forever busy tuning this
>>>>>>> heuristics, which can't fundamentally be fixed I think. The old "half
>>>>>>> of mappable" heuristic isn't really better, but at least it was dead
>>>>>>> simple.
>>>>>>>
>>>>>>> Imo what we need here is a change in approach:
>>>>>>> 1. Check whether the useable view for scanout exists already. If yes,
>>>>>>> use that. This should avoid the constant unbinding stalls.
>>>>>>> 2. Try to in buffer to mappabley, but without evicting anything (so
>>>>>>> not the non-blocking thing)
>>>>>>> 3. Pin the buffer with the most lenient approach
>>>>>>>
>>>>>>> Even the non-blocking interim stage is dangerous, since it'll just
>>>>>>> result in other buffers (e.g. when triple-buffering) getting unbound
>>>>>>> and we're back to the same stall. Note that this could have an impact
>>>>>>> on cpu rendering compositors, where we might end up relying a lot more
>>>>>>> partial views. But as long as we are a tad more aggressive (i.e. the
>>>>>>> non-blocking binding) in the mmap path that should work out to keep
>>>>>>> everything balanced, since usually you render first before you display
>>>>>>> anything. And so the buffer should end up in the ideal place.
>>>>>>>
>>>>>>> I'd try to first skip the 2. step since I think it'll require a bit of
>>>>>>> work, and frankly I don't think we care about the potential fallout.
>>>>>>
>>>>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie.
>> stop
>>>>>> respecting this comment from i915_gem_object_pin_to_display_plane:
>>>>>>
>>>>>> 	/*
>>>>>> 	 * As the user may map the buffer once pinned in the display plane
>>>>>> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
>>>>>> 	 * always use map_and_fenceable for all scanout buffers. However,
>>>>>> 	 * it may simply be too big to fit into mappable, in which case
>>>>>> 	 * put it anyway and hope that userspace can cope (but always first
>>>>>> 	 * try to preserve the existing ABI).
>>>>>> 	 */
>>>>> [Kasireddy, Vivek] Digging further, this is what the commit message that added
>>>>> the above comment says:
>>>>> commit 2efb813d5388e18255c54afac77bd91acd586908
>>>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Date:   Thu Aug 18 17:17:06 2016 +0100
>>>>>
>>>>>        drm/i915: Fallback to using unmappable memory for scanout
>>>>>
>>>>>        The existing ABI says that scanouts are pinned into the mappable region
>>>>>        so that legacy clients (e.g. old Xorg or plymouthd) can write directly
>>>>>        into the scanout through a GTT mapping. However if the surface does not
>>>>>        fit into the mappable region, we are better off just trying to fit it
>>>>>        anywhere and hoping for the best. (Any userspace that is capable of
>>>>>        using ginormous scanouts is also likely not to rely on pure GTT
>>>>>        updates.) With the partial vma fault support, we are no longer
>>>>>        restricted to only using scanouts that we can pin (though it is still
>>>>>        preferred for performance reasons and for powersaving features like
>>>>>        FBC).
>>>>>
>>>>>>
>>>>>> By a quick look, for this case it appears we would end up creating partial views for
>>>> CPU
>>>>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is
>> to
>>>>>> create a bunch of 1MiB VMAs so something to check would be how long those
>> persist
>>>> in
>>>>>> memory before they get released. Or perhaps the bootup splash use case is not
>> common
>>>>>> these days?
>>>>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on
>> Fedora,
>>>>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it
>>>> (Plymouth's
>>>>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl.
>> This
>>>>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB.
>>>>>
>>>>
>>>> FBC is a good point - correct me if I am wrong, but if we dropped trying
>>>> to map in aperture by default it looks like we would lose it and that
>>>> would be a significant power regression. In which case it doesn't seem
>>>> like that would be an option.
>>> [Kasireddy, Vivek] Ok, makes sense.
>>>
>>>>
>>>> Which I think leaves us with _some_ heuristics in any case.
>>>>
>>>> 1) N-holes heuristics.
>>>>
>>>> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some
>>>> percentage of aperture.
>>>>
>>>> Could this solve the 8k issue, most of the time, maybe? Could the
>>>> current "aperture / 2" test be expressed generically in some terms? Like
>>>> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb
>>>> objects? I forgot what you said the relationship between aperture size
>>>> and 8k fb size was.
>>>>
>>>> 3) Don't evict for PIN_MAPPABLE mismatches when
>>>> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of
>>>> i915_gem_object_pin_to_display_plane. Assumption being if we ended up
>>>> with a non-mappable fb to start with, we must not try to re-bind it or
>>>> we risk ping-pong latencies.
>>>>
>>>> The last would I guess need to distinguish between PIN_MAPPABLE passed
>>>> in versus opportunistically added by i915_gem_object_pin_to_display_plane.
>>>>
>>>> How intrusive would it be to implement this option I am not sure without
>>>> trying myself.
>>> [Kasireddy, Vivek] I suspect I might be missing something, but could it not be
>>> as simple as below:
>>> @@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
>> *obj,
>>>                                   return ERR_PTR(-ENOSPC);
>>>
>>>                           if (flags & PIN_MAPPABLE &&
>>> -                           vma->fence_size > ggtt->mappable_end / 2)
>>> +                           (vma->fence_size > ggtt->mappable_end / 2 ||
>>> +                           !i915_vma_is_map_and_fenceable(vma)))
>>>                                       return ERR_PTR(-ENOSPC);
>>>                   }
>>
>> Looks like this would work...
>>
>>>>
>>>>> Given this, do you think it would work if we just preserve the existing behavior and
>>>>> tweak the heuristic introduced in this patch to look for space in aperture for only
>>>>> one FB instead of two? Or, is there no good option for solving this issue other than
>>>>> to create 1MB VMAs?
>>>>
>>>> I did not get how having one hole would solve the issue. Wouldn't it
>>>> still hit the re-bind ping-pong? Or there isn't even a single hole for
>>>> 8k fb typically?
>>> [Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it
>>> almost always uses only 2 except when it needs to share the FB -- with a plugin
>>> such as "remoting" for desktop streaming.
>>> Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2
>>> FB1 is mappable/fenceable and therefore not misplaced.
>>> FB2 is NOT mappable and hence identified as misplaced
>>> (because it fails the check
>>> (flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma))
>>>
>>> As you suggest in 3) above the goal is to ensure that FB2 does not get evicted
>>> when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced.
>>> Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to
>>> see if there is space in aperture for only FB2 (N = 1) and return early -- before
>>> even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong
>>> issue in both these cases.
>>
>> ... got it, yes, it seems both approaches works for this use case.
>>
>> Not sure that I have a preference between the two approaches at this point.
>>
>> Both would be behind a "PIN_MAPPABLE && PIN_NONBLOCK" check, so both
>> would only apply to opportunistic PIN_MAPPABLE attempts. That is, any
>> caller who only passes PIN_MAPPABLE would be unaffected which is what we
>> want.
>>
>> The extra i915_vma_is_map_and_fenceable check I guess is simpler and
>> self-contained. I assume you have a test setup and can try it out to
>> check it really works?
> [Kasireddy, Vivek] Yes, it works; my testcase just involves running Weston
> with a mode of 8K@60 on ADL-S and checking the FPS of the sample client
> weston-simple-egl. With the fix included, the perf improves to 59 FPS from
> 40 FPS. I'll send out a new patch for review soon.
> 
> Oh, btw, do you think it is now pointless to merge the drm/mm patch that adds
> the iterator given that we'd no longer have the i915 patch that uses it anymore?

Yeah, with no users there is no reason to merge it right now.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-03-17 10:34 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-07 20:21 [PATCH v6 0/2] drm/mm: Add an iterator to optimally walk over holes suitable for an allocation Vivek Kasireddy
2022-03-07 20:21 ` [Intel-gfx] " Vivek Kasireddy
2022-03-07 20:21 ` [PATCH v6 1/2] drm/mm: Add an iterator to optimally walk over holes for an allocation (v6) Vivek Kasireddy
2022-03-07 20:21   ` [Intel-gfx] " Vivek Kasireddy
2022-03-07 20:21 ` [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9) Vivek Kasireddy
2022-03-07 20:21   ` [Intel-gfx] " Vivek Kasireddy
2022-03-11  9:39   ` Daniel Vetter
2022-03-11  9:39     ` Daniel Vetter
2022-03-14 11:14     ` Tvrtko Ursulin
2022-03-15  7:28       ` Kasireddy, Vivek
2022-03-15  7:28         ` Kasireddy, Vivek
2022-03-15  9:45         ` Tvrtko Ursulin
2022-03-16  7:37           ` Kasireddy, Vivek
2022-03-16  7:37             ` Kasireddy, Vivek
2022-03-16 13:34             ` Tvrtko Ursulin
2022-03-17  7:08               ` Kasireddy, Vivek
2022-03-17  7:08                 ` Kasireddy, Vivek
2022-03-17 10:33                 ` Tvrtko Ursulin
2022-03-17  9:47           ` Daniel Vetter
2022-03-17  9:47             ` Daniel Vetter
2022-03-17 10:04             ` Tvrtko Ursulin
2022-03-17 10:04               ` Tvrtko Ursulin
2022-03-17 10:10               ` Daniel Vetter
2022-03-17 10:10                 ` Daniel Vetter
2022-03-07 20:56 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation Patchwork
2022-03-07 20:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-03-08 12:42 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-03-09 18:56 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/mm: Add an iterator to optimally walk over holes suitable for an allocation (rev2) Patchwork
2022-03-09 18:59 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-03-09 19:31 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-03-10  5:00 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.