All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-24 14:23 ` Luben Tuikov
  0 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics; +Cc: Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

Make sure we can at least move and alloc TT objects without backing store.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index bc9c432edffe03..45ce2d1f754cc4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 {
 	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
 						     bdev);
-	struct ttm_resource_manager *man =
-		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
 	unsigned long ccs_pages = 0;
 	enum ttm_caching caching;
@@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!i915_tt)
 		return NULL;
 
-	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
-	    man->use_tt)
+	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
+	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
 		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
 
 	caching = i915_ttm_select_tt_caching(obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 9a7e50534b84bb..c420d1ab605b6f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 	bool clear;
 	int ret;
 
-	if (GEM_WARN_ON(!obj)) {
+	if (GEM_WARN_ON(!obj) || !bo->resource) {
 		ttm_bo_move_null(bo, dst_mem);
 		return 0;
 	}
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-24 14:23 ` Luben Tuikov
  0 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics; +Cc: Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

Make sure we can at least move and alloc TT objects without backing store.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index bc9c432edffe03..45ce2d1f754cc4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 {
 	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
 						     bdev);
-	struct ttm_resource_manager *man =
-		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
 	unsigned long ccs_pages = 0;
 	enum ttm_caching caching;
@@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!i915_tt)
 		return NULL;
 
-	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
-	    man->use_tt)
+	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
+	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
 		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
 
 	caching = i915_ttm_select_tt_caching(obj);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index 9a7e50534b84bb..c420d1ab605b6f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
 	bool clear;
 	int ret;
 
-	if (GEM_WARN_ON(!obj)) {
+	if (GEM_WARN_ON(!obj) || !bo->resource) {
 		ttm_bo_move_null(bo, dst_mem);
 		return 0;
 	}
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
@ 2022-08-24 14:23   ` Luben Tuikov
  -1 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics
  Cc: Michael J . Ruhl, Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

That should not be necessary any more when drivers should at least be
able to handle the move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index f066e8124c5058..d7ed891d388769 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -959,7 +959,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 			 struct sg_table *sg, struct dma_resv *resv,
 			 void (*destroy) (struct ttm_buffer_object *))
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	int ret;
 
 	kref_init(&bo->kref);
@@ -977,12 +976,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 		bo->base.resv = &bo->base._resv;
 	atomic_inc(&ttm_glob.bo_count);
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &bo->resource);
-	if (unlikely(ret)) {
-		ttm_bo_put(bo);
-		return ret;
-	}
-
 	/*
 	 * For ttm_bo_type_device buffers, allocate
 	 * address space from the device.
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Intel-gfx] [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation
@ 2022-08-24 14:23   ` Luben Tuikov
  0 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics; +Cc: Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

That should not be necessary any more when drivers should at least be
able to handle the move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index f066e8124c5058..d7ed891d388769 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -959,7 +959,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 			 struct sg_table *sg, struct dma_resv *resv,
 			 void (*destroy) (struct ttm_buffer_object *))
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	int ret;
 
 	kref_init(&bo->kref);
@@ -977,12 +976,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 		bo->base.resv = &bo->base._resv;
 	atomic_inc(&ttm_glob.bo_count);
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &bo->resource);
-	if (unlikely(ret)) {
-		ttm_bo_put(bo);
-		return ret;
-	}
-
 	/*
 	 * For ttm_bo_type_device buffers, allocate
 	 * address space from the device.
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/3] drm/ttm: stop allocating a dummy resource for pipelined gutting
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
@ 2022-08-24 14:23   ` Luben Tuikov
  -1 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics
  Cc: Michael J . Ruhl, Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

That should not be necessary any more when drivers should at least be
able to handle a move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 497ee1fdbad7bf..a1c4dc031cae63 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -607,16 +607,10 @@ EXPORT_SYMBOL(ttm_bo_move_sync_cleanup);
  */
 int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	struct ttm_buffer_object *ghost;
-	struct ttm_resource *sys_res;
 	struct ttm_tt *ttm;
 	int ret;
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &sys_res);
-	if (ret)
-		return ret;
-
 	/* If already idle, no need for ghost object dance. */
 	ret = ttm_bo_wait(bo, false, true);
 	if (ret != -EBUSY) {
@@ -624,14 +618,13 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 			/* See comment below about clearing. */
 			ret = ttm_tt_create(bo, true);
 			if (ret)
-				goto error_free_sys_mem;
+				return ret;
 		} else {
 			ttm_tt_unpopulate(bo->bdev, bo->ttm);
 			if (bo->type == ttm_bo_type_device)
 				ttm_tt_mark_for_clear(bo->ttm);
 		}
 		ttm_resource_free(bo, &bo->resource);
-		ttm_bo_assign_mem(bo, sys_res);
 		return 0;
 	}
 
@@ -648,7 +641,7 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 	ret = ttm_tt_create(bo, true);
 	swap(bo->ttm, ttm);
 	if (ret)
-		goto error_free_sys_mem;
+		return ret;
 
 	ret = ttm_buffer_object_transfer(bo, &ghost);
 	if (ret)
@@ -662,13 +655,9 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 	dma_resv_unlock(&ghost->base._resv);
 	ttm_bo_put(ghost);
 	bo->ttm = ttm;
-	ttm_bo_assign_mem(bo, sys_res);
 	return 0;
 
 error_destroy_tt:
 	ttm_tt_destroy(bo->bdev, ttm);
-
-error_free_sys_mem:
-	ttm_resource_free(bo, &sys_res);
 	return ret;
 }
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Intel-gfx] [PATCH 3/3] drm/ttm: stop allocating a dummy resource for pipelined gutting
@ 2022-08-24 14:23   ` Luben Tuikov
  0 siblings, 0 replies; 43+ messages in thread
From: Luben Tuikov @ 2022-08-24 14:23 UTC (permalink / raw)
  To: dri-devel, Intel Graphics; +Cc: Luben Tuikov, Christian König

From: Christian König <christian.koenig@amd.com>

That should not be necessary any more when drivers should at least be
able to handle a move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 497ee1fdbad7bf..a1c4dc031cae63 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -607,16 +607,10 @@ EXPORT_SYMBOL(ttm_bo_move_sync_cleanup);
  */
 int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	struct ttm_buffer_object *ghost;
-	struct ttm_resource *sys_res;
 	struct ttm_tt *ttm;
 	int ret;
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &sys_res);
-	if (ret)
-		return ret;
-
 	/* If already idle, no need for ghost object dance. */
 	ret = ttm_bo_wait(bo, false, true);
 	if (ret != -EBUSY) {
@@ -624,14 +618,13 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 			/* See comment below about clearing. */
 			ret = ttm_tt_create(bo, true);
 			if (ret)
-				goto error_free_sys_mem;
+				return ret;
 		} else {
 			ttm_tt_unpopulate(bo->bdev, bo->ttm);
 			if (bo->type == ttm_bo_type_device)
 				ttm_tt_mark_for_clear(bo->ttm);
 		}
 		ttm_resource_free(bo, &bo->resource);
-		ttm_bo_assign_mem(bo, sys_res);
 		return 0;
 	}
 
@@ -648,7 +641,7 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 	ret = ttm_tt_create(bo, true);
 	swap(bo->ttm, ttm);
 	if (ret)
-		goto error_free_sys_mem;
+		return ret;
 
 	ret = ttm_buffer_object_transfer(bo, &ghost);
 	if (ret)
@@ -662,13 +655,9 @@ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo)
 	dma_resv_unlock(&ghost->base._resv);
 	ttm_bo_put(ghost);
 	bo->ttm = ttm;
-	ttm_bo_assign_mem(bo, sys_res);
 	return 0;
 
 error_destroy_tt:
 	ttm_tt_destroy(bo->bdev, ttm);
-
-error_free_sys_mem:
-	ttm_resource_free(bo, &sys_res);
 	return ret;
 }
-- 
2.36.1.74.g277cf0bc36


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [1/3] drm/i915: audit bo->resource usage
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
                   ` (2 preceding siblings ...)
  (?)
@ 2022-08-24 16:21 ` Patchwork
  -1 siblings, 0 replies; 43+ messages in thread
From: Patchwork @ 2022-08-24 16:21 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915: audit bo->resource usage
URL   : https://patchwork.freedesktop.org/series/107680/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:223:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:223:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:223:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:225:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:225:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:225:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+drivers/gpu/drm/ttm/ttm_bo.c:1142:20: warning: context imbalance in 'ttm_bo_swapout' - unexpected unlock
+drivers/gpu/drm/ttm/ttm_bo.c:259:28: warning: context imbalance in 'ttm_bo_cleanup_refs' - unexpected unlock
+drivers/gpu/drm/ttm/ttm_bo.c:317:27: warning: context imbalance in 'ttm_bo_delayed_delete' - different lock contexts for basic block
+drivers/gpu/drm/ttm/ttm_bo.c:613:5: warning: context imbalance in 'ttm_mem_evict_first' - wrong count at exit
+./include/asm-generic/bitops/generic-non-atomic.h:104:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:104:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:104:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:106:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:106:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:106:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:110:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:110:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:110:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:110:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:110:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:110:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:110:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:110:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:110:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:111:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:111:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:120:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:120:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:120:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:127:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:127:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:127:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:152:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:152:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:152:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:154:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:154:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:154:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:155:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:155:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:155:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:156:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:156:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:156:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:158:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:158:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:158:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:158:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:158:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:158:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:158:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:158:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:158:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:27:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:27:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:27:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:29:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:29:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:29:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:32:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:32:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:32:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:32:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:32:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:32:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:36:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:36:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:36:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:38:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:38:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:38:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:41:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:41:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:41:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:41:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:41:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:41:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:54:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:54:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:54:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:56:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:56:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:56:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:59:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:59:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:59:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:59:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:59:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:59:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:72:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:72:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:72:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:74:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:74:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:74:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:78:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:78:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:78:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:78:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:78:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:78:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:78:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:78:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:78:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:79:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:79:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:92:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:92:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:92:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:94:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:94:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:94:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:98:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:98:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:98:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:98:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:98:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:98:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:98:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:98:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:98:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:99:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:99:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for series starting with [1/3] drm/i915: audit bo->resource usage
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
                   ` (3 preceding siblings ...)
  (?)
@ 2022-08-24 16:44 ` Patchwork
  -1 siblings, 0 replies; 43+ messages in thread
From: Patchwork @ 2022-08-24 16:44 UTC (permalink / raw)
  To: Luben Tuikov; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 6279 bytes --]

== Series Details ==

Series: series starting with [1/3] drm/i915: audit bo->resource usage
URL   : https://patchwork.freedesktop.org/series/107680/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12021 -> Patchwork_107680v1
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_107680v1 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_107680v1, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/index.html

Participating hosts (37 -> 37)
------------------------------

  Additional (1): fi-kbl-soraka 
  Missing    (1): fi-hsw-4770 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_107680v1:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_module_load@load:
    - bat-dg1-6:          [PASS][1] -> [FAIL][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-dg1-6/igt@i915_module_load@load.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg1-6/igt@i915_module_load@load.html

  * igt@i915_selftest@live@memory_region:
    - bat-dg1-5:          [PASS][3] -> [DMESG-FAIL][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-dg1-5/igt@i915_selftest@live@memory_region.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg1-5/igt@i915_selftest@live@memory_region.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@core_hotunplug@unbind-rebind:
    - {bat-dg2-9}:        [PASS][5] -> [{ABORT}][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-dg2-9/igt@core_hotunplug@unbind-rebind.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg2-9/igt@core_hotunplug@unbind-rebind.html
    - {bat-dg2-11}:       [PASS][7] -> [{ABORT}][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-dg2-11/igt@core_hotunplug@unbind-rebind.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg2-11/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_lmem_swapping@basic:
    - {bat-dg2-8}:        NOTRUN -> [SKIP][9]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg2-8/igt@gem_lmem_swapping@basic.html

  * igt@i915_module_load@load:
    - {bat-dg2-10}:       [PASS][10] -> [INCOMPLETE][11]
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-dg2-10/igt@i915_module_load@load.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-dg2-10/igt@i915_module_load@load.html

  
Known issues
------------

  Here are the changes found in Patchwork_107680v1 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@i915_suspend@basic-s3-without-i915:
    - fi-pnv-d510:        NOTRUN -> [INCOMPLETE][12] ([i915#6598] / [i915#6601])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/fi-pnv-d510/igt@i915_suspend@basic-s3-without-i915.html

  * igt@runner@aborted:
    - fi-kbl-soraka:      NOTRUN -> [FAIL][13] ([i915#6219])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/fi-kbl-soraka/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s3@smem:
    - {bat-rplp-1}:       [DMESG-WARN][14] ([i915#2867]) -> [PASS][15]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/bat-rplp-1/igt@gem_exec_suspend@basic-s3@smem.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/bat-rplp-1/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@i915_selftest@live@requests:
    - fi-pnv-d510:        [DMESG-FAIL][16] ([i915#4528]) -> [PASS][17]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/fi-pnv-d510/igt@i915_selftest@live@requests.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/fi-pnv-d510/igt@i915_selftest@live@requests.html

  
#### Warnings ####

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions:
    - fi-bsw-kefka:       [FAIL][18] -> [FAIL][19] ([i915#6298])
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12021/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/fi-bsw-kefka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4528]: https://gitlab.freedesktop.org/drm/intel/issues/4528
  [i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
  [i915#6219]: https://gitlab.freedesktop.org/drm/intel/issues/6219
  [i915#6257]: https://gitlab.freedesktop.org/drm/intel/issues/6257
  [i915#6298]: https://gitlab.freedesktop.org/drm/intel/issues/6298
  [i915#6380]: https://gitlab.freedesktop.org/drm/intel/issues/6380
  [i915#6598]: https://gitlab.freedesktop.org/drm/intel/issues/6598
  [i915#6601]: https://gitlab.freedesktop.org/drm/intel/issues/6601


Build changes
-------------

  * Linux: CI_DRM_12021 -> Patchwork_107680v1

  CI-20190529: 20190529
  CI_DRM_12021: 078959b4819e4e0ab8cf2965e7bfd98278c0b35d @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6636: 1298b5f0e1b3e010657ffba41d2e775fab028e08 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_107680v1: 078959b4819e4e0ab8cf2965e7bfd98278c0b35d @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

d89370bd4ce1 drm/ttm: stop allocating a dummy resource for pipelined gutting
5700995ed3d0 drm/ttm: stop allocating dummy resources during BO creation
2f18a077f77a drm/i915: audit bo->resource usage

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/index.html

[-- Attachment #2: Type: text/html, Size: 6958 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
@ 2022-08-30  7:33   ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-30  7:33 UTC (permalink / raw)
  To: dri-devel, Intel Graphics, Matthew Auld, Thomas Hellström
  Cc: Luben Tuikov

Hi guys,

can we get an rb/acked-by for this i915 change?

Basically we are just making sure that the driver doesn't crash when 
bo->resource is NULL and a bo doesn't have any backing store assigned to it.

The Intel CI seems to be happy with this change, so I'm pretty sure it 
is correct.

Thanks,
Christian.

Am 24.08.22 um 16:23 schrieb Luben Tuikov:
> From: Christian König <christian.koenig@amd.com>
>
> Make sure we can at least move and alloc TT objects without backing store.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>   2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index bc9c432edffe03..45ce2d1f754cc4 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   {
>   	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
>   						     bdev);
> -	struct ttm_resource_manager *man =
> -		ttm_manager_type(bo->bdev, bo->resource->mem_type);
>   	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>   	unsigned long ccs_pages = 0;
>   	enum ttm_caching caching;
> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   	if (!i915_tt)
>   		return NULL;
>   
> -	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
> -	    man->use_tt)
> +	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
> +	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>   		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>   
>   	caching = i915_ttm_select_tt_caching(obj);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index 9a7e50534b84bb..c420d1ab605b6f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
>   	bool clear;
>   	int ret;
>   
> -	if (GEM_WARN_ON(!obj)) {
> +	if (GEM_WARN_ON(!obj) || !bo->resource) {
>   		ttm_bo_move_null(bo, dst_mem);
>   		return 0;
>   	}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-30  7:33   ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-30  7:33 UTC (permalink / raw)
  To: dri-devel, Intel Graphics, Matthew Auld, Thomas Hellström
  Cc: Luben Tuikov

Hi guys,

can we get an rb/acked-by for this i915 change?

Basically we are just making sure that the driver doesn't crash when 
bo->resource is NULL and a bo doesn't have any backing store assigned to it.

The Intel CI seems to be happy with this change, so I'm pretty sure it 
is correct.

Thanks,
Christian.

Am 24.08.22 um 16:23 schrieb Luben Tuikov:
> From: Christian König <christian.koenig@amd.com>
>
> Make sure we can at least move and alloc TT objects without backing store.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>   2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index bc9c432edffe03..45ce2d1f754cc4 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   {
>   	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
>   						     bdev);
> -	struct ttm_resource_manager *man =
> -		ttm_manager_type(bo->bdev, bo->resource->mem_type);
>   	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>   	unsigned long ccs_pages = 0;
>   	enum ttm_caching caching;
> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   	if (!i915_tt)
>   		return NULL;
>   
> -	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
> -	    man->use_tt)
> +	if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
> +	    ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>   		page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>   
>   	caching = i915_ttm_select_tt_caching(obj);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index 9a7e50534b84bb..c420d1ab605b6f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, bool evict,
>   	bool clear;
>   	int ret;
>   
> -	if (GEM_WARN_ON(!obj)) {
> +	if (GEM_WARN_ON(!obj) || !bo->resource) {
>   		ttm_bo_move_null(bo, dst_mem);
>   		return 0;
>   	}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-30  7:33   ` [Intel-gfx] " Christian König
@ 2022-08-30 10:45     ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-30 10:45 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Hi,

On 30/08/2022 08:33, Christian König wrote:
> Hi guys,
> 
> can we get an rb/acked-by for this i915 change?
> 
> Basically we are just making sure that the driver doesn't crash when 
> bo->resource is NULL and a bo doesn't have any backing store assigned to 
> it.
> 
> The Intel CI seems to be happy with this change, so I'm pretty sure it 
> is correct.

It looks like DG2/DG1 (which happen to use TTM here) are no longer 
loading the module:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/index.html?

According to the logs the firmware is failing to load, so perhaps 
related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare users. 
See below.

> 
> Thanks,
> Christian.
> 
> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>> From: Christian König <christian.koenig@amd.com>
>>
>> Make sure we can at least move and alloc TT objects without backing 
>> store.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> index bc9c432edffe03..45ce2d1f754cc4 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>   {
>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>> typeof(*i915),
>>                                bdev);
>> -    struct ttm_resource_manager *man =
>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>       unsigned long ccs_pages = 0;
>>       enum ttm_caching caching;
>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>       if (!i915_tt)
>>           return NULL;
>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>> -        man->use_tt)
>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;

AFAICT i915 was massively relying on everything starting out in a 
"dummy" system memory resource (ttm_tt), where it then later 
"transitions" to the real resource. And if we need to clear the memory 
we rely on ZERO_ALLOC being set before calling into the i915_ttm_move() 
callback (even when allocating local-memory).

For ttm_bo_type_device objects (userspace stuff) it looks like this was 
previously handled by ttm_bo_validate() always doing:

   ret = ttm_tt_create(bo, true); /* clear = true */

Which we would always hit since the resource was always "compatible" for 
the dummy case. But it looks like this is no longer even called, since 
we can now call into ttm_move with bo->resource == NULL, which still 
calls tt_create eventually, which not always with clear = true.

All other objects are then ttm_bo_type_kernel so we don't care about 
clearing, except in the rare case of ALLOC_CPU_CLEAR, which was handled 
as per above in i915_ttm_tt_create(). But I think here bo->resource is 
NULL at the start when first creating the object, which will skip 
setting ZERO_ALLOC, which might explain the CI failure.

The other possible concern (not sure since CI didn't get that far) is 
around ttm_bo_pipeline_gutting(), which now leaves bo->resource = NULL. 
It looks like i915_ttm_shrink() was relying on that to unpopulate the 
ttm_tt. When later calling ttm_bo_validate(), i915_ttm_move() would see 
the SWAPPED flag set on the ttm_tt , re-populate it and then potentially 
move it back to local-memory.

What are your thoughts here? Also sorry if i915 is making a bit of mess 
here.

>>       caching = i915_ttm_select_tt_caching(obj);
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>> bool evict,
>>       bool clear;
>>       int ret;
>> -    if (GEM_WARN_ON(!obj)) {
>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>           ttm_bo_move_null(bo, dst_mem);
>>           return 0;
>>       }
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-30 10:45     ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-30 10:45 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Hi,

On 30/08/2022 08:33, Christian König wrote:
> Hi guys,
> 
> can we get an rb/acked-by for this i915 change?
> 
> Basically we are just making sure that the driver doesn't crash when 
> bo->resource is NULL and a bo doesn't have any backing store assigned to 
> it.
> 
> The Intel CI seems to be happy with this change, so I'm pretty sure it 
> is correct.

It looks like DG2/DG1 (which happen to use TTM here) are no longer 
loading the module:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_107680v1/index.html?

According to the logs the firmware is failing to load, so perhaps 
related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare users. 
See below.

> 
> Thanks,
> Christian.
> 
> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>> From: Christian König <christian.koenig@amd.com>
>>
>> Make sure we can at least move and alloc TT objects without backing 
>> store.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> index bc9c432edffe03..45ce2d1f754cc4 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>   {
>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>> typeof(*i915),
>>                                bdev);
>> -    struct ttm_resource_manager *man =
>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>       unsigned long ccs_pages = 0;
>>       enum ttm_caching caching;
>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>> ttm_buffer_object *bo,
>>       if (!i915_tt)
>>           return NULL;
>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>> -        man->use_tt)
>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;

AFAICT i915 was massively relying on everything starting out in a 
"dummy" system memory resource (ttm_tt), where it then later 
"transitions" to the real resource. And if we need to clear the memory 
we rely on ZERO_ALLOC being set before calling into the i915_ttm_move() 
callback (even when allocating local-memory).

For ttm_bo_type_device objects (userspace stuff) it looks like this was 
previously handled by ttm_bo_validate() always doing:

   ret = ttm_tt_create(bo, true); /* clear = true */

Which we would always hit since the resource was always "compatible" for 
the dummy case. But it looks like this is no longer even called, since 
we can now call into ttm_move with bo->resource == NULL, which still 
calls tt_create eventually, which not always with clear = true.

All other objects are then ttm_bo_type_kernel so we don't care about 
clearing, except in the rare case of ALLOC_CPU_CLEAR, which was handled 
as per above in i915_ttm_tt_create(). But I think here bo->resource is 
NULL at the start when first creating the object, which will skip 
setting ZERO_ALLOC, which might explain the CI failure.

The other possible concern (not sure since CI didn't get that far) is 
around ttm_bo_pipeline_gutting(), which now leaves bo->resource = NULL. 
It looks like i915_ttm_shrink() was relying on that to unpopulate the 
ttm_tt. When later calling ttm_bo_validate(), i915_ttm_move() would see 
the SWAPPED flag set on the ttm_tt , re-populate it and then potentially 
move it back to local-memory.

What are your thoughts here? Also sorry if i915 is making a bit of mess 
here.

>>       caching = i915_ttm_select_tt_caching(obj);
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>> bool evict,
>>       bool clear;
>>       int ret;
>> -    if (GEM_WARN_ON(!obj)) {
>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>           ttm_bo_move_null(bo, dst_mem);
>>           return 0;
>>       }
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-30 10:45     ` [Intel-gfx] " Matthew Auld
@ 2022-08-31  8:16       ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31  8:16 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Hi Matthew,

Am 30.08.22 um 12:45 schrieb Matthew Auld:
> Hi,
>
> On 30/08/2022 08:33, Christian König wrote:
>> Hi guys,
>>
>> can we get an rb/acked-by for this i915 change?
>>
>> Basically we are just making sure that the driver doesn't crash when 
>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>> to it.
>>
>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>> it is correct.
>
> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
> loading the module:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0? 
>
>
> According to the logs the firmware is failing to load, so perhaps 
> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
> users. See below.
>
>>
>> Thanks,
>> Christian.
>>
>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>> From: Christian König <christian.koenig@amd.com>
>>>
>>> Make sure we can at least move and alloc TT objects without backing 
>>> store.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>   {
>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>> typeof(*i915),
>>>                                bdev);
>>> -    struct ttm_resource_manager *man =
>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>       unsigned long ccs_pages = 0;
>>>       enum ttm_caching caching;
>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>       if (!i915_tt)
>>>           return NULL;
>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>> -        man->use_tt)
>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>
> AFAICT i915 was massively relying on everything starting out in a 
> "dummy" system memory resource (ttm_tt), where it then later 
> "transitions" to the real resource. And if we need to clear the memory 
> we rely on ZERO_ALLOC being set before calling into the 
> i915_ttm_move() callback (even when allocating local-memory).
>
> For ttm_bo_type_device objects (userspace stuff) it looks like this 
> was previously handled by ttm_bo_validate() always doing:
>
>   ret = ttm_tt_create(bo, true); /* clear = true */
>
> Which we would always hit since the resource was always "compatible" 
> for the dummy case. But it looks like this is no longer even called, 
> since we can now call into ttm_move with bo->resource == NULL, which 
> still calls tt_create eventually, which not always with clear = true.
>
> All other objects are then ttm_bo_type_kernel so we don't care about 
> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
> handled as per above in i915_ttm_tt_create(). But I think here 
> bo->resource is NULL at the start when first creating the object, 
> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>
> The other possible concern (not sure since CI didn't get that far) is 
> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
> NULL. It looks like i915_ttm_shrink() was relying on that to 
> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
> re-populate it and then potentially move it back to local-memory.
>
> What are your thoughts here? Also sorry if i915 is making a bit of 
> mess here.

First of all thanks a lot for taking a look. We previously got reports 
about strange crashes with this patch, but couldn't really reproduce 
them (even not by sending them out again). This explains that a bit.

The simplest solution would be to just change the && into a ||, e.g. 
when previously either no resource is allocated or the resource requires 
to use a tt we clear it.

That should give you the same behavior as before, but I agree that this 
is a bit messy.

I've been considering to replacing the ttm_bo_type with a bunch of 
behavior flags for a bo. I'm hoping that this will clean things up a bit.

Regards,
Christian.

>
>>>       caching = i915_ttm_select_tt_caching(obj);
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>> bool evict,
>>>       bool clear;
>>>       int ret;
>>> -    if (GEM_WARN_ON(!obj)) {
>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>           ttm_bo_move_null(bo, dst_mem);
>>>           return 0;
>>>       }
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31  8:16       ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31  8:16 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Hi Matthew,

Am 30.08.22 um 12:45 schrieb Matthew Auld:
> Hi,
>
> On 30/08/2022 08:33, Christian König wrote:
>> Hi guys,
>>
>> can we get an rb/acked-by for this i915 change?
>>
>> Basically we are just making sure that the driver doesn't crash when 
>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>> to it.
>>
>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>> it is correct.
>
> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
> loading the module:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0? 
>
>
> According to the logs the firmware is failing to load, so perhaps 
> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
> users. See below.
>
>>
>> Thanks,
>> Christian.
>>
>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>> From: Christian König <christian.koenig@amd.com>
>>>
>>> Make sure we can at least move and alloc TT objects without backing 
>>> store.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>   {
>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>> typeof(*i915),
>>>                                bdev);
>>> -    struct ttm_resource_manager *man =
>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>       unsigned long ccs_pages = 0;
>>>       enum ttm_caching caching;
>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>> ttm_buffer_object *bo,
>>>       if (!i915_tt)
>>>           return NULL;
>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>> -        man->use_tt)
>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>
> AFAICT i915 was massively relying on everything starting out in a 
> "dummy" system memory resource (ttm_tt), where it then later 
> "transitions" to the real resource. And if we need to clear the memory 
> we rely on ZERO_ALLOC being set before calling into the 
> i915_ttm_move() callback (even when allocating local-memory).
>
> For ttm_bo_type_device objects (userspace stuff) it looks like this 
> was previously handled by ttm_bo_validate() always doing:
>
>   ret = ttm_tt_create(bo, true); /* clear = true */
>
> Which we would always hit since the resource was always "compatible" 
> for the dummy case. But it looks like this is no longer even called, 
> since we can now call into ttm_move with bo->resource == NULL, which 
> still calls tt_create eventually, which not always with clear = true.
>
> All other objects are then ttm_bo_type_kernel so we don't care about 
> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
> handled as per above in i915_ttm_tt_create(). But I think here 
> bo->resource is NULL at the start when first creating the object, 
> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>
> The other possible concern (not sure since CI didn't get that far) is 
> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
> NULL. It looks like i915_ttm_shrink() was relying on that to 
> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
> re-populate it and then potentially move it back to local-memory.
>
> What are your thoughts here? Also sorry if i915 is making a bit of 
> mess here.

First of all thanks a lot for taking a look. We previously got reports 
about strange crashes with this patch, but couldn't really reproduce 
them (even not by sending them out again). This explains that a bit.

The simplest solution would be to just change the && into a ||, e.g. 
when previously either no resource is allocated or the resource requires 
to use a tt we clear it.

That should give you the same behavior as before, but I agree that this 
is a bit messy.

I've been considering to replacing the ttm_bo_type with a bunch of 
behavior flags for a bo. I'm hoping that this will clean things up a bit.

Regards,
Christian.

>
>>>       caching = i915_ttm_select_tt_caching(obj);
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>> bool evict,
>>>       bool clear;
>>>       int ret;
>>> -    if (GEM_WARN_ON(!obj)) {
>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>           ttm_bo_move_null(bo, dst_mem);
>>>           return 0;
>>>       }
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31  8:16       ` [Intel-gfx] " Christian König
@ 2022-08-31  9:26         ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31  9:26 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 09:16, Christian König wrote:
> Hi Matthew,
> 
> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>> Hi,
>>
>> On 30/08/2022 08:33, Christian König wrote:
>>> Hi guys,
>>>
>>> can we get an rb/acked-by for this i915 change?
>>>
>>> Basically we are just making sure that the driver doesn't crash when 
>>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>>> to it.
>>>
>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>> it is correct.
>>
>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>> loading the module:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0?
>>
>> According to the logs the firmware is failing to load, so perhaps 
>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>> users. See below.
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> Make sure we can at least move and alloc TT objects without backing 
>>>> store.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>   {
>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>> typeof(*i915),
>>>>                                bdev);
>>>> -    struct ttm_resource_manager *man =
>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>       unsigned long ccs_pages = 0;
>>>>       enum ttm_caching caching;
>>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>       if (!i915_tt)
>>>>           return NULL;
>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>> -        man->use_tt)
>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>
>> AFAICT i915 was massively relying on everything starting out in a 
>> "dummy" system memory resource (ttm_tt), where it then later 
>> "transitions" to the real resource. And if we need to clear the memory 
>> we rely on ZERO_ALLOC being set before calling into the 
>> i915_ttm_move() callback (even when allocating local-memory).
>>
>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>> was previously handled by ttm_bo_validate() always doing:
>>
>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>
>> Which we would always hit since the resource was always "compatible" 
>> for the dummy case. But it looks like this is no longer even called, 
>> since we can now call into ttm_move with bo->resource == NULL, which 
>> still calls tt_create eventually, which not always with clear = true.
>>
>> All other objects are then ttm_bo_type_kernel so we don't care about 
>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>> handled as per above in i915_ttm_tt_create(). But I think here 
>> bo->resource is NULL at the start when first creating the object, 
>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>
>> The other possible concern (not sure since CI didn't get that far) is 
>> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>> re-populate it and then potentially move it back to local-memory.
>>
>> What are your thoughts here? Also sorry if i915 is making a bit of 
>> mess here.
> 
> First of all thanks a lot for taking a look. We previously got reports 
> about strange crashes with this patch, but couldn't really reproduce 
> them (even not by sending them out again). This explains that a bit.
> 
> The simplest solution would be to just change the && into a ||, e.g. 
> when previously either no resource is allocated or the resource requires 
> to use a tt we clear it.
> 
> That should give you the same behavior as before, but I agree that this 
> is a bit messy.

Yeah, that should do the trick.

That hopefully just leaves i915_ttm_shrink(), which is swapping out 
shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
force the pipeline-gutting path, which importantly unpopulates the 
ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
AFAICT it looks like that will now also nuke the bo->resource, instead 
of just leaving it in system memory. My assumption is that when later 
calling ttm_bo_validate(), it will just do the bo_move_null() in 
i915_ttm_move(), instead of re-populating the ttm_tt and then 
potentially copying it back to local-memory?

> 
> I've been considering to replacing the ttm_bo_type with a bunch of 
> behavior flags for a bo. I'm hoping that this will clean things up a bit.
> 
> Regards,
> Christian.
> 
>>
>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>>> bool evict,
>>>>       bool clear;
>>>>       int ret;
>>>> -    if (GEM_WARN_ON(!obj)) {
>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>           return 0;
>>>>       }
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31  9:26         ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31  9:26 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 09:16, Christian König wrote:
> Hi Matthew,
> 
> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>> Hi,
>>
>> On 30/08/2022 08:33, Christian König wrote:
>>> Hi guys,
>>>
>>> can we get an rb/acked-by for this i915 change?
>>>
>>> Basically we are just making sure that the driver doesn't crash when 
>>> bo->resource is NULL and a bo doesn't have any backing store assigned 
>>> to it.
>>>
>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>> it is correct.
>>
>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>> loading the module:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Caa9bdb0e31064859568708da8a74b899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637974531164663116%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=UW8BEnIFXHawhAfLUcknGmE88g2wwAiTLAQ3Y5v1pFA%3D&amp;reserved=0?
>>
>> According to the logs the firmware is failing to load, so perhaps 
>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>> users. See below.
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> Make sure we can at least move and alloc TT objects without backing 
>>>> store.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>> @@ -271,8 +271,6 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>   {
>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>> typeof(*i915),
>>>>                                bdev);
>>>> -    struct ttm_resource_manager *man =
>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>       unsigned long ccs_pages = 0;
>>>>       enum ttm_caching caching;
>>>> @@ -286,8 +284,8 @@ static struct ttm_tt *i915_ttm_tt_create(struct 
>>>> ttm_buffer_object *bo,
>>>>       if (!i915_tt)
>>>>           return NULL;
>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>> -        man->use_tt)
>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>
>> AFAICT i915 was massively relying on everything starting out in a 
>> "dummy" system memory resource (ttm_tt), where it then later 
>> "transitions" to the real resource. And if we need to clear the memory 
>> we rely on ZERO_ALLOC being set before calling into the 
>> i915_ttm_move() callback (even when allocating local-memory).
>>
>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>> was previously handled by ttm_bo_validate() always doing:
>>
>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>
>> Which we would always hit since the resource was always "compatible" 
>> for the dummy case. But it looks like this is no longer even called, 
>> since we can now call into ttm_move with bo->resource == NULL, which 
>> still calls tt_create eventually, which not always with clear = true.
>>
>> All other objects are then ttm_bo_type_kernel so we don't care about 
>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>> handled as per above in i915_ttm_tt_create(). But I think here 
>> bo->resource is NULL at the start when first creating the object, 
>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>
>> The other possible concern (not sure since CI didn't get that far) is 
>> around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>> re-populate it and then potentially move it back to local-memory.
>>
>> What are your thoughts here? Also sorry if i915 is making a bit of 
>> mess here.
> 
> First of all thanks a lot for taking a look. We previously got reports 
> about strange crashes with this patch, but couldn't really reproduce 
> them (even not by sending them out again). This explains that a bit.
> 
> The simplest solution would be to just change the && into a ||, e.g. 
> when previously either no resource is allocated or the resource requires 
> to use a tt we clear it.
> 
> That should give you the same behavior as before, but I agree that this 
> is a bit messy.

Yeah, that should do the trick.

That hopefully just leaves i915_ttm_shrink(), which is swapping out 
shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
force the pipeline-gutting path, which importantly unpopulates the 
ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
AFAICT it looks like that will now also nuke the bo->resource, instead 
of just leaving it in system memory. My assumption is that when later 
calling ttm_bo_validate(), it will just do the bo_move_null() in 
i915_ttm_move(), instead of re-populating the ttm_tt and then 
potentially copying it back to local-memory?

> 
> I've been considering to replacing the ttm_bo_type with a bunch of 
> behavior flags for a bo. I'm hoping that this will clean things up a bit.
> 
> Regards,
> Christian.
> 
>>
>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
>>>> bool evict,
>>>>       bool clear;
>>>>       int ret;
>>>> -    if (GEM_WARN_ON(!obj)) {
>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>           return 0;
>>>>       }
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31  9:26         ` [Intel-gfx] " Matthew Auld
@ 2022-08-31  9:38           ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31  9:38 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 11:26 schrieb Matthew Auld:
> On 31/08/2022 09:16, Christian König wrote:
>> Hi Matthew,
>>
>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>> Hi,
>>>
>>> On 30/08/2022 08:33, Christian König wrote:
>>>> Hi guys,
>>>>
>>>> can we get an rb/acked-by for this i915 change?
>>>>
>>>> Basically we are just making sure that the driver doesn't crash 
>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>> assigned to it.
>>>>
>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>> it is correct.
>>>
>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>> loading the module:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0? 
>>>
>>>
>>> According to the logs the firmware is failing to load, so perhaps 
>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>> users. See below.
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>
>>>>> Make sure we can at least move and alloc TT objects without 
>>>>> backing store.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>   {
>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>> typeof(*i915),
>>>>>                                bdev);
>>>>> -    struct ttm_resource_manager *man =
>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>       unsigned long ccs_pages = 0;
>>>>>       enum ttm_caching caching;
>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>       if (!i915_tt)
>>>>>           return NULL;
>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>> -        man->use_tt)
>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>
>>> AFAICT i915 was massively relying on everything starting out in a 
>>> "dummy" system memory resource (ttm_tt), where it then later 
>>> "transitions" to the real resource. And if we need to clear the 
>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>> i915_ttm_move() callback (even when allocating local-memory).
>>>
>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>> was previously handled by ttm_bo_validate() always doing:
>>>
>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>
>>> Which we would always hit since the resource was always "compatible" 
>>> for the dummy case. But it looks like this is no longer even called, 
>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>> still calls tt_create eventually, which not always with clear = true.
>>>
>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>> bo->resource is NULL at the start when first creating the object, 
>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>
>>> The other possible concern (not sure since CI didn't get that far) 
>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>> re-populate it and then potentially move it back to local-memory.
>>>
>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>> mess here.
>>
>> First of all thanks a lot for taking a look. We previously got 
>> reports about strange crashes with this patch, but couldn't really 
>> reproduce them (even not by sending them out again). This explains 
>> that a bit.
>>
>> The simplest solution would be to just change the && into a ||, e.g. 
>> when previously either no resource is allocated or the resource 
>> requires to use a tt we clear it.
>>
>> That should give you the same behavior as before, but I agree that 
>> this is a bit messy.
>
> Yeah, that should do the trick.
>
> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
> force the pipeline-gutting path, which importantly unpopulates the 
> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
> AFAICT it looks like that will now also nuke the bo->resource, instead 
> of just leaving it in system memory. My assumption is that when later 
> calling ttm_bo_validate(), it will just do the bo_move_null() in 
> i915_ttm_move(), instead of re-populating the ttm_tt and then 
> potentially copying it back to local-memory?

Well you do ttm_bo_validate() with something like GTT domain, don't you? 
This should result in re-populating the tt object, but I'm not 100% sure 
if that really works as expected.

Thanks,
Christian.

>
>>
>> I've been considering to replacing the ttm_bo_type with a bunch of 
>> behavior flags for a bo. I'm hoping that this will clean things up a 
>> bit.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>> *bo, bool evict,
>>>>>       bool clear;
>>>>>       int ret;
>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>           return 0;
>>>>>       }
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31  9:38           ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31  9:38 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 11:26 schrieb Matthew Auld:
> On 31/08/2022 09:16, Christian König wrote:
>> Hi Matthew,
>>
>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>> Hi,
>>>
>>> On 30/08/2022 08:33, Christian König wrote:
>>>> Hi guys,
>>>>
>>>> can we get an rb/acked-by for this i915 change?
>>>>
>>>> Basically we are just making sure that the driver doesn't crash 
>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>> assigned to it.
>>>>
>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>> it is correct.
>>>
>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>> loading the module:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0? 
>>>
>>>
>>> According to the logs the firmware is failing to load, so perhaps 
>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>> users. See below.
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>
>>>>> Make sure we can at least move and alloc TT objects without 
>>>>> backing store.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>   {
>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>> typeof(*i915),
>>>>>                                bdev);
>>>>> -    struct ttm_resource_manager *man =
>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>       unsigned long ccs_pages = 0;
>>>>>       enum ttm_caching caching;
>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>       if (!i915_tt)
>>>>>           return NULL;
>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>> -        man->use_tt)
>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>
>>> AFAICT i915 was massively relying on everything starting out in a 
>>> "dummy" system memory resource (ttm_tt), where it then later 
>>> "transitions" to the real resource. And if we need to clear the 
>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>> i915_ttm_move() callback (even when allocating local-memory).
>>>
>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>> was previously handled by ttm_bo_validate() always doing:
>>>
>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>
>>> Which we would always hit since the resource was always "compatible" 
>>> for the dummy case. But it looks like this is no longer even called, 
>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>> still calls tt_create eventually, which not always with clear = true.
>>>
>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>> bo->resource is NULL at the start when first creating the object, 
>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>
>>> The other possible concern (not sure since CI didn't get that far) 
>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>> re-populate it and then potentially move it back to local-memory.
>>>
>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>> mess here.
>>
>> First of all thanks a lot for taking a look. We previously got 
>> reports about strange crashes with this patch, but couldn't really 
>> reproduce them (even not by sending them out again). This explains 
>> that a bit.
>>
>> The simplest solution would be to just change the && into a ||, e.g. 
>> when previously either no resource is allocated or the resource 
>> requires to use a tt we clear it.
>>
>> That should give you the same behavior as before, but I agree that 
>> this is a bit messy.
>
> Yeah, that should do the trick.
>
> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
> force the pipeline-gutting path, which importantly unpopulates the 
> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
> AFAICT it looks like that will now also nuke the bo->resource, instead 
> of just leaving it in system memory. My assumption is that when later 
> calling ttm_bo_validate(), it will just do the bo_move_null() in 
> i915_ttm_move(), instead of re-populating the ttm_tt and then 
> potentially copying it back to local-memory?

Well you do ttm_bo_validate() with something like GTT domain, don't you? 
This should result in re-populating the tt object, but I'm not 100% sure 
if that really works as expected.

Thanks,
Christian.

>
>>
>> I've been considering to replacing the ttm_bo_type with a bunch of 
>> behavior flags for a bo. I'm hoping that this will clean things up a 
>> bit.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>> *bo, bool evict,
>>>>>       bool clear;
>>>>>       int ret;
>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>           return 0;
>>>>>       }
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31  9:38           ` [Intel-gfx] " Christian König
@ 2022-08-31 10:37             ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 10:37 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 10:38, Christian König wrote:
> Am 31.08.22 um 11:26 schrieb Matthew Auld:
>> On 31/08/2022 09:16, Christian König wrote:
>>> Hi Matthew,
>>>
>>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>>> Hi,
>>>>
>>>> On 30/08/2022 08:33, Christian König wrote:
>>>>> Hi guys,
>>>>>
>>>>> can we get an rb/acked-by for this i915 change?
>>>>>
>>>>> Basically we are just making sure that the driver doesn't crash 
>>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>>> assigned to it.
>>>>>
>>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>>> it is correct.
>>>>
>>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>>> loading the module:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0?
>>>>
>>>> According to the logs the firmware is failing to load, so perhaps 
>>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>>> users. See below.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> Make sure we can at least move and alloc TT objects without 
>>>>>> backing store.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>   {
>>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>>> typeof(*i915),
>>>>>>                                bdev);
>>>>>> -    struct ttm_resource_manager *man =
>>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>>       unsigned long ccs_pages = 0;
>>>>>>       enum ttm_caching caching;
>>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>       if (!i915_tt)
>>>>>>           return NULL;
>>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>>> -        man->use_tt)
>>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>>
>>>> AFAICT i915 was massively relying on everything starting out in a 
>>>> "dummy" system memory resource (ttm_tt), where it then later 
>>>> "transitions" to the real resource. And if we need to clear the 
>>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>>> i915_ttm_move() callback (even when allocating local-memory).
>>>>
>>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>>> was previously handled by ttm_bo_validate() always doing:
>>>>
>>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>>
>>>> Which we would always hit since the resource was always "compatible" 
>>>> for the dummy case. But it looks like this is no longer even called, 
>>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>>> still calls tt_create eventually, which not always with clear = true.
>>>>
>>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>>> bo->resource is NULL at the start when first creating the object, 
>>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>>
>>>> The other possible concern (not sure since CI didn't get that far) 
>>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>>> re-populate it and then potentially move it back to local-memory.
>>>>
>>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>>> mess here.
>>>
>>> First of all thanks a lot for taking a look. We previously got 
>>> reports about strange crashes with this patch, but couldn't really 
>>> reproduce them (even not by sending them out again). This explains 
>>> that a bit.
>>>
>>> The simplest solution would be to just change the && into a ||, e.g. 
>>> when previously either no resource is allocated or the resource 
>>> requires to use a tt we clear it.
>>>
>>> That should give you the same behavior as before, but I agree that 
>>> this is a bit messy.
>>
>> Yeah, that should do the trick.
>>
>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
>> force the pipeline-gutting path, which importantly unpopulates the 
>> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
>> AFAICT it looks like that will now also nuke the bo->resource, instead 
>> of just leaving it in system memory. My assumption is that when later 
>> calling ttm_bo_validate(), it will just do the bo_move_null() in 
>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>> potentially copying it back to local-memory?
> 
> Well you do ttm_bo_validate() with something like GTT domain, don't you? 
> This should result in re-populating the tt object, but I'm not 100% sure 
> if that really works as expected.

AFAIK for domains we either have system memory (which uses ttm_tt and 
might be shmem underneath) or local-memory. But perhaps i915 is doing 
something wrong here, or abusing TTM in some way. I'm not sure tbh.

Anyway, I think we have two cases here:

- We have some system memory only object. After doing i915_ttm_shrink(), 
bo->resource is now NULL. We then call ttm_bo_validate() at some later 
point, but here we don't need to copy anything, but it also looks like 
ttm_bo_handle_move_mem() won't populate the ttm_tt or us either, since 
mem_type == TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care 
of this, but now we just call ttm_bo_move_null().

- We have a local-memory only object, which was evicted to shmem, and 
then swapped out by the shrinker like above. The bo->resource is NULL. 
However this time when calling ttm_bo_validate() we need to actually do 
a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
i915_ttm_move() was taking care of this, but now we just call 
ttm_bo_move_null().

Perhaps i915 is doing something wrong in the above two cases?

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>> behavior flags for a bo. I'm hoping that this will clean things up a 
>>> bit.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>> *bo, bool evict,
>>>>>>       bool clear;
>>>>>>       int ret;
>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>           return 0;
>>>>>>       }
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 10:37             ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 10:37 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 10:38, Christian König wrote:
> Am 31.08.22 um 11:26 schrieb Matthew Auld:
>> On 31/08/2022 09:16, Christian König wrote:
>>> Hi Matthew,
>>>
>>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>>> Hi,
>>>>
>>>> On 30/08/2022 08:33, Christian König wrote:
>>>>> Hi guys,
>>>>>
>>>>> can we get an rb/acked-by for this i915 change?
>>>>>
>>>>> Basically we are just making sure that the driver doesn't crash 
>>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>>> assigned to it.
>>>>>
>>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>>> it is correct.
>>>>
>>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>>> loading the module:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&amp;reserved=0?
>>>>
>>>> According to the logs the firmware is failing to load, so perhaps 
>>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>>> users. See below.
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> Make sure we can at least move and alloc TT objects without 
>>>>>> backing store.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>   {
>>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>>> typeof(*i915),
>>>>>>                                bdev);
>>>>>> -    struct ttm_resource_manager *man =
>>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>>       unsigned long ccs_pages = 0;
>>>>>>       enum ttm_caching caching;
>>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>>       if (!i915_tt)
>>>>>>           return NULL;
>>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>>> -        man->use_tt)
>>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>>
>>>> AFAICT i915 was massively relying on everything starting out in a 
>>>> "dummy" system memory resource (ttm_tt), where it then later 
>>>> "transitions" to the real resource. And if we need to clear the 
>>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>>> i915_ttm_move() callback (even when allocating local-memory).
>>>>
>>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>>> was previously handled by ttm_bo_validate() always doing:
>>>>
>>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>>
>>>> Which we would always hit since the resource was always "compatible" 
>>>> for the dummy case. But it looks like this is no longer even called, 
>>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>>> still calls tt_create eventually, which not always with clear = true.
>>>>
>>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>>> bo->resource is NULL at the start when first creating the object, 
>>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>>
>>>> The other possible concern (not sure since CI didn't get that far) 
>>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>>> re-populate it and then potentially move it back to local-memory.
>>>>
>>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>>> mess here.
>>>
>>> First of all thanks a lot for taking a look. We previously got 
>>> reports about strange crashes with this patch, but couldn't really 
>>> reproduce them (even not by sending them out again). This explains 
>>> that a bit.
>>>
>>> The simplest solution would be to just change the && into a ||, e.g. 
>>> when previously either no resource is allocated or the resource 
>>> requires to use a tt we clear it.
>>>
>>> That should give you the same behavior as before, but I agree that 
>>> this is a bit messy.
>>
>> Yeah, that should do the trick.
>>
>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
>> force the pipeline-gutting path, which importantly unpopulates the 
>> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
>> AFAICT it looks like that will now also nuke the bo->resource, instead 
>> of just leaving it in system memory. My assumption is that when later 
>> calling ttm_bo_validate(), it will just do the bo_move_null() in 
>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>> potentially copying it back to local-memory?
> 
> Well you do ttm_bo_validate() with something like GTT domain, don't you? 
> This should result in re-populating the tt object, but I'm not 100% sure 
> if that really works as expected.

AFAIK for domains we either have system memory (which uses ttm_tt and 
might be shmem underneath) or local-memory. But perhaps i915 is doing 
something wrong here, or abusing TTM in some way. I'm not sure tbh.

Anyway, I think we have two cases here:

- We have some system memory only object. After doing i915_ttm_shrink(), 
bo->resource is now NULL. We then call ttm_bo_validate() at some later 
point, but here we don't need to copy anything, but it also looks like 
ttm_bo_handle_move_mem() won't populate the ttm_tt or us either, since 
mem_type == TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care 
of this, but now we just call ttm_bo_move_null().

- We have a local-memory only object, which was evicted to shmem, and 
then swapped out by the shrinker like above. The bo->resource is NULL. 
However this time when calling ttm_bo_validate() we need to actually do 
a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
i915_ttm_move() was taking care of this, but now we just call 
ttm_bo_move_null().

Perhaps i915 is doing something wrong in the above two cases?

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>> behavior flags for a bo. I'm hoping that this will clean things up a 
>>> bit.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>> *bo, bool evict,
>>>>>>       bool clear;
>>>>>>       int ret;
>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>           return 0;
>>>>>>       }
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 10:37             ` [Intel-gfx] " Matthew Auld
@ 2022-08-31 11:03               ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 11:03 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 12:37 schrieb Matthew Auld:
> [SNIP]
>>>
>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>> to force the pipeline-gutting path, which importantly unpopulates 
>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>> seems). But AFAICT it looks like that will now also nuke the 
>>> bo->resource, instead of just leaving it in system memory. My 
>>> assumption is that when later calling ttm_bo_validate(), it will 
>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>> re-populating the ttm_tt and then potentially copying it back to 
>>> local-memory?
>>
>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>> you? This should result in re-populating the tt object, but I'm not 
>> 100% sure if that really works as expected.
>
> AFAIK for domains we either have system memory (which uses ttm_tt and 
> might be shmem underneath) or local-memory. But perhaps i915 is doing 
> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>
> Anyway, I think we have two cases here:
>
> - We have some system memory only object. After doing 
> i915_ttm_shrink(), bo->resource is now NULL. We then call 
> ttm_bo_validate() at some later point, but here we don't need to copy 
> anything, but it also looks like ttm_bo_handle_move_mem() won't 
> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
> looks like i915_ttm_move() was taking care of this, but now we just 
> call ttm_bo_move_null().
>
> - We have a local-memory only object, which was evicted to shmem, and 
> then swapped out by the shrinker like above. The bo->resource is NULL. 
> However this time when calling ttm_bo_validate() we need to actually 
> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
> i915_ttm_move() was taking care of this, but now we just call 
> ttm_bo_move_null().
>
> Perhaps i915 is doing something wrong in the above two cases?

Mhm, as far as I can see that should still work.

See previously you should got a transition from SYSTEM->GTT in 
i915_ttm_move() to re-create your backing store. Not you get 
NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

If you just validated to SYSTEM memory before I think the tt object 
wouldn't have been populated either.

Regards,
Christian.

>
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>> a bit.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>> *bo, bool evict,
>>>>>>>       bool clear;
>>>>>>>       int ret;
>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>           return 0;
>>>>>>>       }
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 11:03               ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 11:03 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 12:37 schrieb Matthew Auld:
> [SNIP]
>>>
>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>> to force the pipeline-gutting path, which importantly unpopulates 
>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>> seems). But AFAICT it looks like that will now also nuke the 
>>> bo->resource, instead of just leaving it in system memory. My 
>>> assumption is that when later calling ttm_bo_validate(), it will 
>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>> re-populating the ttm_tt and then potentially copying it back to 
>>> local-memory?
>>
>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>> you? This should result in re-populating the tt object, but I'm not 
>> 100% sure if that really works as expected.
>
> AFAIK for domains we either have system memory (which uses ttm_tt and 
> might be shmem underneath) or local-memory. But perhaps i915 is doing 
> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>
> Anyway, I think we have two cases here:
>
> - We have some system memory only object. After doing 
> i915_ttm_shrink(), bo->resource is now NULL. We then call 
> ttm_bo_validate() at some later point, but here we don't need to copy 
> anything, but it also looks like ttm_bo_handle_move_mem() won't 
> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
> looks like i915_ttm_move() was taking care of this, but now we just 
> call ttm_bo_move_null().
>
> - We have a local-memory only object, which was evicted to shmem, and 
> then swapped out by the shrinker like above. The bo->resource is NULL. 
> However this time when calling ttm_bo_validate() we need to actually 
> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
> i915_ttm_move() was taking care of this, but now we just call 
> ttm_bo_move_null().
>
> Perhaps i915 is doing something wrong in the above two cases?

Mhm, as far as I can see that should still work.

See previously you should got a transition from SYSTEM->GTT in 
i915_ttm_move() to re-create your backing store. Not you get 
NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

If you just validated to SYSTEM memory before I think the tt object 
wouldn't have been populated either.

Regards,
Christian.

>
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>> a bit.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>> *bo, bool evict,
>>>>>>>       bool clear;
>>>>>>>       int ret;
>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>           return 0;
>>>>>>>       }
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 11:03               ` [Intel-gfx] " Christian König
@ 2022-08-31 12:06                 ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 12:06 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 12:03, Christian König wrote:
> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>> [SNIP]
>>>>
>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>>> to force the pipeline-gutting path, which importantly unpopulates 
>>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>>> seems). But AFAICT it looks like that will now also nuke the 
>>>> bo->resource, instead of just leaving it in system memory. My 
>>>> assumption is that when later calling ttm_bo_validate(), it will 
>>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>>> re-populating the ttm_tt and then potentially copying it back to 
>>>> local-memory?
>>>
>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>> you? This should result in re-populating the tt object, but I'm not 
>>> 100% sure if that really works as expected.
>>
>> AFAIK for domains we either have system memory (which uses ttm_tt and 
>> might be shmem underneath) or local-memory. But perhaps i915 is doing 
>> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>>
>> Anyway, I think we have two cases here:
>>
>> - We have some system memory only object. After doing 
>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>> ttm_bo_validate() at some later point, but here we don't need to copy 
>> anything, but it also looks like ttm_bo_handle_move_mem() won't 
>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
>> looks like i915_ttm_move() was taking care of this, but now we just 
>> call ttm_bo_move_null().
>>
>> - We have a local-memory only object, which was evicted to shmem, and 
>> then swapped out by the shrinker like above. The bo->resource is NULL. 
>> However this time when calling ttm_bo_validate() we need to actually 
>> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
>> i915_ttm_move() was taking care of this, but now we just call 
>> ttm_bo_move_null().
>>
>> Perhaps i915 is doing something wrong in the above two cases?
> 
> Mhm, as far as I can see that should still work.
> 
> See previously you should got a transition from SYSTEM->GTT in 
> i915_ttm_move() to re-create your backing store. Not you get 
> NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

What is GTT here in TTM world? Also I'm not seeing where there is this 
SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is only 
calling ttm_bo_validate() once when acquiring the pages, and we don't 
call it again, unless it was evicted (and potentially swapped out).

> 
> If you just validated to SYSTEM memory before I think the tt object 
> wouldn't have been populated either.
> 
> Regards,
> Christian.
> 
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>>> a bit.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>> *bo, bool evict,
>>>>>>>>       bool clear;
>>>>>>>>       int ret;
>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>           return 0;
>>>>>>>>       }
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 12:06                 ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 12:06 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 12:03, Christian König wrote:
> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>> [SNIP]
>>>>
>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
>>>> shmem ttm_tt and is calling ttm_bo_validate() with empty placements 
>>>> to force the pipeline-gutting path, which importantly unpopulates 
>>>> the ttm_tt for us (since ttm_tt_unpopulate is not exported it 
>>>> seems). But AFAICT it looks like that will now also nuke the 
>>>> bo->resource, instead of just leaving it in system memory. My 
>>>> assumption is that when later calling ttm_bo_validate(), it will 
>>>> just do the bo_move_null() in i915_ttm_move(), instead of 
>>>> re-populating the ttm_tt and then potentially copying it back to 
>>>> local-memory?
>>>
>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>> you? This should result in re-populating the tt object, but I'm not 
>>> 100% sure if that really works as expected.
>>
>> AFAIK for domains we either have system memory (which uses ttm_tt and 
>> might be shmem underneath) or local-memory. But perhaps i915 is doing 
>> something wrong here, or abusing TTM in some way. I'm not sure tbh.
>>
>> Anyway, I think we have two cases here:
>>
>> - We have some system memory only object. After doing 
>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>> ttm_bo_validate() at some later point, but here we don't need to copy 
>> anything, but it also looks like ttm_bo_handle_move_mem() won't 
>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. It 
>> looks like i915_ttm_move() was taking care of this, but now we just 
>> call ttm_bo_move_null().
>>
>> - We have a local-memory only object, which was evicted to shmem, and 
>> then swapped out by the shrinker like above. The bo->resource is NULL. 
>> However this time when calling ttm_bo_validate() we need to actually 
>> do a copy in i915_ttm_move(), as well as re-populate the ttm_tt. 
>> i915_ttm_move() was taking care of this, but now we just call 
>> ttm_bo_move_null().
>>
>> Perhaps i915 is doing something wrong in the above two cases?
> 
> Mhm, as far as I can see that should still work.
> 
> See previously you should got a transition from SYSTEM->GTT in 
> i915_ttm_move() to re-create your backing store. Not you get 
> NULL->SYSTEM which is handled by ttm_bo_move_null() and then SYSTEM->GTT.

What is GTT here in TTM world? Also I'm not seeing where there is this 
SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is only 
calling ttm_bo_validate() once when acquiring the pages, and we don't 
call it again, unless it was evicted (and potentially swapped out).

> 
> If you just validated to SYSTEM memory before I think the tt object 
> wouldn't have been populated either.
> 
> Regards,
> Christian.
> 
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> I've been considering to replacing the ttm_bo_type with a bunch of 
>>>>> behavior flags for a bo. I'm hoping that this will clean things up 
>>>>> a bit.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>> *bo, bool evict,
>>>>>>>>       bool clear;
>>>>>>>>       int ret;
>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>           return 0;
>>>>>>>>       }
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 12:06                 ` [Intel-gfx] " Matthew Auld
@ 2022-08-31 12:35                   ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 12:35 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 14:06 schrieb Matthew Auld:
> On 31/08/2022 12:03, Christian König wrote:
>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>> [SNIP]
>>>>>
>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>> memory. My assumption is that when later calling 
>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>> potentially copying it back to local-memory?
>>>>
>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>> you? This should result in re-populating the tt object, but I'm not 
>>>> 100% sure if that really works as expected.
>>>
>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>> tbh.
>>>
>>> Anyway, I think we have two cases here:
>>>
>>> - We have some system memory only object. After doing 
>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>> ttm_bo_validate() at some later point, but here we don't need to 
>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>> just call ttm_bo_move_null().
>>>
>>> - We have a local-memory only object, which was evicted to shmem, 
>>> and then swapped out by the shrinker like above. The bo->resource is 
>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>> call ttm_bo_move_null().
>>>
>>> Perhaps i915 is doing something wrong in the above two cases?
>>
>> Mhm, as far as I can see that should still work.
>>
>> See previously you should got a transition from SYSTEM->GTT in 
>> i915_ttm_move() to re-create your backing store. Not you get 
>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>> SYSTEM->GTT.
>
> What is GTT here in TTM world? Also I'm not seeing where there is this 
> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
> only calling ttm_bo_validate() once when acquiring the pages, and we 
> don't call it again, unless it was evicted (and potentially swapped out).

Well GTT means TTM_PL_TT.

And calling it only once is perfectly fine, TTM will internally see that 
we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
transition and then SYSTEM->TT.

As far as I can see that should work like it did before.

Christian.

>
>>
>> If you just validated to SYSTEM memory before I think the tt object 
>> wouldn't have been populated either.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>> things up a bit.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>> *bo, bool evict,
>>>>>>>>>       bool clear;
>>>>>>>>>       int ret;
>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>           return 0;
>>>>>>>>>       }
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 12:35                   ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 12:35 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 14:06 schrieb Matthew Auld:
> On 31/08/2022 12:03, Christian König wrote:
>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>> [SNIP]
>>>>>
>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>> memory. My assumption is that when later calling 
>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>> potentially copying it back to local-memory?
>>>>
>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>> you? This should result in re-populating the tt object, but I'm not 
>>>> 100% sure if that really works as expected.
>>>
>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>> tbh.
>>>
>>> Anyway, I think we have two cases here:
>>>
>>> - We have some system memory only object. After doing 
>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>> ttm_bo_validate() at some later point, but here we don't need to 
>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>> just call ttm_bo_move_null().
>>>
>>> - We have a local-memory only object, which was evicted to shmem, 
>>> and then swapped out by the shrinker like above. The bo->resource is 
>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>> call ttm_bo_move_null().
>>>
>>> Perhaps i915 is doing something wrong in the above two cases?
>>
>> Mhm, as far as I can see that should still work.
>>
>> See previously you should got a transition from SYSTEM->GTT in 
>> i915_ttm_move() to re-create your backing store. Not you get 
>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>> SYSTEM->GTT.
>
> What is GTT here in TTM world? Also I'm not seeing where there is this 
> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
> only calling ttm_bo_validate() once when acquiring the pages, and we 
> don't call it again, unless it was evicted (and potentially swapped out).

Well GTT means TTM_PL_TT.

And calling it only once is perfectly fine, TTM will internally see that 
we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
transition and then SYSTEM->TT.

As far as I can see that should work like it did before.

Christian.

>
>>
>> If you just validated to SYSTEM memory before I think the tt object 
>> wouldn't have been populated either.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>> things up a bit.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>> *bo, bool evict,
>>>>>>>>>       bool clear;
>>>>>>>>>       int ret;
>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>           return 0;
>>>>>>>>>       }
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 12:35                   ` [Intel-gfx] " Christian König
@ 2022-08-31 12:50                     ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 12:50 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 13:35, Christian König wrote:
> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>> On 31/08/2022 12:03, Christian König wrote:
>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>> [SNIP]
>>>>>>
>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>> memory. My assumption is that when later calling 
>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>> potentially copying it back to local-memory?
>>>>>
>>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>>> you? This should result in re-populating the tt object, but I'm not 
>>>>> 100% sure if that really works as expected.
>>>>
>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>>> tbh.
>>>>
>>>> Anyway, I think we have two cases here:
>>>>
>>>> - We have some system memory only object. After doing 
>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>>> just call ttm_bo_move_null().
>>>>
>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>> and then swapped out by the shrinker like above. The bo->resource is 
>>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>>> call ttm_bo_move_null().
>>>>
>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>
>>> Mhm, as far as I can see that should still work.
>>>
>>> See previously you should got a transition from SYSTEM->GTT in 
>>> i915_ttm_move() to re-create your backing store. Not you get 
>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>> SYSTEM->GTT.
>>
>> What is GTT here in TTM world? Also I'm not seeing where there is this 
>> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
>> only calling ttm_bo_validate() once when acquiring the pages, and we 
>> don't call it again, unless it was evicted (and potentially swapped out).
> 
> Well GTT means TTM_PL_TT.
> 
> And calling it only once is perfectly fine, TTM will internally see that 
> we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
> transition and then SYSTEM->TT.

Ah interesting, so that's what the multi-hop thing does. But AFAICT i915 
is not using either TTM_PL_TT or -EMULTIHOP.

Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
should you use one over the other?

> 
> As far as I can see that should work like it did before.
> 
> Christian.
> 
>>
>>>
>>> If you just validated to SYSTEM memory before I think the tt object 
>>> wouldn't have been populated either.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>> things up a bit.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>>> *bo, bool evict,
>>>>>>>>>>       bool clear;
>>>>>>>>>>       int ret;
>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>           return 0;
>>>>>>>>>>       }
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 12:50                     ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 12:50 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 13:35, Christian König wrote:
> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>> On 31/08/2022 12:03, Christian König wrote:
>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>> [SNIP]
>>>>>>
>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>> memory. My assumption is that when later calling 
>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>> potentially copying it back to local-memory?
>>>>>
>>>>> Well you do ttm_bo_validate() with something like GTT domain, don't 
>>>>> you? This should result in re-populating the tt object, but I'm not 
>>>>> 100% sure if that really works as expected.
>>>>
>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>> and might be shmem underneath) or local-memory. But perhaps i915 is 
>>>> doing something wrong here, or abusing TTM in some way. I'm not sure 
>>>> tbh.
>>>>
>>>> Anyway, I think we have two cases here:
>>>>
>>>> - We have some system memory only object. After doing 
>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() won't 
>>>> populate the ttm_tt or us either, since mem_type == TTM_PL_SYSTEM. 
>>>> It looks like i915_ttm_move() was taking care of this, but now we 
>>>> just call ttm_bo_move_null().
>>>>
>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>> and then swapped out by the shrinker like above. The bo->resource is 
>>>> NULL. However this time when calling ttm_bo_validate() we need to 
>>>> actually do a copy in i915_ttm_move(), as well as re-populate the 
>>>> ttm_tt. i915_ttm_move() was taking care of this, but now we just 
>>>> call ttm_bo_move_null().
>>>>
>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>
>>> Mhm, as far as I can see that should still work.
>>>
>>> See previously you should got a transition from SYSTEM->GTT in 
>>> i915_ttm_move() to re-create your backing store. Not you get 
>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>> SYSTEM->GTT.
>>
>> What is GTT here in TTM world? Also I'm not seeing where there is this 
>> SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 is 
>> only calling ttm_bo_validate() once when acquiring the pages, and we 
>> don't call it again, unless it was evicted (and potentially swapped out).
> 
> Well GTT means TTM_PL_TT.
> 
> And calling it only once is perfectly fine, TTM will internally see that 
> we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
> transition and then SYSTEM->TT.

Ah interesting, so that's what the multi-hop thing does. But AFAICT i915 
is not using either TTM_PL_TT or -EMULTIHOP.

Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
should you use one over the other?

> 
> As far as I can see that should work like it did before.
> 
> Christian.
> 
>>
>>>
>>> If you just validated to SYSTEM memory before I think the tt object 
>>> wouldn't have been populated either.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>> things up a bit.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>>>>>>> *bo, bool evict,
>>>>>>>>>>       bool clear;
>>>>>>>>>>       int ret;
>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>           return 0;
>>>>>>>>>>       }
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 12:50                     ` [Intel-gfx] " Matthew Auld
@ 2022-08-31 13:34                       ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 13:34 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 14:50 schrieb Matthew Auld:
> On 31/08/2022 13:35, Christian König wrote:
>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>> On 31/08/2022 12:03, Christian König wrote:
>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>> [SNIP]
>>>>>>>
>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>> memory. My assumption is that when later calling 
>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>> potentially copying it back to local-memory?
>>>>>>
>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>> I'm not 100% sure if that really works as expected.
>>>>>
>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>> sure tbh.
>>>>>
>>>>> Anyway, I think we have two cases here:
>>>>>
>>>>> - We have some system memory only object. After doing 
>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>> this, but now we just call ttm_bo_move_null().
>>>>>
>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>> just call ttm_bo_move_null().
>>>>>
>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>
>>>> Mhm, as far as I can see that should still work.
>>>>
>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>> SYSTEM->GTT.
>>>
>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>> we don't call it again, unless it was evicted (and potentially 
>>> swapped out).
>>
>> Well GTT means TTM_PL_TT.
>>
>> And calling it only once is perfectly fine, TTM will internally see 
>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>> transition and then SYSTEM->TT.
>
> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
> i915 is not using either TTM_PL_TT or -EMULTIHOP.

Mhm, it could be that we then have a problem and the i915 driver only 
sees NULL->TT directly. But I really don't know the i915 driver code 
good enough to judge that.

Can you take a look at this and test it maybe?

>
> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
> should you use one over the other?

TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
the control over the backing store and can swapout/swapin as it wants it.

TTM_PL_TT means that the device is accessing the data (TT stands for 
translation table) and so TTM can't swap the backing store in/out.

TTM_PL_VRAM well that one is obvious.

Thanks,
Christian.

>
>>
>> As far as I can see that should work like it did before.
>>
>> Christian.
>>
>>>
>>>>
>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>> wouldn't have been populated either.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>> things up a bit.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>       bool clear;
>>>>>>>>>>>       int ret;
>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>           return 0;
>>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 13:34                       ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31 13:34 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 14:50 schrieb Matthew Auld:
> On 31/08/2022 13:35, Christian König wrote:
>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>> On 31/08/2022 12:03, Christian König wrote:
>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>> [SNIP]
>>>>>>>
>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>> memory. My assumption is that when later calling 
>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>> potentially copying it back to local-memory?
>>>>>>
>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>> I'm not 100% sure if that really works as expected.
>>>>>
>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>> sure tbh.
>>>>>
>>>>> Anyway, I think we have two cases here:
>>>>>
>>>>> - We have some system memory only object. After doing 
>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>> this, but now we just call ttm_bo_move_null().
>>>>>
>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>> just call ttm_bo_move_null().
>>>>>
>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>
>>>> Mhm, as far as I can see that should still work.
>>>>
>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>> SYSTEM->GTT.
>>>
>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>> we don't call it again, unless it was evicted (and potentially 
>>> swapped out).
>>
>> Well GTT means TTM_PL_TT.
>>
>> And calling it only once is perfectly fine, TTM will internally see 
>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>> transition and then SYSTEM->TT.
>
> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
> i915 is not using either TTM_PL_TT or -EMULTIHOP.

Mhm, it could be that we then have a problem and the i915 driver only 
sees NULL->TT directly. But I really don't know the i915 driver code 
good enough to judge that.

Can you take a look at this and test it maybe?

>
> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
> should you use one over the other?

TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
the control over the backing store and can swapout/swapin as it wants it.

TTM_PL_TT means that the device is accessing the data (TT stands for 
translation table) and so TTM can't swap the backing store in/out.

TTM_PL_VRAM well that one is obvious.

Thanks,
Christian.

>
>>
>> As far as I can see that should work like it did before.
>>
>> Christian.
>>
>>>
>>>>
>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>> wouldn't have been populated either.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>> things up a bit.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>       bool clear;
>>>>>>>>>>>       int ret;
>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>           return 0;
>>>>>>>>>>>       }
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 13:34                       ` [Intel-gfx] " Christian König
@ 2022-08-31 14:53                         ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 14:53 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 14:34, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>> On 31/08/2022 13:35, Christian König wrote:
>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>> [SNIP]
>>>>>>>>
>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>>> memory. My assumption is that when later calling 
>>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>>> potentially copying it back to local-memory?
>>>>>>>
>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>>> I'm not 100% sure if that really works as expected.
>>>>>>
>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>>> sure tbh.
>>>>>>
>>>>>> Anyway, I think we have two cases here:
>>>>>>
>>>>>> - We have some system memory only object. After doing 
>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>
>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>> just call ttm_bo_move_null().
>>>>>>
>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>
>>>>> Mhm, as far as I can see that should still work.
>>>>>
>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>> SYSTEM->GTT.
>>>>
>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>>> we don't call it again, unless it was evicted (and potentially 
>>>> swapped out).
>>>
>>> Well GTT means TTM_PL_TT.
>>>
>>> And calling it only once is perfectly fine, TTM will internally see 
>>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>>> transition and then SYSTEM->TT.
>>
>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only 
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?

I'll grab a machine and try to see what is going on here.

> 
>>
>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>> should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
> the control over the backing store and can swapout/swapin as it wants it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.

Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
even for device access it seems.

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> As far as I can see that should work like it did before.
>>>
>>> Christian.
>>>
>>>>
>>>>>
>>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>>> wouldn't have been populated either.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>>> things up a bit.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>       int ret;
>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 14:53                         ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 14:53 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 14:34, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>> On 31/08/2022 13:35, Christian König wrote:
>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>> [SNIP]
>>>>>>>>
>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>> placements to force the pipeline-gutting path, which importantly 
>>>>>>>> unpopulates the ttm_tt for us (since ttm_tt_unpopulate is not 
>>>>>>>> exported it seems). But AFAICT it looks like that will now also 
>>>>>>>> nuke the bo->resource, instead of just leaving it in system 
>>>>>>>> memory. My assumption is that when later calling 
>>>>>>>> ttm_bo_validate(), it will just do the bo_move_null() in 
>>>>>>>> i915_ttm_move(), instead of re-populating the ttm_tt and then 
>>>>>>>> potentially copying it back to local-memory?
>>>>>>>
>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>> don't you? This should result in re-populating the tt object, but 
>>>>>>> I'm not 100% sure if that really works as expected.
>>>>>>
>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>> is doing something wrong here, or abusing TTM in some way. I'm not 
>>>>>> sure tbh.
>>>>>>
>>>>>> Anyway, I think we have two cases here:
>>>>>>
>>>>>> - We have some system memory only object. After doing 
>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>
>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>> just call ttm_bo_move_null().
>>>>>>
>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>
>>>>> Mhm, as far as I can see that should still work.
>>>>>
>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>> SYSTEM->GTT.
>>>>
>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, i915 
>>>> is only calling ttm_bo_validate() once when acquiring the pages, and 
>>>> we don't call it again, unless it was evicted (and potentially 
>>>> swapped out).
>>>
>>> Well GTT means TTM_PL_TT.
>>>
>>> And calling it only once is perfectly fine, TTM will internally see 
>>> that we need two hops to reach TTM_PL_TT and so does the NULL->SYSTEM 
>>> transition and then SYSTEM->TT.
>>
>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only 
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?

I'll grab a machine and try to see what is going on here.

> 
>>
>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>> should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
> the control over the backing store and can swapout/swapin as it wants it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.

Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
even for device access it seems.

> 
> Thanks,
> Christian.
> 
>>
>>>
>>> As far as I can see that should work like it did before.
>>>
>>> Christian.
>>>
>>>>
>>>>>
>>>>> If you just validated to SYSTEM memory before I think the tt object 
>>>>> wouldn't have been populated either.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've been considering to replacing the ttm_bo_type with a bunch 
>>>>>>>>> of behavior flags for a bo. I'm hoping that this will clean 
>>>>>>>>> things up a bit.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>       int ret;
>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>       }
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 14:53                         ` [Intel-gfx] " Matthew Auld
@ 2022-08-31 16:32                           ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 16:32 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 15:53, Matthew Auld wrote:
> On 31/08/2022 14:34, Christian König wrote:
>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>> On 31/08/2022 13:35, Christian König wrote:
>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>>> placements to force the pipeline-gutting path, which 
>>>>>>>>> importantly unpopulates the ttm_tt for us (since 
>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>> looks like that will now also nuke the bo->resource, instead of 
>>>>>>>>> just leaving it in system memory. My assumption is that when 
>>>>>>>>> later calling ttm_bo_validate(), it will just do the 
>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating the 
>>>>>>>>> ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>
>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>
>>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>>> is doing something wrong here, or abusing TTM in some way. I'm 
>>>>>>> not sure tbh.
>>>>>>>
>>>>>>> Anyway, I think we have two cases here:
>>>>>>>
>>>>>>> - We have some system memory only object. After doing 
>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>>
>>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>>> just call ttm_bo_move_null().
>>>>>>>
>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>
>>>>>> Mhm, as far as I can see that should still work.
>>>>>>
>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>> SYSTEM->GTT.
>>>>>
>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>> potentially swapped out).
>>>>
>>>> Well GTT means TTM_PL_TT.
>>>>
>>>> And calling it only once is perfectly fine, TTM will internally see 
>>>> that we need two hops to reach TTM_PL_TT and so does the 
>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>
>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>
>> Mhm, it could be that we then have a problem and the i915 driver only 
>> sees NULL->TT directly. But I really don't know the i915 driver code 
>> good enough to judge that.
>>
>> Can you take a look at this and test it maybe?
> 
> I'll grab a machine and try to see what is going on here.

Well at least the issue with the firmware not loading looks to be fixed now.

So running some eviction + oom tests it looks it now does:

/* eviction kicks in */
i915_ttm_move(bo):  LMEM -> PL_SYSTEM

/* shrinker/oom kicks in at some point */
i915_ttm_shrink(bo):
     bo->resource = NULL, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

/* user touches the same object later */
i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()

So seems to incorrectly skip swapping it back in and then copy over to 
lmem. It just allocates directly in lmem.

And previously the last two steps would have been:

i915_ttm_shrink(bo):
     bo->resource = PL_SYSTEM, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

i915_ttm_move(bo):
     PL_SYSTEM -> LMEM,
     ttm_tt is repopulated and pages are copied over to lmem

> 
>>
>>>
>>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>>> should you use one over the other?
>>
>> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
>> the control over the backing store and can swapout/swapin as it wants it.
>>
>> TTM_PL_TT means that the device is accessing the data (TT stands for 
>> translation table) and so TTM can't swap the backing store in/out.
>>
>> TTM_PL_VRAM well that one is obvious.
> 
> Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
> even for device access it seems.
> 
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> As far as I can see that should work like it did before.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> If you just validated to SYSTEM memory before I think the tt 
>>>>>> object wouldn't have been populated either.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've been considering to replacing the ttm_bo_type with a 
>>>>>>>>>> bunch of behavior flags for a bo. I'm hoping that this will 
>>>>>>>>>> clean things up a bit.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>>       int ret;
>>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>>       }
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-08-31 16:32                           ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-08-31 16:32 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 31/08/2022 15:53, Matthew Auld wrote:
> On 31/08/2022 14:34, Christian König wrote:
>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>> On 31/08/2022 13:35, Christian König wrote:
>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is swapping 
>>>>>>>>> out shmem ttm_tt and is calling ttm_bo_validate() with empty 
>>>>>>>>> placements to force the pipeline-gutting path, which 
>>>>>>>>> importantly unpopulates the ttm_tt for us (since 
>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>> looks like that will now also nuke the bo->resource, instead of 
>>>>>>>>> just leaving it in system memory. My assumption is that when 
>>>>>>>>> later calling ttm_bo_validate(), it will just do the 
>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating the 
>>>>>>>>> ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>
>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>
>>>>>>> AFAIK for domains we either have system memory (which uses ttm_tt 
>>>>>>> and might be shmem underneath) or local-memory. But perhaps i915 
>>>>>>> is doing something wrong here, or abusing TTM in some way. I'm 
>>>>>>> not sure tbh.
>>>>>>>
>>>>>>> Anyway, I think we have two cases here:
>>>>>>>
>>>>>>> - We have some system memory only object. After doing 
>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>> ttm_bo_validate() at some later point, but here we don't need to 
>>>>>>> copy anything, but it also looks like ttm_bo_handle_move_mem() 
>>>>>>> won't populate the ttm_tt or us either, since mem_type == 
>>>>>>> TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking care of 
>>>>>>> this, but now we just call ttm_bo_move_null().
>>>>>>>
>>>>>>> - We have a local-memory only object, which was evicted to shmem, 
>>>>>>> and then swapped out by the shrinker like above. The bo->resource 
>>>>>>> is NULL. However this time when calling ttm_bo_validate() we need 
>>>>>>> to actually do a copy in i915_ttm_move(), as well as re-populate 
>>>>>>> the ttm_tt. i915_ttm_move() was taking care of this, but now we 
>>>>>>> just call ttm_bo_move_null().
>>>>>>>
>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>
>>>>>> Mhm, as far as I can see that should still work.
>>>>>>
>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>> SYSTEM->GTT.
>>>>>
>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>> potentially swapped out).
>>>>
>>>> Well GTT means TTM_PL_TT.
>>>>
>>>> And calling it only once is perfectly fine, TTM will internally see 
>>>> that we need two hops to reach TTM_PL_TT and so does the 
>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>
>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>
>> Mhm, it could be that we then have a problem and the i915 driver only 
>> sees NULL->TT directly. But I really don't know the i915 driver code 
>> good enough to judge that.
>>
>> Can you take a look at this and test it maybe?
> 
> I'll grab a machine and try to see what is going on here.

Well at least the issue with the firmware not loading looks to be fixed now.

So running some eviction + oom tests it looks it now does:

/* eviction kicks in */
i915_ttm_move(bo):  LMEM -> PL_SYSTEM

/* shrinker/oom kicks in at some point */
i915_ttm_shrink(bo):
     bo->resource = NULL, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

/* user touches the same object later */
i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()

So seems to incorrectly skip swapping it back in and then copy over to 
lmem. It just allocates directly in lmem.

And previously the last two steps would have been:

i915_ttm_shrink(bo):
     bo->resource = PL_SYSTEM, /* pipeline_gutting */
     shmem ttm_tt is unpopulated and pages are correctly swapped out

i915_ttm_move(bo):
     PL_SYSTEM -> LMEM,
     ttm_tt is repopulated and pages are copied over to lmem

> 
>>
>>>
>>> Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM? When 
>>> should you use one over the other?
>>
>> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM has 
>> the control over the backing store and can swapout/swapin as it wants it.
>>
>> TTM_PL_TT means that the device is accessing the data (TT stands for 
>> translation table) and so TTM can't swap the backing store in/out.
>>
>> TTM_PL_VRAM well that one is obvious.
> 
> Thanks for the explanation. So it looks like i915 is using TTM_PL_SYSTEM 
> even for device access it seems.
> 
>>
>> Thanks,
>> Christian.
>>
>>>
>>>>
>>>> As far as I can see that should work like it did before.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> If you just validated to SYSTEM memory before I think the tt 
>>>>>> object wouldn't have been populated either.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've been considering to replacing the ttm_bo_type with a 
>>>>>>>>>> bunch of behavior flags for a bo. I'm hoping that this will 
>>>>>>>>>> clean things up a bit.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>>>>>>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
>>>>>>>>>>>>> ttm_buffer_object *bo, bool evict,
>>>>>>>>>>>>>       bool clear;
>>>>>>>>>>>>>       int ret;
>>>>>>>>>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>>>>>>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>>>>>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>>>>>>>>>           return 0;
>>>>>>>>>>>>>       }
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 16:32                           ` [Intel-gfx] " Matthew Auld
@ 2022-09-01  8:00                             ` Christian König
  -1 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-09-01  8:00 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 18:32 schrieb Matthew Auld:
> On 31/08/2022 15:53, Matthew Auld wrote:
>> On 31/08/2022 14:34, Christian König wrote:
>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>
>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>
>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>
>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>
>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>> to copy anything, but it also looks like 
>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>
>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>
>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>> SYSTEM->GTT.
>>>>>>
>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>> potentially swapped out).
>>>>>
>>>>> Well GTT means TTM_PL_TT.
>>>>>
>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>
>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>
>>> Mhm, it could be that we then have a problem and the i915 driver 
>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>> code good enough to judge that.
>>>
>>> Can you take a look at this and test it maybe?
>>
>> I'll grab a machine and try to see what is going on here.
>
> Well at least the issue with the firmware not loading looks to be 
> fixed now.
>
> So running some eviction + oom tests it looks it now does:
>
> /* eviction kicks in */
> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>
> /* shrinker/oom kicks in at some point */
> i915_ttm_shrink(bo):
>     bo->resource = NULL, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> /* user touches the same object later */
> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>
> So seems to incorrectly skip swapping it back in and then copy over to 
> lmem. It just allocates directly in lmem.
>
> And previously the last two steps would have been:
>
> i915_ttm_shrink(bo):
>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> i915_ttm_move(bo):
>     PL_SYSTEM -> LMEM,
>     ttm_tt is repopulated and pages are copied over to lmem

Mhm, crap. That indeed looks like it won't work.

How about changing the code like this:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index c420d1ab605b..1ee7d81ddcbc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
bool evict,
         bool clear;
         int ret;

-       if (GEM_WARN_ON(!obj) || !bo->resource) {
+       if (GEM_WARN_ON(!obj)) {
+               ttm_bo_move_null(bo, dst_mem);
+               return 0;
+       }
+
+       if (!bo->resource) {
+               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
+                        hop->mem_type = TTM_PL_SYSTEM;
+                        hop->flags = TTM_PL_FLAG_TEMPORARY;
+                       return -EMULTIHOP;
+               }
                 ttm_bo_move_null(bo, dst_mem);
                 return 0;
         }

That should result in allocating a TTM_PL_SYSTEM resource first and then 
moving from that to the final destination.

Thanks,
Christian.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-09-01  8:00                             ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-09-01  8:00 UTC (permalink / raw)
  To: Matthew Auld, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

Am 31.08.22 um 18:32 schrieb Matthew Auld:
> On 31/08/2022 15:53, Matthew Auld wrote:
>> On 31/08/2022 14:34, Christian König wrote:
>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>
>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>
>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>
>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>
>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>> to copy anything, but it also looks like 
>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>
>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>
>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>> SYSTEM->GTT.
>>>>>>
>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>> potentially swapped out).
>>>>>
>>>>> Well GTT means TTM_PL_TT.
>>>>>
>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>
>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>
>>> Mhm, it could be that we then have a problem and the i915 driver 
>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>> code good enough to judge that.
>>>
>>> Can you take a look at this and test it maybe?
>>
>> I'll grab a machine and try to see what is going on here.
>
> Well at least the issue with the firmware not loading looks to be 
> fixed now.
>
> So running some eviction + oom tests it looks it now does:
>
> /* eviction kicks in */
> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>
> /* shrinker/oom kicks in at some point */
> i915_ttm_shrink(bo):
>     bo->resource = NULL, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> /* user touches the same object later */
> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>
> So seems to incorrectly skip swapping it back in and then copy over to 
> lmem. It just allocates directly in lmem.
>
> And previously the last two steps would have been:
>
> i915_ttm_shrink(bo):
>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> i915_ttm_move(bo):
>     PL_SYSTEM -> LMEM,
>     ttm_tt is repopulated and pages are copied over to lmem

Mhm, crap. That indeed looks like it won't work.

How about changing the code like this:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index c420d1ab605b..1ee7d81ddcbc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
bool evict,
         bool clear;
         int ret;

-       if (GEM_WARN_ON(!obj) || !bo->resource) {
+       if (GEM_WARN_ON(!obj)) {
+               ttm_bo_move_null(bo, dst_mem);
+               return 0;
+       }
+
+       if (!bo->resource) {
+               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
+                        hop->mem_type = TTM_PL_SYSTEM;
+                        hop->flags = TTM_PL_FLAG_TEMPORARY;
+                       return -EMULTIHOP;
+               }
                 ttm_bo_move_null(bo, dst_mem);
                 return 0;
         }

That should result in allocating a TTM_PL_SYSTEM resource first and then 
moving from that to the final destination.

Thanks,
Christian.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/3] drm/i915: audit bo->resource usage (rev2)
  2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
                   ` (5 preceding siblings ...)
  (?)
@ 2022-09-01  8:43 ` Patchwork
  -1 siblings, 0 replies; 43+ messages in thread
From: Patchwork @ 2022-09-01  8:43 UTC (permalink / raw)
  To: Christian König; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915: audit bo->resource usage (rev2)
URL   : https://patchwork.freedesktop.org/series/107680/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/107680/revisions/2/mbox/ not applied
Applying: drm/i915: audit bo->resource usage
error: git diff header lacks filename information when removing 1 leading pathname component (line 2)
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 drm/i915: audit bo->resource usage
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-09-01  8:00                             ` [Intel-gfx] " Christian König
@ 2022-09-01 12:52                               ` Matthew Auld
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-09-01 12:52 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 01/09/2022 09:00, Christian König wrote:
> Am 31.08.22 um 18:32 schrieb Matthew Auld:
>> On 31/08/2022 15:53, Matthew Auld wrote:
>>> On 31/08/2022 14:34, Christian König wrote:
>>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>>
>>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>>
>>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>>
>>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>>
>>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>>> to copy anything, but it also looks like 
>>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>>
>>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>>
>>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>>> SYSTEM->GTT.
>>>>>>>
>>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>>> potentially swapped out).
>>>>>>
>>>>>> Well GTT means TTM_PL_TT.
>>>>>>
>>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>>
>>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>>
>>>> Mhm, it could be that we then have a problem and the i915 driver 
>>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>>> code good enough to judge that.
>>>>
>>>> Can you take a look at this and test it maybe?
>>>
>>> I'll grab a machine and try to see what is going on here.
>>
>> Well at least the issue with the firmware not loading looks to be 
>> fixed now.
>>
>> So running some eviction + oom tests it looks it now does:
>>
>> /* eviction kicks in */
>> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>>
>> /* shrinker/oom kicks in at some point */
>> i915_ttm_shrink(bo):
>>     bo->resource = NULL, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> /* user touches the same object later */
>> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>>
>> So seems to incorrectly skip swapping it back in and then copy over to 
>> lmem. It just allocates directly in lmem.
>>
>> And previously the last two steps would have been:
>>
>> i915_ttm_shrink(bo):
>>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> i915_ttm_move(bo):
>>     PL_SYSTEM -> LMEM,
>>     ttm_tt is repopulated and pages are copied over to lmem
> 
> Mhm, crap. That indeed looks like it won't work.
> 
> How about changing the code like this:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index c420d1ab605b..1ee7d81ddcbc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
> bool evict,
>          bool clear;
>          int ret;
> 
> -       if (GEM_WARN_ON(!obj) || !bo->resource) {
> +       if (GEM_WARN_ON(!obj)) {
> +               ttm_bo_move_null(bo, dst_mem);
> +               return 0;
> +       }
> +
> +       if (!bo->resource) {
> +               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
> +                        hop->mem_type = TTM_PL_SYSTEM;
> +                        hop->flags = TTM_PL_FLAG_TEMPORARY;
> +                       return -EMULTIHOP;
> +               }
>                  ttm_bo_move_null(bo, dst_mem);
>                  return 0;
>          }
> 
> That should result in allocating a TTM_PL_SYSTEM resource first and then 
> moving from that to the final destination.

Ok, I played around with this some more. The final diff looks like:
https://patchwork.freedesktop.org/patch/500786/?series=108027&rev=1

It looks like we had some more places where bo->resource == NULL didn't 
end well. It at least now seems to survive my local testing.

> 
> Thanks,
> Christian.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-09-01 12:52                               ` Matthew Auld
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Auld @ 2022-09-01 12:52 UTC (permalink / raw)
  To: Christian König, dri-devel, Intel Graphics, Thomas Hellström
  Cc: Luben Tuikov

On 01/09/2022 09:00, Christian König wrote:
> Am 31.08.22 um 18:32 schrieb Matthew Auld:
>> On 31/08/2022 15:53, Matthew Auld wrote:
>>> On 31/08/2022 14:34, Christian König wrote:
>>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>>
>>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>>
>>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>>
>>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>>
>>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>>> to copy anything, but it also looks like 
>>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>>
>>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>>
>>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>>> SYSTEM->GTT.
>>>>>>>
>>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>>> potentially swapped out).
>>>>>>
>>>>>> Well GTT means TTM_PL_TT.
>>>>>>
>>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>>
>>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>>
>>>> Mhm, it could be that we then have a problem and the i915 driver 
>>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>>> code good enough to judge that.
>>>>
>>>> Can you take a look at this and test it maybe?
>>>
>>> I'll grab a machine and try to see what is going on here.
>>
>> Well at least the issue with the firmware not loading looks to be 
>> fixed now.
>>
>> So running some eviction + oom tests it looks it now does:
>>
>> /* eviction kicks in */
>> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>>
>> /* shrinker/oom kicks in at some point */
>> i915_ttm_shrink(bo):
>>     bo->resource = NULL, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> /* user touches the same object later */
>> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>>
>> So seems to incorrectly skip swapping it back in and then copy over to 
>> lmem. It just allocates directly in lmem.
>>
>> And previously the last two steps would have been:
>>
>> i915_ttm_shrink(bo):
>>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> i915_ttm_move(bo):
>>     PL_SYSTEM -> LMEM,
>>     ttm_tt is repopulated and pages are copied over to lmem
> 
> Mhm, crap. That indeed looks like it won't work.
> 
> How about changing the code like this:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index c420d1ab605b..1ee7d81ddcbc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
> bool evict,
>          bool clear;
>          int ret;
> 
> -       if (GEM_WARN_ON(!obj) || !bo->resource) {
> +       if (GEM_WARN_ON(!obj)) {
> +               ttm_bo_move_null(bo, dst_mem);
> +               return 0;
> +       }
> +
> +       if (!bo->resource) {
> +               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
> +                        hop->mem_type = TTM_PL_SYSTEM;
> +                        hop->flags = TTM_PL_FLAG_TEMPORARY;
> +                       return -EMULTIHOP;
> +               }
>                  ttm_bo_move_null(bo, dst_mem);
>                  return 0;
>          }
> 
> That should result in allocating a TTM_PL_SYSTEM resource first and then 
> moving from that to the final destination.

Ok, I played around with this some more. The final diff looks like:
https://patchwork.freedesktop.org/patch/500786/?series=108027&rev=1

It looks like we had some more places where bo->resource == NULL didn't 
end well. It at least now seems to survive my local testing.

> 
> Thanks,
> Christian.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] drm/i915: audit bo->resource usage
  2022-08-31 13:34                       ` [Intel-gfx] " Christian König
@ 2022-09-01 17:48                         ` Thomas Hellström
  -1 siblings, 0 replies; 43+ messages in thread
From: Thomas Hellström @ 2022-09-01 17:48 UTC (permalink / raw)
  To: Christian König, Matthew Auld, dri-devel, Intel Graphics
  Cc: Luben Tuikov

On Wed, 2022-08-31 at 15:34 +0200, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
> > On 31/08/2022 13:35, Christian König wrote:
> > > Am 31.08.22 um 14:06 schrieb Matthew Auld:
> > > > On 31/08/2022 12:03, Christian König wrote:
> > > > > Am 31.08.22 um 12:37 schrieb Matthew Auld:
> > > > > > [SNIP]
> > > > > > > > 
> > > > > > > > That hopefully just leaves i915_ttm_shrink(), which is
> > > > > > > > swapping 
> > > > > > > > out shmem ttm_tt and is calling ttm_bo_validate() with
> > > > > > > > empty 
> > > > > > > > placements to force the pipeline-gutting path, which
> > > > > > > > importantly 
> > > > > > > > unpopulates the ttm_tt for us (since ttm_tt_unpopulate
> > > > > > > > is not 
> > > > > > > > exported it seems). But AFAICT it looks like that will
> > > > > > > > now also 
> > > > > > > > nuke the bo->resource, instead of just leaving it in
> > > > > > > > system 
> > > > > > > > memory. My assumption is that when later calling 
> > > > > > > > ttm_bo_validate(), it will just do the bo_move_null()
> > > > > > > > in 
> > > > > > > > i915_ttm_move(), instead of re-populating the ttm_tt
> > > > > > > > and then 
> > > > > > > > potentially copying it back to local-memory?
> > > > > > > 
> > > > > > > Well you do ttm_bo_validate() with something like GTT
> > > > > > > domain, 
> > > > > > > don't you? This should result in re-populating the tt
> > > > > > > object, but 
> > > > > > > I'm not 100% sure if that really works as expected.
> > > > > > 
> > > > > > AFAIK for domains we either have system memory (which uses
> > > > > > ttm_tt 
> > > > > > and might be shmem underneath) or local-memory. But perhaps
> > > > > > i915 
> > > > > > is doing something wrong here, or abusing TTM in some way.
> > > > > > I'm not 
> > > > > > sure tbh.
> > > > > > 
> > > > > > Anyway, I think we have two cases here:
> > > > > > 
> > > > > > - We have some system memory only object. After doing 
> > > > > > i915_ttm_shrink(), bo->resource is now NULL. We then call 
> > > > > > ttm_bo_validate() at some later point, but here we don't
> > > > > > need to 
> > > > > > copy anything, but it also looks like
> > > > > > ttm_bo_handle_move_mem() 
> > > > > > won't populate the ttm_tt or us either, since mem_type == 
> > > > > > TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking
> > > > > > care of 
> > > > > > this, but now we just call ttm_bo_move_null().
> > > > > > 
> > > > > > - We have a local-memory only object, which was evicted to
> > > > > > shmem, 
> > > > > > and then swapped out by the shrinker like above. The bo-
> > > > > > >resource 
> > > > > > is NULL. However this time when calling ttm_bo_validate()
> > > > > > we need 
> > > > > > to actually do a copy in i915_ttm_move(), as well as re-
> > > > > > populate 
> > > > > > the ttm_tt. i915_ttm_move() was taking care of this, but
> > > > > > now we 
> > > > > > just call ttm_bo_move_null().
> > > > > > 
> > > > > > Perhaps i915 is doing something wrong in the above two
> > > > > > cases?
> > > > > 
> > > > > Mhm, as far as I can see that should still work.
> > > > > 
> > > > > See previously you should got a transition from SYSTEM->GTT
> > > > > in 
> > > > > i915_ttm_move() to re-create your backing store. Not you get 
> > > > > NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
> > > > > SYSTEM->GTT.
> > > > 
> > > > What is GTT here in TTM world? Also I'm not seeing where there
> > > > is 
> > > > this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear,
> > > > i915 
> > > > is only calling ttm_bo_validate() once when acquiring the
> > > > pages, and 
> > > > we don't call it again, unless it was evicted (and potentially 
> > > > swapped out).
> > > 
> > > Well GTT means TTM_PL_TT.
> > > 
> > > And calling it only once is perfectly fine, TTM will internally
> > > see 
> > > that we need two hops to reach TTM_PL_TT and so does the NULL-
> > > >SYSTEM 
> > > transition and then SYSTEM->TT.
> > 
> > Ah interesting, so that's what the multi-hop thing does. But AFAICT
> > i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?
> 
> > 
> > Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM?
> > When 
> > should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM
> has 
> the control over the backing store and can swapout/swapin as it wants
> it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.
> 
> Thanks,
> Christian.

We've had a previous long discussions on this listing pros and cons,
and IIRC concluded that either binding to the device from system needed
some TTM fixes and documentation to be straightforward, or a driver
should use the above scheme bouncing in TT. IIRC we concluded that the
best thing for i915 would be to transition and introduce a dummy TT
region and obey the scheme outlined above by Christian.
We unfortunately never gotten around to that, though, due to other work
got prioritized. Also need to solve the limbo (not populated) -> vram
transition without populating when moving to TT....

Originaly TT was intended for GGTT-like and AGP apertures that needed
cpu-mapping to the PCI address. Using it like Christian outlines helps
avoid special casing for swapout. Devices that bind to System memory
needs the swap notifier to unbind.

/Thomas



> 
> > 
> > > 
> > > As far as I can see that should work like it did before.
> > > 
> > > Christian.
> > > 
> > > > 
> > > > > 
> > > > > If you just validated to SYSTEM memory before I think the tt
> > > > > object 
> > > > > wouldn't have been populated either.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I've been considering to replacing the ttm_bo_type
> > > > > > > > > with a bunch 
> > > > > > > > > of behavior flags for a bo. I'm hoping that this will
> > > > > > > > > clean 
> > > > > > > > > things up a bit.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > >       caching =
> > > > > > > > > > > > i915_ttm_select_tt_caching(obj);
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > index 9a7e50534b84bb..c420d1ab605b6f 100644
> > > > > > > > > > > > ---
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > +++
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
> > > > > > > > > > > > ttm_buffer_object *bo, bool evict,
> > > > > > > > > > > >       bool clear;
> > > > > > > > > > > >       int ret;
> > > > > > > > > > > > -    if (GEM_WARN_ON(!obj)) {
> > > > > > > > > > > > +    if (GEM_WARN_ON(!obj) || !bo->resource) {
> > > > > > > > > > > >           ttm_bo_move_null(bo, dst_mem);
> > > > > > > > > > > >           return 0;
> > > > > > > > > > > >       }
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
@ 2022-09-01 17:48                         ` Thomas Hellström
  0 siblings, 0 replies; 43+ messages in thread
From: Thomas Hellström @ 2022-09-01 17:48 UTC (permalink / raw)
  To: Christian König, Matthew Auld, dri-devel, Intel Graphics
  Cc: Luben Tuikov

On Wed, 2022-08-31 at 15:34 +0200, Christian König wrote:
> Am 31.08.22 um 14:50 schrieb Matthew Auld:
> > On 31/08/2022 13:35, Christian König wrote:
> > > Am 31.08.22 um 14:06 schrieb Matthew Auld:
> > > > On 31/08/2022 12:03, Christian König wrote:
> > > > > Am 31.08.22 um 12:37 schrieb Matthew Auld:
> > > > > > [SNIP]
> > > > > > > > 
> > > > > > > > That hopefully just leaves i915_ttm_shrink(), which is
> > > > > > > > swapping 
> > > > > > > > out shmem ttm_tt and is calling ttm_bo_validate() with
> > > > > > > > empty 
> > > > > > > > placements to force the pipeline-gutting path, which
> > > > > > > > importantly 
> > > > > > > > unpopulates the ttm_tt for us (since ttm_tt_unpopulate
> > > > > > > > is not 
> > > > > > > > exported it seems). But AFAICT it looks like that will
> > > > > > > > now also 
> > > > > > > > nuke the bo->resource, instead of just leaving it in
> > > > > > > > system 
> > > > > > > > memory. My assumption is that when later calling 
> > > > > > > > ttm_bo_validate(), it will just do the bo_move_null()
> > > > > > > > in 
> > > > > > > > i915_ttm_move(), instead of re-populating the ttm_tt
> > > > > > > > and then 
> > > > > > > > potentially copying it back to local-memory?
> > > > > > > 
> > > > > > > Well you do ttm_bo_validate() with something like GTT
> > > > > > > domain, 
> > > > > > > don't you? This should result in re-populating the tt
> > > > > > > object, but 
> > > > > > > I'm not 100% sure if that really works as expected.
> > > > > > 
> > > > > > AFAIK for domains we either have system memory (which uses
> > > > > > ttm_tt 
> > > > > > and might be shmem underneath) or local-memory. But perhaps
> > > > > > i915 
> > > > > > is doing something wrong here, or abusing TTM in some way.
> > > > > > I'm not 
> > > > > > sure tbh.
> > > > > > 
> > > > > > Anyway, I think we have two cases here:
> > > > > > 
> > > > > > - We have some system memory only object. After doing 
> > > > > > i915_ttm_shrink(), bo->resource is now NULL. We then call 
> > > > > > ttm_bo_validate() at some later point, but here we don't
> > > > > > need to 
> > > > > > copy anything, but it also looks like
> > > > > > ttm_bo_handle_move_mem() 
> > > > > > won't populate the ttm_tt or us either, since mem_type == 
> > > > > > TTM_PL_SYSTEM. It looks like i915_ttm_move() was taking
> > > > > > care of 
> > > > > > this, but now we just call ttm_bo_move_null().
> > > > > > 
> > > > > > - We have a local-memory only object, which was evicted to
> > > > > > shmem, 
> > > > > > and then swapped out by the shrinker like above. The bo-
> > > > > > >resource 
> > > > > > is NULL. However this time when calling ttm_bo_validate()
> > > > > > we need 
> > > > > > to actually do a copy in i915_ttm_move(), as well as re-
> > > > > > populate 
> > > > > > the ttm_tt. i915_ttm_move() was taking care of this, but
> > > > > > now we 
> > > > > > just call ttm_bo_move_null().
> > > > > > 
> > > > > > Perhaps i915 is doing something wrong in the above two
> > > > > > cases?
> > > > > 
> > > > > Mhm, as far as I can see that should still work.
> > > > > 
> > > > > See previously you should got a transition from SYSTEM->GTT
> > > > > in 
> > > > > i915_ttm_move() to re-create your backing store. Not you get 
> > > > > NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
> > > > > SYSTEM->GTT.
> > > > 
> > > > What is GTT here in TTM world? Also I'm not seeing where there
> > > > is 
> > > > this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear,
> > > > i915 
> > > > is only calling ttm_bo_validate() once when acquiring the
> > > > pages, and 
> > > > we don't call it again, unless it was evicted (and potentially 
> > > > swapped out).
> > > 
> > > Well GTT means TTM_PL_TT.
> > > 
> > > And calling it only once is perfectly fine, TTM will internally
> > > see 
> > > that we need two hops to reach TTM_PL_TT and so does the NULL-
> > > >SYSTEM 
> > > transition and then SYSTEM->TT.
> > 
> > Ah interesting, so that's what the multi-hop thing does. But AFAICT
> > i915 is not using either TTM_PL_TT or -EMULTIHOP.
> 
> Mhm, it could be that we then have a problem and the i915 driver only
> sees NULL->TT directly. But I really don't know the i915 driver code 
> good enough to judge that.
> 
> Can you take a look at this and test it maybe?
> 
> > 
> > Also what is the difference between TTM_PL_TT and TM_PL_SYSTEM?
> > When 
> > should you use one over the other?
> 
> TTM_PL_SYSTEM means the device is not accessing the buffer and TTM
> has 
> the control over the backing store and can swapout/swapin as it wants
> it.
> 
> TTM_PL_TT means that the device is accessing the data (TT stands for 
> translation table) and so TTM can't swap the backing store in/out.
> 
> TTM_PL_VRAM well that one is obvious.
> 
> Thanks,
> Christian.

We've had a previous long discussions on this listing pros and cons,
and IIRC concluded that either binding to the device from system needed
some TTM fixes and documentation to be straightforward, or a driver
should use the above scheme bouncing in TT. IIRC we concluded that the
best thing for i915 would be to transition and introduce a dummy TT
region and obey the scheme outlined above by Christian.
We unfortunately never gotten around to that, though, due to other work
got prioritized. Also need to solve the limbo (not populated) -> vram
transition without populating when moving to TT....

Originaly TT was intended for GGTT-like and AGP apertures that needed
cpu-mapping to the PCI address. Using it like Christian outlines helps
avoid special casing for swapout. Devices that bind to System memory
needs the swap notifier to unbind.

/Thomas



> 
> > 
> > > 
> > > As far as I can see that should work like it did before.
> > > 
> > > Christian.
> > > 
> > > > 
> > > > > 
> > > > > If you just validated to SYSTEM memory before I think the tt
> > > > > object 
> > > > > wouldn't have been populated either.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I've been considering to replacing the ttm_bo_type
> > > > > > > > > with a bunch 
> > > > > > > > > of behavior flags for a bo. I'm hoping that this will
> > > > > > > > > clean 
> > > > > > > > > things up a bit.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > >       caching =
> > > > > > > > > > > > i915_ttm_select_tt_caching(obj);
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > index 9a7e50534b84bb..c420d1ab605b6f 100644
> > > > > > > > > > > > ---
> > > > > > > > > > > > a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > +++
> > > > > > > > > > > > b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> > > > > > > > > > > > @@ -560,7 +560,7 @@ int i915_ttm_move(struct 
> > > > > > > > > > > > ttm_buffer_object *bo, bool evict,
> > > > > > > > > > > >       bool clear;
> > > > > > > > > > > >       int ret;
> > > > > > > > > > > > -    if (GEM_WARN_ON(!obj)) {
> > > > > > > > > > > > +    if (GEM_WARN_ON(!obj) || !bo->resource) {
> > > > > > > > > > > >           ttm_bo_move_null(bo, dst_mem);
> > > > > > > > > > > >           return 0;
> > > > > > > > > > > >       }
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation
  2022-08-31  9:34 [PATCH 1/3] drm/i915: audit bo->resource usage v2 Christian König
@ 2022-08-31  9:34 ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-08-31  9:34 UTC (permalink / raw)
  To: matthew.auld, dri-devel, intel-gfx; +Cc: Michael J . Ruhl, Christian König

That should not be necessary any more when drivers should at least be
able to handle the move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index f066e8124c50..d7ed891d3887 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -959,7 +959,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 			 struct sg_table *sg, struct dma_resv *resv,
 			 void (*destroy) (struct ttm_buffer_object *))
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	int ret;
 
 	kref_init(&bo->kref);
@@ -977,12 +976,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 		bo->base.resv = &bo->base._resv;
 	atomic_inc(&ttm_glob.bo_count);
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &bo->resource);
-	if (unlikely(ret)) {
-		ttm_bo_put(bo);
-		return ret;
-	}
-
 	/*
 	 * For ttm_bo_type_device buffers, allocate
 	 * address space from the device.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation
  2022-07-12 11:46 [PATCH 1/3] drm/i915: audit bo->resource usage Christian König
@ 2022-07-12 11:46 ` Christian König
  0 siblings, 0 replies; 43+ messages in thread
From: Christian König @ 2022-07-12 11:46 UTC (permalink / raw)
  To: intel-gfx, dri-devel; +Cc: Michael J . Ruhl, Christian König

That should not be necessary any more when drivers should at least be
able to handle the move without a resource.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index c1bd006a5525..a95826be8048 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -960,7 +960,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 			 struct sg_table *sg, struct dma_resv *resv,
 			 void (*destroy) (struct ttm_buffer_object *))
 {
-	static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };
 	int ret;
 
 	kref_init(&bo->kref);
@@ -978,12 +977,6 @@ int ttm_bo_init_reserved(struct ttm_device *bdev, struct ttm_buffer_object *bo,
 		bo->base.resv = &bo->base._resv;
 	atomic_inc(&ttm_glob.bo_count);
 
-	ret = ttm_resource_alloc(bo, &sys_mem, &bo->resource);
-	if (unlikely(ret)) {
-		ttm_bo_put(bo);
-		return ret;
-	}
-
 	/*
 	 * For ttm_bo_type_device buffers, allocate
 	 * address space from the device.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-09-01 17:48 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-24 14:23 [PATCH 1/3] drm/i915: audit bo->resource usage Luben Tuikov
2022-08-24 14:23 ` [Intel-gfx] " Luben Tuikov
2022-08-24 14:23 ` [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation Luben Tuikov
2022-08-24 14:23   ` [Intel-gfx] " Luben Tuikov
2022-08-24 14:23 ` [PATCH 3/3] drm/ttm: stop allocating a dummy resource for pipelined gutting Luben Tuikov
2022-08-24 14:23   ` [Intel-gfx] " Luben Tuikov
2022-08-24 16:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [1/3] drm/i915: audit bo->resource usage Patchwork
2022-08-24 16:44 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2022-08-30  7:33 ` [PATCH 1/3] " Christian König
2022-08-30  7:33   ` [Intel-gfx] " Christian König
2022-08-30 10:45   ` Matthew Auld
2022-08-30 10:45     ` [Intel-gfx] " Matthew Auld
2022-08-31  8:16     ` Christian König
2022-08-31  8:16       ` [Intel-gfx] " Christian König
2022-08-31  9:26       ` Matthew Auld
2022-08-31  9:26         ` [Intel-gfx] " Matthew Auld
2022-08-31  9:38         ` Christian König
2022-08-31  9:38           ` [Intel-gfx] " Christian König
2022-08-31 10:37           ` Matthew Auld
2022-08-31 10:37             ` [Intel-gfx] " Matthew Auld
2022-08-31 11:03             ` Christian König
2022-08-31 11:03               ` [Intel-gfx] " Christian König
2022-08-31 12:06               ` Matthew Auld
2022-08-31 12:06                 ` [Intel-gfx] " Matthew Auld
2022-08-31 12:35                 ` Christian König
2022-08-31 12:35                   ` [Intel-gfx] " Christian König
2022-08-31 12:50                   ` Matthew Auld
2022-08-31 12:50                     ` [Intel-gfx] " Matthew Auld
2022-08-31 13:34                     ` Christian König
2022-08-31 13:34                       ` [Intel-gfx] " Christian König
2022-08-31 14:53                       ` Matthew Auld
2022-08-31 14:53                         ` [Intel-gfx] " Matthew Auld
2022-08-31 16:32                         ` Matthew Auld
2022-08-31 16:32                           ` [Intel-gfx] " Matthew Auld
2022-09-01  8:00                           ` Christian König
2022-09-01  8:00                             ` [Intel-gfx] " Christian König
2022-09-01 12:52                             ` Matthew Auld
2022-09-01 12:52                               ` [Intel-gfx] " Matthew Auld
2022-09-01 17:48                       ` Thomas Hellström
2022-09-01 17:48                         ` [Intel-gfx] " Thomas Hellström
2022-09-01  8:43 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/3] drm/i915: audit bo->resource usage (rev2) Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2022-08-31  9:34 [PATCH 1/3] drm/i915: audit bo->resource usage v2 Christian König
2022-08-31  9:34 ` [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation Christian König
2022-07-12 11:46 [PATCH 1/3] drm/i915: audit bo->resource usage Christian König
2022-07-12 11:46 ` [PATCH 2/3] drm/ttm: stop allocating dummy resources during BO creation Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.