[RFC 0/2] drm/i915/ttm: Evict and store of compressed object

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07  9:37 ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

On flat-ccs capable platform we need to evict and resore the ccs data
along with the corresponding main memory.

This ccs data can only be access through BLT engine through a special
cmd ( )

To support above requirement of flat-ccs enabled i915 platforms this
series adds new param called ccs_pages_needed to the ttm_tt_init(),
to increase the ttm_tt->num_pages of system memory when the obj has the
lmem placement possibility.

This will be on top of the flat-ccs enabling series
https://patchwork.freedesktop.org/series/95686/

For more about flat-ccs feature please have a look at
https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5

Testing of the series is WIP and looking forward for the early review on
the amendment to ttm_tt_init and the approach.

Ramalingam C (2):
  drm/i915/ttm: Add extra pages for handling ccs data
  drm/i915/migrate: Evict and restore the ccs data

 drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
 drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
 drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
 drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
 include/drm/ttm/ttm_tt.h                   |   4 +-
 8 files changed, 191 insertions(+), 139 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07  9:37 ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

On flat-ccs capable platform we need to evict and resore the ccs data
along with the corresponding main memory.

This ccs data can only be access through BLT engine through a special
cmd ( )

To support above requirement of flat-ccs enabled i915 platforms this
series adds new param called ccs_pages_needed to the ttm_tt_init(),
to increase the ttm_tt->num_pages of system memory when the obj has the
lmem placement possibility.

This will be on top of the flat-ccs enabling series
https://patchwork.freedesktop.org/series/95686/

For more about flat-ccs feature please have a look at
https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5

Testing of the series is WIP and looking forward for the early review on
the amendment to ttm_tt_init and the approach.

Ramalingam C (2):
  drm/i915/ttm: Add extra pages for handling ccs data
  drm/i915/migrate: Evict and restore the ccs data

 drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
 drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
 drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
 drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
 include/drm/ttm/ttm_tt.h                   |   4 +-
 8 files changed, 191 insertions(+), 139 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 1/2] drm/i915/ttm: Add extra pages for handling ccs data
  2022-02-07  9:37 ` [Intel-gfx] " Ramalingam C
@ 2022-02-07  9:37   ` Ramalingam C
  -1 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

While evicting the local memory data on flat-ccs capable platform we
need to evict the ccs data associated to the data. For this, we are
adding extra pages ((size / 256) >> PAGE_SIZE) into the ttm_tt.

To achieve this we are adding a new param into the ttm_tt_init as
ccs_pages_needed, which will be added into the ttm_tt->num_pages.

Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
Suggested-by: Thomas Hellstorm <thomas.hellstrom@intel.com>
---
 drivers/gpu/drm/drm_gem_vram_helper.c      |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c    | 23 +++++++++++++++++++++-
 drivers/gpu/drm/qxl/qxl_ttm.c              |  2 +-
 drivers/gpu/drm/ttm/ttm_agp_backend.c      |  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c               | 12 ++++++-----
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |  2 +-
 include/drm/ttm/ttm_tt.h                   |  4 +++-
 7 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
index 3f00192215d1..eef1f4dc7232 100644
--- a/drivers/gpu/drm/drm_gem_vram_helper.c
+++ b/drivers/gpu/drm/drm_gem_vram_helper.c
@@ -864,7 +864,7 @@ static struct ttm_tt *bo_driver_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!tt)
 		return NULL;
 
-	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached);
+	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached, 0);
 	if (ret < 0)
 		goto err_ttm_tt_init;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 84cae740b4a5..bb71aa6d66c0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -16,6 +16,7 @@
 #include "gem/i915_gem_ttm.h"
 #include "gem/i915_gem_ttm_move.h"
 #include "gem/i915_gem_ttm_pm.h"
+#include "gt/intel_gpu_commands.h"
 
 #define I915_TTM_PRIO_PURGE     0
 #define I915_TTM_PRIO_NO_PAGES  1
@@ -242,12 +243,27 @@ static const struct i915_refct_sgt_ops tt_rsgt_ops = {
 	.release = i915_ttm_tt_release
 };
 
+static inline bool
+i915_gem_object_has_lmem_placement(struct drm_i915_gem_object *obj)
+{
+	int i;
+
+	for (i = 0; i < obj->mm.n_placements; i++)
+		if (obj->mm.placements[i]->type == INTEL_MEMORY_LOCAL)
+			return true;
+
+	return false;
+}
+
 static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 					 uint32_t page_flags)
 {
+	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
+						     bdev);
 	struct ttm_resource_manager *man =
 		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
+	unsigned long ccs_pages_needed = 0;
 	enum ttm_caching caching;
 	struct i915_ttm_tt *i915_tt;
 	int ret;
@@ -270,7 +286,12 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 		i915_tt->is_shmem = true;
 	}
 
-	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags, caching);
+	if (HAS_FLAT_CCS(i915) && i915_gem_object_has_lmem_placement(obj))
+		ccs_pages_needed = DIV_ROUND_UP(DIV_ROUND_UP(bo->base.size,
+					       NUM_CCS_BYTES_PER_BLOCK), PAGE_SIZE);
+
+	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags,
+			  caching, ccs_pages_needed);
 	if (ret)
 		goto err_free;
 
diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
index b2e33d5ba5d0..52156b54498f 100644
--- a/drivers/gpu/drm/qxl/qxl_ttm.c
+++ b/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -113,7 +113,7 @@ static struct ttm_tt *qxl_ttm_tt_create(struct ttm_buffer_object *bo,
 	ttm = kzalloc(sizeof(struct ttm_tt), GFP_KERNEL);
 	if (ttm == NULL)
 		return NULL;
-	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached)) {
+	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached, 0)) {
 		kfree(ttm);
 		return NULL;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c
index 6ddc16f0fe2b..d27691f2e451 100644
--- a/drivers/gpu/drm/ttm/ttm_agp_backend.c
+++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c
@@ -134,7 +134,7 @@ struct ttm_tt *ttm_agp_tt_create(struct ttm_buffer_object *bo,
 	agp_be->mem = NULL;
 	agp_be->bridge = bridge;
 
-	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined)) {
+	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined, 0)) {
 		kfree(agp_be);
 		return NULL;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 79c870a3bef8..80355465f717 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -134,9 +134,10 @@ void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 static void ttm_tt_init_fields(struct ttm_tt *ttm,
 			       struct ttm_buffer_object *bo,
 			       uint32_t page_flags,
-			       enum ttm_caching caching)
+			       enum ttm_caching caching,
+			       unsigned long ccs_pages)
 {
-	ttm->num_pages = PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT;
+	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + ccs_pages;
 	ttm->caching = ttm_cached;
 	ttm->page_flags = page_flags;
 	ttm->dma_address = NULL;
@@ -146,9 +147,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
-		uint32_t page_flags, enum ttm_caching caching)
+		uint32_t page_flags, enum ttm_caching caching,
+		unsigned long ccs_pages)
 {
-	ttm_tt_init_fields(ttm, bo, page_flags, caching);
+	ttm_tt_init_fields(ttm, bo, page_flags, caching, ccs_pages);
 
 	if (ttm_tt_alloc_page_directory(ttm)) {
 		pr_err("Failed allocating page table\n");
@@ -180,7 +182,7 @@ int ttm_sg_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
 {
 	int ret;
 
-	ttm_tt_init_fields(ttm, bo, page_flags, caching);
+	ttm_tt_init_fields(ttm, bo, page_flags, caching, 0);
 
 	if (page_flags & TTM_TT_FLAG_EXTERNAL)
 		ret = ttm_sg_tt_alloc_page_directory(ttm);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index b84ecc6d6611..4e3938e62c08 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -517,7 +517,7 @@ static struct ttm_tt *vmw_ttm_tt_create(struct ttm_buffer_object *bo,
 				     ttm_cached);
 	else
 		ret = ttm_tt_init(&vmw_be->dma_ttm, bo, page_flags,
-				  ttm_cached);
+				  ttm_cached, 0);
 	if (unlikely(ret != 0))
 		goto out_no_init;
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index f20832139815..2c4ff08ea354 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -140,6 +140,7 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
  * @bo: The buffer object we create the ttm for.
  * @page_flags: Page flags as identified by TTM_TT_FLAG_XX flags.
  * @caching: the desired caching state of the pages
+ * @ccs_pages_needed: Extra pages needed for the ccs data of compression.
  *
  * Create a struct ttm_tt to back data with system memory pages.
  * No pages are actually allocated.
@@ -147,7 +148,8 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
  * NULL: Out of memory.
  */
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
-		uint32_t page_flags, enum ttm_caching caching);
+		uint32_t page_flags, enum ttm_caching caching,
+		unsigned long ccs_pages_needed);
 int ttm_sg_tt_init(struct ttm_tt *ttm_dma, struct ttm_buffer_object *bo,
 		   uint32_t page_flags, enum ttm_caching caching);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Intel-gfx] [RFC 1/2] drm/i915/ttm: Add extra pages for handling ccs data
@ 2022-02-07  9:37   ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

While evicting the local memory data on flat-ccs capable platform we
need to evict the ccs data associated to the data. For this, we are
adding extra pages ((size / 256) >> PAGE_SIZE) into the ttm_tt.

To achieve this we are adding a new param into the ttm_tt_init as
ccs_pages_needed, which will be added into the ttm_tt->num_pages.

Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
Suggested-by: Thomas Hellstorm <thomas.hellstrom@intel.com>
---
 drivers/gpu/drm/drm_gem_vram_helper.c      |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c    | 23 +++++++++++++++++++++-
 drivers/gpu/drm/qxl/qxl_ttm.c              |  2 +-
 drivers/gpu/drm/ttm/ttm_agp_backend.c      |  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c               | 12 ++++++-----
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |  2 +-
 include/drm/ttm/ttm_tt.h                   |  4 +++-
 7 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
index 3f00192215d1..eef1f4dc7232 100644
--- a/drivers/gpu/drm/drm_gem_vram_helper.c
+++ b/drivers/gpu/drm/drm_gem_vram_helper.c
@@ -864,7 +864,7 @@ static struct ttm_tt *bo_driver_ttm_tt_create(struct ttm_buffer_object *bo,
 	if (!tt)
 		return NULL;
 
-	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached);
+	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached, 0);
 	if (ret < 0)
 		goto err_ttm_tt_init;
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index 84cae740b4a5..bb71aa6d66c0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -16,6 +16,7 @@
 #include "gem/i915_gem_ttm.h"
 #include "gem/i915_gem_ttm_move.h"
 #include "gem/i915_gem_ttm_pm.h"
+#include "gt/intel_gpu_commands.h"
 
 #define I915_TTM_PRIO_PURGE     0
 #define I915_TTM_PRIO_NO_PAGES  1
@@ -242,12 +243,27 @@ static const struct i915_refct_sgt_ops tt_rsgt_ops = {
 	.release = i915_ttm_tt_release
 };
 
+static inline bool
+i915_gem_object_has_lmem_placement(struct drm_i915_gem_object *obj)
+{
+	int i;
+
+	for (i = 0; i < obj->mm.n_placements; i++)
+		if (obj->mm.placements[i]->type == INTEL_MEMORY_LOCAL)
+			return true;
+
+	return false;
+}
+
 static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 					 uint32_t page_flags)
 {
+	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
+						     bdev);
 	struct ttm_resource_manager *man =
 		ttm_manager_type(bo->bdev, bo->resource->mem_type);
 	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
+	unsigned long ccs_pages_needed = 0;
 	enum ttm_caching caching;
 	struct i915_ttm_tt *i915_tt;
 	int ret;
@@ -270,7 +286,12 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
 		i915_tt->is_shmem = true;
 	}
 
-	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags, caching);
+	if (HAS_FLAT_CCS(i915) && i915_gem_object_has_lmem_placement(obj))
+		ccs_pages_needed = DIV_ROUND_UP(DIV_ROUND_UP(bo->base.size,
+					       NUM_CCS_BYTES_PER_BLOCK), PAGE_SIZE);
+
+	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags,
+			  caching, ccs_pages_needed);
 	if (ret)
 		goto err_free;
 
diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
index b2e33d5ba5d0..52156b54498f 100644
--- a/drivers/gpu/drm/qxl/qxl_ttm.c
+++ b/drivers/gpu/drm/qxl/qxl_ttm.c
@@ -113,7 +113,7 @@ static struct ttm_tt *qxl_ttm_tt_create(struct ttm_buffer_object *bo,
 	ttm = kzalloc(sizeof(struct ttm_tt), GFP_KERNEL);
 	if (ttm == NULL)
 		return NULL;
-	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached)) {
+	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached, 0)) {
 		kfree(ttm);
 		return NULL;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c
index 6ddc16f0fe2b..d27691f2e451 100644
--- a/drivers/gpu/drm/ttm/ttm_agp_backend.c
+++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c
@@ -134,7 +134,7 @@ struct ttm_tt *ttm_agp_tt_create(struct ttm_buffer_object *bo,
 	agp_be->mem = NULL;
 	agp_be->bridge = bridge;
 
-	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined)) {
+	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined, 0)) {
 		kfree(agp_be);
 		return NULL;
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 79c870a3bef8..80355465f717 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -134,9 +134,10 @@ void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
 static void ttm_tt_init_fields(struct ttm_tt *ttm,
 			       struct ttm_buffer_object *bo,
 			       uint32_t page_flags,
-			       enum ttm_caching caching)
+			       enum ttm_caching caching,
+			       unsigned long ccs_pages)
 {
-	ttm->num_pages = PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT;
+	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + ccs_pages;
 	ttm->caching = ttm_cached;
 	ttm->page_flags = page_flags;
 	ttm->dma_address = NULL;
@@ -146,9 +147,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
 }
 
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
-		uint32_t page_flags, enum ttm_caching caching)
+		uint32_t page_flags, enum ttm_caching caching,
+		unsigned long ccs_pages)
 {
-	ttm_tt_init_fields(ttm, bo, page_flags, caching);
+	ttm_tt_init_fields(ttm, bo, page_flags, caching, ccs_pages);
 
 	if (ttm_tt_alloc_page_directory(ttm)) {
 		pr_err("Failed allocating page table\n");
@@ -180,7 +182,7 @@ int ttm_sg_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
 {
 	int ret;
 
-	ttm_tt_init_fields(ttm, bo, page_flags, caching);
+	ttm_tt_init_fields(ttm, bo, page_flags, caching, 0);
 
 	if (page_flags & TTM_TT_FLAG_EXTERNAL)
 		ret = ttm_sg_tt_alloc_page_directory(ttm);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index b84ecc6d6611..4e3938e62c08 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -517,7 +517,7 @@ static struct ttm_tt *vmw_ttm_tt_create(struct ttm_buffer_object *bo,
 				     ttm_cached);
 	else
 		ret = ttm_tt_init(&vmw_be->dma_ttm, bo, page_flags,
-				  ttm_cached);
+				  ttm_cached, 0);
 	if (unlikely(ret != 0))
 		goto out_no_init;
 
diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
index f20832139815..2c4ff08ea354 100644
--- a/include/drm/ttm/ttm_tt.h
+++ b/include/drm/ttm/ttm_tt.h
@@ -140,6 +140,7 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
  * @bo: The buffer object we create the ttm for.
  * @page_flags: Page flags as identified by TTM_TT_FLAG_XX flags.
  * @caching: the desired caching state of the pages
+ * @ccs_pages_needed: Extra pages needed for the ccs data of compression.
  *
  * Create a struct ttm_tt to back data with system memory pages.
  * No pages are actually allocated.
@@ -147,7 +148,8 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
  * NULL: Out of memory.
  */
 int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
-		uint32_t page_flags, enum ttm_caching caching);
+		uint32_t page_flags, enum ttm_caching caching,
+		unsigned long ccs_pages_needed);
 int ttm_sg_tt_init(struct ttm_tt *ttm_dma, struct ttm_buffer_object *bo,
 		   uint32_t page_flags, enum ttm_caching caching);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07  9:37 ` [Intel-gfx] " Ramalingam C
@ 2022-02-07  9:37   ` Ramalingam C
  -1 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

When we are swapping out the local memory obj on flat-ccs capable platform,
we need to capture the ccs data too along with main meory and we need to
restore it when we are swapping in the content.

Extracting and restoring the CCS data is done through a special cmd called
XY_CTRL_SURF_COPY_BLT

Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----------
 1 file changed, 155 insertions(+), 128 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 5bdab0b3c735..e60ae6ff1847 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32 size)
 	return height % 4 == 3 && height <= 8;
 }
 
+/**
+ * DOC: Flat-CCS - Memory compression for Local memory
+ *
+ * On Xe-HP and later devices, we use dedicated compression control state (CCS)
+ * stored in local memory for each surface, to support the 3D and media
+ * compression formats.
+ *
+ * The memory required for the CCS of the entire local memory is 1/256 of the
+ * local memory size. So before the kernel boot, the required memory is reserved
+ * for the CCS data and a secure register will be programmed with the CCS base
+ * address.
+ *
+ * Flat CCS data needs to be cleared when a lmem object is allocated.
+ * And CCS data can be copied in and out of CCS region through
+ * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
+ *
+ * When we exaust the lmem, if the object's placements support smem, then we can
+ * directly decompress the compressed lmem object into smem and start using it
+ * from smem itself.
+ *
+ * But when we need to swapout the compressed lmem object into a smem region
+ * though objects' placement doesn't support smem, then we copy the lmem content
+ * as it is into smem region along with ccs data (using XY_CTRL_SURF_COPY_BLT).
+ * When the object is referred, lmem content will be swaped in along with
+ * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at corresponding
+ * location.
+ *
+ *
+ * Flat-CCS Modifiers for different compression formats
+ * ----------------------------------------------------
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers of Flat CCS
+ * render compression formats. Though the general layout is same as
+ * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression algorithm is
+ * used. Render compression uses 128 byte compression blocks
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers of Flat CCS
+ * media compression formats. Though the general layout is same as
+ * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression algorithm is
+ * used. Media compression uses 256 byte compression blocks.
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the buffers of Flat
+ * CCS clear color render compression formats. Unified compression format for
+ * clear color render compression. The genral layout is a tiled layout using
+ * 4Kb tiles i.e Tile4 layout.
+ */
+
+static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
+{
+	/* Mask the 3 LSB to use the PPGTT address space */
+	*cmd++ = MI_FLUSH_DW | flags;
+	*cmd++ = lower_32_bits(dst);
+	*cmd++ = upper_32_bits(dst);
+
+	return cmd;
+}
+
+static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915, int size)
+{
+	u32 num_cmds, num_blks, total_size;
+
+	if (!GET_CCS_SIZE(i915, size))
+		return 0;
+
+	/*
+	 * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
+	 * blocks. one XY_CTRL_SURF_COPY_BLT command can
+	 * trnasfer upto 1024 blocks.
+	 */
+	num_blks = GET_CCS_SIZE(i915, size);
+	num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
+	total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
+
+	/*
+	 * We need to add a flush before and after
+	 * XY_CTRL_SURF_COPY_BLT
+	 */
+	total_size += 2 * MI_FLUSH_DW_SIZE;
+	return total_size;
+}
+
+static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
+				     u8 src_mem_access, u8 dst_mem_access,
+				     int src_mocs, int dst_mocs,
+				     u16 num_ccs_blocks)
+{
+	int i = num_ccs_blocks;
+
+	/*
+	 * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the CCS
+	 * data in and out of the CCS region.
+	 *
+	 * We can copy at most 1024 blocks of 256 bytes using one
+	 * XY_CTRL_SURF_COPY_BLT instruction.
+	 *
+	 * In case we need to copy more than 1024 blocks, we need to add
+	 * another instruction to the same batch buffer.
+	 *
+	 * 1024 blocks of 256 bytes of CCS represent a total 256KB of CCS.
+	 *
+	 * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
+	 */
+	do {
+		/*
+		 * We use logical AND with 1023 since the size field
+		 * takes values which is in the range of 0 - 1023
+		 */
+		*cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
+			  (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
+			  (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
+			  (((i - 1) & 1023) << CCS_SIZE_SHIFT));
+		*cmd++ = lower_32_bits(src_addr);
+		*cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
+			  (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
+		*cmd++ = lower_32_bits(dst_addr);
+		*cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
+			  (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
+		src_addr += SZ_64M;
+		dst_addr += SZ_64M;
+		i -= NUM_CCS_BLKS_PER_XFER;
+	} while (i > 0);
+
+	return cmd;
+}
+
 static int emit_copy(struct i915_request *rq,
-		     u32 dst_offset, u32 src_offset, int size)
+		     bool dst_is_lmem, u32 dst_offset,
+		     bool src_is_lmem, u32 src_offset, int size)
 {
+	struct drm_i915_private *i915 = rq->engine->i915;
 	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 instance = rq->engine->instance;
+	u32 num_ccs_blks, ccs_ring_size;
+	u8 src_access, dst_access;
 	u32 *cs;
 
-	cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
+	ccs_ring_size = ((src_is_lmem || dst_is_lmem) && HAS_FLAT_CCS(i915)) ?
+			 calc_ctrl_surf_instr_size(i915, size) : 0;
+
+	cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
@@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
 		*cs++ = src_offset;
 	}
 
+	if (ccs_ring_size) {
+		/* TODO: Migration needs to be handled with resolve of compressed data */
+		num_ccs_blks = (GET_CCS_SIZE(i915, size) +
+				NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
+
+		src_access = !src_is_lmem && dst_is_lmem;
+		dst_access = !src_access;
+
+		if (src_access) /* Swapin of compressed data */
+			src_offset += size;
+		else
+			dst_offset += size;
+
+		cs = _i915_ctrl_surf_copy_blt(cs, src_offset, dst_offset,
+					      src_access, dst_access,
+					      1, 1, num_ccs_blks);
+		cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
+	}
+
 	intel_ring_advance(rq, cs);
 	return 0;
 }
@@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_copy(rq, dst_offset, src_offset, len);
+		err = emit_copy(rq, dst_is_lmem, dst_offset,
+				src_is_lmem, src_offset, len);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
@@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context *ce,
 	return err;
 }
 
-/**
- * DOC: Flat-CCS - Memory compression for Local memory
- *
- * On Xe-HP and later devices, we use dedicated compression control state (CCS)
- * stored in local memory for each surface, to support the 3D and media
- * compression formats.
- *
- * The memory required for the CCS of the entire local memory is 1/256 of the
- * local memory size. So before the kernel boot, the required memory is reserved
- * for the CCS data and a secure register will be programmed with the CCS base
- * address.
- *
- * Flat CCS data needs to be cleared when a lmem object is allocated.
- * And CCS data can be copied in and out of CCS region through
- * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
- *
- * When we exaust the lmem, if the object's placements support smem, then we can
- * directly decompress the compressed lmem object into smem and start using it
- * from smem itself.
- *
- * But when we need to swapout the compressed lmem object into a smem region
- * though objects' placement doesn't support smem, then we copy the lmem content
- * as it is into smem region along with ccs data (using XY_CTRL_SURF_COPY_BLT).
- * When the object is referred, lmem content will be swaped in along with
- * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at corresponding
- * location.
- *
- *
- * Flat-CCS Modifiers for different compression formats
- * ----------------------------------------------------
- *
- * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers of Flat CCS
- * render compression formats. Though the general layout is same as
- * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression algorithm is
- * used. Render compression uses 128 byte compression blocks
- *
- * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers of Flat CCS
- * media compression formats. Though the general layout is same as
- * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression algorithm is
- * used. Media compression uses 256 byte compression blocks.
- *
- * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the buffers of Flat
- * CCS clear color render compression formats. Unified compression format for
- * clear color render compression. The genral layout is a tiled layout using
- * 4Kb tiles i.e Tile4 layout.
- */
-
-static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
-{
-	/* Mask the 3 LSB to use the PPGTT address space */
-	*cmd++ = MI_FLUSH_DW | flags;
-	*cmd++ = lower_32_bits(dst);
-	*cmd++ = upper_32_bits(dst);
-
-	return cmd;
-}
-
-static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915, int size)
-{
-	u32 num_cmds, num_blks, total_size;
-
-	if (!GET_CCS_SIZE(i915, size))
-		return 0;
-
-	/*
-	 * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
-	 * blocks. one XY_CTRL_SURF_COPY_BLT command can
-	 * trnasfer upto 1024 blocks.
-	 */
-	num_blks = GET_CCS_SIZE(i915, size);
-	num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
-	total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
-
-	/*
-	 * We need to add a flush before and after
-	 * XY_CTRL_SURF_COPY_BLT
-	 */
-	total_size += 2 * MI_FLUSH_DW_SIZE;
-	return total_size;
-}
-
-static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
-				     u8 src_mem_access, u8 dst_mem_access,
-				     int src_mocs, int dst_mocs,
-				     u16 num_ccs_blocks)
-{
-	int i = num_ccs_blocks;
-
-	/*
-	 * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the CCS
-	 * data in and out of the CCS region.
-	 *
-	 * We can copy at most 1024 blocks of 256 bytes using one
-	 * XY_CTRL_SURF_COPY_BLT instruction.
-	 *
-	 * In case we need to copy more than 1024 blocks, we need to add
-	 * another instruction to the same batch buffer.
-	 *
-	 * 1024 blocks of 256 bytes of CCS represent a total 256KB of CCS.
-	 *
-	 * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
-	 */
-	do {
-		/*
-		 * We use logical AND with 1023 since the size field
-		 * takes values which is in the range of 0 - 1023
-		 */
-		*cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
-			  (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
-			  (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
-			  (((i - 1) & 1023) << CCS_SIZE_SHIFT));
-		*cmd++ = lower_32_bits(src_addr);
-		*cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
-			  (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
-		*cmd++ = lower_32_bits(dst_addr);
-		*cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
-			  (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
-		src_addr += SZ_64M;
-		dst_addr += SZ_64M;
-		i -= NUM_CCS_BLKS_PER_XFER;
-	} while (i > 0);
-
-	return cmd;
-}
-
 static int emit_clear(struct i915_request *rq,
 		      u64 offset,
 		      int size,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
@ 2022-02-07  9:37   ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07  9:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

When we are swapping out the local memory obj on flat-ccs capable platform,
we need to capture the ccs data too along with main meory and we need to
restore it when we are swapping in the content.

Extracting and restoring the CCS data is done through a special cmd called
XY_CTRL_SURF_COPY_BLT

Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----------
 1 file changed, 155 insertions(+), 128 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 5bdab0b3c735..e60ae6ff1847 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32 size)
 	return height % 4 == 3 && height <= 8;
 }
 
+/**
+ * DOC: Flat-CCS - Memory compression for Local memory
+ *
+ * On Xe-HP and later devices, we use dedicated compression control state (CCS)
+ * stored in local memory for each surface, to support the 3D and media
+ * compression formats.
+ *
+ * The memory required for the CCS of the entire local memory is 1/256 of the
+ * local memory size. So before the kernel boot, the required memory is reserved
+ * for the CCS data and a secure register will be programmed with the CCS base
+ * address.
+ *
+ * Flat CCS data needs to be cleared when a lmem object is allocated.
+ * And CCS data can be copied in and out of CCS region through
+ * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
+ *
+ * When we exaust the lmem, if the object's placements support smem, then we can
+ * directly decompress the compressed lmem object into smem and start using it
+ * from smem itself.
+ *
+ * But when we need to swapout the compressed lmem object into a smem region
+ * though objects' placement doesn't support smem, then we copy the lmem content
+ * as it is into smem region along with ccs data (using XY_CTRL_SURF_COPY_BLT).
+ * When the object is referred, lmem content will be swaped in along with
+ * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at corresponding
+ * location.
+ *
+ *
+ * Flat-CCS Modifiers for different compression formats
+ * ----------------------------------------------------
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers of Flat CCS
+ * render compression formats. Though the general layout is same as
+ * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression algorithm is
+ * used. Render compression uses 128 byte compression blocks
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers of Flat CCS
+ * media compression formats. Though the general layout is same as
+ * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression algorithm is
+ * used. Media compression uses 256 byte compression blocks.
+ *
+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the buffers of Flat
+ * CCS clear color render compression formats. Unified compression format for
+ * clear color render compression. The genral layout is a tiled layout using
+ * 4Kb tiles i.e Tile4 layout.
+ */
+
+static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
+{
+	/* Mask the 3 LSB to use the PPGTT address space */
+	*cmd++ = MI_FLUSH_DW | flags;
+	*cmd++ = lower_32_bits(dst);
+	*cmd++ = upper_32_bits(dst);
+
+	return cmd;
+}
+
+static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915, int size)
+{
+	u32 num_cmds, num_blks, total_size;
+
+	if (!GET_CCS_SIZE(i915, size))
+		return 0;
+
+	/*
+	 * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
+	 * blocks. one XY_CTRL_SURF_COPY_BLT command can
+	 * trnasfer upto 1024 blocks.
+	 */
+	num_blks = GET_CCS_SIZE(i915, size);
+	num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
+	total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
+
+	/*
+	 * We need to add a flush before and after
+	 * XY_CTRL_SURF_COPY_BLT
+	 */
+	total_size += 2 * MI_FLUSH_DW_SIZE;
+	return total_size;
+}
+
+static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
+				     u8 src_mem_access, u8 dst_mem_access,
+				     int src_mocs, int dst_mocs,
+				     u16 num_ccs_blocks)
+{
+	int i = num_ccs_blocks;
+
+	/*
+	 * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the CCS
+	 * data in and out of the CCS region.
+	 *
+	 * We can copy at most 1024 blocks of 256 bytes using one
+	 * XY_CTRL_SURF_COPY_BLT instruction.
+	 *
+	 * In case we need to copy more than 1024 blocks, we need to add
+	 * another instruction to the same batch buffer.
+	 *
+	 * 1024 blocks of 256 bytes of CCS represent a total 256KB of CCS.
+	 *
+	 * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
+	 */
+	do {
+		/*
+		 * We use logical AND with 1023 since the size field
+		 * takes values which is in the range of 0 - 1023
+		 */
+		*cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
+			  (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
+			  (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
+			  (((i - 1) & 1023) << CCS_SIZE_SHIFT));
+		*cmd++ = lower_32_bits(src_addr);
+		*cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
+			  (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
+		*cmd++ = lower_32_bits(dst_addr);
+		*cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
+			  (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
+		src_addr += SZ_64M;
+		dst_addr += SZ_64M;
+		i -= NUM_CCS_BLKS_PER_XFER;
+	} while (i > 0);
+
+	return cmd;
+}
+
 static int emit_copy(struct i915_request *rq,
-		     u32 dst_offset, u32 src_offset, int size)
+		     bool dst_is_lmem, u32 dst_offset,
+		     bool src_is_lmem, u32 src_offset, int size)
 {
+	struct drm_i915_private *i915 = rq->engine->i915;
 	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 instance = rq->engine->instance;
+	u32 num_ccs_blks, ccs_ring_size;
+	u8 src_access, dst_access;
 	u32 *cs;
 
-	cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
+	ccs_ring_size = ((src_is_lmem || dst_is_lmem) && HAS_FLAT_CCS(i915)) ?
+			 calc_ctrl_surf_instr_size(i915, size) : 0;
+
+	cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
@@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
 		*cs++ = src_offset;
 	}
 
+	if (ccs_ring_size) {
+		/* TODO: Migration needs to be handled with resolve of compressed data */
+		num_ccs_blks = (GET_CCS_SIZE(i915, size) +
+				NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
+
+		src_access = !src_is_lmem && dst_is_lmem;
+		dst_access = !src_access;
+
+		if (src_access) /* Swapin of compressed data */
+			src_offset += size;
+		else
+			dst_offset += size;
+
+		cs = _i915_ctrl_surf_copy_blt(cs, src_offset, dst_offset,
+					      src_access, dst_access,
+					      1, 1, num_ccs_blks);
+		cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
+	}
+
 	intel_ring_advance(rq, cs);
 	return 0;
 }
@@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_copy(rq, dst_offset, src_offset, len);
+		err = emit_copy(rq, dst_is_lmem, dst_offset,
+				src_is_lmem, src_offset, len);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
@@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context *ce,
 	return err;
 }
 
-/**
- * DOC: Flat-CCS - Memory compression for Local memory
- *
- * On Xe-HP and later devices, we use dedicated compression control state (CCS)
- * stored in local memory for each surface, to support the 3D and media
- * compression formats.
- *
- * The memory required for the CCS of the entire local memory is 1/256 of the
- * local memory size. So before the kernel boot, the required memory is reserved
- * for the CCS data and a secure register will be programmed with the CCS base
- * address.
- *
- * Flat CCS data needs to be cleared when a lmem object is allocated.
- * And CCS data can be copied in and out of CCS region through
- * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
- *
- * When we exaust the lmem, if the object's placements support smem, then we can
- * directly decompress the compressed lmem object into smem and start using it
- * from smem itself.
- *
- * But when we need to swapout the compressed lmem object into a smem region
- * though objects' placement doesn't support smem, then we copy the lmem content
- * as it is into smem region along with ccs data (using XY_CTRL_SURF_COPY_BLT).
- * When the object is referred, lmem content will be swaped in along with
- * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at corresponding
- * location.
- *
- *
- * Flat-CCS Modifiers for different compression formats
- * ----------------------------------------------------
- *
- * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers of Flat CCS
- * render compression formats. Though the general layout is same as
- * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression algorithm is
- * used. Render compression uses 128 byte compression blocks
- *
- * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers of Flat CCS
- * media compression formats. Though the general layout is same as
- * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression algorithm is
- * used. Media compression uses 256 byte compression blocks.
- *
- * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the buffers of Flat
- * CCS clear color render compression formats. Unified compression format for
- * clear color render compression. The genral layout is a tiled layout using
- * 4Kb tiles i.e Tile4 layout.
- */
-
-static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
-{
-	/* Mask the 3 LSB to use the PPGTT address space */
-	*cmd++ = MI_FLUSH_DW | flags;
-	*cmd++ = lower_32_bits(dst);
-	*cmd++ = upper_32_bits(dst);
-
-	return cmd;
-}
-
-static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915, int size)
-{
-	u32 num_cmds, num_blks, total_size;
-
-	if (!GET_CCS_SIZE(i915, size))
-		return 0;
-
-	/*
-	 * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
-	 * blocks. one XY_CTRL_SURF_COPY_BLT command can
-	 * trnasfer upto 1024 blocks.
-	 */
-	num_blks = GET_CCS_SIZE(i915, size);
-	num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
-	total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
-
-	/*
-	 * We need to add a flush before and after
-	 * XY_CTRL_SURF_COPY_BLT
-	 */
-	total_size += 2 * MI_FLUSH_DW_SIZE;
-	return total_size;
-}
-
-static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
-				     u8 src_mem_access, u8 dst_mem_access,
-				     int src_mocs, int dst_mocs,
-				     u16 num_ccs_blocks)
-{
-	int i = num_ccs_blocks;
-
-	/*
-	 * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the CCS
-	 * data in and out of the CCS region.
-	 *
-	 * We can copy at most 1024 blocks of 256 bytes using one
-	 * XY_CTRL_SURF_COPY_BLT instruction.
-	 *
-	 * In case we need to copy more than 1024 blocks, we need to add
-	 * another instruction to the same batch buffer.
-	 *
-	 * 1024 blocks of 256 bytes of CCS represent a total 256KB of CCS.
-	 *
-	 * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
-	 */
-	do {
-		/*
-		 * We use logical AND with 1023 since the size field
-		 * takes values which is in the range of 0 - 1023
-		 */
-		*cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
-			  (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
-			  (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
-			  (((i - 1) & 1023) << CCS_SIZE_SHIFT));
-		*cmd++ = lower_32_bits(src_addr);
-		*cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
-			  (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
-		*cmd++ = lower_32_bits(dst_addr);
-		*cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
-			  (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
-		src_addr += SZ_64M;
-		dst_addr += SZ_64M;
-		i -= NUM_CCS_BLKS_PER_XFER;
-	} while (i > 0);
-
-	return cmd;
-}
-
 static int emit_clear(struct i915_request *rq,
 		      u64 offset,
 		      int size,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 1/2] drm/i915/ttm: Add extra pages for handling ccs data
  2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
  (?)
@ 2022-02-07 10:41   ` Thomas Hellström (Intel)
  -1 siblings, 0 replies; 29+ messages in thread
From: Thomas Hellström (Intel) @ 2022-02-07 10:41 UTC (permalink / raw)
  To: Ramalingam C, dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig

Hi, Ram,


On 2/7/22 10:37, Ramalingam C wrote:
> While evicting the local memory data on flat-ccs capable platform we
> need to evict the ccs data associated to the data.

>   For this, we are
> adding extra pages ((size / 256) >> PAGE_SIZE) into the ttm_tt.
>
> To achieve this we are adding a new param into the ttm_tt_init as
> ccs_pages_needed, which will be added into the ttm_tt->num_pages.

Please use imperative form above, Instead of "We are adding..", use "Add"


>
> Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> Suggested-by: Thomas Hellstorm <thomas.hellstrom@intel.com>
Hellstorm instead of Hellstrom might scare people off. :)
> ---
>   drivers/gpu/drm/drm_gem_vram_helper.c      |  2 +-
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    | 23 +++++++++++++++++++++-
>   drivers/gpu/drm/qxl/qxl_ttm.c              |  2 +-
>   drivers/gpu/drm/ttm/ttm_agp_backend.c      |  2 +-
>   drivers/gpu/drm/ttm/ttm_tt.c               | 12 ++++++-----
>   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |  2 +-
>   include/drm/ttm/ttm_tt.h                   |  4 +++-
>   7 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
> index 3f00192215d1..eef1f4dc7232 100644
> --- a/drivers/gpu/drm/drm_gem_vram_helper.c
> +++ b/drivers/gpu/drm/drm_gem_vram_helper.c
> @@ -864,7 +864,7 @@ static struct ttm_tt *bo_driver_ttm_tt_create(struct ttm_buffer_object *bo,
>   	if (!tt)
>   		return NULL;
>   
> -	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached);
> +	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached, 0);
>   	if (ret < 0)
>   		goto err_ttm_tt_init;
>   
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index 84cae740b4a5..bb71aa6d66c0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -16,6 +16,7 @@
>   #include "gem/i915_gem_ttm.h"
>   #include "gem/i915_gem_ttm_move.h"
>   #include "gem/i915_gem_ttm_pm.h"
> +#include "gt/intel_gpu_commands.h"
>   
>   #define I915_TTM_PRIO_PURGE     0
>   #define I915_TTM_PRIO_NO_PAGES  1
> @@ -242,12 +243,27 @@ static const struct i915_refct_sgt_ops tt_rsgt_ops = {
>   	.release = i915_ttm_tt_release
>   };
>   
> +static inline bool
> +i915_gem_object_has_lmem_placement(struct drm_i915_gem_object *obj)
> +{
> +	int i;
> +
> +	for (i = 0; i < obj->mm.n_placements; i++)
> +		if (obj->mm.placements[i]->type == INTEL_MEMORY_LOCAL)
> +			return true;
> +
> +	return false;
> +}
> +
>   static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   					 uint32_t page_flags)
>   {
> +	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
> +						     bdev);
>   	struct ttm_resource_manager *man =
>   		ttm_manager_type(bo->bdev, bo->resource->mem_type);
>   	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
> +	unsigned long ccs_pages_needed = 0;
>   	enum ttm_caching caching;
>   	struct i915_ttm_tt *i915_tt;
>   	int ret;
> @@ -270,7 +286,12 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   		i915_tt->is_shmem = true;
>   	}
>   
> -	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags, caching);
> +	if (HAS_FLAT_CCS(i915) && i915_gem_object_has_lmem_placement(obj))
> +		ccs_pages_needed = DIV_ROUND_UP(DIV_ROUND_UP(bo->base.size,
> +					       NUM_CCS_BYTES_PER_BLOCK), PAGE_SIZE);
> +
> +	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags,
> +			  caching, ccs_pages_needed);

I'd suggest a patch that first adds the functionality to TTM, where even 
i915 passes in 0 here, and a follow-up patch for the i915 functionality 
where we add the ccs requirement.


>   	if (ret)
>   		goto err_free;
>   
> diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
> index b2e33d5ba5d0..52156b54498f 100644
> --- a/drivers/gpu/drm/qxl/qxl_ttm.c
> +++ b/drivers/gpu/drm/qxl/qxl_ttm.c
> @@ -113,7 +113,7 @@ static struct ttm_tt *qxl_ttm_tt_create(struct ttm_buffer_object *bo,
>   	ttm = kzalloc(sizeof(struct ttm_tt), GFP_KERNEL);
>   	if (ttm == NULL)
>   		return NULL;
> -	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached)) {
> +	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached, 0)) {
>   		kfree(ttm);
>   		return NULL;
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c
> index 6ddc16f0fe2b..d27691f2e451 100644
> --- a/drivers/gpu/drm/ttm/ttm_agp_backend.c
> +++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c
> @@ -134,7 +134,7 @@ struct ttm_tt *ttm_agp_tt_create(struct ttm_buffer_object *bo,
>   	agp_be->mem = NULL;
>   	agp_be->bridge = bridge;
>   
> -	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined)) {
> +	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined, 0)) {
>   		kfree(agp_be);
>   		return NULL;
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 79c870a3bef8..80355465f717 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -134,9 +134,10 @@ void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
>   static void ttm_tt_init_fields(struct ttm_tt *ttm,
>   			       struct ttm_buffer_object *bo,
>   			       uint32_t page_flags,
> -			       enum ttm_caching caching)
> +			       enum ttm_caching caching,
> +			       unsigned long ccs_pages)
>   {
> -	ttm->num_pages = PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT;
> +	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + ccs_pages;
>   	ttm->caching = ttm_cached;
>   	ttm->page_flags = page_flags;
>   	ttm->dma_address = NULL;
> @@ -146,9 +147,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>   }
>   
>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching)
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages)
>   {
> -	ttm_tt_init_fields(ttm, bo, page_flags, caching);
> +	ttm_tt_init_fields(ttm, bo, page_flags, caching, ccs_pages);
>   
>   	if (ttm_tt_alloc_page_directory(ttm)) {
>   		pr_err("Failed allocating page table\n");
> @@ -180,7 +182,7 @@ int ttm_sg_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
>   {
>   	int ret;
>   
> -	ttm_tt_init_fields(ttm, bo, page_flags, caching);
> +	ttm_tt_init_fields(ttm, bo, page_flags, caching, 0);
>   
>   	if (page_flags & TTM_TT_FLAG_EXTERNAL)
>   		ret = ttm_sg_tt_alloc_page_directory(ttm);
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> index b84ecc6d6611..4e3938e62c08 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> @@ -517,7 +517,7 @@ static struct ttm_tt *vmw_ttm_tt_create(struct ttm_buffer_object *bo,
>   				     ttm_cached);
>   	else
>   		ret = ttm_tt_init(&vmw_be->dma_ttm, bo, page_flags,
> -				  ttm_cached);
> +				  ttm_cached, 0);
>   	if (unlikely(ret != 0))
>   		goto out_no_init;
>   
> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> index f20832139815..2c4ff08ea354 100644
> --- a/include/drm/ttm/ttm_tt.h
> +++ b/include/drm/ttm/ttm_tt.h
> @@ -140,6 +140,7 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
>    * @bo: The buffer object we create the ttm for.
>    * @page_flags: Page flags as identified by TTM_TT_FLAG_XX flags.
>    * @caching: the desired caching state of the pages
> + * @ccs_pages_needed: Extra pages needed for the ccs data of compression.

The name and use-case ccs is driver specific, and TTM knows nothing 
about CCS. Hence we should use "additional_pages", "additional_size" or 
something similar for this. Christian might have some additional 
guidance here.

>    *
>    * Create a struct ttm_tt to back data with system memory pages.
>    * No pages are actually allocated.
> @@ -147,7 +148,8 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
>    * NULL: Out of memory.
>    */
>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching);
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages_needed);
>   int ttm_sg_tt_init(struct ttm_tt *ttm_dma, struct ttm_buffer_object *bo,
>   		   uint32_t page_flags, enum ttm_caching caching);
>   

Thanks,

Thomas



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 1/2] drm/i915/ttm: Add extra pages for handling ccs data
  2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
  (?)
  (?)
@ 2022-02-07 10:41   ` Das, Nirmoy
  -1 siblings, 0 replies; 29+ messages in thread
From: Das, Nirmoy @ 2022-02-07 10:41 UTC (permalink / raw)
  To: Ramalingam C, dri-devel, intel-gfx; +Cc: Hellstrom Thomas, Christian Koenig


On 07/02/2022 10:37, Ramalingam C wrote:
> While evicting the local memory data on flat-ccs capable platform we
> need to evict the ccs data associated to the data. For this, we are
> adding extra pages ((size / 256) >> PAGE_SIZE) into the ttm_tt.
>
> To achieve this we are adding a new param into the ttm_tt_init as
> ccs_pages_needed, which will be added into the ttm_tt->num_pages.
>
> Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> Suggested-by: Thomas Hellstorm <thomas.hellstrom@intel.com>
> ---
>   drivers/gpu/drm/drm_gem_vram_helper.c      |  2 +-
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    | 23 +++++++++++++++++++++-
>   drivers/gpu/drm/qxl/qxl_ttm.c              |  2 +-
>   drivers/gpu/drm/ttm/ttm_agp_backend.c      |  2 +-
>   drivers/gpu/drm/ttm/ttm_tt.c               | 12 ++++++-----
>   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |  2 +-
>   include/drm/ttm/ttm_tt.h                   |  4 +++-
>   7 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem_vram_helper.c b/drivers/gpu/drm/drm_gem_vram_helper.c
> index 3f00192215d1..eef1f4dc7232 100644
> --- a/drivers/gpu/drm/drm_gem_vram_helper.c
> +++ b/drivers/gpu/drm/drm_gem_vram_helper.c
> @@ -864,7 +864,7 @@ static struct ttm_tt *bo_driver_ttm_tt_create(struct ttm_buffer_object *bo,
>   	if (!tt)
>   		return NULL;
>   
> -	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached);
> +	ret = ttm_tt_init(tt, bo, page_flags, ttm_cached, 0);
>   	if (ret < 0)
>   		goto err_ttm_tt_init;
>   
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index 84cae740b4a5..bb71aa6d66c0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -16,6 +16,7 @@
>   #include "gem/i915_gem_ttm.h"
>   #include "gem/i915_gem_ttm_move.h"
>   #include "gem/i915_gem_ttm_pm.h"
> +#include "gt/intel_gpu_commands.h"
>   
>   #define I915_TTM_PRIO_PURGE     0
>   #define I915_TTM_PRIO_NO_PAGES  1
> @@ -242,12 +243,27 @@ static const struct i915_refct_sgt_ops tt_rsgt_ops = {
>   	.release = i915_ttm_tt_release
>   };
>   
> +static inline bool
> +i915_gem_object_has_lmem_placement(struct drm_i915_gem_object *obj)
> +{
> +	int i;
> +
> +	for (i = 0; i < obj->mm.n_placements; i++)
> +		if (obj->mm.placements[i]->type == INTEL_MEMORY_LOCAL)
> +			return true;
> +
> +	return false;
> +}
> +
>   static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   					 uint32_t page_flags)
>   {
> +	struct drm_i915_private *i915 = container_of(bo->bdev, typeof(*i915),
> +						     bdev);
>   	struct ttm_resource_manager *man =
>   		ttm_manager_type(bo->bdev, bo->resource->mem_type);
>   	struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
> +	unsigned long ccs_pages_needed = 0;
>   	enum ttm_caching caching;
>   	struct i915_ttm_tt *i915_tt;
>   	int ret;
> @@ -270,7 +286,12 @@ static struct ttm_tt *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>   		i915_tt->is_shmem = true;
>   	}
>   
> -	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags, caching);
> +	if (HAS_FLAT_CCS(i915) && i915_gem_object_has_lmem_placement(obj))
> +		ccs_pages_needed = DIV_ROUND_UP(DIV_ROUND_UP(bo->base.size,
> +					       NUM_CCS_BYTES_PER_BLOCK), PAGE_SIZE);
> +
> +	ret = ttm_tt_init(&i915_tt->ttm, bo, page_flags,
> +			  caching, ccs_pages_needed);

I am wondering if we should do this in the driver itself and pass 
ttm->num_pages with CCS size included.


Regards,

Nirmoy


>   	if (ret)
>   		goto err_free;
>   
> diff --git a/drivers/gpu/drm/qxl/qxl_ttm.c b/drivers/gpu/drm/qxl/qxl_ttm.c
> index b2e33d5ba5d0..52156b54498f 100644
> --- a/drivers/gpu/drm/qxl/qxl_ttm.c
> +++ b/drivers/gpu/drm/qxl/qxl_ttm.c
> @@ -113,7 +113,7 @@ static struct ttm_tt *qxl_ttm_tt_create(struct ttm_buffer_object *bo,
>   	ttm = kzalloc(sizeof(struct ttm_tt), GFP_KERNEL);
>   	if (ttm == NULL)
>   		return NULL;
> -	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached)) {
> +	if (ttm_tt_init(ttm, bo, page_flags, ttm_cached, 0)) {
>   		kfree(ttm);
>   		return NULL;
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c
> index 6ddc16f0fe2b..d27691f2e451 100644
> --- a/drivers/gpu/drm/ttm/ttm_agp_backend.c
> +++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c
> @@ -134,7 +134,7 @@ struct ttm_tt *ttm_agp_tt_create(struct ttm_buffer_object *bo,
>   	agp_be->mem = NULL;
>   	agp_be->bridge = bridge;
>   
> -	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined)) {
> +	if (ttm_tt_init(&agp_be->ttm, bo, page_flags, ttm_write_combined, 0)) {
>   		kfree(agp_be);
>   		return NULL;
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> index 79c870a3bef8..80355465f717 100644
> --- a/drivers/gpu/drm/ttm/ttm_tt.c
> +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> @@ -134,9 +134,10 @@ void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm)
>   static void ttm_tt_init_fields(struct ttm_tt *ttm,
>   			       struct ttm_buffer_object *bo,
>   			       uint32_t page_flags,
> -			       enum ttm_caching caching)
> +			       enum ttm_caching caching,
> +			       unsigned long ccs_pages)
>   {
> -	ttm->num_pages = PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT;
> +	ttm->num_pages = (PAGE_ALIGN(bo->base.size) >> PAGE_SHIFT) + ccs_pages;
>   	ttm->caching = ttm_cached;
>   	ttm->page_flags = page_flags;
>   	ttm->dma_address = NULL;
> @@ -146,9 +147,10 @@ static void ttm_tt_init_fields(struct ttm_tt *ttm,
>   }
>   
>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching)
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages)
>   {
> -	ttm_tt_init_fields(ttm, bo, page_flags, caching);
> +	ttm_tt_init_fields(ttm, bo, page_flags, caching, ccs_pages);
>   
>   	if (ttm_tt_alloc_page_directory(ttm)) {
>   		pr_err("Failed allocating page table\n");
> @@ -180,7 +182,7 @@ int ttm_sg_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
>   {
>   	int ret;
>   
> -	ttm_tt_init_fields(ttm, bo, page_flags, caching);
> +	ttm_tt_init_fields(ttm, bo, page_flags, caching, 0);
>   
>   	if (page_flags & TTM_TT_FLAG_EXTERNAL)
>   		ret = ttm_sg_tt_alloc_page_directory(ttm);
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> index b84ecc6d6611..4e3938e62c08 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> @@ -517,7 +517,7 @@ static struct ttm_tt *vmw_ttm_tt_create(struct ttm_buffer_object *bo,
>   				     ttm_cached);
>   	else
>   		ret = ttm_tt_init(&vmw_be->dma_ttm, bo, page_flags,
> -				  ttm_cached);
> +				  ttm_cached, 0);
>   	if (unlikely(ret != 0))
>   		goto out_no_init;
>   
> diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h
> index f20832139815..2c4ff08ea354 100644
> --- a/include/drm/ttm/ttm_tt.h
> +++ b/include/drm/ttm/ttm_tt.h
> @@ -140,6 +140,7 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
>    * @bo: The buffer object we create the ttm for.
>    * @page_flags: Page flags as identified by TTM_TT_FLAG_XX flags.
>    * @caching: the desired caching state of the pages
> + * @ccs_pages_needed: Extra pages needed for the ccs data of compression.
>    *
>    * Create a struct ttm_tt to back data with system memory pages.
>    * No pages are actually allocated.
> @@ -147,7 +148,8 @@ int ttm_tt_create(struct ttm_buffer_object *bo, bool zero_alloc);
>    * NULL: Out of memory.
>    */
>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching);
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages_needed);
>   int ttm_sg_tt_init(struct ttm_tt *ttm_dma, struct ttm_buffer_object *bo,
>   		   uint32_t page_flags, enum ttm_caching caching);
>   

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for drm/i915/ttm: Evict and store of compressed object
  2022-02-07  9:37 ` [Intel-gfx] " Ramalingam C
                   ` (2 preceding siblings ...)
  (?)
@ 2022-02-07 10:48 ` Patchwork
  -1 siblings, 0 replies; 29+ messages in thread
From: Patchwork @ 2022-02-07 10:48 UTC (permalink / raw)
  To: Ramalingam C; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/ttm: Evict and store of compressed object
URL   : https://patchwork.freedesktop.org/series/99759/
State : failure

== Summary ==

Applying: drm/i915/ttm: Add extra pages for handling ccs data
Applying: drm/i915/migrate: Evict and restore the ccs data
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/gt/intel_migrate.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/gt/intel_migrate.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/gt/intel_migrate.c
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 drm/i915/migrate: Evict and restore the ccs data
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07  9:37 ` [Intel-gfx] " Ramalingam C
@ 2022-02-07 11:41   ` Christian König
  -1 siblings, 0 replies; 29+ messages in thread
From: Christian König @ 2022-02-07 11:41 UTC (permalink / raw)
  To: Ramalingam C, dri-devel, intel-gfx; +Cc: Hellstrom Thomas

Am 07.02.22 um 10:37 schrieb Ramalingam C:
> On flat-ccs capable platform we need to evict and resore the ccs data
> along with the corresponding main memory.
>
> This ccs data can only be access through BLT engine through a special
> cmd ( )
>
> To support above requirement of flat-ccs enabled i915 platforms this
> series adds new param called ccs_pages_needed to the ttm_tt_init(),
> to increase the ttm_tt->num_pages of system memory when the obj has the
> lmem placement possibility.

Well question is why isn't the buffer object allocated with the extra 
space in the first place?

Regards,
Christian.

>
> This will be on top of the flat-ccs enabling series
> https://patchwork.freedesktop.org/series/95686/
>
> For more about flat-ccs feature please have a look at
> https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
>
> Testing of the series is WIP and looking forward for the early review on
> the amendment to ttm_tt_init and the approach.
>
> Ramalingam C (2):
>    drm/i915/ttm: Add extra pages for handling ccs data
>    drm/i915/migrate: Evict and restore the ccs data
>
>   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
>   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
>   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
>   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
>   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
>   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
>   include/drm/ttm/ttm_tt.h                   |   4 +-
>   8 files changed, 191 insertions(+), 139 deletions(-)
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07 11:41   ` Christian König
  0 siblings, 0 replies; 29+ messages in thread
From: Christian König @ 2022-02-07 11:41 UTC (permalink / raw)
  To: Ramalingam C, dri-devel, intel-gfx; +Cc: Hellstrom Thomas

Am 07.02.22 um 10:37 schrieb Ramalingam C:
> On flat-ccs capable platform we need to evict and resore the ccs data
> along with the corresponding main memory.
>
> This ccs data can only be access through BLT engine through a special
> cmd ( )
>
> To support above requirement of flat-ccs enabled i915 platforms this
> series adds new param called ccs_pages_needed to the ttm_tt_init(),
> to increase the ttm_tt->num_pages of system memory when the obj has the
> lmem placement possibility.

Well question is why isn't the buffer object allocated with the extra 
space in the first place?

Regards,
Christian.

>
> This will be on top of the flat-ccs enabling series
> https://patchwork.freedesktop.org/series/95686/
>
> For more about flat-ccs feature please have a look at
> https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
>
> Testing of the series is WIP and looking forward for the early review on
> the amendment to ttm_tt_init and the approach.
>
> Ramalingam C (2):
>    drm/i915/ttm: Add extra pages for handling ccs data
>    drm/i915/migrate: Evict and restore the ccs data
>
>   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
>   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
>   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
>   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
>   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
>   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
>   include/drm/ttm/ttm_tt.h                   |   4 +-
>   8 files changed, 191 insertions(+), 139 deletions(-)
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07 11:41   ` [Intel-gfx] " Christian König
@ 2022-02-07 13:49     ` Hellstrom, Thomas
  -1 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 13:49 UTC (permalink / raw)
  To: dri-devel, christian.koenig, C, Ramalingam, intel-gfx

Hi, Christian,

On Mon, 2022-02-07 at 12:41 +0100, Christian König wrote:
> Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > On flat-ccs capable platform we need to evict and resore the ccs data
> > along with the corresponding main memory.
> > 
> > This ccs data can only be access through BLT engine through a special
> > cmd ( )
> > 
> > To support above requirement of flat-ccs enabled i915 platforms this
> > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > to increase the ttm_tt->num_pages of system memory when the obj has
> > the
> > lmem placement possibility.
> 
> Well question is why isn't the buffer object allocated with the extra
> space in the first place?

That wastes precious VRAM. The extra space is needed only when the bo
is evicted.

We've had a previous short disussion on this here:
https://lists.freedesktop.org/archives/dri-devel/2021-August/321161.html

Thanks,
Thomas


> 
> Regards,
> Christian.
> 
> > 
> > This will be on top of the flat-ccs enabling series
> > https://patchwork.freedesktop.org/series/95686/
> > 
> > For more about flat-ccs feature please have a look at
> > https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
> > 
> > Testing of the series is WIP and looking forward for the early review
> > on
> > the amendment to ttm_tt_init and the approach.
> > 
> > Ramalingam C (2):
> >    drm/i915/ttm: Add extra pages for handling ccs data
> >    drm/i915/migrate: Evict and restore the ccs data
> > 
> >   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> >   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++-------
> > ---
> >   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> >   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> >   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> >   include/drm/ttm/ttm_tt.h                   |   4 +-
> >   8 files changed, 191 insertions(+), 139 deletions(-)
> > 
> 

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07 13:49     ` Hellstrom, Thomas
  0 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 13:49 UTC (permalink / raw)
  To: dri-devel, christian.koenig, C, Ramalingam, intel-gfx

Hi, Christian,

On Mon, 2022-02-07 at 12:41 +0100, Christian König wrote:
> Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > On flat-ccs capable platform we need to evict and resore the ccs data
> > along with the corresponding main memory.
> > 
> > This ccs data can only be access through BLT engine through a special
> > cmd ( )
> > 
> > To support above requirement of flat-ccs enabled i915 platforms this
> > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > to increase the ttm_tt->num_pages of system memory when the obj has
> > the
> > lmem placement possibility.
> 
> Well question is why isn't the buffer object allocated with the extra
> space in the first place?

That wastes precious VRAM. The extra space is needed only when the bo
is evicted.

We've had a previous short disussion on this here:
https://lists.freedesktop.org/archives/dri-devel/2021-August/321161.html

Thanks,
Thomas


> 
> Regards,
> Christian.
> 
> > 
> > This will be on top of the flat-ccs enabling series
> > https://patchwork.freedesktop.org/series/95686/
> > 
> > For more about flat-ccs feature please have a look at
> > https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
> > 
> > Testing of the series is WIP and looking forward for the early review
> > on
> > the amendment to ttm_tt_init and the approach.
> > 
> > Ramalingam C (2):
> >    drm/i915/ttm: Add extra pages for handling ccs data
> >    drm/i915/migrate: Evict and restore the ccs data
> > 
> >   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> >   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++-------
> > ---
> >   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> >   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> >   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> >   include/drm/ttm/ttm_tt.h                   |   4 +-
> >   8 files changed, 191 insertions(+), 139 deletions(-)
> > 
> 

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07 11:41   ` [Intel-gfx] " Christian König
@ 2022-02-07 13:53     ` Ramalingam C
  -1 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 13:53 UTC (permalink / raw)
  To: Christian König; +Cc: intel-gfx, Hellstrom Thomas, dri-devel

On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
> Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > On flat-ccs capable platform we need to evict and resore the ccs data
> > along with the corresponding main memory.
> > 
> > This ccs data can only be access through BLT engine through a special
> > cmd ( )
> > 
> > To support above requirement of flat-ccs enabled i915 platforms this
> > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > to increase the ttm_tt->num_pages of system memory when the obj has the
> > lmem placement possibility.
> 
> Well question is why isn't the buffer object allocated with the extra space
> in the first place?
Hi Christian,

On Xe-HP and later devices, we use dedicated compression control state (CCS)
stored in local memory for each surface, to support the 3D and media
compression formats.

The memory required for the CCS of the entire local memory is 1/256 of the
local memory size. So before the kernel boot, the required memory is reserved
for the CCS data and a secure register will be programmed with the CCS base
address

So when we allocate a object in local memory we dont need to explicitly
allocate the space for ccs data. But when we evict the obj into the smem
 to hold the compression related data along with the obj we need smem
 space of obj_size + (obj_size/256).

 Hence when we create smem for an obj with lmem placement possibility we
 create with the extra space.

 Ram.
> 
> Regards,
> Christian.
> 
> > 
> > This will be on top of the flat-ccs enabling series
> > https://patchwork.freedesktop.org/series/95686/
> > 
> > For more about flat-ccs feature please have a look at
> > https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
> > 
> > Testing of the series is WIP and looking forward for the early review on
> > the amendment to ttm_tt_init and the approach.
> > 
> > Ramalingam C (2):
> >    drm/i915/ttm: Add extra pages for handling ccs data
> >    drm/i915/migrate: Evict and restore the ccs data
> > 
> >   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> >   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
> >   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> >   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> >   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> >   include/drm/ttm/ttm_tt.h                   |   4 +-
> >   8 files changed, 191 insertions(+), 139 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07 13:53     ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 13:53 UTC (permalink / raw)
  To: Christian König; +Cc: intel-gfx, Hellstrom Thomas, dri-devel

On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
> Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > On flat-ccs capable platform we need to evict and resore the ccs data
> > along with the corresponding main memory.
> > 
> > This ccs data can only be access through BLT engine through a special
> > cmd ( )
> > 
> > To support above requirement of flat-ccs enabled i915 platforms this
> > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > to increase the ttm_tt->num_pages of system memory when the obj has the
> > lmem placement possibility.
> 
> Well question is why isn't the buffer object allocated with the extra space
> in the first place?
Hi Christian,

On Xe-HP and later devices, we use dedicated compression control state (CCS)
stored in local memory for each surface, to support the 3D and media
compression formats.

The memory required for the CCS of the entire local memory is 1/256 of the
local memory size. So before the kernel boot, the required memory is reserved
for the CCS data and a secure register will be programmed with the CCS base
address

So when we allocate a object in local memory we dont need to explicitly
allocate the space for ccs data. But when we evict the obj into the smem
 to hold the compression related data along with the obj we need smem
 space of obj_size + (obj_size/256).

 Hence when we create smem for an obj with lmem placement possibility we
 create with the extra space.

 Ram.
> 
> Regards,
> Christian.
> 
> > 
> > This will be on top of the flat-ccs enabling series
> > https://patchwork.freedesktop.org/series/95686/
> > 
> > For more about flat-ccs feature please have a look at
> > https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
> > 
> > Testing of the series is WIP and looking forward for the early review on
> > the amendment to ttm_tt_init and the approach.
> > 
> > Ramalingam C (2):
> >    drm/i915/ttm: Add extra pages for handling ccs data
> >    drm/i915/migrate: Evict and restore the ccs data
> > 
> >   drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> >   drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> >   drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
> >   drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> >   drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> >   drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> >   include/drm/ttm/ttm_tt.h                   |   4 +-
> >   8 files changed, 191 insertions(+), 139 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07 13:53     ` [Intel-gfx] " Ramalingam C
@ 2022-02-07 14:37       ` Christian König
  -1 siblings, 0 replies; 29+ messages in thread
From: Christian König @ 2022-02-07 14:37 UTC (permalink / raw)
  To: Ramalingam C; +Cc: intel-gfx, Hellstrom Thomas, dri-devel

Am 07.02.22 um 14:53 schrieb Ramalingam C:
> On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
>> Am 07.02.22 um 10:37 schrieb Ramalingam C:
>>> On flat-ccs capable platform we need to evict and resore the ccs data
>>> along with the corresponding main memory.
>>>
>>> This ccs data can only be access through BLT engine through a special
>>> cmd ( )
>>>
>>> To support above requirement of flat-ccs enabled i915 platforms this
>>> series adds new param called ccs_pages_needed to the ttm_tt_init(),
>>> to increase the ttm_tt->num_pages of system memory when the obj has the
>>> lmem placement possibility.
>> Well question is why isn't the buffer object allocated with the extra space
>> in the first place?
> Hi Christian,
>
> On Xe-HP and later devices, we use dedicated compression control state (CCS)
> stored in local memory for each surface, to support the 3D and media
> compression formats.
>
> The memory required for the CCS of the entire local memory is 1/256 of the
> local memory size. So before the kernel boot, the required memory is reserved
> for the CCS data and a secure register will be programmed with the CCS base
> address
>
> So when we allocate a object in local memory we dont need to explicitly
> allocate the space for ccs data. But when we evict the obj into the smem
>   to hold the compression related data along with the obj we need smem
>   space of obj_size + (obj_size/256).
>
>   Hence when we create smem for an obj with lmem placement possibility we
>   create with the extra space.

Exactly that's what I've been missing in the cover letter and/or commit 
messages, comments etc..

Over all sounds like a valid explanation to me, just one comment on the 
code/naming:

>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching)
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages)

Please don't try to leak any i915 specific stuff into common TTM code.

For example use the wording extra_pages instead of ccs_pages here.

Apart from that looks good to me,
Christian.

>
>   Ram.
>> Regards,
>> Christian.
>>
>>> This will be on top of the flat-ccs enabling series
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F95686%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=V9wQZvb0JwtplIBSYYXGzrg%2BEMvn4hfkscziPFDvZDY%3D&amp;reserved=0
>>>
>>> For more about flat-ccs feature please have a look at
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F471777%2F%3Fseries%3D95686%26rev%3D5&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=aYjoTKMbZvi%2Fnr7hkSH4SYGxZIv8Dj210dNrBnUNpQw%3D&amp;reserved=0
>>>
>>> Testing of the series is WIP and looking forward for the early review on
>>> the amendment to ttm_tt_init and the approach.
>>>
>>> Ramalingam C (2):
>>>     drm/i915/ttm: Add extra pages for handling ccs data
>>>     drm/i915/migrate: Evict and restore the ccs data
>>>
>>>    drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
>>>    drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
>>>    drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
>>>    drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
>>>    include/drm/ttm/ttm_tt.h                   |   4 +-
>>>    8 files changed, 191 insertions(+), 139 deletions(-)
>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07 14:37       ` Christian König
  0 siblings, 0 replies; 29+ messages in thread
From: Christian König @ 2022-02-07 14:37 UTC (permalink / raw)
  To: Ramalingam C; +Cc: intel-gfx, Hellstrom Thomas, dri-devel

Am 07.02.22 um 14:53 schrieb Ramalingam C:
> On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
>> Am 07.02.22 um 10:37 schrieb Ramalingam C:
>>> On flat-ccs capable platform we need to evict and resore the ccs data
>>> along with the corresponding main memory.
>>>
>>> This ccs data can only be access through BLT engine through a special
>>> cmd ( )
>>>
>>> To support above requirement of flat-ccs enabled i915 platforms this
>>> series adds new param called ccs_pages_needed to the ttm_tt_init(),
>>> to increase the ttm_tt->num_pages of system memory when the obj has the
>>> lmem placement possibility.
>> Well question is why isn't the buffer object allocated with the extra space
>> in the first place?
> Hi Christian,
>
> On Xe-HP and later devices, we use dedicated compression control state (CCS)
> stored in local memory for each surface, to support the 3D and media
> compression formats.
>
> The memory required for the CCS of the entire local memory is 1/256 of the
> local memory size. So before the kernel boot, the required memory is reserved
> for the CCS data and a secure register will be programmed with the CCS base
> address
>
> So when we allocate a object in local memory we dont need to explicitly
> allocate the space for ccs data. But when we evict the obj into the smem
>   to hold the compression related data along with the obj we need smem
>   space of obj_size + (obj_size/256).
>
>   Hence when we create smem for an obj with lmem placement possibility we
>   create with the extra space.

Exactly that's what I've been missing in the cover letter and/or commit 
messages, comments etc..

Over all sounds like a valid explanation to me, just one comment on the 
code/naming:

>   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> -		uint32_t page_flags, enum ttm_caching caching)
> +		uint32_t page_flags, enum ttm_caching caching,
> +		unsigned long ccs_pages)

Please don't try to leak any i915 specific stuff into common TTM code.

For example use the wording extra_pages instead of ccs_pages here.

Apart from that looks good to me,
Christian.

>
>   Ram.
>> Regards,
>> Christian.
>>
>>> This will be on top of the flat-ccs enabling series
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F95686%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=V9wQZvb0JwtplIBSYYXGzrg%2BEMvn4hfkscziPFDvZDY%3D&amp;reserved=0
>>>
>>> For more about flat-ccs feature please have a look at
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F471777%2F%3Fseries%3D95686%26rev%3D5&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=aYjoTKMbZvi%2Fnr7hkSH4SYGxZIv8Dj210dNrBnUNpQw%3D&amp;reserved=0
>>>
>>> Testing of the series is WIP and looking forward for the early review on
>>> the amendment to ttm_tt_init and the approach.
>>>
>>> Ramalingam C (2):
>>>     drm/i915/ttm: Add extra pages for handling ccs data
>>>     drm/i915/migrate: Evict and restore the ccs data
>>>
>>>    drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
>>>    drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
>>>    drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
>>>    drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
>>>    include/drm/ttm/ttm_tt.h                   |   4 +-
>>>    8 files changed, 191 insertions(+), 139 deletions(-)
>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07 14:37       ` [Intel-gfx] " Christian König
@ 2022-02-07 14:47         ` C, Ramalingam
  -1 siblings, 0 replies; 29+ messages in thread
From: C, Ramalingam @ 2022-02-07 14:47 UTC (permalink / raw)
  To: Christian König; +Cc: intel-gfx, Hellstrom, Thomas, dri-devel

On 2022-02-07 at 15:37:09 +0100, Christian König wrote:
> Am 07.02.22 um 14:53 schrieb Ramalingam C:
> > On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
> > > Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > > > On flat-ccs capable platform we need to evict and resore the ccs data
> > > > along with the corresponding main memory.
> > > > 
> > > > This ccs data can only be access through BLT engine through a special
> > > > cmd ( )
> > > > 
> > > > To support above requirement of flat-ccs enabled i915 platforms this
> > > > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > > > to increase the ttm_tt->num_pages of system memory when the obj has the
> > > > lmem placement possibility.
> > > Well question is why isn't the buffer object allocated with the extra space
> > > in the first place?
> > Hi Christian,
> > 
> > On Xe-HP and later devices, we use dedicated compression control state (CCS)
> > stored in local memory for each surface, to support the 3D and media
> > compression formats.
> > 
> > The memory required for the CCS of the entire local memory is 1/256 of the
> > local memory size. So before the kernel boot, the required memory is reserved
> > for the CCS data and a secure register will be programmed with the CCS base
> > address
> > 
> > So when we allocate a object in local memory we dont need to explicitly
> > allocate the space for ccs data. But when we evict the obj into the smem
> >   to hold the compression related data along with the obj we need smem
> >   space of obj_size + (obj_size/256).
> > 
> >   Hence when we create smem for an obj with lmem placement possibility we
> >   create with the extra space.
> 
> Exactly that's what I've been missing in the cover letter and/or commit
> messages, comments etc..
> 
> Over all sounds like a valid explanation to me, just one comment on the
> code/naming:
> 
> >   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> > -		uint32_t page_flags, enum ttm_caching caching)
> > +		uint32_t page_flags, enum ttm_caching caching,
> > +		unsigned long ccs_pages)
> 
> Please don't try to leak any i915 specific stuff into common TTM code.
> 
> For example use the wording extra_pages instead of ccs_pages here.
> 
> Apart from that looks good to me,

Thank you. I will address the comments on naming.

Ram
> Christian.
> 
> > 
> >   Ram.
> > > Regards,
> > > Christian.
> > > 
> > > > This will be on top of the flat-ccs enabling series
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F95686%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=V9wQZvb0JwtplIBSYYXGzrg%2BEMvn4hfkscziPFDvZDY%3D&amp;reserved=0
> > > > 
> > > > For more about flat-ccs feature please have a look at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F471777%2F%3Fseries%3D95686%26rev%3D5&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=aYjoTKMbZvi%2Fnr7hkSH4SYGxZIv8Dj210dNrBnUNpQw%3D&amp;reserved=0
> > > > 
> > > > Testing of the series is WIP and looking forward for the early review on
> > > > the amendment to ttm_tt_init and the approach.
> > > > 
> > > > Ramalingam C (2):
> > > >     drm/i915/ttm: Add extra pages for handling ccs data
> > > >     drm/i915/migrate: Evict and restore the ccs data
> > > > 
> > > >    drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> > > >    drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> > > >    drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
> > > >    drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> > > >    drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> > > >    drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> > > >    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> > > >    include/drm/ttm/ttm_tt.h                   |   4 +-
> > > >    8 files changed, 191 insertions(+), 139 deletions(-)
> > > > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
@ 2022-02-07 14:47         ` C, Ramalingam
  0 siblings, 0 replies; 29+ messages in thread
From: C, Ramalingam @ 2022-02-07 14:47 UTC (permalink / raw)
  To: Christian König; +Cc: intel-gfx, Hellstrom, Thomas, dri-devel

On 2022-02-07 at 15:37:09 +0100, Christian König wrote:
> Am 07.02.22 um 14:53 schrieb Ramalingam C:
> > On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
> > > Am 07.02.22 um 10:37 schrieb Ramalingam C:
> > > > On flat-ccs capable platform we need to evict and resore the ccs data
> > > > along with the corresponding main memory.
> > > > 
> > > > This ccs data can only be access through BLT engine through a special
> > > > cmd ( )
> > > > 
> > > > To support above requirement of flat-ccs enabled i915 platforms this
> > > > series adds new param called ccs_pages_needed to the ttm_tt_init(),
> > > > to increase the ttm_tt->num_pages of system memory when the obj has the
> > > > lmem placement possibility.
> > > Well question is why isn't the buffer object allocated with the extra space
> > > in the first place?
> > Hi Christian,
> > 
> > On Xe-HP and later devices, we use dedicated compression control state (CCS)
> > stored in local memory for each surface, to support the 3D and media
> > compression formats.
> > 
> > The memory required for the CCS of the entire local memory is 1/256 of the
> > local memory size. So before the kernel boot, the required memory is reserved
> > for the CCS data and a secure register will be programmed with the CCS base
> > address
> > 
> > So when we allocate a object in local memory we dont need to explicitly
> > allocate the space for ccs data. But when we evict the obj into the smem
> >   to hold the compression related data along with the obj we need smem
> >   space of obj_size + (obj_size/256).
> > 
> >   Hence when we create smem for an obj with lmem placement possibility we
> >   create with the extra space.
> 
> Exactly that's what I've been missing in the cover letter and/or commit
> messages, comments etc..
> 
> Over all sounds like a valid explanation to me, just one comment on the
> code/naming:
> 
> >   int ttm_tt_init(struct ttm_tt *ttm, struct ttm_buffer_object *bo,
> > -		uint32_t page_flags, enum ttm_caching caching)
> > +		uint32_t page_flags, enum ttm_caching caching,
> > +		unsigned long ccs_pages)
> 
> Please don't try to leak any i915 specific stuff into common TTM code.
> 
> For example use the wording extra_pages instead of ccs_pages here.
> 
> Apart from that looks good to me,

Thank you. I will address the comments on naming.

Ram
> Christian.
> 
> > 
> >   Ram.
> > > Regards,
> > > Christian.
> > > 
> > > > This will be on top of the flat-ccs enabling series
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fseries%2F95686%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=V9wQZvb0JwtplIBSYYXGzrg%2BEMvn4hfkscziPFDvZDY%3D&amp;reserved=0
> > > > 
> > > > For more about flat-ccs feature please have a look at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F471777%2F%3Fseries%3D95686%26rev%3D5&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce54bb7576a334a76cab008d9ea4138e5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637798388115252727%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=aYjoTKMbZvi%2Fnr7hkSH4SYGxZIv8Dj210dNrBnUNpQw%3D&amp;reserved=0
> > > > 
> > > > Testing of the series is WIP and looking forward for the early review on
> > > > the amendment to ttm_tt_init and the approach.
> > > > 
> > > > Ramalingam C (2):
> > > >     drm/i915/ttm: Add extra pages for handling ccs data
> > > >     drm/i915/migrate: Evict and restore the ccs data
> > > > 
> > > >    drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
> > > >    drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
> > > >    drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
> > > >    drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
> > > >    drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
> > > >    drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
> > > >    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
> > > >    include/drm/ttm/ttm_tt.h                   |   4 +-
> > > >    8 files changed, 191 insertions(+), 139 deletions(-)
> > > > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 0/2] drm/i915/ttm: Evict and store of compressed object
  2022-02-07 13:53     ` [Intel-gfx] " Ramalingam C
  (?)
  (?)
@ 2022-02-07 14:49     ` Das, Nirmoy
  -1 siblings, 0 replies; 29+ messages in thread
From: Das, Nirmoy @ 2022-02-07 14:49 UTC (permalink / raw)
  To: Ramalingam C, Christian König; +Cc: intel-gfx, Hellstrom Thomas, dri-devel

Thanks for the clarification, Ram!

On 07/02/2022 14:53, Ramalingam C wrote:
> On 2022-02-07 at 12:41:59 +0100, Christian König wrote:
>> Am 07.02.22 um 10:37 schrieb Ramalingam C:
>>> On flat-ccs capable platform we need to evict and resore the ccs data
>>> along with the corresponding main memory.
>>>
>>> This ccs data can only be access through BLT engine through a special
>>> cmd ( )
>>>
>>> To support above requirement of flat-ccs enabled i915 platforms this
>>> series adds new param called ccs_pages_needed to the ttm_tt_init(),
>>> to increase the ttm_tt->num_pages of system memory when the obj has the
>>> lmem placement possibility.
>> Well question is why isn't the buffer object allocated with the extra space
>> in the first place?
> Hi Christian,
>
> On Xe-HP and later devices, we use dedicated compression control state (CCS)
> stored in local memory for each surface, to support the 3D and media
> compression formats.
>
> The memory required for the CCS of the entire local memory is 1/256 of the
> local memory size. So before the kernel boot, the required memory is reserved
> for the CCS data and a secure register will be programmed with the CCS base
> address
>
> So when we allocate a object in local memory we dont need to explicitly
> allocate the space for ccs data. But when we evict the obj into the smem
>   to hold the compression related data along with the obj we need smem
>   space of obj_size + (obj_size/256).
>
>   Hence when we create smem for an obj with lmem placement possibility we
>   create with the extra space.
>
>   Ram.
>> Regards,
>> Christian.
>>
>>> This will be on top of the flat-ccs enabling series
>>> https://patchwork.freedesktop.org/series/95686/
>>>
>>> For more about flat-ccs feature please have a look at
>>> https://patchwork.freedesktop.org/patch/471777/?series=95686&rev=5
>>>
>>> Testing of the series is WIP and looking forward for the early review on
>>> the amendment to ttm_tt_init and the approach.
>>>
>>> Ramalingam C (2):
>>>     drm/i915/ttm: Add extra pages for handling ccs data
>>>     drm/i915/migrate: Evict and restore the ccs data
>>>
>>>    drivers/gpu/drm/drm_gem_vram_helper.c      |   2 +-
>>>    drivers/gpu/drm/i915/gem/i915_gem_ttm.c    |  23 +-
>>>    drivers/gpu/drm/i915/gt/intel_migrate.c    | 283 +++++++++++----------
>>>    drivers/gpu/drm/qxl/qxl_ttm.c              |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_agp_backend.c      |   2 +-
>>>    drivers/gpu/drm/ttm/ttm_tt.c               |  12 +-
>>>    drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c |   2 +-
>>>    include/drm/ttm/ttm_tt.h                   |   4 +-
>>>    8 files changed, 191 insertions(+), 139 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
@ 2022-02-07 14:55     ` Hellstrom, Thomas
  -1 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 14:55 UTC (permalink / raw)
  To: dri-devel, C, Ramalingam, intel-gfx; +Cc: christian.koenig

Hi, Ram,

A couple of quick questions before starting a more detailed review:

1) Does this also support migrating of compressed data LMEM->LMEM?
What-about inter-tile?

2) Do we need to block faulting of compressed data in the fault handler
as a follow-up patch?

/Thomas


On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> When we are swapping out the local memory obj on flat-ccs capable
> platform,
> we need to capture the ccs data too along with main meory and we need
> to
> restore it when we are swapping in the content.
> 
> Extracting and restoring the CCS data is done through a special cmd
> called
> XY_CTRL_SURF_COPY_BLT
> 
> Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++---------
> --
>  1 file changed, 155 insertions(+), 128 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 5bdab0b3c735..e60ae6ff1847 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32
> size)
>         return height % 4 == 3 && height <= 8;
>  }
>  
> +/**
> + * DOC: Flat-CCS - Memory compression for Local memory
> + *
> + * On Xe-HP and later devices, we use dedicated compression control
> state (CCS)
> + * stored in local memory for each surface, to support the 3D and
> media
> + * compression formats.
> + *
> + * The memory required for the CCS of the entire local memory is
> 1/256 of the
> + * local memory size. So before the kernel boot, the required memory
> is reserved
> + * for the CCS data and a secure register will be programmed with
> the CCS base
> + * address.
> + *
> + * Flat CCS data needs to be cleared when a lmem object is
> allocated.
> + * And CCS data can be copied in and out of CCS region through
> + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> + *
> + * When we exaust the lmem, if the object's placements support smem,
> then we can
> + * directly decompress the compressed lmem object into smem and
> start using it
> + * from smem itself.
> + *
> + * But when we need to swapout the compressed lmem object into a
> smem region
> + * though objects' placement doesn't support smem, then we copy the
> lmem content
> + * as it is into smem region along with ccs data (using
> XY_CTRL_SURF_COPY_BLT).
> + * When the object is referred, lmem content will be swaped in along
> with
> + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> corresponding
> + * location.
> + *
> + *
> + * Flat-CCS Modifiers for different compression formats
> + * ----------------------------------------------------
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> of Flat CCS
> + * render compression formats. Though the general layout is same as
> + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> algorithm is
> + * used. Render compression uses 128 byte compression blocks
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> of Flat CCS
> + * media compression formats. Though the general layout is same as
> + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> algorithm is
> + * used. Media compression uses 256 byte compression blocks.
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> buffers of Flat
> + * CCS clear color render compression formats. Unified compression
> format for
> + * clear color render compression. The genral layout is a tiled
> layout using
> + * 4Kb tiles i.e Tile4 layout.
> + */
> +
> +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> +{
> +       /* Mask the 3 LSB to use the PPGTT address space */
> +       *cmd++ = MI_FLUSH_DW | flags;
> +       *cmd++ = lower_32_bits(dst);
> +       *cmd++ = upper_32_bits(dst);
> +
> +       return cmd;
> +}
> +
> +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> int size)
> +{
> +       u32 num_cmds, num_blks, total_size;
> +
> +       if (!GET_CCS_SIZE(i915, size))
> +               return 0;
> +
> +       /*
> +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> +        * trnasfer upto 1024 blocks.
> +        */
> +       num_blks = GET_CCS_SIZE(i915, size);
> +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> +
> +       /*
> +        * We need to add a flush before and after
> +        * XY_CTRL_SURF_COPY_BLT
> +        */
> +       total_size += 2 * MI_FLUSH_DW_SIZE;
> +       return total_size;
> +}
> +
> +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> dst_addr,
> +                                    u8 src_mem_access, u8
> dst_mem_access,
> +                                    int src_mocs, int dst_mocs,
> +                                    u16 num_ccs_blocks)
> +{
> +       int i = num_ccs_blocks;
> +
> +       /*
> +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> CCS
> +        * data in and out of the CCS region.
> +        *
> +        * We can copy at most 1024 blocks of 256 bytes using one
> +        * XY_CTRL_SURF_COPY_BLT instruction.
> +        *
> +        * In case we need to copy more than 1024 blocks, we need to
> add
> +        * another instruction to the same batch buffer.
> +        *
> +        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> CCS.
> +        *
> +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> +        */
> +       do {
> +               /*
> +                * We use logical AND with 1023 since the size field
> +                * takes values which is in the range of 0 - 1023
> +                */
> +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> +                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> +                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> +               *cmd++ = lower_32_bits(src_addr);
> +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> +               *cmd++ = lower_32_bits(dst_addr);
> +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> +               src_addr += SZ_64M;
> +               dst_addr += SZ_64M;
> +               i -= NUM_CCS_BLKS_PER_XFER;
> +       } while (i > 0);
> +
> +       return cmd;
> +}
> +
>  static int emit_copy(struct i915_request *rq,
> -                    u32 dst_offset, u32 src_offset, int size)
> +                    bool dst_is_lmem, u32 dst_offset,
> +                    bool src_is_lmem, u32 src_offset, int size)
>  {
> +       struct drm_i915_private *i915 = rq->engine->i915;
>         const int ver = GRAPHICS_VER(rq->engine->i915);
>         u32 instance = rq->engine->instance;
> +       u32 num_ccs_blks, ccs_ring_size;
> +       u8 src_access, dst_access;
>         u32 *cs;
>  
> -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> HAS_FLAT_CCS(i915)) ?
> +                        calc_ctrl_surf_instr_size(i915, size) : 0;
> +
> +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
>         if (IS_ERR(cs))
>                 return PTR_ERR(cs);
>  
> @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
>                 *cs++ = src_offset;
>         }
>  
> +       if (ccs_ring_size) {
> +               /* TODO: Migration needs to be handled with resolve
> of compressed data */
> +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> +                               NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
> +
> +               src_access = !src_is_lmem && dst_is_lmem;
> +               dst_access = !src_access;
> +
> +               if (src_access) /* Swapin of compressed data */
> +                       src_offset += size;
> +               else
> +                       dst_offset += size;
> +
> +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> dst_offset,
> +                                             src_access, dst_access,
> +                                             1, 1, num_ccs_blks);
> +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> MI_FLUSH_CCS);
> +       }
> +
>         intel_ring_advance(rq, cs);
>         return 0;
>  }
> @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context
> *ce,
>                 if (err)
>                         goto out_rq;
>  
> -               err = emit_copy(rq, dst_offset, src_offset, len);
> +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> +                               src_is_lmem, src_offset, len);
>  
>                 /* Arbitration is re-enabled between requests. */
>  out_rq:
> @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context
> *ce,
>         return err;
>  }
>  
> -/**
> - * DOC: Flat-CCS - Memory compression for Local memory
> - *
> - * On Xe-HP and later devices, we use dedicated compression control
> state (CCS)
> - * stored in local memory for each surface, to support the 3D and
> media
> - * compression formats.
> - *
> - * The memory required for the CCS of the entire local memory is
> 1/256 of the
> - * local memory size. So before the kernel boot, the required memory
> is reserved
> - * for the CCS data and a secure register will be programmed with
> the CCS base
> - * address.
> - *
> - * Flat CCS data needs to be cleared when a lmem object is
> allocated.
> - * And CCS data can be copied in and out of CCS region through
> - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> - *
> - * When we exaust the lmem, if the object's placements support smem,
> then we can
> - * directly decompress the compressed lmem object into smem and
> start using it
> - * from smem itself.
> - *
> - * But when we need to swapout the compressed lmem object into a
> smem region
> - * though objects' placement doesn't support smem, then we copy the
> lmem content
> - * as it is into smem region along with ccs data (using
> XY_CTRL_SURF_COPY_BLT).
> - * When the object is referred, lmem content will be swaped in along
> with
> - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> corresponding
> - * location.
> - *
> - *
> - * Flat-CCS Modifiers for different compression formats
> - * ----------------------------------------------------
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> of Flat CCS
> - * render compression formats. Though the general layout is same as
> - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> algorithm is
> - * used. Render compression uses 128 byte compression blocks
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> of Flat CCS
> - * media compression formats. Though the general layout is same as
> - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> algorithm is
> - * used. Media compression uses 256 byte compression blocks.
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> buffers of Flat
> - * CCS clear color render compression formats. Unified compression
> format for
> - * clear color render compression. The genral layout is a tiled
> layout using
> - * 4Kb tiles i.e Tile4 layout.
> - */
> -
> -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> -{
> -       /* Mask the 3 LSB to use the PPGTT address space */
> -       *cmd++ = MI_FLUSH_DW | flags;
> -       *cmd++ = lower_32_bits(dst);
> -       *cmd++ = upper_32_bits(dst);
> -
> -       return cmd;
> -}
> -
> -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> int size)
> -{
> -       u32 num_cmds, num_blks, total_size;
> -
> -       if (!GET_CCS_SIZE(i915, size))
> -               return 0;
> -
> -       /*
> -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> -        * trnasfer upto 1024 blocks.
> -        */
> -       num_blks = GET_CCS_SIZE(i915, size);
> -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> -
> -       /*
> -        * We need to add a flush before and after
> -        * XY_CTRL_SURF_COPY_BLT
> -        */
> -       total_size += 2 * MI_FLUSH_DW_SIZE;
> -       return total_size;
> -}
> -
> -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> dst_addr,
> -                                    u8 src_mem_access, u8
> dst_mem_access,
> -                                    int src_mocs, int dst_mocs,
> -                                    u16 num_ccs_blocks)
> -{
> -       int i = num_ccs_blocks;
> -
> -       /*
> -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> CCS
> -        * data in and out of the CCS region.
> -        *
> -        * We can copy at most 1024 blocks of 256 bytes using one
> -        * XY_CTRL_SURF_COPY_BLT instruction.
> -        *
> -        * In case we need to copy more than 1024 blocks, we need to
> add
> -        * another instruction to the same batch buffer.
> -        *
> -        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> CCS.
> -        *
> -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> -        */
> -       do {
> -               /*
> -                * We use logical AND with 1023 since the size field
> -                * takes values which is in the range of 0 - 1023
> -                */
> -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> -                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> -                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> -               *cmd++ = lower_32_bits(src_addr);
> -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> -               *cmd++ = lower_32_bits(dst_addr);
> -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> -               src_addr += SZ_64M;
> -               dst_addr += SZ_64M;
> -               i -= NUM_CCS_BLKS_PER_XFER;
> -       } while (i > 0);
> -
> -       return cmd;
> -}
> -
>  static int emit_clear(struct i915_request *rq,
>                       u64 offset,
>                       int size,

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
@ 2022-02-07 14:55     ` Hellstrom, Thomas
  0 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 14:55 UTC (permalink / raw)
  To: dri-devel, C, Ramalingam, intel-gfx; +Cc: christian.koenig

Hi, Ram,

A couple of quick questions before starting a more detailed review:

1) Does this also support migrating of compressed data LMEM->LMEM?
What-about inter-tile?

2) Do we need to block faulting of compressed data in the fault handler
as a follow-up patch?

/Thomas


On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> When we are swapping out the local memory obj on flat-ccs capable
> platform,
> we need to capture the ccs data too along with main meory and we need
> to
> restore it when we are swapping in the content.
> 
> Extracting and restoring the CCS data is done through a special cmd
> called
> XY_CTRL_SURF_COPY_BLT
> 
> Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++---------
> --
>  1 file changed, 155 insertions(+), 128 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 5bdab0b3c735..e60ae6ff1847 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32
> size)
>         return height % 4 == 3 && height <= 8;
>  }
>  
> +/**
> + * DOC: Flat-CCS - Memory compression for Local memory
> + *
> + * On Xe-HP and later devices, we use dedicated compression control
> state (CCS)
> + * stored in local memory for each surface, to support the 3D and
> media
> + * compression formats.
> + *
> + * The memory required for the CCS of the entire local memory is
> 1/256 of the
> + * local memory size. So before the kernel boot, the required memory
> is reserved
> + * for the CCS data and a secure register will be programmed with
> the CCS base
> + * address.
> + *
> + * Flat CCS data needs to be cleared when a lmem object is
> allocated.
> + * And CCS data can be copied in and out of CCS region through
> + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> + *
> + * When we exaust the lmem, if the object's placements support smem,
> then we can
> + * directly decompress the compressed lmem object into smem and
> start using it
> + * from smem itself.
> + *
> + * But when we need to swapout the compressed lmem object into a
> smem region
> + * though objects' placement doesn't support smem, then we copy the
> lmem content
> + * as it is into smem region along with ccs data (using
> XY_CTRL_SURF_COPY_BLT).
> + * When the object is referred, lmem content will be swaped in along
> with
> + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> corresponding
> + * location.
> + *
> + *
> + * Flat-CCS Modifiers for different compression formats
> + * ----------------------------------------------------
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> of Flat CCS
> + * render compression formats. Though the general layout is same as
> + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> algorithm is
> + * used. Render compression uses 128 byte compression blocks
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> of Flat CCS
> + * media compression formats. Though the general layout is same as
> + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> algorithm is
> + * used. Media compression uses 256 byte compression blocks.
> + *
> + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> buffers of Flat
> + * CCS clear color render compression formats. Unified compression
> format for
> + * clear color render compression. The genral layout is a tiled
> layout using
> + * 4Kb tiles i.e Tile4 layout.
> + */
> +
> +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> +{
> +       /* Mask the 3 LSB to use the PPGTT address space */
> +       *cmd++ = MI_FLUSH_DW | flags;
> +       *cmd++ = lower_32_bits(dst);
> +       *cmd++ = upper_32_bits(dst);
> +
> +       return cmd;
> +}
> +
> +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> int size)
> +{
> +       u32 num_cmds, num_blks, total_size;
> +
> +       if (!GET_CCS_SIZE(i915, size))
> +               return 0;
> +
> +       /*
> +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> +        * trnasfer upto 1024 blocks.
> +        */
> +       num_blks = GET_CCS_SIZE(i915, size);
> +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> +
> +       /*
> +        * We need to add a flush before and after
> +        * XY_CTRL_SURF_COPY_BLT
> +        */
> +       total_size += 2 * MI_FLUSH_DW_SIZE;
> +       return total_size;
> +}
> +
> +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> dst_addr,
> +                                    u8 src_mem_access, u8
> dst_mem_access,
> +                                    int src_mocs, int dst_mocs,
> +                                    u16 num_ccs_blocks)
> +{
> +       int i = num_ccs_blocks;
> +
> +       /*
> +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> CCS
> +        * data in and out of the CCS region.
> +        *
> +        * We can copy at most 1024 blocks of 256 bytes using one
> +        * XY_CTRL_SURF_COPY_BLT instruction.
> +        *
> +        * In case we need to copy more than 1024 blocks, we need to
> add
> +        * another instruction to the same batch buffer.
> +        *
> +        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> CCS.
> +        *
> +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> +        */
> +       do {
> +               /*
> +                * We use logical AND with 1023 since the size field
> +                * takes values which is in the range of 0 - 1023
> +                */
> +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> +                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> +                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> +               *cmd++ = lower_32_bits(src_addr);
> +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> +               *cmd++ = lower_32_bits(dst_addr);
> +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> +               src_addr += SZ_64M;
> +               dst_addr += SZ_64M;
> +               i -= NUM_CCS_BLKS_PER_XFER;
> +       } while (i > 0);
> +
> +       return cmd;
> +}
> +
>  static int emit_copy(struct i915_request *rq,
> -                    u32 dst_offset, u32 src_offset, int size)
> +                    bool dst_is_lmem, u32 dst_offset,
> +                    bool src_is_lmem, u32 src_offset, int size)
>  {
> +       struct drm_i915_private *i915 = rq->engine->i915;
>         const int ver = GRAPHICS_VER(rq->engine->i915);
>         u32 instance = rq->engine->instance;
> +       u32 num_ccs_blks, ccs_ring_size;
> +       u8 src_access, dst_access;
>         u32 *cs;
>  
> -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> HAS_FLAT_CCS(i915)) ?
> +                        calc_ctrl_surf_instr_size(i915, size) : 0;
> +
> +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
>         if (IS_ERR(cs))
>                 return PTR_ERR(cs);
>  
> @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
>                 *cs++ = src_offset;
>         }
>  
> +       if (ccs_ring_size) {
> +               /* TODO: Migration needs to be handled with resolve
> of compressed data */
> +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> +                               NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
> +
> +               src_access = !src_is_lmem && dst_is_lmem;
> +               dst_access = !src_access;
> +
> +               if (src_access) /* Swapin of compressed data */
> +                       src_offset += size;
> +               else
> +                       dst_offset += size;
> +
> +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> dst_offset,
> +                                             src_access, dst_access,
> +                                             1, 1, num_ccs_blks);
> +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> MI_FLUSH_CCS);
> +       }
> +
>         intel_ring_advance(rq, cs);
>         return 0;
>  }
> @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context
> *ce,
>                 if (err)
>                         goto out_rq;
>  
> -               err = emit_copy(rq, dst_offset, src_offset, len);
> +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> +                               src_is_lmem, src_offset, len);
>  
>                 /* Arbitration is re-enabled between requests. */
>  out_rq:
> @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context
> *ce,
>         return err;
>  }
>  
> -/**
> - * DOC: Flat-CCS - Memory compression for Local memory
> - *
> - * On Xe-HP and later devices, we use dedicated compression control
> state (CCS)
> - * stored in local memory for each surface, to support the 3D and
> media
> - * compression formats.
> - *
> - * The memory required for the CCS of the entire local memory is
> 1/256 of the
> - * local memory size. So before the kernel boot, the required memory
> is reserved
> - * for the CCS data and a secure register will be programmed with
> the CCS base
> - * address.
> - *
> - * Flat CCS data needs to be cleared when a lmem object is
> allocated.
> - * And CCS data can be copied in and out of CCS region through
> - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> - *
> - * When we exaust the lmem, if the object's placements support smem,
> then we can
> - * directly decompress the compressed lmem object into smem and
> start using it
> - * from smem itself.
> - *
> - * But when we need to swapout the compressed lmem object into a
> smem region
> - * though objects' placement doesn't support smem, then we copy the
> lmem content
> - * as it is into smem region along with ccs data (using
> XY_CTRL_SURF_COPY_BLT).
> - * When the object is referred, lmem content will be swaped in along
> with
> - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> corresponding
> - * location.
> - *
> - *
> - * Flat-CCS Modifiers for different compression formats
> - * ----------------------------------------------------
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> of Flat CCS
> - * render compression formats. Though the general layout is same as
> - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> algorithm is
> - * used. Render compression uses 128 byte compression blocks
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> of Flat CCS
> - * media compression formats. Though the general layout is same as
> - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> algorithm is
> - * used. Media compression uses 256 byte compression blocks.
> - *
> - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> buffers of Flat
> - * CCS clear color render compression formats. Unified compression
> format for
> - * clear color render compression. The genral layout is a tiled
> layout using
> - * 4Kb tiles i.e Tile4 layout.
> - */
> -
> -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> -{
> -       /* Mask the 3 LSB to use the PPGTT address space */
> -       *cmd++ = MI_FLUSH_DW | flags;
> -       *cmd++ = lower_32_bits(dst);
> -       *cmd++ = upper_32_bits(dst);
> -
> -       return cmd;
> -}
> -
> -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> int size)
> -{
> -       u32 num_cmds, num_blks, total_size;
> -
> -       if (!GET_CCS_SIZE(i915, size))
> -               return 0;
> -
> -       /*
> -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> -        * trnasfer upto 1024 blocks.
> -        */
> -       num_blks = GET_CCS_SIZE(i915, size);
> -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> -
> -       /*
> -        * We need to add a flush before and after
> -        * XY_CTRL_SURF_COPY_BLT
> -        */
> -       total_size += 2 * MI_FLUSH_DW_SIZE;
> -       return total_size;
> -}
> -
> -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> dst_addr,
> -                                    u8 src_mem_access, u8
> dst_mem_access,
> -                                    int src_mocs, int dst_mocs,
> -                                    u16 num_ccs_blocks)
> -{
> -       int i = num_ccs_blocks;
> -
> -       /*
> -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> CCS
> -        * data in and out of the CCS region.
> -        *
> -        * We can copy at most 1024 blocks of 256 bytes using one
> -        * XY_CTRL_SURF_COPY_BLT instruction.
> -        *
> -        * In case we need to copy more than 1024 blocks, we need to
> add
> -        * another instruction to the same batch buffer.
> -        *
> -        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> CCS.
> -        *
> -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> -        */
> -       do {
> -               /*
> -                * We use logical AND with 1023 since the size field
> -                * takes values which is in the range of 0 - 1023
> -                */
> -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> -                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> -                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> -               *cmd++ = lower_32_bits(src_addr);
> -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> -               *cmd++ = lower_32_bits(dst_addr);
> -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> -               src_addr += SZ_64M;
> -               dst_addr += SZ_64M;
> -               i -= NUM_CCS_BLKS_PER_XFER;
> -       } while (i > 0);
> -
> -       return cmd;
> -}
> -
>  static int emit_clear(struct i915_request *rq,
>                       u64 offset,
>                       int size,

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07 14:55     ` [Intel-gfx] " Hellstrom, Thomas
@ 2022-02-07 15:14       ` Ramalingam C
  -1 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 15:14 UTC (permalink / raw)
  To: Hellstrom, Thomas; +Cc: intel-gfx, christian.koenig, dri-devel

On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> Hi, Ram,
> 
> A couple of quick questions before starting a more detailed review:
> 
> 1) Does this also support migrating of compressed data LMEM->LMEM?
> What-about inter-tile?
Honestly this series mainly facused on eviction of lmem into smem and
restoration of same.

To cover migration, we need to handle this differently from eviction.
Becasue when we migrate the compressed content we need to be able to use
that from that new placement. can't keep the ccs data separately.

Migration of lmem->smem needs decompression incorportated.
Migration of lmem_m->lmem_n needs to maintain the
compressed/decompressed state as it is.

So we need to pass the information upto emit_copy to differentiate
eviction and migration

If you dont have objection I would like to take the migration once we
have the eviction of lmem in place.

> 
> 2) Do we need to block faulting of compressed data in the fault handler
> as a follow-up patch?

In case of evicted compressed data we dont need to treat it differently
from the evicted normal data. So I dont think this needs a special
treatment. Sorry if i dont understand your question.

Ram
> 
> /Thomas
> 
> 
> On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > When we are swapping out the local memory obj on flat-ccs capable
> > platform,
> > we need to capture the ccs data too along with main meory and we need
> > to
> > restore it when we are swapping in the content.
> >
> > Extracting and restoring the CCS data is done through a special cmd
> > called
> > XY_CTRL_SURF_COPY_BLT
> >
> > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++---------
> > --
> >  1 file changed, 155 insertions(+), 128 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > index 5bdab0b3c735..e60ae6ff1847 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32
> > size)
> >         return height % 4 == 3 && height <= 8;
> >  }
> >
> > +/**
> > + * DOC: Flat-CCS - Memory compression for Local memory
> > + *
> > + * On Xe-HP and later devices, we use dedicated compression control
> > state (CCS)
> > + * stored in local memory for each surface, to support the 3D and
> > media
> > + * compression formats.
> > + *
> > + * The memory required for the CCS of the entire local memory is
> > 1/256 of the
> > + * local memory size. So before the kernel boot, the required memory
> > is reserved
> > + * for the CCS data and a secure register will be programmed with
> > the CCS base
> > + * address.
> > + *
> > + * Flat CCS data needs to be cleared when a lmem object is
> > allocated.
> > + * And CCS data can be copied in and out of CCS region through
> > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> > + *
> > + * When we exaust the lmem, if the object's placements support smem,
> > then we can
> > + * directly decompress the compressed lmem object into smem and
> > start using it
> > + * from smem itself.
> > + *
> > + * But when we need to swapout the compressed lmem object into a
> > smem region
> > + * though objects' placement doesn't support smem, then we copy the
> > lmem content
> > + * as it is into smem region along with ccs data (using
> > XY_CTRL_SURF_COPY_BLT).
> > + * When the object is referred, lmem content will be swaped in along
> > with
> > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > corresponding
> > + * location.
> > + *
> > + *
> > + * Flat-CCS Modifiers for different compression formats
> > + * ----------------------------------------------------
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> > of Flat CCS
> > + * render compression formats. Though the general layout is same as
> > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > algorithm is
> > + * used. Render compression uses 128 byte compression blocks
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> > of Flat CCS
> > + * media compression formats. Though the general layout is same as
> > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > algorithm is
> > + * used. Media compression uses 256 byte compression blocks.
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > buffers of Flat
> > + * CCS clear color render compression formats. Unified compression
> > format for
> > + * clear color render compression. The genral layout is a tiled
> > layout using
> > + * 4Kb tiles i.e Tile4 layout.
> > + */
> > +
> > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > +{
> > +       /* Mask the 3 LSB to use the PPGTT address space */
> > +       *cmd++ = MI_FLUSH_DW | flags;
> > +       *cmd++ = lower_32_bits(dst);
> > +       *cmd++ = upper_32_bits(dst);
> > +
> > +       return cmd;
> > +}
> > +
> > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> > int size)
> > +{
> > +       u32 num_cmds, num_blks, total_size;
> > +
> > +       if (!GET_CCS_SIZE(i915, size))
> > +               return 0;
> > +
> > +       /*
> > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > +        * trnasfer upto 1024 blocks.
> > +        */
> > +       num_blks = GET_CCS_SIZE(i915, size);
> > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > +
> > +       /*
> > +        * We need to add a flush before and after
> > +        * XY_CTRL_SURF_COPY_BLT
> > +        */
> > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > +       return total_size;
> > +}
> > +
> > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > dst_addr,
> > +                                    u8 src_mem_access, u8
> > dst_mem_access,
> > +                                    int src_mocs, int dst_mocs,
> > +                                    u16 num_ccs_blocks)
> > +{
> > +       int i = num_ccs_blocks;
> > +
> > +       /*
> > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> > CCS
> > +        * data in and out of the CCS region.
> > +        *
> > +        * We can copy at most 1024 blocks of 256 bytes using one
> > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > +        *
> > +        * In case we need to copy more than 1024 blocks, we need to
> > add
> > +        * another instruction to the same batch buffer.
> > +        *
> > +        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> > CCS.
> > +        *
> > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > +        */
> > +       do {
> > +               /*
> > +                * We use logical AND with 1023 since the size field
> > +                * takes values which is in the range of 0 - 1023
> > +                */
> > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > +                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> > +                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > +               *cmd++ = lower_32_bits(src_addr);
> > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > +               *cmd++ = lower_32_bits(dst_addr);
> > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > +               src_addr += SZ_64M;
> > +               dst_addr += SZ_64M;
> > +               i -= NUM_CCS_BLKS_PER_XFER;
> > +       } while (i > 0);
> > +
> > +       return cmd;
> > +}
> > +
> >  static int emit_copy(struct i915_request *rq,
> > -                    u32 dst_offset, u32 src_offset, int size)
> > +                    bool dst_is_lmem, u32 dst_offset,
> > +                    bool src_is_lmem, u32 src_offset, int size)
> >  {
> > +       struct drm_i915_private *i915 = rq->engine->i915;
> >         const int ver = GRAPHICS_VER(rq->engine->i915);
> >         u32 instance = rq->engine->instance;
> > +       u32 num_ccs_blks, ccs_ring_size;
> > +       u8 src_access, dst_access;
> >         u32 *cs;
> >
> > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > HAS_FLAT_CCS(i915)) ?
> > +                        calc_ctrl_surf_instr_size(i915, size) : 0;
> > +
> > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
> >         if (IS_ERR(cs))
> >                 return PTR_ERR(cs);
> >
> > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
> >                 *cs++ = src_offset;
> >         }
> >
> > +       if (ccs_ring_size) {
> > +               /* TODO: Migration needs to be handled with resolve
> > of compressed data */
> > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
> > +
> > +               src_access = !src_is_lmem && dst_is_lmem;
> > +               dst_access = !src_access;
> > +
> > +               if (src_access) /* Swapin of compressed data */
> > +                       src_offset += size;
> > +               else
> > +                       dst_offset += size;
> > +
> > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > dst_offset,
> > +                                             src_access, dst_access,
> > +                                             1, 1, num_ccs_blks);
> > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > MI_FLUSH_CCS);
> > +       }
> > +
> >         intel_ring_advance(rq, cs);
> >         return 0;
> >  }
> > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context
> > *ce,
> >                 if (err)
> >                         goto out_rq;
> >
> > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > +                               src_is_lmem, src_offset, len);
> >
> >                 /* Arbitration is re-enabled between requests. */
> >  out_rq:
> > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context
> > *ce,
> >         return err;
> >  }
> >
> > -/**
> > - * DOC: Flat-CCS - Memory compression for Local memory
> > - *
> > - * On Xe-HP and later devices, we use dedicated compression control
> > state (CCS)
> > - * stored in local memory for each surface, to support the 3D and
> > media
> > - * compression formats.
> > - *
> > - * The memory required for the CCS of the entire local memory is
> > 1/256 of the
> > - * local memory size. So before the kernel boot, the required memory
> > is reserved
> > - * for the CCS data and a secure register will be programmed with
> > the CCS base
> > - * address.
> > - *
> > - * Flat CCS data needs to be cleared when a lmem object is
> > allocated.
> > - * And CCS data can be copied in and out of CCS region through
> > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> > - *
> > - * When we exaust the lmem, if the object's placements support smem,
> > then we can
> > - * directly decompress the compressed lmem object into smem and
> > start using it
> > - * from smem itself.
> > - *
> > - * But when we need to swapout the compressed lmem object into a
> > smem region
> > - * though objects' placement doesn't support smem, then we copy the
> > lmem content
> > - * as it is into smem region along with ccs data (using
> > XY_CTRL_SURF_COPY_BLT).
> > - * When the object is referred, lmem content will be swaped in along
> > with
> > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > corresponding
> > - * location.
> > - *
> > - *
> > - * Flat-CCS Modifiers for different compression formats
> > - * ----------------------------------------------------
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> > of Flat CCS
> > - * render compression formats. Though the general layout is same as
> > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > algorithm is
> > - * used. Render compression uses 128 byte compression blocks
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> > of Flat CCS
> > - * media compression formats. Though the general layout is same as
> > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > algorithm is
> > - * used. Media compression uses 256 byte compression blocks.
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > buffers of Flat
> > - * CCS clear color render compression formats. Unified compression
> > format for
> > - * clear color render compression. The genral layout is a tiled
> > layout using
> > - * 4Kb tiles i.e Tile4 layout.
> > - */
> > -
> > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > -{
> > -       /* Mask the 3 LSB to use the PPGTT address space */
> > -       *cmd++ = MI_FLUSH_DW | flags;
> > -       *cmd++ = lower_32_bits(dst);
> > -       *cmd++ = upper_32_bits(dst);
> > -
> > -       return cmd;
> > -}
> > -
> > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> > int size)
> > -{
> > -       u32 num_cmds, num_blks, total_size;
> > -
> > -       if (!GET_CCS_SIZE(i915, size))
> > -               return 0;
> > -
> > -       /*
> > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > -        * trnasfer upto 1024 blocks.
> > -        */
> > -       num_blks = GET_CCS_SIZE(i915, size);
> > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > -
> > -       /*
> > -        * We need to add a flush before and after
> > -        * XY_CTRL_SURF_COPY_BLT
> > -        */
> > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > -       return total_size;
> > -}
> > -
> > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > dst_addr,
> > -                                    u8 src_mem_access, u8
> > dst_mem_access,
> > -                                    int src_mocs, int dst_mocs,
> > -                                    u16 num_ccs_blocks)
> > -{
> > -       int i = num_ccs_blocks;
> > -
> > -       /*
> > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> > CCS
> > -        * data in and out of the CCS region.
> > -        *
> > -        * We can copy at most 1024 blocks of 256 bytes using one
> > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > -        *
> > -        * In case we need to copy more than 1024 blocks, we need to
> > add
> > -        * another instruction to the same batch buffer.
> > -        *
> > -        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> > CCS.
> > -        *
> > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > -        */
> > -       do {
> > -               /*
> > -                * We use logical AND with 1023 since the size field
> > -                * takes values which is in the range of 0 - 1023
> > -                */
> > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > -                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> > -                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > -               *cmd++ = lower_32_bits(src_addr);
> > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > -               *cmd++ = lower_32_bits(dst_addr);
> > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > -               src_addr += SZ_64M;
> > -               dst_addr += SZ_64M;
> > -               i -= NUM_CCS_BLKS_PER_XFER;
> > -       } while (i > 0);
> > -
> > -       return cmd;
> > -}
> > -
> >  static int emit_clear(struct i915_request *rq,
> >                       u64 offset,
> >                       int size,
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
@ 2022-02-07 15:14       ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 15:14 UTC (permalink / raw)
  To: Hellstrom, Thomas; +Cc: intel-gfx, christian.koenig, dri-devel

On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> Hi, Ram,
> 
> A couple of quick questions before starting a more detailed review:
> 
> 1) Does this also support migrating of compressed data LMEM->LMEM?
> What-about inter-tile?
Honestly this series mainly facused on eviction of lmem into smem and
restoration of same.

To cover migration, we need to handle this differently from eviction.
Becasue when we migrate the compressed content we need to be able to use
that from that new placement. can't keep the ccs data separately.

Migration of lmem->smem needs decompression incorportated.
Migration of lmem_m->lmem_n needs to maintain the
compressed/decompressed state as it is.

So we need to pass the information upto emit_copy to differentiate
eviction and migration

If you dont have objection I would like to take the migration once we
have the eviction of lmem in place.

> 
> 2) Do we need to block faulting of compressed data in the fault handler
> as a follow-up patch?

In case of evicted compressed data we dont need to treat it differently
from the evicted normal data. So I dont think this needs a special
treatment. Sorry if i dont understand your question.

Ram
> 
> /Thomas
> 
> 
> On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > When we are swapping out the local memory obj on flat-ccs capable
> > platform,
> > we need to capture the ccs data too along with main meory and we need
> > to
> > restore it when we are swapping in the content.
> >
> > Extracting and restoring the CCS data is done through a special cmd
> > called
> > XY_CTRL_SURF_COPY_BLT
> >
> > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > ---
> >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++---------
> > --
> >  1 file changed, 155 insertions(+), 128 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > index 5bdab0b3c735..e60ae6ff1847 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32
> > size)
> >         return height % 4 == 3 && height <= 8;
> >  }
> >
> > +/**
> > + * DOC: Flat-CCS - Memory compression for Local memory
> > + *
> > + * On Xe-HP and later devices, we use dedicated compression control
> > state (CCS)
> > + * stored in local memory for each surface, to support the 3D and
> > media
> > + * compression formats.
> > + *
> > + * The memory required for the CCS of the entire local memory is
> > 1/256 of the
> > + * local memory size. So before the kernel boot, the required memory
> > is reserved
> > + * for the CCS data and a secure register will be programmed with
> > the CCS base
> > + * address.
> > + *
> > + * Flat CCS data needs to be cleared when a lmem object is
> > allocated.
> > + * And CCS data can be copied in and out of CCS region through
> > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> > + *
> > + * When we exaust the lmem, if the object's placements support smem,
> > then we can
> > + * directly decompress the compressed lmem object into smem and
> > start using it
> > + * from smem itself.
> > + *
> > + * But when we need to swapout the compressed lmem object into a
> > smem region
> > + * though objects' placement doesn't support smem, then we copy the
> > lmem content
> > + * as it is into smem region along with ccs data (using
> > XY_CTRL_SURF_COPY_BLT).
> > + * When the object is referred, lmem content will be swaped in along
> > with
> > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > corresponding
> > + * location.
> > + *
> > + *
> > + * Flat-CCS Modifiers for different compression formats
> > + * ----------------------------------------------------
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> > of Flat CCS
> > + * render compression formats. Though the general layout is same as
> > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > algorithm is
> > + * used. Render compression uses 128 byte compression blocks
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> > of Flat CCS
> > + * media compression formats. Though the general layout is same as
> > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > algorithm is
> > + * used. Media compression uses 256 byte compression blocks.
> > + *
> > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > buffers of Flat
> > + * CCS clear color render compression formats. Unified compression
> > format for
> > + * clear color render compression. The genral layout is a tiled
> > layout using
> > + * 4Kb tiles i.e Tile4 layout.
> > + */
> > +
> > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > +{
> > +       /* Mask the 3 LSB to use the PPGTT address space */
> > +       *cmd++ = MI_FLUSH_DW | flags;
> > +       *cmd++ = lower_32_bits(dst);
> > +       *cmd++ = upper_32_bits(dst);
> > +
> > +       return cmd;
> > +}
> > +
> > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> > int size)
> > +{
> > +       u32 num_cmds, num_blks, total_size;
> > +
> > +       if (!GET_CCS_SIZE(i915, size))
> > +               return 0;
> > +
> > +       /*
> > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > +        * trnasfer upto 1024 blocks.
> > +        */
> > +       num_blks = GET_CCS_SIZE(i915, size);
> > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > +
> > +       /*
> > +        * We need to add a flush before and after
> > +        * XY_CTRL_SURF_COPY_BLT
> > +        */
> > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > +       return total_size;
> > +}
> > +
> > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > dst_addr,
> > +                                    u8 src_mem_access, u8
> > dst_mem_access,
> > +                                    int src_mocs, int dst_mocs,
> > +                                    u16 num_ccs_blocks)
> > +{
> > +       int i = num_ccs_blocks;
> > +
> > +       /*
> > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> > CCS
> > +        * data in and out of the CCS region.
> > +        *
> > +        * We can copy at most 1024 blocks of 256 bytes using one
> > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > +        *
> > +        * In case we need to copy more than 1024 blocks, we need to
> > add
> > +        * another instruction to the same batch buffer.
> > +        *
> > +        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> > CCS.
> > +        *
> > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > +        */
> > +       do {
> > +               /*
> > +                * We use logical AND with 1023 since the size field
> > +                * takes values which is in the range of 0 - 1023
> > +                */
> > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > +                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> > +                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > +               *cmd++ = lower_32_bits(src_addr);
> > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > +               *cmd++ = lower_32_bits(dst_addr);
> > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > +               src_addr += SZ_64M;
> > +               dst_addr += SZ_64M;
> > +               i -= NUM_CCS_BLKS_PER_XFER;
> > +       } while (i > 0);
> > +
> > +       return cmd;
> > +}
> > +
> >  static int emit_copy(struct i915_request *rq,
> > -                    u32 dst_offset, u32 src_offset, int size)
> > +                    bool dst_is_lmem, u32 dst_offset,
> > +                    bool src_is_lmem, u32 src_offset, int size)
> >  {
> > +       struct drm_i915_private *i915 = rq->engine->i915;
> >         const int ver = GRAPHICS_VER(rq->engine->i915);
> >         u32 instance = rq->engine->instance;
> > +       u32 num_ccs_blks, ccs_ring_size;
> > +       u8 src_access, dst_access;
> >         u32 *cs;
> >
> > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > HAS_FLAT_CCS(i915)) ?
> > +                        calc_ctrl_surf_instr_size(i915, size) : 0;
> > +
> > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
> >         if (IS_ERR(cs))
> >                 return PTR_ERR(cs);
> >
> > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
> >                 *cs++ = src_offset;
> >         }
> >
> > +       if (ccs_ring_size) {
> > +               /* TODO: Migration needs to be handled with resolve
> > of compressed data */
> > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
> > +
> > +               src_access = !src_is_lmem && dst_is_lmem;
> > +               dst_access = !src_access;
> > +
> > +               if (src_access) /* Swapin of compressed data */
> > +                       src_offset += size;
> > +               else
> > +                       dst_offset += size;
> > +
> > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > dst_offset,
> > +                                             src_access, dst_access,
> > +                                             1, 1, num_ccs_blks);
> > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > MI_FLUSH_CCS);
> > +       }
> > +
> >         intel_ring_advance(rq, cs);
> >         return 0;
> >  }
> > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context
> > *ce,
> >                 if (err)
> >                         goto out_rq;
> >
> > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > +                               src_is_lmem, src_offset, len);
> >
> >                 /* Arbitration is re-enabled between requests. */
> >  out_rq:
> > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context
> > *ce,
> >         return err;
> >  }
> >
> > -/**
> > - * DOC: Flat-CCS - Memory compression for Local memory
> > - *
> > - * On Xe-HP and later devices, we use dedicated compression control
> > state (CCS)
> > - * stored in local memory for each surface, to support the 3D and
> > media
> > - * compression formats.
> > - *
> > - * The memory required for the CCS of the entire local memory is
> > 1/256 of the
> > - * local memory size. So before the kernel boot, the required memory
> > is reserved
> > - * for the CCS data and a secure register will be programmed with
> > the CCS base
> > - * address.
> > - *
> > - * Flat CCS data needs to be cleared when a lmem object is
> > allocated.
> > - * And CCS data can be copied in and out of CCS region through
> > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
> > - *
> > - * When we exaust the lmem, if the object's placements support smem,
> > then we can
> > - * directly decompress the compressed lmem object into smem and
> > start using it
> > - * from smem itself.
> > - *
> > - * But when we need to swapout the compressed lmem object into a
> > smem region
> > - * though objects' placement doesn't support smem, then we copy the
> > lmem content
> > - * as it is into smem region along with ccs data (using
> > XY_CTRL_SURF_COPY_BLT).
> > - * When the object is referred, lmem content will be swaped in along
> > with
> > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > corresponding
> > - * location.
> > - *
> > - *
> > - * Flat-CCS Modifiers for different compression formats
> > - * ----------------------------------------------------
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers
> > of Flat CCS
> > - * render compression formats. Though the general layout is same as
> > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > algorithm is
> > - * used. Render compression uses 128 byte compression blocks
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers
> > of Flat CCS
> > - * media compression formats. Though the general layout is same as
> > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > algorithm is
> > - * used. Media compression uses 256 byte compression blocks.
> > - *
> > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > buffers of Flat
> > - * CCS clear color render compression formats. Unified compression
> > format for
> > - * clear color render compression. The genral layout is a tiled
> > layout using
> > - * 4Kb tiles i.e Tile4 layout.
> > - */
> > -
> > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > -{
> > -       /* Mask the 3 LSB to use the PPGTT address space */
> > -       *cmd++ = MI_FLUSH_DW | flags;
> > -       *cmd++ = lower_32_bits(dst);
> > -       *cmd++ = upper_32_bits(dst);
> > -
> > -       return cmd;
> > -}
> > -
> > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915,
> > int size)
> > -{
> > -       u32 num_cmds, num_blks, total_size;
> > -
> > -       if (!GET_CCS_SIZE(i915, size))
> > -               return 0;
> > -
> > -       /*
> > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > -        * trnasfer upto 1024 blocks.
> > -        */
> > -       num_blks = GET_CCS_SIZE(i915, size);
> > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
> > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > -
> > -       /*
> > -        * We need to add a flush before and after
> > -        * XY_CTRL_SURF_COPY_BLT
> > -        */
> > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > -       return total_size;
> > -}
> > -
> > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > dst_addr,
> > -                                    u8 src_mem_access, u8
> > dst_mem_access,
> > -                                    int src_mocs, int dst_mocs,
> > -                                    u16 num_ccs_blocks)
> > -{
> > -       int i = num_ccs_blocks;
> > -
> > -       /*
> > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the
> > CCS
> > -        * data in and out of the CCS region.
> > -        *
> > -        * We can copy at most 1024 blocks of 256 bytes using one
> > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > -        *
> > -        * In case we need to copy more than 1024 blocks, we need to
> > add
> > -        * another instruction to the same batch buffer.
> > -        *
> > -        * 1024 blocks of 256 bytes of CCS represent a total 256KB of
> > CCS.
> > -        *
> > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > -        */
> > -       do {
> > -               /*
> > -                * We use logical AND with 1023 since the size field
> > -                * takes values which is in the range of 0 - 1023
> > -                */
> > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > -                         (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
> > -                         (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
> > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > -               *cmd++ = lower_32_bits(src_addr);
> > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > -               *cmd++ = lower_32_bits(dst_addr);
> > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > -               src_addr += SZ_64M;
> > -               dst_addr += SZ_64M;
> > -               i -= NUM_CCS_BLKS_PER_XFER;
> > -       } while (i > 0);
> > -
> > -       return cmd;
> > -}
> > -
> >  static int emit_clear(struct i915_request *rq,
> >                       u64 offset,
> >                       int size,
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07 15:14       ` [Intel-gfx] " Ramalingam C
@ 2022-02-07 15:22         ` Hellstrom, Thomas
  -1 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 15:22 UTC (permalink / raw)
  To: C, Ramalingam; +Cc: intel-gfx, christian.koenig, dri-devel

On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote:
> On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> > Hi, Ram,
> > 
> > A couple of quick questions before starting a more detailed review:
> > 
> > 1) Does this also support migrating of compressed data LMEM->LMEM?
> > What-about inter-tile?
> Honestly this series mainly facused on eviction of lmem into smem and
> restoration of same.
> 
> To cover migration, we need to handle this differently from eviction.
> Becasue when we migrate the compressed content we need to be able to
> use
> that from that new placement. can't keep the ccs data separately.
> 
> Migration of lmem->smem needs decompression incorportated.
> Migration of lmem_m->lmem_n needs to maintain the
> compressed/decompressed state as it is.
> 
> So we need to pass the information upto emit_copy to differentiate
> eviction and migration
> 
> If you dont have objection I would like to take the migration once we
> have the eviction of lmem in place.

Sure NP. I was thinking that in the final solution we might also need
to think about the possibility that we might evict to another lmem
region, although I figure that won't be enabled until we support multi-
tile.

> 
> > 
> > 2) Do we need to block faulting of compressed data in the fault
> > handler
> > as a follow-up patch?
> 
> In case of evicted compressed data we dont need to treat it
> differently
> from the evicted normal data. So I dont think this needs a special
> treatment. Sorry if i dont understand your question.

My question wasn't directly related to eviction actually, but does
user-space need to have mmap access to compressed data? If not, block
it?

Thanks,
Thomas



> 
> Ram
> > 
> > /Thomas
> > 
> > 
> > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > > When we are swapping out the local memory obj on flat-ccs capable
> > > platform,
> > > we need to capture the ccs data too along with main meory and we
> > > need
> > > to
> > > restore it when we are swapping in the content.
> > > 
> > > Extracting and restoring the CCS data is done through a special
> > > cmd
> > > called
> > > XY_CTRL_SURF_COPY_BLT
> > > 
> > > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----
> > > ----
> > > --
> > >  1 file changed, 155 insertions(+), 128 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > index 5bdab0b3c735..e60ae6ff1847 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver,
> > > u32
> > > size)
> > >         return height % 4 == 3 && height <= 8;
> > >  }
> > > 
> > > +/**
> > > + * DOC: Flat-CCS - Memory compression for Local memory
> > > + *
> > > + * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > + * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > + * compression formats.
> > > + *
> > > + * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > + * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > + * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > + * address.
> > > + *
> > > + * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > + * And CCS data can be copied in and out of CCS region through
> > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > + *
> > > + * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > + * directly decompress the compressed lmem object into smem and
> > > start using it
> > > + * from smem itself.
> > > + *
> > > + * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > + * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > + * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > + * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > + * location.
> > > + *
> > > + *
> > > + * Flat-CCS Modifiers for different compression formats
> > > + * ----------------------------------------------------
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * render compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Render compression uses 128 byte compression blocks
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * media compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Media compression uses 256 byte compression blocks.
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > + * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > + * clear color render compression. The genral layout is a tiled
> > > layout using
> > > + * 4Kb tiles i.e Tile4 layout.
> > > + */
> > > +
> > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > +{
> > > +       /* Mask the 3 LSB to use the PPGTT address space */
> > > +       *cmd++ = MI_FLUSH_DW | flags;
> > > +       *cmd++ = lower_32_bits(dst);
> > > +       *cmd++ = upper_32_bits(dst);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > +{
> > > +       u32 num_cmds, num_blks, total_size;
> > > +
> > > +       if (!GET_CCS_SIZE(i915, size))
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > +        * trnasfer upto 1024 blocks.
> > > +        */
> > > +       num_blks = GET_CCS_SIZE(i915, size);
> > > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > +
> > > +       /*
> > > +        * We need to add a flush before and after
> > > +        * XY_CTRL_SURF_COPY_BLT
> > > +        */
> > > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > +       return total_size;
> > > +}
> > > +
> > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > +                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > +                                    int src_mocs, int dst_mocs,
> > > +                                    u16 num_ccs_blocks)
> > > +{
> > > +       int i = num_ccs_blocks;
> > > +
> > > +       /*
> > > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > +        * data in and out of the CCS region.
> > > +        *
> > > +        * We can copy at most 1024 blocks of 256 bytes using one
> > > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > > +        *
> > > +        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > +        * another instruction to the same batch buffer.
> > > +        *
> > > +        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > +        *
> > > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > +        */
> > > +       do {
> > > +               /*
> > > +                * We use logical AND with 1023 since the size
> > > field
> > > +                * takes values which is in the range of 0 - 1023
> > > +                */
> > > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > +                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > +                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > +               *cmd++ = lower_32_bits(src_addr);
> > > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               *cmd++ = lower_32_bits(dst_addr);
> > > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               src_addr += SZ_64M;
> > > +               dst_addr += SZ_64M;
> > > +               i -= NUM_CCS_BLKS_PER_XFER;
> > > +       } while (i > 0);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > >  static int emit_copy(struct i915_request *rq,
> > > -                    u32 dst_offset, u32 src_offset, int size)
> > > +                    bool dst_is_lmem, u32 dst_offset,
> > > +                    bool src_is_lmem, u32 src_offset, int size)
> > >  {
> > > +       struct drm_i915_private *i915 = rq->engine->i915;
> > >         const int ver = GRAPHICS_VER(rq->engine->i915);
> > >         u32 instance = rq->engine->instance;
> > > +       u32 num_ccs_blks, ccs_ring_size;
> > > +       u8 src_access, dst_access;
> > >         u32 *cs;
> > > 
> > > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > > HAS_FLAT_CCS(i915)) ?
> > > +                        calc_ctrl_surf_instr_size(i915, size) :
> > > 0;
> > > +
> > > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size :
> > > 6);
> > >         if (IS_ERR(cs))
> > >                 return PTR_ERR(cs);
> > > 
> > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request
> > > *rq,
> > >                 *cs++ = src_offset;
> > >         }
> > > 
> > > +       if (ccs_ring_size) {
> > > +               /* TODO: Migration needs to be handled with
> > > resolve
> > > of compressed data */
> > > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >>
> > > 8;
> > > +
> > > +               src_access = !src_is_lmem && dst_is_lmem;
> > > +               dst_access = !src_access;
> > > +
> > > +               if (src_access) /* Swapin of compressed data */
> > > +                       src_offset += size;
> > > +               else
> > > +                       dst_offset += size;
> > > +
> > > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > > dst_offset,
> > > +                                             src_access,
> > > dst_access,
> > > +                                             1, 1,
> > > num_ccs_blks);
> > > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > > MI_FLUSH_CCS);
> > > +       }
> > > +
> > >         intel_ring_advance(rq, cs);
> > >         return 0;
> > >  }
> > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >                 if (err)
> > >                         goto out_rq;
> > > 
> > > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > > +                               src_is_lmem, src_offset, len);
> > > 
> > >                 /* Arbitration is re-enabled between requests. */
> > >  out_rq:
> > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >         return err;
> > >  }
> > > 
> > > -/**
> > > - * DOC: Flat-CCS - Memory compression for Local memory
> > > - *
> > > - * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > - * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > - * compression formats.
> > > - *
> > > - * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > - * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > - * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > - * address.
> > > - *
> > > - * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > - * And CCS data can be copied in and out of CCS region through
> > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > - *
> > > - * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > - * directly decompress the compressed lmem object into smem and
> > > start using it
> > > - * from smem itself.
> > > - *
> > > - * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > - * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > - * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > - * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > - * location.
> > > - *
> > > - *
> > > - * Flat-CCS Modifiers for different compression formats
> > > - * ----------------------------------------------------
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * render compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Render compression uses 128 byte compression blocks
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * media compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Media compression uses 256 byte compression blocks.
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > - * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > - * clear color render compression. The genral layout is a tiled
> > > layout using
> > > - * 4Kb tiles i.e Tile4 layout.
> > > - */
> > > -
> > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > -{
> > > -       /* Mask the 3 LSB to use the PPGTT address space */
> > > -       *cmd++ = MI_FLUSH_DW | flags;
> > > -       *cmd++ = lower_32_bits(dst);
> > > -       *cmd++ = upper_32_bits(dst);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > -{
> > > -       u32 num_cmds, num_blks, total_size;
> > > -
> > > -       if (!GET_CCS_SIZE(i915, size))
> > > -               return 0;
> > > -
> > > -       /*
> > > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > -        * trnasfer upto 1024 blocks.
> > > -        */
> > > -       num_blks = GET_CCS_SIZE(i915, size);
> > > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > -
> > > -       /*
> > > -        * We need to add a flush before and after
> > > -        * XY_CTRL_SURF_COPY_BLT
> > > -        */
> > > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > -       return total_size;
> > > -}
> > > -
> > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > -                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > -                                    int src_mocs, int dst_mocs,
> > > -                                    u16 num_ccs_blocks)
> > > -{
> > > -       int i = num_ccs_blocks;
> > > -
> > > -       /*
> > > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > -        * data in and out of the CCS region.
> > > -        *
> > > -        * We can copy at most 1024 blocks of 256 bytes using one
> > > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > > -        *
> > > -        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > -        * another instruction to the same batch buffer.
> > > -        *
> > > -        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > -        *
> > > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > -        */
> > > -       do {
> > > -               /*
> > > -                * We use logical AND with 1023 since the size
> > > field
> > > -                * takes values which is in the range of 0 - 1023
> > > -                */
> > > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > -                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > -                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > -               *cmd++ = lower_32_bits(src_addr);
> > > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               *cmd++ = lower_32_bits(dst_addr);
> > > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               src_addr += SZ_64M;
> > > -               dst_addr += SZ_64M;
> > > -               i -= NUM_CCS_BLKS_PER_XFER;
> > > -       } while (i > 0);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > >  static int emit_clear(struct i915_request *rq,
> > >                       u64 offset,
> > >                       int size,
> > 

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
@ 2022-02-07 15:22         ` Hellstrom, Thomas
  0 siblings, 0 replies; 29+ messages in thread
From: Hellstrom, Thomas @ 2022-02-07 15:22 UTC (permalink / raw)
  To: C, Ramalingam; +Cc: intel-gfx, christian.koenig, dri-devel

On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote:
> On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> > Hi, Ram,
> > 
> > A couple of quick questions before starting a more detailed review:
> > 
> > 1) Does this also support migrating of compressed data LMEM->LMEM?
> > What-about inter-tile?
> Honestly this series mainly facused on eviction of lmem into smem and
> restoration of same.
> 
> To cover migration, we need to handle this differently from eviction.
> Becasue when we migrate the compressed content we need to be able to
> use
> that from that new placement. can't keep the ccs data separately.
> 
> Migration of lmem->smem needs decompression incorportated.
> Migration of lmem_m->lmem_n needs to maintain the
> compressed/decompressed state as it is.
> 
> So we need to pass the information upto emit_copy to differentiate
> eviction and migration
> 
> If you dont have objection I would like to take the migration once we
> have the eviction of lmem in place.

Sure NP. I was thinking that in the final solution we might also need
to think about the possibility that we might evict to another lmem
region, although I figure that won't be enabled until we support multi-
tile.

> 
> > 
> > 2) Do we need to block faulting of compressed data in the fault
> > handler
> > as a follow-up patch?
> 
> In case of evicted compressed data we dont need to treat it
> differently
> from the evicted normal data. So I dont think this needs a special
> treatment. Sorry if i dont understand your question.

My question wasn't directly related to eviction actually, but does
user-space need to have mmap access to compressed data? If not, block
it?

Thanks,
Thomas



> 
> Ram
> > 
> > /Thomas
> > 
> > 
> > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > > When we are swapping out the local memory obj on flat-ccs capable
> > > platform,
> > > we need to capture the ccs data too along with main meory and we
> > > need
> > > to
> > > restore it when we are swapping in the content.
> > > 
> > > Extracting and restoring the CCS data is done through a special
> > > cmd
> > > called
> > > XY_CTRL_SURF_COPY_BLT
> > > 
> > > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----
> > > ----
> > > --
> > >  1 file changed, 155 insertions(+), 128 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > index 5bdab0b3c735..e60ae6ff1847 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver,
> > > u32
> > > size)
> > >         return height % 4 == 3 && height <= 8;
> > >  }
> > > 
> > > +/**
> > > + * DOC: Flat-CCS - Memory compression for Local memory
> > > + *
> > > + * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > + * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > + * compression formats.
> > > + *
> > > + * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > + * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > + * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > + * address.
> > > + *
> > > + * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > + * And CCS data can be copied in and out of CCS region through
> > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > + *
> > > + * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > + * directly decompress the compressed lmem object into smem and
> > > start using it
> > > + * from smem itself.
> > > + *
> > > + * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > + * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > + * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > + * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > + * location.
> > > + *
> > > + *
> > > + * Flat-CCS Modifiers for different compression formats
> > > + * ----------------------------------------------------
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * render compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Render compression uses 128 byte compression blocks
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * media compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Media compression uses 256 byte compression blocks.
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > + * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > + * clear color render compression. The genral layout is a tiled
> > > layout using
> > > + * 4Kb tiles i.e Tile4 layout.
> > > + */
> > > +
> > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > +{
> > > +       /* Mask the 3 LSB to use the PPGTT address space */
> > > +       *cmd++ = MI_FLUSH_DW | flags;
> > > +       *cmd++ = lower_32_bits(dst);
> > > +       *cmd++ = upper_32_bits(dst);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > +{
> > > +       u32 num_cmds, num_blks, total_size;
> > > +
> > > +       if (!GET_CCS_SIZE(i915, size))
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > +        * trnasfer upto 1024 blocks.
> > > +        */
> > > +       num_blks = GET_CCS_SIZE(i915, size);
> > > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > +
> > > +       /*
> > > +        * We need to add a flush before and after
> > > +        * XY_CTRL_SURF_COPY_BLT
> > > +        */
> > > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > +       return total_size;
> > > +}
> > > +
> > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > +                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > +                                    int src_mocs, int dst_mocs,
> > > +                                    u16 num_ccs_blocks)
> > > +{
> > > +       int i = num_ccs_blocks;
> > > +
> > > +       /*
> > > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > +        * data in and out of the CCS region.
> > > +        *
> > > +        * We can copy at most 1024 blocks of 256 bytes using one
> > > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > > +        *
> > > +        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > +        * another instruction to the same batch buffer.
> > > +        *
> > > +        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > +        *
> > > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > +        */
> > > +       do {
> > > +               /*
> > > +                * We use logical AND with 1023 since the size
> > > field
> > > +                * takes values which is in the range of 0 - 1023
> > > +                */
> > > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > +                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > +                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > +               *cmd++ = lower_32_bits(src_addr);
> > > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               *cmd++ = lower_32_bits(dst_addr);
> > > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               src_addr += SZ_64M;
> > > +               dst_addr += SZ_64M;
> > > +               i -= NUM_CCS_BLKS_PER_XFER;
> > > +       } while (i > 0);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > >  static int emit_copy(struct i915_request *rq,
> > > -                    u32 dst_offset, u32 src_offset, int size)
> > > +                    bool dst_is_lmem, u32 dst_offset,
> > > +                    bool src_is_lmem, u32 src_offset, int size)
> > >  {
> > > +       struct drm_i915_private *i915 = rq->engine->i915;
> > >         const int ver = GRAPHICS_VER(rq->engine->i915);
> > >         u32 instance = rq->engine->instance;
> > > +       u32 num_ccs_blks, ccs_ring_size;
> > > +       u8 src_access, dst_access;
> > >         u32 *cs;
> > > 
> > > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > > HAS_FLAT_CCS(i915)) ?
> > > +                        calc_ctrl_surf_instr_size(i915, size) :
> > > 0;
> > > +
> > > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size :
> > > 6);
> > >         if (IS_ERR(cs))
> > >                 return PTR_ERR(cs);
> > > 
> > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request
> > > *rq,
> > >                 *cs++ = src_offset;
> > >         }
> > > 
> > > +       if (ccs_ring_size) {
> > > +               /* TODO: Migration needs to be handled with
> > > resolve
> > > of compressed data */
> > > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >>
> > > 8;
> > > +
> > > +               src_access = !src_is_lmem && dst_is_lmem;
> > > +               dst_access = !src_access;
> > > +
> > > +               if (src_access) /* Swapin of compressed data */
> > > +                       src_offset += size;
> > > +               else
> > > +                       dst_offset += size;
> > > +
> > > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > > dst_offset,
> > > +                                             src_access,
> > > dst_access,
> > > +                                             1, 1,
> > > num_ccs_blks);
> > > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > > MI_FLUSH_CCS);
> > > +       }
> > > +
> > >         intel_ring_advance(rq, cs);
> > >         return 0;
> > >  }
> > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >                 if (err)
> > >                         goto out_rq;
> > > 
> > > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > > +                               src_is_lmem, src_offset, len);
> > > 
> > >                 /* Arbitration is re-enabled between requests. */
> > >  out_rq:
> > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >         return err;
> > >  }
> > > 
> > > -/**
> > > - * DOC: Flat-CCS - Memory compression for Local memory
> > > - *
> > > - * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > - * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > - * compression formats.
> > > - *
> > > - * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > - * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > - * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > - * address.
> > > - *
> > > - * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > - * And CCS data can be copied in and out of CCS region through
> > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > - *
> > > - * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > - * directly decompress the compressed lmem object into smem and
> > > start using it
> > > - * from smem itself.
> > > - *
> > > - * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > - * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > - * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > - * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > - * location.
> > > - *
> > > - *
> > > - * Flat-CCS Modifiers for different compression formats
> > > - * ----------------------------------------------------
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * render compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Render compression uses 128 byte compression blocks
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * media compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Media compression uses 256 byte compression blocks.
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > - * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > - * clear color render compression. The genral layout is a tiled
> > > layout using
> > > - * 4Kb tiles i.e Tile4 layout.
> > > - */
> > > -
> > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > -{
> > > -       /* Mask the 3 LSB to use the PPGTT address space */
> > > -       *cmd++ = MI_FLUSH_DW | flags;
> > > -       *cmd++ = lower_32_bits(dst);
> > > -       *cmd++ = upper_32_bits(dst);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > -{
> > > -       u32 num_cmds, num_blks, total_size;
> > > -
> > > -       if (!GET_CCS_SIZE(i915, size))
> > > -               return 0;
> > > -
> > > -       /*
> > > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > -        * trnasfer upto 1024 blocks.
> > > -        */
> > > -       num_blks = GET_CCS_SIZE(i915, size);
> > > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > -
> > > -       /*
> > > -        * We need to add a flush before and after
> > > -        * XY_CTRL_SURF_COPY_BLT
> > > -        */
> > > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > -       return total_size;
> > > -}
> > > -
> > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > -                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > -                                    int src_mocs, int dst_mocs,
> > > -                                    u16 num_ccs_blocks)
> > > -{
> > > -       int i = num_ccs_blocks;
> > > -
> > > -       /*
> > > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > -        * data in and out of the CCS region.
> > > -        *
> > > -        * We can copy at most 1024 blocks of 256 bytes using one
> > > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > > -        *
> > > -        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > -        * another instruction to the same batch buffer.
> > > -        *
> > > -        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > -        *
> > > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > -        */
> > > -       do {
> > > -               /*
> > > -                * We use logical AND with 1023 since the size
> > > field
> > > -                * takes values which is in the range of 0 - 1023
> > > -                */
> > > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > -                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > -                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > -               *cmd++ = lower_32_bits(src_addr);
> > > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               *cmd++ = lower_32_bits(dst_addr);
> > > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               src_addr += SZ_64M;
> > > -               dst_addr += SZ_64M;
> > > -               i -= NUM_CCS_BLKS_PER_XFER;
> > > -       } while (i > 0);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > >  static int emit_clear(struct i915_request *rq,
> > >                       u64 offset,
> > >                       int size,
> > 

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07 15:22         ` [Intel-gfx] " Hellstrom, Thomas
@ 2022-02-07 15:33           ` Ramalingam C
  -1 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 15:33 UTC (permalink / raw)
  To: Hellstrom, Thomas; +Cc: intel-gfx, christian.koenig, dri-devel

On 2022-02-07 at 20:52:33 +0530, Hellstrom, Thomas wrote:
> On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote:
> > On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> > > Hi, Ram,
> > >
> > > A couple of quick questions before starting a more detailed review:
> > >
> > > 1) Does this also support migrating of compressed data LMEM->LMEM?
> > > What-about inter-tile?
> > Honestly this series mainly facused on eviction of lmem into smem and
> > restoration of same.
> >
> > To cover migration, we need to handle this differently from eviction.
> > Becasue when we migrate the compressed content we need to be able to
> > use
> > that from that new placement. can't keep the ccs data separately.
> >
> > Migration of lmem->smem needs decompression incorportated.
> > Migration of lmem_m->lmem_n needs to maintain the
> > compressed/decompressed state as it is.
> >
> > So we need to pass the information upto emit_copy to differentiate
> > eviction and migration
> >
> > If you dont have objection I would like to take the migration once we
> > have the eviction of lmem in place.
> 
> Sure NP. I was thinking that in the final solution we might also need
> to think about the possibility that we might evict to another lmem
> region, although I figure that won't be enabled until we support multi-
> tile.

Yes we need it for multi tile enablement of XeHPSDV.
> 
> >
> > >
> > > 2) Do we need to block faulting of compressed data in the fault
> > > handler
> > > as a follow-up patch?
> >
> > In case of evicted compressed data we dont need to treat it
> > differently
> > from the evicted normal data. So I dont think this needs a special
> > treatment. Sorry if i dont understand your question.
> 
> My question wasn't directly related to eviction actually, but does
> user-space need to have mmap access to compressed data? If not, block
> it?

We shouldn't mmap the ccs data. As per my understanding we should be
mmaping the obj size which doesn't count the ttm_tt inflated size.

I will verify this part and if needed will prepare a change to exclude
increased pages from mmap range.

Ram.
> 
> Thanks,
> Thomas
> 
> 
> 
> >
> > Ram
> > >
> > > /Thomas
> > >
> > >
> > > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > > > When we are swapping out the local memory obj on flat-ccs capable
> > > > platform,
> > > > we need to capture the ccs data too along with main meory and we
> > > > need
> > > > to
> > > > restore it when we are swapping in the content.
> > > >
> > > > Extracting and restoring the CCS data is done through a special
> > > > cmd
> > > > called
> > > > XY_CTRL_SURF_COPY_BLT
> > > >
> > > > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----
> > > > ----
> > > > --
> > > >  1 file changed, 155 insertions(+), 128 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > index 5bdab0b3c735..e60ae6ff1847 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver,
> > > > u32
> > > > size)
> > > >         return height % 4 == 3 && height <= 8;
> > > >  }
> > > >
> > > > +/**
> > > > + * DOC: Flat-CCS - Memory compression for Local memory
> > > > + *
> > > > + * On Xe-HP and later devices, we use dedicated compression
> > > > control
> > > > state (CCS)
> > > > + * stored in local memory for each surface, to support the 3D
> > > > and
> > > > media
> > > > + * compression formats.
> > > > + *
> > > > + * The memory required for the CCS of the entire local memory is
> > > > 1/256 of the
> > > > + * local memory size. So before the kernel boot, the required
> > > > memory
> > > > is reserved
> > > > + * for the CCS data and a secure register will be programmed
> > > > with
> > > > the CCS base
> > > > + * address.
> > > > + *
> > > > + * Flat CCS data needs to be cleared when a lmem object is
> > > > allocated.
> > > > + * And CCS data can be copied in and out of CCS region through
> > > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > > directly.
> > > > + *
> > > > + * When we exaust the lmem, if the object's placements support
> > > > smem,
> > > > then we can
> > > > + * directly decompress the compressed lmem object into smem and
> > > > start using it
> > > > + * from smem itself.
> > > > + *
> > > > + * But when we need to swapout the compressed lmem object into a
> > > > smem region
> > > > + * though objects' placement doesn't support smem, then we copy
> > > > the
> > > > lmem content
> > > > + * as it is into smem region along with ccs data (using
> > > > XY_CTRL_SURF_COPY_BLT).
> > > > + * When the object is referred, lmem content will be swaped in
> > > > along
> > > > with
> > > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > > corresponding
> > > > + * location.
> > > > + *
> > > > + *
> > > > + * Flat-CCS Modifiers for different compression formats
> > > > + * ----------------------------------------------------
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > + * render compression formats. Though the general layout is same
> > > > as
> > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > > algorithm is
> > > > + * used. Render compression uses 128 byte compression blocks
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > + * media compression formats. Though the general layout is same
> > > > as
> > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > > algorithm is
> > > > + * used. Media compression uses 256 byte compression blocks.
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > > buffers of Flat
> > > > + * CCS clear color render compression formats. Unified
> > > > compression
> > > > format for
> > > > + * clear color render compression. The genral layout is a tiled
> > > > layout using
> > > > + * 4Kb tiles i.e Tile4 layout.
> > > > + */
> > > > +
> > > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > > +{
> > > > +       /* Mask the 3 LSB to use the PPGTT address space */
> > > > +       *cmd++ = MI_FLUSH_DW | flags;
> > > > +       *cmd++ = lower_32_bits(dst);
> > > > +       *cmd++ = upper_32_bits(dst);
> > > > +
> > > > +       return cmd;
> > > > +}
> > > > +
> > > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > > *i915,
> > > > int size)
> > > > +{
> > > > +       u32 num_cmds, num_blks, total_size;
> > > > +
> > > > +       if (!GET_CCS_SIZE(i915, size))
> > > > +               return 0;
> > > > +
> > > > +       /*
> > > > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > > +        * trnasfer upto 1024 blocks.
> > > > +        */
> > > > +       num_blks = GET_CCS_SIZE(i915, size);
> > > > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > > 10;
> > > > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > > +
> > > > +       /*
> > > > +        * We need to add a flush before and after
> > > > +        * XY_CTRL_SURF_COPY_BLT
> > > > +        */
> > > > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > > +       return total_size;
> > > > +}
> > > > +
> > > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > > dst_addr,
> > > > +                                    u8 src_mem_access, u8
> > > > dst_mem_access,
> > > > +                                    int src_mocs, int dst_mocs,
> > > > +                                    u16 num_ccs_blocks)
> > > > +{
> > > > +       int i = num_ccs_blocks;
> > > > +
> > > > +       /*
> > > > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > > the
> > > > CCS
> > > > +        * data in and out of the CCS region.
> > > > +        *
> > > > +        * We can copy at most 1024 blocks of 256 bytes using one
> > > > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > > > +        *
> > > > +        * In case we need to copy more than 1024 blocks, we need
> > > > to
> > > > add
> > > > +        * another instruction to the same batch buffer.
> > > > +        *
> > > > +        * 1024 blocks of 256 bytes of CCS represent a total
> > > > 256KB of
> > > > CCS.
> > > > +        *
> > > > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > > +        */
> > > > +       do {
> > > > +               /*
> > > > +                * We use logical AND with 1023 since the size
> > > > field
> > > > +                * takes values which is in the range of 0 - 1023
> > > > +                */
> > > > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > > +                         (src_mem_access <<
> > > > SRC_ACCESS_TYPE_SHIFT) |
> > > > +                         (dst_mem_access <<
> > > > DST_ACCESS_TYPE_SHIFT) |
> > > > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > > +               *cmd++ = lower_32_bits(src_addr);
> > > > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > +               *cmd++ = lower_32_bits(dst_addr);
> > > > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > +               src_addr += SZ_64M;
> > > > +               dst_addr += SZ_64M;
> > > > +               i -= NUM_CCS_BLKS_PER_XFER;
> > > > +       } while (i > 0);
> > > > +
> > > > +       return cmd;
> > > > +}
> > > > +
> > > >  static int emit_copy(struct i915_request *rq,
> > > > -                    u32 dst_offset, u32 src_offset, int size)
> > > > +                    bool dst_is_lmem, u32 dst_offset,
> > > > +                    bool src_is_lmem, u32 src_offset, int size)
> > > >  {
> > > > +       struct drm_i915_private *i915 = rq->engine->i915;
> > > >         const int ver = GRAPHICS_VER(rq->engine->i915);
> > > >         u32 instance = rq->engine->instance;
> > > > +       u32 num_ccs_blks, ccs_ring_size;
> > > > +       u8 src_access, dst_access;
> > > >         u32 *cs;
> > > >
> > > > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > > > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > > > HAS_FLAT_CCS(i915)) ?
> > > > +                        calc_ctrl_surf_instr_size(i915, size) :
> > > > 0;
> > > > +
> > > > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size :
> > > > 6);
> > > >         if (IS_ERR(cs))
> > > >                 return PTR_ERR(cs);
> > > >
> > > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request
> > > > *rq,
> > > >                 *cs++ = src_offset;
> > > >         }
> > > >
> > > > +       if (ccs_ring_size) {
> > > > +               /* TODO: Migration needs to be handled with
> > > > resolve
> > > > of compressed data */
> > > > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > > > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >>
> > > > 8;
> > > > +
> > > > +               src_access = !src_is_lmem && dst_is_lmem;
> > > > +               dst_access = !src_access;
> > > > +
> > > > +               if (src_access) /* Swapin of compressed data */
> > > > +                       src_offset += size;
> > > > +               else
> > > > +                       dst_offset += size;
> > > > +
> > > > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > > > dst_offset,
> > > > +                                             src_access,
> > > > dst_access,
> > > > +                                             1, 1,
> > > > num_ccs_blks);
> > > > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > > > MI_FLUSH_CCS);
> > > > +       }
> > > > +
> > > >         intel_ring_advance(rq, cs);
> > > >         return 0;
> > > >  }
> > > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct
> > > > intel_context
> > > > *ce,
> > > >                 if (err)
> > > >                         goto out_rq;
> > > >
> > > > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > > > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > > > +                               src_is_lmem, src_offset, len);
> > > >
> > > >                 /* Arbitration is re-enabled between requests. */
> > > >  out_rq:
> > > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct
> > > > intel_context
> > > > *ce,
> > > >         return err;
> > > >  }
> > > >
> > > > -/**
> > > > - * DOC: Flat-CCS - Memory compression for Local memory
> > > > - *
> > > > - * On Xe-HP and later devices, we use dedicated compression
> > > > control
> > > > state (CCS)
> > > > - * stored in local memory for each surface, to support the 3D
> > > > and
> > > > media
> > > > - * compression formats.
> > > > - *
> > > > - * The memory required for the CCS of the entire local memory is
> > > > 1/256 of the
> > > > - * local memory size. So before the kernel boot, the required
> > > > memory
> > > > is reserved
> > > > - * for the CCS data and a secure register will be programmed
> > > > with
> > > > the CCS base
> > > > - * address.
> > > > - *
> > > > - * Flat CCS data needs to be cleared when a lmem object is
> > > > allocated.
> > > > - * And CCS data can be copied in and out of CCS region through
> > > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > > directly.
> > > > - *
> > > > - * When we exaust the lmem, if the object's placements support
> > > > smem,
> > > > then we can
> > > > - * directly decompress the compressed lmem object into smem and
> > > > start using it
> > > > - * from smem itself.
> > > > - *
> > > > - * But when we need to swapout the compressed lmem object into a
> > > > smem region
> > > > - * though objects' placement doesn't support smem, then we copy
> > > > the
> > > > lmem content
> > > > - * as it is into smem region along with ccs data (using
> > > > XY_CTRL_SURF_COPY_BLT).
> > > > - * When the object is referred, lmem content will be swaped in
> > > > along
> > > > with
> > > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > > corresponding
> > > > - * location.
> > > > - *
> > > > - *
> > > > - * Flat-CCS Modifiers for different compression formats
> > > > - * ----------------------------------------------------
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > - * render compression formats. Though the general layout is same
> > > > as
> > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > > algorithm is
> > > > - * used. Render compression uses 128 byte compression blocks
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > - * media compression formats. Though the general layout is same
> > > > as
> > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > > algorithm is
> > > > - * used. Media compression uses 256 byte compression blocks.
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > > buffers of Flat
> > > > - * CCS clear color render compression formats. Unified
> > > > compression
> > > > format for
> > > > - * clear color render compression. The genral layout is a tiled
> > > > layout using
> > > > - * 4Kb tiles i.e Tile4 layout.
> > > > - */
> > > > -
> > > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > > -{
> > > > -       /* Mask the 3 LSB to use the PPGTT address space */
> > > > -       *cmd++ = MI_FLUSH_DW | flags;
> > > > -       *cmd++ = lower_32_bits(dst);
> > > > -       *cmd++ = upper_32_bits(dst);
> > > > -
> > > > -       return cmd;
> > > > -}
> > > > -
> > > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > > *i915,
> > > > int size)
> > > > -{
> > > > -       u32 num_cmds, num_blks, total_size;
> > > > -
> > > > -       if (!GET_CCS_SIZE(i915, size))
> > > > -               return 0;
> > > > -
> > > > -       /*
> > > > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > > -        * trnasfer upto 1024 blocks.
> > > > -        */
> > > > -       num_blks = GET_CCS_SIZE(i915, size);
> > > > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > > 10;
> > > > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > > -
> > > > -       /*
> > > > -        * We need to add a flush before and after
> > > > -        * XY_CTRL_SURF_COPY_BLT
> > > > -        */
> > > > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > > -       return total_size;
> > > > -}
> > > > -
> > > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > > dst_addr,
> > > > -                                    u8 src_mem_access, u8
> > > > dst_mem_access,
> > > > -                                    int src_mocs, int dst_mocs,
> > > > -                                    u16 num_ccs_blocks)
> > > > -{
> > > > -       int i = num_ccs_blocks;
> > > > -
> > > > -       /*
> > > > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > > the
> > > > CCS
> > > > -        * data in and out of the CCS region.
> > > > -        *
> > > > -        * We can copy at most 1024 blocks of 256 bytes using one
> > > > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > > > -        *
> > > > -        * In case we need to copy more than 1024 blocks, we need
> > > > to
> > > > add
> > > > -        * another instruction to the same batch buffer.
> > > > -        *
> > > > -        * 1024 blocks of 256 bytes of CCS represent a total
> > > > 256KB of
> > > > CCS.
> > > > -        *
> > > > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > > -        */
> > > > -       do {
> > > > -               /*
> > > > -                * We use logical AND with 1023 since the size
> > > > field
> > > > -                * takes values which is in the range of 0 - 1023
> > > > -                */
> > > > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > > -                         (src_mem_access <<
> > > > SRC_ACCESS_TYPE_SHIFT) |
> > > > -                         (dst_mem_access <<
> > > > DST_ACCESS_TYPE_SHIFT) |
> > > > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > > -               *cmd++ = lower_32_bits(src_addr);
> > > > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > -               *cmd++ = lower_32_bits(dst_addr);
> > > > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > -               src_addr += SZ_64M;
> > > > -               dst_addr += SZ_64M;
> > > > -               i -= NUM_CCS_BLKS_PER_XFER;
> > > > -       } while (i > 0);
> > > > -
> > > > -       return cmd;
> > > > -}
> > > > -
> > > >  static int emit_clear(struct i915_request *rq,
> > > >                       u64 offset,
> > > >                       int size,
> > >
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
@ 2022-02-07 15:33           ` Ramalingam C
  0 siblings, 0 replies; 29+ messages in thread
From: Ramalingam C @ 2022-02-07 15:33 UTC (permalink / raw)
  To: Hellstrom, Thomas; +Cc: intel-gfx, christian.koenig, dri-devel

On 2022-02-07 at 20:52:33 +0530, Hellstrom, Thomas wrote:
> On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote:
> > On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> > > Hi, Ram,
> > >
> > > A couple of quick questions before starting a more detailed review:
> > >
> > > 1) Does this also support migrating of compressed data LMEM->LMEM?
> > > What-about inter-tile?
> > Honestly this series mainly facused on eviction of lmem into smem and
> > restoration of same.
> >
> > To cover migration, we need to handle this differently from eviction.
> > Becasue when we migrate the compressed content we need to be able to
> > use
> > that from that new placement. can't keep the ccs data separately.
> >
> > Migration of lmem->smem needs decompression incorportated.
> > Migration of lmem_m->lmem_n needs to maintain the
> > compressed/decompressed state as it is.
> >
> > So we need to pass the information upto emit_copy to differentiate
> > eviction and migration
> >
> > If you dont have objection I would like to take the migration once we
> > have the eviction of lmem in place.
> 
> Sure NP. I was thinking that in the final solution we might also need
> to think about the possibility that we might evict to another lmem
> region, although I figure that won't be enabled until we support multi-
> tile.

Yes we need it for multi tile enablement of XeHPSDV.
> 
> >
> > >
> > > 2) Do we need to block faulting of compressed data in the fault
> > > handler
> > > as a follow-up patch?
> >
> > In case of evicted compressed data we dont need to treat it
> > differently
> > from the evicted normal data. So I dont think this needs a special
> > treatment. Sorry if i dont understand your question.
> 
> My question wasn't directly related to eviction actually, but does
> user-space need to have mmap access to compressed data? If not, block
> it?

We shouldn't mmap the ccs data. As per my understanding we should be
mmaping the obj size which doesn't count the ttm_tt inflated size.

I will verify this part and if needed will prepare a change to exclude
increased pages from mmap range.

Ram.
> 
> Thanks,
> Thomas
> 
> 
> 
> >
> > Ram
> > >
> > > /Thomas
> > >
> > >
> > > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > > > When we are swapping out the local memory obj on flat-ccs capable
> > > > platform,
> > > > we need to capture the ccs data too along with main meory and we
> > > > need
> > > > to
> > > > restore it when we are swapping in the content.
> > > >
> > > > Extracting and restoring the CCS data is done through a special
> > > > cmd
> > > > called
> > > > XY_CTRL_SURF_COPY_BLT
> > > >
> > > > Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----
> > > > ----
> > > > --
> > > >  1 file changed, 155 insertions(+), 128 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > index 5bdab0b3c735..e60ae6ff1847 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver,
> > > > u32
> > > > size)
> > > >         return height % 4 == 3 && height <= 8;
> > > >  }
> > > >
> > > > +/**
> > > > + * DOC: Flat-CCS - Memory compression for Local memory
> > > > + *
> > > > + * On Xe-HP and later devices, we use dedicated compression
> > > > control
> > > > state (CCS)
> > > > + * stored in local memory for each surface, to support the 3D
> > > > and
> > > > media
> > > > + * compression formats.
> > > > + *
> > > > + * The memory required for the CCS of the entire local memory is
> > > > 1/256 of the
> > > > + * local memory size. So before the kernel boot, the required
> > > > memory
> > > > is reserved
> > > > + * for the CCS data and a secure register will be programmed
> > > > with
> > > > the CCS base
> > > > + * address.
> > > > + *
> > > > + * Flat CCS data needs to be cleared when a lmem object is
> > > > allocated.
> > > > + * And CCS data can be copied in and out of CCS region through
> > > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > > directly.
> > > > + *
> > > > + * When we exaust the lmem, if the object's placements support
> > > > smem,
> > > > then we can
> > > > + * directly decompress the compressed lmem object into smem and
> > > > start using it
> > > > + * from smem itself.
> > > > + *
> > > > + * But when we need to swapout the compressed lmem object into a
> > > > smem region
> > > > + * though objects' placement doesn't support smem, then we copy
> > > > the
> > > > lmem content
> > > > + * as it is into smem region along with ccs data (using
> > > > XY_CTRL_SURF_COPY_BLT).
> > > > + * When the object is referred, lmem content will be swaped in
> > > > along
> > > > with
> > > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > > corresponding
> > > > + * location.
> > > > + *
> > > > + *
> > > > + * Flat-CCS Modifiers for different compression formats
> > > > + * ----------------------------------------------------
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > + * render compression formats. Though the general layout is same
> > > > as
> > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > > algorithm is
> > > > + * used. Render compression uses 128 byte compression blocks
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > + * media compression formats. Though the general layout is same
> > > > as
> > > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > > algorithm is
> > > > + * used. Media compression uses 256 byte compression blocks.
> > > > + *
> > > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > > buffers of Flat
> > > > + * CCS clear color render compression formats. Unified
> > > > compression
> > > > format for
> > > > + * clear color render compression. The genral layout is a tiled
> > > > layout using
> > > > + * 4Kb tiles i.e Tile4 layout.
> > > > + */
> > > > +
> > > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > > +{
> > > > +       /* Mask the 3 LSB to use the PPGTT address space */
> > > > +       *cmd++ = MI_FLUSH_DW | flags;
> > > > +       *cmd++ = lower_32_bits(dst);
> > > > +       *cmd++ = upper_32_bits(dst);
> > > > +
> > > > +       return cmd;
> > > > +}
> > > > +
> > > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > > *i915,
> > > > int size)
> > > > +{
> > > > +       u32 num_cmds, num_blks, total_size;
> > > > +
> > > > +       if (!GET_CCS_SIZE(i915, size))
> > > > +               return 0;
> > > > +
> > > > +       /*
> > > > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > > +        * trnasfer upto 1024 blocks.
> > > > +        */
> > > > +       num_blks = GET_CCS_SIZE(i915, size);
> > > > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > > 10;
> > > > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > > +
> > > > +       /*
> > > > +        * We need to add a flush before and after
> > > > +        * XY_CTRL_SURF_COPY_BLT
> > > > +        */
> > > > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > > +       return total_size;
> > > > +}
> > > > +
> > > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > > dst_addr,
> > > > +                                    u8 src_mem_access, u8
> > > > dst_mem_access,
> > > > +                                    int src_mocs, int dst_mocs,
> > > > +                                    u16 num_ccs_blocks)
> > > > +{
> > > > +       int i = num_ccs_blocks;
> > > > +
> > > > +       /*
> > > > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > > the
> > > > CCS
> > > > +        * data in and out of the CCS region.
> > > > +        *
> > > > +        * We can copy at most 1024 blocks of 256 bytes using one
> > > > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > > > +        *
> > > > +        * In case we need to copy more than 1024 blocks, we need
> > > > to
> > > > add
> > > > +        * another instruction to the same batch buffer.
> > > > +        *
> > > > +        * 1024 blocks of 256 bytes of CCS represent a total
> > > > 256KB of
> > > > CCS.
> > > > +        *
> > > > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > > +        */
> > > > +       do {
> > > > +               /*
> > > > +                * We use logical AND with 1023 since the size
> > > > field
> > > > +                * takes values which is in the range of 0 - 1023
> > > > +                */
> > > > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > > +                         (src_mem_access <<
> > > > SRC_ACCESS_TYPE_SHIFT) |
> > > > +                         (dst_mem_access <<
> > > > DST_ACCESS_TYPE_SHIFT) |
> > > > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > > +               *cmd++ = lower_32_bits(src_addr);
> > > > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > +               *cmd++ = lower_32_bits(dst_addr);
> > > > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > +               src_addr += SZ_64M;
> > > > +               dst_addr += SZ_64M;
> > > > +               i -= NUM_CCS_BLKS_PER_XFER;
> > > > +       } while (i > 0);
> > > > +
> > > > +       return cmd;
> > > > +}
> > > > +
> > > >  static int emit_copy(struct i915_request *rq,
> > > > -                    u32 dst_offset, u32 src_offset, int size)
> > > > +                    bool dst_is_lmem, u32 dst_offset,
> > > > +                    bool src_is_lmem, u32 src_offset, int size)
> > > >  {
> > > > +       struct drm_i915_private *i915 = rq->engine->i915;
> > > >         const int ver = GRAPHICS_VER(rq->engine->i915);
> > > >         u32 instance = rq->engine->instance;
> > > > +       u32 num_ccs_blks, ccs_ring_size;
> > > > +       u8 src_access, dst_access;
> > > >         u32 *cs;
> > > >
> > > > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > > > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > > > HAS_FLAT_CCS(i915)) ?
> > > > +                        calc_ctrl_surf_instr_size(i915, size) :
> > > > 0;
> > > > +
> > > > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size :
> > > > 6);
> > > >         if (IS_ERR(cs))
> > > >                 return PTR_ERR(cs);
> > > >
> > > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request
> > > > *rq,
> > > >                 *cs++ = src_offset;
> > > >         }
> > > >
> > > > +       if (ccs_ring_size) {
> > > > +               /* TODO: Migration needs to be handled with
> > > > resolve
> > > > of compressed data */
> > > > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > > > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >>
> > > > 8;
> > > > +
> > > > +               src_access = !src_is_lmem && dst_is_lmem;
> > > > +               dst_access = !src_access;
> > > > +
> > > > +               if (src_access) /* Swapin of compressed data */
> > > > +                       src_offset += size;
> > > > +               else
> > > > +                       dst_offset += size;
> > > > +
> > > > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > > > dst_offset,
> > > > +                                             src_access,
> > > > dst_access,
> > > > +                                             1, 1,
> > > > num_ccs_blks);
> > > > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > > > MI_FLUSH_CCS);
> > > > +       }
> > > > +
> > > >         intel_ring_advance(rq, cs);
> > > >         return 0;
> > > >  }
> > > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct
> > > > intel_context
> > > > *ce,
> > > >                 if (err)
> > > >                         goto out_rq;
> > > >
> > > > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > > > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > > > +                               src_is_lmem, src_offset, len);
> > > >
> > > >                 /* Arbitration is re-enabled between requests. */
> > > >  out_rq:
> > > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct
> > > > intel_context
> > > > *ce,
> > > >         return err;
> > > >  }
> > > >
> > > > -/**
> > > > - * DOC: Flat-CCS - Memory compression for Local memory
> > > > - *
> > > > - * On Xe-HP and later devices, we use dedicated compression
> > > > control
> > > > state (CCS)
> > > > - * stored in local memory for each surface, to support the 3D
> > > > and
> > > > media
> > > > - * compression formats.
> > > > - *
> > > > - * The memory required for the CCS of the entire local memory is
> > > > 1/256 of the
> > > > - * local memory size. So before the kernel boot, the required
> > > > memory
> > > > is reserved
> > > > - * for the CCS data and a secure register will be programmed
> > > > with
> > > > the CCS base
> > > > - * address.
> > > > - *
> > > > - * Flat CCS data needs to be cleared when a lmem object is
> > > > allocated.
> > > > - * And CCS data can be copied in and out of CCS region through
> > > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > > directly.
> > > > - *
> > > > - * When we exaust the lmem, if the object's placements support
> > > > smem,
> > > > then we can
> > > > - * directly decompress the compressed lmem object into smem and
> > > > start using it
> > > > - * from smem itself.
> > > > - *
> > > > - * But when we need to swapout the compressed lmem object into a
> > > > smem region
> > > > - * though objects' placement doesn't support smem, then we copy
> > > > the
> > > > lmem content
> > > > - * as it is into smem region along with ccs data (using
> > > > XY_CTRL_SURF_COPY_BLT).
> > > > - * When the object is referred, lmem content will be swaped in
> > > > along
> > > > with
> > > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > > corresponding
> > > > - * location.
> > > > - *
> > > > - *
> > > > - * Flat-CCS Modifiers for different compression formats
> > > > - * ----------------------------------------------------
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > - * render compression formats. Though the general layout is same
> > > > as
> > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > > algorithm is
> > > > - * used. Render compression uses 128 byte compression blocks
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > > buffers
> > > > of Flat CCS
> > > > - * media compression formats. Though the general layout is same
> > > > as
> > > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > > algorithm is
> > > > - * used. Media compression uses 256 byte compression blocks.
> > > > - *
> > > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > > buffers of Flat
> > > > - * CCS clear color render compression formats. Unified
> > > > compression
> > > > format for
> > > > - * clear color render compression. The genral layout is a tiled
> > > > layout using
> > > > - * 4Kb tiles i.e Tile4 layout.
> > > > - */
> > > > -
> > > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > > -{
> > > > -       /* Mask the 3 LSB to use the PPGTT address space */
> > > > -       *cmd++ = MI_FLUSH_DW | flags;
> > > > -       *cmd++ = lower_32_bits(dst);
> > > > -       *cmd++ = upper_32_bits(dst);
> > > > -
> > > > -       return cmd;
> > > > -}
> > > > -
> > > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > > *i915,
> > > > int size)
> > > > -{
> > > > -       u32 num_cmds, num_blks, total_size;
> > > > -
> > > > -       if (!GET_CCS_SIZE(i915, size))
> > > > -               return 0;
> > > > -
> > > > -       /*
> > > > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > > -        * trnasfer upto 1024 blocks.
> > > > -        */
> > > > -       num_blks = GET_CCS_SIZE(i915, size);
> > > > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > > 10;
> > > > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > > -
> > > > -       /*
> > > > -        * We need to add a flush before and after
> > > > -        * XY_CTRL_SURF_COPY_BLT
> > > > -        */
> > > > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > > -       return total_size;
> > > > -}
> > > > -
> > > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > > dst_addr,
> > > > -                                    u8 src_mem_access, u8
> > > > dst_mem_access,
> > > > -                                    int src_mocs, int dst_mocs,
> > > > -                                    u16 num_ccs_blocks)
> > > > -{
> > > > -       int i = num_ccs_blocks;
> > > > -
> > > > -       /*
> > > > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > > the
> > > > CCS
> > > > -        * data in and out of the CCS region.
> > > > -        *
> > > > -        * We can copy at most 1024 blocks of 256 bytes using one
> > > > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > > > -        *
> > > > -        * In case we need to copy more than 1024 blocks, we need
> > > > to
> > > > add
> > > > -        * another instruction to the same batch buffer.
> > > > -        *
> > > > -        * 1024 blocks of 256 bytes of CCS represent a total
> > > > 256KB of
> > > > CCS.
> > > > -        *
> > > > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > > -        */
> > > > -       do {
> > > > -               /*
> > > > -                * We use logical AND with 1023 since the size
> > > > field
> > > > -                * takes values which is in the range of 0 - 1023
> > > > -                */
> > > > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > > -                         (src_mem_access <<
> > > > SRC_ACCESS_TYPE_SHIFT) |
> > > > -                         (dst_mem_access <<
> > > > DST_ACCESS_TYPE_SHIFT) |
> > > > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > > -               *cmd++ = lower_32_bits(src_addr);
> > > > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > -               *cmd++ = lower_32_bits(dst_addr);
> > > > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > > -               src_addr += SZ_64M;
> > > > -               dst_addr += SZ_64M;
> > > > -               i -= NUM_CCS_BLKS_PER_XFER;
> > > > -       } while (i > 0);
> > > > -
> > > > -       return cmd;
> > > > -}
> > > > -
> > > >  static int emit_clear(struct i915_request *rq,
> > > >                       u64 offset,
> > > >                       int size,
> > >
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data
  2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
  (?)
  (?)
@ 2022-02-18  0:05   ` Lucas De Marchi
  -1 siblings, 0 replies; 29+ messages in thread
From: Lucas De Marchi @ 2022-02-18  0:05 UTC (permalink / raw)
  To: Ramalingam C; +Cc: intel-gfx, Hellstrom Thomas, Christian Koenig, dri-devel

On Mon, Feb 07, 2022 at 03:07:43PM +0530, Ramalingam C wrote:
>When we are swapping out the local memory obj on flat-ccs capable platform,
>we need to capture the ccs data too along with main meory and we need to
>restore it when we are swapping in the content.
>
>Extracting and restoring the CCS data is done through a special cmd called
>XY_CTRL_SURF_COPY_BLT
>
>Signed-off-by: Ramalingam C <ramalingam.c@intel.com>
>---
> drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----------
> 1 file changed, 155 insertions(+), 128 deletions(-)
>
>diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
>index 5bdab0b3c735..e60ae6ff1847 100644
>--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
>+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
>@@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver, u32 size)
> 	return height % 4 == 3 && height <= 8;
> }
>
>+/**
>+ * DOC: Flat-CCS - Memory compression for Local memory
>+ *
>+ * On Xe-HP and later devices, we use dedicated compression control state (CCS)
>+ * stored in local memory for each surface, to support the 3D and media
>+ * compression formats.
>+ *
>+ * The memory required for the CCS of the entire local memory is 1/256 of the
>+ * local memory size. So before the kernel boot, the required memory is reserved
>+ * for the CCS data and a secure register will be programmed with the CCS base
>+ * address.
>+ *
>+ * Flat CCS data needs to be cleared when a lmem object is allocated.
>+ * And CCS data can be copied in and out of CCS region through
>+ * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data directly.
>+ *
>+ * When we exaust the lmem, if the object's placements support smem, then we can
>+ * directly decompress the compressed lmem object into smem and start using it
>+ * from smem itself.
>+ *
>+ * But when we need to swapout the compressed lmem object into a smem region
>+ * though objects' placement doesn't support smem, then we copy the lmem content
>+ * as it is into smem region along with ccs data (using XY_CTRL_SURF_COPY_BLT).
>+ * When the object is referred, lmem content will be swaped in along with
>+ * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at corresponding
>+ * location.
>+ *
>+ *
>+ * Flat-CCS Modifiers for different compression formats
>+ * ----------------------------------------------------
>+ *
>+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the buffers of Flat CCS
>+ * render compression formats. Though the general layout is same as
>+ * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression algorithm is
>+ * used. Render compression uses 128 byte compression blocks
>+ *
>+ * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the buffers of Flat CCS
>+ * media compression formats. Though the general layout is same as
>+ * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression algorithm is
>+ * used. Media compression uses 256 byte compression blocks.
>+ *
>+ * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the buffers of Flat
>+ * CCS clear color render compression formats. Unified compression format for
>+ * clear color render compression. The genral layout is a tiled layout using
>+ * 4Kb tiles i.e Tile4 layout.
>+ */
>+
>+static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
>+{
>+	/* Mask the 3 LSB to use the PPGTT address space */
>+	*cmd++ = MI_FLUSH_DW | flags;
>+	*cmd++ = lower_32_bits(dst);
>+	*cmd++ = upper_32_bits(dst);
>+
>+	return cmd;
>+}
>+
>+static u32 calc_ctrl_surf_instr_size(struct drm_i915_private *i915, int size)
>+{
>+	u32 num_cmds, num_blks, total_size;
>+
>+	if (!GET_CCS_SIZE(i915, size))
>+		return 0;
>+
>+	/*
>+	 * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
>+	 * blocks. one XY_CTRL_SURF_COPY_BLT command can
>+	 * trnasfer upto 1024 blocks.
>+	 */
>+	num_blks = GET_CCS_SIZE(i915, size);
>+	num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >> 10;
>+	total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
>+
>+	/*
>+	 * We need to add a flush before and after
>+	 * XY_CTRL_SURF_COPY_BLT
>+	 */
>+	total_size += 2 * MI_FLUSH_DW_SIZE;
>+	return total_size;
>+}
>+
>+static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
>+				     u8 src_mem_access, u8 dst_mem_access,
>+				     int src_mocs, int dst_mocs,
>+				     u16 num_ccs_blocks)
>+{
>+	int i = num_ccs_blocks;
>+
>+	/*
>+	 * The XY_CTRL_SURF_COPY_BLT instruction is used to copy the CCS
>+	 * data in and out of the CCS region.
>+	 *
>+	 * We can copy at most 1024 blocks of 256 bytes using one
>+	 * XY_CTRL_SURF_COPY_BLT instruction.
>+	 *
>+	 * In case we need to copy more than 1024 blocks, we need to add
>+	 * another instruction to the same batch buffer.
>+	 *
>+	 * 1024 blocks of 256 bytes of CCS represent a total 256KB of CCS.
>+	 *
>+	 * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
>+	 */
>+	do {
>+		/*
>+		 * We use logical AND with 1023 since the size field
>+		 * takes values which is in the range of 0 - 1023
>+		 */
>+		*cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
>+			  (src_mem_access << SRC_ACCESS_TYPE_SHIFT) |
>+			  (dst_mem_access << DST_ACCESS_TYPE_SHIFT) |
>+			  (((i - 1) & 1023) << CCS_SIZE_SHIFT));
>+		*cmd++ = lower_32_bits(src_addr);
>+		*cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
>+			  (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
>+		*cmd++ = lower_32_bits(dst_addr);
>+		*cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
>+			  (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
>+		src_addr += SZ_64M;
>+		dst_addr += SZ_64M;
>+		i -= NUM_CCS_BLKS_PER_XFER;
>+	} while (i > 0);
>+
>+	return cmd;
>+}
>+
> static int emit_copy(struct i915_request *rq,
>-		     u32 dst_offset, u32 src_offset, int size)
>+		     bool dst_is_lmem, u32 dst_offset,
>+		     bool src_is_lmem, u32 src_offset, int size)
> {
>+	struct drm_i915_private *i915 = rq->engine->i915;
> 	const int ver = GRAPHICS_VER(rq->engine->i915);
> 	u32 instance = rq->engine->instance;
>+	u32 num_ccs_blks, ccs_ring_size;
>+	u8 src_access, dst_access;
> 	u32 *cs;
>
>-	cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
>+	ccs_ring_size = ((src_is_lmem || dst_is_lmem) && HAS_FLAT_CCS(i915)) ?
>+			 calc_ctrl_surf_instr_size(i915, size) : 0;
>+
>+	cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size : 6);
> 	if (IS_ERR(cs))
> 		return PTR_ERR(cs);
>
>@@ -492,6 +624,25 @@ static int emit_copy(struct i915_request *rq,
> 		*cs++ = src_offset;
> 	}
>
>+	if (ccs_ring_size) {
>+		/* TODO: Migration needs to be handled with resolve of compressed data */
>+		num_ccs_blks = (GET_CCS_SIZE(i915, size) +
>+				NUM_CCS_BYTES_PER_BLOCK - 1) >> 8;
>+
>+		src_access = !src_is_lmem && dst_is_lmem;
>+		dst_access = !src_access;
>+
>+		if (src_access) /* Swapin of compressed data */
>+			src_offset += size;
>+		else
>+			dst_offset += size;
>+
>+		cs = _i915_ctrl_surf_copy_blt(cs, src_offset, dst_offset,
>+					      src_access, dst_access,
>+					      1, 1, num_ccs_blks);
>+		cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>+	}
>+
> 	intel_ring_advance(rq, cs);
> 	return 0;
> }
>@@ -578,7 +729,8 @@ intel_context_migrate_copy(struct intel_context *ce,
> 		if (err)
> 			goto out_rq;
>
>-		err = emit_copy(rq, dst_offset, src_offset, len);
>+		err = emit_copy(rq, dst_is_lmem, dst_offset,
>+				src_is_lmem, src_offset, len);
>
> 		/* Arbitration is re-enabled between requests. */
> out_rq:
>@@ -596,131 +748,6 @@ intel_context_migrate_copy(struct intel_context *ce,
> 	return err;
> }
>
>-/**
>- * DOC: Flat-CCS - Memory compression for Local memory

the patch that added this should add it above where you are moving it
now rather than having this additional hunk.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-02-18  0:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-07  9:37 [RFC 0/2] drm/i915/ttm: Evict and store of compressed object Ramalingam C
2022-02-07  9:37 ` [Intel-gfx] " Ramalingam C
2022-02-07  9:37 ` [RFC 1/2] drm/i915/ttm: Add extra pages for handling ccs data Ramalingam C
2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
2022-02-07 10:41   ` Thomas Hellström (Intel)
2022-02-07 10:41   ` Das, Nirmoy
2022-02-07  9:37 ` [RFC 2/2] drm/i915/migrate: Evict and restore the " Ramalingam C
2022-02-07  9:37   ` [Intel-gfx] " Ramalingam C
2022-02-07 14:55   ` Hellstrom, Thomas
2022-02-07 14:55     ` [Intel-gfx] " Hellstrom, Thomas
2022-02-07 15:14     ` Ramalingam C
2022-02-07 15:14       ` [Intel-gfx] " Ramalingam C
2022-02-07 15:22       ` Hellstrom, Thomas
2022-02-07 15:22         ` [Intel-gfx] " Hellstrom, Thomas
2022-02-07 15:33         ` Ramalingam C
2022-02-07 15:33           ` [Intel-gfx] " Ramalingam C
2022-02-18  0:05   ` Lucas De Marchi
2022-02-07 10:48 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for drm/i915/ttm: Evict and store of compressed object Patchwork
2022-02-07 11:41 ` [RFC 0/2] " Christian König
2022-02-07 11:41   ` [Intel-gfx] " Christian König
2022-02-07 13:49   ` Hellstrom, Thomas
2022-02-07 13:49     ` [Intel-gfx] " Hellstrom, Thomas
2022-02-07 13:53   ` Ramalingam C
2022-02-07 13:53     ` [Intel-gfx] " Ramalingam C
2022-02-07 14:37     ` Christian König
2022-02-07 14:37       ` [Intel-gfx] " Christian König
2022-02-07 14:47       ` C, Ramalingam
2022-02-07 14:47         ` [Intel-gfx] " C, Ramalingam
2022-02-07 14:49     ` Das, Nirmoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.