All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] DG2 accelerated migration/clearing support
@ 2021-12-06 13:31 ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

Enable accelerated moves and clearing on DG2. On such HW we have minimum page
size restrictions when accessing LMEM from the GTT, where we now have to use 64K
GTT pages or larger. With the ppGTT the page-table also has a slightly different
layout from past generations when using the 64K GTT mode(which is still enabled
on via some PDE bit), where it is now compacted down to 32 qword entries. Note
that on discrete the paging structures must also be placed in LMEM, and we need
to able to modify them via the GTT itself(see patch 7), which is one of the
complications here.

The series needs to be applied on top of the DG2 enabling branch:
https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3

Matthew Auld (8):
  drm/i915/migrate: don't check the scratch page
  drm/i915/migrate: fix offset calculation
  drm/i915/migrate: fix length calculation
  drm/i915/selftests: handle object rounding
  drm/i915/gtt: allow overriding the pt alignment
  drm/i915/gtt: add xehpsdv_ppgtt_insert_entry
  drm/i915/migrate: add acceleration support for DG2
  drm/i915/migrate: turn on acceleration for DG2

 drivers/gpu/drm/i915/gt/gen8_ppgtt.c       |  50 +++++-
 drivers/gpu/drm/i915/gt/intel_gtt.h        |  10 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c    | 195 ++++++++++++++++-----
 drivers/gpu/drm/i915/gt/intel_ppgtt.c      |  16 +-
 drivers/gpu/drm/i915/gt/selftest_migrate.c |   1 +
 5 files changed, 221 insertions(+), 51 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 0/8] DG2 accelerated migration/clearing support
@ 2021-12-06 13:31 ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel

Enable accelerated moves and clearing on DG2. On such HW we have minimum page
size restrictions when accessing LMEM from the GTT, where we now have to use 64K
GTT pages or larger. With the ppGTT the page-table also has a slightly different
layout from past generations when using the 64K GTT mode(which is still enabled
on via some PDE bit), where it is now compacted down to 32 qword entries. Note
that on discrete the paging structures must also be placed in LMEM, and we need
to able to modify them via the GTT itself(see patch 7), which is one of the
complications here.

The series needs to be applied on top of the DG2 enabling branch:
https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3

Matthew Auld (8):
  drm/i915/migrate: don't check the scratch page
  drm/i915/migrate: fix offset calculation
  drm/i915/migrate: fix length calculation
  drm/i915/selftests: handle object rounding
  drm/i915/gtt: allow overriding the pt alignment
  drm/i915/gtt: add xehpsdv_ppgtt_insert_entry
  drm/i915/migrate: add acceleration support for DG2
  drm/i915/migrate: turn on acceleration for DG2

 drivers/gpu/drm/i915/gt/gen8_ppgtt.c       |  50 +++++-
 drivers/gpu/drm/i915/gt/intel_gtt.h        |  10 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c    | 195 ++++++++++++++++-----
 drivers/gpu/drm/i915/gt/intel_ppgtt.c      |  16 +-
 drivers/gpu/drm/i915/gt/selftest_migrate.c |   1 +
 5 files changed, 221 insertions(+), 51 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/8] drm/i915/migrate: don't check the scratch page
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

The scratch page might not be allocated in LMEM(like on DG2), so instead
of using that as the deciding factor for where the paging structures
live, let's just query the pt before mapping it.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 765c6d48fe52..2d3188a398dd 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -13,7 +13,6 @@
 
 struct insert_pte_data {
 	u64 offset;
-	bool is_lmem;
 };
 
 #define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
@@ -41,7 +40,7 @@ static void insert_pte(struct i915_address_space *vm,
 	struct insert_pte_data *d = data;
 
 	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE,
-			d->is_lmem ? PTE_LM : 0);
+			i915_gem_object_is_lmem(pt->base) ? PTE_LM : 0);
 	d->offset += PAGE_SIZE;
 }
 
@@ -135,7 +134,6 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		d.is_lmem = i915_gem_object_is_lmem(vm->vm.scratch[0]);
 		vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 1/8] drm/i915/migrate: don't check the scratch page
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

The scratch page might not be allocated in LMEM(like on DG2), so instead
of using that as the deciding factor for where the paging structures
live, let's just query the pt before mapping it.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 765c6d48fe52..2d3188a398dd 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -13,7 +13,6 @@
 
 struct insert_pte_data {
 	u64 offset;
-	bool is_lmem;
 };
 
 #define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
@@ -41,7 +40,7 @@ static void insert_pte(struct i915_address_space *vm,
 	struct insert_pte_data *d = data;
 
 	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE,
-			d->is_lmem ? PTE_LM : 0);
+			i915_gem_object_is_lmem(pt->base) ? PTE_LM : 0);
 	d->offset += PAGE_SIZE;
 }
 
@@ -135,7 +134,6 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		d.is_lmem = i915_gem_object_is_lmem(vm->vm.scratch[0]);
 		vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/8] drm/i915/migrate: fix offset calculation
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Ensure we add the engine base only after we calculate the qword offset
into the PTE window.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 2d3188a398dd..6f2c4388ebb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -282,10 +282,10 @@ static int emit_pte(struct i915_request *rq,
 	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
 
 	/* Compute the page directory offset for the target address range */
-	offset += (u64)rq->engine->instance << 32;
 	offset >>= 12;
 	offset *= sizeof(u64);
 	offset += 2 * CHUNK_SZ;
+	offset += (u64)rq->engine->instance << 32;
 
 	cs = intel_ring_begin(rq, 6);
 	if (IS_ERR(cs))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 2/8] drm/i915/migrate: fix offset calculation
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Ensure we add the engine base only after we calculate the qword offset
into the PTE window.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 2d3188a398dd..6f2c4388ebb4 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -282,10 +282,10 @@ static int emit_pte(struct i915_request *rq,
 	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
 
 	/* Compute the page directory offset for the target address range */
-	offset += (u64)rq->engine->instance << 32;
 	offset >>= 12;
 	offset *= sizeof(u64);
 	offset += 2 * CHUNK_SZ;
+	offset += (u64)rq->engine->instance << 32;
 
 	cs = intel_ring_begin(rq, 6);
 	if (IS_ERR(cs))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/8] drm/i915/migrate: fix length calculation
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

No need to insert PTEs for the PTE window itself, also foreach expects a
length not an end offset, which could be gigantic here with a second
engine.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 6f2c4388ebb4..0192b61ab541 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -134,7 +134,7 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);
+		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
 	}
 
 	return &vm->vm;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 3/8] drm/i915/migrate: fix length calculation
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

No need to insert PTEs for the PTE window itself, also foreach expects a
length not an end offset, which could be gigantic here with a second
engine.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 6f2c4388ebb4..0192b61ab541 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -134,7 +134,7 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		vm->vm.foreach(&vm->vm, base, base + sz, insert_pte, &d);
+		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
 	}
 
 	return &vm->vm;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/8] drm/i915/selftests: handle object rounding
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Ensure we account for any object rounding due to min_page_size
restrictions.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_migrate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/gt/selftest_migrate.c b/drivers/gpu/drm/i915/gt/selftest_migrate.c
index 12ef2837c89b..e21787301bbd 100644
--- a/drivers/gpu/drm/i915/gt/selftest_migrate.c
+++ b/drivers/gpu/drm/i915/gt/selftest_migrate.c
@@ -49,6 +49,7 @@ static int copy(struct intel_migrate *migrate,
 	if (IS_ERR(src))
 		return 0;
 
+	sz = src->base.size;
 	dst = i915_gem_object_create_internal(i915, sz);
 	if (IS_ERR(dst))
 		goto err_free_src;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 4/8] drm/i915/selftests: handle object rounding
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Ensure we account for any object rounding due to min_page_size
restrictions.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_migrate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/i915/gt/selftest_migrate.c b/drivers/gpu/drm/i915/gt/selftest_migrate.c
index 12ef2837c89b..e21787301bbd 100644
--- a/drivers/gpu/drm/i915/gt/selftest_migrate.c
+++ b/drivers/gpu/drm/i915/gt/selftest_migrate.c
@@ -49,6 +49,7 @@ static int copy(struct intel_migrate *migrate,
 	if (IS_ERR(src))
 		return 0;
 
+	sz = src->base.size;
 	dst = i915_gem_object_create_internal(i915, sz);
 	if (IS_ERR(dst))
 		goto err_free_src;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/8] drm/i915/gtt: allow overriding the pt alignment
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Thomas Hellström

On some platforms we have alignment restrictions when accessing LMEM
from the GTT. In the next patch few patches we need to be able to modify
the page-tables directly via the GTT itself.

Suggested-by: Ramalingam C <ramalingam.c@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gtt.h   | 10 +++++++++-
 drivers/gpu/drm/i915/gt/intel_ppgtt.c | 16 ++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index cbc0b5266cb4..a00d278d8175 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -196,6 +196,14 @@ void *__px_vaddr(struct drm_i915_gem_object *p);
 struct i915_vm_pt_stash {
 	/* preallocated chains of page tables/directories */
 	struct i915_page_table *pt[2];
+	/*
+	 * Optionally override the alignment/size of the physical page that
+	 * contains each PT. If not set defaults back to the usual
+	 * I915_GTT_PAGE_SIZE_4K. This does not influence the other paging
+	 * structures. MUST be a power-of-two. ONLY applicable on discrete
+	 * platforms.
+	 */
+	int pt_sz;
 };
 
 struct i915_vma_ops {
@@ -583,7 +591,7 @@ void free_scratch(struct i915_address_space *vm);
 
 struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz);
 struct drm_i915_gem_object *alloc_pt_lmem(struct i915_address_space *vm, int sz);
-struct i915_page_table *alloc_pt(struct i915_address_space *vm);
+struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz);
 struct i915_page_directory *alloc_pd(struct i915_address_space *vm);
 struct i915_page_directory *__alloc_pd(int npde);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
index b8238f5bc8b1..3c90aea25072 100644
--- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
@@ -12,7 +12,7 @@
 #include "gen6_ppgtt.h"
 #include "gen8_ppgtt.h"
 
-struct i915_page_table *alloc_pt(struct i915_address_space *vm)
+struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz)
 {
 	struct i915_page_table *pt;
 
@@ -20,7 +20,7 @@ struct i915_page_table *alloc_pt(struct i915_address_space *vm)
 	if (unlikely(!pt))
 		return ERR_PTR(-ENOMEM);
 
-	pt->base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	pt->base = vm->alloc_pt_dma(vm, sz);
 	if (IS_ERR(pt->base)) {
 		kfree(pt);
 		return ERR_PTR(-ENOMEM);
@@ -219,17 +219,25 @@ int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
 			   u64 size)
 {
 	unsigned long count;
-	int shift, n;
+	int shift, n, pt_sz;
 
 	shift = vm->pd_shift;
 	if (!shift)
 		return 0;
 
+	pt_sz = stash->pt_sz;
+	if (!pt_sz)
+		pt_sz = I915_GTT_PAGE_SIZE_4K;
+	else
+		GEM_BUG_ON(!IS_DGFX(vm->i915));
+
+	GEM_BUG_ON(!is_power_of_2(pt_sz));
+
 	count = pd_count(size, shift);
 	while (count--) {
 		struct i915_page_table *pt;
 
-		pt = alloc_pt(vm);
+		pt = alloc_pt(vm, pt_sz);
 		if (IS_ERR(pt)) {
 			i915_vm_free_pt_stash(vm, stash);
 			return PTR_ERR(pt);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 5/8] drm/i915/gtt: allow overriding the pt alignment
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, Thomas Hellström

On some platforms we have alignment restrictions when accessing LMEM
from the GTT. In the next patch few patches we need to be able to modify
the page-tables directly via the GTT itself.

Suggested-by: Ramalingam C <ramalingam.c@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gtt.h   | 10 +++++++++-
 drivers/gpu/drm/i915/gt/intel_ppgtt.c | 16 ++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index cbc0b5266cb4..a00d278d8175 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -196,6 +196,14 @@ void *__px_vaddr(struct drm_i915_gem_object *p);
 struct i915_vm_pt_stash {
 	/* preallocated chains of page tables/directories */
 	struct i915_page_table *pt[2];
+	/*
+	 * Optionally override the alignment/size of the physical page that
+	 * contains each PT. If not set defaults back to the usual
+	 * I915_GTT_PAGE_SIZE_4K. This does not influence the other paging
+	 * structures. MUST be a power-of-two. ONLY applicable on discrete
+	 * platforms.
+	 */
+	int pt_sz;
 };
 
 struct i915_vma_ops {
@@ -583,7 +591,7 @@ void free_scratch(struct i915_address_space *vm);
 
 struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz);
 struct drm_i915_gem_object *alloc_pt_lmem(struct i915_address_space *vm, int sz);
-struct i915_page_table *alloc_pt(struct i915_address_space *vm);
+struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz);
 struct i915_page_directory *alloc_pd(struct i915_address_space *vm);
 struct i915_page_directory *__alloc_pd(int npde);
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
index b8238f5bc8b1..3c90aea25072 100644
--- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
@@ -12,7 +12,7 @@
 #include "gen6_ppgtt.h"
 #include "gen8_ppgtt.h"
 
-struct i915_page_table *alloc_pt(struct i915_address_space *vm)
+struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz)
 {
 	struct i915_page_table *pt;
 
@@ -20,7 +20,7 @@ struct i915_page_table *alloc_pt(struct i915_address_space *vm)
 	if (unlikely(!pt))
 		return ERR_PTR(-ENOMEM);
 
-	pt->base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
+	pt->base = vm->alloc_pt_dma(vm, sz);
 	if (IS_ERR(pt->base)) {
 		kfree(pt);
 		return ERR_PTR(-ENOMEM);
@@ -219,17 +219,25 @@ int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
 			   u64 size)
 {
 	unsigned long count;
-	int shift, n;
+	int shift, n, pt_sz;
 
 	shift = vm->pd_shift;
 	if (!shift)
 		return 0;
 
+	pt_sz = stash->pt_sz;
+	if (!pt_sz)
+		pt_sz = I915_GTT_PAGE_SIZE_4K;
+	else
+		GEM_BUG_ON(!IS_DGFX(vm->i915));
+
+	GEM_BUG_ON(!is_power_of_2(pt_sz));
+
 	count = pd_count(size, shift);
 	while (count--) {
 		struct i915_page_table *pt;
 
-		pt = alloc_pt(vm);
+		pt = alloc_pt(vm, pt_sz);
 		if (IS_ERR(pt)) {
 			i915_vm_free_pt_stash(vm, stash);
 			return PTR_ERR(pt);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/8] drm/i915/gtt: add xehpsdv_ppgtt_insert_entry
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

If this is LMEM then we get a 32 entry PT, with each PTE pointing to
some 64K block of memory, otherwise it's just the usual 512 entry PT.
This very much assumes the caller knows what they are doing.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 50 ++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index bd3ca0996a23..312b2267bf87 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -728,13 +728,56 @@ static void gen8_ppgtt_insert_entry(struct i915_address_space *vm,
 		gen8_pdp_for_page_index(vm, idx);
 	struct i915_page_directory *pd =
 		i915_pd_entry(pdp, gen8_pd_index(idx, 2));
+	struct i915_page_table *pt = i915_pt_entry(pd, gen8_pd_index(idx, 1));
 	gen8_pte_t *vaddr;
 
-	vaddr = px_vaddr(i915_pt_entry(pd, gen8_pd_index(idx, 1)));
+	GEM_BUG_ON(pt->is_compact);
+
+	vaddr = px_vaddr(pt);
 	vaddr[gen8_pd_index(idx, 0)] = gen8_pte_encode(addr, level, flags);
 	clflush_cache_range(&vaddr[gen8_pd_index(idx, 0)], sizeof(*vaddr));
 }
 
+static void __xehpsdv_ppgtt_insert_entry_lm(struct i915_address_space *vm,
+					    dma_addr_t addr,
+					    u64 offset,
+					    enum i915_cache_level level,
+					    u32 flags)
+{
+	u64 idx = offset >> GEN8_PTE_SHIFT;
+	struct i915_page_directory * const pdp =
+		gen8_pdp_for_page_index(vm, idx);
+	struct i915_page_directory *pd =
+		i915_pd_entry(pdp, gen8_pd_index(idx, 2));
+	struct i915_page_table *pt = i915_pt_entry(pd, gen8_pd_index(idx, 1));
+	gen8_pte_t *vaddr;
+
+	GEM_BUG_ON(!IS_ALIGNED(addr, SZ_64K));
+	GEM_BUG_ON(!IS_ALIGNED(offset, SZ_64K));
+
+	if (!pt->is_compact) {
+		vaddr = px_vaddr(pd);
+		vaddr[gen8_pd_index(idx, 1)] |= GEN12_PDE_64K;
+		pt->is_compact = true;
+	}
+
+	vaddr = px_vaddr(pt);
+	vaddr[gen8_pd_index(idx, 0) / 16] = gen8_pte_encode(addr, level, flags);
+}
+
+static void xehpsdv_ppgtt_insert_entry(struct i915_address_space *vm,
+				       dma_addr_t addr,
+				       u64 offset,
+				       enum i915_cache_level level,
+				       u32 flags)
+{
+	if (flags & PTE_LM)
+		return __xehpsdv_ppgtt_insert_entry_lm(vm, addr, offset,
+						       level, flags);
+
+	return gen8_ppgtt_insert_entry(vm, addr, offset, level, flags);
+}
+
 static int gen8_init_scratch(struct i915_address_space *vm)
 {
 	u32 pte_flags;
@@ -937,7 +980,10 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt,
 
 	ppgtt->vm.bind_async_flags = I915_VMA_LOCAL_BIND;
 	ppgtt->vm.insert_entries = gen8_ppgtt_insert;
-	ppgtt->vm.insert_page = gen8_ppgtt_insert_entry;
+	if (HAS_64K_PAGES(gt->i915))
+		ppgtt->vm.insert_page = xehpsdv_ppgtt_insert_entry;
+	else
+		ppgtt->vm.insert_page = gen8_ppgtt_insert_entry;
 	ppgtt->vm.allocate_va_range = gen8_ppgtt_alloc;
 	ppgtt->vm.clear_range = gen8_ppgtt_clear;
 	ppgtt->vm.foreach = gen8_ppgtt_foreach;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 6/8] drm/i915/gtt: add xehpsdv_ppgtt_insert_entry
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

If this is LMEM then we get a 32 entry PT, with each PTE pointing to
some 64K block of memory, otherwise it's just the usual 512 entry PT.
This very much assumes the caller knows what they are doing.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Reviewed-by: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 50 ++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
index bd3ca0996a23..312b2267bf87 100644
--- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c
@@ -728,13 +728,56 @@ static void gen8_ppgtt_insert_entry(struct i915_address_space *vm,
 		gen8_pdp_for_page_index(vm, idx);
 	struct i915_page_directory *pd =
 		i915_pd_entry(pdp, gen8_pd_index(idx, 2));
+	struct i915_page_table *pt = i915_pt_entry(pd, gen8_pd_index(idx, 1));
 	gen8_pte_t *vaddr;
 
-	vaddr = px_vaddr(i915_pt_entry(pd, gen8_pd_index(idx, 1)));
+	GEM_BUG_ON(pt->is_compact);
+
+	vaddr = px_vaddr(pt);
 	vaddr[gen8_pd_index(idx, 0)] = gen8_pte_encode(addr, level, flags);
 	clflush_cache_range(&vaddr[gen8_pd_index(idx, 0)], sizeof(*vaddr));
 }
 
+static void __xehpsdv_ppgtt_insert_entry_lm(struct i915_address_space *vm,
+					    dma_addr_t addr,
+					    u64 offset,
+					    enum i915_cache_level level,
+					    u32 flags)
+{
+	u64 idx = offset >> GEN8_PTE_SHIFT;
+	struct i915_page_directory * const pdp =
+		gen8_pdp_for_page_index(vm, idx);
+	struct i915_page_directory *pd =
+		i915_pd_entry(pdp, gen8_pd_index(idx, 2));
+	struct i915_page_table *pt = i915_pt_entry(pd, gen8_pd_index(idx, 1));
+	gen8_pte_t *vaddr;
+
+	GEM_BUG_ON(!IS_ALIGNED(addr, SZ_64K));
+	GEM_BUG_ON(!IS_ALIGNED(offset, SZ_64K));
+
+	if (!pt->is_compact) {
+		vaddr = px_vaddr(pd);
+		vaddr[gen8_pd_index(idx, 1)] |= GEN12_PDE_64K;
+		pt->is_compact = true;
+	}
+
+	vaddr = px_vaddr(pt);
+	vaddr[gen8_pd_index(idx, 0) / 16] = gen8_pte_encode(addr, level, flags);
+}
+
+static void xehpsdv_ppgtt_insert_entry(struct i915_address_space *vm,
+				       dma_addr_t addr,
+				       u64 offset,
+				       enum i915_cache_level level,
+				       u32 flags)
+{
+	if (flags & PTE_LM)
+		return __xehpsdv_ppgtt_insert_entry_lm(vm, addr, offset,
+						       level, flags);
+
+	return gen8_ppgtt_insert_entry(vm, addr, offset, level, flags);
+}
+
 static int gen8_init_scratch(struct i915_address_space *vm)
 {
 	u32 pte_flags;
@@ -937,7 +980,10 @@ struct i915_ppgtt *gen8_ppgtt_create(struct intel_gt *gt,
 
 	ppgtt->vm.bind_async_flags = I915_VMA_LOCAL_BIND;
 	ppgtt->vm.insert_entries = gen8_ppgtt_insert;
-	ppgtt->vm.insert_page = gen8_ppgtt_insert_entry;
+	if (HAS_64K_PAGES(gt->i915))
+		ppgtt->vm.insert_page = xehpsdv_ppgtt_insert_entry;
+	else
+		ppgtt->vm.insert_page = gen8_ppgtt_insert_entry;
 	ppgtt->vm.allocate_va_range = gen8_ppgtt_alloc;
 	ppgtt->vm.clear_range = gen8_ppgtt_clear;
 	ppgtt->vm.foreach = gen8_ppgtt_foreach;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

This is all kinds of awkward since we now have to contend with using 64K
GTT pages when mapping anything in LMEM(including the page-tables
themselves).

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
 1 file changed, 150 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 0192b61ab541..fb658ae70a8d 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
 	return true;
 }
 
+static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
+				struct i915_page_table *pt,
+				void *data)
+{
+	struct insert_pte_data *d = data;
+
+	/*
+	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
+	 * we have a correctly setup PDE structure for later use.
+	 */
+	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
+	GEM_BUG_ON(!pt->is_compact);
+	d->offset += SZ_2M;
+}
+
+static void xehpsdv_insert_pte(struct i915_address_space *vm,
+			       struct i915_page_table *pt,
+			       void *data)
+{
+	struct insert_pte_data *d = data;
+
+	/*
+	 * We are playing tricks here, since the actual pt, from the hw
+	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
+	 * entries, but we are still guaranteed that the physical
+	 * alignment is 64K underneath for the pt, and we are careful
+	 * not to access the space in the void.
+	 */
+	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
+	d->offset += SZ_64K;
+}
+
 static void insert_pte(struct i915_address_space *vm,
 		       struct i915_page_table *pt,
 		       void *data)
@@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 	 * i.e. within the same non-preemptible window so that we do not switch
 	 * to another migration context that overwrites the PTE.
 	 *
-	 * TODO: Add support for huge LMEM PTEs
+	 * On platforms with HAS_64K_PAGES support we have three windows, and
+	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
+	 * a thing), since we are forced to use 64K GTT pages underneath which
+	 * requires also modifying the PDE. An alternative might be to instead
+	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
+	 * in the PDE from the same batch that also modifies the PTEs.
 	 */
 
 	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
@@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 		goto err_vm;
 	}
 
+	if (HAS_64K_PAGES(gt->i915))
+		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
+
 	/*
 	 * Each engine instance is assigned its own chunk in the VM, so
 	 * that we can run multiple instances concurrently
@@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
 		 * 4x2 page directories for source/destination.
 		 */
-		sz = 2 * CHUNK_SZ;
+		if (HAS_64K_PAGES(gt->i915))
+			sz = 3 * CHUNK_SZ;
+		else
+			sz = 2 * CHUNK_SZ;
 		d.offset = base + sz;
 
 		/*
 		 * We need another page directory setup so that we can write
 		 * the 8x512 PTE in each chunk.
 		 */
-		sz += (sz >> 12) * sizeof(u64);
+		if (HAS_64K_PAGES(gt->i915))
+			sz += (sz / SZ_2M) * SZ_64K;
+		else
+			sz += (sz >> 12) * sizeof(u64);
 
 		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
 		if (err)
@@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
+		if (HAS_64K_PAGES(gt->i915)) {
+			vm->vm.foreach(&vm->vm, base, d.offset - base,
+				       xehpsdv_insert_pte, &d);
+			d.offset = base + CHUNK_SZ;
+			vm->vm.foreach(&vm->vm,
+				       d.offset,
+				       2 * CHUNK_SZ,
+				       xehpsdv_toggle_pdes, &d);
+		} else {
+			vm->vm.foreach(&vm->vm, base, d.offset - base,
+				       insert_pte, &d);
+		}
 	}
 
 	return &vm->vm;
@@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
 		    u64 offset,
 		    int length)
 {
+	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
 	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
 						       is_lmem ? PTE_LM : 0);
 	struct intel_ring *ring = rq->ring;
-	int total = 0;
+	int pkt, dword_length;
+	u32 total = 0;
+	u32 page_size;
 	u32 *hdr, *cs;
-	int pkt;
 
 	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
 
+	page_size = I915_GTT_PAGE_SIZE;
+	dword_length = 0x400;
+
 	/* Compute the page directory offset for the target address range */
-	offset >>= 12;
-	offset *= sizeof(u64);
-	offset += 2 * CHUNK_SZ;
+	if (has_64K_pages) {
+		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
+
+		offset /= SZ_2M;
+		offset *= SZ_64K;
+		offset += 3 * CHUNK_SZ;
+
+		if (is_lmem) {
+			page_size = I915_GTT_PAGE_SIZE_64K;
+			dword_length = 0x40;
+		}
+	} else {
+		offset >>= 12;
+		offset *= sizeof(u64);
+		offset += 2 * CHUNK_SZ;
+	}
+
 	offset += (u64)rq->engine->instance << 32;
 
 	cs = intel_ring_begin(rq, 6);
@@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
 		return PTR_ERR(cs);
 
 	/* Pack as many PTE updates as possible into a single MI command */
-	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
 	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
 
 	hdr = cs;
@@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
 
 	do {
 		if (cs - hdr >= pkt) {
+			int dword_rem;
+
 			*hdr += cs - hdr - 2;
 			*cs++ = MI_NOOP;
 
@@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
 			if (IS_ERR(cs))
 				return PTR_ERR(cs);
 
-			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+			dword_rem = dword_length;
+			if (has_64K_pages) {
+				if (IS_ALIGNED(total, SZ_2M)) {
+					offset = round_up(offset, SZ_64K);
+				} else {
+					dword_rem = SZ_2M - (total & (SZ_2M - 1));
+					dword_rem /= page_size;
+					dword_rem *= 2;
+				}
+			}
+
+			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
 			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
 
 			hdr = cs;
@@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
 			*cs++ = upper_32_bits(offset);
 		}
 
+		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
+
 		*cs++ = lower_32_bits(encode | it->dma);
 		*cs++ = upper_32_bits(encode | it->dma);
 
 		offset += 8;
-		total += I915_GTT_PAGE_SIZE;
+		total += page_size;
 
-		it->dma += I915_GTT_PAGE_SIZE;
+		it->dma += page_size;
 		if (it->dma >= it->max) {
 			it->sg = __sg_next(it->sg);
 			if (!it->sg || sg_dma_len(it->sg) == 0)
@@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
 	return height % 4 == 3 && height <= 8;
 }
 
-static int emit_copy(struct i915_request *rq, int size)
+static int emit_copy(struct i915_request *rq,
+		     u32 dst_offset, u32 src_offset, int size)
 {
 	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 instance = rq->engine->instance;
@@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
 		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = instance;
 		*cs++ = 0;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 		*cs++ = instance;
 	} else if (ver >= 8) {
 		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = instance;
 		*cs++ = 0;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 		*cs++ = instance;
 	} else {
 		GEM_BUG_ON(instance);
 		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 	}
 
 	intel_ring_advance(rq, cs);
@@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
 	GEM_BUG_ON(ce->ring->size < SZ_64K);
 
 	do {
+		u32 src_offset, dst_offset;
 		int len;
 
 		rq = i915_request_create(ce);
@@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
-			       CHUNK_SZ);
+		src_offset = 0;
+		dst_offset = CHUNK_SZ;
+		if (HAS_64K_PAGES(ce->engine->i915)) {
+			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
+
+			src_offset = 0;
+			dst_offset = 0;
+			if (src_is_lmem)
+				src_offset = CHUNK_SZ;
+			if (dst_is_lmem)
+				dst_offset = 2 * CHUNK_SZ;
+		}
+
+		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
+			       src_offset, CHUNK_SZ);
 		if (len <= 0) {
 			err = len;
 			goto out_rq;
 		}
 
 		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
-			       CHUNK_SZ, len);
+			       dst_offset, len);
 		if (err < 0)
 			goto out_rq;
 		if (err < len) {
@@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_copy(rq, len);
+		err = emit_copy(rq, dst_offset, src_offset, len);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
@@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
 }
 
 static int emit_clear(struct i915_request *rq,
+		      u64 offset,
 		      int size,
 		      u32 value,
 		      bool is_lmem)
 {
-	const int ver = GRAPHICS_VER(rq->engine->i915);
-	u32 instance = rq->engine->instance;
-	u32 *cs;
 	struct drm_i915_private *i915 = rq->engine->i915;
+	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 num_ccs_blks, ccs_ring_size;
+	u32 *cs;
 
 	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
 
+	offset += (u64)rq->engine->instance << 32;
+
 	/* Clear flat css only when value is 0 */
 	ccs_ring_size = (is_lmem && !value) ?
 			 calc_ctrl_surf_instr_size(i915, size)
@@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = 0; /* offset */
-		*cs++ = instance;
+		*cs++ = lower_32_bits(offset);
+		*cs++ = upper_32_bits(offset);
 		*cs++ = value;
 		*cs++ = MI_NOOP;
 	} else {
-		GEM_BUG_ON(instance);
+		GEM_BUG_ON(upper_32_bits(offset));
 		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = 0;
+		*cs++ = lower_32_bits(offset);
 		*cs++ = value;
 	}
 
@@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
 		 * and use it as a source.
 		 */
 
-		cs = i915_flush_dw(cs, (u64)instance << 32,
-				   MI_FLUSH_LLC | MI_FLUSH_CCS);
+		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
 		cs = _i915_ctrl_surf_copy_blt(cs,
-					      (u64)instance << 32,
-					      (u64)instance << 32,
+					      offset,
+					      offset,
 					      DIRECT_ACCESS,
 					      INDIRECT_ACCESS,
 					      1, 1,
 					      num_ccs_blks);
-		cs = i915_flush_dw(cs, (u64)instance << 32,
-				   MI_FLUSH_LLC | MI_FLUSH_CCS);
+		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
 	}
 	intel_ring_advance(rq, cs);
 	return 0;
@@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 	GEM_BUG_ON(ce->ring->size < SZ_64K);
 
 	do {
+		u32 offset;
 		int len;
 
 		rq = i915_request_create(ce);
@@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
+		offset = 0;
+		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
+			offset = CHUNK_SZ;
+
+		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
 		if (len <= 0) {
 			err = len;
 			goto out_rq;
@@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_clear(rq, len, value, is_lmem);
+		err = emit_clear(rq, offset, len, value, is_lmem);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

This is all kinds of awkward since we now have to contend with using 64K
GTT pages when mapping anything in LMEM(including the page-tables
themselves).

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
 1 file changed, 150 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index 0192b61ab541..fb658ae70a8d 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
 	return true;
 }
 
+static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
+				struct i915_page_table *pt,
+				void *data)
+{
+	struct insert_pte_data *d = data;
+
+	/*
+	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
+	 * we have a correctly setup PDE structure for later use.
+	 */
+	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
+	GEM_BUG_ON(!pt->is_compact);
+	d->offset += SZ_2M;
+}
+
+static void xehpsdv_insert_pte(struct i915_address_space *vm,
+			       struct i915_page_table *pt,
+			       void *data)
+{
+	struct insert_pte_data *d = data;
+
+	/*
+	 * We are playing tricks here, since the actual pt, from the hw
+	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
+	 * entries, but we are still guaranteed that the physical
+	 * alignment is 64K underneath for the pt, and we are careful
+	 * not to access the space in the void.
+	 */
+	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
+	d->offset += SZ_64K;
+}
+
 static void insert_pte(struct i915_address_space *vm,
 		       struct i915_page_table *pt,
 		       void *data)
@@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 	 * i.e. within the same non-preemptible window so that we do not switch
 	 * to another migration context that overwrites the PTE.
 	 *
-	 * TODO: Add support for huge LMEM PTEs
+	 * On platforms with HAS_64K_PAGES support we have three windows, and
+	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
+	 * a thing), since we are forced to use 64K GTT pages underneath which
+	 * requires also modifying the PDE. An alternative might be to instead
+	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
+	 * in the PDE from the same batch that also modifies the PTEs.
 	 */
 
 	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
@@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 		goto err_vm;
 	}
 
+	if (HAS_64K_PAGES(gt->i915))
+		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
+
 	/*
 	 * Each engine instance is assigned its own chunk in the VM, so
 	 * that we can run multiple instances concurrently
@@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
 		 * 4x2 page directories for source/destination.
 		 */
-		sz = 2 * CHUNK_SZ;
+		if (HAS_64K_PAGES(gt->i915))
+			sz = 3 * CHUNK_SZ;
+		else
+			sz = 2 * CHUNK_SZ;
 		d.offset = base + sz;
 
 		/*
 		 * We need another page directory setup so that we can write
 		 * the 8x512 PTE in each chunk.
 		 */
-		sz += (sz >> 12) * sizeof(u64);
+		if (HAS_64K_PAGES(gt->i915))
+			sz += (sz / SZ_2M) * SZ_64K;
+		else
+			sz += (sz >> 12) * sizeof(u64);
 
 		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
 		if (err)
@@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
 			goto err_vm;
 
 		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
-		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
+		if (HAS_64K_PAGES(gt->i915)) {
+			vm->vm.foreach(&vm->vm, base, d.offset - base,
+				       xehpsdv_insert_pte, &d);
+			d.offset = base + CHUNK_SZ;
+			vm->vm.foreach(&vm->vm,
+				       d.offset,
+				       2 * CHUNK_SZ,
+				       xehpsdv_toggle_pdes, &d);
+		} else {
+			vm->vm.foreach(&vm->vm, base, d.offset - base,
+				       insert_pte, &d);
+		}
 	}
 
 	return &vm->vm;
@@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
 		    u64 offset,
 		    int length)
 {
+	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
 	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
 						       is_lmem ? PTE_LM : 0);
 	struct intel_ring *ring = rq->ring;
-	int total = 0;
+	int pkt, dword_length;
+	u32 total = 0;
+	u32 page_size;
 	u32 *hdr, *cs;
-	int pkt;
 
 	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
 
+	page_size = I915_GTT_PAGE_SIZE;
+	dword_length = 0x400;
+
 	/* Compute the page directory offset for the target address range */
-	offset >>= 12;
-	offset *= sizeof(u64);
-	offset += 2 * CHUNK_SZ;
+	if (has_64K_pages) {
+		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
+
+		offset /= SZ_2M;
+		offset *= SZ_64K;
+		offset += 3 * CHUNK_SZ;
+
+		if (is_lmem) {
+			page_size = I915_GTT_PAGE_SIZE_64K;
+			dword_length = 0x40;
+		}
+	} else {
+		offset >>= 12;
+		offset *= sizeof(u64);
+		offset += 2 * CHUNK_SZ;
+	}
+
 	offset += (u64)rq->engine->instance << 32;
 
 	cs = intel_ring_begin(rq, 6);
@@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
 		return PTR_ERR(cs);
 
 	/* Pack as many PTE updates as possible into a single MI command */
-	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
 	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
 
 	hdr = cs;
@@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
 
 	do {
 		if (cs - hdr >= pkt) {
+			int dword_rem;
+
 			*hdr += cs - hdr - 2;
 			*cs++ = MI_NOOP;
 
@@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
 			if (IS_ERR(cs))
 				return PTR_ERR(cs);
 
-			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
+			dword_rem = dword_length;
+			if (has_64K_pages) {
+				if (IS_ALIGNED(total, SZ_2M)) {
+					offset = round_up(offset, SZ_64K);
+				} else {
+					dword_rem = SZ_2M - (total & (SZ_2M - 1));
+					dword_rem /= page_size;
+					dword_rem *= 2;
+				}
+			}
+
+			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
 			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
 
 			hdr = cs;
@@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
 			*cs++ = upper_32_bits(offset);
 		}
 
+		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
+
 		*cs++ = lower_32_bits(encode | it->dma);
 		*cs++ = upper_32_bits(encode | it->dma);
 
 		offset += 8;
-		total += I915_GTT_PAGE_SIZE;
+		total += page_size;
 
-		it->dma += I915_GTT_PAGE_SIZE;
+		it->dma += page_size;
 		if (it->dma >= it->max) {
 			it->sg = __sg_next(it->sg);
 			if (!it->sg || sg_dma_len(it->sg) == 0)
@@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
 	return height % 4 == 3 && height <= 8;
 }
 
-static int emit_copy(struct i915_request *rq, int size)
+static int emit_copy(struct i915_request *rq,
+		     u32 dst_offset, u32 src_offset, int size)
 {
 	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 instance = rq->engine->instance;
@@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
 		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = instance;
 		*cs++ = 0;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 		*cs++ = instance;
 	} else if (ver >= 8) {
 		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = instance;
 		*cs++ = 0;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 		*cs++ = instance;
 	} else {
 		GEM_BUG_ON(instance);
 		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
-		*cs++ = CHUNK_SZ; /* dst offset */
+		*cs++ = dst_offset;
 		*cs++ = PAGE_SIZE;
-		*cs++ = 0; /* src offset */
+		*cs++ = src_offset;
 	}
 
 	intel_ring_advance(rq, cs);
@@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
 	GEM_BUG_ON(ce->ring->size < SZ_64K);
 
 	do {
+		u32 src_offset, dst_offset;
 		int len;
 
 		rq = i915_request_create(ce);
@@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
-			       CHUNK_SZ);
+		src_offset = 0;
+		dst_offset = CHUNK_SZ;
+		if (HAS_64K_PAGES(ce->engine->i915)) {
+			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
+
+			src_offset = 0;
+			dst_offset = 0;
+			if (src_is_lmem)
+				src_offset = CHUNK_SZ;
+			if (dst_is_lmem)
+				dst_offset = 2 * CHUNK_SZ;
+		}
+
+		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
+			       src_offset, CHUNK_SZ);
 		if (len <= 0) {
 			err = len;
 			goto out_rq;
 		}
 
 		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
-			       CHUNK_SZ, len);
+			       dst_offset, len);
 		if (err < 0)
 			goto out_rq;
 		if (err < len) {
@@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_copy(rq, len);
+		err = emit_copy(rq, dst_offset, src_offset, len);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
@@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
 }
 
 static int emit_clear(struct i915_request *rq,
+		      u64 offset,
 		      int size,
 		      u32 value,
 		      bool is_lmem)
 {
-	const int ver = GRAPHICS_VER(rq->engine->i915);
-	u32 instance = rq->engine->instance;
-	u32 *cs;
 	struct drm_i915_private *i915 = rq->engine->i915;
+	const int ver = GRAPHICS_VER(rq->engine->i915);
 	u32 num_ccs_blks, ccs_ring_size;
+	u32 *cs;
 
 	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
 
+	offset += (u64)rq->engine->instance << 32;
+
 	/* Clear flat css only when value is 0 */
 	ccs_ring_size = (is_lmem && !value) ?
 			 calc_ctrl_surf_instr_size(i915, size)
@@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = 0; /* offset */
-		*cs++ = instance;
+		*cs++ = lower_32_bits(offset);
+		*cs++ = upper_32_bits(offset);
 		*cs++ = value;
 		*cs++ = MI_NOOP;
 	} else {
-		GEM_BUG_ON(instance);
+		GEM_BUG_ON(upper_32_bits(offset));
 		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
 		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
 		*cs++ = 0;
 		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
-		*cs++ = 0;
+		*cs++ = lower_32_bits(offset);
 		*cs++ = value;
 	}
 
@@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
 		 * and use it as a source.
 		 */
 
-		cs = i915_flush_dw(cs, (u64)instance << 32,
-				   MI_FLUSH_LLC | MI_FLUSH_CCS);
+		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
 		cs = _i915_ctrl_surf_copy_blt(cs,
-					      (u64)instance << 32,
-					      (u64)instance << 32,
+					      offset,
+					      offset,
 					      DIRECT_ACCESS,
 					      INDIRECT_ACCESS,
 					      1, 1,
 					      num_ccs_blks);
-		cs = i915_flush_dw(cs, (u64)instance << 32,
-				   MI_FLUSH_LLC | MI_FLUSH_CCS);
+		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
 	}
 	intel_ring_advance(rq, cs);
 	return 0;
@@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 	GEM_BUG_ON(ce->ring->size < SZ_64K);
 
 	do {
+		u32 offset;
 		int len;
 
 		rq = i915_request_create(ce);
@@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
+		offset = 0;
+		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
+			offset = CHUNK_SZ;
+
+		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
 		if (len <= 0) {
 			err = len;
 			goto out_rq;
@@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
 		if (err)
 			goto out_rq;
 
-		err = emit_clear(rq, len, value, is_lmem);
+		err = emit_clear(rq, offset, len, value, is_lmem);
 
 		/* Arbitration is re-enabled between requests. */
 out_rq:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 8/8] drm/i915/migrate: turn on acceleration for DG2
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 13:31   ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index fb658ae70a8d..0fb83d0bec91 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -243,8 +243,6 @@ int intel_migrate_init(struct intel_migrate *m, struct intel_gt *gt)
 
 	memset(m, 0, sizeof(*m));
 
-	return 0;
-
 	ce = pinned_context(gt);
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH v3 8/8] drm/i915/migrate: turn on acceleration for DG2
@ 2021-12-06 13:31   ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 13:31 UTC (permalink / raw)
  To: intel-gfx; +Cc: Thomas Hellström, dri-devel

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_migrate.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
index fb658ae70a8d..0fb83d0bec91 100644
--- a/drivers/gpu/drm/i915/gt/intel_migrate.c
+++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
@@ -243,8 +243,6 @@ int intel_migrate_init(struct intel_migrate *m, struct intel_gt *gt)
 
 	memset(m, 0, sizeof(*m));
 
-	return 0;
-
 	ce = pinned_context(gt);
 	if (IS_ERR(ce))
 		return PTR_ERR(ce);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for DG2 accelerated migration/clearing support (rev2)
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
                   ` (8 preceding siblings ...)
  (?)
@ 2021-12-06 14:05 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2021-12-06 14:05 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-gfx

== Series Details ==

Series: DG2 accelerated migration/clearing support (rev2)
URL   : https://patchwork.freedesktop.org/series/97544/
State : failure

== Summary ==

Applying: drm/i915/migrate: don't check the scratch page
Applying: drm/i915/migrate: fix offset calculation
Applying: drm/i915/migrate: fix length calculation
Applying: drm/i915/selftests: handle object rounding
Applying: drm/i915/gtt: allow overriding the pt alignment
Applying: drm/i915/gtt: add xehpsdv_ppgtt_insert_entry
Applying: drm/i915/migrate: add acceleration support for DG2
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/gt/intel_migrate.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0007 drm/i915/migrate: add acceleration support for DG2
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] DG2 accelerated migration/clearing support
  2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
@ 2021-12-06 14:49   ` Daniel Stone
  -1 siblings, 0 replies; 31+ messages in thread
From: Daniel Stone @ 2021-12-06 14:49 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-gfx, dri-devel

Hi Matthew,

On Mon, 6 Dec 2021 at 13:32, Matthew Auld <matthew.auld@intel.com> wrote:
> Enable accelerated moves and clearing on DG2. On such HW we have minimum page
> size restrictions when accessing LMEM from the GTT, where we now have to use 64K
> GTT pages or larger. With the ppGTT the page-table also has a slightly different
> layout from past generations when using the 64K GTT mode(which is still enabled
> on via some PDE bit), where it is now compacted down to 32 qword entries. Note
> that on discrete the paging structures must also be placed in LMEM, and we need
> to able to modify them via the GTT itself(see patch 7), which is one of the
> complications here.
>
> The series needs to be applied on top of the DG2 enabling branch:
> https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3

What are the changes to the v1/v2?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 0/8] DG2 accelerated migration/clearing support
@ 2021-12-06 14:49   ` Daniel Stone
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Stone @ 2021-12-06 14:49 UTC (permalink / raw)
  To: Matthew Auld; +Cc: intel-gfx, dri-devel

Hi Matthew,

On Mon, 6 Dec 2021 at 13:32, Matthew Auld <matthew.auld@intel.com> wrote:
> Enable accelerated moves and clearing on DG2. On such HW we have minimum page
> size restrictions when accessing LMEM from the GTT, where we now have to use 64K
> GTT pages or larger. With the ppGTT the page-table also has a slightly different
> layout from past generations when using the 64K GTT mode(which is still enabled
> on via some PDE bit), where it is now compacted down to 32 qword entries. Note
> that on discrete the paging structures must also be placed in LMEM, and we need
> to able to modify them via the GTT itself(see patch 7), which is one of the
> complications here.
>
> The series needs to be applied on top of the DG2 enabling branch:
> https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3

What are the changes to the v1/v2?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] DG2 accelerated migration/clearing support
  2021-12-06 14:49   ` [Intel-gfx] " Daniel Stone
@ 2021-12-06 15:13     ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 15:13 UTC (permalink / raw)
  To: Daniel Stone; +Cc: intel-gfx, dri-devel

On 06/12/2021 14:49, Daniel Stone wrote:
> Hi Matthew,
> 
> On Mon, 6 Dec 2021 at 13:32, Matthew Auld <matthew.auld@intel.com> wrote:
>> Enable accelerated moves and clearing on DG2. On such HW we have minimum page
>> size restrictions when accessing LMEM from the GTT, where we now have to use 64K
>> GTT pages or larger. With the ppGTT the page-table also has a slightly different
>> layout from past generations when using the 64K GTT mode(which is still enabled
>> on via some PDE bit), where it is now compacted down to 32 qword entries. Note
>> that on discrete the paging structures must also be placed in LMEM, and we need
>> to able to modify them via the GTT itself(see patch 7), which is one of the
>> complications here.
>>
>> The series needs to be applied on top of the DG2 enabling branch:
>> https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3
> 
> What are the changes to the v1/v2?

Yeah, I should have added that somewhere. Sorry.

v2: Add missing cover letter
v3:
- Add some r-b tags
- Drop the GTT_MAPPABLE approach. We can instead simply pass along the 
required size/alignment using alloc_pt().

> 
> Cheers,
> Daniel
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 0/8] DG2 accelerated migration/clearing support
@ 2021-12-06 15:13     ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-06 15:13 UTC (permalink / raw)
  To: Daniel Stone; +Cc: intel-gfx, dri-devel

On 06/12/2021 14:49, Daniel Stone wrote:
> Hi Matthew,
> 
> On Mon, 6 Dec 2021 at 13:32, Matthew Auld <matthew.auld@intel.com> wrote:
>> Enable accelerated moves and clearing on DG2. On such HW we have minimum page
>> size restrictions when accessing LMEM from the GTT, where we now have to use 64K
>> GTT pages or larger. With the ppGTT the page-table also has a slightly different
>> layout from past generations when using the 64K GTT mode(which is still enabled
>> on via some PDE bit), where it is now compacted down to 32 qword entries. Note
>> that on discrete the paging structures must also be placed in LMEM, and we need
>> to able to modify them via the GTT itself(see patch 7), which is one of the
>> complications here.
>>
>> The series needs to be applied on top of the DG2 enabling branch:
>> https://cgit.freedesktop.org/~ramaling/linux/log/?h=dg2_enabling_ww49.3
> 
> What are the changes to the v1/v2?

Yeah, I should have added that somewhere. Sorry.

v2: Add missing cover letter
v3:
- Add some r-b tags
- Drop the GTT_MAPPABLE approach. We can instead simply pass along the 
required size/alignment using alloc_pt().

> 
> Cheers,
> Daniel
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 5/8] drm/i915/gtt: allow overriding the pt alignment
  2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
@ 2021-12-13 15:32     ` Ramalingam C
  -1 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-13 15:32 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-06 at 13:31:37 +0000, Matthew Auld wrote:
> On some platforms we have alignment restrictions when accessing LMEM
> from the GTT. In the next patch few patches we need to be able to modify
probably extra "patch"

Apart from that looks good to me

Reviewed-by : Ramalingam C <ramalingam.c@intel.com>

> the page-tables directly via the GTT itself.
> 
> Suggested-by: Ramalingam C <ramalingam.c@intel.com>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   | 10 +++++++++-
>  drivers/gpu/drm/i915/gt/intel_ppgtt.c | 16 ++++++++++++----
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index cbc0b5266cb4..a00d278d8175 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -196,6 +196,14 @@ void *__px_vaddr(struct drm_i915_gem_object *p);
>  struct i915_vm_pt_stash {
>  	/* preallocated chains of page tables/directories */
>  	struct i915_page_table *pt[2];
> +	/*
> +	 * Optionally override the alignment/size of the physical page that
> +	 * contains each PT. If not set defaults back to the usual
> +	 * I915_GTT_PAGE_SIZE_4K. This does not influence the other paging
> +	 * structures. MUST be a power-of-two. ONLY applicable on discrete
> +	 * platforms.
> +	 */
> +	int pt_sz;
>  };
>  
>  struct i915_vma_ops {
> @@ -583,7 +591,7 @@ void free_scratch(struct i915_address_space *vm);
>  
>  struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz);
>  struct drm_i915_gem_object *alloc_pt_lmem(struct i915_address_space *vm, int sz);
> -struct i915_page_table *alloc_pt(struct i915_address_space *vm);
> +struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz);
>  struct i915_page_directory *alloc_pd(struct i915_address_space *vm);
>  struct i915_page_directory *__alloc_pd(int npde);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> index b8238f5bc8b1..3c90aea25072 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> @@ -12,7 +12,7 @@
>  #include "gen6_ppgtt.h"
>  #include "gen8_ppgtt.h"
>  
> -struct i915_page_table *alloc_pt(struct i915_address_space *vm)
> +struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz)
>  {
>  	struct i915_page_table *pt;
>  
> @@ -20,7 +20,7 @@ struct i915_page_table *alloc_pt(struct i915_address_space *vm)
>  	if (unlikely(!pt))
>  		return ERR_PTR(-ENOMEM);
>  
> -	pt->base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
> +	pt->base = vm->alloc_pt_dma(vm, sz);
>  	if (IS_ERR(pt->base)) {
>  		kfree(pt);
>  		return ERR_PTR(-ENOMEM);
> @@ -219,17 +219,25 @@ int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
>  			   u64 size)
>  {
>  	unsigned long count;
> -	int shift, n;
> +	int shift, n, pt_sz;
>  
>  	shift = vm->pd_shift;
>  	if (!shift)
>  		return 0;
>  
> +	pt_sz = stash->pt_sz;
> +	if (!pt_sz)
> +		pt_sz = I915_GTT_PAGE_SIZE_4K;
> +	else
> +		GEM_BUG_ON(!IS_DGFX(vm->i915));
> +
> +	GEM_BUG_ON(!is_power_of_2(pt_sz));
> +
>  	count = pd_count(size, shift);
>  	while (count--) {
>  		struct i915_page_table *pt;
>  
> -		pt = alloc_pt(vm);
> +		pt = alloc_pt(vm, pt_sz);
>  		if (IS_ERR(pt)) {
>  			i915_vm_free_pt_stash(vm, stash);
>  			return PTR_ERR(pt);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 5/8] drm/i915/gtt: allow overriding the pt alignment
@ 2021-12-13 15:32     ` Ramalingam C
  0 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-13 15:32 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-06 at 13:31:37 +0000, Matthew Auld wrote:
> On some platforms we have alignment restrictions when accessing LMEM
> from the GTT. In the next patch few patches we need to be able to modify
probably extra "patch"

Apart from that looks good to me

Reviewed-by : Ramalingam C <ramalingam.c@intel.com>

> the page-tables directly via the GTT itself.
> 
> Suggested-by: Ramalingam C <ramalingam.c@intel.com>
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gtt.h   | 10 +++++++++-
>  drivers/gpu/drm/i915/gt/intel_ppgtt.c | 16 ++++++++++++----
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
> index cbc0b5266cb4..a00d278d8175 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gtt.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
> @@ -196,6 +196,14 @@ void *__px_vaddr(struct drm_i915_gem_object *p);
>  struct i915_vm_pt_stash {
>  	/* preallocated chains of page tables/directories */
>  	struct i915_page_table *pt[2];
> +	/*
> +	 * Optionally override the alignment/size of the physical page that
> +	 * contains each PT. If not set defaults back to the usual
> +	 * I915_GTT_PAGE_SIZE_4K. This does not influence the other paging
> +	 * structures. MUST be a power-of-two. ONLY applicable on discrete
> +	 * platforms.
> +	 */
> +	int pt_sz;
>  };
>  
>  struct i915_vma_ops {
> @@ -583,7 +591,7 @@ void free_scratch(struct i915_address_space *vm);
>  
>  struct drm_i915_gem_object *alloc_pt_dma(struct i915_address_space *vm, int sz);
>  struct drm_i915_gem_object *alloc_pt_lmem(struct i915_address_space *vm, int sz);
> -struct i915_page_table *alloc_pt(struct i915_address_space *vm);
> +struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz);
>  struct i915_page_directory *alloc_pd(struct i915_address_space *vm);
>  struct i915_page_directory *__alloc_pd(int npde);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_ppgtt.c b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> index b8238f5bc8b1..3c90aea25072 100644
> --- a/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_ppgtt.c
> @@ -12,7 +12,7 @@
>  #include "gen6_ppgtt.h"
>  #include "gen8_ppgtt.h"
>  
> -struct i915_page_table *alloc_pt(struct i915_address_space *vm)
> +struct i915_page_table *alloc_pt(struct i915_address_space *vm, int sz)
>  {
>  	struct i915_page_table *pt;
>  
> @@ -20,7 +20,7 @@ struct i915_page_table *alloc_pt(struct i915_address_space *vm)
>  	if (unlikely(!pt))
>  		return ERR_PTR(-ENOMEM);
>  
> -	pt->base = vm->alloc_pt_dma(vm, I915_GTT_PAGE_SIZE_4K);
> +	pt->base = vm->alloc_pt_dma(vm, sz);
>  	if (IS_ERR(pt->base)) {
>  		kfree(pt);
>  		return ERR_PTR(-ENOMEM);
> @@ -219,17 +219,25 @@ int i915_vm_alloc_pt_stash(struct i915_address_space *vm,
>  			   u64 size)
>  {
>  	unsigned long count;
> -	int shift, n;
> +	int shift, n, pt_sz;
>  
>  	shift = vm->pd_shift;
>  	if (!shift)
>  		return 0;
>  
> +	pt_sz = stash->pt_sz;
> +	if (!pt_sz)
> +		pt_sz = I915_GTT_PAGE_SIZE_4K;
> +	else
> +		GEM_BUG_ON(!IS_DGFX(vm->i915));
> +
> +	GEM_BUG_ON(!is_power_of_2(pt_sz));
> +
>  	count = pd_count(size, shift);
>  	while (count--) {
>  		struct i915_page_table *pt;
>  
> -		pt = alloc_pt(vm);
> +		pt = alloc_pt(vm, pt_sz);
>  		if (IS_ERR(pt)) {
>  			i915_vm_free_pt_stash(vm, stash);
>  			return PTR_ERR(pt);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
  2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
@ 2021-12-14 10:56     ` Ramalingam C
  -1 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-14 10:56 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
> This is all kinds of awkward since we now have to contend with using 64K
> GTT pages when mapping anything in LMEM(including the page-tables
> themselves).
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
>  1 file changed, 150 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 0192b61ab541..fb658ae70a8d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
>  	return true;
>  }
>  
> +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
> +				struct i915_page_table *pt,
> +				void *data)
> +{
> +	struct insert_pte_data *d = data;
> +
> +	/*
> +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
> +	 * we have a correctly setup PDE structure for later use.
> +	 */
> +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
This part i am not understanding. Why do we need to insert the dummy
PTE here.?
> +	GEM_BUG_ON(!pt->is_compact);
> +	d->offset += SZ_2M;
> +}
> +
> +static void xehpsdv_insert_pte(struct i915_address_space *vm,
> +			       struct i915_page_table *pt,
> +			       void *data)
> +{
> +	struct insert_pte_data *d = data;
> +
> +	/*
> +	 * We are playing tricks here, since the actual pt, from the hw
> +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
> +	 * entries, but we are still guaranteed that the physical
> +	 * alignment is 64K underneath for the pt, and we are careful
> +	 * not to access the space in the void.
> +	 */
> +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
> +	d->offset += SZ_64K;
> +}
> +
>  static void insert_pte(struct i915_address_space *vm,
>  		       struct i915_page_table *pt,
>  		       void *data)
> @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  	 * i.e. within the same non-preemptible window so that we do not switch
>  	 * to another migration context that overwrites the PTE.
>  	 *
> -	 * TODO: Add support for huge LMEM PTEs
> +	 * On platforms with HAS_64K_PAGES support we have three windows, and
> +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
> +	 * a thing), since we are forced to use 64K GTT pages underneath which
> +	 * requires also modifying the PDE. An alternative might be to instead
> +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
> +	 * in the PDE from the same batch that also modifies the PTEs.
Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?
>  	 */
>  
>  	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
> @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  		goto err_vm;
>  	}
>  
> +	if (HAS_64K_PAGES(gt->i915))
> +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
> +
>  	/*
>  	 * Each engine instance is assigned its own chunk in the VM, so
>  	 * that we can run multiple instances concurrently
> @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
>  		 * 4x2 page directories for source/destination.
>  		 */
> -		sz = 2 * CHUNK_SZ;
> +		if (HAS_64K_PAGES(gt->i915))
> +			sz = 3 * CHUNK_SZ;
> +		else
> +			sz = 2 * CHUNK_SZ;
>  		d.offset = base + sz;
>  
>  		/*
>  		 * We need another page directory setup so that we can write
>  		 * the 8x512 PTE in each chunk.
>  		 */
> -		sz += (sz >> 12) * sizeof(u64);
> +		if (HAS_64K_PAGES(gt->i915))
> +			sz += (sz / SZ_2M) * SZ_64K;
> +		else
> +			sz += (sz >> 12) * sizeof(u64);
Here for 4K page support, per page we assume the u64 as the length required. But
for 64k page support we calculate the no of PDE and per PDE we allocate
the 64k page so that we can map it for edit right?

In this case i assume we have the unused space at the end. say after
32*sizeof(u64)

Ram
>  
>  		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
>  		if (err)
> @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  			goto err_vm;
>  
>  		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
> -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
> +		if (HAS_64K_PAGES(gt->i915)) {
> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> +				       xehpsdv_insert_pte, &d);
> +			d.offset = base + CHUNK_SZ;
> +			vm->vm.foreach(&vm->vm,
> +				       d.offset,
> +				       2 * CHUNK_SZ,
> +				       xehpsdv_toggle_pdes, &d);
> +		} else {
> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> +				       insert_pte, &d);
> +		}
>  	}
>  
>  	return &vm->vm;
> @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
>  		    u64 offset,
>  		    int length)
>  {
> +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
>  	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
>  						       is_lmem ? PTE_LM : 0);
>  	struct intel_ring *ring = rq->ring;
> -	int total = 0;
> +	int pkt, dword_length;
> +	u32 total = 0;
> +	u32 page_size;
>  	u32 *hdr, *cs;
> -	int pkt;
>  
>  	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
>  
> +	page_size = I915_GTT_PAGE_SIZE;
> +	dword_length = 0x400;
> +
>  	/* Compute the page directory offset for the target address range */
> -	offset >>= 12;
> -	offset *= sizeof(u64);
> -	offset += 2 * CHUNK_SZ;
> +	if (has_64K_pages) {
> +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
> +
> +		offset /= SZ_2M;
> +		offset *= SZ_64K;
> +		offset += 3 * CHUNK_SZ;
> +
> +		if (is_lmem) {
> +			page_size = I915_GTT_PAGE_SIZE_64K;
> +			dword_length = 0x40;
> +		}
> +	} else {
> +		offset >>= 12;
> +		offset *= sizeof(u64);
> +		offset += 2 * CHUNK_SZ;
> +	}
> +
>  	offset += (u64)rq->engine->instance << 32;
>  
>  	cs = intel_ring_begin(rq, 6);
> @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
>  		return PTR_ERR(cs);
>  
>  	/* Pack as many PTE updates as possible into a single MI command */
> -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
>  	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>  
>  	hdr = cs;
> @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
>  
>  	do {
>  		if (cs - hdr >= pkt) {
> +			int dword_rem;
> +
>  			*hdr += cs - hdr - 2;
>  			*cs++ = MI_NOOP;
>  
> @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
>  			if (IS_ERR(cs))
>  				return PTR_ERR(cs);
>  
> -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> +			dword_rem = dword_length;
> +			if (has_64K_pages) {
> +				if (IS_ALIGNED(total, SZ_2M)) {
> +					offset = round_up(offset, SZ_64K);
> +				} else {
> +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
> +					dword_rem /= page_size;
> +					dword_rem *= 2;
> +				}
> +			}
> +
> +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
>  			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>  
>  			hdr = cs;
> @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
>  			*cs++ = upper_32_bits(offset);
>  		}
>  
> +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
> +
>  		*cs++ = lower_32_bits(encode | it->dma);
>  		*cs++ = upper_32_bits(encode | it->dma);
>  
>  		offset += 8;
> -		total += I915_GTT_PAGE_SIZE;
> +		total += page_size;
>  
> -		it->dma += I915_GTT_PAGE_SIZE;
> +		it->dma += page_size;
>  		if (it->dma >= it->max) {
>  			it->sg = __sg_next(it->sg);
>  			if (!it->sg || sg_dma_len(it->sg) == 0)
> @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
>  	return height % 4 == 3 && height <= 8;
>  }
>  
> -static int emit_copy(struct i915_request *rq, int size)
> +static int emit_copy(struct i915_request *rq,
> +		     u32 dst_offset, u32 src_offset, int size)
>  {
>  	const int ver = GRAPHICS_VER(rq->engine->i915);
>  	u32 instance = rq->engine->instance;
> @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
>  		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = instance;
>  		*cs++ = 0;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  		*cs++ = instance;
>  	} else if (ver >= 8) {
>  		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = instance;
>  		*cs++ = 0;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  		*cs++ = instance;
>  	} else {
>  		GEM_BUG_ON(instance);
>  		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  	}
>  
>  	intel_ring_advance(rq, cs);
> @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>  	GEM_BUG_ON(ce->ring->size < SZ_64K);
>  
>  	do {
> +		u32 src_offset, dst_offset;
>  		int len;
>  
>  		rq = i915_request_create(ce);
> @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
> -			       CHUNK_SZ);
> +		src_offset = 0;
> +		dst_offset = CHUNK_SZ;
> +		if (HAS_64K_PAGES(ce->engine->i915)) {
> +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
> +
> +			src_offset = 0;
> +			dst_offset = 0;
> +			if (src_is_lmem)
> +				src_offset = CHUNK_SZ;
> +			if (dst_is_lmem)
> +				dst_offset = 2 * CHUNK_SZ;
> +		}
> +
> +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
> +			       src_offset, CHUNK_SZ);
>  		if (len <= 0) {
>  			err = len;
>  			goto out_rq;
>  		}
>  
>  		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
> -			       CHUNK_SZ, len);
> +			       dst_offset, len);
>  		if (err < 0)
>  			goto out_rq;
>  		if (err < len) {
> @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		err = emit_copy(rq, len);
> +		err = emit_copy(rq, dst_offset, src_offset, len);
>  
>  		/* Arbitration is re-enabled between requests. */
>  out_rq:
> @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
>  }
>  
>  static int emit_clear(struct i915_request *rq,
> +		      u64 offset,
>  		      int size,
>  		      u32 value,
>  		      bool is_lmem)
>  {
> -	const int ver = GRAPHICS_VER(rq->engine->i915);
> -	u32 instance = rq->engine->instance;
> -	u32 *cs;
>  	struct drm_i915_private *i915 = rq->engine->i915;
> +	const int ver = GRAPHICS_VER(rq->engine->i915);
>  	u32 num_ccs_blks, ccs_ring_size;
> +	u32 *cs;
>  
>  	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
>  
> +	offset += (u64)rq->engine->instance << 32;
> +
>  	/* Clear flat css only when value is 0 */
>  	ccs_ring_size = (is_lmem && !value) ?
>  			 calc_ctrl_surf_instr_size(i915, size)
> @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = 0; /* offset */
> -		*cs++ = instance;
> +		*cs++ = lower_32_bits(offset);
> +		*cs++ = upper_32_bits(offset);
>  		*cs++ = value;
>  		*cs++ = MI_NOOP;
>  	} else {
> -		GEM_BUG_ON(instance);
> +		GEM_BUG_ON(upper_32_bits(offset));
>  		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = 0;
> +		*cs++ = lower_32_bits(offset);
>  		*cs++ = value;
>  	}
>  
> @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
>  		 * and use it as a source.
>  		 */
>  
> -		cs = i915_flush_dw(cs, (u64)instance << 32,
> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>  		cs = _i915_ctrl_surf_copy_blt(cs,
> -					      (u64)instance << 32,
> -					      (u64)instance << 32,
> +					      offset,
> +					      offset,
>  					      DIRECT_ACCESS,
>  					      INDIRECT_ACCESS,
>  					      1, 1,
>  					      num_ccs_blks);
> -		cs = i915_flush_dw(cs, (u64)instance << 32,
> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>  	}
>  	intel_ring_advance(rq, cs);
>  	return 0;
> @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>  	GEM_BUG_ON(ce->ring->size < SZ_64K);
>  
>  	do {
> +		u32 offset;
>  		int len;
>  
>  		rq = i915_request_create(ce);
> @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
> +		offset = 0;
> +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
> +			offset = CHUNK_SZ;
> +
> +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
>  		if (len <= 0) {
>  			err = len;
>  			goto out_rq;
> @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		err = emit_clear(rq, len, value, is_lmem);
> +		err = emit_clear(rq, offset, len, value, is_lmem);
>  
>  		/* Arbitration is re-enabled between requests. */
>  out_rq:
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
@ 2021-12-14 10:56     ` Ramalingam C
  0 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-14 10:56 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
> This is all kinds of awkward since we now have to contend with using 64K
> GTT pages when mapping anything in LMEM(including the page-tables
> themselves).
> 
> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Ramalingam C <ramalingam.c@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
>  1 file changed, 150 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> index 0192b61ab541..fb658ae70a8d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
>  	return true;
>  }
>  
> +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
> +				struct i915_page_table *pt,
> +				void *data)
> +{
> +	struct insert_pte_data *d = data;
> +
> +	/*
> +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
> +	 * we have a correctly setup PDE structure for later use.
> +	 */
> +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
This part i am not understanding. Why do we need to insert the dummy
PTE here.?
> +	GEM_BUG_ON(!pt->is_compact);
> +	d->offset += SZ_2M;
> +}
> +
> +static void xehpsdv_insert_pte(struct i915_address_space *vm,
> +			       struct i915_page_table *pt,
> +			       void *data)
> +{
> +	struct insert_pte_data *d = data;
> +
> +	/*
> +	 * We are playing tricks here, since the actual pt, from the hw
> +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
> +	 * entries, but we are still guaranteed that the physical
> +	 * alignment is 64K underneath for the pt, and we are careful
> +	 * not to access the space in the void.
> +	 */
> +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
> +	d->offset += SZ_64K;
> +}
> +
>  static void insert_pte(struct i915_address_space *vm,
>  		       struct i915_page_table *pt,
>  		       void *data)
> @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  	 * i.e. within the same non-preemptible window so that we do not switch
>  	 * to another migration context that overwrites the PTE.
>  	 *
> -	 * TODO: Add support for huge LMEM PTEs
> +	 * On platforms with HAS_64K_PAGES support we have three windows, and
> +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
> +	 * a thing), since we are forced to use 64K GTT pages underneath which
> +	 * requires also modifying the PDE. An alternative might be to instead
> +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
> +	 * in the PDE from the same batch that also modifies the PTEs.
Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?
>  	 */
>  
>  	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
> @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  		goto err_vm;
>  	}
>  
> +	if (HAS_64K_PAGES(gt->i915))
> +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
> +
>  	/*
>  	 * Each engine instance is assigned its own chunk in the VM, so
>  	 * that we can run multiple instances concurrently
> @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
>  		 * 4x2 page directories for source/destination.
>  		 */
> -		sz = 2 * CHUNK_SZ;
> +		if (HAS_64K_PAGES(gt->i915))
> +			sz = 3 * CHUNK_SZ;
> +		else
> +			sz = 2 * CHUNK_SZ;
>  		d.offset = base + sz;
>  
>  		/*
>  		 * We need another page directory setup so that we can write
>  		 * the 8x512 PTE in each chunk.
>  		 */
> -		sz += (sz >> 12) * sizeof(u64);
> +		if (HAS_64K_PAGES(gt->i915))
> +			sz += (sz / SZ_2M) * SZ_64K;
> +		else
> +			sz += (sz >> 12) * sizeof(u64);
Here for 4K page support, per page we assume the u64 as the length required. But
for 64k page support we calculate the no of PDE and per PDE we allocate
the 64k page so that we can map it for edit right?

In this case i assume we have the unused space at the end. say after
32*sizeof(u64)

Ram
>  
>  		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
>  		if (err)
> @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>  			goto err_vm;
>  
>  		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
> -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
> +		if (HAS_64K_PAGES(gt->i915)) {
> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> +				       xehpsdv_insert_pte, &d);
> +			d.offset = base + CHUNK_SZ;
> +			vm->vm.foreach(&vm->vm,
> +				       d.offset,
> +				       2 * CHUNK_SZ,
> +				       xehpsdv_toggle_pdes, &d);
> +		} else {
> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> +				       insert_pte, &d);
> +		}
>  	}
>  
>  	return &vm->vm;
> @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
>  		    u64 offset,
>  		    int length)
>  {
> +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
>  	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
>  						       is_lmem ? PTE_LM : 0);
>  	struct intel_ring *ring = rq->ring;
> -	int total = 0;
> +	int pkt, dword_length;
> +	u32 total = 0;
> +	u32 page_size;
>  	u32 *hdr, *cs;
> -	int pkt;
>  
>  	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
>  
> +	page_size = I915_GTT_PAGE_SIZE;
> +	dword_length = 0x400;
> +
>  	/* Compute the page directory offset for the target address range */
> -	offset >>= 12;
> -	offset *= sizeof(u64);
> -	offset += 2 * CHUNK_SZ;
> +	if (has_64K_pages) {
> +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
> +
> +		offset /= SZ_2M;
> +		offset *= SZ_64K;
> +		offset += 3 * CHUNK_SZ;
> +
> +		if (is_lmem) {
> +			page_size = I915_GTT_PAGE_SIZE_64K;
> +			dword_length = 0x40;
> +		}
> +	} else {
> +		offset >>= 12;
> +		offset *= sizeof(u64);
> +		offset += 2 * CHUNK_SZ;
> +	}
> +
>  	offset += (u64)rq->engine->instance << 32;
>  
>  	cs = intel_ring_begin(rq, 6);
> @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
>  		return PTR_ERR(cs);
>  
>  	/* Pack as many PTE updates as possible into a single MI command */
> -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
>  	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>  
>  	hdr = cs;
> @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
>  
>  	do {
>  		if (cs - hdr >= pkt) {
> +			int dword_rem;
> +
>  			*hdr += cs - hdr - 2;
>  			*cs++ = MI_NOOP;
>  
> @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
>  			if (IS_ERR(cs))
>  				return PTR_ERR(cs);
>  
> -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> +			dword_rem = dword_length;
> +			if (has_64K_pages) {
> +				if (IS_ALIGNED(total, SZ_2M)) {
> +					offset = round_up(offset, SZ_64K);
> +				} else {
> +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
> +					dword_rem /= page_size;
> +					dword_rem *= 2;
> +				}
> +			}
> +
> +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
>  			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>  
>  			hdr = cs;
> @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
>  			*cs++ = upper_32_bits(offset);
>  		}
>  
> +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
> +
>  		*cs++ = lower_32_bits(encode | it->dma);
>  		*cs++ = upper_32_bits(encode | it->dma);
>  
>  		offset += 8;
> -		total += I915_GTT_PAGE_SIZE;
> +		total += page_size;
>  
> -		it->dma += I915_GTT_PAGE_SIZE;
> +		it->dma += page_size;
>  		if (it->dma >= it->max) {
>  			it->sg = __sg_next(it->sg);
>  			if (!it->sg || sg_dma_len(it->sg) == 0)
> @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
>  	return height % 4 == 3 && height <= 8;
>  }
>  
> -static int emit_copy(struct i915_request *rq, int size)
> +static int emit_copy(struct i915_request *rq,
> +		     u32 dst_offset, u32 src_offset, int size)
>  {
>  	const int ver = GRAPHICS_VER(rq->engine->i915);
>  	u32 instance = rq->engine->instance;
> @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
>  		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = instance;
>  		*cs++ = 0;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  		*cs++ = instance;
>  	} else if (ver >= 8) {
>  		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = instance;
>  		*cs++ = 0;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  		*cs++ = instance;
>  	} else {
>  		GEM_BUG_ON(instance);
>  		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
> -		*cs++ = CHUNK_SZ; /* dst offset */
> +		*cs++ = dst_offset;
>  		*cs++ = PAGE_SIZE;
> -		*cs++ = 0; /* src offset */
> +		*cs++ = src_offset;
>  	}
>  
>  	intel_ring_advance(rq, cs);
> @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>  	GEM_BUG_ON(ce->ring->size < SZ_64K);
>  
>  	do {
> +		u32 src_offset, dst_offset;
>  		int len;
>  
>  		rq = i915_request_create(ce);
> @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
> -			       CHUNK_SZ);
> +		src_offset = 0;
> +		dst_offset = CHUNK_SZ;
> +		if (HAS_64K_PAGES(ce->engine->i915)) {
> +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
> +
> +			src_offset = 0;
> +			dst_offset = 0;
> +			if (src_is_lmem)
> +				src_offset = CHUNK_SZ;
> +			if (dst_is_lmem)
> +				dst_offset = 2 * CHUNK_SZ;
> +		}
> +
> +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
> +			       src_offset, CHUNK_SZ);
>  		if (len <= 0) {
>  			err = len;
>  			goto out_rq;
>  		}
>  
>  		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
> -			       CHUNK_SZ, len);
> +			       dst_offset, len);
>  		if (err < 0)
>  			goto out_rq;
>  		if (err < len) {
> @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		err = emit_copy(rq, len);
> +		err = emit_copy(rq, dst_offset, src_offset, len);
>  
>  		/* Arbitration is re-enabled between requests. */
>  out_rq:
> @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
>  }
>  
>  static int emit_clear(struct i915_request *rq,
> +		      u64 offset,
>  		      int size,
>  		      u32 value,
>  		      bool is_lmem)
>  {
> -	const int ver = GRAPHICS_VER(rq->engine->i915);
> -	u32 instance = rq->engine->instance;
> -	u32 *cs;
>  	struct drm_i915_private *i915 = rq->engine->i915;
> +	const int ver = GRAPHICS_VER(rq->engine->i915);
>  	u32 num_ccs_blks, ccs_ring_size;
> +	u32 *cs;
>  
>  	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
>  
> +	offset += (u64)rq->engine->instance << 32;
> +
>  	/* Clear flat css only when value is 0 */
>  	ccs_ring_size = (is_lmem && !value) ?
>  			 calc_ctrl_surf_instr_size(i915, size)
> @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = 0; /* offset */
> -		*cs++ = instance;
> +		*cs++ = lower_32_bits(offset);
> +		*cs++ = upper_32_bits(offset);
>  		*cs++ = value;
>  		*cs++ = MI_NOOP;
>  	} else {
> -		GEM_BUG_ON(instance);
> +		GEM_BUG_ON(upper_32_bits(offset));
>  		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>  		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>  		*cs++ = 0;
>  		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> -		*cs++ = 0;
> +		*cs++ = lower_32_bits(offset);
>  		*cs++ = value;
>  	}
>  
> @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
>  		 * and use it as a source.
>  		 */
>  
> -		cs = i915_flush_dw(cs, (u64)instance << 32,
> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>  		cs = _i915_ctrl_surf_copy_blt(cs,
> -					      (u64)instance << 32,
> -					      (u64)instance << 32,
> +					      offset,
> +					      offset,
>  					      DIRECT_ACCESS,
>  					      INDIRECT_ACCESS,
>  					      1, 1,
>  					      num_ccs_blks);
> -		cs = i915_flush_dw(cs, (u64)instance << 32,
> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>  	}
>  	intel_ring_advance(rq, cs);
>  	return 0;
> @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>  	GEM_BUG_ON(ce->ring->size < SZ_64K);
>  
>  	do {
> +		u32 offset;
>  		int len;
>  
>  		rq = i915_request_create(ce);
> @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
> +		offset = 0;
> +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
> +			offset = CHUNK_SZ;
> +
> +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
>  		if (len <= 0) {
>  			err = len;
>  			goto out_rq;
> @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>  		if (err)
>  			goto out_rq;
>  
> -		err = emit_clear(rq, len, value, is_lmem);
> +		err = emit_clear(rq, offset, len, value, is_lmem);
>  
>  		/* Arbitration is re-enabled between requests. */
>  out_rq:
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
  2021-12-14 10:56     ` [Intel-gfx] " Ramalingam C
@ 2021-12-14 12:32       ` Matthew Auld
  -1 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-14 12:32 UTC (permalink / raw)
  To: Ramalingam C; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 14/12/2021 10:56, Ramalingam C wrote:
> On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
>> This is all kinds of awkward since we now have to contend with using 64K
>> GTT pages when mapping anything in LMEM(including the page-tables
>> themselves).
>>
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Ramalingam C <ramalingam.c@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
>>   1 file changed, 150 insertions(+), 39 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> index 0192b61ab541..fb658ae70a8d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
>>   	return true;
>>   }
>>   
>> +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
>> +				struct i915_page_table *pt,
>> +				void *data)
>> +{
>> +	struct insert_pte_data *d = data;
>> +
>> +	/*
>> +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
>> +	 * we have a correctly setup PDE structure for later use.
>> +	 */
>> +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
> This part i am not understanding. Why do we need to insert the dummy
> PTE here.?

We have three windows, each CHUNK_SIZE in size. The first is reserved 
for mapping system-memory, and that just uses the 512 entry layout using 
4K GTT pages. The other two windows just map lmem pages and must use the 
new compact 32 entry layout using 64K GTT pages, which ensures we can 
address any lmem object that the user throws at us. The above is 
basically just toggling the PDE bit(GEN12_PDE_64K) for us, to enable the 
compact layout for each of these page-tables, that fall within the 2 * 
CHUNK_SIZE range starting at CHUNK_SIZE.


>> +	GEM_BUG_ON(!pt->is_compact);
>> +	d->offset += SZ_2M;
>> +}
>> +
>> +static void xehpsdv_insert_pte(struct i915_address_space *vm,
>> +			       struct i915_page_table *pt,
>> +			       void *data)
>> +{
>> +	struct insert_pte_data *d = data;
>> +
>> +	/*
>> +	 * We are playing tricks here, since the actual pt, from the hw
>> +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
>> +	 * entries, but we are still guaranteed that the physical
>> +	 * alignment is 64K underneath for the pt, and we are careful
>> +	 * not to access the space in the void.
>> +	 */
>> +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
>> +	d->offset += SZ_64K;
>> +}
>> +
>>   static void insert_pte(struct i915_address_space *vm,
>>   		       struct i915_page_table *pt,
>>   		       void *data)
>> @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   	 * i.e. within the same non-preemptible window so that we do not switch
>>   	 * to another migration context that overwrites the PTE.
>>   	 *
>> -	 * TODO: Add support for huge LMEM PTEs
>> +	 * On platforms with HAS_64K_PAGES support we have three windows, and
>> +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
>> +	 * a thing), since we are forced to use 64K GTT pages underneath which
>> +	 * requires also modifying the PDE. An alternative might be to instead
>> +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
>> +	 * in the PDE from the same batch that also modifies the PTEs.
> Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?

[0, CHUNK_SZ) -> first window, maps smem
[CHUNK_SZ, 2 * CHUNK_SZ) -> second window, maps lmem src
[2 * CHUNK_SZ, 3 * CHUNK_SZ) -> third window, maps lmem dst

It starts to get strange here, since each PTE must point to some 64K 
page, one for each PT(since it's in lmem), and yet each is only <= 
4096bytes, but since the unused space within that PTE range is never 
touched, this should be fine.

So basically each PT now needs 64K of virtual memory, instead of 4K. So 
something like:

[3 * CHUNK_SZ, 3 * CHUNK_SZ + ((3 * CHUNK_SZ / SZ_2M) * SZ_64K)] -> PTE

And then later when writing out the PTEs we know if the layout within a 
particular PT is 512 vs 32 depending on if we are mapping lmem or not.

>>   	 */
>>   
>>   	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
>> @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   		goto err_vm;
>>   	}
>>   
>> +	if (HAS_64K_PAGES(gt->i915))
>> +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
>> +
>>   	/*
>>   	 * Each engine instance is assigned its own chunk in the VM, so
>>   	 * that we can run multiple instances concurrently
>> @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
>>   		 * 4x2 page directories for source/destination.
>>   		 */
>> -		sz = 2 * CHUNK_SZ;
>> +		if (HAS_64K_PAGES(gt->i915))
>> +			sz = 3 * CHUNK_SZ;
>> +		else
>> +			sz = 2 * CHUNK_SZ;
>>   		d.offset = base + sz;
>>   
>>   		/*
>>   		 * We need another page directory setup so that we can write
>>   		 * the 8x512 PTE in each chunk.
>>   		 */
>> -		sz += (sz >> 12) * sizeof(u64);
>> +		if (HAS_64K_PAGES(gt->i915))
>> +			sz += (sz / SZ_2M) * SZ_64K;
>> +		else
>> +			sz += (sz >> 12) * sizeof(u64);
> Here for 4K page support, per page we assume the u64 as the length required. But
> for 64k page support we calculate the no of PDE and per PDE we allocate
> the 64k page so that we can map it for edit right?
> 
> In this case i assume we have the unused space at the end. say after
> 32*sizeof(u64)

For every PT(which is 2M va range), we need a 64K va chunk in order to 
map it. Yes, there is some unused space at the end, but it is already 
like that for the other PTEs. Also it seems strange to call 
alloc_va_range without also rounding up the size to the correct page size.

> 
> Ram
>>   
>>   		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
>>   		if (err)
>> @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   			goto err_vm;
>>   
>>   		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
>> -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
>> +		if (HAS_64K_PAGES(gt->i915)) {
>> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
>> +				       xehpsdv_insert_pte, &d);
>> +			d.offset = base + CHUNK_SZ;
>> +			vm->vm.foreach(&vm->vm,
>> +				       d.offset,
>> +				       2 * CHUNK_SZ,
>> +				       xehpsdv_toggle_pdes, &d);
>> +		} else {
>> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
>> +				       insert_pte, &d);
>> +		}
>>   	}
>>   
>>   	return &vm->vm;
>> @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
>>   		    u64 offset,
>>   		    int length)
>>   {
>> +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
>>   	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
>>   						       is_lmem ? PTE_LM : 0);
>>   	struct intel_ring *ring = rq->ring;
>> -	int total = 0;
>> +	int pkt, dword_length;
>> +	u32 total = 0;
>> +	u32 page_size;
>>   	u32 *hdr, *cs;
>> -	int pkt;
>>   
>>   	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
>>   
>> +	page_size = I915_GTT_PAGE_SIZE;
>> +	dword_length = 0x400;
>> +
>>   	/* Compute the page directory offset for the target address range */
>> -	offset >>= 12;
>> -	offset *= sizeof(u64);
>> -	offset += 2 * CHUNK_SZ;
>> +	if (has_64K_pages) {
>> +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
>> +
>> +		offset /= SZ_2M;
>> +		offset *= SZ_64K;
>> +		offset += 3 * CHUNK_SZ;
>> +
>> +		if (is_lmem) {
>> +			page_size = I915_GTT_PAGE_SIZE_64K;
>> +			dword_length = 0x40;
>> +		}
>> +	} else {
>> +		offset >>= 12;
>> +		offset *= sizeof(u64);
>> +		offset += 2 * CHUNK_SZ;
>> +	}
>> +
>>   	offset += (u64)rq->engine->instance << 32;
>>   
>>   	cs = intel_ring_begin(rq, 6);
>> @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
>>   		return PTR_ERR(cs);
>>   
>>   	/* Pack as many PTE updates as possible into a single MI command */
>> -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
>> +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
>>   	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>>   
>>   	hdr = cs;
>> @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
>>   
>>   	do {
>>   		if (cs - hdr >= pkt) {
>> +			int dword_rem;
>> +
>>   			*hdr += cs - hdr - 2;
>>   			*cs++ = MI_NOOP;
>>   
>> @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
>>   			if (IS_ERR(cs))
>>   				return PTR_ERR(cs);
>>   
>> -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
>> +			dword_rem = dword_length;
>> +			if (has_64K_pages) {
>> +				if (IS_ALIGNED(total, SZ_2M)) {
>> +					offset = round_up(offset, SZ_64K);
>> +				} else {
>> +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
>> +					dword_rem /= page_size;
>> +					dword_rem *= 2;
>> +				}
>> +			}
>> +
>> +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
>>   			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>>   
>>   			hdr = cs;
>> @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
>>   			*cs++ = upper_32_bits(offset);
>>   		}
>>   
>> +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
>> +
>>   		*cs++ = lower_32_bits(encode | it->dma);
>>   		*cs++ = upper_32_bits(encode | it->dma);
>>   
>>   		offset += 8;
>> -		total += I915_GTT_PAGE_SIZE;
>> +		total += page_size;
>>   
>> -		it->dma += I915_GTT_PAGE_SIZE;
>> +		it->dma += page_size;
>>   		if (it->dma >= it->max) {
>>   			it->sg = __sg_next(it->sg);
>>   			if (!it->sg || sg_dma_len(it->sg) == 0)
>> @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
>>   	return height % 4 == 3 && height <= 8;
>>   }
>>   
>> -static int emit_copy(struct i915_request *rq, int size)
>> +static int emit_copy(struct i915_request *rq,
>> +		     u32 dst_offset, u32 src_offset, int size)
>>   {
>>   	const int ver = GRAPHICS_VER(rq->engine->i915);
>>   	u32 instance = rq->engine->instance;
>> @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
>>   		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = instance;
>>   		*cs++ = 0;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   		*cs++ = instance;
>>   	} else if (ver >= 8) {
>>   		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = instance;
>>   		*cs++ = 0;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   		*cs++ = instance;
>>   	} else {
>>   		GEM_BUG_ON(instance);
>>   		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   	}
>>   
>>   	intel_ring_advance(rq, cs);
>> @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   	GEM_BUG_ON(ce->ring->size < SZ_64K);
>>   
>>   	do {
>> +		u32 src_offset, dst_offset;
>>   		int len;
>>   
>>   		rq = i915_request_create(ce);
>> @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
>> -			       CHUNK_SZ);
>> +		src_offset = 0;
>> +		dst_offset = CHUNK_SZ;
>> +		if (HAS_64K_PAGES(ce->engine->i915)) {
>> +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
>> +
>> +			src_offset = 0;
>> +			dst_offset = 0;
>> +			if (src_is_lmem)
>> +				src_offset = CHUNK_SZ;
>> +			if (dst_is_lmem)
>> +				dst_offset = 2 * CHUNK_SZ;
>> +		}
>> +
>> +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
>> +			       src_offset, CHUNK_SZ);
>>   		if (len <= 0) {
>>   			err = len;
>>   			goto out_rq;
>>   		}
>>   
>>   		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
>> -			       CHUNK_SZ, len);
>> +			       dst_offset, len);
>>   		if (err < 0)
>>   			goto out_rq;
>>   		if (err < len) {
>> @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		err = emit_copy(rq, len);
>> +		err = emit_copy(rq, dst_offset, src_offset, len);
>>   
>>   		/* Arbitration is re-enabled between requests. */
>>   out_rq:
>> @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
>>   }
>>   
>>   static int emit_clear(struct i915_request *rq,
>> +		      u64 offset,
>>   		      int size,
>>   		      u32 value,
>>   		      bool is_lmem)
>>   {
>> -	const int ver = GRAPHICS_VER(rq->engine->i915);
>> -	u32 instance = rq->engine->instance;
>> -	u32 *cs;
>>   	struct drm_i915_private *i915 = rq->engine->i915;
>> +	const int ver = GRAPHICS_VER(rq->engine->i915);
>>   	u32 num_ccs_blks, ccs_ring_size;
>> +	u32 *cs;
>>   
>>   	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
>>   
>> +	offset += (u64)rq->engine->instance << 32;
>> +
>>   	/* Clear flat css only when value is 0 */
>>   	ccs_ring_size = (is_lmem && !value) ?
>>   			 calc_ctrl_surf_instr_size(i915, size)
>> @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = 0; /* offset */
>> -		*cs++ = instance;
>> +		*cs++ = lower_32_bits(offset);
>> +		*cs++ = upper_32_bits(offset);
>>   		*cs++ = value;
>>   		*cs++ = MI_NOOP;
>>   	} else {
>> -		GEM_BUG_ON(instance);
>> +		GEM_BUG_ON(upper_32_bits(offset));
>>   		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = 0;
>> +		*cs++ = lower_32_bits(offset);
>>   		*cs++ = value;
>>   	}
>>   
>> @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
>>   		 * and use it as a source.
>>   		 */
>>   
>> -		cs = i915_flush_dw(cs, (u64)instance << 32,
>> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
>> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>>   		cs = _i915_ctrl_surf_copy_blt(cs,
>> -					      (u64)instance << 32,
>> -					      (u64)instance << 32,
>> +					      offset,
>> +					      offset,
>>   					      DIRECT_ACCESS,
>>   					      INDIRECT_ACCESS,
>>   					      1, 1,
>>   					      num_ccs_blks);
>> -		cs = i915_flush_dw(cs, (u64)instance << 32,
>> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
>> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>>   	}
>>   	intel_ring_advance(rq, cs);
>>   	return 0;
>> @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   	GEM_BUG_ON(ce->ring->size < SZ_64K);
>>   
>>   	do {
>> +		u32 offset;
>>   		int len;
>>   
>>   		rq = i915_request_create(ce);
>> @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
>> +		offset = 0;
>> +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
>> +			offset = CHUNK_SZ;
>> +
>> +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
>>   		if (len <= 0) {
>>   			err = len;
>>   			goto out_rq;
>> @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		err = emit_clear(rq, len, value, is_lmem);
>> +		err = emit_clear(rq, offset, len, value, is_lmem);
>>   
>>   		/* Arbitration is re-enabled between requests. */
>>   out_rq:
>> -- 
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
@ 2021-12-14 12:32       ` Matthew Auld
  0 siblings, 0 replies; 31+ messages in thread
From: Matthew Auld @ 2021-12-14 12:32 UTC (permalink / raw)
  To: Ramalingam C; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 14/12/2021 10:56, Ramalingam C wrote:
> On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
>> This is all kinds of awkward since we now have to contend with using 64K
>> GTT pages when mapping anything in LMEM(including the page-tables
>> themselves).
>>
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Ramalingam C <ramalingam.c@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
>>   1 file changed, 150 insertions(+), 39 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> index 0192b61ab541..fb658ae70a8d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
>> @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
>>   	return true;
>>   }
>>   
>> +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
>> +				struct i915_page_table *pt,
>> +				void *data)
>> +{
>> +	struct insert_pte_data *d = data;
>> +
>> +	/*
>> +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
>> +	 * we have a correctly setup PDE structure for later use.
>> +	 */
>> +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
> This part i am not understanding. Why do we need to insert the dummy
> PTE here.?

We have three windows, each CHUNK_SIZE in size. The first is reserved 
for mapping system-memory, and that just uses the 512 entry layout using 
4K GTT pages. The other two windows just map lmem pages and must use the 
new compact 32 entry layout using 64K GTT pages, which ensures we can 
address any lmem object that the user throws at us. The above is 
basically just toggling the PDE bit(GEN12_PDE_64K) for us, to enable the 
compact layout for each of these page-tables, that fall within the 2 * 
CHUNK_SIZE range starting at CHUNK_SIZE.


>> +	GEM_BUG_ON(!pt->is_compact);
>> +	d->offset += SZ_2M;
>> +}
>> +
>> +static void xehpsdv_insert_pte(struct i915_address_space *vm,
>> +			       struct i915_page_table *pt,
>> +			       void *data)
>> +{
>> +	struct insert_pte_data *d = data;
>> +
>> +	/*
>> +	 * We are playing tricks here, since the actual pt, from the hw
>> +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
>> +	 * entries, but we are still guaranteed that the physical
>> +	 * alignment is 64K underneath for the pt, and we are careful
>> +	 * not to access the space in the void.
>> +	 */
>> +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
>> +	d->offset += SZ_64K;
>> +}
>> +
>>   static void insert_pte(struct i915_address_space *vm,
>>   		       struct i915_page_table *pt,
>>   		       void *data)
>> @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   	 * i.e. within the same non-preemptible window so that we do not switch
>>   	 * to another migration context that overwrites the PTE.
>>   	 *
>> -	 * TODO: Add support for huge LMEM PTEs
>> +	 * On platforms with HAS_64K_PAGES support we have three windows, and
>> +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
>> +	 * a thing), since we are forced to use 64K GTT pages underneath which
>> +	 * requires also modifying the PDE. An alternative might be to instead
>> +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
>> +	 * in the PDE from the same batch that also modifies the PTEs.
> Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?

[0, CHUNK_SZ) -> first window, maps smem
[CHUNK_SZ, 2 * CHUNK_SZ) -> second window, maps lmem src
[2 * CHUNK_SZ, 3 * CHUNK_SZ) -> third window, maps lmem dst

It starts to get strange here, since each PTE must point to some 64K 
page, one for each PT(since it's in lmem), and yet each is only <= 
4096bytes, but since the unused space within that PTE range is never 
touched, this should be fine.

So basically each PT now needs 64K of virtual memory, instead of 4K. So 
something like:

[3 * CHUNK_SZ, 3 * CHUNK_SZ + ((3 * CHUNK_SZ / SZ_2M) * SZ_64K)] -> PTE

And then later when writing out the PTEs we know if the layout within a 
particular PT is 512 vs 32 depending on if we are mapping lmem or not.

>>   	 */
>>   
>>   	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
>> @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   		goto err_vm;
>>   	}
>>   
>> +	if (HAS_64K_PAGES(gt->i915))
>> +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
>> +
>>   	/*
>>   	 * Each engine instance is assigned its own chunk in the VM, so
>>   	 * that we can run multiple instances concurrently
>> @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
>>   		 * 4x2 page directories for source/destination.
>>   		 */
>> -		sz = 2 * CHUNK_SZ;
>> +		if (HAS_64K_PAGES(gt->i915))
>> +			sz = 3 * CHUNK_SZ;
>> +		else
>> +			sz = 2 * CHUNK_SZ;
>>   		d.offset = base + sz;
>>   
>>   		/*
>>   		 * We need another page directory setup so that we can write
>>   		 * the 8x512 PTE in each chunk.
>>   		 */
>> -		sz += (sz >> 12) * sizeof(u64);
>> +		if (HAS_64K_PAGES(gt->i915))
>> +			sz += (sz / SZ_2M) * SZ_64K;
>> +		else
>> +			sz += (sz >> 12) * sizeof(u64);
> Here for 4K page support, per page we assume the u64 as the length required. But
> for 64k page support we calculate the no of PDE and per PDE we allocate
> the 64k page so that we can map it for edit right?
> 
> In this case i assume we have the unused space at the end. say after
> 32*sizeof(u64)

For every PT(which is 2M va range), we need a 64K va chunk in order to 
map it. Yes, there is some unused space at the end, but it is already 
like that for the other PTEs. Also it seems strange to call 
alloc_va_range without also rounding up the size to the correct page size.

> 
> Ram
>>   
>>   		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
>>   		if (err)
>> @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>   			goto err_vm;
>>   
>>   		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
>> -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
>> +		if (HAS_64K_PAGES(gt->i915)) {
>> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
>> +				       xehpsdv_insert_pte, &d);
>> +			d.offset = base + CHUNK_SZ;
>> +			vm->vm.foreach(&vm->vm,
>> +				       d.offset,
>> +				       2 * CHUNK_SZ,
>> +				       xehpsdv_toggle_pdes, &d);
>> +		} else {
>> +			vm->vm.foreach(&vm->vm, base, d.offset - base,
>> +				       insert_pte, &d);
>> +		}
>>   	}
>>   
>>   	return &vm->vm;
>> @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
>>   		    u64 offset,
>>   		    int length)
>>   {
>> +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
>>   	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
>>   						       is_lmem ? PTE_LM : 0);
>>   	struct intel_ring *ring = rq->ring;
>> -	int total = 0;
>> +	int pkt, dword_length;
>> +	u32 total = 0;
>> +	u32 page_size;
>>   	u32 *hdr, *cs;
>> -	int pkt;
>>   
>>   	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
>>   
>> +	page_size = I915_GTT_PAGE_SIZE;
>> +	dword_length = 0x400;
>> +
>>   	/* Compute the page directory offset for the target address range */
>> -	offset >>= 12;
>> -	offset *= sizeof(u64);
>> -	offset += 2 * CHUNK_SZ;
>> +	if (has_64K_pages) {
>> +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
>> +
>> +		offset /= SZ_2M;
>> +		offset *= SZ_64K;
>> +		offset += 3 * CHUNK_SZ;
>> +
>> +		if (is_lmem) {
>> +			page_size = I915_GTT_PAGE_SIZE_64K;
>> +			dword_length = 0x40;
>> +		}
>> +	} else {
>> +		offset >>= 12;
>> +		offset *= sizeof(u64);
>> +		offset += 2 * CHUNK_SZ;
>> +	}
>> +
>>   	offset += (u64)rq->engine->instance << 32;
>>   
>>   	cs = intel_ring_begin(rq, 6);
>> @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
>>   		return PTR_ERR(cs);
>>   
>>   	/* Pack as many PTE updates as possible into a single MI command */
>> -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
>> +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
>>   	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>>   
>>   	hdr = cs;
>> @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
>>   
>>   	do {
>>   		if (cs - hdr >= pkt) {
>> +			int dword_rem;
>> +
>>   			*hdr += cs - hdr - 2;
>>   			*cs++ = MI_NOOP;
>>   
>> @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
>>   			if (IS_ERR(cs))
>>   				return PTR_ERR(cs);
>>   
>> -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
>> +			dword_rem = dword_length;
>> +			if (has_64K_pages) {
>> +				if (IS_ALIGNED(total, SZ_2M)) {
>> +					offset = round_up(offset, SZ_64K);
>> +				} else {
>> +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
>> +					dword_rem /= page_size;
>> +					dword_rem *= 2;
>> +				}
>> +			}
>> +
>> +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
>>   			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
>>   
>>   			hdr = cs;
>> @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
>>   			*cs++ = upper_32_bits(offset);
>>   		}
>>   
>> +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
>> +
>>   		*cs++ = lower_32_bits(encode | it->dma);
>>   		*cs++ = upper_32_bits(encode | it->dma);
>>   
>>   		offset += 8;
>> -		total += I915_GTT_PAGE_SIZE;
>> +		total += page_size;
>>   
>> -		it->dma += I915_GTT_PAGE_SIZE;
>> +		it->dma += page_size;
>>   		if (it->dma >= it->max) {
>>   			it->sg = __sg_next(it->sg);
>>   			if (!it->sg || sg_dma_len(it->sg) == 0)
>> @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
>>   	return height % 4 == 3 && height <= 8;
>>   }
>>   
>> -static int emit_copy(struct i915_request *rq, int size)
>> +static int emit_copy(struct i915_request *rq,
>> +		     u32 dst_offset, u32 src_offset, int size)
>>   {
>>   	const int ver = GRAPHICS_VER(rq->engine->i915);
>>   	u32 instance = rq->engine->instance;
>> @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
>>   		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = instance;
>>   		*cs++ = 0;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   		*cs++ = instance;
>>   	} else if (ver >= 8) {
>>   		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = instance;
>>   		*cs++ = 0;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   		*cs++ = instance;
>>   	} else {
>>   		GEM_BUG_ON(instance);
>>   		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
>> -		*cs++ = CHUNK_SZ; /* dst offset */
>> +		*cs++ = dst_offset;
>>   		*cs++ = PAGE_SIZE;
>> -		*cs++ = 0; /* src offset */
>> +		*cs++ = src_offset;
>>   	}
>>   
>>   	intel_ring_advance(rq, cs);
>> @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   	GEM_BUG_ON(ce->ring->size < SZ_64K);
>>   
>>   	do {
>> +		u32 src_offset, dst_offset;
>>   		int len;
>>   
>>   		rq = i915_request_create(ce);
>> @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
>> -			       CHUNK_SZ);
>> +		src_offset = 0;
>> +		dst_offset = CHUNK_SZ;
>> +		if (HAS_64K_PAGES(ce->engine->i915)) {
>> +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
>> +
>> +			src_offset = 0;
>> +			dst_offset = 0;
>> +			if (src_is_lmem)
>> +				src_offset = CHUNK_SZ;
>> +			if (dst_is_lmem)
>> +				dst_offset = 2 * CHUNK_SZ;
>> +		}
>> +
>> +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
>> +			       src_offset, CHUNK_SZ);
>>   		if (len <= 0) {
>>   			err = len;
>>   			goto out_rq;
>>   		}
>>   
>>   		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
>> -			       CHUNK_SZ, len);
>> +			       dst_offset, len);
>>   		if (err < 0)
>>   			goto out_rq;
>>   		if (err < len) {
>> @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		err = emit_copy(rq, len);
>> +		err = emit_copy(rq, dst_offset, src_offset, len);
>>   
>>   		/* Arbitration is re-enabled between requests. */
>>   out_rq:
>> @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
>>   }
>>   
>>   static int emit_clear(struct i915_request *rq,
>> +		      u64 offset,
>>   		      int size,
>>   		      u32 value,
>>   		      bool is_lmem)
>>   {
>> -	const int ver = GRAPHICS_VER(rq->engine->i915);
>> -	u32 instance = rq->engine->instance;
>> -	u32 *cs;
>>   	struct drm_i915_private *i915 = rq->engine->i915;
>> +	const int ver = GRAPHICS_VER(rq->engine->i915);
>>   	u32 num_ccs_blks, ccs_ring_size;
>> +	u32 *cs;
>>   
>>   	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
>>   
>> +	offset += (u64)rq->engine->instance << 32;
>> +
>>   	/* Clear flat css only when value is 0 */
>>   	ccs_ring_size = (is_lmem && !value) ?
>>   			 calc_ctrl_surf_instr_size(i915, size)
>> @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = 0; /* offset */
>> -		*cs++ = instance;
>> +		*cs++ = lower_32_bits(offset);
>> +		*cs++ = upper_32_bits(offset);
>>   		*cs++ = value;
>>   		*cs++ = MI_NOOP;
>>   	} else {
>> -		GEM_BUG_ON(instance);
>> +		GEM_BUG_ON(upper_32_bits(offset));
>>   		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
>>   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
>>   		*cs++ = 0;
>>   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
>> -		*cs++ = 0;
>> +		*cs++ = lower_32_bits(offset);
>>   		*cs++ = value;
>>   	}
>>   
>> @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
>>   		 * and use it as a source.
>>   		 */
>>   
>> -		cs = i915_flush_dw(cs, (u64)instance << 32,
>> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
>> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>>   		cs = _i915_ctrl_surf_copy_blt(cs,
>> -					      (u64)instance << 32,
>> -					      (u64)instance << 32,
>> +					      offset,
>> +					      offset,
>>   					      DIRECT_ACCESS,
>>   					      INDIRECT_ACCESS,
>>   					      1, 1,
>>   					      num_ccs_blks);
>> -		cs = i915_flush_dw(cs, (u64)instance << 32,
>> -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
>> +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
>>   	}
>>   	intel_ring_advance(rq, cs);
>>   	return 0;
>> @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   	GEM_BUG_ON(ce->ring->size < SZ_64K);
>>   
>>   	do {
>> +		u32 offset;
>>   		int len;
>>   
>>   		rq = i915_request_create(ce);
>> @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
>> +		offset = 0;
>> +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
>> +			offset = CHUNK_SZ;
>> +
>> +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
>>   		if (len <= 0) {
>>   			err = len;
>>   			goto out_rq;
>> @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
>>   		if (err)
>>   			goto out_rq;
>>   
>> -		err = emit_clear(rq, len, value, is_lmem);
>> +		err = emit_clear(rq, offset, len, value, is_lmem);
>>   
>>   		/* Arbitration is re-enabled between requests. */
>>   out_rq:
>> -- 
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
  2021-12-14 12:32       ` [Intel-gfx] " Matthew Auld
@ 2021-12-16 15:01         ` Ramalingam C
  -1 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-16 15:01 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-14 at 12:32:57 +0000, Matthew Auld wrote:
> On 14/12/2021 10:56, Ramalingam C wrote:
> > On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
> > > This is all kinds of awkward since we now have to contend with using 64K
> > > GTT pages when mapping anything in LMEM(including the page-tables
> > > themselves).
> > > 
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Ramalingam C <ramalingam.c@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
> > >   1 file changed, 150 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > index 0192b61ab541..fb658ae70a8d 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
> > >   	return true;
> > >   }
> > > +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
> > > +				struct i915_page_table *pt,
> > > +				void *data)
> > > +{
> > > +	struct insert_pte_data *d = data;
> > > +
> > > +	/*
> > > +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
> > > +	 * we have a correctly setup PDE structure for later use.
> > > +	 */
> > > +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
> > This part i am not understanding. Why do we need to insert the dummy
> > PTE here.?
> 
> We have three windows, each CHUNK_SIZE in size. The first is reserved for
> mapping system-memory, and that just uses the 512 entry layout using 4K GTT
> pages. The other two windows just map lmem pages and must use the new
> compact 32 entry layout using 64K GTT pages, which ensures we can address
> any lmem object that the user throws at us. The above is basically just
> toggling the PDE bit(GEN12_PDE_64K) for us, to enable the compact layout for
> each of these page-tables, that fall within the 2 * CHUNK_SIZE range
> starting at CHUNK_SIZE.

If we could summarize this into the comment it will be helpful.
Apart from this, change looks good to me.

Reviewed-by: Ramalingam C <ramalingam.c@intel.com>

> 
> 
> > > +	GEM_BUG_ON(!pt->is_compact);
> > > +	d->offset += SZ_2M;
> > > +}
> > > +
> > > +static void xehpsdv_insert_pte(struct i915_address_space *vm,
> > > +			       struct i915_page_table *pt,
> > > +			       void *data)
> > > +{
> > > +	struct insert_pte_data *d = data;
> > > +
> > > +	/*
> > > +	 * We are playing tricks here, since the actual pt, from the hw
> > > +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
> > > +	 * entries, but we are still guaranteed that the physical
> > > +	 * alignment is 64K underneath for the pt, and we are careful
> > > +	 * not to access the space in the void.
> > > +	 */
> > > +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
> > > +	d->offset += SZ_64K;
> > > +}
> > > +
> > >   static void insert_pte(struct i915_address_space *vm,
> > >   		       struct i915_page_table *pt,
> > >   		       void *data)
> > > @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   	 * i.e. within the same non-preemptible window so that we do not switch
> > >   	 * to another migration context that overwrites the PTE.
> > >   	 *
> > > -	 * TODO: Add support for huge LMEM PTEs
> > > +	 * On platforms with HAS_64K_PAGES support we have three windows, and
> > > +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
> > > +	 * a thing), since we are forced to use 64K GTT pages underneath which
> > > +	 * requires also modifying the PDE. An alternative might be to instead
> > > +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
> > > +	 * in the PDE from the same batch that also modifies the PTEs.
> > Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?
> 
> [0, CHUNK_SZ) -> first window, maps smem
> [CHUNK_SZ, 2 * CHUNK_SZ) -> second window, maps lmem src
> [2 * CHUNK_SZ, 3 * CHUNK_SZ) -> third window, maps lmem dst
> 
> It starts to get strange here, since each PTE must point to some 64K page,
> one for each PT(since it's in lmem), and yet each is only <= 4096bytes, but
> since the unused space within that PTE range is never touched, this should
> be fine.
> 
> So basically each PT now needs 64K of virtual memory, instead of 4K. So
> something like:
> 
> [3 * CHUNK_SZ, 3 * CHUNK_SZ + ((3 * CHUNK_SZ / SZ_2M) * SZ_64K)] -> PTE
> 
> And then later when writing out the PTEs we know if the layout within a
> particular PT is 512 vs 32 depending on if we are mapping lmem or not.
> 
> > >   	 */
> > >   	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
> > > @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   		goto err_vm;
> > >   	}
> > > +	if (HAS_64K_PAGES(gt->i915))
> > > +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
> > > +
> > >   	/*
> > >   	 * Each engine instance is assigned its own chunk in the VM, so
> > >   	 * that we can run multiple instances concurrently
> > > @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
> > >   		 * 4x2 page directories for source/destination.
> > >   		 */
> > > -		sz = 2 * CHUNK_SZ;
> > > +		if (HAS_64K_PAGES(gt->i915))
> > > +			sz = 3 * CHUNK_SZ;
> > > +		else
> > > +			sz = 2 * CHUNK_SZ;
> > >   		d.offset = base + sz;
> > >   		/*
> > >   		 * We need another page directory setup so that we can write
> > >   		 * the 8x512 PTE in each chunk.
> > >   		 */
> > > -		sz += (sz >> 12) * sizeof(u64);
> > > +		if (HAS_64K_PAGES(gt->i915))
> > > +			sz += (sz / SZ_2M) * SZ_64K;
> > > +		else
> > > +			sz += (sz >> 12) * sizeof(u64);
> > Here for 4K page support, per page we assume the u64 as the length required. But
> > for 64k page support we calculate the no of PDE and per PDE we allocate
> > the 64k page so that we can map it for edit right?
> > 
> > In this case i assume we have the unused space at the end. say after
> > 32*sizeof(u64)
> 
> For every PT(which is 2M va range), we need a 64K va chunk in order to map
> it. Yes, there is some unused space at the end, but it is already like that
> for the other PTEs. Also it seems strange to call alloc_va_range without
> also rounding up the size to the correct page size.
> 
> > 
> > Ram
> > >   		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> > >   		if (err)
> > > @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   			goto err_vm;
> > >   		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
> > > -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
> > > +		if (HAS_64K_PAGES(gt->i915)) {
> > > +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> > > +				       xehpsdv_insert_pte, &d);
> > > +			d.offset = base + CHUNK_SZ;
> > > +			vm->vm.foreach(&vm->vm,
> > > +				       d.offset,
> > > +				       2 * CHUNK_SZ,
> > > +				       xehpsdv_toggle_pdes, &d);
> > > +		} else {
> > > +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> > > +				       insert_pte, &d);
> > > +		}
> > >   	}
> > >   	return &vm->vm;
> > > @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
> > >   		    u64 offset,
> > >   		    int length)
> > >   {
> > > +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
> > >   	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
> > >   						       is_lmem ? PTE_LM : 0);
> > >   	struct intel_ring *ring = rq->ring;
> > > -	int total = 0;
> > > +	int pkt, dword_length;
> > > +	u32 total = 0;
> > > +	u32 page_size;
> > >   	u32 *hdr, *cs;
> > > -	int pkt;
> > >   	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
> > > +	page_size = I915_GTT_PAGE_SIZE;
> > > +	dword_length = 0x400;
> > > +
> > >   	/* Compute the page directory offset for the target address range */
> > > -	offset >>= 12;
> > > -	offset *= sizeof(u64);
> > > -	offset += 2 * CHUNK_SZ;
> > > +	if (has_64K_pages) {
> > > +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
> > > +
> > > +		offset /= SZ_2M;
> > > +		offset *= SZ_64K;
> > > +		offset += 3 * CHUNK_SZ;
> > > +
> > > +		if (is_lmem) {
> > > +			page_size = I915_GTT_PAGE_SIZE_64K;
> > > +			dword_length = 0x40;
> > > +		}
> > > +	} else {
> > > +		offset >>= 12;
> > > +		offset *= sizeof(u64);
> > > +		offset += 2 * CHUNK_SZ;
> > > +	}
> > > +
> > >   	offset += (u64)rq->engine->instance << 32;
> > >   	cs = intel_ring_begin(rq, 6);
> > > @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
> > >   		return PTR_ERR(cs);
> > >   	/* Pack as many PTE updates as possible into a single MI command */
> > > -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> > > +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
> > >   	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
> > >   	hdr = cs;
> > > @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
> > >   	do {
> > >   		if (cs - hdr >= pkt) {
> > > +			int dword_rem;
> > > +
> > >   			*hdr += cs - hdr - 2;
> > >   			*cs++ = MI_NOOP;
> > > @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
> > >   			if (IS_ERR(cs))
> > >   				return PTR_ERR(cs);
> > > -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> > > +			dword_rem = dword_length;
> > > +			if (has_64K_pages) {
> > > +				if (IS_ALIGNED(total, SZ_2M)) {
> > > +					offset = round_up(offset, SZ_64K);
> > > +				} else {
> > > +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
> > > +					dword_rem /= page_size;
> > > +					dword_rem *= 2;
> > > +				}
> > > +			}
> > > +
> > > +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
> > >   			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
> > >   			hdr = cs;
> > > @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
> > >   			*cs++ = upper_32_bits(offset);
> > >   		}
> > > +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
> > > +
> > >   		*cs++ = lower_32_bits(encode | it->dma);
> > >   		*cs++ = upper_32_bits(encode | it->dma);
> > >   		offset += 8;
> > > -		total += I915_GTT_PAGE_SIZE;
> > > +		total += page_size;
> > > -		it->dma += I915_GTT_PAGE_SIZE;
> > > +		it->dma += page_size;
> > >   		if (it->dma >= it->max) {
> > >   			it->sg = __sg_next(it->sg);
> > >   			if (!it->sg || sg_dma_len(it->sg) == 0)
> > > @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
> > >   	return height % 4 == 3 && height <= 8;
> > >   }
> > > -static int emit_copy(struct i915_request *rq, int size)
> > > +static int emit_copy(struct i915_request *rq,
> > > +		     u32 dst_offset, u32 src_offset, int size)
> > >   {
> > >   	const int ver = GRAPHICS_VER(rq->engine->i915);
> > >   	u32 instance = rq->engine->instance;
> > > @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
> > >   		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = instance;
> > >   		*cs++ = 0;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   		*cs++ = instance;
> > >   	} else if (ver >= 8) {
> > >   		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = instance;
> > >   		*cs++ = 0;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   		*cs++ = instance;
> > >   	} else {
> > >   		GEM_BUG_ON(instance);
> > >   		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   	}
> > >   	intel_ring_advance(rq, cs);
> > > @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   	GEM_BUG_ON(ce->ring->size < SZ_64K);
> > >   	do {
> > > +		u32 src_offset, dst_offset;
> > >   		int len;
> > >   		rq = i915_request_create(ce);
> > > @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
> > > -			       CHUNK_SZ);
> > > +		src_offset = 0;
> > > +		dst_offset = CHUNK_SZ;
> > > +		if (HAS_64K_PAGES(ce->engine->i915)) {
> > > +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
> > > +
> > > +			src_offset = 0;
> > > +			dst_offset = 0;
> > > +			if (src_is_lmem)
> > > +				src_offset = CHUNK_SZ;
> > > +			if (dst_is_lmem)
> > > +				dst_offset = 2 * CHUNK_SZ;
> > > +		}
> > > +
> > > +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
> > > +			       src_offset, CHUNK_SZ);
> > >   		if (len <= 0) {
> > >   			err = len;
> > >   			goto out_rq;
> > >   		}
> > >   		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
> > > -			       CHUNK_SZ, len);
> > > +			       dst_offset, len);
> > >   		if (err < 0)
> > >   			goto out_rq;
> > >   		if (err < len) {
> > > @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		err = emit_copy(rq, len);
> > > +		err = emit_copy(rq, dst_offset, src_offset, len);
> > >   		/* Arbitration is re-enabled between requests. */
> > >   out_rq:
> > > @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
> > >   }
> > >   static int emit_clear(struct i915_request *rq,
> > > +		      u64 offset,
> > >   		      int size,
> > >   		      u32 value,
> > >   		      bool is_lmem)
> > >   {
> > > -	const int ver = GRAPHICS_VER(rq->engine->i915);
> > > -	u32 instance = rq->engine->instance;
> > > -	u32 *cs;
> > >   	struct drm_i915_private *i915 = rq->engine->i915;
> > > +	const int ver = GRAPHICS_VER(rq->engine->i915);
> > >   	u32 num_ccs_blks, ccs_ring_size;
> > > +	u32 *cs;
> > >   	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
> > > +	offset += (u64)rq->engine->instance << 32;
> > > +
> > >   	/* Clear flat css only when value is 0 */
> > >   	ccs_ring_size = (is_lmem && !value) ?
> > >   			 calc_ctrl_surf_instr_size(i915, size)
> > > @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = 0; /* offset */
> > > -		*cs++ = instance;
> > > +		*cs++ = lower_32_bits(offset);
> > > +		*cs++ = upper_32_bits(offset);
> > >   		*cs++ = value;
> > >   		*cs++ = MI_NOOP;
> > >   	} else {
> > > -		GEM_BUG_ON(instance);
> > > +		GEM_BUG_ON(upper_32_bits(offset));
> > >   		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = 0;
> > > +		*cs++ = lower_32_bits(offset);
> > >   		*cs++ = value;
> > >   	}
> > > @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
> > >   		 * and use it as a source.
> > >   		 */
> > > -		cs = i915_flush_dw(cs, (u64)instance << 32,
> > > -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> > > +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
> > >   		cs = _i915_ctrl_surf_copy_blt(cs,
> > > -					      (u64)instance << 32,
> > > -					      (u64)instance << 32,
> > > +					      offset,
> > > +					      offset,
> > >   					      DIRECT_ACCESS,
> > >   					      INDIRECT_ACCESS,
> > >   					      1, 1,
> > >   					      num_ccs_blks);
> > > -		cs = i915_flush_dw(cs, (u64)instance << 32,
> > > -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> > > +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
> > >   	}
> > >   	intel_ring_advance(rq, cs);
> > >   	return 0;
> > > @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   	GEM_BUG_ON(ce->ring->size < SZ_64K);
> > >   	do {
> > > +		u32 offset;
> > >   		int len;
> > >   		rq = i915_request_create(ce);
> > > @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
> > > +		offset = 0;
> > > +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
> > > +			offset = CHUNK_SZ;
> > > +
> > > +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
> > >   		if (len <= 0) {
> > >   			err = len;
> > >   			goto out_rq;
> > > @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		err = emit_clear(rq, len, value, is_lmem);
> > > +		err = emit_clear(rq, offset, len, value, is_lmem);
> > >   		/* Arbitration is re-enabled between requests. */
> > >   out_rq:
> > > -- 
> > > 2.31.1
> > > 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2
@ 2021-12-16 15:01         ` Ramalingam C
  0 siblings, 0 replies; 31+ messages in thread
From: Ramalingam C @ 2021-12-16 15:01 UTC (permalink / raw)
  To: Matthew Auld; +Cc: Thomas Hellström, intel-gfx, dri-devel

On 2021-12-14 at 12:32:57 +0000, Matthew Auld wrote:
> On 14/12/2021 10:56, Ramalingam C wrote:
> > On 2021-12-06 at 13:31:39 +0000, Matthew Auld wrote:
> > > This is all kinds of awkward since we now have to contend with using 64K
> > > GTT pages when mapping anything in LMEM(including the page-tables
> > > themselves).
> > > 
> > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Ramalingam C <ramalingam.c@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_migrate.c | 189 +++++++++++++++++++-----
> > >   1 file changed, 150 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > index 0192b61ab541..fb658ae70a8d 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > @@ -33,6 +33,38 @@ static bool engine_supports_migration(struct intel_engine_cs *engine)
> > >   	return true;
> > >   }
> > > +static void xehpsdv_toggle_pdes(struct i915_address_space *vm,
> > > +				struct i915_page_table *pt,
> > > +				void *data)
> > > +{
> > > +	struct insert_pte_data *d = data;
> > > +
> > > +	/*
> > > +	 * Insert a dummy PTE into every PT that will map to LMEM to ensure
> > > +	 * we have a correctly setup PDE structure for later use.
> > > +	 */
> > > +	vm->insert_page(vm, 0, d->offset, I915_CACHE_NONE, PTE_LM);
> > This part i am not understanding. Why do we need to insert the dummy
> > PTE here.?
> 
> We have three windows, each CHUNK_SIZE in size. The first is reserved for
> mapping system-memory, and that just uses the 512 entry layout using 4K GTT
> pages. The other two windows just map lmem pages and must use the new
> compact 32 entry layout using 64K GTT pages, which ensures we can address
> any lmem object that the user throws at us. The above is basically just
> toggling the PDE bit(GEN12_PDE_64K) for us, to enable the compact layout for
> each of these page-tables, that fall within the 2 * CHUNK_SIZE range
> starting at CHUNK_SIZE.

If we could summarize this into the comment it will be helpful.
Apart from this, change looks good to me.

Reviewed-by: Ramalingam C <ramalingam.c@intel.com>

> 
> 
> > > +	GEM_BUG_ON(!pt->is_compact);
> > > +	d->offset += SZ_2M;
> > > +}
> > > +
> > > +static void xehpsdv_insert_pte(struct i915_address_space *vm,
> > > +			       struct i915_page_table *pt,
> > > +			       void *data)
> > > +{
> > > +	struct insert_pte_data *d = data;
> > > +
> > > +	/*
> > > +	 * We are playing tricks here, since the actual pt, from the hw
> > > +	 * pov, is only 256bytes with 32 entries, or 4096bytes with 512
> > > +	 * entries, but we are still guaranteed that the physical
> > > +	 * alignment is 64K underneath for the pt, and we are careful
> > > +	 * not to access the space in the void.
> > > +	 */
> > > +	vm->insert_page(vm, px_dma(pt), d->offset, I915_CACHE_NONE, PTE_LM);
> > > +	d->offset += SZ_64K;
> > > +}
> > > +
> > >   static void insert_pte(struct i915_address_space *vm,
> > >   		       struct i915_page_table *pt,
> > >   		       void *data)
> > > @@ -75,7 +107,12 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   	 * i.e. within the same non-preemptible window so that we do not switch
> > >   	 * to another migration context that overwrites the PTE.
> > >   	 *
> > > -	 * TODO: Add support for huge LMEM PTEs
> > > +	 * On platforms with HAS_64K_PAGES support we have three windows, and
> > > +	 * dedicate two windows just for mapping lmem pages(smem <-> smem is not
> > > +	 * a thing), since we are forced to use 64K GTT pages underneath which
> > > +	 * requires also modifying the PDE. An alternative might be to instead
> > > +	 * map the PD into the GTT, and then on the fly toggle the 4K/64K mode
> > > +	 * in the PDE from the same batch that also modifies the PTEs.
> > Could we also add a layout of the ppGTT, incase of HAS_64K_PAGES?
> 
> [0, CHUNK_SZ) -> first window, maps smem
> [CHUNK_SZ, 2 * CHUNK_SZ) -> second window, maps lmem src
> [2 * CHUNK_SZ, 3 * CHUNK_SZ) -> third window, maps lmem dst
> 
> It starts to get strange here, since each PTE must point to some 64K page,
> one for each PT(since it's in lmem), and yet each is only <= 4096bytes, but
> since the unused space within that PTE range is never touched, this should
> be fine.
> 
> So basically each PT now needs 64K of virtual memory, instead of 4K. So
> something like:
> 
> [3 * CHUNK_SZ, 3 * CHUNK_SZ + ((3 * CHUNK_SZ / SZ_2M) * SZ_64K)] -> PTE
> 
> And then later when writing out the PTEs we know if the layout within a
> particular PT is 512 vs 32 depending on if we are mapping lmem or not.
> 
> > >   	 */
> > >   	vm = i915_ppgtt_create(gt, I915_BO_ALLOC_PM_EARLY);
> > > @@ -87,6 +124,9 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   		goto err_vm;
> > >   	}
> > > +	if (HAS_64K_PAGES(gt->i915))
> > > +		stash.pt_sz = I915_GTT_PAGE_SIZE_64K;
> > > +
> > >   	/*
> > >   	 * Each engine instance is assigned its own chunk in the VM, so
> > >   	 * that we can run multiple instances concurrently
> > > @@ -106,14 +146,20 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   		 * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
> > >   		 * 4x2 page directories for source/destination.
> > >   		 */
> > > -		sz = 2 * CHUNK_SZ;
> > > +		if (HAS_64K_PAGES(gt->i915))
> > > +			sz = 3 * CHUNK_SZ;
> > > +		else
> > > +			sz = 2 * CHUNK_SZ;
> > >   		d.offset = base + sz;
> > >   		/*
> > >   		 * We need another page directory setup so that we can write
> > >   		 * the 8x512 PTE in each chunk.
> > >   		 */
> > > -		sz += (sz >> 12) * sizeof(u64);
> > > +		if (HAS_64K_PAGES(gt->i915))
> > > +			sz += (sz / SZ_2M) * SZ_64K;
> > > +		else
> > > +			sz += (sz >> 12) * sizeof(u64);
> > Here for 4K page support, per page we assume the u64 as the length required. But
> > for 64k page support we calculate the no of PDE and per PDE we allocate
> > the 64k page so that we can map it for edit right?
> > 
> > In this case i assume we have the unused space at the end. say after
> > 32*sizeof(u64)
> 
> For every PT(which is 2M va range), we need a 64K va chunk in order to map
> it. Yes, there is some unused space at the end, but it is already like that
> for the other PTEs. Also it seems strange to call alloc_va_range without
> also rounding up the size to the correct page size.
> 
> > 
> > Ram
> > >   		err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> > >   		if (err)
> > > @@ -134,7 +180,18 @@ static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> > >   			goto err_vm;
> > >   		/* Now allow the GPU to rewrite the PTE via its own ppGTT */
> > > -		vm->vm.foreach(&vm->vm, base, d.offset - base, insert_pte, &d);
> > > +		if (HAS_64K_PAGES(gt->i915)) {
> > > +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> > > +				       xehpsdv_insert_pte, &d);
> > > +			d.offset = base + CHUNK_SZ;
> > > +			vm->vm.foreach(&vm->vm,
> > > +				       d.offset,
> > > +				       2 * CHUNK_SZ,
> > > +				       xehpsdv_toggle_pdes, &d);
> > > +		} else {
> > > +			vm->vm.foreach(&vm->vm, base, d.offset - base,
> > > +				       insert_pte, &d);
> > > +		}
> > >   	}
> > >   	return &vm->vm;
> > > @@ -272,19 +329,38 @@ static int emit_pte(struct i915_request *rq,
> > >   		    u64 offset,
> > >   		    int length)
> > >   {
> > > +	bool has_64K_pages = HAS_64K_PAGES(rq->engine->i915);
> > >   	const u64 encode = rq->context->vm->pte_encode(0, cache_level,
> > >   						       is_lmem ? PTE_LM : 0);
> > >   	struct intel_ring *ring = rq->ring;
> > > -	int total = 0;
> > > +	int pkt, dword_length;
> > > +	u32 total = 0;
> > > +	u32 page_size;
> > >   	u32 *hdr, *cs;
> > > -	int pkt;
> > >   	GEM_BUG_ON(GRAPHICS_VER(rq->engine->i915) < 8);
> > > +	page_size = I915_GTT_PAGE_SIZE;
> > > +	dword_length = 0x400;
> > > +
> > >   	/* Compute the page directory offset for the target address range */
> > > -	offset >>= 12;
> > > -	offset *= sizeof(u64);
> > > -	offset += 2 * CHUNK_SZ;
> > > +	if (has_64K_pages) {
> > > +		GEM_BUG_ON(!IS_ALIGNED(offset, SZ_2M));
> > > +
> > > +		offset /= SZ_2M;
> > > +		offset *= SZ_64K;
> > > +		offset += 3 * CHUNK_SZ;
> > > +
> > > +		if (is_lmem) {
> > > +			page_size = I915_GTT_PAGE_SIZE_64K;
> > > +			dword_length = 0x40;
> > > +		}
> > > +	} else {
> > > +		offset >>= 12;
> > > +		offset *= sizeof(u64);
> > > +		offset += 2 * CHUNK_SZ;
> > > +	}
> > > +
> > >   	offset += (u64)rq->engine->instance << 32;
> > >   	cs = intel_ring_begin(rq, 6);
> > > @@ -292,7 +368,7 @@ static int emit_pte(struct i915_request *rq,
> > >   		return PTR_ERR(cs);
> > >   	/* Pack as many PTE updates as possible into a single MI command */
> > > -	pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> > > +	pkt = min_t(int, dword_length, ring->space / sizeof(u32) + 5);
> > >   	pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
> > >   	hdr = cs;
> > > @@ -302,6 +378,8 @@ static int emit_pte(struct i915_request *rq,
> > >   	do {
> > >   		if (cs - hdr >= pkt) {
> > > +			int dword_rem;
> > > +
> > >   			*hdr += cs - hdr - 2;
> > >   			*cs++ = MI_NOOP;
> > > @@ -313,7 +391,18 @@ static int emit_pte(struct i915_request *rq,
> > >   			if (IS_ERR(cs))
> > >   				return PTR_ERR(cs);
> > > -			pkt = min_t(int, 0x400, ring->space / sizeof(u32) + 5);
> > > +			dword_rem = dword_length;
> > > +			if (has_64K_pages) {
> > > +				if (IS_ALIGNED(total, SZ_2M)) {
> > > +					offset = round_up(offset, SZ_64K);
> > > +				} else {
> > > +					dword_rem = SZ_2M - (total & (SZ_2M - 1));
> > > +					dword_rem /= page_size;
> > > +					dword_rem *= 2;
> > > +				}
> > > +			}
> > > +
> > > +			pkt = min_t(int, dword_rem, ring->space / sizeof(u32) + 5);
> > >   			pkt = min_t(int, pkt, (ring->size - ring->emit) / sizeof(u32) + 5);
> > >   			hdr = cs;
> > > @@ -322,13 +411,15 @@ static int emit_pte(struct i915_request *rq,
> > >   			*cs++ = upper_32_bits(offset);
> > >   		}
> > > +		GEM_BUG_ON(!IS_ALIGNED(it->dma, page_size));
> > > +
> > >   		*cs++ = lower_32_bits(encode | it->dma);
> > >   		*cs++ = upper_32_bits(encode | it->dma);
> > >   		offset += 8;
> > > -		total += I915_GTT_PAGE_SIZE;
> > > +		total += page_size;
> > > -		it->dma += I915_GTT_PAGE_SIZE;
> > > +		it->dma += page_size;
> > >   		if (it->dma >= it->max) {
> > >   			it->sg = __sg_next(it->sg);
> > >   			if (!it->sg || sg_dma_len(it->sg) == 0)
> > > @@ -359,7 +450,8 @@ static bool wa_1209644611_applies(int ver, u32 size)
> > >   	return height % 4 == 3 && height <= 8;
> > >   }
> > > -static int emit_copy(struct i915_request *rq, int size)
> > > +static int emit_copy(struct i915_request *rq,
> > > +		     u32 dst_offset, u32 src_offset, int size)
> > >   {
> > >   	const int ver = GRAPHICS_VER(rq->engine->i915);
> > >   	u32 instance = rq->engine->instance;
> > > @@ -374,31 +466,31 @@ static int emit_copy(struct i915_request *rq, int size)
> > >   		*cs++ = BLT_DEPTH_32 | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = instance;
> > >   		*cs++ = 0;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   		*cs++ = instance;
> > >   	} else if (ver >= 8) {
> > >   		*cs++ = XY_SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (10 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = instance;
> > >   		*cs++ = 0;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   		*cs++ = instance;
> > >   	} else {
> > >   		GEM_BUG_ON(instance);
> > >   		*cs++ = SRC_COPY_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_SRC_COPY | PAGE_SIZE;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE;
> > > -		*cs++ = CHUNK_SZ; /* dst offset */
> > > +		*cs++ = dst_offset;
> > >   		*cs++ = PAGE_SIZE;
> > > -		*cs++ = 0; /* src offset */
> > > +		*cs++ = src_offset;
> > >   	}
> > >   	intel_ring_advance(rq, cs);
> > > @@ -426,6 +518,7 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   	GEM_BUG_ON(ce->ring->size < SZ_64K);
> > >   	do {
> > > +		u32 src_offset, dst_offset;
> > >   		int len;
> > >   		rq = i915_request_create(ce);
> > > @@ -453,15 +546,28 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem, 0,
> > > -			       CHUNK_SZ);
> > > +		src_offset = 0;
> > > +		dst_offset = CHUNK_SZ;
> > > +		if (HAS_64K_PAGES(ce->engine->i915)) {
> > > +			GEM_BUG_ON(!src_is_lmem && !dst_is_lmem);
> > > +
> > > +			src_offset = 0;
> > > +			dst_offset = 0;
> > > +			if (src_is_lmem)
> > > +				src_offset = CHUNK_SZ;
> > > +			if (dst_is_lmem)
> > > +				dst_offset = 2 * CHUNK_SZ;
> > > +		}
> > > +
> > > +		len = emit_pte(rq, &it_src, src_cache_level, src_is_lmem,
> > > +			       src_offset, CHUNK_SZ);
> > >   		if (len <= 0) {
> > >   			err = len;
> > >   			goto out_rq;
> > >   		}
> > >   		err = emit_pte(rq, &it_dst, dst_cache_level, dst_is_lmem,
> > > -			       CHUNK_SZ, len);
> > > +			       dst_offset, len);
> > >   		if (err < 0)
> > >   			goto out_rq;
> > >   		if (err < len) {
> > > @@ -473,7 +579,7 @@ intel_context_migrate_copy(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		err = emit_copy(rq, len);
> > > +		err = emit_copy(rq, dst_offset, src_offset, len);
> > >   		/* Arbitration is re-enabled between requests. */
> > >   out_rq:
> > > @@ -571,18 +677,20 @@ static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64 dst_addr,
> > >   }
> > >   static int emit_clear(struct i915_request *rq,
> > > +		      u64 offset,
> > >   		      int size,
> > >   		      u32 value,
> > >   		      bool is_lmem)
> > >   {
> > > -	const int ver = GRAPHICS_VER(rq->engine->i915);
> > > -	u32 instance = rq->engine->instance;
> > > -	u32 *cs;
> > >   	struct drm_i915_private *i915 = rq->engine->i915;
> > > +	const int ver = GRAPHICS_VER(rq->engine->i915);
> > >   	u32 num_ccs_blks, ccs_ring_size;
> > > +	u32 *cs;
> > >   	GEM_BUG_ON(size >> PAGE_SHIFT > S16_MAX);
> > > +	offset += (u64)rq->engine->instance << 32;
> > > +
> > >   	/* Clear flat css only when value is 0 */
> > >   	ccs_ring_size = (is_lmem && !value) ?
> > >   			 calc_ctrl_surf_instr_size(i915, size)
> > > @@ -597,17 +705,17 @@ static int emit_clear(struct i915_request *rq,
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = 0; /* offset */
> > > -		*cs++ = instance;
> > > +		*cs++ = lower_32_bits(offset);
> > > +		*cs++ = upper_32_bits(offset);
> > >   		*cs++ = value;
> > >   		*cs++ = MI_NOOP;
> > >   	} else {
> > > -		GEM_BUG_ON(instance);
> > > +		GEM_BUG_ON(upper_32_bits(offset));
> > >   		*cs++ = XY_COLOR_BLT_CMD | BLT_WRITE_RGBA | (6 - 2);
> > >   		*cs++ = BLT_DEPTH_32 | BLT_ROP_COLOR_COPY | PAGE_SIZE;
> > >   		*cs++ = 0;
> > >   		*cs++ = size >> PAGE_SHIFT << 16 | PAGE_SIZE / 4;
> > > -		*cs++ = 0;
> > > +		*cs++ = lower_32_bits(offset);
> > >   		*cs++ = value;
> > >   	}
> > > @@ -623,17 +731,15 @@ static int emit_clear(struct i915_request *rq,
> > >   		 * and use it as a source.
> > >   		 */
> > > -		cs = i915_flush_dw(cs, (u64)instance << 32,
> > > -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> > > +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
> > >   		cs = _i915_ctrl_surf_copy_blt(cs,
> > > -					      (u64)instance << 32,
> > > -					      (u64)instance << 32,
> > > +					      offset,
> > > +					      offset,
> > >   					      DIRECT_ACCESS,
> > >   					      INDIRECT_ACCESS,
> > >   					      1, 1,
> > >   					      num_ccs_blks);
> > > -		cs = i915_flush_dw(cs, (u64)instance << 32,
> > > -				   MI_FLUSH_LLC | MI_FLUSH_CCS);
> > > +		cs = i915_flush_dw(cs, offset, MI_FLUSH_LLC | MI_FLUSH_CCS);
> > >   	}
> > >   	intel_ring_advance(rq, cs);
> > >   	return 0;
> > > @@ -658,6 +764,7 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   	GEM_BUG_ON(ce->ring->size < SZ_64K);
> > >   	do {
> > > +		u32 offset;
> > >   		int len;
> > >   		rq = i915_request_create(ce);
> > > @@ -685,7 +792,11 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		len = emit_pte(rq, &it, cache_level, is_lmem, 0, CHUNK_SZ);
> > > +		offset = 0;
> > > +		if (HAS_64K_PAGES(ce->engine->i915) && is_lmem)
> > > +			offset = CHUNK_SZ;
> > > +
> > > +		len = emit_pte(rq, &it, cache_level, is_lmem, offset, CHUNK_SZ);
> > >   		if (len <= 0) {
> > >   			err = len;
> > >   			goto out_rq;
> > > @@ -695,7 +806,7 @@ intel_context_migrate_clear(struct intel_context *ce,
> > >   		if (err)
> > >   			goto out_rq;
> > > -		err = emit_clear(rq, len, value, is_lmem);
> > > +		err = emit_clear(rq, offset, len, value, is_lmem);
> > >   		/* Arbitration is re-enabled between requests. */
> > >   out_rq:
> > > -- 
> > > 2.31.1
> > > 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-12-16 15:03 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-06 13:31 [PATCH v3 0/8] DG2 accelerated migration/clearing support Matthew Auld
2021-12-06 13:31 ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 1/8] drm/i915/migrate: don't check the scratch page Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 2/8] drm/i915/migrate: fix offset calculation Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 3/8] drm/i915/migrate: fix length calculation Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 4/8] drm/i915/selftests: handle object rounding Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 5/8] drm/i915/gtt: allow overriding the pt alignment Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-13 15:32   ` Ramalingam C
2021-12-13 15:32     ` [Intel-gfx] " Ramalingam C
2021-12-06 13:31 ` [PATCH v3 6/8] drm/i915/gtt: add xehpsdv_ppgtt_insert_entry Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 13:31 ` [PATCH v3 7/8] drm/i915/migrate: add acceleration support for DG2 Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-14 10:56   ` Ramalingam C
2021-12-14 10:56     ` [Intel-gfx] " Ramalingam C
2021-12-14 12:32     ` Matthew Auld
2021-12-14 12:32       ` [Intel-gfx] " Matthew Auld
2021-12-16 15:01       ` Ramalingam C
2021-12-16 15:01         ` [Intel-gfx] " Ramalingam C
2021-12-06 13:31 ` [PATCH v3 8/8] drm/i915/migrate: turn on acceleration " Matthew Auld
2021-12-06 13:31   ` [Intel-gfx] " Matthew Auld
2021-12-06 14:05 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for DG2 accelerated migration/clearing support (rev2) Patchwork
2021-12-06 14:49 ` [PATCH v3 0/8] DG2 accelerated migration/clearing support Daniel Stone
2021-12-06 14:49   ` [Intel-gfx] " Daniel Stone
2021-12-06 15:13   ` Matthew Auld
2021-12-06 15:13     ` [Intel-gfx] " Matthew Auld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.