All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs
@ 2023-08-09 16:53 Boris Brezillon
  2023-08-09 16:53 ` [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t Boris Brezillon
                   ` (16 more replies)
  0 siblings, 17 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Hello,

This is the second version of the kernel driver meant to support new Mali
GPUs which are delegating the scheduling to a firmware.

The RFC has been dropped as the major blocking points have been addressed
(request to use drm_sched, request to implement a VM_BIND-like ioctl,
request to use drm_gpuva_mgr for the VM logic, lack of PM/devfreq support).

This series is based on drm-misc-next and depends on some drm_sched [1]
and iommu [2] changes.

A branch containing all those dependencies is available here[3], and
here [4] is another one containing all the patches needed to have
a working GPU on rk3588 on top. The CSF firmware binary can be found
here[5].

The mesa branch used to test this new driver is available here [6].
It's still under development and it's just a gallium driver right now,
but we are working on that ;-).

Here is a non-exaustive changelog, check each commit for a detailed
changelog.

v2:
- Rename the driver (pancsf -> panthor)
- Split the commit adding the driver to ease review
- Use drm_sched for dependency tracking/job submission
- Add a VM_BIND ioctl
- Add the concept of exclusive VM for BOs that are only ever mapped to a
  single VM
- Document the code and uAPI
- Add a DT binding doc

I tried to Cc anyone that was involved in any development of the code
I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
change. If I missed someone, please let me know.

Best Regards,

Boris

[1]https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.brost@intel.com/T/#t
[2]https://lore.kernel.org/linux-iommu/20230809121744.2341454-1-boris.brezillon@collabora.com/T/#t
[3]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor
[4]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor+rk3588-evb1
[5]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
[6]https://gitlab.freedesktop.org/panfrost/mesa/-/tree/v10+panthor

Boris Brezillon (14):
  drm/shmem-helper: Make pages_use_count an atomic_t
  drm/panthor: Add uAPI
  drm/panthor: Add GPU register definitions
  drm/panthor: Add the device logical block
  drm/panthor: Add the GPU logical block
  drm/panthor: Add GEM logical block
  drm/panthor: Add the devfreq logical block
  drm/panthor: Add the MMU/VM logical block
  drm/panthor: Add the FW logical block
  drm/panthor: Add the heap logical block
  drm/panthor: Add the scheduler logical block
  drm/panthor: Add the driver frontend block
  drm/panthor: Allow driver compilation
  drm/panthor: Add an entry to MAINTAINERS

Liviu Dudau (1):
  dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor
    driver

 .../bindings/gpu/arm,mali-valhall-csf.yaml    |  148 +
 Documentation/gpu/driver-uapi.rst             |    5 +
 MAINTAINERS                                   |    8 +
 drivers/gpu/drm/Kconfig                       |    2 +
 drivers/gpu/drm/Makefile                      |    1 +
 drivers/gpu/drm/drm_gem_shmem_helper.c        |   28 +-
 drivers/gpu/drm/lima/lima_gem.c               |    2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c       |    2 +-
 drivers/gpu/drm/panthor/Kconfig               |   16 +
 drivers/gpu/drm/panthor/Makefile              |   15 +
 drivers/gpu/drm/panthor/panthor_devfreq.c     |  281 ++
 drivers/gpu/drm/panthor/panthor_devfreq.h     |   25 +
 drivers/gpu/drm/panthor/panthor_device.c      |  479 +++
 drivers/gpu/drm/panthor/panthor_device.h      |  354 ++
 drivers/gpu/drm/panthor/panthor_drv.c         | 1540 ++++++++
 drivers/gpu/drm/panthor/panthor_fw.c          | 1417 +++++++
 drivers/gpu/drm/panthor/panthor_fw.h          |  505 +++
 drivers/gpu/drm/panthor/panthor_gem.c         |  229 ++
 drivers/gpu/drm/panthor/panthor_gem.h         |   96 +
 drivers/gpu/drm/panthor/panthor_gpu.c         |  463 +++
 drivers/gpu/drm/panthor/panthor_gpu.h         |   52 +
 drivers/gpu/drm/panthor/panthor_heap.c        |  550 +++
 drivers/gpu/drm/panthor/panthor_heap.h        |   36 +
 drivers/gpu/drm/panthor/panthor_mmu.c         | 2611 +++++++++++++
 drivers/gpu/drm/panthor/panthor_mmu.h         |   81 +
 drivers/gpu/drm/panthor/panthor_regs.h        |  229 ++
 drivers/gpu/drm/panthor/panthor_sched.c       | 3272 +++++++++++++++++
 drivers/gpu/drm/panthor/panthor_sched.h       |   50 +
 include/drm/drm_gem_shmem_helper.h            |    2 +-
 include/uapi/drm/panthor_drm.h                |  862 +++++
 30 files changed, 13345 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
 create mode 100644 drivers/gpu/drm/panthor/Kconfig
 create mode 100644 drivers/gpu/drm/panthor/Makefile
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
 create mode 100644 include/uapi/drm/panthor_drm.h

-- 
2.41.0


^ permalink raw reply	[flat|nested] 93+ messages in thread

* [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 13:08   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

This way we can grab a pages ref without acquiring the resv lock when
pages_use_count > 0. Need to implement asynchronous map using the
drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
requiring the new left/right regions to grab an additional page ref
to guarantee that the pages stay pinned when the middle section is
unmapped.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/drm_gem_shmem_helper.c  | 28 +++++++++++++------------
 drivers/gpu/drm/lima/lima_gem.c         |  2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  2 +-
 include/drm/drm_gem_shmem_helper.h      |  2 +-
 4 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
index a783d2245599..ca6938ea1b82 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -155,7 +155,7 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
 		if (shmem->pages)
 			drm_gem_shmem_put_pages(shmem);
 
-		drm_WARN_ON(obj->dev, shmem->pages_use_count);
+		drm_WARN_ON(obj->dev, atomic_read(&shmem->pages_use_count));
 
 		dma_resv_unlock(shmem->base.resv);
 	}
@@ -172,14 +172,14 @@ static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
 
 	dma_resv_assert_held(shmem->base.resv);
 
-	if (shmem->pages_use_count++ > 0)
+	if (atomic_inc_return(&shmem->pages_use_count) > 1)
 		return 0;
 
 	pages = drm_gem_get_pages(obj);
 	if (IS_ERR(pages)) {
 		drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
 			    PTR_ERR(pages));
-		shmem->pages_use_count = 0;
+		atomic_set(&shmem->pages_use_count, 0);
 		return PTR_ERR(pages);
 	}
 
@@ -210,10 +210,10 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
 
 	dma_resv_assert_held(shmem->base.resv);
 
-	if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
+	if (drm_WARN_ON_ONCE(obj->dev, !atomic_read(&shmem->pages_use_count)))
 		return;
 
-	if (--shmem->pages_use_count > 0)
+	if (atomic_dec_return(&shmem->pages_use_count) > 0)
 		return;
 
 #ifdef CONFIG_X86
@@ -263,6 +263,10 @@ int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
 
 	drm_WARN_ON(obj->dev, obj->import_attach);
 
+	/* If we are the first owner, we need to grab the lock. */
+	if (atomic_inc_not_zero(&shmem->pages_use_count))
+		return 0;
+
 	ret = dma_resv_lock_interruptible(shmem->base.resv, NULL);
 	if (ret)
 		return ret;
@@ -286,6 +290,10 @@ void drm_gem_shmem_unpin(struct drm_gem_shmem_object *shmem)
 
 	drm_WARN_ON(obj->dev, obj->import_attach);
 
+	/* If we are the last owner, we need to grab the lock. */
+	if (atomic_add_unless(&shmem->pages_use_count, -1, 1))
+		return;
+
 	dma_resv_lock(shmem->base.resv, NULL);
 	drm_gem_shmem_unpin_locked(shmem);
 	dma_resv_unlock(shmem->base.resv);
@@ -543,18 +551,12 @@ static void drm_gem_shmem_vm_open(struct vm_area_struct *vma)
 
 	drm_WARN_ON(obj->dev, obj->import_attach);
 
-	dma_resv_lock(shmem->base.resv, NULL);
-
 	/*
 	 * We should have already pinned the pages when the buffer was first
 	 * mmap'd, vm_open() just grabs an additional reference for the new
 	 * mm the vma is getting copied into (ie. on fork()).
 	 */
-	if (!drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
-		shmem->pages_use_count++;
-
-	dma_resv_unlock(shmem->base.resv);
-
+	drm_WARN_ON_ONCE(obj->dev, atomic_inc_return(&shmem->pages_use_count) == 1);
 	drm_gem_vm_open(vma);
 }
 
@@ -632,7 +634,7 @@ void drm_gem_shmem_print_info(const struct drm_gem_shmem_object *shmem,
 	if (shmem->base.import_attach)
 		return;
 
-	drm_printf_indent(p, indent, "pages_use_count=%u\n", shmem->pages_use_count);
+	drm_printf_indent(p, indent, "pages_use_count=%u\n", atomic_read(&shmem->pages_use_count));
 	drm_printf_indent(p, indent, "vmap_use_count=%u\n", shmem->vmap_use_count);
 	drm_printf_indent(p, indent, "vaddr=%p\n", shmem->vaddr);
 }
diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c
index 4f9736e5f929..0116518b1601 100644
--- a/drivers/gpu/drm/lima/lima_gem.c
+++ b/drivers/gpu/drm/lima/lima_gem.c
@@ -47,7 +47,7 @@ int lima_heap_alloc(struct lima_bo *bo, struct lima_vm *vm)
 		}
 
 		bo->base.pages = pages;
-		bo->base.pages_use_count = 1;
+		atomic_set(&bo->base.pages_use_count, 1);
 
 		mapping_set_unevictable(mapping);
 	}
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index c0123d09f699..f66e63bf743e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -487,7 +487,7 @@ static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,
 			goto err_unlock;
 		}
 		bo->base.pages = pages;
-		bo->base.pages_use_count = 1;
+		atomic_set(&bo->base.pages_use_count, 1);
 	} else {
 		pages = bo->base.pages;
 		if (pages[page_offset]) {
diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h
index bf0c31aa8fbe..0661f87d3bda 100644
--- a/include/drm/drm_gem_shmem_helper.h
+++ b/include/drm/drm_gem_shmem_helper.h
@@ -37,7 +37,7 @@ struct drm_gem_shmem_object {
 	 * Reference count on the pages table.
 	 * The pages are put when the count reaches zero.
 	 */
-	unsigned int pages_use_count;
+	atomic_t pages_use_count;
 
 	/**
 	 * @madv: State for madvise
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
  2023-08-09 16:53 ` [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 14:13   ` Steven Price
                     ` (4 more replies)
  2023-08-09 16:53 ` [PATCH v2 03/15] drm/panthor: Add GPU register definitions Boris Brezillon
                   ` (14 subsequent siblings)
  16 siblings, 5 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Panthor follows the lead of other recently submitted drivers with
ioctls allowing us to support modern Vulkan features, like sparse memory
binding:

- Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
  with the 'exclusive-VM' bit to speed-up BO reservation on job submission
- VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
  ioctl is loosely based on the Xe model, and can handle both
  asynchronous and synchronous requests
- GPU execution context creation/destruction, tiler heap context creation
  and job submission. Those ioctls reflect how the hardware/scheduler
  works and are thus driver specific.

We also have a way to expose IO regions, such that the usermode driver
can directly access specific/well-isolate registers, like the
LATEST_FLUSH register used to implement cache-flush reduction.

This uAPI intentionally keeps usermode queues out of the scope, which
explains why doorbell registers and command stream ring-buffers are not
directly exposed to userspace.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
- Add the concept of exclusive_vm at BO creation time
- Add missing padding fields
- Add documentation

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 Documentation/gpu/driver-uapi.rst |   5 +
 include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
 2 files changed, 867 insertions(+)
 create mode 100644 include/uapi/drm/panthor_drm.h

diff --git a/Documentation/gpu/driver-uapi.rst b/Documentation/gpu/driver-uapi.rst
index c08bcbb95fb3..7a667901830f 100644
--- a/Documentation/gpu/driver-uapi.rst
+++ b/Documentation/gpu/driver-uapi.rst
@@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
     :doc: Overview
 
 .. kernel-doc:: include/uapi/drm/nouveau_drm.h
+
+drm/panthor uAPI
+================
+
+.. kernel-doc:: include/uapi/drm/panthor_drm.h
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
new file mode 100644
index 000000000000..e217eb5ad198
--- /dev/null
+++ b/include/uapi/drm/panthor_drm.h
@@ -0,0 +1,862 @@
+/* SPDX-License-Identifier: MIT */
+/* Copyright (C) 2023 Collabora ltd. */
+#ifndef _PANTHOR_DRM_H_
+#define _PANTHOR_DRM_H_
+
+#include "drm.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/**
+ * DOC: Introduction
+ *
+ * This documentation decribes the Panthor IOCTLs.
+ *
+ * Just a few generic rules about the data passed to the Panthor IOCTLs:
+ *
+ * - Structures must be aligned on 64-bit/8-byte. If the object is not
+ *   naturally aligned, a padding field must be added.
+ * - Fields must be explicity aligned to their natural type alignment with
+ *   pad[0..N] fields.
+ * - All padding fields will be checked by the driver to make sure they are
+ *   zeroed.
+ * - Flags can be added, but not removed/replaced.
+ * - New fields can be added to the main structures (the structures
+ *   directly passed to the ioctl). Those fiels can be added at the end of
+ *   the structure, or replace existing padding fields. Any new field being
+ *   added must preserve the behavior that existed before those fields were
+ *   added when a value of zero is passed.
+ * - New fields can be added to indirect objects (objects pointed by the
+ *   main structure), iff those objects are passed a size to reflect the
+ *   size known by the userspace driver (see drm_panthor_obj_array::stride
+ *   or drm_panthor_dev_query::size).
+ * - If the kernel driver is too old to know some fields, those will
+ *   be ignored (input) and set back to zero (output).
+ * - If userspace is too old to know some fields, those will be zeroed
+ *   (input) before the structure is parsed by the kernel driver.
+ * - Each new flag/field addition must come with a driver version update so
+ *   the userspace driver doesn't have to trial and error to know which
+ *   flags are supported.
+ * - Structures should not contain unions, as this would defeat the
+ *   extensibility of such structures.
+ * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
+ *   at the end of the drm_panthor_ioctl_id enum.
+ */
+
+/**
+ * DOC: MMIO regions exposed to userspace.
+ *
+ * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
+ *
+ * File offset for all MMIO regions being exposed to userspace. Don't use
+ * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
+ *
+ * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
+ *
+ * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
+ * GPU cache flushling through CS instructions, but the flush reduction
+ * mechanism requires a flush_id. This flush_id could be queried with an
+ * ioctl, but Arm provides a well-isolated register page containing only this
+ * read-only register, so let's expose this page through a static mmap offset
+ * and allow direct mapping of this MMIO region so we can avoid the
+ * user <-> kernel round-trip.
+ */
+#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
+#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)
+
+/**
+ * DOC: IOCTL IDs
+ *
+ * enum drm_panthor_ioctl_id - IOCTL IDs
+ *
+ * Place new ioctls at the end, don't re-oder, don't replace or remove entries.
+ *
+ * These IDs are not meant to be used directly. Use the DRM_IOCTL_PANTHOR_xxx
+ * definitions instead.
+ */
+enum drm_panthor_ioctl_id {
+	/** @DRM_PANTHOR_DEV_QUERY: Query device information. */
+	DRM_PANTHOR_DEV_QUERY = 0,
+
+	/** @DRM_PANTHOR_VM_CREATE: Create a VM. */
+	DRM_PANTHOR_VM_CREATE,
+
+	/** @DRM_PANTHOR_VM_DESTROY: Destroy a VM. */
+	DRM_PANTHOR_VM_DESTROY,
+
+	/** @DRM_PANTHOR_VM_BIND: Bind/unbind memory to a VM. */
+	DRM_PANTHOR_VM_BIND,
+
+	/** @DRM_PANTHOR_BO_CREATE: Create a buffer object. */
+	DRM_PANTHOR_BO_CREATE,
+
+	/**
+	 * @DRM_PANTHOR_BO_MMAP_OFFSET: Get the file offset to pass to
+	 * mmap to map a GEM object.
+	 */
+	DRM_PANTHOR_BO_MMAP_OFFSET,
+
+	/** @DRM_PANTHOR_GROUP_CREATE: Create a scheduling group. */
+	DRM_PANTHOR_GROUP_CREATE,
+
+	/** @DRM_PANTHOR_GROUP_DESTROY: Destroy a scheduling group. */
+	DRM_PANTHOR_GROUP_DESTROY,
+
+	/**
+	 * @DRM_PANTHOR_GROUP_SUBMIT: Submit jobs to queues belonging
+	 * to a specific scheduling group.
+	 */
+	DRM_PANTHOR_GROUP_SUBMIT,
+
+	/** @DRM_PANTHOR_GROUP_GET_STATE: Get the state of a scheduling group. */
+	DRM_PANTHOR_GROUP_GET_STATE,
+
+	/** @DRM_PANTHOR_TILER_HEAP_CREATE: Create a tiler heap. */
+	DRM_PANTHOR_TILER_HEAP_CREATE,
+
+	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
+	DRM_PANTHOR_TILER_HEAP_DESTROY,
+};
+
+/**
+ * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
+ * @__access: Access type. Must be R, W or RW.
+ * @__id: One of the DRM_PANTHOR_xxx id.
+ * @__type: Suffix of the type being passed to the IOCTL.
+ *
+ * Don't use this macro directly, use the DRM_IOCTL_PANTHOR_xxx
+ * values instead.
+ *
+ * Return: An IOCTL number to be passed to ioctl() from userspace.
+ */
+#define DRM_IOCTL_PANTHOR(__access, __id, __type) \
+	DRM_IO ## __access(DRM_COMMAND_BASE + DRM_PANTHOR_ ## __id, \
+			   struct drm_panthor_ ## __type)
+
+#define DRM_IOCTL_PANTHOR_DEV_QUERY \
+	DRM_IOCTL_PANTHOR(WR, DEV_QUERY, dev_query)
+#define DRM_IOCTL_PANTHOR_VM_CREATE \
+	DRM_IOCTL_PANTHOR(WR, VM_CREATE, vm_create)
+#define DRM_IOCTL_PANTHOR_VM_DESTROY \
+	DRM_IOCTL_PANTHOR(WR, VM_DESTROY, vm_destroy)
+#define DRM_IOCTL_PANTHOR_VM_BIND \
+	DRM_IOCTL_PANTHOR(WR, VM_BIND, vm_bind)
+#define DRM_IOCTL_PANTHOR_BO_CREATE \
+	DRM_IOCTL_PANTHOR(WR, BO_CREATE, bo_create)
+#define DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET \
+	DRM_IOCTL_PANTHOR(WR, BO_MMAP_OFFSET, bo_mmap_offset)
+#define DRM_IOCTL_PANTHOR_GROUP_CREATE \
+	DRM_IOCTL_PANTHOR(WR, GROUP_CREATE, group_create)
+#define DRM_IOCTL_PANTHOR_GROUP_DESTROY \
+	DRM_IOCTL_PANTHOR(WR, GROUP_DESTROY, group_destroy)
+#define DRM_IOCTL_PANTHOR_GROUP_SUBMIT \
+	DRM_IOCTL_PANTHOR(WR, GROUP_SUBMIT, group_submit)
+#define DRM_IOCTL_PANTHOR_GROUP_GET_STATE \
+	DRM_IOCTL_PANTHOR(WR, GROUP_GET_STATE, group_get_state)
+#define DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE \
+	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
+#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
+	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
+
+/**
+ * DOC: IOCTL arguments
+ */
+
+/**
+ * struct drm_panthor_obj_array - Object array.
+ *
+ * This object is used to pass an array of objects whose size it subject to changes in
+ * future versions of the driver. In order to support this mutability, we pass a stride
+ * describing the size of the object as known by userspace.
+ *
+ * You shouldn't fill drm_panthor_obj_array fields directly. You should instead use
+ * the DRM_PANTHOR_OBJ_ARRAY() macro that takes care of initializing the stride to
+ * the object size.
+ */
+struct drm_panthor_obj_array {
+	/** @stride: Stride of object struct. Used for versioning. */
+	__u32 stride;
+
+	/** @count: Number of objects in the array. */
+	__u32 count;
+
+	/** @array: User pointer to an array of objects. */
+	__u64 array;
+};
+
+/**
+ * DRM_PANTHOR_OBJ_ARRAY() - Initialize a drm_panthor_obj_array field.
+ * @cnt: Number of elements in the array.
+ * @ptr: Pointer to the array to pass to the kernel.
+ *
+ * Macro initializing a drm_panthor_obj_array based on the object size as known
+ * by userspace.
+ */
+#define DRM_PANTHOR_OBJ_ARRAY(cnt, ptr) \
+	{ .stride = sizeof((ptr)[0]), .count = (cnt), .array = (__u64)(uintptr_t)(ptr) }
+
+/**
+ * enum drm_panthor_sync_op_flags - Synchronization operation flags.
+ */
+enum drm_panthor_sync_op_flags {
+	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
+	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
+
+	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
+	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
+
+	/**
+	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
+	 * object type.
+	 */
+	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
+
+	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
+	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
+
+	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
+	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,
+};
+
+/**
+ * struct drm_panthor_sync_op - Synchronization operation.
+ */
+struct drm_panthor_sync_op {
+	/** @flags: Synchronization operation flags. Combination of DRM_PANTHOR_SYNC_OP values. */
+	__u32 flags;
+
+	/** @handle: Sync handle. */
+	__u32 handle;
+
+	/**
+	 * @timeline_value: MBZ if
+	 * (flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK) !=
+	 * DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ.
+	 */
+	__u64 timeline_value;
+};
+
+/**
+ * enum drm_panthor_dev_query_type - Query type
+ *
+ * Place new types at the end, don't re-oder, don't remove or replace.
+ */
+enum drm_panthor_dev_query_type {
+	/** @DRM_PANTHOR_DEV_QUERY_GPU_INFO: Query GPU information. */
+	DRM_PANTHOR_DEV_QUERY_GPU_INFO = 0,
+
+	/** @DRM_PANTHOR_DEV_QUERY_CSIF_INFO: Query command-stream interface information. */
+	DRM_PANTHOR_DEV_QUERY_CSIF_INFO,
+};
+
+/**
+ * struct drm_panthor_gpu_info - GPU information
+ *
+ * Structure grouping all queryable information relating to the GPU.
+ */
+struct drm_panthor_gpu_info {
+	/** @gpu_id : GPU ID. */
+	__u32 gpu_id;
+#define DRM_PANTHOR_ARCH_MAJOR(x)		((x) >> 28)
+#define DRM_PANTHOR_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
+#define DRM_PANTHOR_ARCH_REV(x)			(((x) >> 20) & 0xf)
+#define DRM_PANTHOR_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
+#define DRM_PANTHOR_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
+#define DRM_PANTHOR_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
+#define DRM_PANTHOR_VERSION_STATUS(x)		((x) & 0xf)
+
+	/** @gpu_rev: GPU revision. */
+	__u32 gpu_rev;
+
+	/** @csf_id: Command stream frontend ID. */
+	__u32 csf_id;
+#define DRM_PANTHOR_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
+#define DRM_PANTHOR_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
+#define DRM_PANTHOR_CSHW_REV(x)			(((x) >> 16) & 0xf)
+#define DRM_PANTHOR_MCU_MAJOR(x)		(((x) >> 10) & 0x3f)
+#define DRM_PANTHOR_MCU_MINOR(x)		(((x) >> 4) & 0x3f)
+#define DRM_PANTHOR_MCU_REV(x)			((x) & 0xf)
+
+	/** @l2_features: L2-cache features. */
+	__u32 l2_features;
+
+	/** @tiler_features: Tiler features. */
+	__u32 tiler_features;
+
+	/** @mem_features: Memory features. */
+	__u32 mem_features;
+
+	/** @mmu_features: MMU features. */
+	__u32 mmu_features;
+#define DRM_PANTHOR_MMU_VA_BITS(x)		((x) & 0xff)
+
+	/** @thread_features: Thread features. */
+	__u32 thread_features;
+
+	/** @max_threads: Maximum number of threads. */
+	__u32 max_threads;
+
+	/** @thread_max_workgroup_size: Maximum workgroup size. */
+	__u32 thread_max_workgroup_size;
+
+	/**
+	 * @thread_max_barrier_size: Maximum number of threads that can wait
+	 * simultaneously on a barrier.
+	 */
+	__u32 thread_max_barrier_size;
+
+	/** @coherency_features: Coherency features. */
+	__u32 coherency_features;
+
+	/** @texture_features: Texture features. */
+	__u32 texture_features[4];
+
+	/** @as_present: Bitmask encoding the number of address-space exposed by the MMU. */
+	__u32 as_present;
+
+	/** @core_group_count: Number of core groups. */
+	__u32 core_group_count;
+
+	/** @pad: Zero on return. */
+	__u32 pad;
+
+	/** @shader_present: Bitmask encoding the shader cores exposed by the GPU. */
+	__u64 shader_present;
+
+	/** @l2_present: Bitmask encoding the L2 caches exposed by the GPU. */
+	__u64 l2_present;
+
+	/** @tiler_present: Bitmask encoding the tiler unit exposed by the GPU. */
+	__u64 tiler_present;
+};
+
+/**
+ * struct drm_panthor_csif_info - Command stream interface information
+ *
+ * Structure grouping all queryable information relating to the command stream interface.
+ */
+struct drm_panthor_csif_info {
+	/** @csg_slot_count: Number of command stream group slots exposed by the firmware. */
+	__u32 csg_slot_count;
+
+	/** @cs_slot_count: Number of command stream slot per group. */
+	__u32 cs_slot_count;
+
+	/** @cs_reg_count: Number of command stream register. */
+	__u32 cs_reg_count;
+
+	/** @scoreboard_slot_count: Number of scoreboard slot. */
+	__u32 scoreboard_slot_count;
+
+	/**
+	 * @unpreserved_cs_reg_count: Number of command stream registers reserved by
+	 * the kernel driver to call a userspace command stream.
+	 *
+	 * All registers can be used by a userspace command stream, but the
+	 * [cs_slot_count - unpreserved_cs_reg_count .. cs_slot_count] registers are
+	 * used by the kernel when DRM_PANTHOR_IOCTL_GROUP_SUBMIT is called.
+	 */
+	__u32 unpreserved_cs_reg_count;
+
+	/**
+	 * @pad: Padding field, set to zero.
+	 */
+	__u32 pad;
+};
+
+/**
+ * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
+ */
+struct drm_panthor_dev_query {
+	/** @type: the query type (see drm_panthor_dev_query_type). */
+	__u32 type;
+
+	/**
+	 * @size: size of the type being queried.
+	 *
+	 * If pointer is NULL, size is updated by the driver to provide the
+	 * output structure size. If pointer is not NULL, the driver will
+	 * only copy min(size, actual_structure_size) bytes to the pointer,
+	 * and update the size accordingly. This allows us to extend query
+	 * types without breaking userspace.
+	 */
+	__u32 size;
+
+	/**
+	 * @pointer: user pointer to a query type struct.
+	 *
+	 * Pointer can be NULL, in which case, nothing is copied, but the
+	 * actual structure size is returned. If not NULL, it must point to
+	 * a location that's large enough to hold size bytes.
+	 */
+	__u64 pointer;
+};
+
+/**
+ * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
+ */
+struct drm_panthor_vm_create {
+	/** @flags: VM flags, MBZ. */
+	__u32 flags;
+
+	/** @id: Returned VM ID. */
+	__u32 id;
+
+	/**
+	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
+	 *
+	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
+	 *
+	 * Kernel VA space is always placed at the top of the supported VA range.
+	 */
+	__u64 kernel_va_range;
+};
+
+/**
+ * struct drm_panthor_vm_destroy - Arguments passed to DRM_PANTHOR_IOCTL_VM_DESTROY
+ */
+struct drm_panthor_vm_destroy {
+	/** @id: ID of the VM to destroy. */
+	__u32 id;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+};
+
+/**
+ * enum drm_panthor_vm_bind_op_flags - VM bind operation flags
+ */
+enum drm_panthor_vm_bind_op_flags {
+	/**
+	 * @DRM_PANTHOR_VM_BIND_OP_MAP_READONLY: Map the memory read-only.
+	 *
+	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
+	 */
+	DRM_PANTHOR_VM_BIND_OP_MAP_READONLY = 1 << 0,
+
+	/**
+	 * @DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC: Map the memory not-executable.
+	 *
+	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
+	 */
+	DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC = 1 << 1,
+
+	/**
+	 * @DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED: Map the memory uncached.
+	 *
+	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
+	 */
+	DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED = 1 << 2,
+
+	/**
+	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
+	 */
+	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,
+
+	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_MAP: Map operation. */
+	DRM_PANTHOR_VM_BIND_OP_TYPE_MAP = 0 << 28,
+
+	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP: Unmap operation. */
+	DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP = 1 << 28,
+};
+
+/**
+ * struct drm_panthor_vm_bind_op - VM bind operation
+ */
+struct drm_panthor_vm_bind_op {
+	/** @flags: Combination of drm_panthor_vm_bind_op_flags flags. */
+	__u32 flags;
+
+	/**
+	 * @bo_handle: Handle of the buffer object to map.
+	 * MBZ for unmap operations.
+	 */
+	__u32 bo_handle;
+
+	/**
+	 * @bo_offset: Buffer object offset.
+	 * MBZ for unmap operations.
+	 */
+	__u64 bo_offset;
+
+	/**
+	 * @va: Virtual address to map/unmap.
+	 */
+	__u64 va;
+
+	/** @size: Size to map/unmap. */
+	__u64 size;
+
+	/**
+	 * @syncs: Array of synchronization operations.
+	 *
+	 * This array must be empty if %DRM_PANTHOR_VM_BIND_ASYNC is not set on
+	 * the drm_panthor_vm_bind object containing this VM bind operation.
+	 */
+	struct drm_panthor_obj_array syncs;
+
+};
+
+/**
+ * enum drm_panthor_vm_bind_flags - VM bind flags
+ */
+enum drm_panthor_vm_bind_flags {
+	/**
+	 * @DRM_PANTHOR_VM_BIND_ASYNC: VM bind operations are queued to the VM
+	 * queue instead of being executed synchronously.
+	 */
+	DRM_PANTHOR_VM_BIND_ASYNC = 1 << 0,
+};
+
+/**
+ * struct drm_panthor_vm_bind - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
+ */
+struct drm_panthor_vm_bind {
+	/** @vm_id: VM targeted by the bind request. */
+	__u32 vm_id;
+
+	/** @flags: Combination of drm_panthor_vm_bind_flags flags. */
+	__u32 flags;
+
+	/** @ops: Array of bind operations. */
+	struct drm_panthor_obj_array ops;
+};
+
+/**
+ * enum drm_panthor_bo_flags - Buffer object flags, passed at creation time.
+ */
+enum drm_panthor_bo_flags {
+	/** @DRM_PANTHOR_BO_NO_MMAP: The buffer object will never be CPU-mapped in userspace. */
+	DRM_PANTHOR_BO_NO_MMAP = (1 << 0),
+};
+
+/**
+ * struct drm_panthor_bo_create - Arguments passed to DRM_IOCTL_PANTHOR_BO_CREATE.
+ */
+struct drm_panthor_bo_create {
+	/**
+	 * @size: Requested size for the object
+	 *
+	 * The (page-aligned) allocated size for the object will be returned.
+	 */
+	__u64 size;
+
+	/**
+	 * @flags: Flags. Must be a combination of drm_panthor_bo_flags flags.
+	 */
+	__u32 flags;
+
+	/**
+	 * @exclusive_vm_id: Exclusive VM this buffer object will be mapped to.
+	 *
+	 * If not zero, the field must refer to a valid VM ID, and implies that:
+	 *  - the buffer object will only ever be bound to that VM
+	 *  - cannot be exported as a PRIME fd
+	 */
+	__u32 exclusive_vm_id;
+
+	/**
+	 * @handle: Returned handle for the object.
+	 *
+	 * Object handles are nonzero.
+	 */
+	__u32 handle;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+};
+
+/**
+ * struct drm_panthor_bo_mmap_offset - Arguments passed to DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET.
+ */
+struct drm_panthor_bo_mmap_offset {
+	/** @handle: Handle of the object we want an mmap offset for. */
+	__u32 handle;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+
+	/** @offset: The fake offset to use for subsequent mmap calls. */
+	__u64 offset;
+};
+
+/**
+ * struct drm_panthor_queue_create - Queue creation arguments.
+ */
+struct drm_panthor_queue_create {
+	/**
+	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
+	 * 15 being the highest priority.
+	 */
+	__u8 priority;
+
+	/** @pad: Padding fields, MBZ. */
+	__u8 pad[3];
+
+	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
+	__u32 ringbuf_size;
+};
+
+/**
+ * enum drm_panthor_group_priority - Scheduling group priority
+ */
+enum drm_panthor_group_priority {
+	/** @PANTHOR_GROUP_PRIORITY_LOW: Low priority group. */
+	PANTHOR_GROUP_PRIORITY_LOW = 0,
+
+	/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
+	PANTHOR_GROUP_PRIORITY_MEDIUM,
+
+	/** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
+	PANTHOR_GROUP_PRIORITY_HIGH,
+};
+
+/**
+ * struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
+ */
+struct drm_panthor_group_create {
+	/** @queues: Array of drm_panthor_create_cs_queue elements. */
+	struct drm_panthor_obj_array queues;
+
+	/**
+	 * @max_compute_cores: Maximum number of cores that can be used by compute
+	 * jobs across CS queues bound to this group.
+	 *
+	 * Must be less or equal to the number of bits set in @compute_core_mask.
+	 */
+	__u8 max_compute_cores;
+
+	/**
+	 * @max_fragment_cores: Maximum number of cores that can be used by fragment
+	 * jobs across CS queues bound to this group.
+	 *
+	 * Must be less or equal to the number of bits set in @fragment_core_mask.
+	 */
+	__u8 max_fragment_cores;
+
+	/**
+	 * @max_tiler_cores: Maximum number of tilers that can be used by tiler jobs
+	 * across CS queues bound to this group.
+	 *
+	 * Must be less or equal to the number of bits set in @tiler_core_mask.
+	 */
+	__u8 max_tiler_cores;
+
+	/** @priority: Group priority (see drm_drm_panthor_cs_group_priority). */
+	__u8 priority;
+
+	/** @pad: Padding field, MBZ. */
+	__u32 pad;
+
+	/**
+	 * @compute_core_mask: Mask encoding cores that can be used for compute jobs.
+	 *
+	 * This field must have at least @max_compute_cores bits set.
+	 *
+	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
+	 */
+	__u64 compute_core_mask;
+
+	/**
+	 * @fragment_core_mask: Mask encoding cores that can be used for fragment jobs.
+	 *
+	 * This field must have at least @max_fragment_cores bits set.
+	 *
+	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
+	 */
+	__u64 fragment_core_mask;
+
+	/**
+	 * @tiler_core_mask: Mask encoding cores that can be used for tiler jobs.
+	 *
+	 * This field must have at least @max_tiler_cores bits set.
+	 *
+	 * The bits set here should also be set in drm_panthor_gpu_info::tiler_present.
+	 */
+	__u64 tiler_core_mask;
+
+	/**
+	 * @vm_id: VM ID to bind this group to.
+	 *
+	 * All submission to queues bound to this group will use this VM.
+	 */
+	__u32 vm_id;
+
+	/**
+	 * @group_handle: Returned group handle. Passed back when submitting jobs or
+	 * destroying a group.
+	 */
+	__u32 group_handle;
+};
+
+/**
+ * struct drm_panthor_group_destroy - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_DESTROY
+ */
+struct drm_panthor_group_destroy {
+	/** @group_handle: Group to destroy */
+	__u32 group_handle;
+
+	/** @pad: Padding field, MBZ. */
+	__u32 pad;
+};
+
+/**
+ * struct drm_panthor_queue_submit - Job submission arguments.
+ *
+ * This is describing the userspace command stream to call from the kernel
+ * command stream ring-buffer. Queue submission is always part of a group
+ * submission, taking one or more jobs to submit to the underlying queues.
+ */
+struct drm_panthor_queue_submit {
+	/** @queue_index: Index of the queue inside a group. */
+	__u32 queue_index;
+
+	/**
+	 * @stream_size: Size of the command stream to execute.
+	 *
+	 * Must be 64-bit/8-byte aligned (the size of a CS instruction)
+	 *
+	 * Can be zero if stream_addr is zero too.
+	 */
+	__u32 stream_size;
+
+	/**
+	 * @stream_addr: GPU address of the command stream to execute.
+	 *
+	 * Must be aligned on 64-byte.
+	 *
+	 * Can be zero is stream_size is zero too.
+	 */
+	__u64 stream_addr;
+
+	/**
+	 * @latest_flush: FLUSH_ID read at the time the stream was built.
+	 *
+	 * This allows cache flush elimination for the automatic
+	 * flush+invalidate(all) done at submission time, which is needed to
+	 * ensure the GPU doesn't get garbage when reading the indirect command
+	 * stream buffers. If you want the cache flush to happen
+	 * unconditionally, pass a zero here.
+	 */
+	__u32 latest_flush;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+
+	/** @syncs: Array of sync operations. */
+	struct drm_panthor_obj_array syncs;
+};
+
+/**
+ * struct drm_panthor_group_submit - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
+ */
+struct drm_panthor_group_submit {
+	/** @group_handle: Handle of the group to queue jobs to. */
+	__u32 group_handle;
+
+	/** @pad: MBZ. */
+	__u32 pad;
+
+	/** @queue_submits: Array of drm_panthor_queue_submit objects. */
+	struct drm_panthor_obj_array queue_submits;
+};
+
+/**
+ * enum drm_panthor_group_state_flags - Group state flags
+ */
+enum drm_panthor_group_state_flags {
+	/**
+	 * @DRM_PANTHOR_GROUP_STATE_TIMEDOUT: Group had unfinished jobs.
+	 *
+	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
+	 */
+	DRM_PANTHOR_GROUP_STATE_TIMEDOUT = 1 << 0,
+
+	/**
+	 * @DRM_PANTHOR_GROUP_STATE_FATAL_FAULT: Group had fatal faults.
+	 *
+	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
+	 */
+	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
+};
+
+/**
+ * struct drm_panthor_group_get_state - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_GET_STATE
+ *
+ * Used to query the state of a group and decide whether a new group should be created to
+ * replace it.
+ */
+struct drm_panthor_group_get_state {
+	/** @group_handle: Handle of the group to query state on */
+	__u32 group_handle;
+
+	/**
+	 * @state: Combination of DRM_PANTHOR_GROUP_STATE_* flags encoding the
+	 * group state.
+	 */
+	__u32 state;
+
+	/** @fatal_queues: Bitmask of queues that faced fatal faults. */
+	__u32 fatal_queues;
+
+	/** @pad: MBZ */
+	__u32 pad;
+};
+
+/**
+ * struct drm_panthor_tiler_heap_create - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE
+ */
+struct drm_panthor_tiler_heap_create {
+	/** @vm_id: VM ID the tiler heap should be mapped to */
+	__u32 vm_id;
+
+	/** @initial_chunk_count: Initial number of chunks to allocate. */
+	__u32 initial_chunk_count;
+
+	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
+	__u32 chunk_size;
+
+	/** @max_chunks: Maximum number of chunks that can be allocated. */
+	__u32 max_chunks;
+
+	/**
+	 * @target_in_flight: Maximum number of in-flight render passes.
+	 *
+	 * If the heap has more than tiler jobs in-flight, the FW will wait for render
+	 * passes to finish before queuing new tiler jobs.
+	 */
+	__u32 target_in_flight;
+
+	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
+	__u32 handle;
+
+	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
+	__u64 tiler_heap_ctx_gpu_va;
+
+	/**
+	 * @first_heap_chunk_gpu_va: First heap chunk.
+	 *
+	 * The tiler heap is formed of heap chunks forming a single-link list. This
+	 * is the first element in the list.
+	 */
+	__u64 first_heap_chunk_gpu_va;
+};
+
+/**
+ * struct drm_panthor_tiler_heap_destroy - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
+ */
+struct drm_panthor_tiler_heap_destroy {
+	/** @handle: Handle of the tiler heap to destroy */
+	__u32 handle;
+
+	/** @pad: Padding field, MBZ. */
+	__u32 pad;
+};
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif /* _PANTHOR_DRM_H_ */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 03/15] drm/panthor: Add GPU register definitions
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
  2023-08-09 16:53 ` [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t Boris Brezillon
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 14:13   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 04/15] drm/panthor: Add the device logical block Boris Brezillon
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Those are the registers directly accessible through the MMIO range.

FW registers are exposed in panthor_fw.h.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_regs.h | 229 +++++++++++++++++++++++++
 1 file changed, 229 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h

diff --git a/drivers/gpu/drm/panthor/panthor_regs.h b/drivers/gpu/drm/panthor/panthor_regs.h
new file mode 100644
index 000000000000..00e149cf9eab
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_regs.h
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+/*
+ * Register definitions based on mali_kbase_gpu_regmap.h and
+ * mali_kbase_gpu_regmap_csf.h
+ * (C) COPYRIGHT 2010-2022 ARM Limited. All rights reserved.
+ */
+#ifndef __PANTHOR_REGS_H__
+#define __PANTHOR_REGS_H__
+
+#define GPU_ID						0x00
+#define GPU_L2_FEATURES					0x004
+#define GPU_TILER_FEATURES				0x00C
+#define GPU_MEM_FEATURES				0x010
+#define   GROUPS_L2_COHERENT				BIT(0)
+
+#define GPU_MMU_FEATURES				0x014
+#define  GPU_MMU_FEATURES_VA_BITS(x)			((x) & GENMASK(7, 0))
+#define  GPU_MMU_FEATURES_PA_BITS(x)			(((x) >> 8) & GENMASK(7, 0))
+#define GPU_AS_PRESENT					0x018
+#define GPU_CSF_ID					0x01C
+
+#define GPU_INT_RAWSTAT					0x20
+#define GPU_INT_CLEAR					0x24
+#define GPU_INT_MASK					0x28
+#define GPU_INT_STAT					0x2c
+#define   GPU_IRQ_FAULT					BIT(0)
+#define   GPU_IRQ_PROTM_FAULT				BIT(1)
+#define   GPU_IRQ_RESET_COMPLETED			BIT(8)
+#define   GPU_IRQ_POWER_CHANGED				BIT(9)
+#define   GPU_IRQ_POWER_CHANGED_ALL			BIT(10)
+#define   GPU_IRQ_CLEAN_CACHES_COMPLETED		BIT(17)
+#define   GPU_IRQ_DOORBELL_MIRROR			BIT(18)
+#define   GPU_IRQ_MCU_STATUS_CHANGED			BIT(19)
+#define GPU_CMD						0x30
+#define   GPU_CMD_DEF(type, payload)			((type) | ((payload) << 8))
+#define   GPU_SOFT_RESET				GPU_CMD_DEF(1, 1)
+#define   GPU_HARD_RESET				GPU_CMD_DEF(1, 2)
+#define   CACHE_CLEAN					BIT(0)
+#define   CACHE_INV					BIT(1)
+#define   GPU_FLUSH_CACHES(l2, lsc, oth)		\
+	  GPU_CMD_DEF(4, ((l2) << 0) | ((lsc) << 4) | ((oth) << 8))
+
+#define GPU_STATUS					0x34
+#define   GPU_STATUS_ACTIVE				BIT(0)
+#define   GPU_STATUS_PWR_ACTIVE				BIT(1)
+#define   GPU_STATUS_PAGE_FAULT				BIT(4)
+#define   GPU_STATUS_PROTM_ACTIVE			BIT(7)
+#define   GPU_STATUS_DBG_ENABLED			BIT(8)
+
+#define GPU_FAULT_STATUS				0x3C
+#define GPU_FAULT_ADDR_LO				0x40
+#define GPU_FAULT_ADDR_HI				0x44
+
+#define GPU_PWR_KEY					0x50
+#define  GPU_PWR_KEY_UNLOCK				0x2968A819
+#define GPU_PWR_OVERRIDE0				0x54
+#define GPU_PWR_OVERRIDE1				0x58
+
+#define GPU_TIMESTAMP_OFFSET_LO				0x88
+#define GPU_TIMESTAMP_OFFSET_HI				0x8C
+#define GPU_CYCLE_COUNT_LO				0x90
+#define GPU_CYCLE_COUNT_HI				0x94
+#define GPU_TIMESTAMP_LO				0x98
+#define GPU_TIMESTAMP_HI				0x9C
+
+#define GPU_THREAD_MAX_THREADS				0xA0
+#define GPU_THREAD_MAX_WORKGROUP_SIZE			0xA4
+#define GPU_THREAD_MAX_BARRIER_SIZE			0xA8
+#define GPU_THREAD_FEATURES				0xAC
+
+#define GPU_TEXTURE_FEATURES(n)				(0xB0 + ((n) * 4))
+
+#define GPU_SHADER_PRESENT_LO				0x100
+#define GPU_SHADER_PRESENT_HI				0x104
+#define GPU_TILER_PRESENT_LO				0x110
+#define GPU_TILER_PRESENT_HI				0x114
+#define GPU_L2_PRESENT_LO				0x120
+#define GPU_L2_PRESENT_HI				0x124
+
+#define SHADER_READY_LO					0x140
+#define SHADER_READY_HI					0x144
+#define TILER_READY_LO					0x150
+#define TILER_READY_HI					0x154
+#define L2_READY_LO					0x160
+#define L2_READY_HI					0x164
+
+#define SHADER_PWRON_LO					0x180
+#define SHADER_PWRON_HI					0x184
+#define TILER_PWRON_LO					0x190
+#define TILER_PWRON_HI					0x194
+#define L2_PWRON_LO					0x1A0
+#define L2_PWRON_HI					0x1A4
+
+#define SHADER_PWROFF_LO				0x1C0
+#define SHADER_PWROFF_HI				0x1C4
+#define TILER_PWROFF_LO					0x1D0
+#define TILER_PWROFF_HI					0x1D4
+#define L2_PWROFF_LO					0x1E0
+#define L2_PWROFF_HI					0x1E4
+
+#define SHADER_PWRTRANS_LO				0x200
+#define SHADER_PWRTRANS_HI				0x204
+#define TILER_PWRTRANS_LO				0x210
+#define TILER_PWRTRANS_HI				0x214
+#define L2_PWRTRANS_LO					0x220
+#define L2_PWRTRANS_HI					0x224
+
+#define SHADER_PWRACTIVE_LO				0x240
+#define SHADER_PWRACTIVE_HI				0x244
+#define TILER_PWRACTIVE_LO				0x250
+#define TILER_PWRACTIVE_HI				0x254
+#define L2_PWRACTIVE_LO					0x260
+#define L2_PWRACTIVE_HI					0x264
+
+#define GPU_REVID					0x280
+
+#define GPU_COHERENCY_FEATURES				0x300
+#define GPU_COHERENCY_PROT_BIT(name)			BIT(GPU_COHERENCY_  ## name)
+
+#define GPU_COHERENCY_PROTOCOL				0x304
+#define   GPU_COHERENCY_ACE				0
+#define   GPU_COHERENCY_ACE_LITE			1
+#define   GPU_COHERENCY_NONE				31
+
+#define MCU_CONTROL					0x700
+#define MCU_CONTROL_ENABLE				1
+#define MCU_CONTROL_AUTO				2
+#define MCU_CONTROL_DISABLE				0
+
+#define MCU_STATUS					0x704
+#define MCU_STATUS_DISABLED				0
+#define MCU_STATUS_ENABLED				1
+#define MCU_STATUS_HALT					2
+#define MCU_STATUS_FATAL				3
+
+/* Job Control regs */
+#define JOB_INT_RAWSTAT					0x1000
+#define JOB_INT_CLEAR					0x1004
+#define JOB_INT_MASK					0x1008
+#define JOB_INT_STAT					0x100c
+#define   JOB_INT_GLOBAL_IF				BIT(31)
+#define   JOB_INT_CSG_IF(x)				BIT(x)
+
+/* MMU regs */
+#define MMU_INT_RAWSTAT					0x2000
+#define MMU_INT_CLEAR					0x2004
+#define MMU_INT_MASK					0x2008
+#define MMU_INT_STAT					0x200c
+
+/* AS_COMMAND register commands */
+
+#define MMU_BASE					0x2400
+#define MMU_AS_SHIFT					6
+#define MMU_AS(as)					(MMU_BASE + ((as) << MMU_AS_SHIFT))
+
+#define AS_TRANSTAB_LO(as)				(MMU_AS(as) + 0x00)
+#define AS_TRANSTAB_HI(as)				(MMU_AS(as) + 0x04)
+#define AS_MEMATTR_LO(as)				(MMU_AS(as) + 0x08)
+#define AS_MEMATTR_HI(as)				(MMU_AS(as) + 0x0C)
+#define   AS_MEMATTR_AARCH64_INNER_ALLOC_IMPL		(2 << 2)
+#define   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(w, r)	((3 << 2) | \
+							 ((w) ? BIT(0) : 0) | \
+							 ((r) ? BIT(1) : 0))
+#define   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER		(0 << 4)
+#define   AS_MEMATTR_AARCH64_SH_CPU_INNER		(1 << 4)
+#define   AS_MEMATTR_AARCH64_SH_CPU_INNER_SHADER_COH	(2 << 4)
+#define   AS_MEMATTR_AARCH64_SHARED			(0 << 6)
+#define   AS_MEMATTR_AARCH64_INNER_OUTER_NC		(1 << 6)
+#define   AS_MEMATTR_AARCH64_INNER_OUTER_WB		(2 << 6)
+#define   AS_MEMATTR_AARCH64_FAULT			(3 << 6)
+#define AS_LOCKADDR_LO(as)				(MMU_AS(as) + 0x10)
+#define AS_LOCKADDR_HI(as)				(MMU_AS(as) + 0x14)
+#define AS_COMMAND(as)					(MMU_AS(as) + 0x18)
+#define   AS_COMMAND_NOP				0
+#define   AS_COMMAND_UPDATE				1
+#define   AS_COMMAND_LOCK				2
+#define   AS_COMMAND_UNLOCK				3
+#define   AS_COMMAND_FLUSH_PT				4
+#define   AS_COMMAND_FLUSH_MEM				5
+#define   AS_LOCK_REGION_MIN_SIZE			(1ULL << 15)
+#define AS_FAULTSTATUS(as)				(MMU_AS(as) + 0x1C)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_MASK		(0x3 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC		(0x0 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_EX			(0x1 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_READ		(0x2 << 8)
+#define  AS_FAULTSTATUS_ACCESS_TYPE_WRITE		(0x3 << 8)
+#define AS_FAULTADDRESS_LO(as)				(MMU_AS(as) + 0x20)
+#define AS_FAULTADDRESS_HI(as)				(MMU_AS(as) + 0x24)
+#define AS_STATUS(as)					(MMU_AS(as) + 0x28)
+#define   AS_STATUS_AS_ACTIVE				BIT(0)
+#define AS_TRANSCFG_LO(as)				(MMU_AS(as) + 0x30)
+#define AS_TRANSCFG_HI(as)				(MMU_AS(as) + 0x34)
+#define   AS_TRANSCFG_ADRMODE_LEGACY			(0 << 0)
+#define   AS_TRANSCFG_ADRMODE_UNMAPPED			(1 << 0)
+#define   AS_TRANSCFG_ADRMODE_IDENTITY			(2 << 0)
+#define   AS_TRANSCFG_ADRMODE_AARCH64_4K		(6 << 0)
+#define   AS_TRANSCFG_ADRMODE_AARCH64_64K		(8 << 0)
+#define   AS_TRANSCFG_INA_BITS(x)			((x) << 6)
+#define   AS_TRANSCFG_OUTA_BITS(x)			((x) << 14)
+#define   AS_TRANSCFG_SL_CONCAT				BIT(22)
+#define   AS_TRANSCFG_PTW_MEMATTR_NC			(1 << 24)
+#define   AS_TRANSCFG_PTW_MEMATTR_WB			(2 << 24)
+#define   AS_TRANSCFG_PTW_SH_NS				(0 << 28)
+#define   AS_TRANSCFG_PTW_SH_OS				(2 << 28)
+#define   AS_TRANSCFG_PTW_SH_IS				(3 << 28)
+#define   AS_TRANSCFG_PTW_RA				BIT(30)
+#define   AS_TRANSCFG_DISABLE_HIER_AP			BIT(33)
+#define   AS_TRANSCFG_DISABLE_AF_FAULT			BIT(34)
+#define   AS_TRANSCFG_WXN				BIT(35)
+#define   AS_TRANSCFG_XREADABLE				BIT(36)
+#define AS_FAULTEXTRA_LO(as)				(MMU_AS(as) + 0x38)
+#define AS_FAULTEXTRA_HI(as)				(MMU_AS(as) + 0x3C)
+
+#define CSF_GPU_LATEST_FLUSH_ID				0x10000
+#define CSF_GPU_LATEST_FLUSH_ID_DEFAULT			0xffffe0
+
+#define CSF_DOORBELL(i)					(0x80000 + ((i) * 0x10000))
+#define CSF_GLB_DOORBELL_ID				0
+
+#define gpu_write(dev, reg, data) \
+	writel(data, (dev)->iomem + (reg))
+
+#define gpu_read(dev, reg) \
+	readl((dev)->iomem + (reg))
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (2 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 03/15] drm/panthor: Add GPU register definitions Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 15:47   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 05/15] drm/panthor: Add the GPU " Boris Brezillon
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

The panthor driver is designed in a modular way, where each logical
block is dealing with a specific HW-block or software feature. In order
for those blocks to communicate with each other, we need a central
panthor_device collecting all the blocks, and exposing some common
features, like interrupt handling, power management, reset, ...

This what this panthor_device logical block is about.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Add devfreq/PM support
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 479 +++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_device.h | 354 +++++++++++++++++
 2 files changed, 833 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.h

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
new file mode 100644
index 000000000000..15f102116fa0
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -0,0 +1,479 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/clk.h>
+#include <linux/reset.h>
+#include <linux/platform_device.h>
+#include <linux/pm_domain.h>
+#include <linux/pm_runtime.h>
+#include <linux/regulator/consumer.h>
+
+#include <drm/drm_drv.h>
+#include <drm/drm_managed.h>
+
+#include "panthor_sched.h"
+#include "panthor_device.h"
+#include "panthor_devfreq.h"
+#include "panthor_gpu.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_regs.h"
+
+static int panthor_clk_init(struct panthor_device *ptdev)
+{
+	ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
+	if (IS_ERR(ptdev->clks.core)) {
+		drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
+			PTR_ERR(ptdev->clks.core));
+		return PTR_ERR(ptdev->clks.core);
+	}
+
+	ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
+	if (IS_ERR(ptdev->clks.stacks)) {
+		drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
+			PTR_ERR(ptdev->clks.stacks));
+		return PTR_ERR(ptdev->clks.stacks);
+	}
+
+	ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, "coregroup");
+	if (IS_ERR(ptdev->clks.coregroup)) {
+		drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
+			PTR_ERR(ptdev->clks.coregroup));
+		return PTR_ERR(ptdev->clks.coregroup);
+	}
+
+	drm_info(&ptdev->base, "clock rate = %lu\n", clk_get_rate(ptdev->clks.core));
+	return 0;
+}
+
+void panthor_device_unplug(struct panthor_device *ptdev)
+{
+	/* FIXME: This is racy. */
+	if (drm_dev_is_unplugged(&ptdev->base))
+		return;
+
+	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
+
+	/* Call drm_dev_unplug() so any access to HW block happening after
+	 * that point get rejected.
+	 */
+	drm_dev_unplug(&ptdev->base);
+
+	/* Now, try to cleanly shutdown the GPU before the device resources
+	 * get reclaimed.
+	 */
+	panthor_sched_unplug(ptdev);
+	panthor_fw_unplug(ptdev);
+	panthor_mmu_unplug(ptdev);
+	panthor_gpu_unplug(ptdev);
+
+	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
+	pm_runtime_put_sync_suspend(ptdev->base.dev);
+}
+
+static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+
+	cancel_work_sync(&ptdev->reset.work);
+	destroy_workqueue(ptdev->reset.wq);
+}
+
+static void panthor_device_reset_work(struct work_struct *work)
+{
+	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
+	int ret, cookie;
+
+	if (!drm_dev_enter(&ptdev->base, &cookie))
+		return;
+
+	panthor_sched_pre_reset(ptdev);
+	panthor_fw_pre_reset(ptdev, true);
+	panthor_mmu_pre_reset(ptdev);
+	panthor_gpu_soft_reset(ptdev);
+	panthor_gpu_l2_power_on(ptdev);
+	panthor_mmu_post_reset(ptdev);
+	ret = panthor_fw_post_reset(ptdev);
+	if (ret)
+		goto out;
+
+	atomic_set(&ptdev->reset.pending, 0);
+	panthor_sched_post_reset(ptdev);
+	drm_dev_exit(cookie);
+
+out:
+	if (ret) {
+		panthor_device_unplug(ptdev);
+		drm_err(&ptdev->base, "Failed to boot MCU after reset, making device unusable.");
+	}
+}
+
+static bool panthor_device_is_initialized(struct panthor_device *ptdev)
+{
+	return !!ptdev->scheduler;
+}
+
+static void panthor_device_free_page(struct drm_device *ddev, void *data)
+{
+	free_page((unsigned long)data);
+}
+
+int panthor_device_init(struct panthor_device *ptdev)
+{
+	struct resource *res;
+	struct page *p;
+	int ret;
+
+	ptdev->coherent = device_get_dma_attr(ptdev->base.dev) == DEV_DMA_COHERENT;
+
+	drmm_mutex_init(&ptdev->base, &ptdev->pm.lock);
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+	p = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!p)
+		return -ENOMEM;
+
+	ptdev->pm.dummy_latest_flush = page_address(p);
+	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
+				       ptdev->pm.dummy_latest_flush);
+	if (ret)
+		return ret;
+
+	/* Set the dummy page to the default LATEST_FLUSH value. This
+	 * will be updated on the next suspend.
+	 */
+	*ptdev->pm.dummy_latest_flush = CSF_GPU_LATEST_FLUSH_ID_DEFAULT;
+
+	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
+	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
+	if (!ptdev->reset.wq)
+		return -ENOMEM;
+
+	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_reset_cleanup, NULL);
+	if (ret)
+		return ret;
+
+	ret = panthor_clk_init(ptdev);
+	if (ret)
+		return ret;
+
+	ret = panthor_devfreq_init(ptdev);
+	if (ret)
+		return ret;
+
+	ptdev->iomem = devm_platform_get_and_ioremap_resource(to_platform_device(ptdev->base.dev),
+							      0, &res);
+	if (IS_ERR(ptdev->iomem))
+		return PTR_ERR(ptdev->iomem);
+
+	ptdev->phys_addr = res->start;
+
+	ret = devm_pm_runtime_enable(ptdev->base.dev);
+	if (ret)
+		return ret;
+
+	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	if (ret)
+		return ret;
+
+	ret = panthor_gpu_init(ptdev);
+	if (ret)
+		goto err_rpm_put;
+
+	ret = panthor_mmu_init(ptdev);
+	if (ret)
+		goto err_rpm_put;
+
+	ret = panthor_fw_init(ptdev);
+	if (ret)
+		goto err_rpm_put;
+
+	ret = panthor_sched_init(ptdev);
+	if (ret)
+		goto err_rpm_put;
+
+	/* ~3 frames */
+	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
+	pm_runtime_use_autosuspend(ptdev->base.dev);
+	pm_runtime_put_autosuspend(ptdev->base.dev);
+	return 0;
+
+err_rpm_put:
+	pm_runtime_put_sync_suspend(ptdev->base.dev);
+	return ret;
+}
+
+#define PANTHOR_EXCEPTION(id) \
+	[DRM_PANTHOR_EXCEPTION_ ## id] = { \
+		.name = #id, \
+	}
+
+struct panthor_exception_info {
+	const char *name;
+};
+
+static const struct panthor_exception_info panthor_exception_infos[] = {
+	PANTHOR_EXCEPTION(OK),
+	PANTHOR_EXCEPTION(TERMINATED),
+	PANTHOR_EXCEPTION(KABOOM),
+	PANTHOR_EXCEPTION(EUREKA),
+	PANTHOR_EXCEPTION(ACTIVE),
+	PANTHOR_EXCEPTION(CS_RES_TERM),
+	PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
+	PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
+	PANTHOR_EXCEPTION(CS_BUS_FAULT),
+	PANTHOR_EXCEPTION(CS_INSTR_INVALID),
+	PANTHOR_EXCEPTION(CS_CALL_STACK_OVERFLOW),
+	PANTHOR_EXCEPTION(CS_INHERIT_FAULT),
+	PANTHOR_EXCEPTION(INSTR_INVALID_PC),
+	PANTHOR_EXCEPTION(INSTR_INVALID_ENC),
+	PANTHOR_EXCEPTION(INSTR_BARRIER_FAULT),
+	PANTHOR_EXCEPTION(DATA_INVALID_FAULT),
+	PANTHOR_EXCEPTION(TILE_RANGE_FAULT),
+	PANTHOR_EXCEPTION(ADDR_RANGE_FAULT),
+	PANTHOR_EXCEPTION(IMPRECISE_FAULT),
+	PANTHOR_EXCEPTION(OOM),
+	PANTHOR_EXCEPTION(CSF_FW_INTERNAL_ERROR),
+	PANTHOR_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
+	PANTHOR_EXCEPTION(GPU_BUS_FAULT),
+	PANTHOR_EXCEPTION(GPU_SHAREABILITY_FAULT),
+	PANTHOR_EXCEPTION(SYS_SHAREABILITY_FAULT),
+	PANTHOR_EXCEPTION(GPU_CACHEABILITY_FAULT),
+	PANTHOR_EXCEPTION(TRANSLATION_FAULT_0),
+	PANTHOR_EXCEPTION(TRANSLATION_FAULT_1),
+	PANTHOR_EXCEPTION(TRANSLATION_FAULT_2),
+	PANTHOR_EXCEPTION(TRANSLATION_FAULT_3),
+	PANTHOR_EXCEPTION(TRANSLATION_FAULT_4),
+	PANTHOR_EXCEPTION(PERM_FAULT_0),
+	PANTHOR_EXCEPTION(PERM_FAULT_1),
+	PANTHOR_EXCEPTION(PERM_FAULT_2),
+	PANTHOR_EXCEPTION(PERM_FAULT_3),
+	PANTHOR_EXCEPTION(ACCESS_FLAG_1),
+	PANTHOR_EXCEPTION(ACCESS_FLAG_2),
+	PANTHOR_EXCEPTION(ACCESS_FLAG_3),
+	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_IN),
+	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
+	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
+	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
+	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
+	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_0),
+	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_1),
+	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_2),
+	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_3),
+};
+
+const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_code)
+{
+	if (drm_WARN_ON(&ptdev->base,
+			exception_code >= ARRAY_SIZE(panthor_exception_infos) ||
+			!panthor_exception_infos[exception_code].name))
+		return "Unknown exception type";
+
+	return panthor_exception_infos[exception_code].name;
+}
+
+static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct panthor_device *ptdev = vma->vm_private_data;
+	u64 id = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long pfn;
+	pgprot_t pgprot;
+	vm_fault_t ret;
+	bool active;
+	int cookie;
+
+	if (!drm_dev_enter(&ptdev->base, &cookie))
+		return VM_FAULT_SIGBUS;
+
+	mutex_lock(&ptdev->pm.lock);
+	active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
+
+	switch (id) {
+	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
+		if (active)
+			pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
+		else
+			pfn = virt_to_pfn(ptdev->pm.dummy_latest_flush);
+		break;
+
+	default:
+		ret = VM_FAULT_SIGBUS;
+		goto out_unlock;
+	}
+
+	pgprot = vma->vm_page_prot;
+	if (active)
+		pgprot = pgprot_noncached(pgprot);
+
+	ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
+
+out_unlock:
+	mutex_unlock(&ptdev->pm.lock);
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static const struct vm_operations_struct panthor_mmio_vm_ops = {
+	.fault = panthor_mmio_vm_fault,
+};
+
+int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
+{
+	u64 id = vma->vm_pgoff << PAGE_SHIFT;
+
+	switch (id) {
+	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
+		if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+		    (vma->vm_flags & (VM_WRITE | VM_EXEC)))
+			return -EINVAL;
+
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	/* Defer actual mapping to the fault handler. */
+	vma->vm_private_data = ptdev;
+	vma->vm_ops = &panthor_mmio_vm_ops;
+	vm_flags_set(vma,
+		     VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
+		     VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
+	return 0;
+}
+
+#ifdef CONFIG_PM
+int panthor_device_resume(struct device *dev)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+	int ret, cookie;
+
+	mutex_lock(&ptdev->pm.lock);
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_RESUMING);
+
+	ret = clk_prepare_enable(ptdev->clks.core);
+	if (ret)
+		goto err_unlock;
+
+	ret = clk_prepare_enable(ptdev->clks.stacks);
+	if (ret)
+		goto err_disable_core_clk;
+
+	ret = clk_prepare_enable(ptdev->clks.coregroup);
+	if (ret)
+		goto err_disable_stacks_clk;
+
+	ret = panthor_devfreq_resume(ptdev);
+	if (ret)
+		goto err_disable_coregroup_clk;
+
+	if (panthor_device_is_initialized(ptdev) &&
+	    drm_dev_enter(&ptdev->base, &cookie)) {
+		panthor_gpu_resume(ptdev);
+		panthor_mmu_resume(ptdev);
+		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
+		if (!ret)
+			panthor_sched_resume(ptdev);
+
+		drm_dev_exit(cookie);
+
+		if (ret)
+			goto err_devfreq_suspend;
+	}
+
+	/* Clear all IOMEM mappings pointing to this device after we've
+	 * resumed. This way the fake mappings pointing to the dummy pages
+	 * are removed and the real iomem mapping will be restored on next
+	 * access.
+	 */
+	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
+			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
+	if (atomic_read(&ptdev->reset.pending))
+		queue_work(ptdev->reset.wq, &ptdev->reset.work);
+
+	mutex_unlock(&ptdev->pm.lock);
+	return 0;
+
+err_devfreq_suspend:
+	panthor_devfreq_suspend(ptdev);
+
+err_disable_coregroup_clk:
+	clk_disable_unprepare(ptdev->clks.coregroup);
+
+err_disable_stacks_clk:
+	clk_disable_unprepare(ptdev->clks.stacks);
+
+err_disable_core_clk:
+	clk_disable_unprepare(ptdev->clks.core);
+
+err_unlock:
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+	mutex_unlock(&ptdev->pm.lock);
+	return ret;
+}
+
+int panthor_device_suspend(struct device *dev)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+	int ret, cookie;
+
+	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
+		return 0;
+
+	mutex_lock(&ptdev->pm.lock);
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDING);
+
+	/* Clear all IOMEM mappings pointing to this device before we
+	 * shutdown the power-domain and clocks. Failing to do that results
+	 * in external aborts when the process accesses the iomem region.
+	 */
+	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
+			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
+
+	if (panthor_device_is_initialized(ptdev) &&
+	    drm_dev_enter(&ptdev->base, &cookie)) {
+		cancel_work_sync(&ptdev->reset.work);
+
+		/* We prepare everything as if we were resetting the GPU.
+		 * The end of the reset will happen in the resume path though.
+		 */
+		panthor_sched_suspend(ptdev);
+		panthor_fw_suspend(ptdev);
+		panthor_mmu_suspend(ptdev);
+		panthor_gpu_suspend(ptdev);
+		drm_dev_exit(cookie);
+	}
+
+	ret = panthor_devfreq_suspend(ptdev);
+	if (ret) {
+		if (panthor_device_is_initialized(ptdev) &&
+		    drm_dev_enter(&ptdev->base, &cookie)) {
+			panthor_gpu_resume(ptdev);
+			panthor_mmu_resume(ptdev);
+			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
+			panthor_sched_resume(ptdev);
+			drm_dev_exit(cookie);
+		}
+
+		atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
+		goto out_unlock;
+	}
+
+	/* Before we suspend, update the dummy_latest_flush page, so accesses
+	 * to this dummy page return the value the HW would have returned.
+	 */
+	*ptdev->pm.dummy_latest_flush = gpu_read(ptdev, CSF_GPU_LATEST_FLUSH_ID);
+
+	clk_disable_unprepare(ptdev->clks.coregroup);
+	clk_disable_unprepare(ptdev->clks.stacks);
+	clk_disable_unprepare(ptdev->clks.core);
+	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+
+out_unlock:
+	mutex_unlock(&ptdev->pm.lock);
+	return ret;
+}
+#endif
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
new file mode 100644
index 000000000000..e0e1be263eb9
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -0,0 +1,354 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_DEVICE_H__
+#define __PANTHOR_DEVICE_H__
+
+#include <linux/atomic.h>
+#include <linux/io-pgtable.h>
+#include <linux/regulator/consumer.h>
+#include <linux/spinlock.h>
+#include <drm/drm_device.h>
+#include <drm/drm_mm.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/panthor_drm.h>
+
+struct panthor_csf;
+struct panthor_csf_ctx;
+struct panthor_device;
+struct panthor_gpu;
+struct panthor_group_pool;
+struct panthor_heap_pool;
+struct panthor_job;
+struct panthor_mmu;
+struct panthor_fw;
+struct panthor_perfcnt;
+struct panthor_vm;
+struct panthor_vm_pool;
+
+/**
+ * enum panthor_device_pm_state - PM state
+ */
+enum panthor_device_pm_state {
+	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDED: Device is suspended. */
+	PANTHOR_DEVICE_PM_STATE_SUSPENDED = 0,
+
+	/** @PANTHOR_DEVICE_PM_STATE_RESUMING: Device is being resumed. */
+	PANTHOR_DEVICE_PM_STATE_RESUMING,
+
+	/** @PANTHOR_DEVICE_PM_STATE_ACTIVE: Device is active. */
+	PANTHOR_DEVICE_PM_STATE_ACTIVE,
+
+	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDING: Device is being suspended. */
+	PANTHOR_DEVICE_PM_STATE_SUSPENDING,
+};
+
+/**
+ * struct panthor_irq - IRQ data
+ *
+ * Used to automate IRQ handling for the 3 different IRQs we have in this driver.
+ */
+struct panthor_irq {
+	/** @ptdev: Panthor device */
+	struct panthor_device *ptdev;
+
+	/** @irq: IRQ number. */
+	int irq;
+
+	/** @mask: Current mask being applied to xxx_INT_MASK. */
+	u32 mask;
+
+	/** @suspended: Set to true when the IRQ is suspended. */
+	atomic_t suspended;
+};
+
+/**
+ * struct panthor_device - Panthor device
+ */
+struct panthor_device {
+	/** @base: Base drm_device. */
+	struct drm_device base;
+
+	/** @phys_addr: Physical address of the iomem region. */
+	phys_addr_t phys_addr;
+
+	/** @iomem: CPU mapping of the IOMEM region. */
+	void __iomem *iomem;
+
+	/** @clks: GPU clocks. */
+	struct {
+		/** @core: Core clock. */
+		struct clk *core;
+
+		/** @stacks: Stacks clock. This clock is optional. */
+		struct clk *stacks;
+
+		/** @coregroup: Core group clock. This clock is optional. */
+		struct clk *coregroup;
+	} clks;
+
+	/** @coherent: True if the CPU/GPU are memory coherent. */
+	bool coherent;
+
+	/** @gpu_info: GPU information. */
+	struct drm_panthor_gpu_info gpu_info;
+
+	/** @csif_info: Command stream interface information. */
+	struct drm_panthor_csif_info csif_info;
+
+	/** @gpu: GPU management data. */
+	struct panthor_gpu *gpu;
+
+	/** @fw: FW management data. */
+	struct panthor_fw *fw;
+
+	/** @mmu: MMU management data. */
+	struct panthor_mmu *mmu;
+
+	/** @scheduler: Scheduler management data. */
+	struct panthor_scheduler *scheduler;
+
+	/** @devfreq: Device frequency scaling management data. */
+	struct panthor_devfreq *devfreq;
+
+	/** @reset: Reset related fields. */
+	struct {
+		/** @wq: Ordered worqueud used to schedule reset operations. */
+		struct workqueue_struct *wq;
+
+		/** @work: Reset work. */
+		struct work_struct work;
+
+		/** @pending: Set to true if a reset is pending. */
+		atomic_t pending;
+	} reset;
+
+	/** @pm: Power management related data. */
+	struct {
+		/** @state: Power state, see panthor_device_pm_state. */
+		atomic_t state;
+
+		/**
+		 * @lock: Lock protecting the suspend/resume operations.
+		 *
+		 * This is needed to ensure we map the dummy IO pages when
+		 * the device is being suspended, and the real IO pages when
+		 * the device is being resumed. We can't just do with the
+		 * state atomicity to deal with this race.
+		 */
+		struct mutex lock;
+
+		/**
+		 * @dummy_latest_flush: Dummy LATEST_FLUSH page.
+		 *
+		 * Used to replace the real LATEST_FLUSH page when the GPU
+		 * is suspended.
+		 */
+		u32 *dummy_latest_flush;
+	} pm;
+};
+
+/**
+ * struct panthor_file - Panthor file
+ */
+struct panthor_file {
+	/** @ptdev: Device attached to this file. */
+	struct panthor_device *ptdev;
+
+	/** @vms: VM pool attached to this file. */
+	struct panthor_vm_pool *vms;
+
+	/** @groups: Scheduling group pool attached to this file. */
+	struct panthor_group_pool *groups;
+};
+
+int panthor_device_init(struct panthor_device *ptdev);
+void panthor_device_unplug(struct panthor_device *ptdev);
+
+/**
+ * panthor_device_schedule_reset() - Schedules a reset operation
+ */
+static inline void panthor_device_schedule_reset(struct panthor_device *ptdev)
+{
+	if (atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE &&
+	    !atomic_cmpxchg(&ptdev->reset.pending, 0, 1))
+		queue_work(ptdev->reset.wq, &ptdev->reset.work);
+}
+
+/**
+ * panthor_device_reset_is_pending() - Checks if a reset is pending.
+ *
+ * Return: true if a reset is pending, false otherwise.
+ */
+static inline bool panthor_device_reset_is_pending(struct panthor_device *ptdev)
+{
+	return atomic_read(&ptdev->reset.pending) != 0;
+}
+
+int panthor_device_mmap_io(struct panthor_device *ptdev,
+			   struct vm_area_struct *vma);
+
+int panthor_device_resume(struct device *dev);
+int panthor_device_suspend(struct device *dev);
+
+enum drm_panthor_exception_type {
+	DRM_PANTHOR_EXCEPTION_OK = 0x00,
+	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
+	DRM_PANTHOR_EXCEPTION_KABOOM = 0x05,
+	DRM_PANTHOR_EXCEPTION_EUREKA = 0x06,
+	DRM_PANTHOR_EXCEPTION_ACTIVE = 0x08,
+	DRM_PANTHOR_EXCEPTION_CS_RES_TERM = 0x0f,
+	DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT = 0x3f,
+	DRM_PANTHOR_EXCEPTION_CS_CONFIG_FAULT = 0x40,
+	DRM_PANTHOR_EXCEPTION_CS_ENDPOINT_FAULT = 0x44,
+	DRM_PANTHOR_EXCEPTION_CS_BUS_FAULT = 0x48,
+	DRM_PANTHOR_EXCEPTION_CS_INSTR_INVALID = 0x49,
+	DRM_PANTHOR_EXCEPTION_CS_CALL_STACK_OVERFLOW = 0x4a,
+	DRM_PANTHOR_EXCEPTION_CS_INHERIT_FAULT = 0x4b,
+	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_PC = 0x50,
+	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+	DRM_PANTHOR_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+	DRM_PANTHOR_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+	DRM_PANTHOR_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+	DRM_PANTHOR_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+	DRM_PANTHOR_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+	DRM_PANTHOR_EXCEPTION_OOM = 0x60,
+	DRM_PANTHOR_EXCEPTION_CSF_FW_INTERNAL_ERROR = 0x68,
+	DRM_PANTHOR_EXCEPTION_CSF_RES_EVICTION_TIMEOUT = 0x69,
+	DRM_PANTHOR_EXCEPTION_GPU_BUS_FAULT = 0x80,
+	DRM_PANTHOR_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+	DRM_PANTHOR_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+	DRM_PANTHOR_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+	DRM_PANTHOR_EXCEPTION_PERM_FAULT_0 = 0xc8,
+	DRM_PANTHOR_EXCEPTION_PERM_FAULT_1 = 0xc9,
+	DRM_PANTHOR_EXCEPTION_PERM_FAULT_2 = 0xca,
+	DRM_PANTHOR_EXCEPTION_PERM_FAULT_3 = 0xcb,
+	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_IN = 0xe0,
+	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+};
+
+/**
+ * panthor_exception_is_fault() - Checks if an exception is a fault.
+ *
+ * Return: true if the exception is a fault, false otherwise.
+ */
+static inline bool
+panthor_exception_is_fault(u32 exception_code)
+{
+	return exception_code > DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT;
+}
+
+const char *panthor_exception_name(struct panthor_device *ptdev,
+				   u32 exception_code);
+
+/**
+ * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
+ * registration function.
+ *
+ * The boiler-plate to gracefully deal with shared interrupts is
+ * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
+ * just after you actual handler. The handler prototype is:
+ *
+ * void (*handler)(struct panthor_device *, u32 status);
+ */
+#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
+static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
+{												\
+	struct panthor_irq *pirq = data;							\
+	struct panthor_device *ptdev = pirq->ptdev;						\
+												\
+	if (!gpu_read(ptdev, __reg_prefix ## _INT_STAT))					\
+		return IRQ_NONE;								\
+												\
+	gpu_write(ptdev, __reg_prefix ## _INT_MASK, 0);						\
+	return IRQ_WAKE_THREAD;									\
+}												\
+												\
+static irqreturn_t panthor_ ## __name ## _irq_threaded_handler(int irq, void *data)		\
+{												\
+	struct panthor_irq *pirq = data;							\
+	struct panthor_device *ptdev = pirq->ptdev;						\
+	irqreturn_t ret = IRQ_NONE;								\
+												\
+	while (true) {										\
+		u32 status = gpu_read(ptdev, __reg_prefix ## _INT_RAWSTAT) & pirq->mask;	\
+												\
+		if (!status)									\
+			break;									\
+												\
+		gpu_write(ptdev, __reg_prefix ## _INT_CLEAR, status);				\
+												\
+		__handler(ptdev, status);							\
+		ret = IRQ_HANDLED;								\
+	}											\
+												\
+	if (!atomic_read(&pirq->suspended))							\
+		gpu_write(ptdev, __reg_prefix ## _INT_MASK, pirq->mask);			\
+												\
+	return ret;										\
+}												\
+												\
+static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)			\
+{												\
+	int cookie;										\
+												\
+	atomic_set(&pirq->suspended, true);							\
+												\
+	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
+		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
+		synchronize_irq(pirq->irq);							\
+		drm_dev_exit(cookie);								\
+	}											\
+												\
+	pirq->mask = 0;										\
+}												\
+												\
+static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
+{												\
+	int cookie;										\
+												\
+	atomic_set(&pirq->suspended, false);							\
+	pirq->mask = mask;									\
+												\
+	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
+		gpu_write(pirq->ptdev, __reg_prefix ## _INT_CLEAR, mask);			\
+		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, mask);			\
+		drm_dev_exit(cookie);								\
+	}											\
+}												\
+												\
+static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev,			\
+					      struct panthor_irq *pirq,				\
+					      int irq, u32 mask)				\
+{												\
+	pirq->ptdev = ptdev;									\
+	pirq->irq = irq;									\
+	panthor_ ## __name ## _irq_resume(pirq, mask);						\
+												\
+	return devm_request_threaded_irq(ptdev->base.dev, irq,					\
+					 panthor_ ## __name ## _irq_raw_handler,		\
+					 panthor_ ## __name ## _irq_threaded_handler,		\
+					 IRQF_SHARED, KBUILD_MODNAME "-" # __name,		\
+					 pirq);							\
+}
+
+extern struct workqueue_struct *panthor_cleanup_wq;
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (3 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 04/15] drm/panthor: Add the device logical block Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-14 10:54   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 06/15] drm/panthor: Add GEM " Boris Brezillon
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Handles everything that's not related to the FW, the MMU or the
scheduler. This is the block dealing with the GPU property retrieval,
the GPU block power on/off logic, and some global operations, like
global cache flushing.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_gpu.c | 463 ++++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_gpu.h |  52 +++
 2 files changed, 515 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h

diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
new file mode 100644
index 000000000000..47d15334b46e
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -0,0 +1,463 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
+/* Copyright 2019 Collabora ltd. */
+
+#include <linux/bitfield.h>
+#include <linux/bitmap.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+
+#include <drm/drm_drv.h>
+#include <drm/drm_managed.h>
+
+#include "panthor_device.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+
+/**
+ * struct panthor_gpu - GPU block management data.
+ */
+struct panthor_gpu {
+	/** @irq: GPU irq. */
+	struct panthor_irq irq;
+
+	/** @reqs_lock: Lock protecting access to pending_reqs. */
+	spinlock_t reqs_lock;
+
+	/** @pending_reqs: Pending GPU requests. */
+	u32 pending_reqs;
+
+	/** @reqs_acked: GPU request wait queue. */
+	wait_queue_head_t reqs_acked;
+};
+
+/**
+ * struct panthor_model - GPU model description
+ */
+struct panthor_model {
+	/** @name: Model name. */
+	const char *name;
+
+	/** @id: Model ID. */
+	u32 id;
+};
+
+/**
+ * GPU_MODEL() - Define a GPU model.
+ */
+#define GPU_MODEL(_name, _id, ...) \
+{\
+	.name = __stringify(_name),				\
+	.id = _id,						\
+}
+
+#define GPU_MODEL_ID_MASK		0xf00f0000
+
+static const struct panthor_model gpu_models[] = {
+	GPU_MODEL(g610, 0xa0070000),
+	{},
+};
+
+#define GPU_INTERRUPTS_MASK	\
+	(GPU_IRQ_FAULT | \
+	 GPU_IRQ_PROTM_FAULT | \
+	 GPU_IRQ_RESET_COMPLETED | \
+	 GPU_IRQ_MCU_STATUS_CHANGED | \
+	 GPU_IRQ_CLEAN_CACHES_COMPLETED)
+
+static void panthor_gpu_init_info(struct panthor_device *ptdev)
+{
+	const struct panthor_model *model;
+	u32 major, minor, status;
+	unsigned int i;
+
+	ptdev->gpu_info.gpu_id = gpu_read(ptdev, GPU_ID);
+	ptdev->gpu_info.csf_id = gpu_read(ptdev, GPU_CSF_ID);
+	ptdev->gpu_info.gpu_rev = gpu_read(ptdev, GPU_REVID);
+	ptdev->gpu_info.l2_features = gpu_read(ptdev, GPU_L2_FEATURES);
+	ptdev->gpu_info.tiler_features = gpu_read(ptdev, GPU_TILER_FEATURES);
+	ptdev->gpu_info.mem_features = gpu_read(ptdev, GPU_MEM_FEATURES);
+	ptdev->gpu_info.mmu_features = gpu_read(ptdev, GPU_MMU_FEATURES);
+	ptdev->gpu_info.thread_features = gpu_read(ptdev, GPU_THREAD_FEATURES);
+	ptdev->gpu_info.max_threads = gpu_read(ptdev, GPU_THREAD_MAX_THREADS);
+	ptdev->gpu_info.thread_max_workgroup_size = gpu_read(ptdev, GPU_THREAD_MAX_WORKGROUP_SIZE);
+	ptdev->gpu_info.thread_max_barrier_size = gpu_read(ptdev, GPU_THREAD_MAX_BARRIER_SIZE);
+	ptdev->gpu_info.coherency_features = gpu_read(ptdev, GPU_COHERENCY_FEATURES);
+	for (i = 0; i < 4; i++)
+		ptdev->gpu_info.texture_features[i] = gpu_read(ptdev, GPU_TEXTURE_FEATURES(i));
+
+	ptdev->gpu_info.as_present = gpu_read(ptdev, GPU_AS_PRESENT);
+
+	ptdev->gpu_info.shader_present = gpu_read(ptdev, GPU_SHADER_PRESENT_LO);
+	ptdev->gpu_info.shader_present |= (u64)gpu_read(ptdev, GPU_SHADER_PRESENT_HI) << 32;
+
+	ptdev->gpu_info.tiler_present = gpu_read(ptdev, GPU_TILER_PRESENT_LO);
+	ptdev->gpu_info.tiler_present |= (u64)gpu_read(ptdev, GPU_TILER_PRESENT_HI) << 32;
+
+	ptdev->gpu_info.l2_present = gpu_read(ptdev, GPU_L2_PRESENT_LO);
+	ptdev->gpu_info.l2_present |= (u64)gpu_read(ptdev, GPU_L2_PRESENT_HI) << 32;
+	ptdev->gpu_info.core_group_count = hweight64(ptdev->gpu_info.l2_present);
+
+	major = (ptdev->gpu_info.gpu_id >> 12) & 0xf;
+	minor = (ptdev->gpu_info.gpu_id >> 4) & 0xff;
+	status = ptdev->gpu_info.gpu_id & 0xf;
+
+	for (model = gpu_models; model->name; model++) {
+		if (model->id == (ptdev->gpu_info.gpu_id & GPU_MODEL_ID_MASK))
+			break;
+	}
+
+	drm_info(&ptdev->base,
+		 "mali-%s id 0x%x major 0x%x minor 0x%x status 0x%x",
+		 model->name ?: "unknown", ptdev->gpu_info.gpu_id >> 16,
+		 major, minor, status);
+
+	drm_info(&ptdev->base,
+		 "Features: L2:0x%08x Tiler:0x%08x Mem:0x%0x MMU:0x%08x AS:0x%x",
+		 ptdev->gpu_info.l2_features,
+		 ptdev->gpu_info.tiler_features,
+		 ptdev->gpu_info.mem_features,
+		 ptdev->gpu_info.mmu_features,
+		 ptdev->gpu_info.as_present);
+
+	drm_info(&ptdev->base,
+		 "shader_present=0x%0llx l2_present=0x%0llx tiler_present=0x%0llx",
+		 ptdev->gpu_info.shader_present, ptdev->gpu_info.l2_present,
+		 ptdev->gpu_info.tiler_present);
+}
+
+static void panthor_gpu_irq_handler(struct panthor_device *ptdev, u32 status)
+{
+	if (status & (GPU_IRQ_FAULT | GPU_IRQ_PROTM_FAULT)) {
+		u32 fault_status = gpu_read(ptdev, GPU_FAULT_STATUS);
+		u64 address = ((u64)gpu_read(ptdev, GPU_FAULT_ADDR_HI) << 32) |
+			      gpu_read(ptdev, GPU_FAULT_ADDR_LO);
+
+		drm_warn(&ptdev->base, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
+			 fault_status, panthor_exception_name(ptdev, fault_status & 0xFF),
+			 address);
+	}
+
+	spin_lock(&ptdev->gpu->reqs_lock);
+	if (status & ptdev->gpu->pending_reqs) {
+		ptdev->gpu->pending_reqs &= ~status;
+		wake_up_all(&ptdev->gpu->reqs_acked);
+	}
+	spin_unlock(&ptdev->gpu->reqs_lock);
+}
+PANTHOR_IRQ_HANDLER(gpu, GPU, panthor_gpu_irq_handler);
+
+/**
+ * panthor_gpu_unplug() - Called when the GPU is unplugged.
+ */
+void panthor_gpu_unplug(struct panthor_device *ptdev)
+{
+	unsigned long flags;
+
+	/* Make sure the IRQ handler is not running after that point. */
+	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
+
+	/* Wake-up all waiters. */
+	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
+	ptdev->gpu->pending_reqs = 0;
+	wake_up_all(&ptdev->gpu->reqs_acked);
+	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
+}
+
+/**
+ * panthor_gpu_init() - Initialize the GPU block
+ * @ptdev: Device.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_init(struct panthor_device *ptdev)
+{
+	struct panthor_gpu *gpu;
+	u32 pa_bits;
+	int ret, irq;
+
+	gpu = drmm_kzalloc(&ptdev->base, sizeof(*gpu), GFP_KERNEL);
+	if (!gpu)
+		return -ENOMEM;
+
+	spin_lock_init(&gpu->reqs_lock);
+	init_waitqueue_head(&gpu->reqs_acked);
+	ptdev->gpu = gpu;
+	panthor_gpu_init_info(ptdev);
+
+	dma_set_max_seg_size(ptdev->base.dev, UINT_MAX);
+	pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
+	ret = dma_set_mask_and_coherent(ptdev->base.dev, DMA_BIT_MASK(pa_bits));
+	if (ret)
+		return ret;
+
+	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "gpu");
+	if (irq <= 0)
+		return ret;
+
+	ret = panthor_request_gpu_irq(ptdev, &ptdev->gpu->irq, irq, GPU_INTERRUPTS_MASK);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+/**
+ * panthor_gpu_block_power_off() - Power-off a specific block of the GPU
+ * @ptdev: Device.
+ * @blk_name: Block name.
+ * @pwroff_reg: Power-off register for this block.
+ * @pwrtrans_reg: Power transition register for this block.
+ * @mask: Sub-elements to power-off.
+ * @timeout_ms: Timeout in milliseconds.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_block_power_off(struct panthor_device *ptdev,
+				const char *blk_name,
+				u32 pwroff_reg, u32 pwrtrans_reg,
+				u64 mask, u32 timeout_us)
+{
+	u32 val, i;
+	int ret;
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask32 & val),
+						 100, timeout_us);
+		if (ret) {
+			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	if (mask & GENMASK(31, 0))
+		gpu_write(ptdev, pwroff_reg, mask);
+
+	if (mask >> 32)
+		gpu_write(ptdev, pwroff_reg, mask >> 32);
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask & val),
+						 100, timeout_us);
+		if (ret) {
+			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_gpu_block_power_on() - Power-on a specific block of the GPU
+ * @ptdev: Device.
+ * @blk_name: Block name.
+ * @pwron_reg: Power-on register for this block.
+ * @pwrtrans_reg: Power transition register for this block.
+ * @mask: Sub-elements to power-on.
+ * @timeout_ms: Timeout in milliseconds.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_block_power_on(struct panthor_device *ptdev,
+			       const char *blk_name,
+			       u32 pwron_reg, u32 pwrtrans_reg,
+			       u32 rdy_reg, u64 mask, u32 timeout_us)
+{
+	u32 val, i;
+	int ret;
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
+						 val, !(mask32 & val),
+						 100, timeout_us);
+		if (ret) {
+			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	if (mask & GENMASK(31, 0))
+		gpu_write(ptdev, pwron_reg, mask);
+
+	if (mask >> 32)
+		gpu_write(ptdev, pwron_reg + 4, mask >> 32);
+
+	for (i = 0; i < 2; i++) {
+		u32 mask32 = mask >> (i * 32);
+
+		if (!mask32)
+			continue;
+
+		ret = readl_relaxed_poll_timeout(ptdev->iomem + rdy_reg + (i * 4),
+						 val, (mask32 & val) == mask32,
+						 100, timeout_us);
+		if (ret) {
+			drm_err(&ptdev->base, "timeout waiting on %s:%llx readyness",
+				blk_name, mask);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_gpu_l2_power_on() - Power-on the L2-cache
+ * @ptdev: Device.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
+{
+	u64 core_mask = U64_MAX;
+
+	if (ptdev->gpu_info.l2_present != 1) {
+		/*
+		 * Only support one core group now.
+		 * ~(l2_present - 1) unsets all bits in l2_present except
+		 * the bottom bit. (l2_present - 2) has all the bits in
+		 * the first core group set. AND them together to generate
+		 * a mask of cores in the first core group.
+		 */
+		core_mask = ~(ptdev->gpu_info.l2_present - 1) &
+			     (ptdev->gpu_info.l2_present - 2);
+		drm_info_once(&ptdev->base, "using only 1st core group (%lu cores from %lu)\n",
+			      hweight64(core_mask),
+			      hweight64(ptdev->gpu_info.shader_present));
+	}
+
+	return panthor_gpu_power_on(ptdev, L2,
+				    ptdev->gpu_info.l2_present & core_mask,
+				    20000);
+}
+
+/**
+ * panthor_gpu_flush_caches() - Flush caches
+ * @ptdev: Device.
+ * @l2: L2 flush type.
+ * @lsc: LSC flush type.
+ * @other: Other flush type.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_flush_caches(struct panthor_device *ptdev,
+			     u32 l2, u32 lsc, u32 other)
+{
+	bool timedout = false;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
+	if (!drm_WARN_ON(&ptdev->base,
+			 ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
+		ptdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED;
+		gpu_write(ptdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other));
+	}
+	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
+
+	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
+				!(ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED),
+				msecs_to_jiffies(100))) {
+		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
+		if ((ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) != 0 &&
+		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_CLEAN_CACHES_COMPLETED))
+			timedout = true;
+		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
+	}
+
+	if (timedout) {
+		drm_err(&ptdev->base, "Flush caches timeout");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_gpu_soft_reset() - Issue a soft-reset
+ * @ptdev: Device.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_gpu_soft_reset(struct panthor_device *ptdev)
+{
+	bool timedout = false;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
+	if (!drm_WARN_ON(&ptdev->base,
+			 ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED)) {
+		ptdev->gpu->pending_reqs |= GPU_IRQ_RESET_COMPLETED;
+		gpu_write(ptdev, GPU_INT_CLEAR, GPU_IRQ_RESET_COMPLETED);
+		gpu_write(ptdev, GPU_CMD, GPU_SOFT_RESET);
+	}
+	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
+
+	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
+				!(ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED),
+				msecs_to_jiffies(100))) {
+		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
+		if ((ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED) != 0 &&
+		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_RESET_COMPLETED))
+			timedout = true;
+		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
+	}
+
+	if (timedout) {
+		drm_err(&ptdev->base, "Soft reset timeout");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_gpu_suspend() - Suspend the GPU block.
+ * @ptdev: Device.
+ *
+ * Soft reset and suspend the GPU irq. This should be called last
+ * in the suspend procedure, after all other blocks have been suspented.
+ */
+void panthor_gpu_suspend(struct panthor_device *ptdev)
+{
+	panthor_gpu_soft_reset(ptdev);
+	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
+}
+
+/**
+ * panthor_gpu_resume() - Resume the GPU block.
+ *
+ * Resume the IRQ handler and power-on the L2-cache.
+ * The FW takes care of powering the other blocks.
+ */
+void panthor_gpu_resume(struct panthor_device *ptdev)
+{
+	panthor_gpu_irq_resume(&ptdev->gpu->irq, GPU_INTERRUPTS_MASK);
+	panthor_gpu_l2_power_on(ptdev);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.h b/drivers/gpu/drm/panthor/panthor_gpu.h
new file mode 100644
index 000000000000..bba7555dd3c6
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gpu.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Collabora ltd. */
+
+#ifndef __PANTHOR_GPU_H__
+#define __PANTHOR_GPU_H__
+
+struct panthor_device;
+
+int panthor_gpu_init(struct panthor_device *ptdev);
+void panthor_gpu_unplug(struct panthor_device *ptdev);
+void panthor_gpu_suspend(struct panthor_device *ptdev);
+void panthor_gpu_resume(struct panthor_device *ptdev);
+
+int panthor_gpu_block_power_on(struct panthor_device *ptdev,
+			       const char *blk_name,
+			       u32 pwron_reg, u32 pwrtrans_reg,
+			       u32 rdy_reg, u64 mask, u32 timeout_us);
+int panthor_gpu_block_power_off(struct panthor_device *ptdev,
+				const char *blk_name,
+				u32 pwroff_reg, u32 pwrtrans_reg,
+				u64 mask, u32 timeout_us);
+
+/**
+ * panthor_gpu_power_on() - Power on the GPU block.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+#define panthor_gpu_power_on(ptdev, type, mask, timeout_us) \
+	panthor_gpu_block_power_on(ptdev, #type, \
+				  type ## _PWRON_LO, \
+				  type ## _PWRTRANS_LO, \
+				  type ## _READY_LO, \
+				  mask, timeout_us)
+
+/**
+ * panthor_gpu_power_off() - Power off the GPU block.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+#define panthor_gpu_power_off(ptdev, type, mask, timeout_us) \
+	panthor_gpu_block_power_off(ptdev, #type, \
+				   type ## _PWROFF_LO, \
+				   type ## _PWRTRANS_LO, \
+				   mask, timeout_us)
+
+int panthor_gpu_l2_power_on(struct panthor_device *ptdev);
+int panthor_gpu_flush_caches(struct panthor_device *ptdev,
+			     u32 l2, u32 lsc, u32 other);
+int panthor_gpu_soft_reset(struct panthor_device *ptdev);
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 06/15] drm/panthor: Add GEM logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (4 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 05/15] drm/panthor: Add the GPU " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-14 13:40   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 07/15] drm/panthor: Add the devfreq " Boris Brezillon
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Anything relating to GEM object management is placed here. Nothing
particularly interesting here, given the implementation is based on
drm_gem_shmem_object, which is doing most of the work.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_gem.c | 229 ++++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_gem.h |  96 +++++++++++
 2 files changed, 325 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
new file mode 100644
index 000000000000..a441a68822ca
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-mapping.h>
+
+#include <drm/panthor_drm.h>
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_mmu.h"
+
+static void panthor_gem_free_object(struct drm_gem_object *obj)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(obj);
+
+	if (drm_WARN_ON(obj->dev, bo->va_node))
+		panthor_vm_free_va(bo->exclusive_vm, bo->va_node);
+
+	panthor_vm_put(bo->exclusive_vm);
+	drm_gem_free_mmap_offset(&bo->base.base);
+	mutex_destroy(&bo->gpuva_list_lock);
+	drm_gem_shmem_free(&bo->base);
+}
+
+/**
+ * panthor_gem_unmap_and_put() - Unmap and drop the reference on a GEM object
+ * @vm: VM to unmap the GEM from.
+ * @bo: GEM object to unmap/release.
+ * @gpu_va: GPU/MCU virtual address the GEM object was mapped at.
+ * @cpu_va: kernel mapping of the GEM object.
+ * Can be NULL if the GEM was not CPU mapped.
+ *
+ * Should be called to undo what was done in panthor_gem_create_and_map().
+ */
+void panthor_gem_unmap_and_put(struct panthor_vm *vm,
+			       struct panthor_gem_object *bo,
+			       u64 gpu_va, void *cpu_va)
+{
+	if (cpu_va) {
+		struct iosys_map map = IOSYS_MAP_INIT_VADDR(cpu_va);
+
+		drm_gem_vunmap_unlocked(&bo->base.base, &map);
+	}
+
+	drm_WARN_ON(bo->base.base.dev, panthor_vm_unmap_range(vm, gpu_va, bo->base.base.size));
+	panthor_vm_free_va(vm, bo->va_node);
+	bo->va_node = NULL;
+	drm_gem_object_put(&bo->base.base);
+}
+
+/**
+ * panthor_gem_create_and_map() - Create and map a GEM object to a VM
+ * @ptdev: Device.
+ * @vm: VM to map the GEM to.
+ * @bo_flags: Combination of drm_panthor_bo_flags flags.
+ * @vm_map_flags: Combination of drm_panthor_vm_bind_op_flags (only those
+ * that are related to map operations).
+ * @gpu_va: Pointer holding the GPU address assigned when mapping to the VM.
+ * If *gpu_va == PANTHOR_GEM_ALLOC_VA, a virtual address range will be allocated
+ * and the allocated address returned, otherwise *gpu_va is used directly.
+ * @cpu_va: Pointer holding the kernel CPU mapping. If NULL, the GEM object
+ * is not CPU-mapped.
+ *
+ * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
+ */
+struct panthor_gem_object *
+panthor_gem_create_and_map(struct panthor_device *ptdev, struct panthor_vm *vm,
+			   size_t size, u32 bo_flags, u32 vm_map_flags,
+			   u64 *gpu_va, void **cpu_va)
+{
+	struct drm_gem_shmem_object *obj;
+	struct panthor_gem_object *bo;
+	int ret;
+
+	obj = drm_gem_shmem_create(&ptdev->base, size);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	bo = to_panthor_bo(&obj->base);
+	bo->flags = bo_flags;
+	bo->exclusive_vm = panthor_vm_get(vm);
+	bo->base.base.resv = panthor_vm_resv(vm);
+
+	if (*gpu_va == PANTHOR_GEM_ALLOC_VA) {
+		bo->va_node = panthor_vm_alloc_va(vm, obj->base.size);
+
+		if (IS_ERR(bo->va_node)) {
+			ret = PTR_ERR(bo->va_node);
+			bo->va_node = NULL;
+			goto err_put_obj;
+		}
+
+		*gpu_va = bo->va_node->start;
+	}
+
+	ret = panthor_vm_map_bo_range(vm, bo, 0, obj->base.size, *gpu_va, vm_map_flags);
+	if (ret)
+		goto err_put_obj;
+
+	if (cpu_va) {
+		struct iosys_map map;
+		int ret;
+
+		ret = drm_gem_vmap_unlocked(&obj->base, &map);
+		if (ret)
+			goto err_vm_unmap_range;
+
+		*cpu_va = map.vaddr;
+	}
+
+	return bo;
+
+err_vm_unmap_range:
+	panthor_vm_unmap_range(vm, *gpu_va, obj->base.size);
+
+err_put_obj:
+	drm_gem_object_put(&obj->base);
+	return ERR_PTR(ret);
+}
+
+static int panthor_gem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(obj);
+
+	/* Don't allow mmap on objects that have the NO_MMAP flag set. */
+	if (bo->flags & DRM_PANTHOR_BO_NO_MMAP)
+		return -EINVAL;
+
+	return drm_gem_shmem_object_mmap(obj, vma);
+}
+
+static struct dma_buf *
+panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
+{
+	/* We can't export GEMs that have an exclusive VM. */
+	if (to_panthor_bo(obj)->exclusive_vm)
+		return ERR_PTR(-EINVAL);
+
+	return drm_gem_prime_export(obj, flags);
+}
+
+static const struct drm_gem_object_funcs panthor_gem_funcs = {
+	.free = panthor_gem_free_object,
+	.print_info = drm_gem_shmem_object_print_info,
+	.pin = drm_gem_shmem_object_pin,
+	.unpin = drm_gem_shmem_object_unpin,
+	.get_sg_table = drm_gem_shmem_object_get_sg_table,
+	.vmap = drm_gem_shmem_object_vmap,
+	.vunmap = drm_gem_shmem_object_vunmap,
+	.mmap = panthor_gem_mmap,
+	.export = panthor_gem_prime_export,
+	.vm_ops = &drm_gem_shmem_vm_ops,
+};
+
+/**
+ * panthor_gem_create_object - Implementation of driver->gem_create_object.
+ * @dev: DRM device
+ * @size: Size in bytes of the memory the object will reference
+ *
+ * This lets the GEM helpers allocate object structs for us, and keep
+ * our BO stats correct.
+ */
+struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t size)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	struct panthor_gem_object *obj;
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->base.base.funcs = &panthor_gem_funcs;
+	obj->base.map_wc = !ptdev->coherent;
+	mutex_init(&obj->gpuva_list_lock);
+	drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
+
+	return &obj->base.base;
+}
+
+/**
+ * panthor_gem_create_with_handle() - Create a GEM object and attach it to a handle.
+ * @file: DRM file.
+ * @ddev: DRM device.
+ * @exclusive_vm: Exclusive VM. Not NULL if the GEM object can't be shared.
+ * @size: Size of the GEM object to allocate.
+ * @flags: Combination of drm_panthor_bo_flags flags.
+ * @handle: Pointer holding the handle pointing to the new GEM object.
+ *
+ * Return: A valid pointer on success, an ERR_PTR() otherwise.
+ */
+struct panthor_gem_object *
+panthor_gem_create_with_handle(struct drm_file *file,
+			       struct drm_device *ddev,
+			       struct panthor_vm *exclusive_vm,
+			       size_t size,
+			       u32 flags, u32 *handle)
+{
+	int ret;
+	struct drm_gem_shmem_object *shmem;
+	struct panthor_gem_object *bo;
+
+	shmem = drm_gem_shmem_create(ddev, size);
+	if (IS_ERR(shmem))
+		return ERR_CAST(shmem);
+
+	bo = to_panthor_bo(&shmem->base);
+	bo->flags = flags;
+
+	if (exclusive_vm) {
+		bo->exclusive_vm = panthor_vm_get(exclusive_vm);
+		bo->base.base.resv = panthor_vm_resv(exclusive_vm);
+	}
+
+	/*
+	 * Allocate an id of idr table where the obj is registered
+	 * and handle has the id what user can see.
+	 */
+	ret = drm_gem_handle_create(file, &shmem->base, handle);
+	/* drop reference from allocate - handle holds it now. */
+	drm_gem_object_put(&shmem->base);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return bo;
+}
diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
new file mode 100644
index 000000000000..07babadc7623
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gem.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_GEM_H__
+#define __PANTHOR_GEM_H__
+
+#include <drm/drm_gem_shmem_helper.h>
+#include <drm/drm_mm.h>
+
+#include <linux/rwsem.h>
+
+struct panthor_vm;
+
+/**
+ * struct panthor_gem_object - Driver specific GEM object.
+ */
+struct panthor_gem_object {
+	/** @base: Inherit from drm_gem_shmem_object. */
+	struct drm_gem_shmem_object base;
+
+	/**
+	 * @va_node: VA space allocated to this GEM.
+	 *
+	 * Should be NULL for all GEM objects managed by userspace.
+	 *
+	 * Not NULL when %PANTHOR_GEM_ALLOC_VA is passed as an address, in
+	 * which case the GEM logic will auto-allocate a VA range before mapping
+	 * to the VM.
+	 *
+	 * @exclusive_vm must be != NULL.
+	 */
+	struct drm_mm_node *va_node;
+
+	/**
+	 * @exclusive_vm: Exclusive VM this GEM object can be mapped to.
+	 *
+	 * If @exclusive_vm != NULL, any attempt to bind the GEM to a different
+	 * VM will fail.
+	 *
+	 * All FW memory objects have this field set to the MCU VM.
+	 */
+	struct panthor_vm *exclusive_vm;
+
+	/**
+	 * @gpuva_list_lock: Custom GPUVA lock.
+	 *
+	 * Used to protect insertion of drm_gpuva elements to the
+	 * drm_gem_object.gpuva.list list.
+	 *
+	 * We can't use the GEM resv for that, because drm_gpuva_link() is
+	 * called in a dma-signaling path, where we're not allowed to take
+	 * resv locks.
+	 */
+	struct mutex gpuva_list_lock;
+
+	/** @flags: Combination of drm_panthor_bo_flags flags. */
+	u32 flags;
+};
+
+static inline
+struct panthor_gem_object *to_panthor_bo(struct drm_gem_object *obj)
+{
+	return container_of(to_drm_gem_shmem_obj(obj), struct panthor_gem_object, base);
+}
+
+struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t size);
+
+struct drm_gem_object *
+panthor_gem_prime_import_sg_table(struct drm_device *ddev,
+				  struct dma_buf_attachment *attach,
+				  struct sg_table *sgt);
+
+struct panthor_gem_object *
+panthor_gem_create_with_handle(struct drm_file *file,
+			       struct drm_device *ddev,
+			       struct panthor_vm *exclusive_vm,
+			       size_t size,
+			       u32 flags,
+			       uint32_t *handle);
+
+void panthor_gem_unmap_and_put(struct panthor_vm *vm, struct panthor_gem_object *bo,
+			       u64 gpu_va, void *cpu_va);
+
+/*
+ * PANTHOR_GEM_ALLOC_VA: Use this magic address when you want the GEM
+ * logic to auto-allocate the virtual address in the reserved kernel VA range.
+ */
+#define PANTHOR_GEM_ALLOC_VA		~0ull
+
+struct panthor_gem_object *
+panthor_gem_create_and_map(struct panthor_device *ptdev, struct panthor_vm *vm,
+			   size_t size, u32 bo_flags, u32 vm_map_flags,
+			   u64 *gpu_va, void **cpu_va);
+
+#endif /* __PANTHOR_GEM_H__ */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 07/15] drm/panthor: Add the devfreq logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (5 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 06/15] drm/panthor: Add GEM " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-14 13:45   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 08/15] drm/panthor: Add the MMU/VM " Boris Brezillon
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Every thing related to devfreq in placed in panthor_devfreq.c, and
helpers that can be called by other logical blocks are exposed through
panthor_devfreq.h.

This implementation is loosely based on the panfrost implementation,
the only difference being that we don't count device users, because
the idle/active state will be managed by the scheduler logic.

v2:
- Added in v2

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 281 ++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_devfreq.h |  25 ++
 2 files changed, 306 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
new file mode 100644
index 000000000000..500ce34cccc2
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Collabora ltd. */
+
+#include <linux/clk.h>
+#include <linux/devfreq.h>
+#include <linux/devfreq_cooling.h>
+#include <linux/platform_device.h>
+#include <linux/pm_opp.h>
+
+#include <drm/drm_managed.h>
+
+#include "panthor_device.h"
+#include "panthor_devfreq.h"
+
+/**
+ * struct panthor_devfreq - Device frequency management
+ */
+struct panthor_devfreq {
+	/** @devfreq: devfreq device. */
+	struct devfreq *devfreq;
+
+	/** @gov_data: Governor data. */
+	struct devfreq_simple_ondemand_data gov_data;
+
+	/** @busy_time: Busy time. */
+	ktime_t busy_time;
+
+	/** @idle_time: Idle time. */
+	ktime_t idle_time;
+
+	/** @time_last_update: Last update time. */
+	ktime_t time_last_update;
+
+	/** @last_busy_state: True if the GPU was busy last time we updated the state. */
+	bool last_busy_state;
+
+	/*
+	 * Protect busy_time, idle_time, time_last_update and last_busy_state
+	 * because these can be accessed concurrently by panthor_devfreq_get_dev_status()
+	 * and panthor_devfreq_record_{busy,idle}().
+	 */
+	spinlock_t lock;
+};
+
+static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
+{
+	ktime_t now, last;
+
+	now = ktime_get();
+	last = pdevfreq->time_last_update;
+
+	if (pdevfreq->last_busy_state)
+		pdevfreq->busy_time += ktime_sub(now, last);
+	else
+		pdevfreq->idle_time += ktime_sub(now, last);
+
+	pdevfreq->time_last_update = now;
+}
+
+static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
+				  u32 flags)
+{
+	struct dev_pm_opp *opp;
+
+	opp = devfreq_recommended_opp(dev, freq, flags);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+	dev_pm_opp_put(opp);
+
+	return dev_pm_opp_set_rate(dev, *freq);
+}
+
+static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
+{
+	pdevfreq->busy_time = 0;
+	pdevfreq->idle_time = 0;
+	pdevfreq->time_last_update = ktime_get();
+}
+
+static int panthor_devfreq_get_dev_status(struct device *dev,
+					  struct devfreq_dev_status *status)
+{
+	struct panthor_device *ptdev = dev_get_drvdata(dev);
+	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+	unsigned long irqflags;
+
+	status->current_frequency = clk_get_rate(ptdev->clks.core);
+
+	spin_lock_irqsave(&pdevfreq->lock, irqflags);
+
+	panthor_devfreq_update_utilization(pdevfreq);
+
+	status->total_time = ktime_to_ns(ktime_add(pdevfreq->busy_time,
+						   pdevfreq->idle_time));
+
+	status->busy_time = ktime_to_ns(pdevfreq->busy_time);
+
+	panthor_devfreq_reset(pdevfreq);
+
+	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
+
+	drm_dbg(&ptdev->base, "busy %lu total %lu %lu %% freq %lu MHz\n",
+		status->busy_time, status->total_time,
+		status->busy_time / (status->total_time / 100),
+		status->current_frequency / 1000 / 1000);
+
+	return 0;
+}
+
+static struct devfreq_dev_profile panthor_devfreq_profile = {
+	.timer = DEVFREQ_TIMER_DELAYED,
+	.polling_ms = 50, /* ~3 frames */
+	.target = panthor_devfreq_target,
+	.get_dev_status = panthor_devfreq_get_dev_status,
+};
+
+int panthor_devfreq_init(struct panthor_device *ptdev)
+{
+	/* There's actually 2 regulators (mali and sram), but the OPP core only
+	 * supports one.
+	 *
+	 * We assume the sram regulator is coupled with the mali one and let
+	 * the coupling logic deal with voltage updates.
+	 */
+	static const char *reg_names[] = { "mali", NULL };
+	struct thermal_cooling_device *cooling;
+	struct device *dev = ptdev->base.dev;
+	struct panthor_devfreq *pdevfreq;
+	struct dev_pm_opp *opp;
+	unsigned long cur_freq;
+	int ret;
+
+	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
+	if (!pdevfreq)
+		return -ENOMEM;
+
+	ptdev->devfreq = pdevfreq;
+
+	ret = devm_pm_opp_set_regulators(dev, reg_names);
+	if (ret) {
+		if (ret != -EPROBE_DEFER)
+			DRM_DEV_ERROR(dev, "Couldn't set OPP regulators\n");
+
+		return ret;
+	}
+
+	ret = devm_pm_opp_of_add_table(dev);
+	if (ret)
+		return ret;
+
+	spin_lock_init(&pdevfreq->lock);
+
+	panthor_devfreq_reset(pdevfreq);
+
+	cur_freq = clk_get_rate(ptdev->clks.core);
+
+	opp = devfreq_recommended_opp(dev, &cur_freq, 0);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+
+	panthor_devfreq_profile.initial_freq = cur_freq;
+
+	/* Regulator coupling only takes care of synchronizing/balancing voltage
+	 * updates, but the coupled regulator needs to be enabled manually.
+	 *
+	 * We use devm_regulator_get_enable_optional() and keep the sram supply
+	 * enabled until the device is removed, just like we do for the mali
+	 * supply, which is enabled when dev_pm_opp_set_opp(dev, opp) is called,
+	 * and disabled when the opp_table is torn down, using the devm action.
+	 *
+	 * If we really care about disabling regulators on suspend, we should:
+	 * - use devm_regulator_get_optional() here
+	 * - call dev_pm_opp_set_opp(dev, NULL) before leaving this function
+	 *   (this disables the regulator passed to the OPP layer)
+	 * - call dev_pm_opp_set_opp(dev, NULL) and
+	 *   regulator_disable(ptdev->regulators.sram) in
+	 *   panthor_devfreq_suspend()
+	 * - call dev_pm_opp_set_opp(dev, default_opp) and
+	 *   regulator_enable(ptdev->regulators.sram) in
+	 *   panthor_devfreq_resume()
+	 *
+	 * But without knowing if it's beneficial or not (in term of power
+	 * consumption), or how much it slows down the suspend/resume steps,
+	 * let's just keep regulators enabled for the device lifetime.
+	 */
+	ret = devm_regulator_get_enable_optional(dev, "sram");
+	if (ret && ret != -ENODEV) {
+		if (ret != -EPROBE_DEFER)
+			DRM_DEV_ERROR(dev, "Couldn't retrieve/enable sram supply\n");
+		return ret;
+	}
+
+	/*
+	 * Set the recommend OPP this will enable and configure the regulator
+	 * if any and will avoid a switch off by regulator_late_cleanup()
+	 */
+	ret = dev_pm_opp_set_opp(dev, opp);
+	if (ret) {
+		DRM_DEV_ERROR(dev, "Couldn't set recommended OPP\n");
+		return ret;
+	}
+
+	dev_pm_opp_put(opp);
+
+	/*
+	 * Setup default thresholds for the simple_ondemand governor.
+	 * The values are chosen based on experiments.
+	 */
+	pdevfreq->gov_data.upthreshold = 45;
+	pdevfreq->gov_data.downdifferential = 5;
+
+	pdevfreq->devfreq = devm_devfreq_add_device(dev, &panthor_devfreq_profile,
+						    DEVFREQ_GOV_SIMPLE_ONDEMAND,
+						    &pdevfreq->gov_data);
+	if (IS_ERR(pdevfreq->devfreq)) {
+		DRM_DEV_ERROR(dev, "Couldn't initialize GPU devfreq\n");
+		ret = PTR_ERR(pdevfreq->devfreq);
+		pdevfreq->devfreq = NULL;
+		return ret;
+	}
+
+	cooling = devfreq_cooling_em_register(pdevfreq->devfreq, NULL);
+	if (IS_ERR(cooling))
+		DRM_DEV_INFO(dev, "Failed to register cooling device\n");
+
+	return 0;
+}
+
+int panthor_devfreq_resume(struct panthor_device *ptdev)
+{
+	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+
+	if (!pdevfreq->devfreq)
+		return 0;
+
+	panthor_devfreq_reset(pdevfreq);
+
+	return devfreq_resume_device(pdevfreq->devfreq);
+}
+
+int panthor_devfreq_suspend(struct panthor_device *ptdev)
+{
+	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+
+	if (!pdevfreq->devfreq)
+		return 0;
+
+	return devfreq_suspend_device(pdevfreq->devfreq);
+}
+
+void panthor_devfreq_record_busy(struct panthor_device *ptdev)
+{
+	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+	unsigned long irqflags;
+
+	if (!pdevfreq->devfreq)
+		return;
+
+	spin_lock_irqsave(&pdevfreq->lock, irqflags);
+
+	panthor_devfreq_update_utilization(pdevfreq);
+	pdevfreq->last_busy_state = true;
+
+	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
+}
+
+void panthor_devfreq_record_idle(struct panthor_device *ptdev)
+{
+	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+	unsigned long irqflags;
+
+	if (!pdevfreq->devfreq)
+		return;
+
+	spin_lock_irqsave(&pdevfreq->lock, irqflags);
+
+	panthor_devfreq_update_utilization(pdevfreq);
+	pdevfreq->last_busy_state = false;
+
+	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
new file mode 100644
index 000000000000..875fbb5a1c1b
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2019 Collabora ltd. */
+
+#ifndef __PANTHOR_DEVFREQ_H__
+#define __PANTHOR_DEVFREQ_H__
+
+#include <linux/devfreq.h>
+#include <linux/spinlock.h>
+#include <linux/ktime.h>
+
+struct devfreq;
+struct thermal_cooling_device;
+
+struct panthor_device;
+struct panthor_devfreq;
+
+int panthor_devfreq_init(struct panthor_device *ptdev);
+
+int panthor_devfreq_resume(struct panthor_device *ptdev);
+int panthor_devfreq_suspend(struct panthor_device *ptdev);
+
+void panthor_devfreq_record_busy(struct panthor_device *ptdev);
+void panthor_devfreq_record_idle(struct panthor_device *ptdev);
+
+#endif /* __PANTHOR_DEVFREQ_H__ */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (6 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 07/15] drm/panthor: Add the devfreq " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-14 15:53   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 09/15] drm/panthor: Add the FW " Boris Brezillon
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

MMU and VM management is related and placed in the same source file.

Page table updates are delegated to the io-pgtable-arm driver that's in
the iommu subsystem.

The VM management logic is based on drm_gpuva_mgr, and is assuming the
VA space is mostly managed by the usermode driver, except for a reserved
portion of this VA-space that's used for kernel objects (like the heap
contexts/chunks).

Both asynchronous and synchronous VM operations are supported, and
internal helpers are exposed to allow other logical blocks to map their
buffers in the GPU VA space.

There's one VM_BIND queue per-VM (meaning the Vulkan driver can only
expose one sparse-binding queue), and this bind queue is managed with
a 1:1 drm_sched_entity:drm_gpu_scheduler, such that each VM gets its own
independent execution queue, avoiding VM operation serialization at the
device level (things are still serialized at the VM level).

The rest is just implementation details that are hopefully well explained
in the documentation.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_gpuva_mgr
- Replace VM_MAP/UNMAP by VM_BIND
- Add support for asynchronous VM_BIND (VM_BIND queue implemented with
  drm_sched)
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_mmu.c | 2611 +++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_mmu.h |   81 +
 2 files changed, 2692 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h

diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
new file mode 100644
index 000000000000..3ba784473023
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -0,0 +1,2611 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#include <drm/drm_debugfs.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_exec.h>
+#include <drm/drm_gpuva_mgr.h>
+#include <drm/drm_managed.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/panthor_drm.h>
+
+#include <linux/atomic.h>
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/io-pgtable.h>
+#include <linux/iommu.h>
+#include <linux/kmemleak.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+#include <linux/rwsem.h>
+#include <linux/shmem_fs.h>
+#include <linux/sizes.h>
+
+#include "panthor_device.h"
+#include "panthor_heap.h"
+#include "panthor_mmu.h"
+#include "panthor_sched.h"
+#include "panthor_gem.h"
+#include "panthor_regs.h"
+
+#define MAX_AS_SLOTS			32
+
+struct panthor_vm;
+
+/**
+ * struct panthor_as_slot - Address space slot
+ */
+struct panthor_as_slot {
+	/** @vm: VM bound to this slot. NULL is no VM is bound. */
+	struct panthor_vm *vm;
+
+	/** @lock: Lock used to serialize access to the AS registers. */
+	spinlock_t lock;
+};
+
+/**
+ * struct panthor_mmu - MMU related data
+ */
+struct panthor_mmu {
+	/** @irq: The MMU irq. */
+	struct panthor_irq irq;
+
+	/** @as: Address space related fields.
+	 *
+	 * The GPU has a limited number of address spaces (AS) slots, forcing
+	 * us to re-assign them to re-assign slots on-demand.
+	 */
+	struct {
+		/** @slots_lock: Lock protecting access to all other AS fields. */
+		struct mutex slots_lock;
+
+		/** @alloc_mask: Bitmask encoding the allocated slots. */
+		unsigned long alloc_mask;
+
+		/** @faulty_mask: Bitmask encoding the faulty slots. */
+		unsigned long faulty_mask;
+
+		/** @slots: VMs currently bound to the AS slots. */
+		struct panthor_as_slot slots[MAX_AS_SLOTS];
+
+		/**
+		 * @lru_list: List of least recently used VMs.
+		 *
+		 * We use this list to pick a VM to evict when all slots are
+		 * used.
+		 *
+		 * There should be no more active VMs than there are AS slots,
+		 * so this LRU is just here to keep VMs bound until there's
+		 * a need to release a slot, thus avoid unnecessary TLB/cache
+		 * flushes.
+		 */
+		struct list_head lru_list;
+	} as;
+
+	/** @vm: VMs management fields */
+	struct {
+		/** @lock: Lock protecting access to list. */
+		struct mutex lock;
+
+		/** @list: List containing all VMs. */
+		struct list_head list;
+
+		/** @reset_in_progress: True if a reset is in progress. */
+		bool reset_in_progress;
+
+		/** @wq: Workqueue used for the VM_BIND queues. */
+		struct workqueue_struct *wq;
+	} vm;
+};
+
+/**
+ * struct panthor_vm_pool - VM pool object
+ */
+struct panthor_vm_pool {
+	/** @xa: Array used for VM handle tracking. */
+	struct xarray xa;
+};
+
+/**
+ * struct panthor_vma - GPU mapping object
+ *
+ * This is used to track GEM mappings in GPU space.
+ */
+struct panthor_vma {
+	/** @base: Inherits from drm_gpuva. */
+	struct drm_gpuva base;
+
+	/** @node: Used to insert the mapping in the panthor_vm::shared_bos list. */
+	struct list_head node;
+
+	/**
+	 * @flags: Combination of drm_panthor_vm_bind_op_flags.
+	 *
+	 * Only map related flags are accepted.
+	 */
+	u32 flags;
+};
+
+/**
+ * struct panthor_vm_op_ctx - VM operation context
+ *
+ * With VM operations potentially taking place in a dma-signaling path, we
+ * need to make sure everything that might require resource allocation is
+ * pre-allocated upfront. This is what this operation context is far.
+ *
+ * We also collect resources that have been freed, so we can release them
+ * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
+ * request.
+ */
+struct panthor_vm_op_ctx {
+	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
+	struct {
+		/** @count: Number of pages reserved. */
+		u32 count;
+
+		/** @ptr: Point to the first unused page in the @pages table. */
+		u32 ptr;
+
+		/**
+		 * @page: Array of pages that can be used for an MMU page table update.
+		 *
+		 * After an VM operation, there might be free pages left in this array.
+		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
+		 */
+		void **pages;
+	} rsvd_page_tables;
+
+	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
+	u32 flags;
+
+	/** @va: Virtual range targeted by the VM operation. */
+	struct {
+		/** @addr: Start address. */
+		u64 addr;
+
+		/** @range: Range size. */
+		u64 range;
+	} va;
+
+	/**
+	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
+	 *
+	 * For unmap operations, this will contain all VMAs that were covered by the
+	 * specified VA range.
+	 *
+	 * For map operations, this will contain all VMAs that previously mapped to
+	 * the specified VA range.
+	 *
+	 * Those VMAs, and the resources they point to will be released as part of
+	 * the op_ctx cleanup operation.
+	 */
+	struct list_head returned_vmas;
+
+	/** @map: Fields specific to a map operation. */
+	struct {
+		/** @gem: GEM object information. */
+		struct {
+			/** @obj: GEM object to map. */
+			struct drm_gem_object *obj;
+
+			/** @offset: Offset in the GEM object. */
+			u64 offset;
+		} gem;
+
+		/**
+		 * @sgt: sg-table pointing to pages backing the GEM object.
+		 *
+		 * This is gathered at job creation time, such that we don't have
+		 * to allocate in ::run_job().
+		 */
+		struct sg_table *sgt;
+
+		/**
+		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
+		 *
+		 * If the map request covers a region that's inside another VMA, the
+		 * previous VMA will be split, requiring instantiation of a maximum of
+		 * two new VMA objects.
+		 */
+		struct panthor_vma *prev_vma;
+
+		/**
+		 * @new_vma: The new VMA object that will be inserted to the VA tree.
+		 */
+		struct panthor_vma *new_vma;
+
+		/**
+		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
+		 *
+		 * See @prev_vma.
+		 */
+		struct panthor_vma *next_vma;
+	} map;
+};
+
+/**
+ * struct panthor_vm - VM object
+ *
+ * A VM is an object representing a GPU (or MCU) virtual address space.
+ * It embeds the MMU page table for this address space, a tree containing
+ * all the virtual mappings of GEM objects, and other things needed to manage
+ * the VM.
+ *
+ * Except for the MCU VM, which is managed by the kernel, all other VMs are
+ * created by userspace and mostly managed by userspace, using the
+ * %DRM_IOCTL_PANTHOR_VM_BIND ioctl.
+ *
+ * A portion of the virtual address space is reserved for kernel objects,
+ * like heap chunks, and userspace gets to decide how much of the virtual
+ * address space is left to the kernel (half of the virtual address space
+ * by default).
+ */
+struct panthor_vm {
+	/**
+	 * @va_mgr: GPU VA manager.
+	 *
+	 * We delegate all the VA management to the common drm_gpuva_mgr framework
+	 * and only implement hooks to update the MMU page table.
+	 */
+	struct drm_gpuva_manager va_mgr;
+
+	/**
+	 * @sched: Scheduler used for asynchronous VM_BIND request.
+	 *
+	 * We use a 1:1 scheduler here.
+	 */
+	struct drm_gpu_scheduler sched;
+
+	/**
+	 * @entity: Scheduling entity representing the VM_BIND queue.
+	 *
+	 * There's currently one bind queue per VM. It doesn't make sense to
+	 * allow more given the VM operations are serialized anyway.
+	 */
+	struct drm_sched_entity entity;
+
+	/** @ptdev: Device. */
+	struct panthor_device *ptdev;
+
+	/** @refcount: Reference count. */
+	struct kref refcount;
+
+	/** @memattr: Value to program to the AS_MEMATTR register. */
+	u64 memattr;
+
+	/** @pgtbl_ops: Page table operations. */
+	struct io_pgtable_ops *pgtbl_ops;
+
+	/**
+	 * @dummy_gem: Used as a VM reservation object.
+	 *
+	 * We declare a drm_gem_object and no a dma_resv, so we can use drm_exec()
+	 * for the VM reservation.
+	 *
+	 * All private BOs use the resv of this dummy GEM object instead of
+	 * drm_gem_object::_resv, such that private GEM preparation is O(1)
+	 * instead of O(N).
+	 */
+	struct drm_gem_object dummy_gem;
+
+	/**
+	 * @op_lock: Lock used to serialize operations on a VM.
+	 *
+	 * The serialization of jobs queued to the VM_BIND queue is already
+	 * taken care of by drm_sched, but we need to serialize synchronous
+	 * and asynchronous VM_BIND request. This is what this lock is for.
+	 */
+	struct mutex op_lock;
+
+	/**
+	 * @op_ctx: The context attached to the currently executing VM operation.
+	 *
+	 * NULL when no operation is in progress.
+	 */
+	struct panthor_vm_op_ctx *op_ctx;
+
+	/**
+	 * @shared_bos: List of shared BOs.
+	 *
+	 * Shared BOs don't use the VM resv, and need to be prepared
+	 * independently. This list keeps track of all VMAs that target
+	 * non-private BOs.
+	 *
+	 * There might be duplicates, but drm_exec and dma_resv should
+	 * handle that for us.
+	 *
+	 * TODO: This is not optimal. We should probably switch to the
+	 * drm_gpuva_mgr solution for handling shared BOs once it's
+	 * ready.
+	 */
+	struct list_head shared_bos;
+
+	/**
+	 * @mm: Memory management object representing the auto-VA/kernel-VA.
+	 *
+	 * Used to auto-allocate VA space for kernel-managed objects (tiler
+	 * heaps, ...).
+	 *
+	 * For the MCU VM, this is managing the VA range that's used to map
+	 * all shared interfaces.
+	 *
+	 * For user VMs, the range is specified by userspace, and must not
+	 * exceed half of the VA space addressable.
+	 */
+	struct drm_mm mm;
+
+	/** @mm_lock: Lock protecting the @mm field. */
+	struct mutex mm_lock;
+
+	/** @as: Address space related fields. */
+	struct {
+		/**
+		 * @id: ID of the address space this VM is bound to.
+		 *
+		 * A value of -1 means the VM is inactive/not bound.
+		 */
+		int id;
+
+		/**
+		 * @lru_node: Used to instead the VM in the panthor_mmu::as::lru_list.
+		 *
+		 * Active VMs should not be inserted in the LRU list.
+		 */
+		struct list_head lru_node;
+	} as;
+
+	/**
+	 * @heaps: Tiler heap related fields.
+	 */
+	struct {
+		/**
+		 * @pool: The heap pool attached to this VM.
+		 *
+		 * Will stay NULL until someone creates a heap context on this VM.
+		 */
+		struct panthor_heap_pool *pool;
+
+		/** @lock: Lock used to protect access to @pool. */
+		struct mutex lock;
+	} heaps;
+
+	/** @node: Used to insert the VM in the panthor_mmu::vm::list. */
+	struct list_head node;
+
+	/** @for_mcu: True if this is the MCU VM. */
+	bool for_mcu;
+
+	/**
+	 * @destroyed: True if the VM was destroyed.
+	 *
+	 * No further bind requests should be queued to a destroyed VM.
+	 */
+	bool destroyed;
+
+	/**
+	 * @unusable: True if the VM has turned unusable because something
+	 * bad happened during an asynchronous request.
+	 *
+	 * We don't try to recover from such failures, because this implies
+	 * informing userspace about the specific operation that failed, and
+	 * hoping the userspace driver can replay things from there. This all
+	 * sounds very complicated for little gain.
+	 *
+	 * Instead, we should just flag the VM as unusable, and fail any
+	 * further request targeting this VM.
+	 *
+	 * We also provide a way to query a VM state, so userspace can destroy
+	 * it and create a new one.
+	 *
+	 * As an analogy, this would be mapped to a VK_ERROR_DEVICE_LOST
+	 * situation, where the logical device needs to be re-created.
+	 */
+	bool unusable;
+};
+
+/**
+ * struct panthor_vm_bind_job - VM bind job
+ */
+struct panthor_vm_bind_job {
+	/** @base: Inherit from drm_sched_job. */
+	struct drm_sched_job base;
+
+	/** @refcount: Reference count. */
+	struct kref refcount;
+
+	/** @cleanup_op_ctx_work: Work used to cleanup the VM operation context. */
+	struct work_struct cleanup_op_ctx_work;
+
+	/** @vm: VM targeted by the VM operation. */
+	struct panthor_vm *vm;
+
+	/** @ctx: Operation context. */
+	struct panthor_vm_op_ctx ctx;
+};
+
+/**
+ * @pt_cache: Cache used to allocate MMU page tables.
+ *
+ * The pre-allocation pattern forces us to over-allocate to plan for
+ * the worst case scenario, and return the pages we didn't use.
+ *
+ * Having a kmem_cache allows us to speed allocations.
+ */
+static struct kmem_cache *pt_cache;
+
+/**
+ * alloc_pt() - Custom page table allocator
+ * @cookie: Cookie passed at page table allocation time.
+ * @size: Size of the page table. This size should be fixed,
+ * and determined at creation time based on the granule size.
+ * @gfp: GFP flags.
+ *
+ * We want a custom allocator so we can use a cache for page table
+ * allocations and amortize the cost of the over-reservation that's
+ * done to allow asynchronous VM operations.
+ *
+ * Return: non-NULL on success, NULL if the allocation failed for any
+ * reason.
+ */
+static void *alloc_pt(void *cookie, size_t size, gfp_t gfp)
+{
+	struct panthor_vm *vm = cookie;
+	void *page;
+
+	/* We're not supposed to have anything bigger than 4k here, because we picked a
+	 * 4k granule size at init time.
+	 */
+	if (drm_WARN_ON(&vm->ptdev->base, size != SZ_4K))
+		return NULL;
+
+	/* Allocation of the root page table happening during init. */
+	if (!vm->pgtbl_ops) {
+		drm_WARN_ON(&vm->ptdev->base, vm->op_ctx);
+		page = kmem_cache_alloc(pt_cache, gfp);
+		goto out;
+	}
+
+	/* We must have some op_ctx attached to the VM and it must have at least one
+	 * free page.
+	 */
+	if (drm_WARN_ON(&vm->ptdev->base, !vm->op_ctx) ||
+	    drm_WARN_ON(&vm->ptdev->base,
+			vm->op_ctx->rsvd_page_tables.ptr >= vm->op_ctx->rsvd_page_tables.count))
+		return NULL;
+
+	page = vm->op_ctx->rsvd_page_tables.pages[vm->op_ctx->rsvd_page_tables.ptr++];
+	memset(page, 0, SZ_4K);
+
+out:
+	/* Page table entries don't use virtual addresses, which trips out
+	 * kmemleak. kmemleak_alloc_phys() might work, but physical addresses
+	 * are mixed with other fields, and I fear kmemleak won't detect that
+	 * either.
+	 *
+	 * Let's just ignore memory passed to the page-table driver for now.
+	 */
+	kmemleak_ignore(page);
+	return page;
+}
+
+/**
+ * @free_pt() - Custom page table free function
+ * @cookie: Cookie passed at page table allocation time.
+ * @data: Page table to free.
+ * @size: Size of the page table. This size should be fixed,
+ * and determined at creation time based on the granule size.
+ */
+static void free_pt(void *cookie, void *data, size_t size)
+{
+	struct panthor_vm *vm = cookie;
+
+	if (drm_WARN_ON(&vm->ptdev->base, size != SZ_4K))
+		return;
+
+	/* Return the page to the pt_cache. */
+	kmem_cache_free(pt_cache, data);
+}
+
+static int wait_ready(struct panthor_device *ptdev, u32 as_nr)
+{
+	int ret;
+	u32 val;
+
+	/* Wait for the MMU status to indicate there is no active command, in
+	 * case one is pending.
+	 */
+	ret = readl_relaxed_poll_timeout_atomic(ptdev->iomem + AS_STATUS(as_nr),
+						val, !(val & AS_STATUS_AS_ACTIVE),
+						10, 100000);
+
+	if (ret) {
+		panthor_device_schedule_reset(ptdev);
+		drm_err(&ptdev->base, "AS_ACTIVE bit stuck\n");
+	}
+
+	return ret;
+}
+
+static int write_cmd(struct panthor_device *ptdev, u32 as_nr, u32 cmd)
+{
+	int status;
+
+	/* write AS_COMMAND when MMU is ready to accept another command */
+	status = wait_ready(ptdev, as_nr);
+	if (!status)
+		gpu_write(ptdev, AS_COMMAND(as_nr), cmd);
+
+	return status;
+}
+
+static void lock_region(struct panthor_device *ptdev, u32 as_nr,
+			u64 region_start, u64 size)
+{
+	u8 region_width;
+	u64 region;
+	u64 region_end = region_start + size;
+
+	if (!size)
+		return;
+
+	/*
+	 * The locked region is a naturally aligned power of 2 block encoded as
+	 * log2 minus(1).
+	 * Calculate the desired start/end and look for the highest bit which
+	 * differs. The smallest naturally aligned block must include this bit
+	 * change, the desired region starts with this bit (and subsequent bits)
+	 * zeroed and ends with the bit (and subsequent bits) set to one.
+	 */
+	region_width = max(fls64(region_start ^ (region_end - 1)),
+			   const_ilog2(AS_LOCK_REGION_MIN_SIZE)) - 1;
+
+	/*
+	 * Mask off the low bits of region_start (which would be ignored by
+	 * the hardware anyway)
+	 */
+	region_start &= GENMASK_ULL(63, region_width);
+
+	region = region_width | region_start;
+
+	/* Lock the region that needs to be updated */
+	gpu_write(ptdev, AS_LOCKADDR_LO(as_nr), lower_32_bits(region));
+	gpu_write(ptdev, AS_LOCKADDR_HI(as_nr), upper_32_bits(region));
+	write_cmd(ptdev, as_nr, AS_COMMAND_LOCK);
+}
+
+static int mmu_hw_do_operation_locked(struct panthor_device *ptdev, int as_nr,
+				      u64 iova, u64 size, u32 op)
+{
+	if (as_nr < 0)
+		return 0;
+
+	if (op != AS_COMMAND_UNLOCK)
+		lock_region(ptdev, as_nr, iova, size);
+
+	/* Run the MMU operation */
+	write_cmd(ptdev, as_nr, op);
+
+	/* Wait for the flush to complete */
+	return wait_ready(ptdev, as_nr);
+}
+
+static int mmu_hw_do_operation(struct panthor_vm *vm,
+			       u64 iova, u64 size, u32 op)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+	int ret;
+
+	spin_lock(&ptdev->mmu->as.slots[vm->as.id].lock);
+	ret = mmu_hw_do_operation_locked(ptdev, vm->as.id, iova, size, op);
+	spin_unlock(&ptdev->mmu->as.slots[vm->as.id].lock);
+	return ret;
+}
+
+static int panthor_mmu_as_enable(struct panthor_device *ptdev, u32 as_nr,
+				 u64 transtab, u64 transcfg, u64 memattr)
+{
+	int ret;
+
+	ret = mmu_hw_do_operation_locked(ptdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
+	if (ret)
+		return ret;
+
+	gpu_write(ptdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab));
+	gpu_write(ptdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
+
+	gpu_write(ptdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr));
+	gpu_write(ptdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));
+
+	gpu_write(ptdev, AS_TRANSCFG_LO(as_nr), lower_32_bits(transcfg));
+	gpu_write(ptdev, AS_TRANSCFG_HI(as_nr), upper_32_bits(transcfg));
+
+	return write_cmd(ptdev, as_nr, AS_COMMAND_UPDATE);
+}
+
+static int panthor_mmu_as_disable(struct panthor_device *ptdev, u32 as_nr)
+{
+	int ret;
+
+	ret = mmu_hw_do_operation_locked(ptdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
+	if (ret)
+		return ret;
+
+	gpu_write(ptdev, AS_TRANSTAB_LO(as_nr), 0);
+	gpu_write(ptdev, AS_TRANSTAB_HI(as_nr), 0);
+
+	gpu_write(ptdev, AS_MEMATTR_LO(as_nr), 0);
+	gpu_write(ptdev, AS_MEMATTR_HI(as_nr), 0);
+
+	gpu_write(ptdev, AS_TRANSCFG_LO(as_nr), AS_TRANSCFG_ADRMODE_UNMAPPED);
+	gpu_write(ptdev, AS_TRANSCFG_HI(as_nr), 0);
+
+	return write_cmd(ptdev, as_nr, AS_COMMAND_UPDATE);
+}
+
+static u32 panthor_mmu_fault_mask(struct panthor_device *ptdev, u32 value)
+{
+	/* Bits 16 to 31 mean REQ_COMPLETE. */
+	return value & GENMASK(15, 0);
+}
+
+static u32 panthor_mmu_as_fault_mask(struct panthor_device *ptdev, u32 as)
+{
+	return BIT(as);
+}
+
+/**
+ * panthor_vm_active() - Flag a VM as active
+ * @VM: VM to flag as active.
+ *
+ * Assigns an address space to a VM so it can be used by the GPU/MCU.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_active(struct panthor_vm *vm)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
+	int ret = 0, as, cookie;
+	u64 transtab, transcfg;
+
+	if (!drm_dev_enter(&ptdev->base, &cookie))
+		return -ENODEV;
+
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+
+	as = vm->as.id;
+	if (as >= 0) {
+		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
+
+		if (ptdev->mmu->as.faulty_mask & mask) {
+			/* Unhandled pagefault on this AS, the MMU was
+			 * disabled. We need to re-enable the MMU after
+			 * clearing+unmasking the AS interrupts.
+			 */
+			gpu_write(ptdev, MMU_INT_CLEAR, mask);
+			ptdev->mmu->as.faulty_mask &= ~mask;
+			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
+			goto out_enable_as;
+		}
+
+		goto out_unlock;
+	}
+
+	/* Check for a free AS */
+	if (vm->for_mcu) {
+		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
+		as = 0;
+	} else {
+		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
+	}
+
+	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
+		struct panthor_vm *lru_vm;
+
+		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
+						  struct panthor_vm,
+						  as.lru_node);
+		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
+			ret = -EBUSY;
+			goto out_unlock;
+		}
+
+		list_del_init(&lru_vm->as.lru_node);
+		as = lru_vm->as.id;
+	} else {
+		set_bit(as, &ptdev->mmu->as.alloc_mask);
+	}
+
+	/* Assign the free or reclaimed AS to the FD */
+	vm->as.id = as;
+	ptdev->mmu->as.slots[as].vm = vm;
+
+out_enable_as:
+	transtab = cfg->arm_lpae_s1_cfg.ttbr;
+	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
+		   AS_TRANSCFG_PTW_RA |
+		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
+	if (ptdev->coherent)
+		transcfg |= AS_TRANSCFG_PTW_SH_OS;
+
+	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
+
+out_unlock:
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+/**
+ * panthor_vm_idle() - Flag a VM idle
+ * @VM: VM to flag as idle.
+ *
+ * When we know the GPU is done with the VM (no more jobs to process),
+ * we can relinquish the AS slot attached to this VM, if any.
+ *
+ * We don't release the slot immediately, but instead place the VM in
+ * the LRU list, so it can be evicted if another VM needs an AS slot.
+ * This way, VMs keep attached to the AS they were given until we run
+ * out of free slot, limiting the number of MMU operations (TLB flush
+ * and other AS updates).
+ */
+void panthor_vm_idle(struct panthor_vm *vm)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+	if (vm->as.id >= 0 && list_empty(&vm->as.lru_node))
+		list_add_tail(&vm->as.lru_node, &ptdev->mmu->as.lru_list);
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+}
+
+static void panthor_vm_stop(struct panthor_vm *vm)
+{
+	drm_sched_stop(&vm->sched, NULL);
+}
+
+static void panthor_vm_start(struct panthor_vm *vm)
+{
+	drm_sched_start(&vm->sched, true);
+}
+
+/**
+ * panthor_vm_as() - Get the AS slot attached to a VM
+ * @vm: VM to get the AS slot of.
+ *
+ * Return: -1 if the VM is not assigned an AS slot yet, >= 0 otherwise.
+ */
+int panthor_vm_as(struct panthor_vm *vm)
+{
+	return vm->as.id;
+}
+
+static size_t get_pgsize(u64 addr, size_t size, size_t *count)
+{
+	/*
+	 * io-pgtable only operates on multiple pages within a single table
+	 * entry, so we need to split at boundaries of the table size, i.e.
+	 * the next block size up. The distance from address A to the next
+	 * boundary of block size B is logically B - A % B, but in unsigned
+	 * two's complement where B is a power of two we get the equivalence
+	 * B - A % B == (B - A) % B == (n * B - A) % B, and choose n = 0 :)
+	 */
+	size_t blk_offset = -addr % SZ_2M;
+
+	if (blk_offset || size < SZ_2M) {
+		*count = min_not_zero(blk_offset, size) / SZ_4K;
+		return SZ_4K;
+	}
+	blk_offset = -addr % SZ_1G ?: SZ_1G;
+	*count = min(blk_offset, size) / SZ_2M;
+	return SZ_2M;
+}
+
+static int panthor_vm_flush_range(struct panthor_vm *vm, u64 iova, u64 size)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+	int ret = 0, cookie;
+
+	if (vm->as.id < 0)
+		return 0;
+
+	/* If the device is unplugged, we just silently skip the flush. */
+	if (!drm_dev_enter(&ptdev->base, &cookie))
+		return 0;
+
+	/* Flush the PTs only if we're already awake */
+	if (pm_runtime_active(ptdev->base.dev))
+		ret = mmu_hw_do_operation(vm, iova, size, AS_COMMAND_FLUSH_PT);
+
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static int panthor_vm_unmap_pages(struct panthor_vm *vm, u64 iova, size_t size)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+	struct io_pgtable_ops *ops = vm->pgtbl_ops;
+	size_t offset = 0;
+
+	drm_dbg(&ptdev->base, "unmap: as=%d, iova=%llx, len=%zx", vm->as.id, iova, size);
+
+	while (offset < size) {
+		size_t unmapped_sz = 0, pgcount;
+		size_t pgsize = get_pgsize(iova + offset, size - offset, &pgcount);
+
+		unmapped_sz = ops->unmap_pages(ops, iova + offset, pgsize, pgcount, NULL);
+
+		if (drm_WARN_ON(&ptdev->base, unmapped_sz != pgsize * pgcount)) {
+			drm_err(&ptdev->base, "failed to unmap range %llx-%llx (requested range %llx-%llx)\n",
+				iova + offset + unmapped_sz,
+				iova + offset + pgsize * pgcount,
+				iova, iova + size);
+			panthor_vm_flush_range(vm, iova, offset + unmapped_sz);
+			return  -EINVAL;
+		}
+		offset += unmapped_sz;
+	}
+
+	return panthor_vm_flush_range(vm, iova, size);
+}
+
+static int
+panthor_vm_map_pages(struct panthor_vm *vm, u64 iova, int prot,
+		     struct sg_table *sgt, u64 offset, ssize_t size)
+{
+	struct panthor_device *ptdev = vm->ptdev;
+	unsigned int count;
+	struct scatterlist *sgl;
+	struct io_pgtable_ops *ops = vm->pgtbl_ops;
+	u64 start_iova = iova;
+	int ret;
+
+	if (!size)
+		return 0;
+
+	for_each_sgtable_dma_sg(sgt, sgl, count) {
+		dma_addr_t paddr = sg_dma_address(sgl);
+		size_t len = sg_dma_len(sgl);
+
+		if (len <= offset) {
+			offset -= len;
+			continue;
+		}
+
+		paddr -= offset;
+		len -= offset;
+
+		if (size >= 0) {
+			len = min_t(size_t, len, size);
+			size -= len;
+		}
+
+		drm_dbg(&ptdev->base, "map: as=%d, iova=%llx, paddr=%llx, len=%zx",
+			vm->as.id, iova, paddr, len);
+
+		while (len) {
+			size_t pgcount, mapped = 0;
+			size_t pgsize = get_pgsize(iova | paddr, len, &pgcount);
+
+			ret = ops->map_pages(ops, iova, paddr, pgsize, pgcount, prot,
+					     GFP_KERNEL, &mapped);
+			iova += mapped;
+			paddr += mapped;
+			len -= mapped;
+
+			if (drm_WARN_ON(&ptdev->base, !ret && !mapped))
+				ret = -ENOMEM;
+
+			if (ret) {
+				/* If something failed, unmap what we've already mapped before
+				 * returning. The unmap call is not supposed to fail.
+				 */
+				drm_WARN_ON(&ptdev->base,
+					    panthor_vm_unmap_pages(vm, start_iova,
+								   iova - start_iova));
+				return ret;
+			}
+		}
+
+		if (!size)
+			break;
+	}
+
+	return panthor_vm_flush_range(vm, start_iova, iova - start_iova);
+}
+
+static int flags_to_prot(u32 flags)
+{
+	int prot = 0;
+
+	if (flags & DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC)
+		prot |= IOMMU_NOEXEC;
+
+	if (!(flags & DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED))
+		prot |= IOMMU_CACHE;
+
+	if (flags & DRM_PANTHOR_VM_BIND_OP_MAP_READONLY)
+		prot |= IOMMU_READ;
+	else
+		prot |= IOMMU_READ | IOMMU_WRITE;
+
+	return prot;
+}
+
+/**
+ * panthor_vm_alloc_va() - Allocate a region in the auto-va space
+ * @VM: VM to allocate a region on.
+ * @size: Size of the region.
+ *
+ * Some GPU objects, like heap chunks, are fully managed by the kernel and
+ * need to be mapped to the userspace VM, in the region reserved for kernel
+ * objects.
+ *
+ * This function takes care of allocating a region in this reserved space.
+ *
+ * Return: A valid pointer on success, and ERR_PTR() otherwise.
+ */
+struct drm_mm_node *
+panthor_vm_alloc_va(struct panthor_vm *vm, size_t size)
+{
+	struct drm_mm_node *mm_node;
+	int ret;
+
+	if (!size || (size & ~PAGE_MASK))
+		return ERR_PTR(-EINVAL);
+
+	mm_node = kzalloc(sizeof(*mm_node), GFP_KERNEL);
+	if (!mm_node)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&vm->mm_lock);
+	ret = drm_mm_insert_node(&vm->mm, mm_node, size);
+	mutex_unlock(&vm->mm_lock);
+
+	if (ret) {
+		kfree(mm_node);
+		return ERR_PTR(ret);
+	}
+
+	return mm_node;
+}
+
+/**
+ * panthor_vm_free_va() - Free a region allocated with panthor_vm_alloc_va()
+ * @VM: VM to free the region on.
+ * @mm_node: Memory node representing the region to free.
+ */
+void panthor_vm_free_va(struct panthor_vm *vm, struct drm_mm_node *mm_node)
+{
+	if (!mm_node)
+		return;
+
+	mutex_lock(&vm->mm_lock);
+	drm_mm_remove_node(mm_node);
+	mutex_unlock(&vm->mm_lock);
+
+	kfree(mm_node);
+}
+
+static void panthor_vm_cleanup_op_ctx(struct panthor_vm_op_ctx *op_ctx,
+				      struct panthor_vm *vm)
+{
+	struct panthor_vma *vma, *tmp_vma;
+
+	u32 remaining_pt_count = op_ctx->rsvd_page_tables.count -
+				 op_ctx->rsvd_page_tables.ptr;
+
+	if (remaining_pt_count) {
+		kmem_cache_free_bulk(pt_cache, remaining_pt_count,
+				     op_ctx->rsvd_page_tables.pages +
+				     op_ctx->rsvd_page_tables.ptr);
+	}
+
+	kfree(op_ctx->rsvd_page_tables.pages);
+	memset(&op_ctx->rsvd_page_tables, 0, sizeof(op_ctx->rsvd_page_tables));
+
+	if (op_ctx->map.gem.obj) {
+		struct panthor_gem_object *bo = to_panthor_bo(op_ctx->map.gem.obj);
+
+		if (!bo->base.base.import_attach)
+			drm_gem_shmem_unpin(&bo->base);
+
+		drm_gem_object_put(&bo->base.base);
+	}
+
+	kfree(op_ctx->map.new_vma);
+	kfree(op_ctx->map.next_vma);
+	kfree(op_ctx->map.prev_vma);
+	memset(&op_ctx->map, 0, sizeof(op_ctx->map));
+
+	list_for_each_entry_safe(vma, tmp_vma, &op_ctx->returned_vmas, node) {
+		struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
+
+		if (!bo->base.base.import_attach)
+			drm_gem_shmem_unpin(&bo->base);
+
+		drm_gem_object_put(&bo->base.base);
+		list_del(&vma->node);
+		kfree(vma);
+	}
+}
+
+#define PANTHOR_VM_BIND_OP_MAP_FLAGS \
+	(DRM_PANTHOR_VM_BIND_OP_MAP_READONLY | \
+	 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | \
+	 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED | \
+	 DRM_PANTHOR_VM_BIND_OP_TYPE_MASK)
+
+static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
+					 struct panthor_vm *vm,
+					 struct panthor_gem_object *bo,
+					 u64 offset,
+					 size_t size, u64 va,
+					 u32 flags)
+{
+	struct sg_table *sgt = NULL;
+	u64 pt_count;
+	int ret;
+
+	if (!bo)
+		return -EINVAL;
+
+	if ((flags & ~PANTHOR_VM_BIND_OP_MAP_FLAGS) ||
+	    (flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) != DRM_PANTHOR_VM_BIND_OP_TYPE_MAP)
+		return -EINVAL;
+
+	/* Make sure the VA and size are aligned and in-bounds. */
+	if (size > bo->base.base.size || offset > bo->base.base.size - size)
+		return -EINVAL;
+
+	/* If the BO has an exclusive VM attached, it can't be mapped to other VMs. */
+	if (bo->exclusive_vm && bo->exclusive_vm != vm)
+		return -EINVAL;
+
+	memset(op_ctx, 0, sizeof(*op_ctx));
+	INIT_LIST_HEAD(&op_ctx->returned_vmas);
+	op_ctx->flags = flags;
+	op_ctx->va.range = size;
+	op_ctx->va.addr = va;
+
+	op_ctx->map.new_vma = kzalloc(sizeof(*op_ctx->map.new_vma), GFP_KERNEL);
+	op_ctx->map.next_vma = kzalloc(sizeof(*op_ctx->map.next_vma), GFP_KERNEL);
+	op_ctx->map.prev_vma = kzalloc(sizeof(*op_ctx->map.prev_vma), GFP_KERNEL);
+	if (!op_ctx->map.new_vma || !op_ctx->map.next_vma || !op_ctx->map.prev_vma) {
+		ret = -ENOMEM;
+		goto err_cleanup;
+	}
+
+	if (!bo->base.base.import_attach) {
+		/* Pre-reserve the BO pages, so the map operation doesn't have to
+		 * allocate.
+		 */
+		ret = drm_gem_shmem_pin(&bo->base);
+		if (ret)
+			goto err_cleanup;
+	}
+
+	sgt = drm_gem_shmem_get_pages_sgt(&bo->base);
+	if (IS_ERR(sgt)) {
+		if (!bo->base.base.import_attach)
+			drm_gem_shmem_unpin(&bo->base);
+
+		ret = PTR_ERR(sgt);
+		goto err_cleanup;
+	}
+
+	op_ctx->map.sgt = sgt;
+	op_ctx->map.gem.obj = &bo->base.base;
+	op_ctx->map.gem.offset = offset;
+	drm_gem_object_get(op_ctx->map.gem.obj);
+
+	/* L1, L2 and L3 page tables.
+	 * We could optimize L3 allocation by iterating over the sgt and merging
+	 * 2M contiguous blocks, but it's simpler to over-provision and return
+	 * the pages if they're not used.
+	 */
+	pt_count = ((ALIGN(va + size, 1ull << 39) - ALIGN_DOWN(va, 1ull << 39)) >> 39) +
+		   ((ALIGN(va + size, 1ull << 30) - ALIGN_DOWN(va, 1ull << 30)) >> 30) +
+		   ((ALIGN(va + size, 1ull << 21) - ALIGN_DOWN(va, 1ull << 21)) >> 21);
+
+	op_ctx->rsvd_page_tables.pages = kcalloc(pt_count,
+						 sizeof(*op_ctx->rsvd_page_tables.pages),
+						 GFP_KERNEL);
+	if (!op_ctx->rsvd_page_tables.pages)
+		goto err_cleanup;
+
+	ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
+				    op_ctx->rsvd_page_tables.pages);
+	op_ctx->rsvd_page_tables.count = ret;
+	if (ret != pt_count) {
+		ret = -ENOMEM;
+		goto err_cleanup;
+	}
+
+	return 0;
+
+err_cleanup:
+	panthor_vm_cleanup_op_ctx(op_ctx, vm);
+	return ret;
+}
+
+static int panthor_vm_prepare_unmap_op_ctx(struct panthor_vm_op_ctx *op_ctx,
+					   struct panthor_vm *vm,
+					   u64 va, size_t size)
+{
+	u32 pt_count = 0;
+	int ret;
+
+	memset(op_ctx, 0, sizeof(*op_ctx));
+	INIT_LIST_HEAD(&op_ctx->returned_vmas);
+	op_ctx->va.range = size;
+	op_ctx->va.addr = va;
+	op_ctx->flags = DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP;
+
+	/* Pre-allocate L3 page tables to account for the split-2M-block
+	 * situation on unmap.
+	 */
+	if (va != ALIGN(va, SZ_2M))
+		pt_count++;
+
+	if (va + size != ALIGN(va + size, SZ_2M) &&
+	    ALIGN(va + size, SZ_2M) != ALIGN(va, SZ_2M))
+		pt_count++;
+
+	if (pt_count) {
+		op_ctx->rsvd_page_tables.pages = kcalloc(pt_count,
+							 sizeof(*op_ctx->rsvd_page_tables.pages),
+							 GFP_KERNEL);
+		if (!op_ctx->rsvd_page_tables.pages)
+			goto err_cleanup;
+
+		ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
+					    op_ctx->rsvd_page_tables.pages);
+		if (ret != pt_count) {
+			ret = -ENOMEM;
+			goto err_cleanup;
+		}
+		op_ctx->rsvd_page_tables.count = pt_count;
+	}
+
+	return 0;
+
+err_cleanup:
+	panthor_vm_cleanup_op_ctx(op_ctx, vm);
+	return ret;
+}
+
+/**
+ * panthor_vm_get_bo_for_va() - Get the GEM object mapped at a virtual address
+ * @vm: VM to look into.
+ * @va: Virtual address to search for.
+ * @bo_offset: Offset of the GEM object mapped at this virtual address.
+ * Only valid on success.
+ *
+ * The object returned by this function might no longer be mapped when the
+ * function returns. It's the caller responsibility to ensure there's no
+ * concurrent map/unmap operations making the returned value invalid, or
+ * make sure it doesn't matter if the object is no longer mapped.
+ *
+ * Return: A valid pointer on success, an ERR_PTR() otherwise.
+ */
+struct panthor_gem_object *
+panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset)
+{
+	struct panthor_gem_object *bo = ERR_PTR(-ENOENT);
+	struct drm_gpuva *gpuva;
+	struct panthor_vma *vma;
+	int ret;
+
+	/* Take the VM lock to prevent concurrent map/unmap operation. */
+	ret = dma_resv_lock(vm->dummy_gem.resv, NULL);
+	if (drm_WARN_ON(&vm->ptdev->base, ret))
+		return NULL;
+
+	gpuva = drm_gpuva_find_first(&vm->va_mgr, va, 1);
+	vma = gpuva ? container_of(gpuva, struct panthor_vma, base) : NULL;
+	if (vma && vma->base.gem.obj) {
+		drm_gem_object_get(vma->base.gem.obj);
+		bo = to_panthor_bo(vma->base.gem.obj);
+		*bo_offset = vma->base.gem.offset;
+	}
+	dma_resv_unlock(vm->dummy_gem.resv);
+
+	return bo;
+}
+
+/*
+ * Only 32 VMs per open file. If that becomes a limiting factor, we can
+ * increase this number.
+ */
+#define PANTHOR_MAX_VMS_PER_FILE	 32
+
+/**
+ * panthor_vm_pool_create_vm() - Create a VM
+ * @pool: The VM to create this VM on.
+ * @kernel_va_start: Start of the region reserved for kernel objects.
+ * @kernel_va_range: Size of the region reserved for kernel objects.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_pool_create_vm(struct panthor_device *ptdev, struct panthor_vm_pool *pool,
+			      u64 kernel_va_start, u64 kernel_va_range)
+{
+	struct panthor_vm *vm;
+	int ret;
+	u32 id;
+
+	vm = panthor_vm_create(ptdev, false, kernel_va_start, kernel_va_range);
+	if (IS_ERR(vm))
+		return PTR_ERR(vm);
+
+	ret = xa_alloc(&pool->xa, &id, vm,
+		       XA_LIMIT(1, PANTHOR_MAX_VMS_PER_FILE), GFP_KERNEL);
+
+	if (ret) {
+		panthor_vm_put(vm);
+		return ret;
+	}
+
+	return id;
+}
+
+static void panthor_vm_destroy(struct panthor_vm *vm)
+{
+	if (!vm)
+		return;
+
+	vm->destroyed = true;
+
+	mutex_lock(&vm->heaps.lock);
+	panthor_heap_pool_destroy(vm->heaps.pool);
+	vm->heaps.pool = NULL;
+	mutex_unlock(&vm->heaps.lock);
+
+	drm_WARN_ON(&vm->ptdev->base,
+		    panthor_vm_unmap_range(vm, vm->va_mgr.mm_start, vm->va_mgr.mm_range));
+	panthor_vm_put(vm);
+}
+
+/**
+ * panthor_vm_destroy() - Destroy a VM.
+ * @pool: VM pool.
+ * @handle: VM handle.
+ *
+ * This function doesn't free the VM object or its resources, it just kills
+ * all mappings, and makes sure nothing can be mapped after that point.
+ *
+ * If there was any active jobs at the time this function is called, these
+ * jobs should experience page faults and be killed as a result.
+ *
+ * The VM resources are freed when the last reference on the VM object is
+ * dropped.
+ */
+int panthor_vm_pool_destroy_vm(struct panthor_vm_pool *pool, u32 handle)
+{
+	struct panthor_vm *vm;
+
+	vm = xa_erase(&pool->xa, handle);
+
+	panthor_vm_destroy(vm);
+
+	return vm ? 0 : -EINVAL;
+}
+
+/**
+ * panthor_vm_pool_get_vm() - Retrieve VM object bound to a VM handle
+ * @pool: VM pool to check.
+ * @handle: Handle of the VM to retrieve.
+ *
+ * Return: A valid pointer if the VM exists, NULL otherwise.
+ */
+struct panthor_vm *
+panthor_vm_pool_get_vm(struct panthor_vm_pool *pool, u32 handle)
+{
+	struct panthor_vm *vm;
+
+	vm = panthor_vm_get(xa_load(&pool->xa, handle));
+
+	return vm;
+}
+
+/**
+ * panthor_vm_pool_destroy() - Destroy a VM pool.
+ * @pfile: File.
+ *
+ * Destroy all VMs in the pool, and release the pool resources.
+ *
+ * Note that VMs can outlive the pool they were created from if other
+ * objects hold a reference to there VMs.
+ */
+void panthor_vm_pool_destroy(struct panthor_file *pfile)
+{
+	struct panthor_vm *vm;
+	unsigned long i;
+
+	if (!pfile->vms)
+		return;
+
+	xa_for_each(&pfile->vms->xa, i, vm)
+		panthor_vm_destroy(vm);
+
+	xa_destroy(&pfile->vms->xa);
+	kfree(pfile->vms);
+}
+
+/**
+ * panthor_vm_pool_create() - Create a VM pool
+ * @pfile: File.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_pool_create(struct panthor_file *pfile)
+{
+	pfile->vms = kzalloc(sizeof(*pfile->vms), GFP_KERNEL);
+	if (!pfile->vms)
+		return -ENOMEM;
+
+	xa_init_flags(&pfile->vms->xa, XA_FLAGS_ALLOC1);
+	return 0;
+}
+
+/* dummy TLB ops, the real TLB flush happens in panthor_vm_flush_range() */
+static void mmu_tlb_flush_all(void *cookie)
+{
+}
+
+static void mmu_tlb_flush_walk(unsigned long iova, size_t size, size_t granule, void *cookie)
+{
+}
+
+static const struct iommu_flush_ops mmu_tlb_ops = {
+	.tlb_flush_all = mmu_tlb_flush_all,
+	.tlb_flush_walk = mmu_tlb_flush_walk,
+};
+
+static const char *access_type_name(struct panthor_device *ptdev,
+				    u32 fault_status)
+{
+	switch (fault_status & AS_FAULTSTATUS_ACCESS_TYPE_MASK) {
+	case AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC:
+		return "ATOMIC";
+	case AS_FAULTSTATUS_ACCESS_TYPE_READ:
+		return "READ";
+	case AS_FAULTSTATUS_ACCESS_TYPE_WRITE:
+		return "WRITE";
+	case AS_FAULTSTATUS_ACCESS_TYPE_EX:
+		return "EXECUTE";
+	default:
+		drm_WARN_ON(&ptdev->base, 1);
+		return NULL;
+	}
+}
+
+static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
+{
+	status = panthor_mmu_fault_mask(ptdev, status);
+	while (status) {
+		u32 as = ffs(status | (status >> 16)) - 1;
+		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
+		u32 new_int_mask;
+		u64 addr;
+		u32 fault_status;
+		u32 exception_type;
+		u32 access_type;
+		u32 source_id;
+
+		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
+		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
+		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
+
+		/* decode the fault status */
+		exception_type = fault_status & 0xFF;
+		access_type = (fault_status >> 8) & 0x3;
+		source_id = (fault_status >> 16);
+
+		/* Page fault only */
+		mutex_lock(&ptdev->mmu->as.slots_lock);
+
+		new_int_mask =
+			panthor_mmu_fault_mask(ptdev, ~ptdev->mmu->as.faulty_mask);
+
+		/* terminal fault, print info about the fault */
+		drm_err(&ptdev->base,
+			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
+			"raw fault status: 0x%X\n"
+			"decoded fault status: %s\n"
+			"exception type 0x%X: %s\n"
+			"access type 0x%X: %s\n"
+			"source id 0x%X\n",
+			as, addr,
+			fault_status,
+			(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
+			exception_type, panthor_exception_name(ptdev, exception_type),
+			access_type, access_type_name(ptdev, fault_status),
+			source_id);
+
+		/* Ignore MMU interrupts on this AS until it's been
+		 * re-enabled.
+		 */
+		ptdev->mmu->irq.mask = new_int_mask;
+		gpu_write(ptdev, MMU_INT_MASK, new_int_mask);
+
+		/* Disable the MMU to kill jobs on this AS. */
+		panthor_mmu_as_disable(ptdev, as);
+		mutex_unlock(&ptdev->mmu->as.slots_lock);
+
+		status &= ~mask;
+	}
+}
+PANTHOR_IRQ_HANDLER(mmu, MMU, panthor_mmu_irq_handler);
+
+/**
+ * panthor_mmu_suspend() - Suspend the MMU logic
+ * @ptdev: Device.
+ *
+ * All we do here is de-assign the AS slots on all active VMs, so things
+ * get flushed to the main memory, and no further access to these VMs are
+ * possible.
+ *
+ * We also suspend the MMU IRQ.
+ */
+void panthor_mmu_suspend(struct panthor_device *ptdev)
+{
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
+		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
+
+		if (vm) {
+			drm_WARN_ON(&ptdev->base, panthor_mmu_as_disable(ptdev, i));
+			vm->as.id = -1;
+			list_del_init(&vm->as.lru_node);
+			ptdev->mmu->as.slots[i].vm = NULL;
+		}
+	}
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+
+	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+}
+
+/**
+ * panthor_mmu_resume() - Resume the MMU logic
+ * @ptdev: Device.
+ *
+ * Resume the IRQ.
+ *
+ * We don't re-enable previously active VMs. We assume other parts of the
+ * driver will call panthor_vm_active() on the VMs they intend to use.
+ */
+void panthor_mmu_resume(struct panthor_device *ptdev)
+{
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+	ptdev->mmu->as.alloc_mask = 0;
+	ptdev->mmu->as.faulty_mask = 0;
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+
+	panthor_mmu_irq_resume(&ptdev->mmu->irq, panthor_mmu_fault_mask(ptdev, ~0));
+}
+
+/**
+ * panthor_mmu_pre_reset() - Prepare for a reset
+ * @ptdev: Device.
+ *
+ * Suspend the IRQ, and make sure all VM_BIND queues are stopped, so we
+ * don't get asked to do a VM operation while the GPU is down.
+ *
+ * We don't cleanly shutdown the AS slots here, because the reset might
+ * come from an AS_ACTIVE_BIT stuck situation.
+ */
+void panthor_mmu_pre_reset(struct panthor_device *ptdev)
+{
+	struct panthor_vm *vm;
+
+	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+
+	mutex_lock(&ptdev->mmu->vm.lock);
+	ptdev->mmu->vm.reset_in_progress = true;
+	list_for_each_entry(vm, &ptdev->mmu->vm.list, node)
+		panthor_vm_stop(vm);
+	mutex_unlock(&ptdev->mmu->vm.lock);
+}
+
+/**
+ * panthor_mmu_post_reset() - Restore things after a reset
+ * @ptdev: Device.
+ *
+ * Put the MMU logic back in action after a reset. That implies resuming the
+ * IRQ and re-enabling the VM_BIND queues.
+ */
+void panthor_mmu_post_reset(struct panthor_device *ptdev)
+{
+	struct panthor_vm *vm;
+
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+
+	/* Now that the reset is effective, we can assume that none of the
+	 * AS slots are setup, and clear the faulty flags too.
+	 */
+	ptdev->mmu->as.alloc_mask = 0;
+	ptdev->mmu->as.faulty_mask = 0;
+
+	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
+		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
+
+		if (vm) {
+			vm->as.id = -1;
+			list_del_init(&vm->as.lru_node);
+			ptdev->mmu->as.slots[i].vm = NULL;
+		}
+	}
+
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+
+	panthor_mmu_irq_resume(&ptdev->mmu->irq, panthor_mmu_fault_mask(ptdev, ~0));
+
+	/* Restart the VM_BIND queues. */
+	mutex_lock(&ptdev->mmu->vm.lock);
+	list_for_each_entry(vm, &ptdev->mmu->vm.list, node) {
+		panthor_vm_start(vm);
+	}
+	ptdev->mmu->vm.reset_in_progress = false;
+	mutex_unlock(&ptdev->mmu->vm.lock);
+}
+
+static void panthor_vm_release(struct kref *kref)
+{
+	struct panthor_vm *vm = container_of(kref, struct panthor_vm, refcount);
+	struct panthor_device *ptdev = vm->ptdev;
+
+	mutex_lock(&vm->heaps.lock);
+	if (drm_WARN_ON(&ptdev->base, vm->heaps.pool))
+		panthor_heap_pool_destroy(vm->heaps.pool);
+	mutex_unlock(&vm->heaps.lock);
+	mutex_destroy(&vm->heaps.lock);
+
+	mutex_lock(&ptdev->mmu->vm.lock);
+	list_del(&vm->node);
+	/* Restore the scheduler state so we can call drm_sched_entity_destroy()
+	 * and drm_sched_fini(). If get there, that means we have no job left
+	 * and no new jobs can be queued, so we can start the scheduler without
+	 * risking interfering with the reset.
+	 */
+	if (ptdev->mmu->vm.reset_in_progress)
+		panthor_vm_start(vm);
+	mutex_unlock(&ptdev->mmu->vm.lock);
+
+	drm_sched_entity_destroy(&vm->entity);
+	drm_sched_fini(&vm->sched);
+
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+	if (vm->as.id >= 0) {
+		int cookie;
+
+		if (drm_dev_enter(&ptdev->base, &cookie)) {
+			panthor_mmu_as_disable(ptdev, vm->as.id);
+			drm_dev_exit(cookie);
+		}
+
+		ptdev->mmu->as.slots[vm->as.id].vm = NULL;
+		clear_bit(vm->as.id, &ptdev->mmu->as.alloc_mask);
+		list_del(&vm->as.lru_node);
+	}
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+
+	drm_WARN_ON(&ptdev->base,
+		    panthor_vm_unmap_range(vm, vm->va_mgr.mm_start, vm->va_mgr.mm_range));
+
+	free_io_pgtable_ops(vm->pgtbl_ops);
+
+	drm_mm_takedown(&vm->mm);
+	mutex_destroy(&vm->mm_lock);
+	drm_gpuva_manager_destroy(&vm->va_mgr);
+	drm_gem_private_object_fini(&vm->dummy_gem);
+	mutex_destroy(&vm->op_lock);
+	kfree(vm);
+}
+
+/**
+ * panthor_vm_put() - Release a reference on a VM
+ * @vm: VM to release the reference on. Can be NULL.
+ */
+void panthor_vm_put(struct panthor_vm *vm)
+{
+	if (vm)
+		kref_put(&vm->refcount, panthor_vm_release);
+}
+
+/**
+ * panthor_vm_get() - Get a VM reference
+ * @vm: VM to get the reference on. Can be NULL.
+ *
+ * Return: @vm value.
+ */
+struct panthor_vm *panthor_vm_get(struct panthor_vm *vm)
+{
+	if (vm)
+		kref_get(&vm->refcount);
+
+	return vm;
+}
+
+/**
+ * panthor_vm_get_heap_pool() - Get the heap pool attached to a VM
+ * @vm: VM to query the heap pool on.
+ * @create: True if the heap pool should be created when it doesn't exist.
+ *
+ * Heap pools are per-VM. This function allows one to retrieve the heap pool
+ * attached to a VM.
+ *
+ * If no heap pool exists yet, and @create is true, we create one.
+ *
+ * The returned panthor_heap_pool should be released with panthor_heap_pool_put().
+ *
+ * Return: A valid pointer on success, an ERR_PTR() otherwise.
+ */
+struct panthor_heap_pool *panthor_vm_get_heap_pool(struct panthor_vm *vm, bool create)
+{
+	struct panthor_heap_pool *pool;
+
+	mutex_lock(&vm->heaps.lock);
+	if (!vm->heaps.pool && create) {
+		if (vm->destroyed)
+			pool = ERR_PTR(-EINVAL);
+		else
+			pool = panthor_heap_pool_create(vm->ptdev, vm);
+
+		if (!IS_ERR(pool))
+			vm->heaps.pool = panthor_heap_pool_get(pool);
+	} else {
+		pool = panthor_heap_pool_get(vm->heaps.pool);
+	}
+	mutex_unlock(&vm->heaps.lock);
+
+	return pool;
+}
+
+static u64 mair_to_memattr(u64 mair)
+{
+	u64 memattr = 0;
+	u32 i;
+
+	for (i = 0; i < 8; i++) {
+		u8 in_attr = mair >> (8 * i), out_attr;
+		u8 outer = in_attr >> 4, inner = in_attr & 0xf;
+
+		/* For caching to be enabled, inner and outer caching policy
+		 * have to be both write-back, if one of them is write-through
+		 * or non-cacheable, we just choose non-cacheable. Device
+		 * memory is also translated to non-cacheable.
+		 */
+		if (!(outer & 3) || !(outer & 4) || !(inner & 4)) {
+			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_NC |
+				   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER |
+				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(false, false);
+		} else {
+			/* Use SH_CPU_INNER mode so SH_IS, which is used when
+			 * IOMMU_CACHE is set, actually maps to the standard
+			 * definition of inner-shareable and not Mali's
+			 * internal-shareable mode.
+			 */
+			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_WB |
+				   AS_MEMATTR_AARCH64_SH_CPU_INNER |
+				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(inner & 1, inner & 2);
+		}
+
+		memattr |= (u64)out_attr << (8 * i);
+	}
+
+	return memattr;
+}
+
+static void panthor_vma_link(struct panthor_vm *vm, struct panthor_vma *vma)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
+
+	mutex_lock(&bo->gpuva_list_lock);
+	drm_gpuva_link(&vma->base);
+	mutex_unlock(&bo->gpuva_list_lock);
+
+	if (!bo->exclusive_vm)
+		list_add_tail(&vma->node, &vm->shared_bos);
+}
+
+static void panthor_vma_unlink(struct panthor_vm_op_ctx *op_ctx,
+			       struct panthor_vma *vma)
+{
+	struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
+
+	mutex_lock(&bo->gpuva_list_lock);
+	drm_gpuva_unlink(&vma->base);
+	mutex_unlock(&bo->gpuva_list_lock);
+
+	list_move_tail(&vma->node, &op_ctx->returned_vmas);
+}
+
+static void panthor_vma_init(struct panthor_vma *vma,
+			     struct drm_gem_object *obj,
+			     u64 offset,
+			     u64 va, u64 range, u32 flags)
+{
+	INIT_LIST_HEAD(&vma->node);
+	vma->flags = flags;
+	vma->base.gem.obj = obj;
+	vma->base.gem.offset = offset;
+	vma->base.va.addr = va;
+	vma->base.va.range = range;
+}
+
+#define PANTHOR_VM_MAP_FLAGS \
+	(DRM_PANTHOR_VM_BIND_OP_MAP_READONLY | \
+	 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | \
+	 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED)
+
+static int panthor_gpuva_sm_step_map(struct drm_gpuva_op *op, void *priv)
+{
+	struct panthor_vm *vm = priv;
+	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
+	struct panthor_vma *vma = op_ctx->map.new_vma;
+	int ret;
+
+	panthor_vma_init(vma, op->map.gem.obj, op->map.gem.offset, op->map.va.addr,
+			 op->map.va.range, op_ctx->flags & PANTHOR_VM_MAP_FLAGS);
+
+	ret = panthor_vm_map_pages(vm, vma->base.va.addr, flags_to_prot(vma->flags),
+				   op_ctx->map.sgt, vma->base.gem.offset,
+				   vma->base.va.range);
+	if (ret)
+		return ret;
+
+	/* Ref owned by the mapping now, clear the obj field so we don't release the
+	 * pinning/obj ref behind GPUVA's back.
+	 */
+	drm_gpuva_map(&vm->va_mgr, &vma->base, &op->map);
+	panthor_vma_link(vm, op_ctx->map.new_vma);
+	op_ctx->map.gem.obj = NULL;
+	op_ctx->map.new_vma = NULL;
+	return 0;
+}
+
+static int panthor_gpuva_sm_step_remap(struct drm_gpuva_op *op,
+				       void *priv)
+{
+	struct panthor_vma *unmap_vma = container_of(op->remap.unmap->va, struct panthor_vma, base);
+	const u64 va_start = op->remap.prev ?
+			     op->remap.prev->va.addr + op->remap.prev->va.range :
+			     op->remap.unmap->va->va.addr;
+	const u64 va_end = op->remap.next ?
+			   op->remap.next->va.addr :
+			   op->remap.unmap->va->va.addr + op->remap.unmap->va->va.range;
+	struct panthor_vm *vm = priv;
+	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
+	struct drm_gpuva *prev_va = NULL, *next_va = NULL;
+	int ret;
+
+	ret = panthor_vm_unmap_pages(vm, va_start, va_end - va_start);
+	if (ret)
+		return ret;
+
+	if (op->remap.prev) {
+		struct panthor_gem_object *bo = to_panthor_bo(op->remap.prev->gem.obj);
+
+		if (!bo->base.base.import_attach) {
+			ret = drm_gem_shmem_pin(&bo->base);
+			if (drm_WARN_ON(&vm->ptdev->base, ret))
+				return ret;
+		}
+
+		panthor_vma_init(op_ctx->map.prev_vma,
+				 op->remap.prev->gem.obj,
+				 op->remap.prev->gem.offset,
+				 op->remap.prev->va.addr,
+				 op->remap.prev->va.range,
+				 unmap_vma->flags);
+		prev_va = &op_ctx->map.prev_vma->base;
+	}
+
+	if (op->remap.next) {
+		struct panthor_gem_object *bo = to_panthor_bo(op->remap.next->gem.obj);
+
+		if (!bo->base.base.import_attach) {
+			ret = drm_gem_shmem_pin(&bo->base);
+			if (drm_WARN_ON(&vm->ptdev->base, ret))
+				return ret;
+		}
+
+		panthor_vma_init(op_ctx->map.next_vma,
+				 op->remap.next->gem.obj,
+				 op->remap.next->gem.offset,
+				 op->remap.next->va.addr,
+				 op->remap.next->va.range,
+				 unmap_vma->flags);
+		next_va = &op_ctx->map.next_vma->base;
+	}
+
+	drm_gpuva_remap(prev_va, next_va, &op->remap);
+
+	if (prev_va) {
+		drm_gem_object_get(prev_va->gem.obj);
+		panthor_vma_link(vm, op_ctx->map.prev_vma);
+		op_ctx->map.prev_vma = NULL;
+	}
+
+	if (next_va) {
+		drm_gem_object_get(next_va->gem.obj);
+		panthor_vma_link(vm, op_ctx->map.next_vma);
+		op_ctx->map.next_vma = NULL;
+	}
+
+	panthor_vma_unlink(op_ctx, unmap_vma);
+	return 0;
+}
+
+static int panthor_gpuva_sm_step_unmap(struct drm_gpuva_op *op,
+				       void *priv)
+{
+	struct panthor_vma *unmap_vma = container_of(op->unmap.va, struct panthor_vma, base);
+	struct panthor_vm *vm = priv;
+	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
+	int ret;
+
+	ret = panthor_vm_unmap_pages(vm, unmap_vma->base.va.addr,
+				     unmap_vma->base.va.range);
+	if (drm_WARN_ON(&vm->ptdev->base, ret))
+		return ret;
+
+	drm_gpuva_unmap(&op->unmap);
+	panthor_vma_unlink(op_ctx, unmap_vma);
+	return 0;
+}
+
+static const struct drm_gpuva_fn_ops panthor_gpuva_ops = {
+	.sm_step_map = panthor_gpuva_sm_step_map,
+	.sm_step_remap = panthor_gpuva_sm_step_remap,
+	.sm_step_unmap = panthor_gpuva_sm_step_unmap,
+};
+
+/**
+ * panthor_vm_resv() - Get the dma_resv object attached to a VM.
+ * @vm: VM to get the dma_resv of.
+ *
+ * Return: A dma_resv object.
+ */
+struct dma_resv *panthor_vm_resv(struct panthor_vm *vm)
+{
+	return vm->dummy_gem.resv;
+}
+
+static int
+panthor_vm_exec_op(struct panthor_vm *vm, struct panthor_vm_op_ctx *op,
+		   bool flag_vm_unusable_on_failure)
+{
+	int ret;
+
+	mutex_lock(&vm->op_lock);
+	vm->op_ctx = op;
+	switch (op->flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) {
+	case DRM_PANTHOR_VM_BIND_OP_TYPE_MAP:
+		if (vm->unusable) {
+			ret = -EINVAL;
+			break;
+		}
+
+		ret = drm_gpuva_sm_map(&vm->va_mgr, vm, op->va.addr, op->va.range,
+				       op->map.gem.obj, op->map.gem.offset);
+		break;
+
+	case DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP:
+		ret = drm_gpuva_sm_unmap(&vm->va_mgr, vm, op->va.addr, op->va.range);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	if (ret && flag_vm_unusable_on_failure)
+		vm->unusable = true;
+
+	vm->op_ctx = NULL;
+	mutex_unlock(&vm->op_lock);
+
+	return ret;
+}
+
+static struct dma_fence *
+panthor_vm_bind_run_job(struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	bool cookie;
+	int ret;
+
+	/* Not only we report an error whose result is propagated to the
+	 * drm_sched finished fence, but we also flag the VM as unusable, because
+	 * a failure in the async VM_BIND results in an inconsistent state. VM needs
+	 * to be destroyed and recreated.
+	 */
+	cookie = dma_fence_begin_signalling();
+	ret = panthor_vm_exec_op(job->vm, &job->ctx, true);
+	dma_fence_end_signalling(cookie);
+
+	return ret ? ERR_PTR(ret) : NULL;
+}
+
+static void panthor_vm_bind_job_release(struct kref *kref)
+{
+	struct panthor_vm_bind_job *job = container_of(kref, struct panthor_vm_bind_job, refcount);
+
+	if (job->base.s_fence)
+		drm_sched_job_cleanup(&job->base);
+
+	panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
+	panthor_vm_put(job->vm);
+	kfree(job);
+}
+
+/**
+ * panthor_vm_bind_job_put() - Release a VM_BIND job reference
+ * @sched_job: Job to release the reference on.
+ */
+void panthor_vm_bind_job_put(struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job =
+		container_of(sched_job, struct panthor_vm_bind_job, base);
+
+	if (sched_job)
+		kref_put(&job->refcount, panthor_vm_bind_job_release);
+}
+
+static void
+panthor_vm_bind_free_job(struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job =
+		container_of(sched_job, struct panthor_vm_bind_job, base);
+
+	drm_sched_job_cleanup(sched_job);
+
+	/* Do the heavy cleanups asynchronously, so we're out of the
+	 * dma-signaling path and can acquire dma-resv locks safely.
+	 */
+	queue_work(panthor_cleanup_wq, &job->cleanup_op_ctx_work);
+}
+
+static enum drm_gpu_sched_stat
+panthor_vm_bind_timedout_job(struct drm_sched_job *sched_job)
+{
+	WARN(1, "VM_BIND ops are synchronous for now, there should be no timeout!");
+	return DRM_GPU_SCHED_STAT_NOMINAL;
+}
+
+static const struct drm_sched_backend_ops panthor_vm_bind_ops = {
+	.run_job = panthor_vm_bind_run_job,
+	.free_job = panthor_vm_bind_free_job,
+	.timedout_job = panthor_vm_bind_timedout_job,
+};
+
+/**
+ * panthor_vm_create() - Create a VM
+ * @ptdev: Device.
+ * @for_mcu: True if this is the FW MCU VM.
+ * @auto_va_start: Start of the auto-VA range.
+ * @auto_va_range: Size of the auto-VA range.
+ *
+ * Return: A valid pointer on success, an ERR_PTR() otherwise.
+ */
+struct panthor_vm *
+panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
+		  u64 auto_va_start, u64 auto_va_range)
+{
+	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
+	u32 pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
+	struct drm_gpu_scheduler *sched;
+	struct io_pgtable_cfg pgtbl_cfg;
+	u64 mair, min_va, va_range;
+	struct panthor_vm *vm;
+	int ret;
+
+	vm = kzalloc(sizeof(*vm), GFP_KERNEL);
+	if (!vm)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&vm->heaps.lock);
+	kref_init(&vm->refcount);
+	drm_gem_private_object_init(&ptdev->base, &vm->dummy_gem, 0);
+	vm->for_mcu = for_mcu;
+	vm->ptdev = ptdev;
+	INIT_LIST_HEAD(&vm->shared_bos);
+	mutex_init(&vm->op_lock);
+
+	if (for_mcu) {
+		/* CSF MCU is a cortex M7, and can only address 4G */
+		min_va = 0;
+		va_range = SZ_4G;
+	} else {
+		min_va = 0;
+		va_range = (1ull << va_bits);
+
+		/* If the auto_va_range is zero, we reserve half of the VA
+		 * space for kernel stuff.
+		 */
+		if (!auto_va_range) {
+			auto_va_range = va_range / 2;
+			auto_va_start = va_range - auto_va_range;
+		}
+	}
+
+	mutex_init(&vm->mm_lock);
+	drm_mm_init(&vm->mm, auto_va_start, auto_va_range);
+
+	/* We intentionally leave the reserved range to zero, because we want kernel VMAs
+	 * to be handled the same way user VMAs are.
+	 */
+	drm_gpuva_manager_init(&vm->va_mgr,
+			       for_mcu ? "panthor-MCU-VA-manager" : "panthor-GPU-VA-manager",
+			       min_va, va_range, 0, 0,
+			       &panthor_gpuva_ops);
+	INIT_LIST_HEAD(&vm->node);
+	INIT_LIST_HEAD(&vm->as.lru_node);
+	vm->as.id = -1;
+
+	pgtbl_cfg = (struct io_pgtable_cfg) {
+		.pgsize_bitmap	= SZ_4K | SZ_2M,
+		.ias		= va_bits,
+		.oas		= pa_bits,
+		.coherent_walk	= ptdev->coherent,
+		.tlb		= &mmu_tlb_ops,
+		.iommu_dev	= ptdev->base.dev,
+		.alloc		= alloc_pt,
+		.free		= free_pt,
+	};
+
+	vm->pgtbl_ops = alloc_io_pgtable_ops(ARM_64_LPAE_S1, &pgtbl_cfg, vm);
+	if (!vm->pgtbl_ops) {
+		ret = -EINVAL;
+		goto err_gpuva_destroy;
+	}
+
+	/* Bind operations are synchronous for now, no timeout needed. */
+	ret = drm_sched_init(&vm->sched, &panthor_vm_bind_ops, ptdev->mmu->vm.wq, 1, 0,
+			     MAX_SCHEDULE_TIMEOUT, NULL, NULL,
+			     "panthor-vm-bind", DRM_SCHED_POLICY_SINGLE_ENTITY,
+			     ptdev->base.dev);
+	if (ret)
+		goto err_free_io_pgtable;
+
+	sched = &vm->sched;
+	ret = drm_sched_entity_init(&vm->entity, DRM_SCHED_PRIORITY_NORMAL,
+				    &sched, 1, NULL);
+	if (ret)
+		goto err_sched_fini;
+
+	mair = io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg.arm_lpae_s1_cfg.mair;
+	vm->memattr = mair_to_memattr(mair);
+
+	mutex_lock(&ptdev->mmu->vm.lock);
+	list_add_tail(&vm->node, &ptdev->mmu->vm.list);
+
+	/* If a reset is in progress, stop the scheduler. */
+	if (ptdev->mmu->vm.reset_in_progress)
+		panthor_vm_stop(vm);
+	mutex_unlock(&ptdev->mmu->vm.lock);
+
+	return vm;
+
+err_sched_fini:
+	drm_sched_fini(&vm->sched);
+
+err_free_io_pgtable:
+	free_io_pgtable_ops(vm->pgtbl_ops);
+
+err_gpuva_destroy:
+	drm_mm_takedown(&vm->mm);
+	drm_gpuva_manager_destroy(&vm->va_mgr);
+	drm_gem_private_object_fini(&vm->dummy_gem);
+	kfree(vm);
+
+	return ERR_PTR(ret);
+}
+
+static int
+panthor_vm_bind_prepare_op_ctx(struct drm_file *file,
+			       struct panthor_vm *vm,
+			       const struct drm_panthor_vm_bind_op *op,
+			       struct panthor_vm_op_ctx *op_ctx)
+{
+	struct drm_gem_object *gem;
+	int ret;
+
+	/* Aligned on page size. */
+	if ((op->va | op->size) & ~PAGE_MASK)
+		return -EINVAL;
+
+	switch (op->flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) {
+	case DRM_PANTHOR_VM_BIND_OP_TYPE_MAP:
+		gem = drm_gem_object_lookup(file, op->bo_handle);
+		ret = panthor_vm_prepare_map_op_ctx(op_ctx, vm,
+						    gem ? to_panthor_bo(gem) : NULL,
+						    op->bo_offset,
+						    op->size,
+						    op->va,
+						    op->flags);
+		drm_gem_object_put(gem);
+		return ret;
+
+	case DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP:
+		return panthor_vm_prepare_unmap_op_ctx(op_ctx, vm, op->va, op->size);
+
+	default:
+		return -EINVAL;
+	}
+}
+
+static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
+{
+	struct panthor_vm_bind_job *job =
+		container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
+
+	panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
+	panthor_vm_bind_job_put(&job->base);
+}
+
+/**
+ * panthor_vm_bind_job_create() - Create a VM_BIND job
+ * @file: File.
+ * @vm: VM targeted by the VM_BIND job.
+ * @op: VM operation data.
+ *
+ * Return: A valid pointer on success, an ERR_PTR() otherwise.
+ */
+struct drm_sched_job *
+panthor_vm_bind_job_create(struct drm_file *file,
+			   struct panthor_vm *vm,
+			   const struct drm_panthor_vm_bind_op *op)
+{
+	struct panthor_vm_bind_job *job;
+	int ret;
+
+	if (!vm)
+		return ERR_PTR(-EINVAL);
+
+	if (vm->destroyed || vm->unusable)
+		return ERR_PTR(-EINVAL);
+
+	job = kzalloc(sizeof(*job), GFP_KERNEL);
+	if (!job)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_WORK(&job->cleanup_op_ctx_work, panthor_vm_bind_job_cleanup_op_ctx_work);
+	kref_init(&job->refcount);
+	job->vm = panthor_vm_get(vm);
+
+	ret = panthor_vm_bind_prepare_op_ctx(file, vm, op, &job->ctx);
+	if (ret)
+		goto err_put_job;
+
+	ret = drm_sched_job_init(&job->base, &vm->entity, vm);
+	if (ret)
+		goto err_put_job;
+
+	return &job->base;
+
+err_put_job:
+	panthor_vm_bind_job_put(&job->base);
+	return ERR_PTR(ret);
+}
+
+/**
+ * panthor_vm_bind_job_prepare_resvs() - Prepare VM_BIND job dma_resvs
+ * @exec: The locking/preparation context.
+ * @sched_job: The job to prepare resvs on.
+ *
+ * Locks and prepare the VM resv.
+ *
+ * If this is a map operation, locks and prepares the GEM resv.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
+				      struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	int ret;
+
+	/* Acquire the VM lock an reserve a slot for this VM bind job. */
+	ret = drm_exec_prepare_obj(exec, &job->vm->dummy_gem, 1);
+	if (ret)
+		return ret;
+
+	if (job->ctx.map.gem.obj) {
+		/* Lock/prepare the GEM being mapped. */
+		ret = drm_exec_prepare_obj(exec, job->ctx.map.gem.obj, 1);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_vm_bind_job_add_resvs_deps() - Add implicit deps to the VM_BIND job
+ * @sched_job: Job to add implicit deps on.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_bind_job_add_resvs_deps(struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+	int ret;
+
+	/* We use explicit fencing, so no need to wait for anything else but
+	 * DMA_RESV_USAGE_KERNEL on the BO being mapped or VM. If there are extra
+	 * dependencies, they should be passed to the VM_BIND ioctl.
+	 */
+	ret = drm_sched_job_add_resv_dependencies(sched_job,
+						  job->vm->dummy_gem.resv,
+						  DMA_RESV_USAGE_KERNEL);
+	if (ret)
+		return ret;
+
+	if (job->ctx.map.gem.obj) {
+		ret = drm_sched_job_add_resv_dependencies(sched_job,
+							  job->ctx.map.gem.obj->resv,
+							  DMA_RESV_USAGE_KERNEL);
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_vm_bind_job_update_resvs() - Update the resv objects touched by a job
+ * @sched_job: Job to update the resvs on.
+ */
+void panthor_vm_bind_job_update_resvs(struct drm_sched_job *sched_job)
+{
+	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
+
+	/* Explicit sync => we just register our job finished fence as bookkeep. */
+	dma_resv_add_fence(job->vm->dummy_gem.resv,
+			   &sched_job->s_fence->finished,
+			   DMA_RESV_USAGE_BOOKKEEP);
+
+	if (job->ctx.map.gem.obj) {
+		dma_resv_add_fence(job->ctx.map.gem.obj->resv,
+				   &sched_job->s_fence->finished,
+				   DMA_RESV_USAGE_BOOKKEEP);
+	}
+}
+
+/**
+ * panthor_vm_bind_exec_sync_op() - Execute a VM_BIND operation synchronously.
+ * @file: File.
+ * @vm: VM targeted by the VM operation.
+ * @op: Data describing the VM operation.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_bind_exec_sync_op(struct drm_file *file,
+				 struct panthor_vm *vm,
+				 struct drm_panthor_vm_bind_op *op)
+{
+	struct panthor_vm_op_ctx op_ctx;
+	int ret;
+
+	/* No sync objects allowed on synchronous operations. */
+	if (op->syncs.count)
+		return -EINVAL;
+
+	if (!op->size)
+		return 0;
+
+	ret = panthor_vm_bind_prepare_op_ctx(file, vm, op, &op_ctx);
+	if (ret)
+		return ret;
+
+	ret = panthor_vm_exec_op(vm, &op_ctx, false);
+	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
+
+	return ret;
+}
+
+/**
+ * panthor_vm_map_bo_range() - Map a GEM object range to a VM
+ * @vm: VM to map the GEM to.
+ * @bo: GEM object to map.
+ * @offset: Offset in the GEM object.
+ * @size: Size to map.
+ * @va: Virtual address to map the object to.
+ * @flags: Combination of drm_panthor_vm_bind_op_flags flags.
+ * Only map-related flags are valid.
+ *
+ * Internal use only. For userspace requests, use
+ * panthor_vm_bind_exec_sync_op() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_map_bo_range(struct panthor_vm *vm, struct panthor_gem_object *bo,
+			    u64 offset, size_t size, u64 va, u32 flags)
+{
+	struct panthor_vm_op_ctx op_ctx;
+	int ret;
+
+	ret = panthor_vm_prepare_map_op_ctx(&op_ctx, vm, bo, offset, size, va, flags);
+	if (ret)
+		return ret;
+
+	ret = panthor_vm_exec_op(vm, &op_ctx, false);
+	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
+
+	return ret;
+}
+
+/**
+ * panthor_vm_unmap_range() - Unmap a portion of the VA space
+ * @vm: VM to unmap the region from.
+ * @va: Virtual address to unmap. Must be 4k aligned.
+ * @size: Size of the region to unmap. Must be 4k aligned.
+ *
+ * Internal use only. For userspace requests, use
+ * panthor_vm_bind_exec_sync_op() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_unmap_range(struct panthor_vm *vm, u64 va, size_t size)
+{
+	struct panthor_vm_op_ctx op_ctx;
+	int ret;
+
+	ret = panthor_vm_prepare_unmap_op_ctx(&op_ctx, vm, va, size);
+	if (ret)
+		return ret;
+
+	ret = panthor_vm_exec_op(vm, &op_ctx, false);
+	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
+
+	return ret;
+}
+
+/**
+ * panthor_vm_prepare_mapped_bos_resvs() - Prepare resvs on VM BOs.
+ * @exec: Locking/preparation context.
+ * @vm: VM targeted by the GPU job.
+ *
+ * GPU jobs assume all BOs bound to the VM at the time the job is submitted
+ * are available when the job is executed. In order to guarantee that, we
+ * need to reserve a slot on all BOs mapped to a VM and update this slot with
+ * the job fence after its submission.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm *vm)
+{
+	struct panthor_vma *vma;
+	int ret;
+
+	/* Acquire the VM lock an reserve a slot for this GPU job. */
+	ret = drm_exec_prepare_obj(exec, &vm->dummy_gem, 1);
+	if (ret)
+		return ret;
+
+	list_for_each_entry(vma, &vm->shared_bos, node) {
+		ret = drm_exec_prepare_obj(exec, vma->base.gem.obj, 1);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_vm_add_bos_resvs_deps_to_job() - Add implicit VM deps to a GPU job
+ * @vm: VM targeted by the GPU job.
+ * @job: GPU job.
+ *
+ * We just take care of kernel access. Other accesses should be passed as
+ * explicit dependencies to the job.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_vm_add_bos_resvs_deps_to_job(struct panthor_vm *vm,
+					 struct drm_sched_job *job)
+{
+	struct panthor_vma *vma;
+	int ret;
+
+	/* We use explicit fencing, so no need to wait for anything else but
+	 * DMA_RESV_USAGE_KERNEL on the BO being mapped or VM. If there are extra
+	 * dependencies, they should be passed to the VM_BIND ioctl.
+	 */
+	ret = drm_sched_job_add_resv_dependencies(job,
+						  vm->dummy_gem.resv,
+						  DMA_RESV_USAGE_KERNEL);
+	if (ret)
+		return ret;
+
+	list_for_each_entry(vma, &vm->shared_bos, node) {
+		ret = drm_sched_job_add_resv_dependencies(job,
+							  vma->base.gem.obj->resv,
+							  DMA_RESV_USAGE_KERNEL);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_vm_add_job_fence_to_bos_resvs() - Add GPU job fence to GEM resvs
+ * @vm: VM targeted by the GPU job.
+ * @job: GPU job.
+ *
+ * Update the GEM resvs after a job has been submitted. All GEMs currently
+ * bound to the VMs get the job fence added to their resv as bookkeep. If
+ * another type of implicit dependency is needed, it should be updated
+ * with %DMA_BUF_IOCTL_IMPORT_SYNC_FILE after the
+ * %DRM_IOCTL_PANTHOR_GROUP_SUBMIT ioctl has returned.
+ */
+void panthor_vm_add_job_fence_to_bos_resvs(struct panthor_vm *vm,
+					   struct drm_sched_job *job)
+{
+	struct panthor_vma *vma;
+
+	/* Explicit sync => we just register our job finished fence as bookkeep. */
+	dma_resv_add_fence(vm->dummy_gem.resv,
+			   &job->s_fence->finished,
+			   DMA_RESV_USAGE_BOOKKEEP);
+
+	list_for_each_entry(vma, &vm->shared_bos, node) {
+		dma_resv_add_fence(vma->base.gem.obj->resv,
+				   &job->s_fence->finished,
+				   DMA_RESV_USAGE_BOOKKEEP);
+	}
+}
+
+/**
+ * panthor_mmu_unplug() - Unplug the MMU logic
+ * @ptdev: Device.
+ *
+ * No access to the MMU regs should be done after this function is called.
+ * We suspend the IRQ and disable all VMs to guarantee that.
+ */
+void panthor_mmu_unplug(struct panthor_device *ptdev)
+{
+	if (ptdev->mmu->irq.irq > 0)
+		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+
+	mutex_lock(&ptdev->mmu->as.slots_lock);
+	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
+		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
+
+		if (vm) {
+			drm_WARN_ON(&ptdev->base, panthor_mmu_as_disable(ptdev, i));
+			vm->as.id = -1;
+			list_del_init(&vm->as.lru_node);
+			clear_bit(i, &ptdev->mmu->as.alloc_mask);
+			ptdev->mmu->as.slots[i].vm = NULL;
+		}
+	}
+	mutex_unlock(&ptdev->mmu->as.slots_lock);
+}
+
+static void panthor_mmu_release_wq(struct drm_device *ddev, void *res)
+{
+	destroy_workqueue(res);
+}
+
+/**
+ * panthor_mmu_init() - Initialize the MMU logic.
+ * @ptdev: Device.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_mmu_init(struct panthor_device *ptdev)
+{
+	struct panthor_mmu *mmu;
+	int ret, irq;
+
+	mmu = drmm_kzalloc(&ptdev->base, sizeof(*mmu), GFP_KERNEL);
+	if (!mmu)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&mmu->as.lru_list);
+
+	for (u32 i = 0; i < ARRAY_SIZE(mmu->as.slots); i++)
+		spin_lock_init(&mmu->as.slots[i].lock);
+
+	drmm_mutex_init(&ptdev->base, &mmu->as.slots_lock);
+	INIT_LIST_HEAD(&mmu->vm.list);
+	drmm_mutex_init(&ptdev->base, &mmu->vm.lock);
+
+	ptdev->mmu = mmu;
+
+	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "mmu");
+	if (irq <= 0)
+		return -ENODEV;
+
+	ret = panthor_request_mmu_irq(ptdev, &mmu->irq, irq,
+				      panthor_mmu_fault_mask(ptdev, ~0));
+	if (ret)
+		return ret;
+
+	mmu->vm.wq = alloc_workqueue("panthor-vm-bind", WQ_UNBOUND, 0);
+	if (!mmu->vm.wq)
+		return -ENOMEM;
+
+	return drmm_add_action_or_reset(&ptdev->base, panthor_mmu_release_wq, mmu->vm.wq);
+}
+
+#ifdef CONFIG_DEBUG_FS
+static int show_vm_gpuvas(struct panthor_vm *vm, struct seq_file *m)
+{
+	int ret;
+
+	mutex_lock(&vm->op_lock);
+	ret = drm_debugfs_gpuva_info(m, &vm->va_mgr);
+	mutex_unlock(&vm->op_lock);
+
+	return ret;
+}
+
+static int show_each_vm(struct seq_file *m, void *arg)
+{
+	struct drm_info_node *node = (struct drm_info_node *)m->private;
+	struct drm_device *ddev = node->minor->dev;
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	int (*show)(struct panthor_vm *, struct seq_file *) = node->info_ent->data;
+	struct panthor_vm *vm;
+	int ret = 0;
+
+	mutex_lock(&ptdev->mmu->vm.lock);
+	list_for_each_entry(vm, &ptdev->mmu->vm.list, node) {
+		ret = show(vm, m);
+		if (ret < 0)
+			break;
+
+		seq_puts(m, "\n");
+	}
+	mutex_unlock(&ptdev->mmu->vm.lock);
+
+	return ret;
+}
+
+static struct drm_info_list panthor_mmu_debugfs_list[] = {
+	DRM_DEBUGFS_GPUVA_INFO(show_each_vm, show_vm_gpuvas),
+};
+
+/**
+ * panthor_mmu_debugfs_init() - Initialize MMU debugfs entries
+ * @minor: Minor.
+ */
+void panthor_mmu_debugfs_init(struct drm_minor *minor)
+{
+	drm_debugfs_create_files(panthor_mmu_debugfs_list,
+				 ARRAY_SIZE(panthor_mmu_debugfs_list),
+				 minor->debugfs_root, minor);
+}
+#endif /* CONFIG_DEBUG_FS */
+
+/**
+ * panthor_mmu_pt_cache_init() - Initialize the page table cache.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_mmu_pt_cache_init(void)
+{
+	pt_cache = kmem_cache_create("panthor-mmu-pt", SZ_4K, SZ_4K, 0, NULL);
+	if (!pt_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/**
+ * panthor_mmu_pt_cache_fini() - Destroy the page table cache.
+ */
+void panthor_mmu_pt_cache_fini(void)
+{
+	kmem_cache_destroy(pt_cache);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.h b/drivers/gpu/drm/panthor/panthor_mmu.h
new file mode 100644
index 000000000000..d94925ccdc8c
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_mmu.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_MMU_H__
+#define __PANTHOR_MMU_H__
+
+struct drm_exec;
+struct drm_sched_job;
+struct panthor_gem_object;
+struct panthor_heap_pool;
+struct panthor_vm;
+struct panthor_vma;
+struct panthor_mmu;
+
+int panthor_mmu_init(struct panthor_device *ptdev);
+void panthor_mmu_unplug(struct panthor_device *ptdev);
+void panthor_mmu_pre_reset(struct panthor_device *ptdev);
+void panthor_mmu_post_reset(struct panthor_device *ptdev);
+void panthor_mmu_suspend(struct panthor_device *ptdev);
+void panthor_mmu_resume(struct panthor_device *ptdev);
+
+int panthor_vm_map_bo_range(struct panthor_vm *vm, struct panthor_gem_object *bo,
+			    u64 offset, size_t size, u64 va, u32 flags);
+int panthor_vm_unmap_range(struct panthor_vm *vm, u64 va, size_t size);
+struct panthor_gem_object *
+panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset);
+
+int panthor_vm_active(struct panthor_vm *vm);
+void panthor_vm_idle(struct panthor_vm *vm);
+int panthor_vm_as(struct panthor_vm *vm);
+
+struct panthor_heap_pool *
+panthor_vm_get_heap_pool(struct panthor_vm *vm, bool create);
+
+struct panthor_vm *panthor_vm_get(struct panthor_vm *vm);
+void panthor_vm_put(struct panthor_vm *vm);
+struct panthor_vm *panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
+				     u64 auto_va_start, u64 auto_va_range);
+
+int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec,
+					struct panthor_vm *vm);
+int panthor_vm_add_bos_resvs_deps_to_job(struct panthor_vm *vm,
+					 struct drm_sched_job *job);
+void panthor_vm_add_job_fence_to_bos_resvs(struct panthor_vm *vm,
+					   struct drm_sched_job *job);
+
+struct dma_resv *panthor_vm_resv(struct panthor_vm *vm);
+
+void panthor_vm_pool_destroy(struct panthor_file *pfile);
+int panthor_vm_pool_create(struct panthor_file *pfile);
+int panthor_vm_pool_create_vm(struct panthor_device *ptdev, struct panthor_vm_pool *pool,
+			      u64 kernel_va_start, u64 kernel_va_range);
+int panthor_vm_pool_destroy_vm(struct panthor_vm_pool *pool, u32 handle);
+struct panthor_vm *panthor_vm_pool_get_vm(struct panthor_vm_pool *pool, u32 handle);
+
+struct drm_mm_node *panthor_vm_alloc_va(struct panthor_vm *vm, size_t size);
+void panthor_vm_free_va(struct panthor_vm *vm, struct drm_mm_node *mm_node);
+
+int panthor_vm_bind_exec_sync_op(struct drm_file *file,
+				 struct panthor_vm *vm,
+				 struct drm_panthor_vm_bind_op *op);
+
+struct drm_sched_job *
+panthor_vm_bind_job_create(struct drm_file *file,
+			   struct panthor_vm *vm,
+			   const struct drm_panthor_vm_bind_op *op);
+void panthor_vm_bind_job_put(struct drm_sched_job *job);
+int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
+				      struct drm_sched_job *job);
+int panthor_vm_bind_job_add_resvs_deps(struct drm_sched_job *job);
+void panthor_vm_bind_job_update_resvs(struct drm_sched_job *job);
+
+int panthor_mmu_pt_cache_init(void);
+void panthor_mmu_pt_cache_fini(void);
+
+#ifdef CONFIG_DEBUG_FS
+void panthor_mmu_debugfs_init(struct drm_minor *minor);
+#endif
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 09/15] drm/panthor: Add the FW logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (7 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 08/15] drm/panthor: Add the MMU/VM " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-16 16:01   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 10/15] drm/panthor: Add the heap " Boris Brezillon
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Contains everything that's FW related, that includes the code dealing
with the microcontroller unit (MCU) that's running the FW, and anything
related to allocating memory shared between the FW and the CPU.

A few global FW events are processed in the IRQ handler, the rest is
forwarded to the scheduler, since scheduling is the primary reason for
the FW existence, and also the main source of FW <-> kernel
interactions.

v2:
- Rename the driver (pancsf -> panthor)
- Rename the file (_mcu -> _fw)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_fw.c | 1417 ++++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_fw.h |  505 +++++++++
 2 files changed, 1922 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
new file mode 100644
index 000000000000..359a68f7af03
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -0,0 +1,1417 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/clk.h>
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/iopoll.h>
+#include <linux/iosys-map.h>
+#include <linux/mutex.h>
+#include <linux/platform_device.h>
+
+#include <drm/drm_drv.h>
+#include <drm/drm_managed.h>
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_sched.h"
+
+#define CSF_FW_NAME "mali_csffw.bin"
+
+#define PING_INTERVAL_MS			12000
+#define PROGRESS_TIMEOUT_CYCLES			(5ull * 500 * 1024 * 1024)
+#define PROGRESS_TIMEOUT_SCALE_SHIFT		10
+#define IDLE_HYSTERESIS_US			800
+#define PWROFF_HYSTERESIS_US			10000
+
+/**
+ * struct panthor_fw_mem - FW memory
+ */
+struct panthor_fw_mem {
+	/** @bo: Buffer object backing the FW memory. */
+	struct panthor_gem_object *bo;
+
+	/** @kmap: Kernel CPU mapping of the FW memory. */
+	void *kmap;
+
+	/** @va: MCU mapping of the FW memory. */
+	u64 va;
+};
+
+/**
+ * struct panthor_fw_binary_hdr - Firmware binary header.
+ */
+struct panthor_fw_binary_hdr {
+	/** @magic: Magic value to check binary validity. */
+	u32 magic;
+#define CSF_FW_BINARY_HEADER_MAGIC		0xc3f13a6e
+
+	/** @minor: Minor FW version. */
+	u8 minor;
+
+	/** @major: Major FW version. */
+	u8 major;
+#define CSF_FW_BINARY_HEADER_MAJOR_MAX		0
+
+	/** @padding1: MBZ. */
+	u16 padding1;
+
+	/** @version_hash: FW version hash. */
+	u32 version_hash;
+
+	/** @padding2: MBZ. */
+	u32 padding2;
+
+	/** @size: FW binary size. */
+	u32 size;
+};
+
+/**
+ * enum panthor_fw_binary_entry_type - Firmware binary entry type
+ */
+enum panthor_fw_binary_entry_type {
+	/** @CSF_FW_BINARY_ENTRY_TYPE_IFACE: Host <-> FW interface. */
+	CSF_FW_BINARY_ENTRY_TYPE_IFACE = 0,
+
+	/** @CSF_FW_BINARY_ENTRY_TYPE_CONFIG: FW config. */
+	CSF_FW_BINARY_ENTRY_TYPE_CONFIG = 1,
+
+	/** @CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST: Unit-tests. */
+	CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST = 2,
+
+	/** @CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER: Trace buffer interface. */
+	CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER = 3,
+
+	/** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata interface. */
+	CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
+};
+
+#define CSF_FW_BINARY_ENTRY_TYPE(ehdr)					((ehdr) & 0xff)
+#define CSF_FW_BINARY_ENTRY_SIZE(ehdr)					(((ehdr) >> 8) & 0xff)
+#define CSF_FW_BINARY_ENTRY_UPDATE					BIT(30)
+#define CSF_FW_BINARY_ENTRY_OPTIONAL					BIT(31)
+
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_RD					BIT(0)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_WR					BIT(1)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_EX					BIT(2)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_NONE			(0 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED			(1 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_UNCACHED_COHERENT	(2 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED_COHERENT		(3 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK			GENMASK(4, 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_PROT				BIT(5)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED				BIT(30)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO				BIT(31)
+
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS			\
+	(CSF_FW_BINARY_IFACE_ENTRY_RD_RD |				\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_WR |				\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_EX |				\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK |			\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_PROT |				\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED  |				\
+	 CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO)
+
+/**
+ * struct panthor_fw_binary_section_entry_hdr - Describes a section of FW binary
+ */
+struct panthor_fw_binary_section_entry_hdr {
+	/** @flags: Section flags. */
+	u32 flags;
+
+	/** @va: MCU virtual range to map this binary section to. */
+	struct {
+		/** @start: Start address. */
+		u32 start;
+
+		/** @end: End address. */
+		u32 end;
+	} va;
+
+	/** @data: Data to initialize the FW section with. */
+	struct {
+		/** @start: Start offset in the FW binary. */
+		u32 start;
+
+		/** @end: End offset in the FW binary. */
+		u32 end;
+	} data;
+};
+
+/**
+ * struct panthor_fw_binary_iter - Firmware binary iterator
+ *
+ * Used to parse a firmware binary.
+ */
+struct panthor_fw_binary_iter {
+	/** @data: FW binary data. */
+	const void *data;
+
+	/** @size: FW binary size. */
+	size_t size;
+
+	/** @offset: Iterator offset. */
+	size_t offset;
+};
+
+/**
+ * struct panthor_fw_section - FW section
+ */
+struct panthor_fw_section {
+	/** @node: Used to keep track of FW sections. */
+	struct list_head node;
+
+	/** @flags: Section flags, as encoded in the FW binary. */
+	u32 flags;
+
+	/** @mem: Section memory. */
+	struct panthor_fw_mem *mem;
+
+	/**
+	 * @name: Name of the section, as specified in the binary.
+	 *
+	 * Can be NULL.
+	 */
+	const char *name;
+
+	/**
+	 * @data: Initial data copied to the FW memory.
+	 *
+	 * We keep data around so we can reload sections after a reset.
+	 */
+	struct {
+		/** @buf: Buffed used to store init data. */
+		const void *buf;
+
+		/** @size: Size of @buf in bytes. */
+		size_t size;
+	} data;
+};
+
+#define CSF_MCU_SHARED_REGION_START		0x04000000ULL
+#define CSF_MCU_SHARED_REGION_SIZE		0x04000000ULL
+
+#define MIN_CS_PER_CSG				8
+#define MIN_CSGS				3
+#define MAX_CSG_PRIO				0xf
+
+#define CSF_IFACE_VERSION(major, minor, patch)	\
+	(((major) << 24) | ((minor) << 16) | (patch))
+#define CSF_IFACE_VERSION_MAJOR(v)		((v) >> 24)
+#define CSF_IFACE_VERSION_MINOR(v)		(((v) >> 16) & 0xff)
+#define CSF_IFACE_VERSION_PATCH(v)		((v) & 0xffff)
+
+#define CSF_GROUP_CONTROL_OFFSET		0x1000
+#define CSF_STREAM_CONTROL_OFFSET		0x40
+#define CSF_UNPRESERVED_REG_COUNT		4
+
+/**
+ * struct panthor_fw_iface - FW interfaces
+ */
+struct panthor_fw_iface {
+	/** @global: Global interface. */
+	struct panthor_fw_global_iface global;
+
+	/** @groups: Group slot interfaces. */
+	struct panthor_fw_csg_iface groups[MAX_CSGS];
+
+	/** @streams: Command stream slot interfaces. */
+	struct panthor_fw_cs_iface streams[MAX_CSGS][MAX_CS_PER_CSG];
+};
+
+/**
+ * struct panthor_fw - Firmware management
+ */
+struct panthor_fw {
+	/** @vm: MCU VM. */
+	struct panthor_vm *vm;
+
+	/** @sections: List of FW sections. */
+	struct list_head sections;
+
+	/** @shared_section: The section containing the FW interfaces. */
+	struct panthor_fw_section *shared_section;
+
+	/** @iface: FW interfaces. */
+	struct panthor_fw_iface iface;
+
+	/** @watchdog: Collection of fields relating to the FW watchdog. */
+	struct {
+		/** @ping_work: Delayed work used to ping the FW. */
+		struct delayed_work ping_work;
+	} watchdog;
+
+	/**
+	 * @waitqueues: Request waitqueues.
+	 *
+	 * Everytime a request is sent to a command stream group or the global
+	 * interface, the caller will first busy wait for the request to be
+	 * acknowledged, and then fallback to a sleeping wait.
+	 *
+	 * Those wait queues are here to support the sleeping wait flavor.
+	 *
+	 * Entry 31 is the global waitqueue, the other ones are the command
+	 * stream group slot waitqueues.
+	 */
+	wait_queue_head_t waitqueues[32];
+
+	/** @booted: True is the FW is booted */
+	bool booted;
+
+	/**
+	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
+	 *
+	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
+	 *
+	 * Any time the firmware is properly suspended, a fast reset can take place.
+	 * On the other hand, if the halt operation failed, the driver will reload
+	 * all sections to make sure we start from a fresh state.
+	 */
+	bool fast_reset;
+
+	/** @irq: Job irq data. */
+	struct panthor_irq irq;
+};
+
+/**
+ * panthor_fw_get_glb_iface() - Get the global interface
+ * @ptdev: Device.
+ *
+ * Return: The global interface.
+ */
+struct panthor_fw_global_iface *
+panthor_fw_get_glb_iface(struct panthor_device *ptdev)
+{
+	return &ptdev->fw->iface.global;
+}
+
+/**
+ * panthor_fw_get_glb_iface() - Get a command stream group slot interface
+ * @ptdev: Device.
+ * @csg_slot: Index of the command stream group slot.
+ *
+ * Return: The command stream group slot interface.
+ */
+struct panthor_fw_csg_iface *
+panthor_fw_get_csg_iface(struct panthor_device *ptdev, u32 csg_slot)
+{
+	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS))
+		return NULL;
+
+	return &ptdev->fw->iface.groups[csg_slot];
+}
+
+/**
+ * panthor_fw_get_glb_iface() - Get a command stream slot interface
+ * @ptdev: Device.
+ * @csg_slot: Index of the command stream group slot.
+ * @cs_slot: Index of the command stream slot.
+ *
+ * Return: The command stream slot interface.
+ */
+struct panthor_fw_cs_iface *
+panthor_fw_get_cs_iface(struct panthor_device *ptdev, u32 csg_slot, u32 cs_slot)
+{
+	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS || cs_slot > MAX_CS_PER_CSG))
+		return NULL;
+
+	return &ptdev->fw->iface.streams[csg_slot][cs_slot];
+}
+
+/**
+ * panthor_fw_conv_timeout() - Convert a timeout into a cycle-count
+ * @ptdev: Device.
+ * @timeout_us: Timeout expressed in micro-seconds.
+ *
+ * The FW has two timer sources: the GPU counter or arch-timer. We need
+ * to express timeouts in term of number of cycles and specify which
+ * timer source should be used.
+ *
+ * Return: A value suitable for timeout fields in the global interface.
+ */
+static u32 panthor_fw_conv_timeout(struct panthor_device *ptdev, u32 timeout_us)
+{
+	bool use_cycle_counter = false;
+	u32 timer_rate = 0;
+	u64 cycles;
+
+#ifdef CONFIG_ARM_ARCH_TIMER
+	timer_rate = arch_timer_get_cntfrq();
+#endif
+
+	if (!timer_rate) {
+		use_cycle_counter = true;
+		timer_rate = clk_get_rate(ptdev->clks.core);
+	}
+
+	if (drm_WARN_ON(&ptdev->base, !timer_rate)) {
+		/* We couldn't get a valid clock rate, let's just pick the
+		 * maximum value so the FW still handles the core
+		 * power on/off requests.
+		 */
+		return GLB_TIMER_VAL(0x7fffffff) |
+		       GLB_TIMER_SOURCE_GPU_COUNTER;
+	}
+
+	cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate, 1000000);
+	return GLB_TIMER_VAL(cycles >> 10) |
+	       (use_cycle_counter ? GLB_TIMER_SOURCE_GPU_COUNTER : 0);
+}
+
+static int panthor_fw_binary_iter_read(struct panthor_device *ptdev,
+				       struct panthor_fw_binary_iter *iter,
+				       void *out, size_t size)
+{
+	size_t new_offset = iter->offset + size;
+
+	if (new_offset > iter->size || new_offset < iter->offset) {
+		drm_err(&ptdev->base, "Firmware too small\n");
+		return -EINVAL;
+	}
+
+	memcpy(out, iter->data + iter->offset, size);
+	iter->offset = new_offset;
+	return 0;
+}
+
+static void panthor_fw_init_section_mem(struct panthor_device *ptdev,
+					struct panthor_fw_section *section)
+{
+	bool was_mapped = !!section->mem->kmap;
+	void *kmap;
+
+	if (!section->data.size &&
+	    !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO))
+		return;
+
+	kmap = panthor_fw_mem_vmap(section->mem);
+	if (drm_WARN_ON(&ptdev->base, !kmap))
+		return;
+
+	memcpy(kmap, section->data.buf, section->data.size);
+	if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO) {
+		memset(kmap + section->data.size, 0,
+		       section->mem->bo->base.base.size - section->data.size);
+	}
+
+	if (!was_mapped)
+		panthor_fw_mem_vunmap(section->mem);
+}
+
+/**
+ * panthor_fw_mem_va() - Get the MCU address of a FW memory object.
+ * @mem: FW memory object.
+ *
+ * Return: The MCU address of a virtual object.
+ */
+u64 panthor_fw_mem_va(struct panthor_fw_mem *mem)
+{
+	return mem->va;
+}
+
+/**
+ * panthor_fw_mem_vunmap() - Kill kernel space mapping of a FW memory object
+ * @mem: FW memory object.
+ */
+void panthor_fw_mem_vunmap(struct panthor_fw_mem *mem)
+{
+	if (mem->kmap) {
+		struct iosys_map map = IOSYS_MAP_INIT_VADDR(mem->kmap);
+
+		drm_gem_vunmap_unlocked(&mem->bo->base.base, &map);
+		mem->kmap = NULL;
+	}
+}
+
+/**
+ * panthor_fw_mem_vunmap() - Map a FW memory object in kernel space
+ * @mem: FW memory object.
+ *
+ * Return: a non-NULL pointer on success, NULL otherwise.
+ */
+void *panthor_fw_mem_vmap(struct panthor_fw_mem *mem)
+{
+	if (!mem->kmap) {
+		struct iosys_map map;
+		int ret;
+
+		ret = drm_gem_vmap_unlocked(&mem->bo->base.base, &map);
+		if (ret)
+			return NULL;
+
+		mem->kmap = map.vaddr;
+	}
+
+	return mem->kmap;
+}
+
+/**
+ * panthor_fw_mem_free() - Free a FW memory object.
+ * @ptdev: Device.
+ * @mem: FW memory object to free.
+ */
+void panthor_fw_mem_free(struct panthor_device *ptdev, struct panthor_fw_mem *mem)
+{
+	if (IS_ERR_OR_NULL(mem))
+		return;
+
+	if (mem->bo)
+		panthor_gem_unmap_and_put(ptdev->fw->vm, mem->bo, mem->va, mem->kmap);
+
+	kfree(mem);
+}
+
+/**
+ * panthor_fw_mem_alloc() - Allocate a FW memory object and map it to the MCU VM.
+ * @ptdev: Device.
+ * @size: Size of the memory block.
+ * @bo_flags: BO flags.
+ * @vm_map_flags: VM_MAP flags.
+ * @va: Virtual address of the MCU mapping.
+ * Set to PANTHOR_GEM_ALLOC_VA for automatic VA-assignment. In that case, the
+ * VA will be allocated in the shared VA space.
+ *
+ * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
+ */
+static struct panthor_fw_mem *
+panthor_fw_mem_alloc(struct panthor_device *ptdev, size_t size,
+		     u32 bo_flags, u32 vm_map_flags, u64 va)
+{
+	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	int ret;
+
+	if (!mem)
+		return ERR_PTR(-ENOMEM);
+
+	mem->bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
+					     size, bo_flags, vm_map_flags,
+					     &va, NULL);
+	if (IS_ERR(mem->bo)) {
+		ret = PTR_ERR(mem->bo);
+		mem->bo = NULL;
+		goto err_free_mem;
+	}
+
+	mem->va = va;
+	return mem;
+
+err_free_mem:
+	panthor_fw_mem_free(ptdev, mem);
+	return ERR_PTR(ret);
+}
+
+/**
+ * panthor_fw_alloc_queue_iface_mem() - Allocate a ring-buffer interfaces.
+ * @ptdev: Device.
+ * @input: Pointer holding the input interface on success.
+ * Should be ignored on failure.
+ * @output: Pointer holding the output interface on success.
+ * Should be ignored on failure.
+ *
+ * Allocates panthor_fw_ringbuf_{input,out}_iface interfaces. The input
+ * interface is at offset 0, and the output interface at offset 4096.
+ *
+ * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
+ */
+struct panthor_fw_mem *
+panthor_fw_alloc_queue_iface_mem(struct panthor_device *ptdev,
+				 struct panthor_fw_ringbuf_input_iface **input,
+				 const struct panthor_fw_ringbuf_output_iface **output)
+{
+	struct panthor_fw_mem *mem;
+	void *kmap;
+
+	mem = panthor_fw_mem_alloc(ptdev, 8192,
+				   DRM_PANTHOR_BO_NO_MMAP,
+				   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+				   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+				   PANTHOR_GEM_ALLOC_VA);
+	if (IS_ERR(mem))
+		return mem;
+
+	kmap = panthor_fw_mem_vmap(mem);
+	if (!kmap) {
+		panthor_fw_mem_free(ptdev, mem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	memset(kmap, 0, mem->bo->base.base.size);
+	*input = kmap;
+	*output = kmap + 4096;
+	return mem;
+}
+
+/**
+ * panthor_fw_alloc_suspend_buf_mem() - Allocate a suspend buffer for a command stream group.
+ * @ptdev: Device.
+ * @size: Size of the suspend buffer.
+ *
+ * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
+ */
+struct panthor_fw_mem *
+panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size)
+{
+	return panthor_fw_mem_alloc(ptdev, size,
+				    DRM_PANTHOR_BO_NO_MMAP,
+				    DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
+				    PANTHOR_GEM_ALLOC_VA);
+}
+
+static int panthor_fw_load_section_entry(struct panthor_device *ptdev,
+					 const struct firmware *fw,
+					 struct panthor_fw_binary_iter *iter,
+					 u32 ehdr)
+{
+	struct panthor_fw_binary_section_entry_hdr hdr;
+	struct panthor_fw_section *section;
+	u32 section_size;
+	u32 name_len;
+	int ret;
+
+	ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
+	if (ret)
+		return ret;
+
+	if (hdr.data.end < hdr.data.start) {
+		drm_err(&ptdev->base, "Firmware corrupted, data.end < data.start (0x%x < 0x%x)\n",
+			hdr.data.end, hdr.data.start);
+		return -EINVAL;
+	}
+
+	if (hdr.va.end < hdr.va.start) {
+		drm_err(&ptdev->base, "Firmware corrupted, hdr.va.end < hdr.va.start (0x%x < 0x%x)\n",
+			hdr.va.end, hdr.va.start);
+		return -EINVAL;
+	}
+
+	if (hdr.data.end > fw->size) {
+		drm_err(&ptdev->base, "Firmware corrupted, file truncated? data_end=0x%x > fw size=0x%zx\n",
+			hdr.data.end, fw->size);
+		return -EINVAL;
+	}
+
+	if ((hdr.va.start & ~PAGE_MASK) != 0 ||
+	    (hdr.va.end & ~PAGE_MASK) != 0) {
+		drm_err(&ptdev->base, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n",
+			hdr.va.start, hdr.va.end);
+		return -EINVAL;
+	}
+
+	if (hdr.flags & ~CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS) {
+		drm_err(&ptdev->base, "Firmware contains interface with unsupported flags (0x%x)\n",
+			hdr.flags);
+		return -EINVAL;
+	}
+
+	if (hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_PROT) {
+		drm_warn(&ptdev->base,
+			 "Firmware protected mode entry not be supported, ignoring");
+		return 0;
+	}
+
+	if (hdr.va.start == CSF_MCU_SHARED_REGION_START &&
+	    !(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED)) {
+		drm_err(&ptdev->base,
+			"Interface at 0x%llx must be shared", CSF_MCU_SHARED_REGION_START);
+		return -EINVAL;
+	}
+
+	name_len = iter->size - iter->offset;
+
+	section = drmm_kzalloc(&ptdev->base, sizeof(*section), GFP_KERNEL);
+	if (!section)
+		return -ENOMEM;
+
+	list_add_tail(&section->node, &ptdev->fw->sections);
+	section->flags = hdr.flags;
+	section->data.size = hdr.data.end - hdr.data.start;
+
+	if (section->data.size > 0) {
+		void *data = drmm_kmalloc(&ptdev->base, section->data.size, GFP_KERNEL);
+
+		if (!data)
+			return -ENOMEM;
+
+		memcpy(data, fw->data + hdr.data.start, section->data.size);
+		section->data.buf = data;
+	}
+
+	if (name_len > 0) {
+		char *name = drmm_kmalloc(&ptdev->base, name_len + 1, GFP_KERNEL);
+
+		if (!name)
+			return -ENOMEM;
+
+		memcpy(name, iter->data + iter->offset, name_len);
+		name[name_len] = '\0';
+		section->name = name;
+	}
+
+	section_size = hdr.va.end - hdr.va.start;
+	if (section_size) {
+		u32 cache_mode = hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK;
+		u32 vm_map_flags = 0;
+		struct sg_table *sgt;
+		u64 va = hdr.va.start;
+
+		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
+			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_READONLY;
+
+		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_EX))
+			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC;
+
+		/* TODO: CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_*_COHERENT are mapped to
+		 * non-cacheable for now. We might want to introduce a new
+		 * IOMMU_xxx flag (or abuse IOMMU_MMIO, which maps to device
+		 * memory and is currently not used by our driver) for
+		 * AS_MEMATTR_AARCH64_SHARED memory, so we can take benefit
+		 * of IO-coherent systems.
+		 */
+		if (cache_mode != CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED)
+			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED;
+
+		/* Shared section is in the auto-VA range. We need to
+		 * reserve the VA range so it's not allocated to someone else.
+		 */
+		if (va >= CSF_MCU_SHARED_REGION_START &&
+		    va < CSF_MCU_SHARED_REGION_START + CSF_MCU_SHARED_REGION_SIZE)
+			va = PANTHOR_GEM_ALLOC_VA;
+
+		section->mem = panthor_fw_mem_alloc(ptdev, section_size,
+						    DRM_PANTHOR_BO_NO_MMAP,
+						    vm_map_flags, va);
+		if (IS_ERR(section->mem))
+			return PTR_ERR(section->mem);
+
+		if (drm_WARN_ON(&ptdev->base, section->mem->va != hdr.va.start))
+			return -EINVAL;
+
+		panthor_fw_init_section_mem(ptdev, section);
+
+		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
+		if (IS_ERR(sgt))
+			return PTR_ERR(section->mem);
+
+		dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
+
+		if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED) {
+			if (!panthor_fw_mem_vmap(section->mem))
+				return -ENOMEM;
+		}
+	}
+
+	if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
+		ptdev->fw->shared_section = section;
+
+	return 0;
+}
+
+static void
+panthor_reload_fw_sections(struct panthor_device *ptdev, bool full_reload)
+{
+	struct panthor_fw_section *section;
+
+	list_for_each_entry(section, &ptdev->fw->sections, node) {
+		struct sg_table *sgt;
+
+		if (!full_reload && !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
+			continue;
+
+		panthor_fw_init_section_mem(ptdev, section);
+		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
+		if (!drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(sgt)))
+			dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
+	}
+}
+
+static int panthor_fw_load_entry(struct panthor_device *ptdev,
+				 const struct firmware *fw,
+				 struct panthor_fw_binary_iter *iter)
+{
+	struct panthor_fw_binary_iter eiter;
+	u32 ehdr;
+	int ret;
+
+	ret = panthor_fw_binary_iter_read(ptdev, iter, &ehdr, sizeof(ehdr));
+	if (ret)
+		return ret;
+
+	if ((iter->offset % sizeof(u32)) ||
+	    (CSF_FW_BINARY_ENTRY_SIZE(ehdr) % sizeof(u32))) {
+		drm_err(&ptdev->base, "Firmware entry isn't 32 bit aligned, offset=0x%x size=0x%x\n",
+			(u32)(iter->offset - sizeof(u32)), CSF_FW_BINARY_ENTRY_SIZE(ehdr));
+		return -EINVAL;
+	}
+
+	eiter.offset = 0;
+	eiter.data = iter->data + iter->offset;
+	eiter.size = CSF_FW_BINARY_ENTRY_SIZE(ehdr) - sizeof(ehdr);
+	iter->offset += eiter.size;
+
+	switch (CSF_FW_BINARY_ENTRY_TYPE(ehdr)) {
+	case CSF_FW_BINARY_ENTRY_TYPE_IFACE:
+		return panthor_fw_load_section_entry(ptdev, fw, &eiter, ehdr);
+
+	/* FIXME: handle those entry types? */
+	case CSF_FW_BINARY_ENTRY_TYPE_CONFIG:
+	case CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST:
+	case CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER:
+	case CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA:
+		return 0;
+	default:
+		break;
+	}
+
+	if (ehdr & CSF_FW_BINARY_ENTRY_OPTIONAL)
+		return 0;
+
+	drm_err(&ptdev->base,
+		"Unsupported non-optional entry type %u in firmware\n",
+		CSF_FW_BINARY_ENTRY_TYPE(ehdr));
+	return -EINVAL;
+}
+
+static int panthor_fw_load(struct panthor_device *ptdev)
+{
+	const struct firmware *fw = NULL;
+	struct panthor_fw_binary_iter iter = {};
+	struct panthor_fw_binary_hdr hdr;
+	int ret;
+
+	ret = request_firmware(&fw, CSF_FW_NAME, ptdev->base.dev);
+	if (ret) {
+		drm_err(&ptdev->base, "Failed to load firmware image '%s'\n",
+			CSF_FW_NAME);
+		return ret;
+	}
+
+	iter.data = fw->data;
+	iter.size = fw->size;
+	ret = panthor_fw_binary_iter_read(ptdev, &iter, &hdr, sizeof(hdr));
+	if (ret)
+		goto out;
+
+	if (hdr.magic != CSF_FW_BINARY_HEADER_MAGIC) {
+		ret = -EINVAL;
+		drm_err(&ptdev->base, "Invalid firmware magic\n");
+		goto out;
+	}
+
+	if (hdr.major != CSF_FW_BINARY_HEADER_MAJOR_MAX) {
+		ret = -EINVAL;
+		drm_err(&ptdev->base, "Unsupported firmware binary header version %d.%d (expected %d.x)\n",
+			hdr.major, hdr.minor, CSF_FW_BINARY_HEADER_MAJOR_MAX);
+		goto out;
+	}
+
+	if (hdr.size > iter.size) {
+		drm_err(&ptdev->base, "Firmware image is truncated\n");
+		goto out;
+	}
+
+	iter.size = hdr.size;
+
+	while (iter.offset < hdr.size) {
+		ret = panthor_fw_load_entry(ptdev, fw, &iter);
+		if (ret)
+			goto out;
+	}
+
+	if (!ptdev->fw->shared_section) {
+		drm_err(&ptdev->base, "Shared interface region not found\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	release_firmware(fw);
+	return ret;
+}
+
+/**
+ * iface_fw_to_cpu_addr() - Turn an MCU address into a CPU address
+ * @ptdev: Device.
+ * @mcu_va: MCU address.
+ *
+ * Return: NULL if the address is not part of the shared section, non-NULL otherwise.
+ */
+static void *iface_fw_to_cpu_addr(struct panthor_device *ptdev, u32 mcu_va)
+{
+	u64 shared_mem_start = ptdev->fw->shared_section->mem->va;
+	u64 shared_mem_end = ptdev->fw->shared_section->mem->va +
+			     ptdev->fw->shared_section->mem->bo->base.base.size;
+	if (mcu_va < shared_mem_start || mcu_va >= shared_mem_end)
+		return NULL;
+
+	return ptdev->fw->shared_section->mem->kmap + (mcu_va - shared_mem_start);
+}
+
+static int panthor_init_cs_iface(struct panthor_device *ptdev,
+				 unsigned int csg_idx, unsigned int cs_idx)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_idx);
+	struct panthor_fw_cs_iface *cs_iface = &ptdev->fw->iface.streams[csg_idx][cs_idx];
+	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
+	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET +
+			   (csg_idx * glb_iface->control->group_stride) +
+			   CSF_STREAM_CONTROL_OFFSET +
+			   (cs_idx * csg_iface->control->stream_stride);
+
+	if (iface_offset + sizeof(*cs_iface) >= shared_section_sz)
+		return -EINVAL;
+
+	spin_lock_init(&cs_iface->lock);
+	cs_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
+	cs_iface->input = iface_fw_to_cpu_addr(ptdev, cs_iface->control->input_va);
+	cs_iface->output = iface_fw_to_cpu_addr(ptdev, cs_iface->control->output_va);
+
+	if (!cs_iface->input || !cs_iface->output) {
+		drm_err(&ptdev->base, "Invalid stream control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (csg_idx > 0 || cs_idx > 0) {
+		struct panthor_fw_cs_iface *first_cs_iface =
+			panthor_fw_get_cs_iface(ptdev, 0, 0);
+
+		if (cs_iface->control->features != first_cs_iface->control->features) {
+			drm_err(&ptdev->base, "Expecting identical CS slots");
+			return -EINVAL;
+		}
+	} else {
+		u32 reg_count = CS_FEATURES_WORK_REGS(cs_iface->control->features);
+
+		ptdev->csif_info.cs_reg_count = reg_count;
+		ptdev->csif_info.unpreserved_cs_reg_count = CSF_UNPRESERVED_REG_COUNT;
+	}
+
+	return 0;
+}
+
+static int panthor_init_csg_iface(struct panthor_device *ptdev,
+				  unsigned int csg_idx)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	struct panthor_fw_csg_iface *csg_iface = &ptdev->fw->iface.groups[csg_idx];
+	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
+	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET + (csg_idx * glb_iface->control->group_stride);
+	unsigned int i;
+
+	if (iface_offset + sizeof(*csg_iface) >= shared_section_sz)
+		return -EINVAL;
+
+	spin_lock_init(&csg_iface->lock);
+	csg_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
+	csg_iface->input = iface_fw_to_cpu_addr(ptdev, csg_iface->control->input_va);
+	csg_iface->output = iface_fw_to_cpu_addr(ptdev, csg_iface->control->output_va);
+
+	if (csg_iface->control->stream_num < MIN_CS_PER_CSG ||
+	    csg_iface->control->stream_num > MAX_CS_PER_CSG)
+		return -EINVAL;
+
+	if (!csg_iface->input || !csg_iface->output) {
+		drm_err(&ptdev->base, "Invalid group control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (csg_idx > 0) {
+		struct panthor_fw_csg_iface *first_csg_iface =
+			panthor_fw_get_csg_iface(ptdev, 0);
+		u32 first_protm_suspend_size = first_csg_iface->control->protm_suspend_size;
+
+		if (first_csg_iface->control->features != csg_iface->control->features ||
+		    first_csg_iface->control->suspend_size != csg_iface->control->suspend_size ||
+		    first_protm_suspend_size != csg_iface->control->protm_suspend_size ||
+		    first_csg_iface->control->stream_num != csg_iface->control->stream_num) {
+			drm_err(&ptdev->base, "Expecting identical CSG slots");
+			return -EINVAL;
+		}
+	}
+
+	for (i = 0; i < csg_iface->control->stream_num; i++) {
+		int ret = panthor_init_cs_iface(ptdev, csg_idx, i);
+
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static u32 panthor_get_instr_features(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	if (glb_iface->control->version < CSF_IFACE_VERSION(1, 1, 0))
+		return 0;
+
+	return glb_iface->control->instr_features;
+}
+
+static int panthor_fw_init_ifaces(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = &ptdev->fw->iface.global;
+	unsigned int i;
+
+	if (!ptdev->fw->shared_section->mem->kmap)
+		return -EINVAL;
+
+	spin_lock_init(&glb_iface->lock);
+	glb_iface->control = ptdev->fw->shared_section->mem->kmap;
+
+	if (!glb_iface->control->version) {
+		drm_err(&ptdev->base, "Invalid CSF interface version %d.%d.%d (%x)",
+			CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
+			CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
+			CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
+			glb_iface->control->version);
+		return -EINVAL;
+	}
+
+	glb_iface->input = iface_fw_to_cpu_addr(ptdev, glb_iface->control->input_va);
+	glb_iface->output = iface_fw_to_cpu_addr(ptdev, glb_iface->control->output_va);
+	if (!glb_iface->input || !glb_iface->output) {
+		drm_err(&ptdev->base, "Invalid global control interface input/output VA");
+		return -EINVAL;
+	}
+
+	if (glb_iface->control->group_num > MAX_CSGS ||
+	    glb_iface->control->group_num < MIN_CSGS) {
+		drm_err(&ptdev->base, "Invalid number of control groups");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < glb_iface->control->group_num; i++) {
+		int ret = panthor_init_csg_iface(ptdev, i);
+
+		if (ret)
+			return ret;
+	}
+
+	drm_info(&ptdev->base, "CSF FW v%d.%d.%d, Features %x Instrumentation features %x",
+		 CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
+		 CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
+		 CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
+		 glb_iface->control->features,
+		 panthor_get_instr_features(ptdev));
+	return 0;
+}
+
+static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	/* Enable all cores. */
+	glb_iface->input->core_en_mask = ptdev->gpu_info.shader_present;
+
+	/* Setup timers. */
+	glb_iface->input->poweroff_timer = panthor_fw_conv_timeout(ptdev, PWROFF_HYSTERESIS_US);
+	glb_iface->input->progress_timer = PROGRESS_TIMEOUT_CYCLES >> PROGRESS_TIMEOUT_SCALE_SHIFT;
+	glb_iface->input->idle_timer = panthor_fw_conv_timeout(ptdev, IDLE_HYSTERESIS_US);
+
+	/* Enable interrupts we care about. */
+	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
+					 GLB_PING |
+					 GLB_CFG_PROGRESS_TIMER |
+					 GLB_CFG_POWEROFF_TIMER |
+					 GLB_IDLE_EN |
+					 GLB_IDLE;
+
+	panthor_fw_update_reqs(glb_iface, req, GLB_IDLE_EN, GLB_IDLE_EN);
+	panthor_fw_toggle_reqs(glb_iface, req, ack,
+			       GLB_CFG_ALLOC_EN |
+			       GLB_CFG_POWEROFF_TIMER |
+			       GLB_CFG_PROGRESS_TIMER);
+
+	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+
+	/* Kick the watchdog. */
+	mod_delayed_work(ptdev->reset.wq, &ptdev->fw->watchdog.ping_work,
+			 msecs_to_jiffies(PING_INTERVAL_MS));
+}
+
+static void panthor_fw_process_global_irq(struct panthor_device *ptdev)
+{
+	/* If the FW is not booted, don't process IRQs, just flag the FW as booted. */
+	if (!ptdev->fw->booted)
+		ptdev->fw->booted = true;
+	else
+		panthor_sched_process_global_irq(ptdev);
+
+	wake_up_all(&ptdev->fw->waitqueues[31]);
+}
+
+static void panthor_fw_process_csg_irq(struct panthor_device *ptdev, u32 csg_slot)
+{
+	panthor_sched_process_csg_irq(ptdev, csg_slot);
+	wake_up_all(&ptdev->fw->waitqueues[csg_slot]);
+}
+
+static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
+{
+	if (status & JOB_INT_GLOBAL_IF) {
+		panthor_fw_process_global_irq(ptdev);
+		status &= ~JOB_INT_GLOBAL_IF;
+	}
+
+	while (status) {
+		u32 csg_id = ffs(status) - 1;
+
+		panthor_fw_process_csg_irq(ptdev, csg_id);
+		status &= ~BIT(csg_id);
+	}
+}
+PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
+
+static int panthor_fw_start(struct panthor_device *ptdev)
+{
+	bool timedout = false;
+
+	ptdev->fw->booted = false;
+	panthor_job_irq_resume(&ptdev->fw->irq, ~0);
+	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_AUTO);
+
+	if (!wait_event_timeout(ptdev->fw->waitqueues[31],
+				ptdev->fw->booted,
+				msecs_to_jiffies(1000))) {
+		if (!ptdev->fw->booted &&
+		    !(gpu_read(ptdev, JOB_INT_STAT) & JOB_INT_GLOBAL_IF))
+			timedout = true;
+	}
+
+	if (timedout) {
+		drm_err(&ptdev->base, "Failed to boot MCU");
+		return -ETIMEDOUT;
+	}
+
+	return 0;
+}
+
+static void panthor_fw_stop(struct panthor_device *ptdev)
+{
+	u32 status;
+
+	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_DISABLE);
+	if (readl_poll_timeout(ptdev->iomem + MCU_CONTROL, status,
+			       status == MCU_CONTROL_DISABLE, 10, 100000))
+		drm_err(&ptdev->base, "Failed to stop MCU");
+}
+
+/**
+ * panthor_fw_pre_reset() - Call before a reset.
+ * @ptdev: Device.
+ * @on_hang: true if the reset was triggered on a GPU hang.
+ *
+ * If the reset is not triggered on a hang, we try to gracefully halt the
+ * MCU, so we can do a fast-reset when panthor_fw_post_reset() is called.
+ */
+void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
+{
+	/* Make sure we won't be woken up by a ping. */
+	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
+
+	ptdev->fw->fast_reset = false;
+
+	if (!on_hang) {
+		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+		u32 status;
+
+		panthor_fw_update_reqs(glb_iface, req, GLB_HALT, GLB_HALT);
+		gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
+					status == MCU_STATUS_HALT, 10, 100000) &&
+		    glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
+			ptdev->fw->fast_reset = true;
+		} else {
+			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
+		}
+
+		/* The FW detects 0 -> 1 transitions. Make sure we reset
+		 * the HALT bit before the FW is rebooted.
+		 */
+		panthor_fw_update_reqs(glb_iface, req, 0, GLB_HALT);
+	}
+
+	panthor_job_irq_suspend(&ptdev->fw->irq);
+}
+
+/**
+ * panthor_fw_post_reset() - Call after a reset.
+ * @ptdev: Device.
+ *
+ * Start the FW. If this is not a fast reset, all FW sections are reloaded to
+ * make sure we can recover from a memory corruption.
+ */
+int panthor_fw_post_reset(struct panthor_device *ptdev)
+{
+	int ret;
+
+	/* Make the MCU VM active. */
+	ret = panthor_vm_active(ptdev->fw->vm);
+	if (ret)
+		return ret;
+
+	/* Reload all sections, including RO ones. We're not supposed
+	 * to end up here anyway, let's just assume the overhead of
+	 * reloading everything is acceptable.
+	 */
+	if (!ptdev->fw->fast_reset)
+		panthor_reload_fw_sections(ptdev, true);
+
+	ret = panthor_fw_start(ptdev);
+	if (ret)
+		return ret;
+
+	/* We must re-initialize the global interface even on fast-reset. */
+	panthor_fw_init_global_iface(ptdev);
+	return 0;
+}
+
+/**
+ * panthor_fw_unplug() - Called when the device is unplugged.
+ * @ptdev: Device.
+ *
+ * This function must make sure all pending operations are flushed before
+ * will release device resources, thus preventing any interaction with
+ * the HW.
+ *
+ * If there are still FW-relates works running after this function returns,
+ * they must use drm_dev_{enter,exit}() and skip any HW access when
+ * drm_dev_enter() returns false.
+ */
+void panthor_fw_unplug(struct panthor_device *ptdev)
+{
+	struct panthor_fw_section *section;
+
+	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
+
+	/* Make sure the IRQ handler can be called after that point. */
+	if (ptdev->fw->irq.irq)
+		panthor_job_irq_suspend(&ptdev->fw->irq);
+
+	panthor_fw_stop(ptdev);
+
+	if (ptdev->fw->vm)
+		panthor_vm_idle(ptdev->fw->vm);
+
+	list_for_each_entry(section, &ptdev->fw->sections, node) {
+		panthor_fw_mem_free(ptdev, section->mem);
+	}
+
+	panthor_vm_put(ptdev->fw->vm);
+
+	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
+}
+
+/**
+ * panthor_fw_wait_acks() - Wait for requests to be acknowledged by the FW.
+ * @req_ptr: Pointer to the req register.
+ * @ack_ptr: Pointer to the ack register.
+ * @wq: Wait queue to use for the sleeping wait.
+ * @req_mask: Mask of requests to wait for.
+ * @acked: Pointer to field that's updated with the acked requests.
+ * If the function returns 0, *acked == req_mask.
+ * @timeout_ms: Timeout expressed in milliseconds.
+ *
+ * Return: 0 on success, -ETIMEDOUT otherwise.
+ */
+static int panthor_fw_wait_acks(const u32 *req_ptr, const u32 *ack_ptr,
+				wait_queue_head_t *wq,
+				u32 req_mask, u32 *acked,
+				u32 timeout_ms)
+{
+	u32 ack, req = READ_ONCE(*req_ptr) & req_mask;
+	int ret;
+
+	/* Busy wait for a few µsecs before falling back to a sleeping wait. */
+	*acked = req_mask;
+	ret = read_poll_timeout_atomic(READ_ONCE, ack,
+				       (ack & req_mask) == req,
+				       0, 10, 0,
+				       *ack_ptr);
+	if (!ret)
+		return 0;
+
+	if (wait_event_timeout(*wq, (READ_ONCE(*ack_ptr) & req_mask) == req,
+			       msecs_to_jiffies(timeout_ms)))
+		return 0;
+
+	/* Check one last time, in case we were not woken up for some reason. */
+	ack = READ_ONCE(*ack_ptr);
+	if ((ack & req_mask) == req)
+		return 0;
+
+	*acked = ~(req ^ ack) & req_mask;
+	return -ETIMEDOUT;
+}
+
+/**
+ * panthor_fw_glb_wait_acks() - Wait for global requests to be acknowledged.
+ * @ptdev: Device.
+ * @req_mask: Mask of requests to wait for.
+ * @acked: Pointer to field that's updated with the acked requests.
+ * If the function returns 0, *acked == req_mask.
+ * @timeout_ms: Timeout expressed in milliseconds.
+ *
+ * Return: 0 on success, -ETIMEDOUT otherwise.
+ */
+int panthor_fw_glb_wait_acks(struct panthor_device *ptdev,
+			     u32 req_mask, u32 *acked,
+			     u32 timeout_ms)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	/* GLB_HALT doesn't get acked through the FW interface. */
+	if (drm_WARN_ON(&ptdev->base, req_mask & (~GLB_REQ_MASK | GLB_HALT)))
+		return -EINVAL;
+
+	return panthor_fw_wait_acks(&glb_iface->input->req,
+				    &glb_iface->output->ack,
+				    &ptdev->fw->waitqueues[31],
+				    req_mask, acked, timeout_ms);
+}
+
+/**
+ * panthor_fw_glb_wait_acks() - Wait for command stream group requests to be acknowledged.
+ * @ptdev: Device.
+ * @req_mask: Mask of requests to wait for.
+ * @acked: Pointer to field that's updated with the acked requests.
+ * If the function returns 0, *acked == req_mask.
+ * @timeout_ms: Timeout expressed in milliseconds.
+ *
+ * Return: 0 on success, -ETIMEDOUT otherwise.
+ */
+int panthor_fw_csg_wait_acks(struct panthor_device *ptdev, u32 csg_slot,
+			     u32 req_mask, u32 *acked, u32 timeout_ms)
+{
+	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_slot);
+	int ret;
+
+	if (drm_WARN_ON(&ptdev->base, req_mask & ~CSG_REQ_MASK))
+		return -EINVAL;
+
+	ret = panthor_fw_wait_acks(&csg_iface->input->req,
+				   &csg_iface->output->ack,
+				   &ptdev->fw->waitqueues[csg_slot],
+				   req_mask, acked, timeout_ms);
+
+	if (ret && (*acked & CSG_STATE_MASK) != CSG_STATE_MASK)
+		*acked &= ~CSG_STATE_MASK;
+
+	return ret;
+}
+
+/**
+ * panthor_fw_ring_csg_doorbells() - Ring command stream group doorbells.
+ * @ptdev: Device.
+ * @csg_mask: Bitmask encoding the command stream group doorbells to ring.
+ *
+ * This function is toggling bits in the doorbell_req and ringing the
+ * global doorbell. It doesn't require a user doorbell to be attached to
+ * the group.
+ */
+void panthor_fw_ring_csg_doorbells(struct panthor_device *ptdev, u32 csg_mask)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	panthor_fw_toggle_reqs(glb_iface, doorbell_req, doorbell_ack, csg_mask);
+	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+}
+
+static void panthor_fw_ping_work(struct work_struct *work)
+{
+	struct panthor_fw *fw = container_of(work, struct panthor_fw, watchdog.ping_work.work);
+	struct panthor_device *ptdev = fw->irq.ptdev;
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	u32 acked;
+	int ret;
+
+	if (panthor_device_reset_is_pending(ptdev))
+		return;
+
+	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PING);
+	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
+
+	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PING, &acked, 100);
+	if (ret) {
+		panthor_device_schedule_reset(ptdev);
+		drm_err(&ptdev->base, "FW ping timeout, scheduling a reset");
+	} else {
+		mod_delayed_work(ptdev->reset.wq, &fw->watchdog.ping_work,
+				 msecs_to_jiffies(PING_INTERVAL_MS));
+	}
+}
+
+/**
+ * panthor_fw_init() - Initialize FW related data.
+ * @ptdev: Device.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+int panthor_fw_init(struct panthor_device *ptdev)
+{
+	struct panthor_fw *fw;
+	int ret, irq;
+
+	fw = drmm_kzalloc(&ptdev->base, sizeof(*fw), GFP_KERNEL);
+	if (!fw)
+		return -ENOMEM;
+
+	ptdev->fw = fw;
+	for (u32 i = 0; i < ARRAY_SIZE(fw->waitqueues); i++)
+		init_waitqueue_head(&fw->waitqueues[i]);
+
+	INIT_LIST_HEAD(&fw->sections);
+	INIT_DELAYED_WORK(&fw->watchdog.ping_work, panthor_fw_ping_work);
+
+	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "job");
+	if (irq <= 0)
+		return -ENODEV;
+
+	ret = panthor_request_job_irq(ptdev, &fw->irq, irq, 0);
+	if (ret) {
+		drm_err(&ptdev->base, "failed to request job irq");
+		return ret;
+	}
+
+	ret = panthor_gpu_l2_power_on(ptdev);
+	if (ret)
+		return ret;
+
+	fw->vm = panthor_vm_create(ptdev, true,
+				   CSF_MCU_SHARED_REGION_START,
+				   CSF_MCU_SHARED_REGION_SIZE);
+	if (IS_ERR(fw->vm)) {
+		ret = PTR_ERR(fw->vm);
+		fw->vm = NULL;
+		goto err_unplug_fw;
+	}
+
+	ret = panthor_fw_load(ptdev);
+	if (ret)
+		goto err_unplug_fw;
+
+	ret = panthor_vm_active(fw->vm);
+	if (ret)
+		goto err_unplug_fw;
+
+	ret = panthor_fw_start(ptdev);
+	if (ret)
+		goto err_unplug_fw;
+
+	ret = panthor_fw_init_ifaces(ptdev);
+	if (ret)
+		goto err_unplug_fw;
+
+	panthor_fw_init_global_iface(ptdev);
+	return 0;
+
+err_unplug_fw:
+	panthor_fw_unplug(ptdev);
+	return ret;
+}
diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
new file mode 100644
index 000000000000..929760c2a46b
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_fw.h
@@ -0,0 +1,505 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_MCU_H__
+#define __PANTHOR_MCU_H__
+
+#include <linux/types.h>
+
+#include "panthor_device.h"
+
+struct panthor_fw_mem;
+
+#define MAX_CSGS				31
+#define MAX_CS_PER_CSG                          32
+
+struct panthor_fw_ringbuf_input_iface {
+	u64 insert;
+	u64 extract;
+} __packed;
+
+struct panthor_fw_ringbuf_output_iface {
+	u64 extract;
+	u32 active;
+} __packed;
+
+struct panthor_fw_cs_control_iface {
+#define CS_FEATURES_WORK_REGS(x)		(((x) & GENMASK(7, 0)) + 1)
+#define CS_FEATURES_SCOREBOARDS(x)		(((x) & GENMASK(15, 8)) >> 8)
+#define CS_FEATURES_COMPUTE			BIT(16)
+#define CS_FEATURES_FRAGMENT			BIT(17)
+#define CS_FEATURES_TILER			BIT(18)
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+} __packed;
+
+struct panthor_fw_cs_input_iface {
+#define CS_STATE_MASK				GENMASK(2, 0)
+#define CS_STATE_STOP				0
+#define CS_STATE_START				1
+#define CS_EXTRACT_EVENT			BIT(4)
+#define CS_IDLE_SYNC_WAIT			BIT(8)
+#define CS_IDLE_PROTM_PENDING			BIT(9)
+#define CS_IDLE_EMPTY				BIT(10)
+#define CS_IDLE_RESOURCE_REQ			BIT(11)
+#define CS_TILER_OOM				BIT(26)
+#define CS_PROTM_PENDING			BIT(27)
+#define CS_FATAL				BIT(30)
+#define CS_FAULT				BIT(31)
+#define CS_REQ_MASK				(CS_STATE_MASK | \
+						 CS_EXTRACT_EVENT | \
+						 CS_IDLE_SYNC_WAIT | \
+						 CS_IDLE_PROTM_PENDING | \
+						 CS_IDLE_EMPTY | \
+						 CS_IDLE_RESOURCE_REQ)
+#define CS_EVT_MASK				(CS_TILER_OOM | \
+						 CS_PROTM_PENDING | \
+						 CS_FATAL | \
+						 CS_FAULT)
+	u32 req;
+
+#define CS_CONFIG_PRIORITY(x)			((x) & GENMASK(3, 0))
+#define CS_CONFIG_DOORBELL(x)			(((x) << 8) & GENMASK(15, 8))
+	u32 config;
+	u32 reserved1;
+	u32 ack_irq_mask;
+	u64 ringbuf_base;
+	u32 ringbuf_size;
+	u32 reserved2;
+	u64 heap_start;
+	u64 heap_end;
+	u64 ringbuf_input;
+	u64 ringbuf_output;
+	u32 instr_config;
+	u32 instrbuf_size;
+	u64 instrbuf_base;
+	u64 instrbuf_offset_ptr;
+} __packed;
+
+struct panthor_fw_cs_output_iface {
+	u32 ack;
+	u32 reserved1[15];
+	u64 status_cmd_ptr;
+
+#define CS_STATUS_WAIT_SB_MASK			GENMASK(15, 0)
+#define CS_STATUS_WAIT_SB_SRC_MASK		GENMASK(19, 16)
+#define CS_STATUS_WAIT_SB_SRC_NONE		(0 << 16)
+#define CS_STATUS_WAIT_SB_SRC_WAIT		(8 << 16)
+#define CS_STATUS_WAIT_SYNC_COND_LE		(0 << 24)
+#define CS_STATUS_WAIT_SYNC_COND_GT		(1 << 24)
+#define CS_STATUS_WAIT_SYNC_COND_MASK		GENMASK(27, 24)
+#define CS_STATUS_WAIT_PROGRESS			BIT(28)
+#define CS_STATUS_WAIT_PROTM			BIT(29)
+#define CS_STATUS_WAIT_SYNC_64B			BIT(30)
+#define CS_STATUS_WAIT_SYNC			BIT(31)
+	u32 status_wait;
+	u32 status_req_resource;
+	u64 status_wait_sync_ptr;
+	u32 status_wait_sync_value;
+	u32 status_scoreboards;
+
+#define CS_STATUS_BLOCKED_REASON_UNBLOCKED	0
+#define CS_STATUS_BLOCKED_REASON_SB_WAIT	1
+#define CS_STATUS_BLOCKED_REASON_PROGRESS_WAIT	2
+#define CS_STATUS_BLOCKED_REASON_SYNC_WAIT	3
+#define CS_STATUS_BLOCKED_REASON_DEFERRED	5
+#define CS_STATUS_BLOCKED_REASON_RES		6
+#define CS_STATUS_BLOCKED_REASON_FLUSH		7
+#define CS_STATUS_BLOCKED_REASON_MASK		GENMASK(3, 0)
+	u32 status_blocked_reason;
+	u32 status_wait_sync_value_hi;
+	u32 reserved2[6];
+
+#define CS_EXCEPTION_TYPE(x)			((x) & GENMASK(7, 0))
+#define CS_EXCEPTION_DATA(x)			(((x) >> 8) & GENMASK(23, 0))
+	u32 fault;
+	u32 fatal;
+	u64 fault_info;
+	u64 fatal_info;
+	u32 reserved3[10];
+	u32 heap_vt_start;
+	u32 heap_vt_end;
+	u32 reserved4;
+	u32 heap_frag_end;
+	u64 heap_address;
+} __packed;
+
+struct panthor_fw_csg_control_iface {
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+	u32 suspend_size;
+	u32 protm_suspend_size;
+	u32 stream_num;
+	u32 stream_stride;
+} __packed;
+
+struct panthor_fw_csg_input_iface {
+#define CSG_STATE_MASK				GENMASK(2, 0)
+#define CSG_STATE_TERMINATE			0
+#define CSG_STATE_START				1
+#define CSG_STATE_SUSPEND			2
+#define CSG_STATE_RESUME			3
+#define CSG_ENDPOINT_CONFIG			BIT(4)
+#define CSG_STATUS_UPDATE			BIT(5)
+#define CSG_SYNC_UPDATE				BIT(28)
+#define CSG_IDLE				BIT(29)
+#define CSG_DOORBELL				BIT(30)
+#define CSG_PROGRESS_TIMER_EVENT		BIT(31)
+#define CSG_REQ_MASK				(CSG_STATE_MASK | \
+						 CSG_ENDPOINT_CONFIG | \
+						 CSG_STATUS_UPDATE)
+#define CSG_EVT_MASK				(CSG_SYNC_UPDATE | \
+						 CSG_IDLE | \
+						 CSG_PROGRESS_TIMER_EVENT)
+	u32 req;
+	u32 ack_irq_mask;
+
+	u32 doorbell_req;
+	u32 cs_irq_ack;
+	u32 reserved1[4];
+	u64 allow_compute;
+	u64 allow_fragment;
+	u32 allow_other;
+
+#define CSG_EP_REQ_COMPUTE(x)			((x) & GENMASK(7, 0))
+#define CSG_EP_REQ_FRAGMENT(x)			(((x) << 8) & GENMASK(15, 8))
+#define CSG_EP_REQ_TILER(x)			(((x) << 16) & GENMASK(19, 16))
+#define CSG_EP_REQ_EXCL_COMPUTE			BIT(20)
+#define CSG_EP_REQ_EXCL_FRAGMENT		BIT(21)
+#define CSG_EP_REQ_PRIORITY(x)			(((x) << 28) & GENMASK(31, 28))
+#define CSG_EP_REQ_PRIORITY_MASK		GENMASK(31, 28)
+	u32 endpoint_req;
+	u32 reserved2[2];
+	u64 suspend_buf;
+	u64 protm_suspend_buf;
+	u32 config;
+	u32 iter_trace_config;
+} __packed;
+
+struct panthor_fw_csg_output_iface {
+	u32 ack;
+	u32 reserved1;
+	u32 doorbell_ack;
+	u32 cs_irq_req;
+	u32 status_endpoint_current;
+	u32 status_endpoint_req;
+
+#define CSG_STATUS_STATE_IS_IDLE		BIT(0)
+	u32 status_state;
+	u32 resource_dep;
+} __packed;
+
+struct panthor_fw_global_control_iface {
+	u32 version;
+	u32 features;
+	u32 input_va;
+	u32 output_va;
+	u32 group_num;
+	u32 group_stride;
+	u32 perfcnt_size;
+	u32 instr_features;
+} __packed;
+
+struct panthor_fw_global_input_iface {
+#define GLB_HALT				BIT(0)
+#define GLB_CFG_PROGRESS_TIMER			BIT(1)
+#define GLB_CFG_ALLOC_EN			BIT(2)
+#define GLB_CFG_POWEROFF_TIMER			BIT(3)
+#define GLB_PROTM_ENTER				BIT(4)
+#define GLB_PERFCNT_EN				BIT(5)
+#define GLB_PERFCNT_SAMPLER			BIT(6)
+#define GLB_COUNTER_EN				BIT(7)
+#define GLB_PING				BIT(8)
+#define GLB_FWCFG_UPDATE			BIT(9)
+#define GLB_IDLE_EN				BIT(10)
+#define GLB_SLEEP				BIT(12)
+#define GLB_INACTIVE_COMPUTE			BIT(20)
+#define GLB_INACTIVE_FRAGMENT			BIT(21)
+#define GLB_INACTIVE_TILER			BIT(22)
+#define GLB_PROTM_EXIT				BIT(23)
+#define GLB_PERFCNT_THRESHOLD			BIT(24)
+#define GLB_PERFCNT_OVERFLOW			BIT(25)
+#define GLB_IDLE				BIT(26)
+#define GLB_DBG_CSF				BIT(30)
+#define GLB_DBG_HOST				BIT(31)
+#define GLB_REQ_MASK				GENMASK(10, 0)
+#define GLB_EVT_MASK				GENMASK(26, 20)
+	u32 req;
+	u32 ack_irq_mask;
+	u32 doorbell_req;
+	u32 reserved1;
+	u32 progress_timer;
+
+#define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
+#define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
+	u32 poweroff_timer;
+	u64 core_en_mask;
+	u32 reserved2;
+	u32 perfcnt_as;
+	u64 perfcnt_base;
+	u32 perfcnt_extract;
+	u32 reserved3[3];
+	u32 perfcnt_config;
+	u32 perfcnt_csg_select;
+	u32 perfcnt_fw_enable;
+	u32 perfcnt_csg_enable;
+	u32 perfcnt_csf_enable;
+	u32 perfcnt_shader_enable;
+	u32 perfcnt_tiler_enable;
+	u32 perfcnt_mmu_l2_enable;
+	u32 reserved4[8];
+	u32 idle_timer;
+} __packed;
+
+enum panthor_fw_halt_status {
+	PANTHOR_FW_HALT_OK = 0,
+	PANTHOR_FW_HALT_ON_PANIC = 0x4e,
+	PANTHOR_FW_HALT_ON_WATCHDOG_EXPIRATION = 0x4f,
+};
+
+struct panthor_fw_global_output_iface {
+	u32 ack;
+	u32 reserved1;
+	u32 doorbell_ack;
+	u32 reserved2;
+	u32 halt_status;
+	u32 perfcnt_status;
+	u32 perfcnt_insert;
+} __packed;
+
+/**
+ * struct panthor_fw_cs_iface - Firmware command stream slot interface
+ */
+struct panthor_fw_cs_iface {
+	/**
+	 * @lock: Lock protecting access to the panthor_fw_cs_input_iface::req
+	 * field.
+	 *
+	 * Needed so we can update the req field concurrently from the interrupt
+	 * handler and the scheduler logic.
+	 *
+	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
+	 * interface sections are mapped uncached/write-combined right now, and
+	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
+	 * we have 'SHARED' GPU mappings hooked up.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @control: Command stream slot control interface.
+	 *
+	 * Used to expose command stream slot properties.
+	 *
+	 * This interface is read-only.
+	 */
+	struct panthor_fw_cs_control_iface *control;
+
+	/**
+	 * @input: Command stream slot input interface.
+	 *
+	 * Used for host updates/events.
+	 */
+	struct panthor_fw_cs_input_iface *input;
+
+	/**
+	 * @output: Command stream slot output interface.
+	 *
+	 * Used for FW updates/events.
+	 *
+	 * This interface is read-only.
+	 */
+	const struct panthor_fw_cs_output_iface *output;
+};
+
+/**
+ * struct panthor_fw_csg_iface - Firmware command stream group slot interface
+ */
+struct panthor_fw_csg_iface {
+	/**
+	 * @lock: Lock protecting access to the panthor_fw_csg_input_iface::req
+	 * field.
+	 *
+	 * Needed so we can update the req field concurrently from the interrupt
+	 * handler and the scheduler logic.
+	 *
+	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
+	 * interface sections are mapped uncached/write-combined right now, and
+	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
+	 * we have 'SHARED' GPU mappings hooked up.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @control: Command stream group slot control interface.
+	 *
+	 * Used to expose command stream group slot properties.
+	 *
+	 * This interface is read-only.
+	 */
+	const struct panthor_fw_csg_control_iface *control;
+
+	/**
+	 * @input: Command stream slot input interface.
+	 *
+	 * Used for host updates/events.
+	 */
+	struct panthor_fw_csg_input_iface *input;
+
+	/**
+	 * @output: Command stream group slot output interface.
+	 *
+	 * Used for FW updates/events.
+	 *
+	 * This interface is read-only.
+	 */
+	const struct panthor_fw_csg_output_iface *output;
+};
+
+/**
+ * struct panthor_fw_global_iface - Firmware global interface
+ */
+struct panthor_fw_global_iface {
+	/**
+	 * @lock: Lock protecting access to the panthor_fw_global_input_iface::req
+	 * field.
+	 *
+	 * Needed so we can update the req field concurrently from the interrupt
+	 * handler and the scheduler/FW management logic.
+	 *
+	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
+	 * interface sections are mapped uncached/write-combined right now, and
+	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
+	 * we have 'SHARED' GPU mappings hooked up.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @control: Command stream group slot control interface.
+	 *
+	 * Used to expose global FW properties.
+	 *
+	 * This interface is read-only.
+	 */
+	const struct panthor_fw_global_control_iface *control;
+
+	/**
+	 * @input: Global input interface.
+	 *
+	 * Used for host updates/events.
+	 */
+	struct panthor_fw_global_input_iface *input;
+
+	/**
+	 * @output: Global output interface.
+	 *
+	 * Used for FW updates/events.
+	 *
+	 * This interface is read-only.
+	 */
+	const struct panthor_fw_global_output_iface *output;
+};
+
+/**
+ * panthor_fw_toggle_reqs() - Toggle acknowledge bits to send an event to the FW
+ * @__iface: The interface to operate on.
+ * @__in_reg: Name of the register to update in the input section of the interface.
+ * @__out_reg: Name of the register to take as a reference in the output section of the
+ * interface.
+ * @__mask: Mask to apply to the update.
+ *
+ * The Host -> FW event/message passing was designed to be lockless, with each side of
+ * the channel having its writeable section. Events are signaled as a difference between
+ * the host and FW side in the req/ack registers (when a bit differs, there's an event
+ * pending, when they are the same, nothing needs attention).
+ *
+ * This helper allows one to update the req register based on the current value of the
+ * ack register managed by the FW. Toggling a specific bit will flag an event. In order
+ * for events to be re-evaluated, the interface doorbell needs to be rung.
+ *
+ * Concurrent accesses to the same req register is covered.
+ *
+ * Anything requiring atomic updates to multiple registers requires a dedicated lock.
+ */
+#define panthor_fw_toggle_reqs(__iface, __in_reg, __out_reg, __mask) \
+	do { \
+		u32 __cur_val, __new_val, __out_val; \
+		spin_lock(&(__iface)->lock); \
+		__cur_val = READ_ONCE((__iface)->input->__in_reg); \
+		__out_val = READ_ONCE((__iface)->output->__out_reg); \
+		__new_val = ((__out_val ^ (__mask)) & (__mask)) | (__cur_val & ~(__mask)); \
+		WRITE_ONCE((__iface)->input->__in_reg, __new_val); \
+		spin_unlock(&(__iface)->lock); \
+	} while (0)
+
+/**
+ * panthor_fw_update_reqs() - Update bits to reflect a configuration change
+ * @__iface: The interface to operate on.
+ * @__in_reg: Name of the register to update in the input section of the interface.
+ * @__val: Value to set.
+ * @__mask: Mask to apply to the update.
+ *
+ * Some configuration get passed through req registers that are also used to
+ * send events to the FW. Those req registers being updated from the interrupt
+ * handler, they require special helpers to update the configuration part as well.
+ *
+ * Concurrent accesses to the same req register is covered.
+ *
+ * Anything requiring atomic updates to multiple registers requires a dedicated lock.
+ */
+#define panthor_fw_update_reqs(__iface, __in_reg, __val, __mask) \
+	do { \
+		u32 __cur_val, __new_val; \
+		spin_lock(&(__iface)->lock); \
+		__cur_val = READ_ONCE((__iface)->input->__in_reg); \
+		__new_val = (__cur_val & ~(__mask)) | ((__val) & (__mask)); \
+		WRITE_ONCE((__iface)->input->__in_reg, __new_val); \
+		spin_unlock(&(__iface)->lock); \
+	} while (0)
+
+struct panthor_fw_global_iface *
+panthor_fw_get_glb_iface(struct panthor_device *ptdev);
+
+struct panthor_fw_csg_iface *
+panthor_fw_get_csg_iface(struct panthor_device *ptdev, u32 csg_slot);
+
+struct panthor_fw_cs_iface *
+panthor_fw_get_cs_iface(struct panthor_device *ptdev, u32 csg_slot, u32 cs_slot);
+
+int panthor_fw_csg_wait_acks(struct panthor_device *ptdev, u32 csg_id, u32 req_mask,
+			     u32 *acked, u32 timeout_ms);
+
+int panthor_fw_glb_wait_acks(struct panthor_device *ptdev, u32 req_mask, u32 *acked,
+			     u32 timeout_ms);
+
+void panthor_fw_ring_csg_doorbells(struct panthor_device *ptdev, u32 csg_slot);
+
+void panthor_fw_mem_vunmap(struct panthor_fw_mem *mem);
+void *panthor_fw_mem_vmap(struct panthor_fw_mem *mem);
+u64 panthor_fw_mem_va(struct panthor_fw_mem *mem);
+void panthor_fw_mem_free(struct panthor_device *ptdev, struct panthor_fw_mem *mem);
+struct panthor_fw_mem *
+panthor_fw_alloc_queue_iface_mem(struct panthor_device *ptdev,
+				 struct panthor_fw_ringbuf_input_iface **input,
+				 const struct panthor_fw_ringbuf_output_iface **output);
+struct panthor_fw_mem *
+panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size);
+
+void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang);
+int panthor_fw_post_reset(struct panthor_device *ptdev);
+
+static inline void panthor_fw_suspend(struct panthor_device *ptdev)
+{
+	panthor_fw_pre_reset(ptdev, false);
+}
+
+static inline int panthor_fw_resume(struct panthor_device *ptdev)
+{
+	return panthor_fw_post_reset(ptdev);
+}
+
+int panthor_fw_init(struct panthor_device *ptdev);
+void panthor_fw_unplug(struct panthor_device *ptdev);
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 10/15] drm/panthor: Add the heap logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (8 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 09/15] drm/panthor: Add the FW " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-18 14:39   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 11/15] drm/panthor: Add the scheduler " Boris Brezillon
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Tiler heap growing requires some kernel driver involvement: when the
tiler runs out of heap memory, it will raise an exception which is
either directly handled by the firmware if some free heap chunks are
available in the heap context, or passed back to the kernel otherwise.
The heap helpers will be used by the scheduler logic to allocate more
heap chunks to a heap context, when such a situation happens.

Heap context creation is explicitly requested by userspace (using
the TILER_HEAP_CREATE ioctl), and the returned context is attached to a
queue through some command stream instruction.

All the kernel does is keep the list of heap chunks allocated to a
context, so they can be freed when TILER_HEAP_DESTROY is called, or
extended when the FW requests a new chunk.

v2:
- Rename the driver (pancsf -> panthor)
- Split the driver addition commit
- Document the code
- Fix various bugs

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_heap.c | 550 +++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_heap.h |  36 ++
 2 files changed, 586 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c b/drivers/gpu/drm/panthor/panthor_heap.c
new file mode 100644
index 000000000000..39244efc2eaa
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -0,0 +1,550 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#include <linux/iosys-map.h>
+#include <linux/rwsem.h>
+
+#include <drm/panthor_drm.h>
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_mmu.h"
+
+/**
+ * struct panthor_heap_gpu_ctx - Heap context used by the GPU/FW.
+ */
+struct panthor_heap_gpu_ctx {
+	/**
+	 * @first_heap_chunk: GPU VA of the first free heap chunk.
+	 *
+	 * This forms a single-link list, where each chunk points to the
+	 * next free chunk, and the last element points to NULL.
+	 *
+	 * Heap chunks get freed and returned to the heap context when fragment
+	 * jobs picking data from those heap chunks complete. When this happens
+	 * this field is updated to insert the heap chunks that were freed.
+	 *
+	 * When the tiler runs out of memory, it will first check if there
+	 * are free heap chunks in the heap context, and pick those if there are.
+	 *
+	 * When there is no free heap chunks left, the FW will raise a TILER_OOM
+	 * interrupt, letting the kernel driver allocate more heap chunks.
+	 *
+	 * If the heap context reached its heap chunk limit, the FW will wait
+	 * for fragment jobs to consume some data and return chunks to the
+	 * context.
+	 *
+	 * As a last resort, if there is no in-flight fragment jobs, the FW
+	 * will try to execute the exception handler set on the command stream.
+	 * This exception handler is expected to issue fragment job to store
+	 * the partial rendering results, free up some heap chunks.
+	 */
+	u64 first_heap_chunk;
+
+	/** @unused1: MBZ. */
+	u32 unused1[2];
+
+	/**
+	 * @vt_started_count: Number of vertex/tiling operations started.
+	 *
+	 * This is marking the beginning of a render pass, and is explicity
+	 * flagged with a HEAP_OPERATION.vt_start instruction. If the render pass
+	 * contains multiple vertex/tiler/IDVS jobs, this HEAP_OPERATION.vt_start
+	 * is only called once.
+	 */
+	u32 vt_started_count;
+
+	/**
+	 * @vt_completed_count: Number of completed vertex/tiler jobs.
+	 *
+	 * This is marking the end of the geometry processing part of a render
+	 * pass, and is explicity flagged by the user command stream with
+	 * a HEAP_OPERATION.vt_completed instruction. If the render pass contains
+	 * multiple vertex/tiler/IDVS jobs, this HEAP_OPERATION.vt_end
+	 * instruction is only issued once.
+	 */
+	u32 vt_completed_count;
+
+	/** @unused2: MBZ. */
+	u32 unused2;
+
+	/**
+	 * @frag_completed_count: Number of completed fragment jobs.
+	 *
+	 * @vt_started_count - @frag_completed_count is the number of in-flight
+	 * render targets that's used by the driver to determine if it's worth
+	 * allocating new chunk or if we should instead wait for fragment jobs
+	 * to complete.
+	 *
+	 * Fragment completion is explicitly flagged by the user command stream
+	 * with a HEAP_OPERATION.frag_end or FINISH_FRAGMENT.frag_end instruction.
+	 */
+	u32 frag_completed_count;
+};
+
+/**
+ * struct panthor_heap_chunk_header - Heap chunk header
+ */
+struct panthor_heap_chunk_header {
+	/**
+	 * @next: Next heap chunk in the list.
+	 *
+	 * This is a GPU VA.
+	 */
+	u64 next;
+
+	/** @unknown: MBZ. */
+	u32 unknown[14];
+};
+
+/**
+ * struct panthor_heap_chunk - Structure used to keep track of allocated heap chunks.
+ */
+struct panthor_heap_chunk {
+	/** @node: Used to insert the heap chunk in panthor_heap::chunks. */
+	struct list_head node;
+
+	/** @bo: Buffer object backing the heap chunk. */
+	struct panthor_gem_object *bo;
+
+	/** @gpu_va: GPU address of this heap chunk. */
+	u64 gpu_va;
+};
+
+/**
+ * struct panthor_heap - Structure used to manage tiler heap contexts.
+ */
+struct panthor_heap {
+	/** @chunks: List containing all heap chunks allocated so far. */
+	struct list_head chunks;
+
+	/** @chunk_size: Size of each chunk. */
+	u32 chunk_size;
+
+	/** @max_chunks: Maximum number of chunks. */
+	u32 max_chunks;
+
+	/**
+	 * @target_in_flight: Number of in-flight render passes after which
+	 * we'd let the FW wait for fragment job to finish instead of allocating new chunks.
+	 */
+	u32 target_in_flight;
+
+	/** @chunk_count: Number of heap chunks currently allocated. */
+	u32 chunk_count;
+};
+
+#define MAX_HEAPS_PER_POOL    128
+
+/**
+ * struct panthor_heap_pool - Pool of heap contexts
+ *
+ * The pool is attached to a panthor_file and can't be shared across processes.
+ */
+struct panthor_heap_pool {
+	/** @refcount: Reference count. */
+	struct kref refcount;
+
+	/** @ptdev: Device. */
+	struct panthor_device *ptdev;
+
+	/** @vm: VM this pool is bound to. */
+	struct panthor_vm *vm;
+
+	/** @lock: Lock protecting access to @xa. */
+	struct rw_semaphore lock;
+
+	/** @xa: Array storing panthor_heap objects. */
+	struct xarray xa;
+
+	/** @bo: Buffer object containing the GPU heap contexts. */
+	struct panthor_gem_object *bo;
+
+	/** @gpu_contexts: Array of GPU heap contexts. */
+	struct panthor_heap_gpu_ctx *gpu_contexts;
+
+	/** @gpu_va: GPU address of the heap contexts. */
+	u64 gpu_va;
+};
+
+static void panthor_free_heap_chunk(struct panthor_vm *vm,
+				    struct panthor_heap_chunk *chunk)
+{
+	if (!chunk)
+		return;
+
+	list_del(&chunk->node);
+	panthor_gem_unmap_and_put(vm, chunk->bo, chunk->gpu_va, NULL);
+	kfree(chunk);
+}
+
+static int panthor_alloc_heap_chunk(struct panthor_device *ptdev,
+				    struct panthor_vm *vm,
+				    struct panthor_heap *heap,
+				    bool initial_chunk)
+{
+	struct iosys_map map = IOSYS_MAP_INIT_VADDR(NULL);
+	struct panthor_heap_chunk *chunk;
+	struct panthor_heap_chunk_header *hdr;
+	int ret;
+
+	chunk = kmalloc(sizeof(*chunk), GFP_KERNEL);
+	if (!chunk)
+		return -ENOMEM;
+
+	chunk->gpu_va = PANTHOR_GEM_ALLOC_VA;
+	chunk->bo = panthor_gem_create_and_map(ptdev, vm, heap->chunk_size,
+					       DRM_PANTHOR_BO_NO_MMAP,
+					       DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
+					       &chunk->gpu_va,
+					       (void **)&hdr);
+	if (IS_ERR(chunk->bo)) {
+		ret = PTR_ERR(chunk->bo);
+		goto err_free_chunk;
+	}
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	if (initial_chunk && !list_empty(&heap->chunks)) {
+		struct panthor_heap_chunk *prev_chunk;
+
+		prev_chunk = list_first_entry(&heap->chunks,
+					      struct panthor_heap_chunk,
+					      node);
+
+		hdr->next = (prev_chunk->gpu_va & GENMASK_ULL(63, 12)) |
+			    (heap->chunk_size >> 12);
+	}
+
+	map.vaddr = hdr;
+	drm_gem_vunmap_unlocked(&chunk->bo->base.base, &map);
+
+	if (initial_chunk)
+		list_add(&chunk->node, &heap->chunks);
+	else
+		list_add_tail(&chunk->node, &heap->chunks);
+	heap->chunk_count++;
+
+	return 0;
+
+err_free_chunk:
+	kfree(chunk);
+
+	return ret;
+}
+
+static void panthor_free_heap_chunks(struct panthor_vm *vm,
+				     struct panthor_heap *heap)
+{
+	struct panthor_heap_chunk *chunk, *tmp;
+
+	list_for_each_entry_safe(chunk, tmp, &heap->chunks, node) {
+		panthor_free_heap_chunk(vm, chunk);
+	}
+
+	heap->chunk_count = 0;
+}
+
+static int panthor_alloc_heap_chunks(struct panthor_device *ptdev,
+				     struct panthor_vm *vm,
+				     struct panthor_heap *heap,
+				     u32 chunk_count)
+{
+	int ret;
+	u32 i;
+
+	for (i = 0; i < chunk_count; i++) {
+		ret = panthor_alloc_heap_chunk(ptdev, vm, heap, true);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int
+panthor_heap_destroy_locked(struct panthor_heap_pool *pool, u32 handle)
+{
+	struct panthor_heap *heap = NULL;
+
+	heap = xa_erase(&pool->xa, handle);
+	if (!heap)
+		return -EINVAL;
+
+	panthor_free_heap_chunks(pool->vm, heap);
+	kfree(heap);
+	return 0;
+}
+
+/**
+ * panthor_heap_destroy() - Destroy a heap context
+ * @pool: Pool this context belongs to.
+ * @handle: Handle returned by panthor_heap_create().
+ */
+int panthor_heap_destroy(struct panthor_heap_pool *pool, u32 handle)
+{
+	int ret;
+
+	down_write(&pool->lock);
+	ret = panthor_heap_destroy_locked(pool, handle);
+	up_write(&pool->lock);
+
+	return ret;
+}
+
+/**
+ * panthor_heap_create() - Create a heap context
+ * @pool: Pool to instantiate the heap context from.
+ * @initial_chunk_count: Number of chunk allocated at initialization time.
+ * Must be at least 1.
+ * @chunk_size: The size of each chunk. Must be a power of two between 256k
+ * and 2M.
+ * @max_chunks: Maximum number of chunks that can be allocated.
+ * @target_in_flight: Maximum number of in-flight render passes.
+ * @heap_ctx_gpu_va: Pointer holding the GPU address of the allocated heap
+ * context.
+ * @first_chunk_gpu_va: Pointer holding the GPU address of the first chunk
+ * assigned to the heap context.
+ *
+ * Return: a positive handle on success, a negative error otherwise.
+ */
+int panthor_heap_create(struct panthor_heap_pool *pool,
+			u32 initial_chunk_count,
+			u32 chunk_size,
+			u32 max_chunks,
+			u32 target_in_flight,
+			u64 *heap_ctx_gpu_va,
+			u64 *first_chunk_gpu_va)
+{
+	struct panthor_heap *heap;
+	struct panthor_heap_gpu_ctx *gpu_ctx;
+	struct panthor_heap_chunk *first_chunk;
+	int ret = 0;
+	u32 id;
+
+	if (initial_chunk_count == 0)
+		return -EINVAL;
+
+	if (hweight32(chunk_size) != 1 ||
+	    chunk_size < SZ_256K || chunk_size > SZ_2M)
+		return -EINVAL;
+
+	heap = kzalloc(sizeof(*heap), GFP_KERNEL);
+	if (!heap)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&heap->chunks);
+	heap->chunk_size = chunk_size;
+	heap->max_chunks = max_chunks;
+	heap->target_in_flight = target_in_flight;
+
+	down_write(&pool->lock);
+
+	/* The pool has been destroyed, we can't create a new heap. */
+	if (!pool->vm) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, MAX_HEAPS_PER_POOL), GFP_KERNEL);
+	if (ret) {
+		kfree(heap);
+		goto out_unlock;
+	}
+
+	gpu_ctx = &pool->gpu_contexts[id];
+	memset(gpu_ctx, 0, sizeof(*gpu_ctx));
+
+	ret = panthor_alloc_heap_chunks(pool->ptdev, pool->vm, heap,
+					initial_chunk_count);
+	if (ret) {
+		panthor_heap_destroy_locked(pool, id);
+		goto out_unlock;
+	}
+
+	*heap_ctx_gpu_va = pool->gpu_va + (sizeof(*pool->gpu_contexts) * id);
+
+	first_chunk = list_first_entry(&heap->chunks,
+				       struct panthor_heap_chunk,
+				       node);
+	*first_chunk_gpu_va = first_chunk->gpu_va;
+	ret = id;
+
+out_unlock:
+	up_write(&pool->lock);
+	return ret;
+}
+
+/**
+ * panthor_heap_grow() - Make a heap context grow.
+ * @pool: The pool this heap belongs to.
+ * @heap_gpu_va: The GPU address of the heap context.
+ * @renderpasses_in_flight: Number of render passes currently in-flight.
+ * @pending_frag_count: Number of fragment jobs waiting for execution/completion.
+ */
+int panthor_heap_grow(struct panthor_heap_pool *pool,
+		      u64 heap_gpu_va,
+		      u32 renderpasses_in_flight,
+		      u32 pending_frag_count,
+		      u64 *new_chunk_gpu_va)
+{
+	u64 heap_id = (heap_gpu_va - pool->gpu_va) /
+		      sizeof(struct panthor_heap_gpu_ctx);
+	struct panthor_heap_chunk *chunk;
+	struct panthor_heap *heap;
+	int ret;
+
+	down_read(&pool->lock);
+	heap = xa_load(&pool->xa, heap_id);
+	if (!heap) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* If we reached the target in-flight render passes, or if we
+	 * reached the maximum number of chunks, let the FW figure another way to
+	 * find some memory (wait for render passes to finish, or call the exception
+	 * handler provided by the userspace driver, if any).
+	 */
+	if (renderpasses_in_flight > heap->target_in_flight ||
+	    (pending_frag_count > 0 && heap->chunk_count >= heap->max_chunks)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	} else if (heap->chunk_count >= heap->max_chunks) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	ret = panthor_alloc_heap_chunk(pool->ptdev, pool->vm, heap, false);
+	if (ret)
+		goto out_unlock;
+
+	chunk = list_last_entry(&heap->chunks,
+				struct panthor_heap_chunk,
+				node);
+	*new_chunk_gpu_va = (chunk->gpu_va & GENMASK_ULL(63, 12)) |
+			    (heap->chunk_size >> 12);
+	ret = 0;
+
+out_unlock:
+	up_read(&pool->lock);
+	return ret;
+}
+
+static void panthor_heap_pool_release(struct kref *refcount)
+{
+	struct panthor_heap_pool *pool =
+		container_of(refcount, struct panthor_heap_pool, refcount);
+
+	xa_destroy(&pool->xa);
+	kfree(pool);
+}
+
+/**
+ * panthor_heap_pool_put() - Release a heap pool reference
+ * @pool: Pool to release the reference on. Can be NULL.
+ */
+void panthor_heap_pool_put(struct panthor_heap_pool *pool)
+{
+	if (pool)
+		kref_put(&pool->refcount, panthor_heap_pool_release);
+}
+
+/**
+ * panthor_heap_pool_get() - Get a heap pool reference
+ * @pool: Pool to get the reference on. Can be NULL.
+ *
+ * Return: @pool.
+ */
+struct panthor_heap_pool *
+panthor_heap_pool_get(struct panthor_heap_pool *pool)
+{
+	if (pool)
+		kref_get(&pool->refcount);
+
+	return pool;
+}
+
+/**
+ * panthor_heap_pool_create() - Create a heap pool
+ * @ptdev: Device.
+ * @vm: The VM this heap pool will be attached to.
+ *
+ * Heap pools might contain up to 128 heap contexts, and are per-VM.
+ *
+ * Return: A valid pointer on success, a negative error code otherwise.
+ */
+struct panthor_heap_pool *
+panthor_heap_pool_create(struct panthor_device *ptdev, struct panthor_vm *vm)
+{
+	size_t bosize = ALIGN(MAX_HEAPS_PER_POOL *
+			      sizeof(struct panthor_heap_gpu_ctx),
+			      4096);
+	struct panthor_heap_pool *pool;
+	int ret = 0;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	/* We want a weak ref here: the heap pool belongs to the VM, so we're
+	 * sure that, as long as the heap pool exists, the VM exists too.
+	 */
+	pool->vm = vm;
+	pool->ptdev = ptdev;
+	init_rwsem(&pool->lock);
+	xa_init_flags(&pool->xa, XA_FLAGS_ALLOC1);
+	kref_init(&pool->refcount);
+
+	pool->gpu_va = PANTHOR_GEM_ALLOC_VA;
+	pool->bo = panthor_gem_create_and_map(ptdev, vm, bosize,
+					      DRM_PANTHOR_BO_NO_MMAP,
+					      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
+					      &pool->gpu_va,
+					      (void *)&pool->gpu_contexts);
+	if (IS_ERR(pool->bo)) {
+		ret = PTR_ERR(pool->bo);
+		goto err_destroy_pool;
+	}
+
+	return pool;
+
+err_destroy_pool:
+	panthor_heap_pool_destroy(pool);
+	return ERR_PTR(ret);
+}
+
+/**
+ * panthor_heap_pool_destroy() - Destroy a heap pool.
+ * @pool: Pool to destroy.
+ *
+ * This function destroys all heap contexts and their resources. Thus
+ * preventing any use of the heap context or the chunk attached to them
+ * after that point.
+ *
+ * If the GPU still has access to some heap contexts, a fault should be
+ * triggered, which should flag the command stream groups using these
+ * context as faulty.
+ *
+ * The heap pool object is only released when all references to this pool
+ * are released.
+ */
+void panthor_heap_pool_destroy(struct panthor_heap_pool *pool)
+{
+	struct panthor_heap *heap;
+	unsigned long i;
+
+	down_write(&pool->lock);
+	xa_for_each(&pool->xa, i, heap)
+		drm_WARN_ON(&pool->ptdev->base, panthor_heap_destroy_locked(pool, i));
+
+	if (!IS_ERR_OR_NULL(pool->bo))
+		panthor_gem_unmap_and_put(pool->vm, pool->bo, pool->gpu_va, pool->gpu_contexts);
+
+	/* Reflects the fact the pool has been destroyed. */
+	pool->vm = NULL;
+	up_write(&pool->lock);
+
+	panthor_heap_pool_put(pool);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_heap.h b/drivers/gpu/drm/panthor/panthor_heap.h
new file mode 100644
index 000000000000..ff6ebdcd412e
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_heap.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_HEAP_H__
+#define __PANTHOR_HEAP_H__
+
+#include <linux/types.h>
+
+struct panthor_device;
+struct panthor_heap_pool;
+struct panthor_vm;
+
+int panthor_heap_create(struct panthor_heap_pool *pool,
+			u32 initial_chunk_count,
+			u32 chunk_size,
+			u32 max_chunks,
+			u32 target_in_flight,
+			u64 *heap_ctx_gpu_va,
+			u64 *first_chunk_gpu_va);
+int panthor_heap_destroy(struct panthor_heap_pool *pool, u32 handle);
+
+struct panthor_heap_pool *
+panthor_heap_pool_create(struct panthor_device *ptdev, struct panthor_vm *vm);
+void panthor_heap_pool_destroy(struct panthor_heap_pool *pool);
+
+struct panthor_heap_pool *
+panthor_heap_pool_get(struct panthor_heap_pool *pool);
+void panthor_heap_pool_put(struct panthor_heap_pool *pool);
+
+int panthor_heap_grow(struct panthor_heap_pool *pool,
+		      u64 heap_gpu_va,
+		      u32 renderpasses_in_flight,
+		      u32 pending_frag_count,
+		      u64 *new_chunk_gpu_va);
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 11/15] drm/panthor: Add the scheduler logical block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (9 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 10/15] drm/panthor: Add the heap " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-18 15:38   ` Steven Price
  2023-08-09 16:53 ` [PATCH v2 12/15] drm/panthor: Add the driver frontend block Boris Brezillon
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

This is the piece of software interacting with the FW scheduler, and
taking care of some scheduling aspects when the FW comes short of slots
scheduling slots. Indeed, the FW only expose a few slots, and the kernel
has to give all submission contexts, a chance to execute their jobs.

The kernel-side scheduler is timeslice-based, with a round-robin queue
per priority level.

Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler,
allowing us to delegate the dependency tracking to the core.

All the gory details should be documented inline.

v2:
- Rename the driver (pancsf -> panthor)
- Rename the file (_mcu -> _fw)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Move the ping logic to panthor_fw.c
- Fix various bugs

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_sched.c | 3272 +++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_sched.h |   50 +
 2 files changed, 3322 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
new file mode 100644
index 000000000000..c1a516454e5d
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -0,0 +1,3272 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#ifdef CONFIG_ARM_ARCH_TIMER
+#include <asm/arch_timer.h>
+#endif
+
+#include <drm/panthor_drm.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_gem_shmem_helper.h>
+#include <drm/drm_managed.h>
+#include <drm/gpu_scheduler.h>
+
+#include <linux/build_bug.h>
+#include <linux/clk.h>
+#include <linux/delay.h>
+#include <linux/dma-mapping.h>
+#include <linux/firmware.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/iosys-map.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
+#include <linux/dma-resv.h>
+
+#include "panthor_sched.h"
+#include "panthor_devfreq.h"
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_regs.h"
+#include "panthor_gpu.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+
+/**
+ * DOC: Scheduler
+ *
+ * Mali CSF hardware adopts a firmware-assited scheduling model, where
+ * the firmware takes care of scheduling aspects, to some extend.
+ *
+ * The scheduling happens at the scheduling group level, each group
+ * contains 1 to N queues (N is FW/hardware dependent, and exposed
+ * through the firmware interface). Each queue is assigned a command
+ * stream ring buffer, which serves as a way to get jobs submitted to
+ * the GPU, among other things.
+ *
+ * The firmware can schedule a maximum of M groups (M is FW/hardware
+ * dependent, and exposed through the firmware interface). Passed
+ * this maximum number of groups, the kernel must take care of
+ * rotating the groups passed to the firmware so every group gets
+ * a chance to have his queues scheduled for execution.
+ *
+ * The current implementation only supports with kernel-mode queues.
+ * In other terms, userspace doesn't have access to the ring-buffer.
+ * Instead, userspace passes indirect command stream buffers that are
+ * called from the queue ring-buffer by the kernel using a pre-defined
+ * sequence of command stream instructions to ensure the userspace driver
+ * always gets consistent results (cache maintenance,
+ * synchronization, ...).
+ *
+ * We rely on the drm_gpu_scheduler framework to deal with job
+ * dependencies and submission. As any other driver dealing with a
+ * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each
+ * entity has its own job scheduler. When a job is ready to be executed
+ * (all its dependencies are met), it is pushed to the appropriate
+ * queue ring-buffer, and the group is scheduled for execution if it
+ * wasn't already active.
+ *
+ * Kernel-side group scheduling is timeslice-based. When we have less
+ * groups than there are slots, the periodic tick is disabled and we
+ * just let the FW schedule the active groups. When there are more
+ * groups than slots, we let each group a chance to execute stuff for
+ * a given amount of time, and then re-evaluate and pick new groups
+ * to schedule. The group selection algorithm is based on
+ * priority+round-robin.
+ *
+ * Even though user-mode queues is out of the scope right now, the
+ * current design takes them into account by avoiding any guess on the
+ * group/queue state that would be based on information we wouldn't have
+ * if userspace was in charge of the ring-buffer. That's also one of the
+ * reason we don't do 'cooperative' scheduling (encoding FW group slot
+ * reservation as dma_fence that would be returned from the
+ * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as
+ * a queue of waiters, ordered by job submission order). This approach
+ * would work for kernel-mode queues, but would make user-mode queues a
+ * lot more complicated to retrofit.
+ */
+
+#define JOB_TIMEOUT_MS				5000
+
+#define MIN_CS_PER_CSG				8
+
+#define MIN_CSGS				3
+#define MAX_CSG_PRIO				0xf
+
+struct panthor_group;
+
+/**
+ * struct panthor_csg_slot - Command stream group slot
+ *
+ * This represents a FW slot for a scheduling group.
+ */
+struct panthor_csg_slot {
+	/** @group: Scheduling group bound to this slot. */
+	struct panthor_group *group;
+
+	/** @priority: Group priority. */
+	u8 priority;
+
+	/**
+	 * @idle: True if the group bound to this slot is idle.
+	 *
+	 * A group is idle when it has nothing waiting for execution on
+	 * all its queues, or when queues are blocked waiting for something
+	 * to happen (synchronization object).
+	 */
+	bool idle;
+};
+
+/**
+ * enum panthor_csg_priority - Group priority
+ */
+enum panthor_csg_priority {
+	/** @PANTHOR_CSG_PRIORITY_LOW: Low priority group. */
+	PANTHOR_CSG_PRIORITY_LOW = 0,
+
+	/** @PANTHOR_CSG_PRIORITY_MEDIUM: Medium priority group. */
+	PANTHOR_CSG_PRIORITY_MEDIUM,
+
+	/** @PANTHOR_CSG_PRIORITY_HIGH: High priority group. */
+	PANTHOR_CSG_PRIORITY_HIGH,
+
+	/**
+	 * @PANTHOR_CSG_PRIORITY_RT: Real-time priority group.
+	 *
+	 * Real-time prioty allows one to preempt scheduling of other
+	 * non-real-time groups. When such a group becomes executable,
+	 * it will evict the group with the lowest non-rt priority if
+	 * there's no free group slot available.
+	 *
+	 * Currently not exposed to userspace.
+	 */
+	PANTHOR_CSG_PRIORITY_RT,
+
+	/** @PANTHOR_CSG_PRIORITY_COUNT: Number of priority levels. */
+	PANTHOR_CSG_PRIORITY_COUNT,
+};
+
+/**
+ * struct panthor_scheduler - Object used to manage the scheduler
+ */
+struct panthor_scheduler {
+	/** @ptdev: Device. */
+	struct panthor_device *ptdev;
+	/**
+	 * @wq: Worqueue passed to the drm_gpu_scheduler.
+	 *
+	 * Used to submit/cleanup jobs.
+	 */
+	struct workqueue_struct *wq;
+
+	/** @tick_work: Work executed on a scheduling tick. */
+	struct delayed_work tick_work;
+
+	/**
+	 * @sync_upd_work: Work used to process synchronization object updates.
+	 *
+	 * We use this work to unblock queues/groups that were waiting on a
+	 * synchronization object.
+	 */
+	struct work_struct sync_upd_work;
+
+	/**
+	 * @resched_target: When the next tick should occur.
+	 *
+	 * Expressed in jiffies.
+	 */
+	u64 resched_target;
+
+	/**
+	 * @last_tick: When the last tick occurred.
+	 *
+	 * Expressed in jiffies.
+	 */
+	u64 last_tick;
+
+	/** @tick_period: Tick period in jiffies. */
+	u64 tick_period;
+
+	/**
+	 * @lock: Lock protecting access to all the scheduler fields.
+	 *
+	 * Should be taken in the tick work, the irq handler, and anywhere the @groups
+	 * fields are touched.
+	 */
+	struct mutex lock;
+
+	/** @groups: Various lists used to classify groups. */
+	struct {
+		/**
+		 * @runnable: Runnable group lists.
+		 *
+		 * When a group has queues that want to execute something,
+		 * its panthor_group::run_node should be inserted here.
+		 *
+		 * One list per-priority.
+		 */
+		struct list_head runnable[PANTHOR_CSG_PRIORITY_COUNT];
+
+		/**
+		 * @idle: Idle group lists.
+		 *
+		 * When all queues of a group are idle (either because they
+		 * have nothing to execute, or because they are blocked), the
+		 * panthor_group::run_node field should be inserted here.
+		 *
+		 * One list per-priority.
+		 */
+		struct list_head idle[PANTHOR_CSG_PRIORITY_COUNT];
+
+		/**
+		 * @waiting: List of groups whose queues are blocked on a
+		 * synchronization object.
+		 *
+		 * Insert panthor_group::wait_node here when a group is waiting
+		 * for synchronization objects to be signaled.
+		 *
+		 * This list is evaluated in the @sync_upd_work work.
+		 */
+		struct list_head waiting;
+	} groups;
+
+	/**
+	 * @csg_slots: FW command stream group slots.
+	 */
+	struct panthor_csg_slot csg_slots[MAX_CSGS];
+
+	/** @csg_slot_count: Number of command stream group slots exposed by the FW. */
+	u32 csg_slot_count;
+
+	/** @cs_slot_count: Number of command stream slot per group slot exposed by the FW. */
+	u32 cs_slot_count;
+
+	/** @as_slot_count: Number of address space slots supported by the MMU. */
+	u32 as_slot_count;
+
+	/** @used_csg_slot_count: Number of command stream group slot currently used. */
+	u32 used_csg_slot_count;
+
+	/** @sb_slot_count: Number of scoreboard slots. */
+	u32 sb_slot_count;
+
+	/**
+	 * @might_have_idle_groups: True if an active group might have become idle.
+	 *
+	 * This will force a tick, so other runnable groups can be scheduler if one
+	 * or more active groups became idle.
+	 */
+	bool might_have_idle_groups;
+
+	/** @pm: Power management related fields. */
+	struct {
+		/** @has_ref: True if the scheduler owns a runtime PM reference. */
+		bool has_ref;
+	} pm;
+
+	/** @reset: Reset related fields. */
+	struct {
+		/** @lock: Lock protecting the other reset fields. */
+		struct mutex lock;
+
+		/**
+		 * @in_progress: True if a reset is in progress.
+		 *
+		 * Set to true in panthor_sched_pre_reset() and back to false in
+		 * panthor_sched_post_reset().
+		 */
+		bool in_progress;
+
+		/**
+		 * @stopped_groups: List containing all groups that were stopped
+		 * before a reset.
+		 *
+		 * Insert panthor_group::run_node in the pre_reset path.
+		 */
+		struct list_head stopped_groups;
+	} reset;
+};
+
+/**
+ * struct panthor_syncobj_32b - 32-bit FW synchronization object
+ */
+struct panthor_syncobj_32b {
+	/** @seqno: Sequence number. */
+	u32 seqno;
+
+	/**
+	 * @status: Status.
+	 *
+	 * Not zero on failure.
+	 */
+	u32 status;
+};
+
+/**
+ * struct panthor_syncobj_64b - 64-bit FW synchronization object
+ */
+struct panthor_syncobj_64b {
+	/** @seqno: Sequence number. */
+	u64 seqno;
+
+	/**
+	 * @status: Status.
+	 *
+	 * Not zero on failure.
+	 */
+	u32 status;
+
+	/** @pad: MBZ. */
+	u32 pad;
+};
+
+/**
+ * struct panthor_queue - Execution queue
+ */
+struct panthor_queue {
+	/** @scheduler: DRM scheduler used for this queue. */
+	struct drm_gpu_scheduler scheduler;
+
+	/** @entity: DRM scheduling entity used for this queue. */
+	struct drm_sched_entity entity;
+
+	/**
+	 * @remaining_time: Time remaining before the job timeout expires.
+	 *
+	 * The job timeout is suspended when the is not scheduled by the
+	 * FW. Every time we suspend the timer, we need to save the remaining
+	 * time so we can restore it later on.
+	 */
+	unsigned long remaining_time;
+
+	/** @timeout_suspended: True if the job timeout was suspended. */
+	bool timeout_suspended;
+
+	/**
+	 * @doorbell_id: Doorbell assigned to this queue.
+	 *
+	 * Right now, all groups share the same doorbell, and the doorbell ID
+	 * is assigned to group_slot + 1 when the group is assigned a slot. But
+	 * we might decide to provide fine grained doorbell assignment at some
+	 * point, so don't have to wake up all queues in a group every time one
+	 * of them is updated.
+	 */
+	u8 doorbell_id;
+
+	/**
+	 * @priority: Priority of the queue inside the group.
+	 *
+	 * Must be less than 16 (Only 4 bits available).
+	 */
+	u8 priority;
+#define CSF_MAX_QUEUE_PRIO	GENMASK(3, 0)
+
+	/** @ringbuf: Command stream ring-buffer fields. */
+	struct {
+		/** @bo: Buffer object for the ring-buffer. */
+		struct panthor_gem_object *bo;
+
+		/** @gpu_va: GPU virtual address. */
+		u64 gpu_va;
+
+		/** @kmap: Kernel mapping of the ring buffer. */
+		u64 *kmap;
+	} ringbuf;
+
+	/** @iface: Firmware interface. */
+	struct {
+		/** @mem: FW memory allocated for this interface. */
+		struct panthor_fw_mem *mem;
+
+		/** @input: Input interface. */
+		struct panthor_fw_ringbuf_input_iface *input;
+
+		/** @output: Output interface. */
+		const struct panthor_fw_ringbuf_output_iface *output;
+	} iface;
+
+	/**
+	 * @syncwait: Stores information about the synchronization object this
+	 * queue is waiting on.
+	 */
+	struct {
+		/** @gpu_va: GPU address of the synchronization object. */
+		u64 gpu_va;
+
+		/** @ref: Reference value to compare against. */
+		u64 ref;
+
+		/** @gt: True is this is a greater-than test. */
+		bool gt;
+
+		/** @sync64: True if this is a 64-bit sync object. */
+		bool sync64;
+
+		/** @bo: Buffer object holding the synchronization object. */
+		struct panthor_gem_object *bo;
+
+		/** @offset: Offset of the synchronization object inside @bo. */
+		u64 offset;
+
+		/**
+		 * @kmap: Kernel mapping of the buffer object holding the
+		 * synchronization object.
+		 */
+		void *kmap;
+	} syncwait;
+
+	/** @fence_ctx: Fence context fields. */
+	struct {
+		/** @lock: Used to protect access to all fences allocated by this context. */
+		spinlock_t lock;
+
+		/**
+		 * @id: Fence context ID.
+		 *
+		 * Allocated with dma_fence_context_alloc().
+		 */
+		u64 id;
+
+		/** @seqno: Sequence number of the last initialized fence. */
+		atomic64_t seqno;
+
+		/**
+		 * @in_flight_jobs: List containing all in-flight jobs.
+		 *
+		 * Used to keep track and signal panthor_job::done_fence when the
+		 * synchronization object attached to the queue is signaled.
+		 */
+		struct list_head in_flight_jobs;
+	} fence_ctx;
+};
+
+/**
+ * enum panthor_group_state - Scheduling group state.
+ */
+enum panthor_group_state {
+	/** @PANTHOR_CS_GROUP_CREATED: Group was created, but not scheduled yet. */
+	PANTHOR_CS_GROUP_CREATED,
+
+	/** @PANTHOR_CS_GROUP_ACTIVE: Group is currently scheduled. */
+	PANTHOR_CS_GROUP_ACTIVE,
+
+	/**
+	 * @PANTHOR_CS_GROUP_SUSPENDED: Group was scheduled at least once, but is
+	 * inactive/suspended right now.
+	 */
+	PANTHOR_CS_GROUP_SUSPENDED,
+
+	/**
+	 * @PANTHOR_CS_GROUP_TERMINATED: Group was terminated.
+	 *
+	 * Can no longer be scheduled. The only allowed action is a destruction.
+	 */
+	PANTHOR_CS_GROUP_TERMINATED,
+};
+
+/**
+ * struct panthor_group - Scheduling group object
+ */
+struct panthor_group {
+	/** @refcount: Reference count */
+	struct kref refcount;
+
+	/** @ptdev: Device. */
+	struct panthor_device *ptdev;
+
+	/** @vm: VM bound to the group. */
+	struct panthor_vm *vm;
+
+	/** @compute_core_mask: Mask of shader cores that can be used for compute jobs. */
+	u64 compute_core_mask;
+
+	/** @fragment_core_mask: Mask of shader cores that can be used for fragment jobs. */
+	u64 fragment_core_mask;
+
+	/** @tiler_core_mask: Mask of tiler cores that can be used for tiler jobs. */
+	u64 tiler_core_mask;
+
+	/** @max_compute_cores: Maximum number of shader cores used for compute jobs. */
+	u8 max_compute_cores;
+
+	/** @max_compute_cores: Maximum number of shader cores used for fragment jobs. */
+	u8 max_fragment_cores;
+
+	/** @max_tiler_cores: Maximum number of tiler cores used for tiler jobs. */
+	u8 max_tiler_cores;
+
+	/** @priority: Group priority (check panthor_csg_priority). */
+	u8 priority;
+
+	/** @blocked_queues: Bitmask reflecting the blocked queues. */
+	u32 blocked_queues;
+
+	/** @idle_queues: Bitmask reflecting the blocked queues. */
+	u32 idle_queues;
+
+	/** @fatal_lock: Lock used to protect access to fatal fields. */
+	spinlock_t fatal_lock;
+
+	/** @fatal_queues: Bitmask reflecting the queues that hit a fatal exception. */
+	u32 fatal_queues;
+
+	/** @queue_count: Number of queues in this group. */
+	u32 queue_count;
+
+	/** @queues: Queues owned by this group. */
+	struct panthor_queue *queues[MAX_CS_PER_CSG];
+
+	/**
+	 * @csg_id: ID of the FW group slot.
+	 *
+	 * -1 when the group is not scheduled/active.
+	 */
+	int csg_id;
+
+	/**
+	 * @destroyed: True when the group has been destroyed.
+	 *
+	 * If a group is destroyed it becomes useless: no further jobs can be submitted
+	 * to its queues. We simply wait for all references to be dropped so we can
+	 * release the group object.
+	 */
+	bool destroyed;
+
+	/**
+	 * @timedout: True when a timeout occurred on any of the queues owned by
+	 * this group.
+	 *
+	 * Timeouts can be reported by drm_sched or by the FW. In any case, any
+	 * timeout situation in unrecoverable, and the group becomes useless.
+	 * We simply wait for all references to be dropped so we can release the
+	 * group object.
+	 */
+	bool timedout;
+
+	/**
+	 * @syncobjs: Pool of per-queue synchronization objects.
+	 *
+	 * One sync object per queue. The position of the sync object is
+	 * determined by the queue index.
+	 */
+	struct {
+		/** @bo: Buffer object containing these synchronization objects. */
+		struct panthor_gem_object *bo;
+
+		/** @gpu_va: GPU address of the sync object pool */
+		u64 gpu_va;
+
+		/** @kmap: The kernel mapping of the sync object pool. */
+		void *kmap;
+	} syncobjs;
+
+	/** @state: Group state. */
+	enum panthor_group_state state;
+
+	/**
+	 * @suspend_buf: Suspend buffer.
+	 *
+	 * Stores the state of the group and its queues when a group is suspended.
+	 * Used at resume time to restore the group in its previous state.
+	 *
+	 * The size of the suspend buffer is exposed through the FW interface.
+	 */
+	struct panthor_fw_mem *suspend_buf;
+
+	/**
+	 * @protm_suspend_buf: Protection mode suspend buffer.
+	 *
+	 * Stores the state of the group and its queues when a group that's in
+	 * protection mode is suspended.
+	 *
+	 * Used at resume time to restore the group in its previous state.
+	 *
+	 * The size of the protection mode suspend buffer is exposed through the
+	 * FW interface.
+	 */
+	struct panthor_fw_mem *protm_suspend_buf;
+
+	/** @sync_upd_work: Work used to check/signal job fences. */
+	struct work_struct sync_upd_work;
+
+	/** @term_work: Work used to finish the group termination procedure. */
+	struct work_struct term_work;
+
+	/**
+	 * @release_work: Work used to release group resources.
+	 *
+	 * We need to postpone the group release to avoid a deadlock when
+	 * the last ref is released in the tick work.
+	 */
+	struct work_struct release_work;
+
+	/**
+	 * @run_node: Node used to insert the group in the
+	 * panthor_group::groups::{runnable,idle} and
+	 * panthor_group::reset.stopped_groups lists.
+	 */
+	struct list_head run_node;
+
+	/**
+	 * @wait_node: Node used to insert the group in the
+	 * panthor_group::groups::waiting list.
+	 */
+	struct list_head wait_node;
+};
+
+/**
+ * group_queue_work() - Queue a group work
+ * @group: Group to queue the work for.
+ * @wname: Work name.
+ *
+ * Grabs a ref and queue a work item to the scheduler workqueue. If
+ * the work was already queued, we release the reference we grabbed.
+ *
+ * Work callbacks must release the reference we grabbed here.
+ */
+#define group_queue_work(group, wname) \
+	do { \
+		group_get(group); \
+		if (!queue_work((group)->ptdev->scheduler->wq, &(group)->wname ## _work)) \
+			group_put(group); \
+	} while (0)
+
+/**
+ * sched_queue_work() - Queue a scheduler work.
+ * @sched: Scheduler object.
+ * @wname: Work name.
+ *
+ * Conditionally queues a scheduler work if no reset is pending/in-progress.
+ */
+#define sched_queue_work(sched, wname) \
+	do { \
+		if (sched->reset.in_progress || \
+		    !panthor_device_reset_is_pending((sched)->ptdev)) \
+			queue_work((sched)->wq, &(sched)->wname ## _work); \
+	} while (0)
+
+/**
+ * sched_queue_work() - Queue a scheduler delayed work.
+ * @sched: Scheduler object.
+ * @wname: Work name.
+ * @delay: Work delay in jiffies.
+ *
+ * Conditionally queues a scheduler delayed work if no reset is
+ * pending/in-progress.
+ */
+#define sched_queue_delayed_work(sched, wname, delay) \
+	do { \
+		if (sched->reset.in_progress || \
+		    !panthor_device_reset_is_pending((sched)->ptdev)) \
+			mod_delayed_work((sched)->wq, &(sched)->wname ## _work, delay); \
+	} while (0)
+
+/*
+ * We currently set the maximum of groups per file to an arbitrary low value.
+ * But this can be updated if we need more.
+ */
+#define MAX_GROUPS_PER_POOL 128
+
+/**
+ * struct panthor_group_pool - Group pool
+ *
+ * Each file get assigned a group pool.
+ */
+struct panthor_group_pool {
+	/** @xa: Xarray used to manage group handles. */
+	struct xarray xa;
+};
+
+/**
+ * struct panthor_job - Used to manage GPU job
+ */
+struct panthor_job {
+	/** @base: Inherit from drm_sched_job. */
+	struct drm_sched_job base;
+
+	/** @refcount: Reference count. */
+	struct kref refcount;
+
+	/** @group: Group of the queue this job will be pushed to. */
+	struct panthor_group *group;
+
+	/** @queue_idx: Index of the queue inside @group. */
+	u32 queue_idx;
+
+	/** @call_info: Information about the userspace command stream call. */
+	struct {
+		/** @start: GPU address of the userspace command stream. */
+		u64 start;
+
+		/** @size: Size of the userspace command stream. */
+		u32 size;
+
+		/**
+		 * @latest_flush: Flush ID at the time the userspace command
+		 * stream was built.
+		 *
+		 * Needed for the flush reduction mechanism.
+		 */
+		u32 latest_flush;
+	} call_info;
+
+	/** @ringbuf: Position of this job is in the ring buffer. */
+	struct {
+		/** @start: Start offset. */
+		u64 start;
+
+		/** @end: End offset. */
+		u64 end;
+	} ringbuf;
+
+	/**
+	 * @node: Used to insert the job in the panthor_queue::fence_ctx::in_flight_jobs
+	 * list.
+	 */
+	struct list_head node;
+
+	/** @done_fence: Fence signaled when the job is finished or cancelled. */
+	struct dma_fence *done_fence;
+};
+
+static void group_free_queue(struct panthor_group *group, u32 idx)
+{
+	struct panthor_queue *queue = group->queues[idx];
+
+	if (IS_ERR_OR_NULL(queue))
+		return;
+
+	if (queue->entity.fence_context)
+		drm_sched_entity_destroy(&queue->entity);
+
+	if (queue->scheduler.ops)
+		drm_sched_fini(&queue->scheduler);
+
+	if (queue->syncwait.bo) {
+		panthor_gem_unmap_and_put(group->vm, queue->syncwait.bo,
+					  queue->syncwait.gpu_va,
+					  queue->syncwait.kmap);
+	}
+
+	if (!IS_ERR_OR_NULL(queue->ringbuf.bo)) {
+		panthor_gem_unmap_and_put(group->vm, queue->ringbuf.bo,
+					  queue->ringbuf.gpu_va,
+					  queue->ringbuf.kmap);
+	}
+
+	panthor_fw_mem_free(group->ptdev, queue->iface.mem);
+	kfree(queue);
+}
+
+static void group_release_work(struct work_struct *work)
+{
+	struct panthor_group *group = container_of(work,
+						   struct panthor_group,
+						   release_work);
+	struct panthor_device *ptdev = group->ptdev;
+	u32 i;
+
+	for (i = 0; i < group->queue_count; i++)
+		group_free_queue(group, i);
+
+	if (group->suspend_buf)
+		panthor_fw_mem_free(ptdev, group->suspend_buf);
+
+	if (group->protm_suspend_buf)
+		panthor_fw_mem_free(ptdev, group->protm_suspend_buf);
+
+	if (!IS_ERR_OR_NULL(group->syncobjs.bo)) {
+		panthor_gem_unmap_and_put(group->vm, group->syncobjs.bo,
+					  group->syncobjs.gpu_va, group->syncobjs.kmap);
+	}
+
+	panthor_vm_put(group->vm);
+	kfree(group);
+}
+
+static void group_release(struct kref *kref)
+{
+	struct panthor_group *group = container_of(kref,
+						   struct panthor_group,
+						   refcount);
+	struct panthor_device *ptdev = group->ptdev;
+
+	drm_WARN_ON(&ptdev->base, group->csg_id >= 0);
+	drm_WARN_ON(&ptdev->base, !list_empty(&group->run_node));
+	drm_WARN_ON(&ptdev->base, !list_empty(&group->wait_node));
+
+	queue_work(panthor_cleanup_wq, &group->release_work);
+}
+
+static void group_put(struct panthor_group *group)
+{
+	if (group)
+		kref_put(&group->refcount, group_release);
+}
+
+static struct panthor_group *
+group_get(struct panthor_group *group)
+{
+	if (group)
+		kref_get(&group->refcount);
+
+	return group;
+}
+
+/**
+ * group_bind_locked() - Bind a group to a group slot
+ * @group: Group.
+ * @csg_id: Slot.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+group_bind_locked(struct panthor_group *group, u32 csg_id)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_csg_slot *csg_slot;
+	int ret;
+
+	if (drm_WARN_ON(&ptdev->base, group->csg_id != -1 || csg_id >= MAX_CSGS ||
+			ptdev->scheduler->csg_slots[csg_id].group))
+		return -EINVAL;
+
+	ret = panthor_vm_active(group->vm);
+	if (ret)
+		return ret;
+
+	csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	group_get(group);
+	group->csg_id = csg_id;
+
+	/* Dummy doorbell allocation: doorbell is assigned to the group and
+	 * all queues use the same doorbell.
+	 *
+	 * TODO: Implement LRU-based doorbell assignment, so the most often
+	 * updated queues get their own doorbell, thus avoiding useless checks
+	 * on queues belonging to the same group that are rarely updated.
+	 */
+	for (u32 i = 0; i < group->queue_count; i++)
+		group->queues[i]->doorbell_id = csg_id + 1;
+
+	csg_slot->group = group;
+
+	return 0;
+}
+
+/**
+ * group_unbind_locked() - Unbind a group from a slot.
+ * @group: Group to unbind.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+group_unbind_locked(struct panthor_group *group)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_csg_slot *slot;
+
+	if (drm_WARN_ON(&ptdev->base, group->csg_id < 0 || group->csg_id >= MAX_CSGS))
+		return -EINVAL;
+
+	if (drm_WARN_ON(&ptdev->base, group->state == PANTHOR_CS_GROUP_ACTIVE))
+		return -EINVAL;
+
+	slot = &ptdev->scheduler->csg_slots[group->csg_id];
+	panthor_vm_idle(group->vm);
+	group->csg_id = -1;
+
+	for (u32 i = 0; i < group->queue_count; i++)
+		group->queues[i]->doorbell_id = -1;
+
+	slot->group = NULL;
+
+	group_put(group);
+	return 0;
+}
+
+/**
+ * cs_slot_prog_locked() - Program a queue slot
+ * @ptdev: Device.
+ * @csg_id: Group slot ID.
+ * @cs_id: Queue slot ID.
+ *
+ * Program a queue slot with the queue information so things can start being
+ * executed on this queue.
+ *
+ * The group slot must have a group bound to it already (group_bind_locked()).
+ */
+static void
+cs_slot_prog_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
+{
+	struct panthor_queue *queue = ptdev->scheduler->csg_slots[csg_id].group->queues[cs_id];
+	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+
+	queue->iface.input->extract = queue->iface.output->extract;
+	drm_WARN_ON(&ptdev->base, queue->iface.input->insert < queue->iface.input->extract);
+
+	cs_iface->input->ringbuf_base = queue->ringbuf.gpu_va;
+	cs_iface->input->ringbuf_size = queue->ringbuf.bo->base.base.size;
+	cs_iface->input->ringbuf_input = panthor_fw_mem_va(queue->iface.mem);
+	cs_iface->input->ringbuf_output = panthor_fw_mem_va(queue->iface.mem) + PAGE_SIZE;
+	cs_iface->input->config = CS_CONFIG_PRIORITY(queue->priority) |
+				  CS_CONFIG_DOORBELL(queue->doorbell_id);
+	cs_iface->input->ack_irq_mask = ~0;
+	panthor_fw_update_reqs(cs_iface, req,
+			       CS_IDLE_SYNC_WAIT |
+			       CS_IDLE_EMPTY |
+			       CS_STATE_START |
+			       CS_EXTRACT_EVENT,
+			       CS_IDLE_SYNC_WAIT |
+			       CS_IDLE_EMPTY |
+			       CS_STATE_MASK |
+			       CS_EXTRACT_EVENT);
+	drm_sched_resume_timeout(&queue->scheduler, queue->remaining_time);
+	if (queue->iface.input->insert != queue->iface.input->extract && queue->timeout_suspended) {
+		drm_sched_resume_timeout(&queue->scheduler, queue->remaining_time);
+		queue->timeout_suspended = false;
+	}
+}
+
+/**
+ * @cs_slot_reset_locked() - Reset a queue slot
+ * @ptdev: Device.
+ * @csg_id: Group slot.
+ * @cs_id: Queue slot.
+ *
+ * Change the queue slot state to STOP and suspend the queue timeout if
+ * the queue is not blocked.
+ *
+ * The group slot must have a group bound to it (group_bind_locked()).
+ */
+static int
+cs_slot_reset_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
+{
+	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	struct panthor_group *group = ptdev->scheduler->csg_slots[csg_id].group;
+	struct panthor_queue *queue = group->queues[cs_id];
+
+	panthor_fw_update_reqs(cs_iface, req,
+			       CS_STATE_STOP,
+			       CS_STATE_MASK);
+
+	/* If the queue is blocked, we want to keep the timeout running, so
+	 * we can detect unbounded waits and kill the group when that happens.
+	 */
+	if (!(group->blocked_queues & BIT(cs_id)) && !queue->timeout_suspended) {
+		queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
+		queue->timeout_suspended = true;
+		WARN_ON(queue->remaining_time > msecs_to_jiffies(JOB_TIMEOUT_MS));
+	}
+
+	return 0;
+}
+
+/**
+ * csg_slot_sync_priority_locked() - Synchronize the group slot priority
+ * @ptdev: Device.
+ * @csg_id: Group slot ID.
+ *
+ * Group slot priority update happens asynchronously. When we receive a
+ * %CSG_ENDPOINT_CONFIG, we know the update is effective, and can
+ * reflect it to our panthor_csg_slot object.
+ */
+static void
+csg_slot_sync_priority_locked(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	struct panthor_fw_csg_iface *csg_iface;
+
+	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+	csg_slot->priority = (csg_iface->input->endpoint_req & CSG_EP_REQ_PRIORITY_MASK) >> 28;
+}
+
+/**
+ * cs_slot_sync_queue_state_locked() - Synchronize the queue slot priority
+ * @ptdev: Device.
+ * @csg_id: Group slot.
+ * @cs_id: Queue slot.
+ *
+ * Queue state is updated on group suspend or STATUS_UPDATE event.
+ */
+static void
+cs_slot_sync_queue_state_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
+{
+	struct panthor_group *group = ptdev->scheduler->csg_slots[csg_id].group;
+	struct panthor_queue *queue = group->queues[cs_id];
+	struct panthor_fw_cs_iface *cs_iface =
+		panthor_fw_get_cs_iface(group->ptdev, csg_id, cs_id);
+
+	u32 status_wait_cond;
+
+	switch (cs_iface->output->status_blocked_reason) {
+	case CS_STATUS_BLOCKED_REASON_UNBLOCKED:
+		if (queue->iface.input->insert == queue->iface.output->extract &&
+		    cs_iface->output->status_scoreboards == 0)
+			group->idle_queues |= BIT(cs_id);
+		break;
+
+	case CS_STATUS_BLOCKED_REASON_SYNC_WAIT:
+		drm_WARN_ON(&ptdev->base, !list_empty(&group->wait_node));
+		list_move_tail(&group->wait_node, &group->ptdev->scheduler->groups.waiting);
+		group->blocked_queues |= BIT(cs_id);
+		queue->syncwait.gpu_va = cs_iface->output->status_wait_sync_ptr;
+		queue->syncwait.ref = cs_iface->output->status_wait_sync_value;
+		status_wait_cond = cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_COND_MASK;
+		queue->syncwait.gt = status_wait_cond == CS_STATUS_WAIT_SYNC_COND_GT;
+		if (cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_64B) {
+			u64 sync_val_hi = cs_iface->output->status_wait_sync_value_hi;
+
+			queue->syncwait.sync64 = true;
+			queue->syncwait.ref |= sync_val_hi << 32;
+		} else {
+			queue->syncwait.sync64 = false;
+		}
+		break;
+
+	default:
+		/* Other reasons are not blocking. Consider the queue as runnable
+		 * in those cases.
+		 */
+		break;
+	}
+}
+
+static void
+csg_slot_sync_queues_state_locked(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+	u32 i;
+
+	group->idle_queues = 0;
+	group->blocked_queues = 0;
+
+	for (i = 0; i < group->queue_count; i++) {
+		if (group->queues[i])
+			cs_slot_sync_queue_state_locked(ptdev, csg_id, i);
+	}
+}
+
+static void
+csg_slot_sync_state_locked(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	struct panthor_fw_csg_iface *csg_iface;
+	struct panthor_group *group;
+	enum panthor_group_state new_state, old_state;
+
+	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+	group = csg_slot->group;
+
+	if (!group)
+		return;
+
+	old_state = group->state;
+	switch (csg_iface->output->ack & CSG_STATE_MASK) {
+	case CSG_STATE_START:
+	case CSG_STATE_RESUME:
+		new_state = PANTHOR_CS_GROUP_ACTIVE;
+		break;
+	case CSG_STATE_TERMINATE:
+		new_state = PANTHOR_CS_GROUP_TERMINATED;
+		break;
+	case CSG_STATE_SUSPEND:
+		new_state = PANTHOR_CS_GROUP_SUSPENDED;
+		break;
+	}
+
+	if (old_state == new_state)
+		return;
+
+	if (new_state == PANTHOR_CS_GROUP_SUSPENDED)
+		csg_slot_sync_queues_state_locked(ptdev, csg_id);
+
+	if (old_state == PANTHOR_CS_GROUP_ACTIVE) {
+		u32 i;
+
+		/* Reset the queue slots so we start from a clean
+		 * state when starting/resuming a new group on this
+		 * CSG slot. No wait needed here, and no ringbell
+		 * either, since the CS slot will only be re-used
+		 * on the next CSG start operation.
+		 */
+		for (i = 0; i < group->queue_count; i++) {
+			if (group->queues[i])
+				cs_slot_reset_locked(ptdev, csg_id, i);
+		}
+	}
+
+	group->state = new_state;
+}
+
+static int
+csg_slot_prog_locked(struct panthor_device *ptdev, u32 csg_id, u32 priority)
+{
+	struct panthor_fw_csg_iface *csg_iface;
+	struct panthor_csg_slot *csg_slot;
+	struct panthor_group *group;
+	u32 queue_mask = 0, i;
+
+	if (priority > MAX_CSG_PRIO)
+		return -EINVAL;
+
+	if (drm_WARN_ON(&ptdev->base, csg_id >= MAX_CSGS))
+		return -EINVAL;
+
+	csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	group = csg_slot->group;
+	if (!group || group->state == PANTHOR_CS_GROUP_ACTIVE)
+		return 0;
+
+	csg_iface = panthor_fw_get_csg_iface(group->ptdev, csg_id);
+
+	for (i = 0; i < group->queue_count; i++) {
+		if (group->queues[i]) {
+			cs_slot_prog_locked(ptdev, csg_id, i);
+			queue_mask |= BIT(i);
+		}
+	}
+
+	csg_iface->input->allow_compute = group->compute_core_mask;
+	csg_iface->input->allow_fragment = group->fragment_core_mask;
+	csg_iface->input->allow_other = group->tiler_core_mask;
+	csg_iface->input->endpoint_req = CSG_EP_REQ_COMPUTE(group->max_compute_cores) |
+					 CSG_EP_REQ_FRAGMENT(group->max_fragment_cores) |
+					 CSG_EP_REQ_TILER(group->max_tiler_cores) |
+					 CSG_EP_REQ_PRIORITY(priority);
+	csg_iface->input->config = panthor_vm_as(group->vm);
+
+	if (group->suspend_buf)
+		csg_iface->input->suspend_buf = panthor_fw_mem_va(group->suspend_buf);
+	else
+		csg_iface->input->suspend_buf = 0;
+
+	if (group->protm_suspend_buf)
+		csg_iface->input->protm_suspend_buf = panthor_fw_mem_va(group->protm_suspend_buf);
+	else
+		csg_iface->input->protm_suspend_buf = 0;
+
+	csg_iface->input->ack_irq_mask = ~0;
+	panthor_fw_toggle_reqs(csg_iface, doorbell_req, doorbell_ack, queue_mask);
+	return 0;
+}
+
+static void
+cs_slot_process_fatal_event(struct panthor_device *ptdev,
+			    u32 csg_id, u32 cs_id)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+	struct panthor_fw_cs_iface *csg_iface;
+	struct panthor_fw_cs_iface *cs_iface;
+	u32 fatal;
+	u64 info;
+
+	csg_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	fatal = cs_iface->output->fatal;
+	info = cs_iface->output->fatal_info;
+	group->fatal_queues |= BIT(cs_id);
+	sched_queue_delayed_work(sched, tick, 0);
+	drm_warn(&ptdev->base,
+		 "CSG slot %d CS slot: %d\n"
+		 "CS_FATAL.EXCEPTION_TYPE: 0x%x (%s)\n"
+		 "CS_FATAL.EXCEPTION_DATA: 0x%x\n"
+		 "CS_FATAL_INFO.EXCEPTION_DATA: 0x%llx\n",
+		 csg_id, cs_id,
+		 (unsigned int)CS_EXCEPTION_TYPE(fatal),
+		 panthor_exception_name(ptdev, CS_EXCEPTION_TYPE(fatal)),
+		 (unsigned int)CS_EXCEPTION_DATA(fatal),
+		 info);
+}
+
+static void
+cs_slot_process_fault_event(struct panthor_device *ptdev,
+			    u32 csg_id, u32 cs_id)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+	struct panthor_queue *queue = cs_id < group->queue_count ? group->queues[cs_id] : NULL;
+	struct panthor_fw_cs_iface *cs_iface;
+	u32 fault;
+	u64 info;
+
+	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	fault = cs_iface->output->fault;
+	info = cs_iface->output->fault_info;
+
+	if (queue && CS_EXCEPTION_TYPE(fault) == DRM_PANTHOR_EXCEPTION_CS_INHERIT_FAULT) {
+		u64 cs_extract = queue->iface.output->extract;
+		struct panthor_job *job;
+
+		spin_lock(&queue->fence_ctx.lock);
+		list_for_each_entry(job, &queue->fence_ctx.in_flight_jobs, node) {
+			if (cs_extract >= job->ringbuf.end)
+				continue;
+
+			if (cs_extract < job->ringbuf.start)
+				break;
+
+			dma_fence_set_error(job->done_fence, -EINVAL);
+		}
+		spin_unlock(&queue->fence_ctx.lock);
+	}
+
+	drm_warn(&ptdev->base,
+		 "CSG slot %d CS slot: %d\n"
+		 "CS_FAULT.EXCEPTION_TYPE: 0x%x (%s)\n"
+		 "CS_FAULT.EXCEPTION_DATA: 0x%x\n"
+		 "CS_FAULT_INFO.EXCEPTION_DATA: 0x%llx\n",
+		 csg_id, cs_id,
+		 (unsigned int)CS_EXCEPTION_TYPE(fault),
+		 panthor_exception_name(ptdev, CS_EXCEPTION_TYPE(fault)),
+		 (unsigned int)CS_EXCEPTION_DATA(fault),
+		 info);
+}
+
+static void
+cs_slot_process_tiler_oom_event(struct panthor_device *ptdev,
+				u32 csg_id, u32 cs_id)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+	struct panthor_fw_cs_iface *cs_iface;
+	struct panthor_heap_pool *heaps;
+	struct panthor_queue *queue;
+	u32 fault, vt_start, vt_end, frag_end;
+	u32 renderpasses_in_flight, pending_frag_count;
+	u64 info, heap_address, new_chunk_va;
+	int ret;
+
+	if (drm_WARN_ON(&ptdev->base, !group))
+		return;
+
+	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	queue = group->queues[cs_id];
+	heaps = panthor_vm_get_heap_pool(group->vm, false);
+	fault = cs_iface->output->fault;
+	info = cs_iface->output->fault_info;
+	heap_address = cs_iface->output->heap_address;
+	vt_start = cs_iface->output->heap_vt_start;
+	vt_end = cs_iface->output->heap_vt_end;
+	frag_end = cs_iface->output->heap_frag_end;
+	renderpasses_in_flight = vt_start - frag_end;
+	pending_frag_count = vt_end - frag_end;
+
+	if (!heaps || frag_end > vt_end || vt_end >= vt_start) {
+		ret = -EINVAL;
+	} else {
+		ret = panthor_heap_grow(heaps, heap_address,
+					renderpasses_in_flight,
+					pending_frag_count, &new_chunk_va);
+	}
+
+	if (!ret) {
+		cs_iface->input->heap_start = new_chunk_va;
+		cs_iface->input->heap_end = new_chunk_va;
+	} else if (ret == -EBUSY) {
+		cs_iface->input->heap_start = 0;
+		cs_iface->input->heap_end = 0;
+	} else {
+		group->fatal_queues |= BIT(csg_id);
+		sched_queue_delayed_work(sched, tick, 0);
+	}
+
+	panthor_heap_pool_put(heaps);
+}
+
+static bool cs_slot_process_irq(struct panthor_device *ptdev,
+				u32 csg_id, u32 cs_id)
+{
+	struct panthor_fw_cs_iface *cs_iface;
+	u32 req, ack, events;
+
+	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
+	req = cs_iface->input->req;
+	ack = cs_iface->output->ack;
+	events = (req ^ ack) & CS_EVT_MASK;
+
+	if (events & CS_FATAL)
+		cs_slot_process_fatal_event(ptdev, csg_id, cs_id);
+
+	if (events & CS_FAULT)
+		cs_slot_process_fault_event(ptdev, csg_id, cs_id);
+
+	if (events & CS_TILER_OOM)
+		cs_slot_process_tiler_oom_event(ptdev, csg_id, cs_id);
+
+	panthor_fw_update_reqs(cs_iface, req, ack,
+			       CS_FATAL | CS_FAULT | CS_TILER_OOM);
+
+	return (events & (CS_FAULT | CS_TILER_OOM)) != 0;
+}
+
+static void csg_slot_sync_idle_state_locked(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	struct panthor_fw_csg_iface *csg_iface;
+
+	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+	csg_slot->idle = csg_iface->output->status_state & CSG_STATUS_STATE_IS_IDLE;
+}
+
+static void csg_slot_process_idle_event(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+
+	mutex_lock(&sched->lock);
+	sched->might_have_idle_groups = true;
+	mutex_unlock(&sched->lock);
+
+	/* Schedule a tick so we can evict idle groups and schedule non-idle
+	 * ones. This will also update runtime PM and devfreq busy/idle states,
+	 * so the device can lower its frequency or get suspended.
+	 */
+	sched_queue_delayed_work(sched, tick, 0);
+}
+
+static void csg_slot_sync_update_locked(struct panthor_device *ptdev,
+					u32 csg_id)
+{
+	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+
+	if (group)
+		group_queue_work(group, sync_upd);
+
+	sched_queue_work(ptdev->scheduler, sync_upd);
+}
+
+static void csg_slot_process_sync_update_event(struct panthor_device *ptdev,
+					       u32 csg_id)
+{
+	mutex_lock(&ptdev->scheduler->lock);
+	csg_slot_sync_update_locked(ptdev, csg_id);
+	mutex_unlock(&ptdev->scheduler->lock);
+}
+
+static void
+csg_slot_process_progress_timer_event(struct panthor_device *ptdev, u32 csg_id)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+	struct panthor_group *group = csg_slot->group;
+
+	drm_warn(&ptdev->base, "CSG slot %d progress timeout\n", csg_id);
+
+	mutex_lock(&sched->lock);
+	group = csg_slot->group;
+	if (!drm_WARN_ON(&ptdev->base, !group))
+		group->timedout = true;
+	mutex_unlock(&sched->lock);
+
+	sched_queue_delayed_work(sched, tick, 0);
+}
+
+void panthor_sched_process_csg_irq(struct panthor_device *ptdev, u32 csg_id)
+{
+	u32 req, ack, cs_irq_req, cs_irq_ack, cs_irqs, csg_events;
+	struct panthor_fw_csg_iface *csg_iface;
+	u32 ring_cs_db_mask = 0;
+
+	if (drm_WARN_ON(&ptdev->base, csg_id >= ptdev->scheduler->csg_slot_count))
+		return;
+
+	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+	req = READ_ONCE(csg_iface->input->req);
+	ack = READ_ONCE(csg_iface->output->ack);
+	cs_irq_req = READ_ONCE(csg_iface->output->cs_irq_req);
+	cs_irq_ack = READ_ONCE(csg_iface->input->cs_irq_ack);
+	csg_events = (req ^ ack) & CSG_EVT_MASK;
+
+	/* There may not be any pending CSG/CS interrupts to process */
+	if (req == ack && cs_irq_req == cs_irq_ack)
+		return;
+
+	/* Immediately set IRQ_ACK bits to be same as the IRQ_REQ bits before
+	 * examining the CS_ACK & CS_REQ bits. This would ensure that Host
+	 * doesn't misses an interrupt for the CS in the race scenario where
+	 * whilst Host is servicing an interrupt for the CS, firmware sends
+	 * another interrupt for that CS.
+	 */
+	csg_iface->input->cs_irq_ack = cs_irq_req;
+
+	panthor_fw_update_reqs(csg_iface, req, ack,
+			       CSG_SYNC_UPDATE |
+			       CSG_IDLE |
+			       CSG_PROGRESS_TIMER_EVENT);
+
+	if (csg_events & CSG_IDLE)
+		csg_slot_process_idle_event(ptdev, csg_id);
+
+	if (csg_events & CSG_PROGRESS_TIMER_EVENT)
+		csg_slot_process_progress_timer_event(ptdev, csg_id);
+
+	cs_irqs = cs_irq_req ^ cs_irq_ack;
+	while (cs_irqs) {
+		u32 cs_id = ffs(cs_irqs) - 1;
+
+		if (cs_slot_process_irq(ptdev, csg_id, cs_id))
+			ring_cs_db_mask |= BIT(cs_id);
+
+		cs_irqs &= ~BIT(cs_id);
+	}
+
+	if (csg_events & CSG_SYNC_UPDATE)
+		csg_slot_process_sync_update_event(ptdev, csg_id);
+
+	if (ring_cs_db_mask)
+		panthor_fw_toggle_reqs(csg_iface, doorbell_req, doorbell_ack, ring_cs_db_mask);
+
+	panthor_fw_ring_csg_doorbells(ptdev, BIT(csg_id));
+}
+
+static void sched_process_idle_event(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
+	/* Acknowledge the idle event and schedule a tick. */
+	panthor_fw_update_reqs(glb_iface, req, glb_iface->output->ack, GLB_IDLE);
+	sched_queue_delayed_work(ptdev->scheduler, tick, 0);
+}
+
+/**
+ * panthor_sched_process_global_irq() - Process the scheduling part of a global IRQ
+ * @ptdev: Device.
+ */
+void panthor_sched_process_global_irq(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	u32 req, ack, evts;
+
+	req = READ_ONCE(glb_iface->input->req);
+	ack = READ_ONCE(glb_iface->output->ack);
+	evts = (req ^ ack) & GLB_EVT_MASK;
+
+	if (evts & GLB_IDLE)
+		sched_process_idle_event(ptdev);
+}
+
+static const char *fence_get_driver_name(struct dma_fence *fence)
+{
+	return "panthor";
+}
+
+static const char *queue_fence_get_timeline_name(struct dma_fence *fence)
+{
+	return "queue-fence";
+}
+
+static const struct dma_fence_ops panthor_queue_fence_ops = {
+	.get_driver_name = fence_get_driver_name,
+	.get_timeline_name = queue_fence_get_timeline_name,
+};
+
+/**
+ */
+struct panthor_csg_slots_upd_ctx {
+	u32 update_mask;
+	u32 timedout_mask;
+	struct {
+		u32 value;
+		u32 mask;
+	} requests[MAX_CSGS];
+};
+
+static void csgs_upd_ctx_init(struct panthor_csg_slots_upd_ctx *ctx)
+{
+	memset(ctx, 0, sizeof(*ctx));
+}
+
+static void csgs_upd_ctx_queue_reqs(struct panthor_device *ptdev,
+				    struct panthor_csg_slots_upd_ctx *ctx,
+				    u32 csg_id, u32 value, u32 mask)
+{
+	if (drm_WARN_ON(&ptdev->base, !mask) ||
+	    drm_WARN_ON(&ptdev->base, csg_id >= ptdev->scheduler->csg_slot_count))
+		return;
+
+	ctx->requests[csg_id].value = (ctx->requests[csg_id].value & ~mask) | (value & mask);
+	ctx->requests[csg_id].mask |= mask;
+	ctx->update_mask |= BIT(csg_id);
+}
+
+static int csgs_upd_ctx_apply_locked(struct panthor_device *ptdev,
+				     struct panthor_csg_slots_upd_ctx *ctx)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	u32 update_slots = ctx->update_mask;
+
+	lockdep_assert_held(&sched->lock);
+
+	if (!ctx->update_mask)
+		return 0;
+
+	while (update_slots) {
+		struct panthor_fw_csg_iface *csg_iface;
+		u32 csg_id = ffs(update_slots) - 1;
+
+		update_slots &= ~BIT(csg_id);
+		csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+		panthor_fw_update_reqs(csg_iface, req,
+				       ctx->requests[csg_id].value,
+				       ctx->requests[csg_id].mask);
+	}
+
+	panthor_fw_ring_csg_doorbells(ptdev, ctx->update_mask);
+
+	update_slots = ctx->update_mask;
+	while (update_slots) {
+		struct panthor_fw_csg_iface *csg_iface;
+		u32 csg_id = ffs(update_slots) - 1;
+		u32 req_mask = ctx->requests[csg_id].mask, acked;
+		int ret;
+
+		update_slots &= ~BIT(csg_id);
+		csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+
+		ret = panthor_fw_csg_wait_acks(ptdev, csg_id, req_mask, &acked, 100);
+
+		if (acked & CSG_ENDPOINT_CONFIG)
+			csg_slot_sync_priority_locked(ptdev, csg_id);
+
+		if (acked & CSG_STATE_MASK)
+			csg_slot_sync_state_locked(ptdev, csg_id);
+
+		if (acked & CSG_STATUS_UPDATE) {
+			csg_slot_sync_queues_state_locked(ptdev, csg_id);
+			csg_slot_sync_idle_state_locked(ptdev, csg_id);
+		}
+
+		if (ret && acked != req_mask &&
+		    ((csg_iface->input->req ^ csg_iface->output->ack) & req_mask) != 0) {
+			drm_err(&ptdev->base, "CSG %d update request timedout", csg_id);
+			ctx->timedout_mask |= BIT(csg_id);
+		}
+	}
+
+	if (ctx->timedout_mask)
+		return -ETIMEDOUT;
+
+	return 0;
+}
+
+struct panthor_sched_tick_ctx {
+	struct list_head old_groups[PANTHOR_CSG_PRIORITY_COUNT];
+	struct list_head groups[PANTHOR_CSG_PRIORITY_COUNT];
+	u32 idle_group_count;
+	u32 group_count;
+	enum panthor_csg_priority min_priority;
+	struct panthor_vm *vms[MAX_CS_PER_CSG];
+	u32 as_count;
+	bool immediate_tick;
+	u32 csg_upd_failed_mask;
+};
+
+static bool
+tick_ctx_is_full(const struct panthor_scheduler *sched,
+		 const struct panthor_sched_tick_ctx *ctx)
+{
+	return ctx->group_count == sched->csg_slot_count;
+}
+
+static bool
+group_is_idle(struct panthor_group *group)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	u32 inactive_queues;
+
+	if (group->csg_id >= 0)
+		return ptdev->scheduler->csg_slots[group->csg_id].idle;
+
+	inactive_queues = group->idle_queues | group->blocked_queues;
+	return hweight32(inactive_queues) == group->queue_count;
+}
+
+static bool
+group_can_run(struct panthor_group *group)
+{
+	return group->state != PANTHOR_CS_GROUP_TERMINATED &&
+	       !group->destroyed && group->fatal_queues == 0 &&
+	       !group->timedout;
+}
+
+static void
+tick_ctx_pick_groups_from_list(const struct panthor_scheduler *sched,
+			       struct panthor_sched_tick_ctx *ctx,
+			       struct list_head *queue,
+			       bool skip_idle_groups,
+			       bool owned_by_tick_ctx)
+{
+	struct panthor_group *group, *tmp;
+
+	if (tick_ctx_is_full(sched, ctx))
+		return;
+
+	list_for_each_entry_safe(group, tmp, queue, run_node) {
+		u32 i;
+
+		if (!group_can_run(group))
+			continue;
+
+		if (skip_idle_groups && group_is_idle(group))
+			continue;
+
+		for (i = 0; i < ctx->as_count; i++) {
+			if (ctx->vms[i] == group->vm)
+				break;
+		}
+
+		if (i == ctx->as_count && ctx->as_count == sched->as_slot_count)
+			continue;
+
+		if (!owned_by_tick_ctx)
+			group_get(group);
+
+		list_move_tail(&group->run_node, &ctx->groups[group->priority]);
+		ctx->group_count++;
+		if (group_is_idle(group))
+			ctx->idle_group_count++;
+
+		if (i == ctx->as_count)
+			ctx->vms[ctx->as_count++] = group->vm;
+
+		if (ctx->min_priority > group->priority)
+			ctx->min_priority = group->priority;
+
+		if (tick_ctx_is_full(sched, ctx))
+			return;
+	}
+}
+
+static void
+tick_ctx_insert_old_group(struct panthor_scheduler *sched,
+			  struct panthor_sched_tick_ctx *ctx,
+			  struct panthor_group *group,
+			  bool full_tick)
+{
+	struct panthor_csg_slot *csg_slot = &sched->csg_slots[group->csg_id];
+	struct panthor_group *other_group;
+
+	if (!full_tick) {
+		list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
+		return;
+	}
+
+	/* Rotate to make sure groups with lower CSG slot
+	 * priorities have a chance to get a higher CSG slot
+	 * priority next time they get picked. This priority
+	 * has an impact on resource request ordering, so it's
+	 * important to make sure we don't let one group starve
+	 * all other groups with the same group priority.
+	 */
+	list_for_each_entry(other_group,
+			    &ctx->old_groups[csg_slot->group->priority],
+			    run_node) {
+		struct panthor_csg_slot *other_csg_slot = &sched->csg_slots[other_group->csg_id];
+
+		if (other_csg_slot->priority > csg_slot->priority) {
+			list_add_tail(&csg_slot->group->run_node, &other_group->run_node);
+			return;
+		}
+	}
+
+	list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
+}
+
+static void
+tick_ctx_init(struct panthor_scheduler *sched,
+	      struct panthor_sched_tick_ctx *ctx,
+	      bool full_tick)
+{
+	struct panthor_device *ptdev = sched->ptdev;
+	struct panthor_csg_slots_upd_ctx upd_ctx;
+	int ret;
+	u32 i;
+
+	memset(ctx, 0, sizeof(*ctx));
+	csgs_upd_ctx_init(&upd_ctx);
+
+	ctx->min_priority = PANTHOR_CSG_PRIORITY_COUNT;
+	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
+		INIT_LIST_HEAD(&ctx->groups[i]);
+		INIT_LIST_HEAD(&ctx->old_groups[i]);
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
+		struct panthor_fw_csg_iface *csg_iface;
+
+		csg_iface = panthor_fw_get_csg_iface(ptdev, i);
+		if (csg_slot->group) {
+			group_get(csg_slot->group);
+			tick_ctx_insert_old_group(sched, ctx, csg_slot->group, full_tick);
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
+						csg_iface->output->ack ^ CSG_STATUS_UPDATE,
+						CSG_STATUS_UPDATE);
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
+	if (ret) {
+		panthor_device_schedule_reset(ptdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+	}
+}
+
+#define NUM_INSTRS_PER_SLOT		16
+
+static void
+group_term_post_processing(struct panthor_group *group)
+{
+	struct panthor_job *job, *tmp;
+	LIST_HEAD(faulty_jobs);
+	bool cookie;
+	u32 i = 0;
+
+	if (drm_WARN_ON(&group->ptdev->base, group_can_run(group)))
+		return;
+
+	cookie = dma_fence_begin_signalling();
+	for (i = 0; i < group->queue_count; i++) {
+		struct panthor_queue *queue = group->queues[i];
+		struct panthor_syncobj_64b *syncobj;
+		int err;
+
+		if (group->fatal_queues & BIT(i))
+			err = -EINVAL;
+		else if (group->timedout)
+			err = -ETIMEDOUT;
+		else
+			err = -ECANCELED;
+
+		if (!queue)
+			continue;
+
+		spin_lock(&queue->fence_ctx.lock);
+		list_for_each_entry_safe(job, tmp, &queue->fence_ctx.in_flight_jobs, node) {
+			list_move_tail(&job->node, &faulty_jobs);
+			dma_fence_set_error(job->done_fence, err);
+			dma_fence_signal_locked(job->done_fence);
+		}
+		spin_unlock(&queue->fence_ctx.lock);
+
+		/* Manually update the syncobj seqno to unblock waiters. */
+		syncobj = group->syncobjs.kmap + (i * sizeof(*syncobj));
+		syncobj->status = ~0;
+		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
+		sched_queue_work(group->ptdev->scheduler, sync_upd);
+	}
+	dma_fence_end_signalling(cookie);
+
+	list_for_each_entry_safe(job, tmp, &faulty_jobs, node) {
+		list_del_init(&job->node);
+		panthor_job_put(&job->base);
+	}
+}
+
+static void group_term_work(struct work_struct *work)
+{
+	struct panthor_group *group =
+		container_of(work, struct panthor_group, term_work);
+
+	group_term_post_processing(group);
+	group_put(group);
+}
+
+static void
+tick_ctx_cleanup(struct panthor_scheduler *sched,
+		 struct panthor_sched_tick_ctx *ctx)
+{
+	struct panthor_group *group, *tmp;
+	u32 i;
+
+	for (i = 0; i < ARRAY_SIZE(ctx->old_groups); i++) {
+		list_for_each_entry_safe(group, tmp, &ctx->old_groups[i], run_node) {
+			/* If everything went fine, we should only have groups
+			 * to be terminated in the old_groups lists.
+			 */
+			drm_WARN_ON(&group->ptdev->base, !ctx->csg_upd_failed_mask &&
+				    group_can_run(group));
+
+			if (!group_can_run(group)) {
+				list_del_init(&group->run_node);
+				list_del_init(&group->wait_node);
+				group_queue_work(group, term);
+			} else if (group->csg_id >= 0) {
+				list_del_init(&group->run_node);
+			} else {
+				list_move(&group->run_node,
+					  group_is_idle(group) ?
+					  &sched->groups.idle[group->priority] :
+					  &sched->groups.runnable[group->priority]);
+			}
+			group_put(group);
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
+		/* If everything went fine, the groups to schedule lists should
+		 * be empty.
+		 */
+		drm_WARN_ON(&group->ptdev->base,
+			    !ctx->csg_upd_failed_mask && !list_empty(&ctx->groups[i]));
+
+		list_for_each_entry_safe(group, tmp, &ctx->groups[i], run_node) {
+			if (group->csg_id >= 0) {
+				list_del_init(&group->run_node);
+			} else {
+				list_move(&group->run_node,
+					  group_is_idle(group) ?
+					  &sched->groups.idle[group->priority] :
+					  &sched->groups.runnable[group->priority]);
+			}
+			group_put(group);
+		}
+	}
+}
+
+static void
+tick_ctx_apply(struct panthor_scheduler *sched, struct panthor_sched_tick_ctx *ctx)
+{
+	struct panthor_group *group, *tmp;
+	struct panthor_device *ptdev = sched->ptdev;
+	struct panthor_csg_slot *csg_slot;
+	int prio, new_csg_prio = MAX_CSG_PRIO, i;
+	u32 csg_mod_mask = 0, free_csg_slots = 0;
+	struct panthor_csg_slots_upd_ctx upd_ctx;
+	int ret;
+
+	csgs_upd_ctx_init(&upd_ctx);
+
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		/* Suspend or terminate evicted groups. */
+		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
+			struct panthor_fw_csg_iface *csg_iface;
+			bool term = !group_can_run(group);
+			int csg_id = group->csg_id;
+
+			if (drm_WARN_ON(&ptdev->base, csg_id < 0))
+				continue;
+
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
+						term ? CSG_STATE_TERMINATE : CSG_STATE_SUSPEND,
+						CSG_STATE_MASK);
+		}
+
+		/* Update priorities on already running groups. */
+		list_for_each_entry(group, &ctx->groups[prio], run_node) {
+			struct panthor_fw_csg_iface *csg_iface;
+			int csg_id = group->csg_id;
+
+			if (csg_id < 0) {
+				new_csg_prio--;
+				continue;
+			}
+
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+			if (csg_slot->priority == new_csg_prio) {
+				new_csg_prio--;
+				continue;
+			}
+
+			panthor_fw_update_reqs(csg_iface, endpoint_req,
+					       CSG_EP_REQ_PRIORITY(new_csg_prio),
+					       CSG_EP_REQ_PRIORITY_MASK);
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
+						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
+						CSG_ENDPOINT_CONFIG);
+			new_csg_prio--;
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
+	if (ret) {
+		panthor_device_schedule_reset(ptdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+		return;
+	}
+
+	/* Unbind evicted groups. */
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
+			group_unbind_locked(group);
+		}
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		if (!sched->csg_slots[i].group)
+			free_csg_slots |= BIT(i);
+	}
+
+	csgs_upd_ctx_init(&upd_ctx);
+	new_csg_prio = MAX_CSG_PRIO;
+
+	/* Start new groups. */
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry(group, &ctx->groups[prio], run_node) {
+			int csg_id = group->csg_id;
+			struct panthor_fw_csg_iface *csg_iface;
+
+			if (csg_id >= 0) {
+				new_csg_prio--;
+				continue;
+			}
+
+			csg_id = ffs(free_csg_slots) - 1;
+			if (drm_WARN_ON(&ptdev->base, csg_id < 0))
+				break;
+
+			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
+			csg_slot = &sched->csg_slots[csg_id];
+			csg_mod_mask |= BIT(csg_id);
+			group_bind_locked(group, csg_id);
+			csg_slot_prog_locked(ptdev, csg_id, new_csg_prio--);
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
+						group->state == PANTHOR_CS_GROUP_SUSPENDED ?
+						CSG_STATE_RESUME : CSG_STATE_START,
+						CSG_STATE_MASK);
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
+						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
+						CSG_ENDPOINT_CONFIG);
+			free_csg_slots &= ~BIT(csg_id);
+		}
+	}
+
+	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
+	if (ret) {
+		panthor_device_schedule_reset(ptdev);
+		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
+		return;
+	}
+
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		list_for_each_entry_safe(group, tmp, &ctx->groups[prio], run_node) {
+			list_del_init(&group->run_node);
+
+			/* If the group has been destroyed while we were
+			 * scheduling, ask for an immediate tick to
+			 * re-evaluate as soon as possible and get rid of
+			 * this dangling group.
+			 */
+			if (group->destroyed)
+				ctx->immediate_tick = true;
+			group_put(group);
+		}
+
+		/* Return evicted groups to the idle or run queues. Groups
+		 * that can no longer be run (because they've been destroyed
+		 * or experienced an unrecoverable error) will be scheduled
+		 * for destruction in tick_ctx_cleanup().
+		 */
+		list_for_each_entry_safe(group, tmp, &ctx->old_groups[prio], run_node) {
+			if (!group_can_run(group))
+				continue;
+
+			if (group_is_idle(group))
+				list_move_tail(&group->run_node, &sched->groups.idle[prio]);
+			else
+				list_move_tail(&group->run_node, &sched->groups.runnable[prio]);
+			group_put(group);
+		}
+	}
+
+	sched->used_csg_slot_count = ctx->group_count;
+	sched->might_have_idle_groups = ctx->idle_group_count > 0;
+}
+
+static u64
+tick_ctx_update_resched_target(struct panthor_scheduler *sched,
+			       const struct panthor_sched_tick_ctx *ctx)
+{
+	/* We had space left, no need to reschedule until some external event happens. */
+	if (!tick_ctx_is_full(sched, ctx))
+		goto no_tick;
+
+	/* If idle groups were scheduled, no need to wake up until some external
+	 * event happens (group unblocked, new job submitted, ...).
+	 */
+	if (ctx->idle_group_count)
+		goto no_tick;
+
+	if (drm_WARN_ON(&sched->ptdev->base, ctx->min_priority >= PANTHOR_CSG_PRIORITY_COUNT))
+		goto no_tick;
+
+	/* If there are groups of the same priority waiting, we need to
+	 * keep the scheduler ticking, otherwise, we'll just wait for
+	 * new groups with higher priority to be queued.
+	 */
+	if (!list_empty(&sched->groups.runnable[ctx->min_priority])) {
+		u64 resched_target = sched->last_tick + sched->tick_period;
+
+		if (time_before64(sched->resched_target, sched->last_tick) ||
+		    time_before64(resched_target, sched->resched_target))
+			sched->resched_target = resched_target;
+
+		return sched->resched_target - sched->last_tick;
+	}
+
+no_tick:
+	sched->resched_target = U64_MAX;
+	return U64_MAX;
+}
+
+static void tick_work(struct work_struct *work)
+{
+	struct panthor_scheduler *sched = container_of(work, struct panthor_scheduler,
+						      tick_work.work);
+	struct panthor_device *ptdev = sched->ptdev;
+	struct panthor_sched_tick_ctx ctx;
+	u64 remaining_jiffies = 0, resched_delay;
+	u64 now = get_jiffies_64();
+	int prio, ret, cookie;
+
+	if (!drm_dev_enter(&ptdev->base, &cookie))
+		return;
+
+	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	if (drm_WARN_ON(&ptdev->base, ret))
+		goto out_dev_exit;
+
+	if (time_before64(now, sched->resched_target))
+		remaining_jiffies = sched->resched_target - now;
+
+	mutex_lock(&sched->lock);
+	if (panthor_device_reset_is_pending(sched->ptdev))
+		goto out_unlock;
+
+	tick_ctx_init(sched, &ctx, remaining_jiffies != 0);
+	if (ctx.csg_upd_failed_mask)
+		goto out_cleanup_ctx;
+
+	if (remaining_jiffies) {
+		/* Scheduling forced in the middle of a tick. Only RT groups
+		 * can preempt non-RT ones. Currently running RT groups can't be
+		 * preempted.
+		 */
+		for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
+		     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+		     prio--) {
+			tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio],
+						       true, true);
+			if (prio == PANTHOR_CSG_PRIORITY_RT) {
+				tick_ctx_pick_groups_from_list(sched, &ctx,
+							       &sched->groups.runnable[prio],
+							       true, false);
+			}
+		}
+	}
+
+	/* First pick non-idle groups */
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
+	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+	     prio--) {
+		tick_ctx_pick_groups_from_list(sched, &ctx, &sched->groups.runnable[prio],
+					       true, false);
+		tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio], true, true);
+	}
+
+	/* If we have free CSG slots left, pick idle groups */
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
+	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
+	     prio--) {
+		/* Check the old_group queue first to avoid reprogramming the slots */
+		tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio], false, true);
+		tick_ctx_pick_groups_from_list(sched, &ctx, &sched->groups.idle[prio],
+					       false, false);
+	}
+
+	tick_ctx_apply(sched, &ctx);
+	if (ctx.csg_upd_failed_mask)
+		goto out_cleanup_ctx;
+
+	if (ctx.idle_group_count == ctx.group_count) {
+		panthor_devfreq_record_idle(sched->ptdev);
+		if (sched->pm.has_ref) {
+			pm_runtime_put_autosuspend(ptdev->base.dev);
+			sched->pm.has_ref = false;
+		}
+	} else {
+		panthor_devfreq_record_busy(sched->ptdev);
+		if (!sched->pm.has_ref) {
+			pm_runtime_get(ptdev->base.dev);
+			sched->pm.has_ref = true;
+		}
+	}
+
+	sched->last_tick = now;
+	resched_delay = tick_ctx_update_resched_target(sched, &ctx);
+	if (ctx.immediate_tick)
+		resched_delay = 0;
+
+	if (resched_delay != U64_MAX)
+		sched_queue_delayed_work(sched, tick, resched_delay);
+
+out_cleanup_ctx:
+	tick_ctx_cleanup(sched, &ctx);
+
+out_unlock:
+	mutex_unlock(&sched->lock);
+	pm_runtime_mark_last_busy(ptdev->base.dev);
+	pm_runtime_put_autosuspend(ptdev->base.dev);
+
+out_dev_exit:
+	drm_dev_exit(cookie);
+}
+
+static void *
+panthor_queue_get_syncwait_obj(struct panthor_group *group, struct panthor_queue *queue)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	struct iosys_map map;
+	int ret;
+
+	if (queue->syncwait.kmap)
+		return queue->syncwait.kmap + queue->syncwait.offset;
+
+	if (!queue->syncwait.bo) {
+		queue->syncwait.bo = panthor_vm_get_bo_for_va(group->vm,
+							      queue->syncwait.gpu_va,
+							      &queue->syncwait.offset);
+		if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(queue->syncwait.bo)))
+			return NULL;
+	}
+
+	ret = drm_gem_vmap_unlocked(&queue->syncwait.bo->base.base, &map);
+	if (drm_WARN_ON(&ptdev->base, ret))
+		return NULL;
+
+	queue->syncwait.kmap = map.vaddr;
+	if (drm_WARN_ON(&ptdev->base, !queue->syncwait.kmap))
+		return NULL;
+
+	return queue->syncwait.kmap + queue->syncwait.offset;
+}
+
+static int panthor_queue_eval_syncwait(struct panthor_group *group, u8 queue_idx)
+{
+	struct panthor_queue *queue = group->queues[queue_idx];
+	union {
+		struct panthor_syncobj_64b sync64;
+		struct panthor_syncobj_32b sync32;
+	} *syncobj;
+	bool result;
+	u64 value;
+
+	syncobj = panthor_queue_get_syncwait_obj(group, queue);
+	if (!syncobj)
+		return -EINVAL;
+
+	value = queue->syncwait.sync64 ?
+		syncobj->sync64.seqno :
+		syncobj->sync32.seqno;
+
+	if (queue->syncwait.gt)
+		result = value > queue->syncwait.ref;
+	else
+		result = value <= queue->syncwait.ref;
+
+	if (result) {
+		panthor_gem_unmap_and_put(group->vm, queue->syncwait.bo,
+					  queue->syncwait.gpu_va,
+					  queue->syncwait.kmap);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void sync_upd_work(struct work_struct *work)
+{
+	struct panthor_scheduler *sched = container_of(work,
+						      struct panthor_scheduler,
+						      sync_upd_work);
+	struct panthor_group *group, *tmp;
+	bool immediate_tick = false;
+
+	mutex_lock(&sched->lock);
+	list_for_each_entry_safe(group, tmp, &sched->groups.waiting, wait_node) {
+		u32 tested_queues = group->blocked_queues;
+		u32 unblocked_queues = 0;
+
+		while (tested_queues) {
+			u32 cs_id = ffs(tested_queues) - 1;
+			int ret;
+
+			ret = panthor_queue_eval_syncwait(group, cs_id);
+			drm_WARN_ON(&group->ptdev->base, ret < 0);
+			if (ret)
+				unblocked_queues |= BIT(cs_id);
+
+			tested_queues &= ~BIT(cs_id);
+		}
+
+		if (unblocked_queues) {
+			group->blocked_queues &= ~unblocked_queues;
+
+			if (group->csg_id < 0) {
+				list_move(&group->run_node,
+					  &sched->groups.runnable[group->priority]);
+				if (group->priority == PANTHOR_CSG_PRIORITY_RT)
+					immediate_tick = true;
+			}
+		}
+
+		if (!group->blocked_queues)
+			list_del_init(&group->wait_node);
+	}
+	mutex_unlock(&sched->lock);
+
+	if (immediate_tick)
+		sched_queue_delayed_work(sched, tick, 0);
+}
+
+static void group_schedule_locked(struct panthor_group *group, u32 queue_mask)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct list_head *queue = &sched->groups.runnable[group->priority];
+	u64 delay_jiffies = 0;
+	bool was_idle;
+	u64 now;
+
+	if (!group_can_run(group))
+		return;
+
+	/* All updated queues are blocked, no need to wake up the scheduler. */
+	if ((queue_mask & group->blocked_queues) == queue_mask)
+		return;
+
+	was_idle = group_is_idle(group);
+	group->idle_queues &= ~queue_mask;
+	if (was_idle && !group_is_idle(group))
+		list_move_tail(&group->run_node, queue);
+
+	/* RT groups are preemptive. */
+	if (group->priority == PANTHOR_CSG_PRIORITY_RT) {
+		sched_queue_delayed_work(sched, tick, 0);
+		return;
+	}
+
+	/* Some groups might be idle, force an immediate tick to
+	 * re-evaluate.
+	 */
+	if (sched->might_have_idle_groups) {
+		sched_queue_delayed_work(sched, tick, 0);
+		return;
+	}
+
+	/* Scheduler is ticking, nothing to do. */
+	if (sched->resched_target != U64_MAX) {
+		/* If there are free slots, force immediating ticking. */
+		if (sched->used_csg_slot_count < sched->csg_slot_count)
+			sched_queue_delayed_work(sched, tick, 0);
+
+		return;
+	}
+
+	/* Scheduler tick was off, recalculate the resched_target based on the
+	 * last tick event, and queue the scheduler work.
+	 */
+	now = get_jiffies_64();
+	sched->resched_target = sched->last_tick + sched->tick_period;
+	if (sched->used_csg_slot_count == sched->csg_slot_count &&
+	    time_before64(now, sched->resched_target))
+		delay_jiffies = min_t(unsigned long, sched->resched_target - now, ULONG_MAX);
+
+	sched_queue_delayed_work(sched, tick, delay_jiffies);
+}
+
+static void queue_stop(struct panthor_queue *queue,
+		       struct panthor_job *bad_job)
+{
+	drm_sched_stop(&queue->scheduler, bad_job ? &bad_job->base : NULL);
+}
+
+static void queue_start(struct panthor_queue *queue)
+{
+	struct panthor_job *job;
+
+	/* Re-assign the parent fences. */
+	list_for_each_entry(job, &queue->scheduler.pending_list, base.list)
+		job->base.s_fence->parent = dma_fence_get(job->done_fence);
+
+	drm_sched_start(&queue->scheduler, true);
+}
+
+static void panthor_group_stop(struct panthor_group *group)
+{
+	struct panthor_scheduler *sched = group->ptdev->scheduler;
+
+	lockdep_assert_held(&sched->reset.lock);
+
+	for (u32 i = 0; i < group->queue_count; i++)
+		queue_stop(group->queues[i], NULL);
+
+	group_get(group);
+	list_move_tail(&group->run_node, &sched->reset.stopped_groups);
+}
+
+static void panthor_group_start(struct panthor_group *group)
+{
+	struct panthor_scheduler *sched = group->ptdev->scheduler;
+
+	lockdep_assert_held(&group->ptdev->scheduler->reset.lock);
+
+	for (u32 i = 0; i < group->queue_count; i++)
+		queue_start(group->queues[i]);
+
+	if (group_can_run(group)) {
+		list_move_tail(&group->run_node,
+			       group_is_idle(group) ?
+			       &sched->groups.idle[group->priority] :
+			       &sched->groups.runnable[group->priority]);
+	} else {
+		list_del_init(&group->run_node);
+		list_del_init(&group->wait_node);
+		group_queue_work(group, term);
+	}
+
+	group_put(group);
+}
+
+void panthor_sched_resume(struct panthor_device *ptdev)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+
+	/* Force a tick to re-evaluate after a resume. */
+	sched_queue_delayed_work(sched, tick, 0);
+}
+
+void panthor_sched_suspend(struct panthor_device *ptdev)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_csg_slots_upd_ctx upd_ctx;
+	u64 suspended_slots, faulty_slots;
+	struct panthor_group *group;
+	int ret;
+	u32 i;
+
+	mutex_lock(&sched->lock);
+	csgs_upd_ctx_init(&upd_ctx);
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
+
+		if (csg_slot->group) {
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
+						CSG_STATE_SUSPEND,
+						CSG_STATE_MASK);
+		}
+	}
+
+	suspended_slots = upd_ctx.update_mask;
+
+	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
+	suspended_slots &= ~upd_ctx.timedout_mask;
+	faulty_slots = upd_ctx.timedout_mask;
+
+	if (faulty_slots) {
+		u32 slot_mask = faulty_slots;
+
+		drm_err(&ptdev->base, "CSG suspend failed, escalating to termination");
+		csgs_upd_ctx_init(&upd_ctx);
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+
+			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
+						CSG_STATE_TERMINATE,
+						CSG_STATE_MASK);
+			slot_mask &= ~BIT(csg_id);
+		}
+
+		csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
+
+		slot_mask = upd_ctx.timedout_mask;
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+			/* Terminate command timedout, but the soft-reset will
+			 * automatically terminate all active groups, so let's
+			 * force the state to halted here.
+			 */
+			if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED)
+				csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
+			slot_mask &= ~BIT(csg_id);
+		}
+	}
+
+	/* Flush L2 and LSC caches to make sure suspend state is up-to-date.
+	 * If the flush fails, flag all queues for termination.
+	 */
+	if (suspended_slots) {
+		bool flush_caches_failed = false;
+		u32 slot_mask = suspended_slots;
+
+		if (panthor_gpu_flush_caches(ptdev, CACHE_CLEAN, CACHE_CLEAN, 0))
+			flush_caches_failed = true;
+
+		while (slot_mask) {
+			u32 csg_id = ffs(slot_mask) - 1;
+			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
+
+			if (flush_caches_failed)
+				csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
+			else
+				csg_slot_sync_update_locked(ptdev, csg_id);
+
+			slot_mask &= ~BIT(csg_id);
+		}
+
+		if (flush_caches_failed)
+			faulty_slots |= suspended_slots;
+	}
+
+	for (i = 0; i < sched->csg_slot_count; i++) {
+		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
+
+		group = csg_slot->group;
+		if (!group)
+			continue;
+
+		group_get(group);
+		group_unbind_locked(group);
+
+		drm_WARN_ON(&group->ptdev->base, !list_empty(&group->run_node));
+
+		if (group_can_run(group)) {
+			list_add(&group->run_node,
+				 group_is_idle(group) ?
+				 &sched->groups.idle[group->priority] :
+				 &sched->groups.runnable[group->priority]);
+		} else {
+			/* We don't bother stopping the scheduler if the group is
+			 * faulty, the group termination work will finish the job.
+			 */
+			list_del_init(&group->wait_node);
+			group_queue_work(group, term);
+		}
+		group_put(group);
+	}
+	mutex_unlock(&sched->lock);
+}
+
+void panthor_sched_pre_reset(struct panthor_device *ptdev)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_group *group, *group_tmp;
+	u32 i;
+
+	mutex_lock(&sched->reset.lock);
+
+	/* Cancel all scheduler works. Once this is done, these works can't be
+	 * scheduled again until the reset operation is complete.
+	 */
+	sched->reset.in_progress = true;
+	cancel_work_sync(&sched->sync_upd_work);
+	cancel_delayed_work_sync(&sched->tick_work);
+
+	panthor_sched_suspend(ptdev);
+
+	/* Stop all groups that might still accept jobs, so we don't get passed
+	 * new jobs while we're resetting.
+	 */
+	for (i = 0; i < ARRAY_SIZE(sched->groups.runnable); i++) {
+		list_for_each_entry_safe(group, group_tmp, &sched->groups.runnable[i], run_node)
+			panthor_group_stop(group);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sched->groups.idle); i++) {
+		list_for_each_entry_safe(group, group_tmp, &sched->groups.idle[i], run_node)
+			panthor_group_stop(group);
+	}
+
+	mutex_unlock(&sched->reset.lock);
+}
+
+void panthor_sched_post_reset(struct panthor_device *ptdev)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_group *group, *group_tmp;
+
+	mutex_lock(&sched->reset.lock);
+
+	list_for_each_entry_safe(group, group_tmp, &sched->reset.stopped_groups, run_node)
+		panthor_group_start(group);
+
+	/* We're done resetting the GPU, clear the reset.in_progress bit so we can
+	 * kick the scheduler.
+	 */
+	sched->reset.in_progress = false;
+	mutex_unlock(&sched->reset.lock);
+
+	sched_queue_delayed_work(sched, tick, 0);
+
+	sched_queue_work(sched, sync_upd);
+}
+
+static void group_sync_upd_work(struct work_struct *work)
+{
+	struct panthor_group *group =
+		container_of(work, struct panthor_group, sync_upd_work);
+	struct panthor_job *job, *job_tmp;
+	LIST_HEAD(done_jobs);
+	u32 queue_idx;
+	bool cookie;
+
+	cookie = dma_fence_begin_signalling();
+	for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
+		struct panthor_queue *queue = group->queues[queue_idx];
+		struct panthor_syncobj_64b *syncobj;
+
+		if (!queue)
+			continue;
+
+		syncobj = group->syncobjs.kmap + (queue_idx * sizeof(*syncobj));
+
+		spin_lock(&queue->fence_ctx.lock);
+		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
+			if (!job->call_info.size)
+				continue;
+
+			if (syncobj->seqno < job->done_fence->seqno)
+				break;
+
+			list_move_tail(&job->node, &done_jobs);
+			dma_fence_signal_locked(job->done_fence);
+		}
+		spin_unlock(&queue->fence_ctx.lock);
+	}
+	dma_fence_end_signalling(cookie);
+
+	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
+		list_del_init(&job->node);
+		panthor_job_put(&job->base);
+	}
+
+	group_put(group);
+}
+
+static struct dma_fence *
+queue_run_job(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_group *group = job->group;
+	struct panthor_queue *queue = group->queues[job->queue_idx];
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	u32 ringbuf_size = queue->ringbuf.bo->base.base.size;
+	u32 ringbuf_insert = queue->iface.input->insert % ringbuf_size;
+	u64 addr_reg = ptdev->csif_info.cs_reg_count -
+		       ptdev->csif_info.unpreserved_cs_reg_count;
+	u64 val_reg = addr_reg + 2;
+	u64 sync_addr = group->syncobjs.gpu_va +
+			job->queue_idx * sizeof(struct panthor_syncobj_64b);
+	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
+	struct dma_fence *done_fence;
+	int ret;
+
+	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
+		/* MOV32 rX+2, cs.latest_flush */
+		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
+
+		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
+		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
+
+		/* MOV48 rX:rX+1, cs.start */
+		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
+
+		/* MOV32 rX+2, cs.size */
+		(2ull << 56) | (val_reg << 48) | job->call_info.size,
+
+		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
+		(3ull << 56) | (1 << 16),
+
+		/* CALL rX:rX+1, rX+2 */
+		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
+
+		/* MOV48 rX:rX+1, sync_addr */
+		(1ull << 56) | (addr_reg << 48) | sync_addr,
+
+		/* MOV32 rX+2, #1 */
+		(1ull << 56) | (val_reg << 48) | 1,
+
+		/* WAIT(all) */
+		(3ull << 56) | (waitall_mask << 16),
+
+		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
+		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
+
+		/* ERROR_BARRIER, so we can recover from faults at job
+		 * boundaries.
+		 */
+		(47ull << 56),
+	};
+
+	/* Need to be cacheline aligned to please the prefetcher. */
+	static_assert(sizeof(call_instrs) % 64 == 0,
+		      "call_instrs is not aligned on a cacheline");
+
+	/* Stream size is zero, nothing to do => return a NULL fence and let
+	 * drm_sched signal the parent.
+	 */
+	if (!job->call_info.size)
+		return NULL;
+
+	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	if (drm_WARN_ON(&ptdev->base, ret))
+		return ERR_PTR(ret);
+
+	mutex_lock(&sched->lock);
+	if (!group_can_run(group)) {
+		done_fence = ERR_PTR(-ECANCELED);
+		goto out_unlock;
+	}
+
+	dma_fence_init(job->done_fence,
+		       &panthor_queue_fence_ops,
+		       &queue->fence_ctx.lock,
+		       queue->fence_ctx.id,
+		       atomic64_inc_return(&queue->fence_ctx.seqno));
+
+	memcpy((u8 *)queue->ringbuf.kmap + ringbuf_insert,
+	       call_instrs, sizeof(call_instrs));
+
+	panthor_job_get(&job->base);
+	spin_lock(&queue->fence_ctx.lock);
+	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
+	spin_unlock(&queue->fence_ctx.lock);
+
+	job->ringbuf.start = queue->iface.input->insert;
+	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
+
+	/* Make sure the ring buffer is updated before the INSERT
+	 * register.
+	 */
+	wmb();
+
+	queue->iface.input->extract = queue->iface.output->extract;
+	queue->iface.input->insert = job->ringbuf.end;
+
+	if (group->csg_id < 0) {
+		/* If the queue is blocked, we want to keep the timeout running, so we
+		 * can detect unbounded waits and kill the group when that happens.
+		 * Otherwise, we suspend the timeout so the time we spend waiting for
+		 * a CSG slot is not counted.
+		 */
+		if (!(group->blocked_queues & BIT(job->queue_idx)) &&
+		    !queue->timeout_suspended) {
+			queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
+			queue->timeout_suspended = true;
+		}
+
+		group_schedule_locked(group, BIT(job->queue_idx));
+	} else {
+		gpu_write(ptdev, CSF_DOORBELL(queue->doorbell_id), 1);
+		if (!sched->pm.has_ref &&
+		    !(group->blocked_queues & BIT(job->queue_idx))) {
+			pm_runtime_get(ptdev->base.dev);
+			sched->pm.has_ref = true;
+		}
+	}
+
+	done_fence = dma_fence_get(job->done_fence);
+
+out_unlock:
+	mutex_unlock(&sched->lock);
+	pm_runtime_mark_last_busy(ptdev->base.dev);
+	pm_runtime_put_autosuspend(ptdev->base.dev);
+
+	return done_fence;
+}
+
+static enum drm_gpu_sched_stat
+queue_timedout_job(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+	struct panthor_group *group = job->group;
+	struct panthor_device *ptdev = group->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_queue *queue = group->queues[job->queue_idx];
+
+	drm_warn(&ptdev->base, "job timeout\n");
+
+	WARN_ON(sched->reset.in_progress);
+
+	queue_stop(queue, job);
+
+	mutex_lock(&sched->lock);
+	group->timedout = true;
+	if (group->csg_id >= 0) {
+		sched_queue_delayed_work(ptdev->scheduler, tick, 0);
+	} else {
+		/* Remove from the run queues, so the scheduler can't
+		 * pick the group on the next tick.
+		 */
+		WARN_ON(list_empty(&group->run_node));
+		list_del_init(&group->run_node);
+		list_del_init(&group->wait_node);
+
+		group_queue_work(group, term);
+	}
+	mutex_unlock(&sched->lock);
+
+	queue_start(queue);
+
+	return DRM_GPU_SCHED_STAT_NOMINAL;
+}
+
+static void queue_free_job(struct drm_sched_job *sched_job)
+{
+	drm_sched_job_cleanup(sched_job);
+	panthor_job_put(sched_job);
+}
+
+static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
+	.run_job = queue_run_job,
+	.timedout_job = queue_timedout_job,
+	.free_job = queue_free_job,
+};
+
+static struct panthor_queue *
+group_create_queue(struct panthor_group *group,
+		   const struct drm_panthor_queue_create *args)
+{
+	struct panthor_scheduler *scheduler = group->ptdev->scheduler;
+	struct drm_gpu_scheduler *drm_sched;
+	struct panthor_queue *queue;
+	int ret;
+
+	if (args->pad[0] || args->pad[1] || args->pad[2])
+		return ERR_PTR(-EINVAL);
+
+	if (!IS_ALIGNED(args->ringbuf_size, PAGE_SIZE) || args->ringbuf_size > SZ_64K)
+		return ERR_PTR(-EINVAL);
+
+	if (args->priority > CSF_MAX_QUEUE_PRIO)
+		return ERR_PTR(-EINVAL);
+
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue)
+		return ERR_PTR(-ENOMEM);
+
+	queue->fence_ctx.id = dma_fence_context_alloc(1);
+	spin_lock_init(&queue->fence_ctx.lock);
+	INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);
+
+	queue->priority = args->priority;
+
+	queue->ringbuf.gpu_va = PANTHOR_GEM_ALLOC_VA;
+	queue->ringbuf.bo = panthor_gem_create_and_map(group->ptdev, group->vm,
+						       args->ringbuf_size,
+						       DRM_PANTHOR_BO_NO_MMAP,
+						       DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+						       DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+						       &queue->ringbuf.gpu_va,
+						       (void **)&queue->ringbuf.kmap);
+	if (IS_ERR(queue->ringbuf.bo)) {
+		ret = PTR_ERR(queue->ringbuf.bo);
+		goto out;
+	}
+
+	queue->iface.mem = panthor_fw_alloc_queue_iface_mem(group->ptdev,
+							    &queue->iface.input,
+							    &queue->iface.output);
+	if (IS_ERR(queue->iface.mem)) {
+		ret = PTR_ERR(queue->iface.mem);
+		goto out;
+	}
+
+	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
+			     scheduler->wq,
+			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
+			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+			     group->ptdev->reset.wq,
+			     NULL, "panthor-queue", DRM_SCHED_POLICY_SINGLE_ENTITY,
+			     group->ptdev->base.dev);
+	if (ret)
+		goto out;
+
+	drm_sched = &queue->scheduler;
+	ret = drm_sched_entity_init(&queue->entity, DRM_SCHED_PRIORITY_NORMAL,
+				    &drm_sched, 1, NULL);
+
+out:
+	if (ret)
+		return ERR_PTR(ret);
+
+	return queue;
+}
+
+int panthor_group_create(struct panthor_file *pfile,
+			 const struct drm_panthor_group_create *group_args,
+			 const struct drm_panthor_queue_create *queue_args)
+{
+	struct panthor_device *ptdev = pfile->ptdev;
+	struct panthor_group_pool *gpool = pfile->groups;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
+	struct panthor_group *group = NULL;
+	u32 gid, i, suspend_size;
+	int ret;
+
+	if (group_args->pad)
+		return -EINVAL;
+
+	if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
+		return -EINVAL;
+
+	if ((group_args->compute_core_mask & ~ptdev->gpu_info.shader_present) ||
+	    (group_args->fragment_core_mask & ~ptdev->gpu_info.shader_present) ||
+	    (group_args->tiler_core_mask & ~ptdev->gpu_info.tiler_present))
+		return -EINVAL;
+
+	if (hweight64(group_args->compute_core_mask) < group_args->max_compute_cores ||
+	    hweight64(group_args->fragment_core_mask) < group_args->max_fragment_cores ||
+	    hweight64(group_args->tiler_core_mask) < group_args->max_tiler_cores)
+		return -EINVAL;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	spin_lock_init(&group->fatal_lock);
+	kref_init(&group->refcount);
+	group->state = PANTHOR_CS_GROUP_CREATED;
+	group->csg_id = -1;
+
+	group->ptdev = ptdev;
+	group->max_compute_cores = group_args->max_compute_cores;
+	group->compute_core_mask = group_args->compute_core_mask;
+	group->max_fragment_cores = group_args->max_fragment_cores;
+	group->fragment_core_mask = group_args->fragment_core_mask;
+	group->max_tiler_cores = group_args->max_tiler_cores;
+	group->tiler_core_mask = group_args->tiler_core_mask;
+	group->priority = group_args->priority;
+
+	INIT_LIST_HEAD(&group->wait_node);
+	INIT_LIST_HEAD(&group->run_node);
+	INIT_WORK(&group->term_work, group_term_work);
+	INIT_WORK(&group->sync_upd_work, group_sync_upd_work);
+	INIT_WORK(&group->release_work, group_release_work);
+
+	group->vm = panthor_vm_pool_get_vm(pfile->vms, group_args->vm_id);
+	if (!group->vm) {
+		ret = -EINVAL;
+		goto err_put_group;
+	}
+
+	suspend_size = csg_iface->control->suspend_size;
+	group->suspend_buf = panthor_fw_alloc_suspend_buf_mem(ptdev, suspend_size);
+	if (IS_ERR(group->suspend_buf)) {
+		ret = PTR_ERR(group->suspend_buf);
+		group->suspend_buf = NULL;
+		goto err_put_group;
+	}
+
+	suspend_size = csg_iface->control->protm_suspend_size;
+	group->protm_suspend_buf = panthor_fw_alloc_suspend_buf_mem(ptdev, suspend_size);
+	if (IS_ERR(group->protm_suspend_buf)) {
+		ret = PTR_ERR(group->protm_suspend_buf);
+		group->protm_suspend_buf = NULL;
+		goto err_put_group;
+	}
+
+	group->syncobjs.gpu_va = PANTHOR_GEM_ALLOC_VA;
+	group->syncobjs.bo = panthor_gem_create_and_map(ptdev, group->vm,
+							group_args->queues.count *
+							sizeof(struct panthor_syncobj_64b),
+							DRM_PANTHOR_BO_NO_MMAP,
+							DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
+							DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
+							&group->syncobjs.gpu_va,
+							(void **)&group->syncobjs.kmap);
+	if (IS_ERR(group->syncobjs.bo)) {
+		ret = PTR_ERR(group->syncobjs.bo);
+		goto err_put_group;
+	}
+
+	memset(group->syncobjs.kmap, 0,
+	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
+
+	for (i = 0; i < group_args->queues.count; i++) {
+		group->queues[i] = group_create_queue(group, &queue_args[i]);
+		if (IS_ERR(group->queues[i])) {
+			ret = PTR_ERR(group->queues[i]);
+			group->queues[i] = NULL;
+			goto err_put_group;
+		}
+
+		group->queue_count++;
+	}
+
+	group->idle_queues = GENMASK(group->queue_count - 1, 0);
+
+	ret = xa_alloc(&gpool->xa, &gid, group, XA_LIMIT(1, sched->csg_slot_count), GFP_KERNEL);
+	if (ret)
+		goto err_put_group;
+
+	mutex_lock(&sched->reset.lock);
+	if (sched->reset.in_progress) {
+		panthor_group_stop(group);
+	} else {
+		mutex_lock(&sched->lock);
+		list_add_tail(&group->run_node,
+			      &sched->groups.idle[group->priority]);
+		mutex_unlock(&sched->lock);
+	}
+	mutex_unlock(&sched->reset.lock);
+
+	return gid;
+
+err_put_group:
+	group_put(group);
+	return ret;
+}
+
+int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
+{
+	struct panthor_group_pool *gpool = pfile->groups;
+	struct panthor_device *ptdev = pfile->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_group *group;
+
+	group = xa_erase(&gpool->xa, group_handle);
+	if (!group)
+		return -EINVAL;
+
+	for (u32 i = 0; i < group->queue_count; i++) {
+		if (group->queues[i])
+			drm_sched_entity_destroy(&group->queues[i]->entity);
+	}
+
+	mutex_lock(&sched->reset.lock);
+	mutex_lock(&sched->lock);
+	group->destroyed = true;
+	if (group->csg_id >= 0) {
+		sched_queue_delayed_work(sched, tick, 0);
+	} else if (!sched->reset.in_progress) {
+		/* Remove from the run queues, so the scheduler can't
+		 * pick the group on the next tick.
+		 */
+		list_del_init(&group->run_node);
+		list_del_init(&group->wait_node);
+		group_queue_work(group, term);
+	}
+	mutex_unlock(&sched->lock);
+	mutex_unlock(&sched->reset.lock);
+
+	group_put(group);
+	return 0;
+}
+
+int panthor_group_get_state(struct panthor_file *pfile,
+			    struct drm_panthor_group_get_state *get_state)
+{
+	struct panthor_group_pool *gpool = pfile->groups;
+	struct panthor_device *ptdev = pfile->ptdev;
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	struct panthor_group *group;
+
+	if (get_state->pad)
+		return -EINVAL;
+
+	group = group_get(xa_load(&gpool->xa, get_state->group_handle));
+	if (!group)
+		return -EINVAL;
+
+	memset(get_state, 0, sizeof(*get_state));
+
+	mutex_lock(&sched->lock);
+	if (group->timedout)
+		get_state->state |= DRM_PANTHOR_GROUP_STATE_TIMEDOUT;
+	if (group->fatal_queues) {
+		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
+		get_state->fatal_queues = group->fatal_queues;
+	}
+	mutex_unlock(&sched->lock);
+
+	group_put(group);
+	return 0;
+}
+
+int panthor_group_pool_create(struct panthor_file *pfile)
+{
+	struct panthor_group_pool *gpool;
+
+	gpool = kzalloc(sizeof(*gpool), GFP_KERNEL);
+	if (!gpool)
+		return -ENOMEM;
+
+	xa_init_flags(&gpool->xa, XA_FLAGS_ALLOC1);
+	pfile->groups = gpool;
+	return 0;
+}
+
+void panthor_group_pool_destroy(struct panthor_file *pfile)
+{
+	struct panthor_group_pool *gpool = pfile->groups;
+	struct panthor_group *group;
+	unsigned long i;
+
+	if (IS_ERR_OR_NULL(gpool))
+		return;
+
+	xa_for_each(&gpool->xa, i, group)
+		panthor_group_destroy(pfile, i);
+
+	xa_destroy(&gpool->xa);
+	kfree(gpool);
+	pfile->groups = NULL;
+}
+
+static void job_release(struct kref *ref)
+{
+	struct panthor_job *job = container_of(ref, struct panthor_job, refcount);
+
+	drm_WARN_ON(&job->group->ptdev->base, !list_empty(&job->node));
+
+	if (job->base.s_fence)
+		drm_sched_job_cleanup(&job->base);
+
+	if (job->done_fence && job->done_fence->ops)
+		dma_fence_put(job->done_fence);
+	else
+		dma_fence_free(job->done_fence);
+
+	group_put(job->group);
+
+	kfree(job);
+}
+
+struct drm_sched_job *panthor_job_get(struct drm_sched_job *sched_job)
+{
+	if (sched_job) {
+		struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+
+		kref_get(&job->refcount);
+	}
+
+	return sched_job;
+}
+
+void panthor_job_put(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+
+	if (sched_job)
+		kref_put(&job->refcount, job_release);
+}
+
+struct drm_sched_job *
+panthor_job_create(struct panthor_file *pfile,
+		   u16 group_handle,
+		   const struct drm_panthor_queue_submit *qsubmit)
+{
+	struct panthor_group_pool *gpool = pfile->groups;
+	struct panthor_job *job;
+	int ret;
+
+	if (qsubmit->pad)
+		return ERR_PTR(-EINVAL);
+
+	/* If stream_addr is zero, so stream_size should be. */
+	if ((qsubmit->stream_size == 0) != (qsubmit->stream_addr == 0))
+		return ERR_PTR(-EINVAL);
+
+	/* Make sure the address is aligned on 64-byte (cacheline) and the size is
+	 * aligned on 8-byte (instruction size).
+	 */
+	if ((qsubmit->stream_addr & 63) || (qsubmit->stream_size & 7))
+		return ERR_PTR(-EINVAL);
+
+	/* bits 24:30 must be zero. */
+	if (qsubmit->latest_flush & GENMASK(30, 24))
+		return ERR_PTR(-EINVAL);
+
+	job = kzalloc(sizeof(*job), GFP_KERNEL);
+	if (!job)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&job->refcount);
+	job->queue_idx = qsubmit->queue_index;
+	job->call_info.size = qsubmit->stream_size;
+	job->call_info.start = qsubmit->stream_addr;
+	job->call_info.latest_flush = qsubmit->latest_flush;
+	INIT_LIST_HEAD(&job->node);
+
+	job->group = group_get(xa_load(&gpool->xa, group_handle));
+	if (!job->group) {
+		ret = -EINVAL;
+		goto err_put_job;
+	}
+
+	if (job->queue_idx >= job->group->queue_count ||
+	    !job->group->queues[job->queue_idx]) {
+		ret = -EINVAL;
+		goto err_put_job;
+	}
+
+	job->done_fence = kzalloc(sizeof(*job->done_fence), GFP_KERNEL);
+	if (!job->done_fence) {
+		ret = -ENOMEM;
+		goto err_put_job;
+	}
+
+	ret = drm_sched_job_init(&job->base,
+				 &job->group->queues[job->queue_idx]->entity,
+				 job->group);
+	if (ret)
+		goto err_put_job;
+
+	return &job->base;
+
+err_put_job:
+	panthor_job_put(&job->base);
+	return ERR_PTR(ret);
+}
+
+int panthor_job_prepare_resvs(struct drm_exec *exec,
+			      struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+
+	return panthor_vm_prepare_mapped_bos_resvs(exec, job->group->vm);
+}
+
+int panthor_job_add_resvs_deps(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+
+	return panthor_vm_add_bos_resvs_deps_to_job(job->group->vm, sched_job);
+}
+
+void panthor_job_update_resvs(struct drm_sched_job *sched_job)
+{
+	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
+
+	panthor_vm_add_job_fence_to_bos_resvs(job->group->vm, sched_job);
+}
+
+void panthor_sched_unplug(struct panthor_device *ptdev)
+{
+	struct panthor_scheduler *sched = ptdev->scheduler;
+
+	cancel_delayed_work_sync(&sched->tick_work);
+
+	mutex_lock(&sched->lock);
+	if (sched->pm.has_ref) {
+		pm_runtime_put(ptdev->base.dev);
+		sched->pm.has_ref = false;
+	}
+	mutex_unlock(&sched->lock);
+}
+
+static void panthor_sched_fini(struct drm_device *ddev, void *res)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	struct panthor_scheduler *sched = ptdev->scheduler;
+	int prio;
+
+	if (!sched || !sched->csg_slot_count)
+		return;
+
+	cancel_delayed_work_sync(&sched->tick_work);
+
+	if (sched->wq) {
+		drain_workqueue(sched->wq);
+		destroy_workqueue(sched->wq);
+	}
+
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		drm_WARN_ON(ddev, !list_empty(&sched->groups.runnable[prio]));
+		drm_WARN_ON(ddev, !list_empty(&sched->groups.idle[prio]));
+	}
+
+	drm_WARN_ON(ddev, !list_empty(&sched->groups.waiting));
+}
+
+int panthor_sched_init(struct panthor_device *ptdev)
+{
+	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
+	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, 0, 0);
+	struct panthor_scheduler *sched;
+	u32 gpu_as_count, num_groups;
+	int prio;
+
+	sched = drmm_kzalloc(&ptdev->base, sizeof(*sched), GFP_KERNEL);
+	if (!sched)
+		return -ENOMEM;
+
+	/* The highest bit in JOB_INT_* is reserved for globabl IRQs. That
+	 * leaves 31 bits for CSG IRQs, hence the MAX_CSGS clamp here.
+	 */
+	num_groups = min_t(u32, MAX_CSGS, glb_iface->control->group_num);
+
+	/* The FW-side scheduler might deadlock if two groups with the same
+	 * priority try to access a set of resources that overlaps, with part
+	 * of the resources being allocated to one group and the other part to
+	 * the other group, both groups waiting for the remaining resources to
+	 * be allocated. To avoid that, it is recommended to assign each CSG a
+	 * different priority. In theory we could allow several groups to have
+	 * the same CSG priority if they don't request the same resources, but
+	 * that makes the scheduling logic more complicated, so let's clamp
+	 * the number of CSG slots to MAX_CSG_PRIO + 1 for now.
+	 */
+	num_groups = min_t(u32, MAX_CSG_PRIO + 1, num_groups);
+
+	/* We need at least one AS for the MCU and one for the GPU contexts. */
+	gpu_as_count = hweight32(ptdev->gpu_info.as_present & GENMASK(31, 1));
+	if (!gpu_as_count) {
+		drm_err(&ptdev->base, "Not enough AS (%d, expected at least 2)",
+			gpu_as_count + 1);
+		return -EINVAL;
+	}
+
+	sched->ptdev = ptdev;
+	sched->sb_slot_count = CS_FEATURES_SCOREBOARDS(cs_iface->control->features);
+	sched->csg_slot_count = num_groups;
+	sched->cs_slot_count = csg_iface->control->stream_num;
+	sched->as_slot_count = gpu_as_count;
+	ptdev->csif_info.csg_slot_count = sched->csg_slot_count;
+	ptdev->csif_info.cs_slot_count = sched->cs_slot_count;
+	ptdev->csif_info.scoreboard_slot_count = sched->sb_slot_count;
+
+	sched->last_tick = 0;
+	sched->resched_target = U64_MAX;
+	sched->tick_period = msecs_to_jiffies(10);
+	INIT_DELAYED_WORK(&sched->tick_work, tick_work);
+	INIT_WORK(&sched->sync_upd_work, sync_upd_work);
+
+	drmm_mutex_init(&ptdev->base, &sched->lock);
+	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
+		INIT_LIST_HEAD(&sched->groups.runnable[prio]);
+		INIT_LIST_HEAD(&sched->groups.idle[prio]);
+	}
+	INIT_LIST_HEAD(&sched->groups.waiting);
+
+	drmm_mutex_init(&ptdev->base, &sched->reset.lock);
+	INIT_LIST_HEAD(&sched->reset.stopped_groups);
+
+	ptdev->scheduler = sched;
+
+	sched->wq = alloc_workqueue("panthor-csf-sched", WQ_UNBOUND, 0);
+	if (!sched->wq) {
+		panthor_sched_fini(&ptdev->base, NULL);
+		drm_err(&ptdev->base, "Failed to allocate the workqueues");
+		return -ENOMEM;
+	}
+
+	return drmm_add_action_or_reset(&ptdev->base, panthor_sched_fini, NULL);
+}
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
new file mode 100644
index 000000000000..ecdd9dd41ad9
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2023 Collabora ltd. */
+
+#ifndef __PANTHOR_SCHED_H__
+#define __PANTHOR_SCHED_H__
+
+#include <drm/panthor_drm.h>
+
+struct drm_exec;
+struct dma_fence;
+struct drm_file;
+struct drm_gem_object;
+struct drm_sched_job;
+struct panthor_device;
+struct panthor_file;
+struct panthor_group_pool;
+struct panthor_job;
+
+int panthor_group_create(struct panthor_file *pfile,
+			 const struct drm_panthor_group_create *group_args,
+			 const struct drm_panthor_queue_create *queue_args);
+int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle);
+int panthor_group_get_state(struct panthor_file *pfile,
+			    struct drm_panthor_group_get_state *get_state);
+
+struct drm_sched_job *
+panthor_job_create(struct panthor_file *pfile,
+		   u16 group_handle,
+		   const struct drm_panthor_queue_submit *qsubmit);
+struct drm_sched_job *panthor_job_get(struct drm_sched_job *job);
+void panthor_job_put(struct drm_sched_job *job);
+int panthor_job_prepare_resvs(struct drm_exec *exec,
+			      struct drm_sched_job *job);
+int panthor_job_add_resvs_deps(struct drm_sched_job *job);
+void panthor_job_update_resvs(struct drm_sched_job *job);
+
+int panthor_group_pool_create(struct panthor_file *pfile);
+void panthor_group_pool_destroy(struct panthor_file *pfile);
+
+void panthor_sched_process_csg_irq(struct panthor_device *ptdev, u32 csg_slot);
+void panthor_sched_process_global_irq(struct panthor_device *ptdev);
+
+int panthor_sched_init(struct panthor_device *ptdev);
+void panthor_sched_unplug(struct panthor_device *ptdev);
+void panthor_sched_pre_reset(struct panthor_device *ptdev);
+void panthor_sched_post_reset(struct panthor_device *ptdev);
+void panthor_sched_suspend(struct panthor_device *ptdev);
+void panthor_sched_resume(struct panthor_device *ptdev);
+
+#endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (10 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 11/15] drm/panthor: Add the scheduler " Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-21 11:31   ` Steven Price
  2023-09-06 12:38   ` Ketil Johnsen
  2023-08-09 16:53 ` [PATCH v2 13/15] drm/panthor: Allow driver compilation Boris Brezillon
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

This is the last piece missing to expose the driver to the outside
world.

This is basically a wrapper between the ioctls and the other logical
blocks.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Fix various bugs
- Refactored the code to make job submission re-usable for VM_BIND
  jobs
- Add user object copy helpers

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_drv.c | 1540 +++++++++++++++++++++++++
 1 file changed, 1540 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
new file mode 100644
index 000000000000..377ebea4c0e8
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -0,0 +1,1540 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
+/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
+/* Copyright 2019 Collabora ltd. */
+
+#include <linux/module.h>
+#include <linux/of_platform.h>
+#include <linux/pagemap.h>
+#include <linux/pm_runtime.h>
+#include <linux/xarray.h>
+
+#include <drm/drm_drv.h>
+#include <drm/drm_exec.h>
+#include <drm/drm_ioctl.h>
+#include <drm/drm_syncobj.h>
+#include <drm/drm_utils.h>
+#include <drm/drm_debugfs.h>
+#include <drm/gpu_scheduler.h>
+#include <drm/panthor_drm.h>
+
+#include "panthor_sched.h"
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+
+/**
+ * DOC: user <-> kernel object copy helpers.
+ */
+
+/**
+ * panthor_set_uobj() - Copy kernel object to user object.
+ * @usr_ptr: Users pointer.
+ * @usr_size: Size of the user object.
+ * @min_size: Minimum size for this object.
+ * @kern_size: Size of the kernel object.
+ * @in: Address of the kernel object to copy.
+ *
+ * Helper automating kernel -> user object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_SET() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const void *in)
+{
+	/* User size shouldn't be smaller than the minimal object size. */
+	if (usr_size < min_size)
+		return -EINVAL;
+
+	if (copy_to_user(u64_to_user_ptr(usr_ptr), in, min_t(u32, usr_size, kern_size)))
+		return -EFAULT;
+
+	/* When the kernel object is smaller than the user object, we fill the gap with
+	 * zeros.
+	 */
+	if (usr_size > kern_size &&
+	    clear_user(u64_to_user_ptr(usr_ptr + kern_size), usr_size - kern_size)) {
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
+ * @in: The object array to copy.
+ * @min_stride: Minimum array stride.
+ * @obj_kernel: Kernel object size.
+ * @out: Pointer to a variable that will hold the newly allocated object array.
+ *
+ * Helper automating user -> kernel object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_ARRAY_GET() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
+		       u32 obj_size, void **out)
+{
+	int ret = 0;
+	void *out_alloc;
+
+	/* User stride must be at least the minimum object size, otherwise it might
+	 * lack useful information.
+	 */
+	if (in->stride < min_stride)
+		return -EINVAL;
+
+	if (!in->count)
+		return 0;
+
+	out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
+	if (!out_alloc)
+		return -ENOMEM;
+
+	if (obj_size == in->stride) {
+		/* Fast path when user/kernel have the same uAPI header version. */
+		if (copy_from_user(out_alloc, u64_to_user_ptr(in->array),
+				   (unsigned long)obj_size * in->count))
+			ret = -EFAULT;
+	} else {
+		void __user *in_ptr = u64_to_user_ptr(in->array);
+		void *out_ptr = out_alloc;
+
+		/* If the sizes differ, we need to copy elements one by one. */
+		for (u32 i = 0; i < in->count; i++) {
+			ret = copy_struct_from_user(out_ptr, obj_size, in_ptr, in->stride);
+			if (ret)
+				break;
+
+			out_ptr += obj_size;
+			in_ptr += in->stride;
+		}
+	}
+
+	if (ret) {
+		kvfree(out_alloc);
+		return ret;
+	}
+
+	*out = out_alloc;
+	return 0;
+}
+
+/**
+ * PANTHOR_UOBJ_MIN_SIZE_INTERNAL() - Get the minimum user object size
+ * @_typename: Object type.
+ * @_last_mandatory_field: Last mandatory field.
+ *
+ * Get the minimum user object size based on the last mandatory field name,
+ * A.K.A, the name of the last field of the structure at the time this
+ * structure was added to the uAPI.
+ *
+ * Don't use directly, use PANTHOR_UOBJ_DECL() instead.
+ */
+#define PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field) \
+	(offsetof(_typename, _last_mandatory_field) + \
+	 sizeof(((_typename *)NULL)->_last_mandatory_field))
+
+/**
+ * PANTHOR_UOBJ_DECL() - Declare a new uAPI object whose subject to
+ * evolutions.
+ * @_typename: Object type.
+ * @_last_mandatory_field: Last mandatory field.
+ *
+ * Should be used to extend the PANTHOR_UOBJ_MIN_SIZE() list.
+ */
+#define PANTHOR_UOBJ_DECL(_typename, _last_mandatory_field) \
+	_typename : PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field)
+
+/**
+ * PANTHOR_UOBJ_MIN_SIZE() - Get the minimum size of a given uAPI object
+ * @_obj_name: Object to get the minimum size of.
+ *
+ * Don't use this macro directly, it's automatically called by
+ * PANTHOR_UOBJ_{SET,GET_ARRAY}().
+ */
+#define PANTHOR_UOBJ_MIN_SIZE(_obj_name) \
+	_Generic(_obj_name, \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_gpu_info, tiler_present), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_csif_info, pad), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
+		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
+
+/**
+ * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
+ * @_dest_usr_ptr: User pointer to copy to.
+ * @_usr_size: Size of the user object.
+ * @_src_obj: Kernel object to copy (not a pointer).
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+#define PANTHOR_UOBJ_SET(_dest_usr_ptr, _usr_size, _src_obj) \
+	panthor_set_uobj(_dest_usr_ptr, _usr_size, \
+			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
+			 sizeof(_src_obj), &(_src_obj))
+
+/**
+ * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
+ * object array.
+ * @_dest_array: Local variable that will hold the newly allocated kernel
+ * object array.
+ * @_uobj_array: The drm_panthor_obj_array object describing the user object
+ * array.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) \
+	panthor_get_uobj_array(_uobj_array, \
+			       PANTHOR_UOBJ_MIN_SIZE((_dest_array)[0]), \
+			       sizeof((_dest_array)[0]), (void **)&(_dest_array))
+
+/**
+ * DOC: Job submission helpers.
+ *
+ * Here is the workflow for atomic submission of multiple jobs. By atomic,
+ * we mean that we either submit the whole batch, or nothing. This requires
+ * doing things in multiple steps, each step operating on all jobs belonging
+ * to a batch.
+ *
+ * int xxx_submit_ioctl(...)
+ * {
+ *	...
+ *
+ *	// Initialize the submission context.
+ *	ret = panthor_submit_ctx_init(&ctx, file, job_count);
+ *	if (ret)
+ *		return ret;
+ *
+ *	// Create jobs and attach sync operations.
+ *	for (u32 i = 0; i < job_count; i++) {
+ *		...
+ *
+ *		// Create job
+ *		job = job_create(pfile, ...);
+ *		if (IS_ERR(job)) {
+ *			ret = PTR_ERR(job);
+ *			goto out_cleanup_submit_ctx;
+ *		}
+ *
+ *		// Add job to the submit context
+ *		ret = panthor_submit_ctx_add_job(&ctx, i, job, sync_ops);
+ *		if (ret)
+ *			goto out_cleanup_submit_ctx;
+ *	}
+ *
+ *	// Collect signal operations on all jobs, such that each job can pick
+ *	// from it for its dependencies and update the fence to signal when
+ *	// the job is submitted.
+ *	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
+ *	if (ret)
+ *		goto out_cleanup_submit_ctx;
+ *
+ *	// We acquire/prepare revs on all jobs before proceeding with the
+ *	// dependency registration.
+ *	//
+ *	// This is solving two problems:
+ *	// 1. drm_sched_job_arm() and drm_sched_entity_push_job() must be protected
+ *	//    by a lock to make sure no concurrent access to the same entity get
+ *	//    interleaved, which would mess up with the fence seqno ordering.
+ *	//    Luckily, one of the resv being acquired is the VM resv, and a scheduling
+ *	//    entity is only bound to a single VM. As soon as we acquire the VM resv,
+ *	//    we should be safe.
+ *	// 2. Jobs might depend on fences that were issued by previous jobs in the
+ *	//    same batch, so we can't add dependencies on all jobs before arming
+ *	//    previous jobs and registering the fence to the signal array, otherwise
+ *	//    we might miss dependencies, or point to an outdated fence.
+ *	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
+ *	if (ret)
+ *		goto out_cleanup_submit_ctx;
+ *
+ *	// Now that resvs are locked/prepared, we can iterate over each job to add
+ *	// the dependencies, arm the job fence, register the job fence to the signal
+ *	// array.
+ *	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
+ *	if (ret)
+ *		goto out_cleanup_submit_ctx;
+ *
+ *	// Nothing can fail after that point, so we can make our job fences visible to the
+ *	// outside world. Push jobs and set the job fences to the resv slots we reserved.
+ *	// This also pushes the fences to the syncobjs that are part of the signal array.
+ *	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
+ *
+ * out_cleanup_submit_ctx:
+ *	// Cleanup the context.
+ *	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
+ *	...
+ *	return ret;
+ *}
+ */
+
+/**
+ * struct panthor_sync_signal - Represent a synchronization object point to attach
+ * our job fence to.
+ *
+ * This structure is here to keep track of fences that are currently bound to
+ * a specific syncobj point.
+ *
+ * At the beginning of a job submission, the fence
+ * is retrieved from the syncobj itself, and can be NULL if no fence was attached
+ * to this point.
+ *
+ * At the end, it points to the fence of the last job that had a
+ * %DRM_PANTHOR_SYNC_OP_SIGNAL on this syncobj.
+ *
+ * With jobs being submitted in batches, the fence might change several times during
+ * the process, allowing one job to wait on a job that's part of the same submission
+ * be appears earlier in the drm_panthor_group_submit::queue_submits array.
+ */
+struct panthor_sync_signal {
+	/** @handle: The syncobj handle. */
+	u32 handle;
+
+	/**
+	 * @point: The syncobj point.
+	 *
+	 * Zero for regular syncobjs, and non-zero for timeline syncobjs.
+	 */
+	u64 point;
+
+	/**
+	 * @syncobj: The sync object pointed by @handle.
+	 */
+	struct drm_syncobj *syncobj;
+
+	/**
+	 * @chain: Chain object used to link the new fence to an existing
+	 * timeline syncobj.
+	 *
+	 * NULL for regular syncobj, non-NULL for timeline syncobjs.
+	 */
+	struct dma_fence_chain *chain;
+
+	/**
+	 * @fence: The fence to assign to the syncobj or syncobj-point.
+	 */
+	struct dma_fence *fence;
+};
+
+/**
+ * struct panthor_job_ctx - Job context
+ */
+struct panthor_job_ctx {
+	/** @job: The job that is about to be submitted to drm_sched. */
+	struct drm_sched_job *job;
+
+	/** @syncobjs: Array of sync operations. */
+	struct drm_panthor_sync_op *syncops;
+
+	/** @syncop_count: Number of sync operations. */
+	u32 syncop_count;
+};
+
+/**
+ * struct panthor_submit_ctx - Submission context
+ *
+ * Anything that's related to a submission (%DRM_IOCTL_PANTHOR_VM_BIND or
+ * %DRM_IOCTL_PANTHOR_GROUP_SUBMIT) is kept here, so we can automate the
+ * initialization and cleanup steps.
+ */
+struct panthor_submit_ctx {
+	/** @file: DRM file this submission happens on. */
+	struct drm_file *file;
+
+	/**
+	 * @signal: Array of panthor_sync_signal objects.
+	 *
+	 * %DRM_PANTHOR_SYNC_OP_SIGNAL operations will be recorded here,
+	 * and %DRM_PANTHOR_SYNC_OP_WAIT will first check if an entry
+	 * matching the syncobj+point exists before calling
+	 * drm_syncobj_find_fence(). This allows us to describe dependencies
+	 * existing between jobs that are part of the same batch.
+	 */
+	struct xarray signal;
+
+	/** @jobs: Array of jobs. */
+	struct panthor_job_ctx *jobs;
+
+	/** @job_count: Number of entries in the @jobs array. */
+	u32 job_count;
+
+	/** @exec: drm_exec context used to acquire and prepare resv objects. */
+	struct drm_exec exec;
+};
+
+#define PANTHOR_SYNC_OP_FLAGS_MASK \
+	(DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK | DRM_PANTHOR_SYNC_OP_SIGNAL)
+
+/**
+ * panthor_check_sync_op() - Check drm_panthor_sync_op fields
+ * @sync_op: The sync operation to check.
+ *
+ * Return: 0 on success, -EINVAL otherwise.
+ */
+static int
+panthor_check_sync_op(const struct drm_panthor_sync_op *sync_op)
+{
+	u8 handle_type;
+
+	if (sync_op->flags & ~PANTHOR_SYNC_OP_FLAGS_MASK)
+		return -EINVAL;
+
+	handle_type = sync_op->flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK;
+	if (handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
+	    handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ)
+		return -EINVAL;
+
+	if (handle_type == DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
+	    sync_op->timeline_value != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * panthor_sync_signal_free() - Release resources and free a panthor_sync_signal object
+ * @sig_sync: Signal object to free.
+ */
+static void
+panthor_sync_signal_free(struct panthor_sync_signal *sig_sync)
+{
+	if (!sig_sync)
+		return;
+
+	drm_syncobj_put(sig_sync->syncobj);
+	dma_fence_chain_free(sig_sync->chain);
+	dma_fence_put(sig_sync->fence);
+	kfree(sig_sync);
+}
+
+/**
+ * panthor_submit_ctx_add_sync_signal() - Add a signal operation to a submit context
+ * @ctx: Context to add the signal operation to.
+ * @handle: Syncobj handle.
+ * @point: Syncobj point.
+ *
+ * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.
+ */
+static struct panthor_sync_signal *
+panthor_submit_ctx_add_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
+{
+	struct panthor_sync_signal *sig_sync;
+	struct dma_fence *cur_fence;
+	int ret;
+	u32 id;
+
+	sig_sync = kzalloc(sizeof(*sig_sync), GFP_KERNEL);
+	if (!sig_sync)
+		return ERR_PTR(-ENOMEM);
+
+	sig_sync->handle = handle;
+	sig_sync->point = point;
+
+	if (point > 0) {
+		sig_sync->chain = dma_fence_chain_alloc();
+		if (!sig_sync->chain) {
+			ret = -ENOMEM;
+			goto err_free_sig_sync;
+		}
+	}
+
+	sig_sync->syncobj = drm_syncobj_find(ctx->file, handle);
+	if (!sig_sync->syncobj) {
+		ret = -EINVAL;
+		goto err_free_sig_sync;
+	}
+
+	/* Retrieve the current fence attached to that point. It's
+	 * perfectly fine to get a NULL fence here, it just means there's
+	 * no fence attached to that point yet.
+	 */
+	if (!drm_syncobj_find_fence(ctx->file, handle, point, 0, &cur_fence))
+		sig_sync->fence = cur_fence;
+
+	ret = xa_alloc(&ctx->signal, &id, sig_sync, xa_limit_32b, GFP_KERNEL);
+	if (ret)
+		goto err_free_sig_sync;
+
+	return sig_sync;
+
+err_free_sig_sync:
+	panthor_sync_signal_free(sig_sync);
+	return ERR_PTR(ret);
+}
+
+/**
+ * panthor_submit_ctx_search_sync_signal() - Search an existing signal operation in a
+ * submit context.
+ * @ctx: Context to search the signal operation in.
+ * @handle: Syncobj handle.
+ * @point: Syncobj point.
+ *
+ * Return: A valid panthor_sync_signal object if found, NULL otherwise.
+ */
+static struct panthor_sync_signal *
+panthor_submit_ctx_search_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
+{
+	struct panthor_sync_signal *sig_sync;
+	unsigned long i;
+
+	xa_for_each(&ctx->signal, i, sig_sync) {
+		if (handle == sig_sync->handle && point == sig_sync->point)
+			return sig_sync;
+	}
+
+	return NULL;
+}
+
+/**
+ * panthor_submit_ctx_add_job() - Add a job to a submit context
+ * @ctx: Context to search the signal operation in.
+ * @idx: Index of the job in the context.
+ * @job: Job to add.
+ * @syncs: Sync operations provided by userspace.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
+			   struct drm_sched_job *job,
+			   const struct drm_panthor_obj_array *syncs)
+{
+	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
+						    struct panthor_device,
+						    base);
+	int ret;
+
+	if (drm_WARN_ON(&ptdev->base,
+			idx >= ctx->job_count ||
+			ctx->jobs[idx].job ||
+			ctx->jobs[idx].syncops ||
+			ctx->jobs[idx].syncop_count))
+		return -EINVAL;
+
+	ctx->jobs[idx].job = job;
+
+	ret = PANTHOR_UOBJ_GET_ARRAY(ctx->jobs[idx].syncops, syncs);
+	if (ret)
+		return ret;
+
+	ctx->jobs[idx].syncop_count = syncs->count;
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_get_sync_signal() - Search signal operation and add one if none was found.
+ * @ctx: Context to search the signal operation in.
+ * @handle: Syncobj handle.
+ * @point: Syncobj point.
+ *
+ * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.
+ */
+static struct panthor_sync_signal *
+panthor_submit_ctx_get_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
+{
+	struct panthor_sync_signal *sig_sync;
+
+	sig_sync = panthor_submit_ctx_search_sync_signal(ctx, handle, point);
+	if (sig_sync)
+		return sig_sync;
+
+	return panthor_submit_ctx_add_sync_signal(ctx, handle, point);
+}
+
+/**
+ * panthor_submit_ctx_update_job_sync_signal_fences() - Update fences
+ * on the signal operations specified by a job.
+ * @ctx: Context to search the signal operation in.
+ * @job_idx: Index of the job to operate on.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_update_job_sync_signal_fences(struct panthor_submit_ctx *ctx,
+						 u32 job_idx)
+{
+	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
+						    struct panthor_device,
+						    base);
+	struct dma_fence *done_fence = &ctx->jobs[job_idx].job->s_fence->finished;
+	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
+	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
+
+	for (u32 i = 0; i < sync_op_count; i++) {
+		struct dma_fence *old_fence;
+		struct panthor_sync_signal *sig_sync;
+
+		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
+			continue;
+
+		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
+								 sync_ops[i].timeline_value);
+		if (drm_WARN_ON(&ptdev->base, !sig_sync))
+			return -EINVAL;
+
+		old_fence = sig_sync->fence;
+		sig_sync->fence = dma_fence_get(done_fence);
+		dma_fence_put(old_fence);
+
+		if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_collect_job_signal_ops() - Iterate over all job signal operations
+ * and add them to the context.
+ * @ctx: Context to search the signal operation in.
+ * @job_idx: Index of the job to operate on.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_collect_job_signal_ops(struct panthor_submit_ctx *ctx,
+					  u32 job_idx)
+{
+	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
+	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
+
+	for (u32 i = 0; i < sync_op_count; i++) {
+		struct panthor_sync_signal *sig_sync;
+		int ret;
+
+		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
+			continue;
+
+		ret = panthor_check_sync_op(&sync_ops[i]);
+		if (ret)
+			return ret;
+
+		sig_sync = panthor_submit_ctx_get_sync_signal(ctx,
+							      sync_ops[i].handle,
+							      sync_ops[i].timeline_value);
+		if (IS_ERR(sig_sync))
+			return PTR_ERR(sig_sync);
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_push_fences() - Iterate over the signal array, and for each entry, push
+ * the currently assigned fence to the associated syncobj.
+ * @ctx: Context to push fences on.
+ *
+ * This is the last step of a submission procedure, and is done once we know the submission
+ * is effective and job fences are guaranteed to be signaled in finite time.
+ */
+static void
+panthor_submit_ctx_push_fences(struct panthor_submit_ctx *ctx)
+{
+	struct panthor_sync_signal *sig_sync;
+	unsigned long i;
+
+	xa_for_each(&ctx->signal, i, sig_sync) {
+		if (sig_sync->chain) {
+			drm_syncobj_add_point(sig_sync->syncobj, sig_sync->chain,
+					      sig_sync->fence, sig_sync->point);
+			sig_sync->chain = NULL;
+		} else {
+			drm_syncobj_replace_fence(sig_sync->syncobj, sig_sync->fence);
+		}
+	}
+}
+
+/**
+ * panthor_submit_ctx_add_sync_deps_to_job() - Add sync wait operations as
+ * job dependencies.
+ * @ctx: Submit context.
+ * @job_idx: Index of the job to operate on.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_add_sync_deps_to_job(struct panthor_submit_ctx *ctx,
+					u32 job_idx)
+{
+	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
+						    struct panthor_device,
+						    base);
+	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
+	struct drm_sched_job *job = ctx->jobs[job_idx].job;
+	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
+	int ret = 0;
+
+	if (!sync_op_count)
+		return 0;
+
+	for (u32 i = 0; i < sync_op_count; i++) {
+		struct panthor_sync_signal *sig_sync;
+		struct dma_fence *fence;
+
+		if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
+			continue;
+
+		ret = panthor_check_sync_op(&sync_ops[i]);
+		if (ret)
+			return ret;
+
+		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
+								 sync_ops[i].timeline_value);
+		if (sig_sync) {
+			if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
+				return -EINVAL;
+
+			fence = dma_fence_get(sig_sync->fence);
+		} else {
+			ret = drm_syncobj_find_fence(ctx->file, sync_ops[i].handle,
+						     sync_ops[i].timeline_value,
+						     0, &fence);
+			if (ret)
+				return ret;
+		}
+
+		ret = drm_sched_job_add_dependency(job, fence);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_collect_jobs_signal_ops() - Collect all signal operations
+ * and add them to the submit context.
+ * @ctx: Submit context.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_collect_jobs_signal_ops(struct panthor_submit_ctx *ctx)
+{
+	for (u32 i = 0; i < ctx->job_count; i++) {
+		int ret;
+
+		ret = panthor_submit_ctx_collect_job_signal_ops(ctx, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_add_deps_and_arm_jobs() - Add jobs dependencies and arm jobs
+ * @ctx: Submit context.
+ * @add_resvs_deps: Callback used to add implicit job dependencies.
+ *
+ * Must be called after panthor_submit_ctx_prepare_resvs().
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_add_deps_and_arm_jobs(struct panthor_submit_ctx *ctx,
+					 int (*add_resvs_deps)(struct drm_sched_job *))
+{
+	for (u32 i = 0; i < ctx->job_count; i++) {
+		int ret;
+
+		ret = add_resvs_deps(ctx->jobs[i].job);
+		if (ret)
+			return ret;
+
+		ret = panthor_submit_ctx_add_sync_deps_to_job(ctx, i);
+		if (ret)
+			return ret;
+
+		drm_sched_job_arm(ctx->jobs[i].job);
+
+		ret = panthor_submit_ctx_update_job_sync_signal_fences(ctx, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_prepare_resvs() - Lock/prepare reservation objects for all jobs.
+ * @ctx: Submit context.
+ * @prep_resvs: Callback used to prepare reservation objects associated to a job.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_submit_ctx_prepare_resvs(struct panthor_submit_ctx *ctx,
+				 int (*prep_resvs)(struct drm_exec *, struct drm_sched_job *))
+{
+	drm_exec_until_all_locked(&ctx->exec) {
+		for (u32 i = 0; i < ctx->job_count; i++) {
+			int ret = prep_resvs(&ctx->exec, ctx->jobs[i].job);
+
+			drm_exec_retry_on_contention(&ctx->exec);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_push_jobs() - Push jobs to their scheduling entities.
+ * @ctx: Submit context.
+ * @upd_resvs: Callback used to update reservation objects that were prepared in
+ * panthor_submit_ctx_prepare_resvs().
+ */
+static void
+panthor_submit_ctx_push_jobs(struct panthor_submit_ctx *ctx,
+			     void (*upd_resvs)(struct drm_sched_job *))
+{
+	for (u32 i = 0; i < ctx->job_count; i++) {
+		upd_resvs(ctx->jobs[i].job);
+		drm_sched_entity_push_job(ctx->jobs[i].job);
+
+		/* Job is owned by the scheduler now. */
+		ctx->jobs[i].job = NULL;
+	}
+
+	panthor_submit_ctx_push_fences(ctx);
+}
+
+/**
+ * panthor_submit_ctx_init() - Initializes a submission context
+ * @ctx: Submit context to initialize.
+ * @file: drm_file this submission happens on.
+ * @job_count: Number of jobs that will be submitted.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int panthor_submit_ctx_init(struct panthor_submit_ctx *ctx,
+				   struct drm_file *file, u32 job_count)
+{
+	ctx->jobs = kvmalloc_array(job_count, sizeof(*ctx->jobs),
+				   GFP_KERNEL | __GFP_ZERO);
+	if (!ctx->jobs)
+		return -ENOMEM;
+
+	ctx->file = file;
+	ctx->job_count = job_count;
+	xa_init_flags(&ctx->signal, XA_FLAGS_ALLOC);
+	drm_exec_init(&ctx->exec, DRM_EXEC_INTERRUPTIBLE_WAIT | DRM_EXEC_IGNORE_DUPLICATES);
+	return 0;
+}
+
+/**
+ * panthor_submit_ctx_cleanup() - Cleanup a submission context
+ * @ctx: Submit context to cleanup.
+ */
+static void panthor_submit_ctx_cleanup(struct panthor_submit_ctx *ctx,
+				       void (*job_put)(struct drm_sched_job *))
+{
+	struct panthor_sync_signal *sig_sync;
+	unsigned long i;
+
+	drm_exec_fini(&ctx->exec);
+
+	xa_for_each(&ctx->signal, i, sig_sync)
+		panthor_sync_signal_free(sig_sync);
+
+	xa_destroy(&ctx->signal);
+
+	for (i = 0; i < ctx->job_count; i++) {
+		job_put(ctx->jobs[i].job);
+		kvfree(ctx->jobs[i].syncops);
+	}
+
+	kvfree(ctx->jobs);
+}
+
+static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	struct drm_panthor_dev_query *args = data;
+
+	if (!args->pointer) {
+		switch (args->type) {
+		case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
+			args->size = sizeof(ptdev->gpu_info);
+			return 0;
+
+		case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
+			args->size = sizeof(ptdev->csif_info);
+			return 0;
+
+		default:
+			return -EINVAL;
+		}
+	}
+
+	switch (args->type) {
+	case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
+		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->gpu_info);
+
+	case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
+		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->csif_info);
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#define PANTHOR_VM_CREATE_FLAGS			0
+
+static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
+				   struct drm_file *file)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_vm_create *args = data;
+	u64 kernel_va_start = 0;
+	int cookie, ret;
+
+	if (!drm_dev_enter(ddev, &cookie))
+		return -ENODEV;
+
+	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
+		ret = -EINVAL;
+		goto out_dev_exit;
+	}
+
+	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {
+		ret = -EINVAL;
+		goto out_dev_exit;
+	}
+
+	if (args->kernel_va_range)
+		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;
+
+	ret = panthor_vm_pool_create_vm(ptdev, pfile->vms,
+					kernel_va_start, args->kernel_va_range);
+	if (ret >= 0) {
+		args->id = ret;
+		ret = 0;
+	}
+
+out_dev_exit:
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static int panthor_ioctl_vm_destroy(struct drm_device *ddev, void *data,
+				    struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_vm_destroy *args = data;
+
+	if (args->pad)
+		return -EINVAL;
+
+	return panthor_vm_pool_destroy_vm(pfile->vms, args->id);
+}
+
+#define PANTHOR_BO_FLAGS		DRM_PANTHOR_BO_NO_MMAP
+
+static int panthor_ioctl_bo_create(struct drm_device *ddev, void *data,
+				   struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct panthor_gem_object *bo;
+	struct drm_panthor_bo_create *args = data;
+	struct panthor_vm *vm = NULL;
+	int cookie, ret;
+
+	if (!drm_dev_enter(ddev, &cookie))
+		return -ENODEV;
+
+	if (!args->size || args->pad ||
+	    (args->flags & ~PANTHOR_BO_FLAGS)) {
+		ret = -EINVAL;
+		goto out_dev_exit;
+	}
+
+	if (args->exclusive_vm_id) {
+		vm = panthor_vm_pool_get_vm(pfile->vms, args->exclusive_vm_id);
+		if (!vm) {
+			ret = -EINVAL;
+			goto out_dev_exit;
+		}
+	}
+
+	bo = panthor_gem_create_with_handle(file, ddev, vm, args->size, args->flags,
+					    &args->handle);
+
+	panthor_vm_put(vm);
+
+	if (IS_ERR(bo))
+		ret = PTR_ERR(bo);
+	else
+		ret = 0;
+
+out_dev_exit:
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static int panthor_ioctl_bo_mmap_offset(struct drm_device *ddev, void *data,
+					struct drm_file *file)
+{
+	struct drm_panthor_bo_mmap_offset *args = data;
+	struct drm_gem_object *obj;
+	int ret;
+
+	if (args->pad)
+		return -EINVAL;
+
+	obj = drm_gem_object_lookup(file, args->handle);
+	if (!obj)
+		return -ENOENT;
+
+	ret = drm_gem_create_mmap_offset(obj);
+	if (ret)
+		goto out;
+
+	args->offset = drm_vma_node_offset_addr(&obj->vma_node);
+
+out:
+	drm_gem_object_put(obj);
+	return ret;
+}
+
+static int panthor_ioctl_group_submit(struct drm_device *ddev, void *data,
+				      struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_group_submit *args = data;
+	struct drm_panthor_queue_submit *jobs_args;
+	struct panthor_submit_ctx ctx;
+	int ret = 0, cookie;
+
+	if (args->pad)
+		return -EINVAL;
+
+	if (!drm_dev_enter(ddev, &cookie))
+		return -ENODEV;
+
+	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->queue_submits);
+	if (ret)
+		goto out_dev_exit;
+
+	ret = panthor_submit_ctx_init(&ctx, file, args->queue_submits.count);
+	if (ret)
+		goto out_free_jobs_args;
+
+	for (u32 i = 0; i < args->queue_submits.count; i++) {
+		const struct drm_panthor_queue_submit *qsubmit = &jobs_args[i];
+		struct drm_sched_job *job;
+
+		job = panthor_job_create(pfile, args->group_handle, qsubmit);
+		if (IS_ERR(job)) {
+			ret = PTR_ERR(job);
+			goto out_cleanup_submit_ctx;
+		}
+
+		ret = panthor_submit_ctx_add_job(&ctx, i, job, &qsubmit->syncs);
+		if (ret)
+			goto out_cleanup_submit_ctx;
+	}
+
+	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	/* Nothing can fail after that point. */
+	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
+
+out_cleanup_submit_ctx:
+	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
+
+out_free_jobs_args:
+	kvfree(jobs_args);
+
+out_dev_exit:
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static int panthor_ioctl_group_destroy(struct drm_device *ddev, void *data,
+				       struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_group_destroy *args = data;
+
+	if (args->pad)
+		return -EINVAL;
+
+	return panthor_group_destroy(pfile, args->group_handle);
+}
+
+static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
+				      struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_group_create *args = data;
+	struct drm_panthor_queue_create *queue_args;
+	int ret;
+
+	if (!args->queues.count)
+		return -EINVAL;
+
+	ret = PANTHOR_UOBJ_GET_ARRAY(queue_args, &args->queues);
+	if (ret)
+		return ret;
+
+	ret = panthor_group_create(pfile, args, queue_args);
+	if (ret >= 0) {
+		args->group_handle = ret;
+		ret = 0;
+	}
+
+	kvfree(queue_args);
+	return ret;
+}
+
+static int panthor_ioctl_group_get_state(struct drm_device *ddev, void *data,
+					 struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_group_get_state *args = data;
+
+	return panthor_group_get_state(pfile, args);
+}
+
+static int panthor_ioctl_tiler_heap_create(struct drm_device *ddev, void *data,
+					   struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_tiler_heap_create *args = data;
+	struct panthor_heap_pool *pool;
+	struct panthor_vm *vm;
+	int ret;
+
+	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (!vm)
+		return -EINVAL;
+
+	pool = panthor_vm_get_heap_pool(vm, true);
+	if (IS_ERR(pool)) {
+		ret = PTR_ERR(pool);
+		goto out_put_vm;
+	}
+
+	ret = panthor_heap_create(pool,
+				  args->initial_chunk_count,
+				  args->chunk_size,
+				  args->max_chunks,
+				  args->target_in_flight,
+				  &args->tiler_heap_ctx_gpu_va,
+				  &args->first_heap_chunk_gpu_va);
+	if (ret < 0)
+		goto out_put_heap_pool;
+
+	/* Heap pools are per-VM. We combine the VM and HEAP id to make
+	 * a unique heap handle.
+	 */
+	args->handle = (args->vm_id << 16) | ret;
+	ret = 0;
+
+out_put_heap_pool:
+	panthor_heap_pool_put(pool);
+
+out_put_vm:
+	panthor_vm_put(vm);
+	return ret;
+}
+
+static int panthor_ioctl_tiler_heap_destroy(struct drm_device *ddev, void *data,
+					    struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_tiler_heap_destroy *args = data;
+	struct panthor_heap_pool *pool;
+	struct panthor_vm *vm;
+	int ret;
+
+	if (args->pad)
+		return -EINVAL;
+
+	vm = panthor_vm_pool_get_vm(pfile->vms, args->handle >> 16);
+	if (!vm)
+		return -EINVAL;
+
+	pool = panthor_vm_get_heap_pool(vm, false);
+	if (!pool) {
+		ret = -EINVAL;
+		goto out_put_vm;
+	}
+
+	ret = panthor_heap_destroy(pool, args->handle & GENMASK(15, 0));
+	panthor_heap_pool_put(pool);
+
+out_put_vm:
+	panthor_vm_put(vm);
+	return ret;
+}
+
+static int panthor_ioctl_vm_bind_async(struct drm_device *ddev,
+				       struct drm_panthor_vm_bind *args,
+				       struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_vm_bind_op *jobs_args;
+	struct panthor_submit_ctx ctx;
+	struct panthor_vm *vm;
+	int ret = 0;
+
+	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (!vm)
+		return -EINVAL;
+
+	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
+	if (ret)
+		goto out_put_vm;
+
+	ret = panthor_submit_ctx_init(&ctx, file, args->ops.count);
+	if (ret)
+		goto out_free_jobs_args;
+
+	for (u32 i = 0; i < args->ops.count; i++) {
+		struct drm_panthor_vm_bind_op *op = &jobs_args[i];
+		struct drm_sched_job *job;
+
+		job = panthor_vm_bind_job_create(file, vm, op);
+		if (IS_ERR(job)) {
+			ret = PTR_ERR(job);
+			goto out_cleanup_submit_ctx;
+		}
+
+		ret = panthor_submit_ctx_add_job(&ctx, i, job, &op->syncs);
+		if (ret)
+			goto out_cleanup_submit_ctx;
+	}
+
+	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_vm_bind_job_prepare_resvs);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_vm_bind_job_add_resvs_deps);
+	if (ret)
+		goto out_cleanup_submit_ctx;
+
+	/* Nothing can fail after that point. */
+	panthor_submit_ctx_push_jobs(&ctx, panthor_vm_bind_job_update_resvs);
+
+out_cleanup_submit_ctx:
+	panthor_submit_ctx_cleanup(&ctx, panthor_vm_bind_job_put);
+
+out_free_jobs_args:
+	kvfree(jobs_args);
+
+out_put_vm:
+	panthor_vm_put(vm);
+	return ret;
+}
+
+static int panthor_ioctl_vm_bind_sync(struct drm_device *ddev,
+				      struct drm_panthor_vm_bind *args,
+				      struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+	struct drm_panthor_vm_bind_op *jobs_args;
+	struct panthor_vm *vm;
+	int ret;
+
+	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
+	if (!vm)
+		return -EINVAL;
+
+	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
+	if (ret)
+		goto out_put_vm;
+
+	for (u32 i = 0; i < args->ops.count; i++) {
+		ret = panthor_vm_bind_exec_sync_op(file, vm, &jobs_args[i]);
+		if (ret) {
+			/* Update ops.count so the user knows where things failed. */
+			args->ops.count = i;
+			break;
+		}
+	}
+
+	kvfree(jobs_args);
+
+out_put_vm:
+	panthor_vm_put(vm);
+	return ret;
+}
+
+#define PANTHOR_VM_BIND_FLAGS DRM_PANTHOR_VM_BIND_ASYNC
+
+static int panthor_ioctl_vm_bind(struct drm_device *ddev, void *data,
+				 struct drm_file *file)
+{
+	struct drm_panthor_vm_bind *args = data;
+	int cookie, ret;
+
+	if (!drm_dev_enter(ddev, &cookie))
+		return -ENODEV;
+
+	if (args->flags & DRM_PANTHOR_VM_BIND_ASYNC)
+		ret = panthor_ioctl_vm_bind_async(ddev, args, file);
+	else
+		ret = panthor_ioctl_vm_bind_sync(ddev, args, file);
+
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static int
+panthor_open(struct drm_device *ddev, struct drm_file *file)
+{
+	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
+	struct panthor_file *pfile;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -EINVAL;
+
+	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
+	if (!pfile) {
+		ret = -ENOMEM;
+		goto err_put_mod;
+	}
+
+	pfile->ptdev = ptdev;
+
+	ret = panthor_vm_pool_create(pfile);
+	if (ret)
+		goto err_free_file;
+
+	ret = panthor_group_pool_create(pfile);
+	if (ret)
+		goto err_destroy_vm_pool;
+
+	file->driver_priv = pfile;
+	return 0;
+
+err_destroy_vm_pool:
+	panthor_vm_pool_destroy(pfile);
+
+err_free_file:
+	kfree(pfile);
+
+err_put_mod:
+	module_put(THIS_MODULE);
+	return ret;
+}
+
+static void
+panthor_postclose(struct drm_device *ddev, struct drm_file *file)
+{
+	struct panthor_file *pfile = file->driver_priv;
+
+	panthor_group_pool_destroy(pfile);
+	panthor_vm_pool_destroy(pfile);
+
+	kfree(pfile);
+	module_put(THIS_MODULE);
+}
+
+static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
+#define PANTHOR_IOCTL(n, func, flags) \
+	DRM_IOCTL_DEF_DRV(PANTHOR_##n, panthor_ioctl_##func, flags)
+
+	PANTHOR_IOCTL(DEV_QUERY, dev_query, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(VM_CREATE, vm_create, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(VM_DESTROY, vm_destroy, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(VM_BIND, vm_bind, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(BO_CREATE, bo_create, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(BO_MMAP_OFFSET, bo_mmap_offset, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(GROUP_CREATE, group_create, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(GROUP_DESTROY, group_destroy, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(GROUP_GET_STATE, group_get_state, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
+	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
+};
+
+static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct drm_file *file = filp->private_data;
+	struct panthor_file *pfile = file->driver_priv;
+	struct panthor_device *ptdev = pfile->ptdev;
+	int ret, cookie;
+
+	if (!drm_dev_enter(file->minor->dev, &cookie))
+		return -ENODEV;
+
+	if (vma->vm_pgoff >= (DRM_PANTHOR_USER_MMIO_OFFSET >> PAGE_SHIFT))
+		ret = panthor_device_mmap_io(ptdev, vma);
+	else
+		ret = drm_gem_mmap(filp, vma);
+
+	drm_dev_exit(cookie);
+	return ret;
+}
+
+static const struct file_operations panthor_drm_driver_fops = {
+	.open = drm_open,
+	.release = drm_release,
+	.unlocked_ioctl = drm_ioctl,
+	.compat_ioctl = drm_compat_ioctl,
+	.poll = drm_poll,
+	.read = drm_read,
+	.llseek = noop_llseek,
+	.mmap = panthor_mmap,
+};
+
+#ifdef CONFIG_DEBUG_FS
+void panthor_debugfs_init(struct drm_minor *minor)
+{
+	panthor_mmu_debugfs_init(minor);
+}
+#endif
+
+/*
+ * PanCSF driver version:
+ * - 1.0 - initial interface
+ */
+static const struct drm_driver panthor_drm_driver = {
+	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
+			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
+	.open = panthor_open,
+	.postclose = panthor_postclose,
+	.ioctls = panthor_drm_driver_ioctls,
+	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
+	.fops = &panthor_drm_driver_fops,
+	.name = "panthor",
+	.desc = "Panthor DRM driver",
+	.date = "20230801",
+	.major = 1,
+	.minor = 0,
+
+	.gem_create_object = panthor_gem_create_object,
+	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
+#ifdef CONFIG_DEBUG_FS
+	.debugfs_init = panthor_debugfs_init,
+#endif
+};
+
+static int panthor_probe(struct platform_device *pdev)
+{
+	struct panthor_device *ptdev;
+	int ret;
+
+	ptdev = devm_drm_dev_alloc(&pdev->dev, &panthor_drm_driver,
+				   struct panthor_device, base);
+	if (!ptdev)
+		return -ENOMEM;
+
+	platform_set_drvdata(pdev, ptdev);
+
+	ret = panthor_device_init(ptdev);
+	if (ret)
+		return ret;
+
+	return drm_dev_register(&ptdev->base, 0);
+}
+
+static void panthor_remove(struct platform_device *pdev)
+{
+	struct panthor_device *ptdev = platform_get_drvdata(pdev);
+
+	panthor_device_unplug(ptdev);
+}
+
+static const struct of_device_id dt_match[] = {
+	{ .compatible = "rockchip,rk3588-mali" },
+	{ .compatible = "arm,mali-valhall-csf" },
+	{}
+};
+MODULE_DEVICE_TABLE(of, dt_match);
+
+static DEFINE_RUNTIME_DEV_PM_OPS(panthor_pm_ops,
+				 panthor_device_suspend,
+				 panthor_device_resume,
+				 NULL);
+
+static struct platform_driver panthor_driver = {
+	.probe = panthor_probe,
+	.remove_new = panthor_remove,
+	.driver = {
+		.name = "panthor",
+		.pm = &panthor_pm_ops,
+		.of_match_table = dt_match,
+	},
+};
+
+/**
+ * @cleanup_wq: Workqueue used to cleanup stuff.
+ *
+ * We create a dedicated workqueue so we can drain on unplug and
+ * make sure all resources are freed before the module is unloaded.
+ */
+struct workqueue_struct *panthor_cleanup_wq;
+
+static int __init panthor_init(void)
+{
+	int ret;
+
+	ret = panthor_mmu_pt_cache_init();
+	if (ret)
+		return ret;
+
+	panthor_cleanup_wq = alloc_workqueue("panthor-cleanup", WQ_UNBOUND, 0);
+	if (!panthor_cleanup_wq) {
+		pr_err("panthor: Failed to allocate the workqueues");
+		ret = -ENOMEM;
+		goto err_mmu_pt_cache_fini;
+	}
+
+	ret = platform_driver_register(&panthor_driver);
+	if (ret)
+		goto err_destroy_cleanup_wq;
+
+	return ret;
+
+err_mmu_pt_cache_fini:
+	panthor_mmu_pt_cache_fini();
+
+err_destroy_cleanup_wq:
+	destroy_workqueue(panthor_cleanup_wq);
+	return ret;
+}
+module_init(panthor_init);
+
+static void __exit panthor_exit(void)
+{
+	platform_driver_unregister(&panthor_driver);
+	destroy_workqueue(panthor_cleanup_wq);
+	panthor_mmu_pt_cache_fini();
+}
+module_exit(panthor_exit);
+
+MODULE_AUTHOR("Panthor Project Developers");
+MODULE_DESCRIPTION("Panthor DRM Driver");
+MODULE_LICENSE("Dual MIT/GPL");
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (11 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 12/15] drm/panthor: Add the driver frontend block Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 16:35   ` Robin Murphy
  2023-08-21 12:47   ` Steven Price
  2023-08-09 16:53   ` Boris Brezillon
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Now that all blocks are available, we can add/update Kconfig/Makefile
files to allow compilation.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Add new dependencies on GPUVA and DRM_SCHED

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/Kconfig          |  2 ++
 drivers/gpu/drm/Makefile         |  1 +
 drivers/gpu/drm/panthor/Kconfig  | 16 ++++++++++++++++
 drivers/gpu/drm/panthor/Makefile | 15 +++++++++++++++
 4 files changed, 34 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/Kconfig
 create mode 100644 drivers/gpu/drm/panthor/Makefile

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 2a44b9419d4d..bddfbdb2ffee 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -358,6 +358,8 @@ source "drivers/gpu/drm/lima/Kconfig"
 
 source "drivers/gpu/drm/panfrost/Kconfig"
 
+source "drivers/gpu/drm/panthor/Kconfig"
+
 source "drivers/gpu/drm/aspeed/Kconfig"
 
 source "drivers/gpu/drm/mcde/Kconfig"
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 215e78e79125..0a260727505f 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -188,6 +188,7 @@ obj-$(CONFIG_DRM_TVE200) += tve200/
 obj-$(CONFIG_DRM_XEN) += xen/
 obj-$(CONFIG_DRM_VBOXVIDEO) += vboxvideo/
 obj-$(CONFIG_DRM_LIMA)  += lima/
+obj-$(CONFIG_DRM_PANTHOR) += panthor/
 obj-$(CONFIG_DRM_PANFROST) += panfrost/
 obj-$(CONFIG_DRM_ASPEED_GFX) += aspeed/
 obj-$(CONFIG_DRM_MCDE) += mcde/
diff --git a/drivers/gpu/drm/panthor/Kconfig b/drivers/gpu/drm/panthor/Kconfig
new file mode 100644
index 000000000000..a9d17b1bbb75
--- /dev/null
+++ b/drivers/gpu/drm/panthor/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0 or MIT
+
+config DRM_PANTHOR
+	tristate "Panthor (DRM support for ARM Mali CSF-based GPUs)"
+	depends on DRM
+	depends on ARM || ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
+	depends on MMU
+	select DRM_EXEC
+	select DRM_SCHED
+	select IOMMU_SUPPORT
+	select IOMMU_IO_PGTABLE_LPAE
+	select DRM_GEM_SHMEM_HELPER
+	select PM_DEVFREQ
+	select DEVFREQ_GOV_SIMPLE_ONDEMAND
+	help
+	  DRM driver for ARM Mali CSF-based GPUs.
diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
new file mode 100644
index 000000000000..64193a484879
--- /dev/null
+++ b/drivers/gpu/drm/panthor/Makefile
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0 or MIT
+
+panthor-y := \
+	panthor_devfreq.o \
+	panthor_device.o \
+	panthor_drv.o \
+	panthor_gem.o \
+	panthor_gpu.o \
+	panthor_heap.o \
+	panthor_heap.o \
+	panthor_fw.o \
+	panthor_mmu.o \
+	panthor_sched.o
+
+obj-$(CONFIG_DRM_PANTHOR) += panthor.o
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
@ 2023-08-09 16:53   ` Boris Brezillon
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Vetter, Marty E . Plummer, Rob Herring,
	Clément Péron, Nicolas Boichat, Neil Armstrong,
	Faith Ekstrand, Daniel Stone, Liviu Dudau, Steven Price,
	Robin Murphy, Liviu Dudau, Krzysztof Kozlowski, Rob Herring,
	Conor Dooley, devicetree

From: Liviu Dudau <liviu.dudau@arm.com>

Arm has introduced a new v10 GPU architecture that replaces the Job Manager
interface with a new Command Stream Frontend. It adds firmware driven
command stream queues that can be used by kernel and user space to submit
jobs to the GPU.

Add the initial schema for the device tree that is based on support for
RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
platforms they will tend to expose the semi-independent clocks for better
power management.

v2:
- New commit

Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: devicetree@vger.kernel.org
---
 .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
 1 file changed, 148 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml

diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
new file mode 100644
index 000000000000..2b9f77aa0b7a
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
@@ -0,0 +1,148 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: ARM Mali Valhall GPU
+
+maintainers:
+  - Liviu Dudau <liviu.dudau@arm.com>
+  - Boris Brezillon <boris.brezillon@collabora.com>
+
+properties:
+  $nodename:
+    pattern: '^gpu@[a-f0-9]+$'
+
+  compatible:
+    oneOf:
+      - items:
+          - enum:
+              - rockchip,rk3588-mali
+          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    items:
+      - description: Job interrupt
+      - description: MMU interrupt
+      - description: GPU interrupt
+
+  interrupt-names:
+    items:
+      - const: job
+      - const: mmu
+      - const: gpu
+
+  clocks:
+    minItems: 1
+    maxItems: 3
+
+  clock-names:
+    minItems: 1
+    items:
+      - const: core
+      - const: coregroup
+      - const: stacks
+
+  mali-supply: true
+
+  sram-supply: true
+
+  operating-points-v2: true
+
+  power-domains:
+    minItems: 1
+    maxItems: 5
+
+  power-domain-names:
+    minItems: 1
+    maxItems: 5
+
+  "#cooling-cells":
+    const: 2
+
+  dynamic-power-coefficient:
+    $ref: /schemas/types.yaml#/definitions/uint32
+    description:
+      A u32 value that represents the running time dynamic
+      power coefficient in units of uW/MHz/V^2. The
+      coefficient can either be calculated from power
+      measurements or derived by analysis.
+
+      The dynamic power consumption of the GPU is
+      proportional to the square of the Voltage (V) and
+      the clock frequency (f). The coefficient is used to
+      calculate the dynamic power as below -
+
+      Pdyn = dynamic-power-coefficient * V^2 * f
+
+      where voltage is in V, frequency is in MHz.
+
+  dma-coherent: true
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - interrupt-names
+  - clocks
+  - mali-supply
+
+additionalProperties: false
+
+allOf:
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: rockchip,rk3588-mali
+    then:
+      properties:
+        clocks:
+          minItems: 3
+        clock-names:
+          items:
+            - const: core
+            - const: coregroup
+            - const: stacks
+
+examples:
+  - |
+    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    #include <dt-bindings/power/rk3588-power.h>
+
+    gpu: gpu@fb000000 {
+        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
+        reg = <0xfb000000 0x200000>;
+        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
+                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
+                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
+        interrupt-names = "job", "mmu", "gpu";
+        clock-names = "core", "coregroup", "stacks";
+        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
+                 <&cru CLK_GPU_STACKS>;
+        power-domains = <&power RK3588_PD_GPU>;
+        operating-points-v2 = <&gpu_opp_table>;
+        mali-supply = <&vdd_gpu_s0>;
+        sram-supply = <&vdd_gpu_mem_s0>;
+        status = "disabled";
+    };
+
+    gpu_opp_table: opp-table {
+        compatible = "operating-points-v2";
+        opp-300000000 {
+            opp-hz = /bits/ 64 <300000000>;
+            opp-microvolt = <675000 675000 850000>;
+        };
+        opp-400000000 {
+            opp-hz = /bits/ 64 <400000000>;
+            opp-microvolt = <675000 675000 850000>;
+        };
+    };
+
+...
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-08-09 16:53   ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Conor Dooley, Nicolas Boichat, Daniel Stone, Krzysztof Kozlowski,
	Neil Armstrong, Liviu Dudau, Steven Price, devicetree,
	Rob Herring, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

From: Liviu Dudau <liviu.dudau@arm.com>

Arm has introduced a new v10 GPU architecture that replaces the Job Manager
interface with a new Command Stream Frontend. It adds firmware driven
command stream queues that can be used by kernel and user space to submit
jobs to the GPU.

Add the initial schema for the device tree that is based on support for
RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
platforms they will tend to expose the semi-independent clocks for better
power management.

v2:
- New commit

Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Conor Dooley <conor+dt@kernel.org>
Cc: devicetree@vger.kernel.org
---
 .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
 1 file changed, 148 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml

diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
new file mode 100644
index 000000000000..2b9f77aa0b7a
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
@@ -0,0 +1,148 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: ARM Mali Valhall GPU
+
+maintainers:
+  - Liviu Dudau <liviu.dudau@arm.com>
+  - Boris Brezillon <boris.brezillon@collabora.com>
+
+properties:
+  $nodename:
+    pattern: '^gpu@[a-f0-9]+$'
+
+  compatible:
+    oneOf:
+      - items:
+          - enum:
+              - rockchip,rk3588-mali
+          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    items:
+      - description: Job interrupt
+      - description: MMU interrupt
+      - description: GPU interrupt
+
+  interrupt-names:
+    items:
+      - const: job
+      - const: mmu
+      - const: gpu
+
+  clocks:
+    minItems: 1
+    maxItems: 3
+
+  clock-names:
+    minItems: 1
+    items:
+      - const: core
+      - const: coregroup
+      - const: stacks
+
+  mali-supply: true
+
+  sram-supply: true
+
+  operating-points-v2: true
+
+  power-domains:
+    minItems: 1
+    maxItems: 5
+
+  power-domain-names:
+    minItems: 1
+    maxItems: 5
+
+  "#cooling-cells":
+    const: 2
+
+  dynamic-power-coefficient:
+    $ref: /schemas/types.yaml#/definitions/uint32
+    description:
+      A u32 value that represents the running time dynamic
+      power coefficient in units of uW/MHz/V^2. The
+      coefficient can either be calculated from power
+      measurements or derived by analysis.
+
+      The dynamic power consumption of the GPU is
+      proportional to the square of the Voltage (V) and
+      the clock frequency (f). The coefficient is used to
+      calculate the dynamic power as below -
+
+      Pdyn = dynamic-power-coefficient * V^2 * f
+
+      where voltage is in V, frequency is in MHz.
+
+  dma-coherent: true
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - interrupt-names
+  - clocks
+  - mali-supply
+
+additionalProperties: false
+
+allOf:
+  - if:
+      properties:
+        compatible:
+          contains:
+            const: rockchip,rk3588-mali
+    then:
+      properties:
+        clocks:
+          minItems: 3
+        clock-names:
+          items:
+            - const: core
+            - const: coregroup
+            - const: stacks
+
+examples:
+  - |
+    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
+    #include <dt-bindings/interrupt-controller/irq.h>
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+    #include <dt-bindings/power/rk3588-power.h>
+
+    gpu: gpu@fb000000 {
+        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
+        reg = <0xfb000000 0x200000>;
+        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
+                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
+                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
+        interrupt-names = "job", "mmu", "gpu";
+        clock-names = "core", "coregroup", "stacks";
+        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
+                 <&cru CLK_GPU_STACKS>;
+        power-domains = <&power RK3588_PD_GPU>;
+        operating-points-v2 = <&gpu_opp_table>;
+        mali-supply = <&vdd_gpu_s0>;
+        sram-supply = <&vdd_gpu_mem_s0>;
+        status = "disabled";
+    };
+
+    gpu_opp_table: opp-table {
+        compatible = "operating-points-v2";
+        opp-300000000 {
+            opp-hz = /bits/ 64 <300000000>;
+            opp-microvolt = <675000 675000 850000>;
+        };
+        opp-400000000 {
+            opp-hz = /bits/ 64 <400000000>;
+            opp-microvolt = <675000 675000 850000>;
+        };
+    };
+
+...
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (13 preceding siblings ...)
  2023-08-09 16:53   ` Boris Brezillon
@ 2023-08-09 16:53 ` Boris Brezillon
  2023-08-11 16:08   ` Steven Price
  2023-08-31 13:18   ` Liviu Dudau
  2023-08-09 20:22 ` [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Rob Herring
  2023-09-27 15:47 ` Steven Price
  16 siblings, 2 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-09 16:53 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Boris Brezillon, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Add an entry for the Panthor driver to the MAINTAINERS file.

v2:
- New commit

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---

If anyone from Arm wants to volunteer to become a co-maintainer, that
would be highly appreciated
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index cd882b87a3c6..6149ab68d461 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1624,6 +1624,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	drivers/gpu/drm/panfrost/
 F:	include/uapi/drm/panfrost_drm.h
 
+ARM MALI PANTHOR DRM DRIVER
+M:	Boris Brezillon <boris.brezillon@collabora.com>
+L:	dri-devel@lists.freedesktop.org
+S:	Supported
+T:	git git://anongit.freedesktop.org/drm/drm-misc
+F:	drivers/gpu/drm/panthor/
+F:	include/uapi/drm/panthor_drm.h
+
 ARM MALI-DP DRM DRIVER
 M:	Liviu Dudau <liviu.dudau@arm.com>
 S:	Supported
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (14 preceding siblings ...)
  2023-08-09 16:53 ` [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS Boris Brezillon
@ 2023-08-09 20:22 ` Rob Herring
  2023-08-10 15:44   ` Boris Brezillon
  2023-09-27 15:47 ` Steven Price
  16 siblings, 1 reply; 93+ messages in thread
From: Rob Herring @ 2023-08-09 20:22 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, Aug 9, 2023 at 10:53 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hello,
>
> This is the second version of the kernel driver meant to support new Mali
> GPUs which are delegating the scheduling to a firmware.
>
> The RFC has been dropped as the major blocking points have been addressed
> (request to use drm_sched, request to implement a VM_BIND-like ioctl,
> request to use drm_gpuva_mgr for the VM logic, lack of PM/devfreq support).
>
> This series is based on drm-misc-next and depends on some drm_sched [1]
> and iommu [2] changes.
>
> A branch containing all those dependencies is available here[3], and
> here [4] is another one containing all the patches needed to have
> a working GPU on rk3588 on top. The CSF firmware binary can be found
> here[5].
>
> The mesa branch used to test this new driver is available here [6].
> It's still under development and it's just a gallium driver right now,
> but we are working on that ;-).
>
> Here is a non-exaustive changelog, check each commit for a detailed
> changelog.
>
> v2:
> - Rename the driver (pancsf -> panthor)
> - Split the commit adding the driver to ease review
> - Use drm_sched for dependency tracking/job submission
> - Add a VM_BIND ioctl
> - Add the concept of exclusive VM for BOs that are only ever mapped to a
>   single VM
> - Document the code and uAPI
> - Add a DT binding doc
>
> I tried to Cc anyone that was involved in any development of the code
> I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
> change. If I missed someone, please let me know.

Panfrost was largely based on etnaviv, vc4, v3d, and msm. Those are
all GPL2 (or 2+) only. How is relicensing that code okay? Also,
panfrost depends on drm_gem_shmem_helper.c (at least) which is GPL2.
Does that get re-implemented in a MIT licensed environment?

Maybe some drivers are enough of a silo to get away with MIT
licensing, but I wouldn't be comfortable claiming it.

Rob

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs
  2023-08-09 20:22 ` [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Rob Herring
@ 2023-08-10 15:44   ` Boris Brezillon
  2023-08-21 14:01     ` Rob Herring
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-10 15:44 UTC (permalink / raw)
  To: Rob Herring
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Hello Rob,

On Wed, 9 Aug 2023 14:22:59 -0600
Rob Herring <robh@kernel.org> wrote:

> On Wed, Aug 9, 2023 at 10:53 AM Boris Brezillon
> <boris.brezillon@collabora.com> wrote:
> >
> > I tried to Cc anyone that was involved in any development of the code
> > I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
> > change. If I missed someone, please let me know.  
> 
> Panfrost was largely based on etnaviv, vc4, v3d, and msm. Those are
> all GPL2 (or 2+) only.

Uh, I must have missed some copyright headers then. Note that not all
panfrost files were taken as a base for panthor:

- Makefile/Kconfig. I honestly hope there's nothing copyright-able in
  there, given there's no other way to define your driver and
  compilation rules.
- panthor_device.{c,h} copied from panfrost_device.{c,h} with quite a
  few modifications in the process. This one has your copyright, and
  Marty's one.
- a tiny part of panthor_drv.c was copied from panfrost_drv.c, but let's
  be honest, the part that was copied (ioctl wrappers, mostly), can't
  really be done differently. This one has your copyright, Marty's one,
  and Collabora's one.
- panthor_regs.h copied from panfrost_regs.h. This one has your
  copyright, Marty's one and Arm's one (definitions extracted from
  kbase). But again, I'm not even sure register definitions are
  copyright-able, given there's no other way to define them. If that
  makes a difference, I changed the prefix, and dropped definition that
  do not exist on CSF HW.
- panthor_gpu.{c,h} copied from panfrost_gpu.{c,h}. These files have
  your copyright, Marty's one, and Collabora's one.
- panthor_{gem,mmu}.{c,h} copied from panfrost_{gem,mmu}.{c,h}. Those
  ones have your copyright only.
- panthor_devfreq.{c,h} copied from panfrost_devfreq.{c,h}. Collabora's
  copyright only.
- panthor_{heap,fw,sched}.{c,h}. Those are brand new files, that were
  written from scratch.

I also git-blamed the lines I copies to Cc any contributors to the
above files. I might have omitted someone, but I did my best to
try and spot people that have a word in this decision.

> How is relicensing that code okay?

Sorry, the copyright headers of the files I copied didn't mention that
:-/. If that's an omission, it would be good to have the headers updated
to reflect the actual chain of copyrights.

> Also,
> panfrost depends on drm_gem_shmem_helper.c (at least) which is GPL2.
> Does that get re-implemented in a MIT licensed environment?

Not only drm_gem_shmem, but drm_gpuva_mgr and drm_sched too. And yes,
any helper function/lib that's not GPL+MIT will have to be
re-implemented or replaced by something else.

> 
> Maybe some drivers are enough of a silo to get away with MIT
> licensing, but I wouldn't be comfortable claiming it.

Well, yes, re-using the code as-is is almost impossible, unless
someone rewrites the various GPL components we depend on. But if
someone wants to pick, say, the scheduling logic, and replace drm_sched
by something else, they can. Not saying it's worth it, just saying it's
possible.

Regards,

Boris


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t
  2023-08-09 16:53 ` [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t Boris Brezillon
@ 2023-08-11 13:08   ` Steven Price
  2023-08-19  2:13     ` Dmitry Osipenko
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-11 13:08 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> This way we can grab a pages ref without acquiring the resv lock when
> pages_use_count > 0. Need to implement asynchronous map using the

NIT: s/Need/This is needed/

> drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
> requiring the new left/right regions to grab an additional page ref
> to guarantee that the pages stay pinned when the middle section is
> unmapped.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c  | 28 +++++++++++++------------
>  drivers/gpu/drm/lima/lima_gem.c         |  2 +-
>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  2 +-
>  include/drm/drm_gem_shmem_helper.h      |  2 +-
>  4 files changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index a783d2245599..ca6938ea1b82 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -155,7 +155,7 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
>  		if (shmem->pages)
>  			drm_gem_shmem_put_pages(shmem);
>  
> -		drm_WARN_ON(obj->dev, shmem->pages_use_count);
> +		drm_WARN_ON(obj->dev, atomic_read(&shmem->pages_use_count));
>  
>  		dma_resv_unlock(shmem->base.resv);
>  	}
> @@ -172,14 +172,14 @@ static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
>  
>  	dma_resv_assert_held(shmem->base.resv);
>  
> -	if (shmem->pages_use_count++ > 0)
> +	if (atomic_inc_return(&shmem->pages_use_count) > 1)
>  		return 0;
>  
>  	pages = drm_gem_get_pages(obj);
>  	if (IS_ERR(pages)) {
>  		drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
>  			    PTR_ERR(pages));
> -		shmem->pages_use_count = 0;
> +		atomic_set(&shmem->pages_use_count, 0);
>  		return PTR_ERR(pages);
>  	}
>  
> @@ -210,10 +210,10 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
>  
>  	dma_resv_assert_held(shmem->base.resv);
>  
> -	if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
> +	if (drm_WARN_ON_ONCE(obj->dev, !atomic_read(&shmem->pages_use_count)))
>  		return;
>  
> -	if (--shmem->pages_use_count > 0)
> +	if (atomic_dec_return(&shmem->pages_use_count) > 0)
>  		return;
>  
>  #ifdef CONFIG_X86
> @@ -263,6 +263,10 @@ int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
>  
>  	drm_WARN_ON(obj->dev, obj->import_attach);
>  
> +	/* If we are the first owner, we need to grab the lock. */
> +	if (atomic_inc_not_zero(&shmem->pages_use_count))
> +		return 0;
> +

Unless I'm misunderstanding I think this introduces a race where two
threads call drm_gem_shmem_pin() at the same time:

Thread1				| Thread 2
--------------------------------+------------------------------
drm_gem_shmem_pin()		|
 - pages_use_count == 0 so not  |
   incremented                  |
 - lock taken			|
drm_gem_shmem_pin_locked()	|
drm_gem_shmem_get_pages()	|
 - pages_use_count incremented	|
<thread descheduled>            | drm_gem_shmem_pin()
                                |  - pages_use_count == 1 so is it
				|    incremented and returns early
				|    without taking the lock
				| Code tries to use shmem->pages
<thread rescheduled>		| and blows up
drm_gem_get_pages()		|
shmem->pages populated		|
lock released			|

I think you need to modify drm_gem_shmem_get_pages() to only increment
pages_use_count when shmem->pages has been populated. That also gets rid
of the atomic_set() in that function which scares me.

Steve

>  	ret = dma_resv_lock_interruptible(shmem->base.resv, NULL);
>  	if (ret)
>  		return ret;
> @@ -286,6 +290,10 @@ void drm_gem_shmem_unpin(struct drm_gem_shmem_object *shmem)
>  
>  	drm_WARN_ON(obj->dev, obj->import_attach);
>  
> +	/* If we are the last owner, we need to grab the lock. */
> +	if (atomic_add_unless(&shmem->pages_use_count, -1, 1))
> +		return;
> +
>  	dma_resv_lock(shmem->base.resv, NULL);
>  	drm_gem_shmem_unpin_locked(shmem);
>  	dma_resv_unlock(shmem->base.resv);
> @@ -543,18 +551,12 @@ static void drm_gem_shmem_vm_open(struct vm_area_struct *vma)
>  
>  	drm_WARN_ON(obj->dev, obj->import_attach);
>  
> -	dma_resv_lock(shmem->base.resv, NULL);
> -
>  	/*
>  	 * We should have already pinned the pages when the buffer was first
>  	 * mmap'd, vm_open() just grabs an additional reference for the new
>  	 * mm the vma is getting copied into (ie. on fork()).
>  	 */
> -	if (!drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
> -		shmem->pages_use_count++;
> -
> -	dma_resv_unlock(shmem->base.resv);
> -
> +	drm_WARN_ON_ONCE(obj->dev, atomic_inc_return(&shmem->pages_use_count) == 1);
>  	drm_gem_vm_open(vma);
>  }
>  
> @@ -632,7 +634,7 @@ void drm_gem_shmem_print_info(const struct drm_gem_shmem_object *shmem,
>  	if (shmem->base.import_attach)
>  		return;
>  
> -	drm_printf_indent(p, indent, "pages_use_count=%u\n", shmem->pages_use_count);
> +	drm_printf_indent(p, indent, "pages_use_count=%u\n", atomic_read(&shmem->pages_use_count));
>  	drm_printf_indent(p, indent, "vmap_use_count=%u\n", shmem->vmap_use_count);
>  	drm_printf_indent(p, indent, "vaddr=%p\n", shmem->vaddr);
>  }
> diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c
> index 4f9736e5f929..0116518b1601 100644
> --- a/drivers/gpu/drm/lima/lima_gem.c
> +++ b/drivers/gpu/drm/lima/lima_gem.c
> @@ -47,7 +47,7 @@ int lima_heap_alloc(struct lima_bo *bo, struct lima_vm *vm)
>  		}
>  
>  		bo->base.pages = pages;
> -		bo->base.pages_use_count = 1;
> +		atomic_set(&bo->base.pages_use_count, 1);
>  
>  		mapping_set_unevictable(mapping);
>  	}
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index c0123d09f699..f66e63bf743e 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -487,7 +487,7 @@ static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,
>  			goto err_unlock;
>  		}
>  		bo->base.pages = pages;
> -		bo->base.pages_use_count = 1;
> +		atomic_set(&bo->base.pages_use_count, 1);
>  	} else {
>  		pages = bo->base.pages;
>  		if (pages[page_offset]) {
> diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h
> index bf0c31aa8fbe..0661f87d3bda 100644
> --- a/include/drm/drm_gem_shmem_helper.h
> +++ b/include/drm/drm_gem_shmem_helper.h
> @@ -37,7 +37,7 @@ struct drm_gem_shmem_object {
>  	 * Reference count on the pages table.
>  	 * The pages are put when the count reaches zero.
>  	 */
> -	unsigned int pages_use_count;
> +	atomic_t pages_use_count;
>  
>  	/**
>  	 * @madv: State for madvise


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
@ 2023-08-11 14:13   ` Steven Price
  2023-09-01 13:59   ` Liviu Dudau
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-11 14:13 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Panthor follows the lead of other recently submitted drivers with
> ioctls allowing us to support modern Vulkan features, like sparse memory
> binding:
> 
> - Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
>   with the 'exclusive-VM' bit to speed-up BO reservation on job submission
> - VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
>   ioctl is loosely based on the Xe model, and can handle both
>   asynchronous and synchronous requests
> - GPU execution context creation/destruction, tiler heap context creation
>   and job submission. Those ioctls reflect how the hardware/scheduler
>   works and are thus driver specific.
> 
> We also have a way to expose IO regions, such that the usermode driver
> can directly access specific/well-isolate registers, like the
> LATEST_FLUSH register used to implement cache-flush reduction.
> 
> This uAPI intentionally keeps usermode queues out of the scope, which
> explains why doorbell registers and command stream ring-buffers are not
> directly exposed to userspace.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
> - Add the concept of exclusive_vm at BO creation time
> - Add missing padding fields
> - Add documentation
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Looks good, just documentation typos/corrections below. With those fixed

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  Documentation/gpu/driver-uapi.rst |   5 +
>  include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
>  2 files changed, 867 insertions(+)
>  create mode 100644 include/uapi/drm/panthor_drm.h
> 
> diff --git a/Documentation/gpu/driver-uapi.rst b/Documentation/gpu/driver-uapi.rst
> index c08bcbb95fb3..7a667901830f 100644
> --- a/Documentation/gpu/driver-uapi.rst
> +++ b/Documentation/gpu/driver-uapi.rst
> @@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
>      :doc: Overview
>  
>  .. kernel-doc:: include/uapi/drm/nouveau_drm.h
> +
> +drm/panthor uAPI
> +================
> +
> +.. kernel-doc:: include/uapi/drm/panthor_drm.h
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> new file mode 100644
> index 000000000000..e217eb5ad198
> --- /dev/null
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -0,0 +1,862 @@
> +/* SPDX-License-Identifier: MIT */
> +/* Copyright (C) 2023 Collabora ltd. */
> +#ifndef _PANTHOR_DRM_H_
> +#define _PANTHOR_DRM_H_
> +
> +#include "drm.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * DOC: Introduction
> + *
> + * This documentation decribes the Panthor IOCTLs.
                         ^^^^^^^^ describes
> + *
> + * Just a few generic rules about the data passed to the Panthor IOCTLs:
> + *
> + * - Structures must be aligned on 64-bit/8-byte. If the object is not
> + *   naturally aligned, a padding field must be added.
> + * - Fields must be explicity aligned to their natural type alignment with
                       ^^^^^^^^^ explicitly

> + *   pad[0..N] fields.
> + * - All padding fields will be checked by the driver to make sure they are
> + *   zeroed.
> + * - Flags can be added, but not removed/replaced.
> + * - New fields can be added to the main structures (the structures
> + *   directly passed to the ioctl). Those fiels can be added at the end of
					     ^^^^^ fields

> + *   the structure, or replace existing padding fields. Any new field being
> + *   added must preserve the behavior that existed before those fields were
> + *   added when a value of zero is passed.
> + * - New fields can be added to indirect objects (objects pointed by the
> + *   main structure), iff those objects are passed a size to reflect the
> + *   size known by the userspace driver (see drm_panthor_obj_array::stride
> + *   or drm_panthor_dev_query::size).
> + * - If the kernel driver is too old to know some fields, those will
> + *   be ignored (input) and set back to zero (output).

I presume this should be "will be ignored if zero (input)" and rejected
if non-zero?

> + * - If userspace is too old to know some fields, those will be zeroed
> + *   (input) before the structure is parsed by the kernel driver.
> + * - Each new flag/field addition must come with a driver version update so
> + *   the userspace driver doesn't have to trial and error to know which
> + *   flags are supported.
> + * - Structures should not contain unions, as this would defeat the
> + *   extensibility of such structures.
> + * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
> + *   at the end of the drm_panthor_ioctl_id enum.
> + */
> +
> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
                ^^^^^^^^^ flushing

> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)
> +
> +/**
> + * DOC: IOCTL IDs
> + *
> + * enum drm_panthor_ioctl_id - IOCTL IDs
> + *
> + * Place new ioctls at the end, don't re-oder, don't replace or remove entries.
                                         ^^^^^^^ re-order

> + *
> + * These IDs are not meant to be used directly. Use the DRM_IOCTL_PANTHOR_xxx
> + * definitions instead.
> + */
> +enum drm_panthor_ioctl_id {
> +	/** @DRM_PANTHOR_DEV_QUERY: Query device information. */
> +	DRM_PANTHOR_DEV_QUERY = 0,
> +
> +	/** @DRM_PANTHOR_VM_CREATE: Create a VM. */
> +	DRM_PANTHOR_VM_CREATE,
> +
> +	/** @DRM_PANTHOR_VM_DESTROY: Destroy a VM. */
> +	DRM_PANTHOR_VM_DESTROY,
> +
> +	/** @DRM_PANTHOR_VM_BIND: Bind/unbind memory to a VM. */
> +	DRM_PANTHOR_VM_BIND,
> +
> +	/** @DRM_PANTHOR_BO_CREATE: Create a buffer object. */
> +	DRM_PANTHOR_BO_CREATE,
> +
> +	/**
> +	 * @DRM_PANTHOR_BO_MMAP_OFFSET: Get the file offset to pass to
> +	 * mmap to map a GEM object.
> +	 */
> +	DRM_PANTHOR_BO_MMAP_OFFSET,
> +
> +	/** @DRM_PANTHOR_GROUP_CREATE: Create a scheduling group. */
> +	DRM_PANTHOR_GROUP_CREATE,
> +
> +	/** @DRM_PANTHOR_GROUP_DESTROY: Destroy a scheduling group. */
> +	DRM_PANTHOR_GROUP_DESTROY,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_SUBMIT: Submit jobs to queues belonging
> +	 * to a specific scheduling group.
> +	 */
> +	DRM_PANTHOR_GROUP_SUBMIT,
> +
> +	/** @DRM_PANTHOR_GROUP_GET_STATE: Get the state of a scheduling group. */
> +	DRM_PANTHOR_GROUP_GET_STATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_CREATE: Create a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_CREATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_DESTROY,
> +};
> +
> +/**
> + * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
> + * @__access: Access type. Must be R, W or RW.
> + * @__id: One of the DRM_PANTHOR_xxx id.
> + * @__type: Suffix of the type being passed to the IOCTL.
> + *
> + * Don't use this macro directly, use the DRM_IOCTL_PANTHOR_xxx
> + * values instead.
> + *
> + * Return: An IOCTL number to be passed to ioctl() from userspace.
> + */
> +#define DRM_IOCTL_PANTHOR(__access, __id, __type) \
> +	DRM_IO ## __access(DRM_COMMAND_BASE + DRM_PANTHOR_ ## __id, \
> +			   struct drm_panthor_ ## __type)
> +
> +#define DRM_IOCTL_PANTHOR_DEV_QUERY \
> +	DRM_IOCTL_PANTHOR(WR, DEV_QUERY, dev_query)
> +#define DRM_IOCTL_PANTHOR_VM_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, VM_CREATE, vm_create)
> +#define DRM_IOCTL_PANTHOR_VM_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, VM_DESTROY, vm_destroy)
> +#define DRM_IOCTL_PANTHOR_VM_BIND \
> +	DRM_IOCTL_PANTHOR(WR, VM_BIND, vm_bind)
> +#define DRM_IOCTL_PANTHOR_BO_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, BO_CREATE, bo_create)
> +#define DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET \
> +	DRM_IOCTL_PANTHOR(WR, BO_MMAP_OFFSET, bo_mmap_offset)
> +#define DRM_IOCTL_PANTHOR_GROUP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_CREATE, group_create)
> +#define DRM_IOCTL_PANTHOR_GROUP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_DESTROY, group_destroy)
> +#define DRM_IOCTL_PANTHOR_GROUP_SUBMIT \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_SUBMIT, group_submit)
> +#define DRM_IOCTL_PANTHOR_GROUP_GET_STATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_GET_STATE, group_get_state)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
> +
> +/**
> + * DOC: IOCTL arguments
> + */
> +
> +/**
> + * struct drm_panthor_obj_array - Object array.
> + *
> + * This object is used to pass an array of objects whose size it subject to changes in
> + * future versions of the driver. In order to support this mutability, we pass a stride
> + * describing the size of the object as known by userspace.
> + *
> + * You shouldn't fill drm_panthor_obj_array fields directly. You should instead use
> + * the DRM_PANTHOR_OBJ_ARRAY() macro that takes care of initializing the stride to
> + * the object size.
> + */
> +struct drm_panthor_obj_array {
> +	/** @stride: Stride of object struct. Used for versioning. */
> +	__u32 stride;
> +
> +	/** @count: Number of objects in the array. */
> +	__u32 count;
> +
> +	/** @array: User pointer to an array of objects. */
> +	__u64 array;
> +};
> +
> +/**
> + * DRM_PANTHOR_OBJ_ARRAY() - Initialize a drm_panthor_obj_array field.
> + * @cnt: Number of elements in the array.
> + * @ptr: Pointer to the array to pass to the kernel.
> + *
> + * Macro initializing a drm_panthor_obj_array based on the object size as known
> + * by userspace.
> + */
> +#define DRM_PANTHOR_OBJ_ARRAY(cnt, ptr) \
> +	{ .stride = sizeof((ptr)[0]), .count = (cnt), .array = (__u64)(uintptr_t)(ptr) }
> +
> +/**
> + * enum drm_panthor_sync_op_flags - Synchronization operation flags.
> + */
> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,
> +};
> +
> +/**
> + * struct drm_panthor_sync_op - Synchronization operation.
> + */
> +struct drm_panthor_sync_op {
> +	/** @flags: Synchronization operation flags. Combination of DRM_PANTHOR_SYNC_OP values. */
> +	__u32 flags;
> +
> +	/** @handle: Sync handle. */
> +	__u32 handle;
> +
> +	/**
> +	 * @timeline_value: MBZ if
> +	 * (flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK) !=
> +	 * DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ.
> +	 */
> +	__u64 timeline_value;
> +};
> +
> +/**
> + * enum drm_panthor_dev_query_type - Query type
> + *
> + * Place new types at the end, don't re-oder, don't remove or replace.
s/re-oder/re-order/

> + */
> +enum drm_panthor_dev_query_type {
> +	/** @DRM_PANTHOR_DEV_QUERY_GPU_INFO: Query GPU information. */
> +	DRM_PANTHOR_DEV_QUERY_GPU_INFO = 0,
> +
> +	/** @DRM_PANTHOR_DEV_QUERY_CSIF_INFO: Query command-stream interface information. */
> +	DRM_PANTHOR_DEV_QUERY_CSIF_INFO,
> +};
> +
> +/**
> + * struct drm_panthor_gpu_info - GPU information
> + *
> + * Structure grouping all queryable information relating to the GPU.
> + */
> +struct drm_panthor_gpu_info {
> +	/** @gpu_id : GPU ID. */
> +	__u32 gpu_id;
> +#define DRM_PANTHOR_ARCH_MAJOR(x)		((x) >> 28)
> +#define DRM_PANTHOR_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> +#define DRM_PANTHOR_ARCH_REV(x)			(((x) >> 20) & 0xf)
> +#define DRM_PANTHOR_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> +#define DRM_PANTHOR_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> +#define DRM_PANTHOR_VERSION_STATUS(x)		((x) & 0xf)
> +
> +	/** @gpu_rev: GPU revision. */
> +	__u32 gpu_rev;
> +
> +	/** @csf_id: Command stream frontend ID. */
> +	__u32 csf_id;
> +#define DRM_PANTHOR_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> +#define DRM_PANTHOR_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> +#define DRM_PANTHOR_CSHW_REV(x)			(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_MCU_MAJOR(x)		(((x) >> 10) & 0x3f)
> +#define DRM_PANTHOR_MCU_MINOR(x)		(((x) >> 4) & 0x3f)
> +#define DRM_PANTHOR_MCU_REV(x)			((x) & 0xf)
> +
> +	/** @l2_features: L2-cache features. */
> +	__u32 l2_features;
> +
> +	/** @tiler_features: Tiler features. */
> +	__u32 tiler_features;
> +
> +	/** @mem_features: Memory features. */
> +	__u32 mem_features;
> +
> +	/** @mmu_features: MMU features. */
> +	__u32 mmu_features;
> +#define DRM_PANTHOR_MMU_VA_BITS(x)		((x) & 0xff)
> +
> +	/** @thread_features: Thread features. */
> +	__u32 thread_features;
> +
> +	/** @max_threads: Maximum number of threads. */
> +	__u32 max_threads;
> +
> +	/** @thread_max_workgroup_size: Maximum workgroup size. */
> +	__u32 thread_max_workgroup_size;
> +
> +	/**
> +	 * @thread_max_barrier_size: Maximum number of threads that can wait
> +	 * simultaneously on a barrier.
> +	 */
> +	__u32 thread_max_barrier_size;
> +
> +	/** @coherency_features: Coherency features. */
> +	__u32 coherency_features;
> +
> +	/** @texture_features: Texture features. */
> +	__u32 texture_features[4];
> +
> +	/** @as_present: Bitmask encoding the number of address-space exposed by the MMU. */
> +	__u32 as_present;
> +
> +	/** @core_group_count: Number of core groups. */
> +	__u32 core_group_count;
> +
> +	/** @pad: Zero on return. */
> +	__u32 pad;
> +
> +	/** @shader_present: Bitmask encoding the shader cores exposed by the GPU. */
> +	__u64 shader_present;
> +
> +	/** @l2_present: Bitmask encoding the L2 caches exposed by the GPU. */
> +	__u64 l2_present;
> +
> +	/** @tiler_present: Bitmask encoding the tiler unit exposed by the GPU. */
s/unit/units/

> +	__u64 tiler_present;
> +};
> +
> +/**
> + * struct drm_panthor_csif_info - Command stream interface information
> + *
> + * Structure grouping all queryable information relating to the command stream interface.
> + */
> +struct drm_panthor_csif_info {
> +	/** @csg_slot_count: Number of command stream group slots exposed by the firmware. */
> +	__u32 csg_slot_count;
> +
> +	/** @cs_slot_count: Number of command stream slot per group. */
s/slot/slots/

> +	__u32 cs_slot_count;
> +
> +	/** @cs_reg_count: Number of command stream register. */
s/register/registers/

> +	__u32 cs_reg_count;
> +
> +	/** @scoreboard_slot_count: Number of scoreboard slot. */
s/slot/slots/

> +	__u32 scoreboard_slot_count;
> +
> +	/**
> +	 * @unpreserved_cs_reg_count: Number of command stream registers reserved by
> +	 * the kernel driver to call a userspace command stream.
> +	 *
> +	 * All registers can be used by a userspace command stream, but the
> +	 * [cs_slot_count - unpreserved_cs_reg_count .. cs_slot_count] registers are
> +	 * used by the kernel when DRM_PANTHOR_IOCTL_GROUP_SUBMIT is called.
> +	 */
> +	__u32 unpreserved_cs_reg_count;
> +
> +	/**
> +	 * @pad: Padding field, set to zero.
> +	 */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
> + */
> +struct drm_panthor_dev_query {
> +	/** @type: the query type (see drm_panthor_dev_query_type). */
> +	__u32 type;
> +
> +	/**
> +	 * @size: size of the type being queried.
> +	 *
> +	 * If pointer is NULL, size is updated by the driver to provide the
> +	 * output structure size. If pointer is not NULL, the driver will
> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> +	 * and update the size accordingly. This allows us to extend query
> +	 * types without breaking userspace.
> +	 */
> +	__u32 size;
> +
> +	/**
> +	 * @pointer: user pointer to a query type struct.
> +	 *
> +	 * Pointer can be NULL, in which case, nothing is copied, but the
> +	 * actual structure size is returned. If not NULL, it must point to
> +	 * a location that's large enough to hold size bytes.
> +	 */
> +	__u64 pointer;
> +};
> +
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;
> +};
> +
> +/**
> + * struct drm_panthor_vm_destroy - Arguments passed to DRM_PANTHOR_IOCTL_VM_DESTROY
> + */
> +struct drm_panthor_vm_destroy {
> +	/** @id: ID of the VM to destroy. */
> +	__u32 id;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_op_flags - VM bind operation flags
> + */
> +enum drm_panthor_vm_bind_op_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_READONLY: Map the memory read-only.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_READONLY = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC: Map the memory not-executable.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED: Map the memory uncached.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED = 1 << 2,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_MAP: Map operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MAP = 0 << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP: Unmap operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP = 1 << 28,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind_op - VM bind operation
> + */
> +struct drm_panthor_vm_bind_op {
> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @bo_handle: Handle of the buffer object to map.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u32 bo_handle;
> +
> +	/**
> +	 * @bo_offset: Buffer object offset.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u64 bo_offset;
> +
> +	/**
> +	 * @va: Virtual address to map/unmap.
> +	 */
> +	__u64 va;
> +
> +	/** @size: Size to map/unmap. */
> +	__u64 size;
> +
> +	/**
> +	 * @syncs: Array of synchronization operations.
> +	 *
> +	 * This array must be empty if %DRM_PANTHOR_VM_BIND_ASYNC is not set on
> +	 * the drm_panthor_vm_bind object containing this VM bind operation.

You should state this is an array of struct drm_panthor_sync_op.

> +	 */
> +	struct drm_panthor_obj_array syncs;
> +
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_flags - VM bind flags
> + */
> +enum drm_panthor_vm_bind_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_ASYNC: VM bind operations are queued to the VM
> +	 * queue instead of being executed synchronously.
> +	 */
> +	DRM_PANTHOR_VM_BIND_ASYNC = 1 << 0,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_vm_bind {
> +	/** @vm_id: VM targeted by the bind request. */
> +	__u32 vm_id;
> +
> +	/** @flags: Combination of drm_panthor_vm_bind_flags flags. */
> +	__u32 flags;
> +
> +	/** @ops: Array of bind operations. */

Array of struct drm_panthor_vm_bind_op

> +	struct drm_panthor_obj_array ops;
> +};
> +
> +/**
> + * enum drm_panthor_bo_flags - Buffer object flags, passed at creation time.
> + */
> +enum drm_panthor_bo_flags {
> +	/** @DRM_PANTHOR_BO_NO_MMAP: The buffer object will never be CPU-mapped in userspace. */
> +	DRM_PANTHOR_BO_NO_MMAP = (1 << 0),
> +};
> +
> +/**
> + * struct drm_panthor_bo_create - Arguments passed to DRM_IOCTL_PANTHOR_BO_CREATE.
> + */
> +struct drm_panthor_bo_create {
> +	/**
> +	 * @size: Requested size for the object
> +	 *
> +	 * The (page-aligned) allocated size for the object will be returned.
> +	 */
> +	__u64 size;
> +
> +	/**
> +	 * @flags: Flags. Must be a combination of drm_panthor_bo_flags flags.
> +	 */
> +	__u32 flags;
> +
> +	/**
> +	 * @exclusive_vm_id: Exclusive VM this buffer object will be mapped to.
> +	 *
> +	 * If not zero, the field must refer to a valid VM ID, and implies that:
> +	 *  - the buffer object will only ever be bound to that VM
> +	 *  - cannot be exported as a PRIME fd
> +	 */
> +	__u32 exclusive_vm_id;
> +
> +	/**
> +	 * @handle: Returned handle for the object.
> +	 *
> +	 * Object handles are nonzero.
> +	 */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_bo_mmap_offset - Arguments passed to DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET.
> + */
> +struct drm_panthor_bo_mmap_offset {
> +	/** @handle: Handle of the object we want an mmap offset for. */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @offset: The fake offset to use for subsequent mmap calls. */
> +	__u64 offset;
> +};
> +
> +/**
> + * struct drm_panthor_queue_create - Queue creation arguments.
> + */
> +struct drm_panthor_queue_create {
> +	/**
> +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> +	 * 15 being the highest priority.
> +	 */
> +	__u8 priority;
> +
> +	/** @pad: Padding fields, MBZ. */
> +	__u8 pad[3];
> +
> +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
> +	__u32 ringbuf_size;
> +};
> +
> +/**
> + * enum drm_panthor_group_priority - Scheduling group priority
> + */
> +enum drm_panthor_group_priority {
> +	/** @PANTHOR_GROUP_PRIORITY_LOW: Low priority group. */
> +	PANTHOR_GROUP_PRIORITY_LOW = 0,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
> +	PANTHOR_GROUP_PRIORITY_MEDIUM,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> +	PANTHOR_GROUP_PRIORITY_HIGH,
> +};
> +
> +/**
> + * struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
> + */
> +struct drm_panthor_group_create {
> +	/** @queues: Array of drm_panthor_create_cs_queue elements. */

s/drm_panthor_create_cs_queue/drm_panthor_queue_create/

> +	struct drm_panthor_obj_array queues;
> +
> +	/**
> +	 * @max_compute_cores: Maximum number of cores that can be used by compute
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @compute_core_mask.
> +	 */
> +	__u8 max_compute_cores;
> +
> +	/**
> +	 * @max_fragment_cores: Maximum number of cores that can be used by fragment
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @fragment_core_mask.
> +	 */
> +	__u8 max_fragment_cores;
> +
> +	/**
> +	 * @max_tiler_cores: Maximum number of tilers that can be used by tiler jobs
> +	 * across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @tiler_core_mask.
> +	 */
> +	__u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (see drm_drm_panthor_cs_group_priority). */

s/drm_drm_panthor_cs_group_priority/enum drm_panthor_group_priority/

> +	__u8 priority;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +
> +	/**
> +	 * @compute_core_mask: Mask encoding cores that can be used for compute jobs.
> +	 *
> +	 * This field must have at least @max_compute_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 compute_core_mask;
> +
> +	/**
> +	 * @fragment_core_mask: Mask encoding cores that can be used for fragment jobs.
> +	 *
> +	 * This field must have at least @max_fragment_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 fragment_core_mask;
> +
> +	/**
> +	 * @tiler_core_mask: Mask encoding cores that can be used for tiler jobs.
> +	 *
> +	 * This field must have at least @max_tiler_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::tiler_present.
> +	 */
> +	__u64 tiler_core_mask;
> +
> +	/**
> +	 * @vm_id: VM ID to bind this group to.
> +	 *
> +	 * All submission to queues bound to this group will use this VM.
> +	 */
> +	__u32 vm_id;
> +
> +	/**
> +	 * @group_handle: Returned group handle. Passed back when submitting jobs or
> +	 * destroying a group.
> +	 */
> +	__u32 group_handle;
> +};
> +
> +/**
> + * struct drm_panthor_group_destroy - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_DESTROY
> + */
> +struct drm_panthor_group_destroy {
> +	/** @group_handle: Group to destroy */
> +	__u32 group_handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_queue_submit - Job submission arguments.
> + *
> + * This is describing the userspace command stream to call from the kernel
> + * command stream ring-buffer. Queue submission is always part of a group
> + * submission, taking one or more jobs to submit to the underlying queues.
> + */
> +struct drm_panthor_queue_submit {
> +	/** @queue_index: Index of the queue inside a group. */
> +	__u32 queue_index;
> +
> +	/**
> +	 * @stream_size: Size of the command stream to execute.
> +	 *
> +	 * Must be 64-bit/8-byte aligned (the size of a CS instruction)
> +	 *
> +	 * Can be zero if stream_addr is zero too.
> +	 */
> +	__u32 stream_size;
> +
> +	/**
> +	 * @stream_addr: GPU address of the command stream to execute.
> +	 *
> +	 * Must be aligned on 64-byte.
> +	 *
> +	 * Can be zero is stream_size is zero too.
> +	 */
> +	__u64 stream_addr;
> +
> +	/**
> +	 * @latest_flush: FLUSH_ID read at the time the stream was built.
> +	 *
> +	 * This allows cache flush elimination for the automatic
> +	 * flush+invalidate(all) done at submission time, which is needed to
> +	 * ensure the GPU doesn't get garbage when reading the indirect command
> +	 * stream buffers. If you want the cache flush to happen
> +	 * unconditionally, pass a zero here.
> +	 */
> +	__u32 latest_flush;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @syncs: Array of sync operations. */

Array of struct drm_panthor_sync_op.

Steve

> +	struct drm_panthor_obj_array syncs;
> +};
> +
> +/**
> + * struct drm_panthor_group_submit - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_group_submit {
> +	/** @group_handle: Handle of the group to queue jobs to. */
> +	__u32 group_handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @queue_submits: Array of drm_panthor_queue_submit objects. */
> +	struct drm_panthor_obj_array queue_submits;
> +};
> +
> +/**
> + * enum drm_panthor_group_state_flags - Group state flags
> + */
> +enum drm_panthor_group_state_flags {
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_TIMEDOUT: Group had unfinished jobs.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_TIMEDOUT = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_FATAL_FAULT: Group had fatal faults.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +};
> +
> +/**
> + * struct drm_panthor_group_get_state - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_GET_STATE
> + *
> + * Used to query the state of a group and decide whether a new group should be created to
> + * replace it.
> + */
> +struct drm_panthor_group_get_state {
> +	/** @group_handle: Handle of the group to query state on */
> +	__u32 group_handle;
> +
> +	/**
> +	 * @state: Combination of DRM_PANTHOR_GROUP_STATE_* flags encoding the
> +	 * group state.
> +	 */
> +	__u32 state;
> +
> +	/** @fatal_queues: Bitmask of queues that faced fatal faults. */
> +	__u32 fatal_queues;
> +
> +	/** @pad: MBZ */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_create - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE
> + */
> +struct drm_panthor_tiler_heap_create {
> +	/** @vm_id: VM ID the tiler heap should be mapped to */
> +	__u32 vm_id;
> +
> +	/** @initial_chunk_count: Initial number of chunks to allocate. */
> +	__u32 initial_chunk_count;
> +
> +	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
> +	__u32 chunk_size;
> +
> +	/** @max_chunks: Maximum number of chunks that can be allocated. */
> +	__u32 max_chunks;
> +
> +	/**
> +	 * @target_in_flight: Maximum number of in-flight render passes.
> +	 *
> +	 * If the heap has more than tiler jobs in-flight, the FW will wait for render
> +	 * passes to finish before queuing new tiler jobs.
> +	 */
> +	__u32 target_in_flight;
> +
> +	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
> +	__u32 handle;
> +
> +	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
> +	__u64 tiler_heap_ctx_gpu_va;
> +
> +	/**
> +	 * @first_heap_chunk_gpu_va: First heap chunk.
> +	 *
> +	 * The tiler heap is formed of heap chunks forming a single-link list. This
> +	 * is the first element in the list.
> +	 */
> +	__u64 first_heap_chunk_gpu_va;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_destroy - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
> + */
> +struct drm_panthor_tiler_heap_destroy {
> +	/** @handle: Handle of the tiler heap to destroy */
> +	__u32 handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif /* _PANTHOR_DRM_H_ */


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/15] drm/panthor: Add GPU register definitions
  2023-08-09 16:53 ` [PATCH v2 03/15] drm/panthor: Add GPU register definitions Boris Brezillon
@ 2023-08-11 14:13   ` Steven Price
  2023-08-29 13:00     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-11 14:13 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Those are the registers directly accessible through the MMIO range.
> 
> FW registers are exposed in panthor_fw.h.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Two possible redundant defines (see below), but otherwise:

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_regs.h | 229 +++++++++++++++++++++++++
>  1 file changed, 229 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_regs.h b/drivers/gpu/drm/panthor/panthor_regs.h
> new file mode 100644
> index 000000000000..00e149cf9eab
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_regs.h
> @@ -0,0 +1,229 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +/*
> + * Register definitions based on mali_kbase_gpu_regmap.h and
> + * mali_kbase_gpu_regmap_csf.h
> + * (C) COPYRIGHT 2010-2022 ARM Limited. All rights reserved.
> + */
> +#ifndef __PANTHOR_REGS_H__
> +#define __PANTHOR_REGS_H__
> +
> +#define GPU_ID						0x00
> +#define GPU_L2_FEATURES					0x004
> +#define GPU_TILER_FEATURES				0x00C
> +#define GPU_MEM_FEATURES				0x010
> +#define   GROUPS_L2_COHERENT				BIT(0)
> +
> +#define GPU_MMU_FEATURES				0x014
> +#define  GPU_MMU_FEATURES_VA_BITS(x)			((x) & GENMASK(7, 0))
> +#define  GPU_MMU_FEATURES_PA_BITS(x)			(((x) >> 8) & GENMASK(7, 0))
> +#define GPU_AS_PRESENT					0x018
> +#define GPU_CSF_ID					0x01C
> +
> +#define GPU_INT_RAWSTAT					0x20
> +#define GPU_INT_CLEAR					0x24
> +#define GPU_INT_MASK					0x28
> +#define GPU_INT_STAT					0x2c
> +#define   GPU_IRQ_FAULT					BIT(0)
> +#define   GPU_IRQ_PROTM_FAULT				BIT(1)
> +#define   GPU_IRQ_RESET_COMPLETED			BIT(8)
> +#define   GPU_IRQ_POWER_CHANGED				BIT(9)
> +#define   GPU_IRQ_POWER_CHANGED_ALL			BIT(10)
> +#define   GPU_IRQ_CLEAN_CACHES_COMPLETED		BIT(17)
> +#define   GPU_IRQ_DOORBELL_MIRROR			BIT(18)
> +#define   GPU_IRQ_MCU_STATUS_CHANGED			BIT(19)
> +#define GPU_CMD						0x30
> +#define   GPU_CMD_DEF(type, payload)			((type) | ((payload) << 8))
> +#define   GPU_SOFT_RESET				GPU_CMD_DEF(1, 1)
> +#define   GPU_HARD_RESET				GPU_CMD_DEF(1, 2)
> +#define   CACHE_CLEAN					BIT(0)
> +#define   CACHE_INV					BIT(1)
> +#define   GPU_FLUSH_CACHES(l2, lsc, oth)		\
> +	  GPU_CMD_DEF(4, ((l2) << 0) | ((lsc) << 4) | ((oth) << 8))
> +
> +#define GPU_STATUS					0x34
> +#define   GPU_STATUS_ACTIVE				BIT(0)
> +#define   GPU_STATUS_PWR_ACTIVE				BIT(1)
> +#define   GPU_STATUS_PAGE_FAULT				BIT(4)
> +#define   GPU_STATUS_PROTM_ACTIVE			BIT(7)
> +#define   GPU_STATUS_DBG_ENABLED			BIT(8)
> +
> +#define GPU_FAULT_STATUS				0x3C
> +#define GPU_FAULT_ADDR_LO				0x40
> +#define GPU_FAULT_ADDR_HI				0x44
> +
> +#define GPU_PWR_KEY					0x50
> +#define  GPU_PWR_KEY_UNLOCK				0x2968A819
> +#define GPU_PWR_OVERRIDE0				0x54
> +#define GPU_PWR_OVERRIDE1				0x58
> +
> +#define GPU_TIMESTAMP_OFFSET_LO				0x88
> +#define GPU_TIMESTAMP_OFFSET_HI				0x8C
> +#define GPU_CYCLE_COUNT_LO				0x90
> +#define GPU_CYCLE_COUNT_HI				0x94
> +#define GPU_TIMESTAMP_LO				0x98
> +#define GPU_TIMESTAMP_HI				0x9C
> +
> +#define GPU_THREAD_MAX_THREADS				0xA0
> +#define GPU_THREAD_MAX_WORKGROUP_SIZE			0xA4
> +#define GPU_THREAD_MAX_BARRIER_SIZE			0xA8
> +#define GPU_THREAD_FEATURES				0xAC
> +
> +#define GPU_TEXTURE_FEATURES(n)				(0xB0 + ((n) * 4))
> +
> +#define GPU_SHADER_PRESENT_LO				0x100
> +#define GPU_SHADER_PRESENT_HI				0x104
> +#define GPU_TILER_PRESENT_LO				0x110
> +#define GPU_TILER_PRESENT_HI				0x114
> +#define GPU_L2_PRESENT_LO				0x120
> +#define GPU_L2_PRESENT_HI				0x124
> +
> +#define SHADER_READY_LO					0x140
> +#define SHADER_READY_HI					0x144
> +#define TILER_READY_LO					0x150
> +#define TILER_READY_HI					0x154
> +#define L2_READY_LO					0x160
> +#define L2_READY_HI					0x164
> +
> +#define SHADER_PWRON_LO					0x180
> +#define SHADER_PWRON_HI					0x184
> +#define TILER_PWRON_LO					0x190
> +#define TILER_PWRON_HI					0x194
> +#define L2_PWRON_LO					0x1A0
> +#define L2_PWRON_HI					0x1A4
> +
> +#define SHADER_PWROFF_LO				0x1C0
> +#define SHADER_PWROFF_HI				0x1C4
> +#define TILER_PWROFF_LO					0x1D0
> +#define TILER_PWROFF_HI					0x1D4
> +#define L2_PWROFF_LO					0x1E0
> +#define L2_PWROFF_HI					0x1E4
> +
> +#define SHADER_PWRTRANS_LO				0x200
> +#define SHADER_PWRTRANS_HI				0x204
> +#define TILER_PWRTRANS_LO				0x210
> +#define TILER_PWRTRANS_HI				0x214
> +#define L2_PWRTRANS_LO					0x220
> +#define L2_PWRTRANS_HI					0x224
> +
> +#define SHADER_PWRACTIVE_LO				0x240
> +#define SHADER_PWRACTIVE_HI				0x244
> +#define TILER_PWRACTIVE_LO				0x250
> +#define TILER_PWRACTIVE_HI				0x254
> +#define L2_PWRACTIVE_LO					0x260
> +#define L2_PWRACTIVE_HI					0x264
> +
> +#define GPU_REVID					0x280
> +
> +#define GPU_COHERENCY_FEATURES				0x300
> +#define GPU_COHERENCY_PROT_BIT(name)			BIT(GPU_COHERENCY_  ## name)
> +
> +#define GPU_COHERENCY_PROTOCOL				0x304
> +#define   GPU_COHERENCY_ACE				0
> +#define   GPU_COHERENCY_ACE_LITE			1
> +#define   GPU_COHERENCY_NONE				31
> +
> +#define MCU_CONTROL					0x700
> +#define MCU_CONTROL_ENABLE				1
> +#define MCU_CONTROL_AUTO				2
> +#define MCU_CONTROL_DISABLE				0
> +
> +#define MCU_STATUS					0x704
> +#define MCU_STATUS_DISABLED				0
> +#define MCU_STATUS_ENABLED				1
> +#define MCU_STATUS_HALT					2
> +#define MCU_STATUS_FATAL				3
> +
> +/* Job Control regs */
> +#define JOB_INT_RAWSTAT					0x1000
> +#define JOB_INT_CLEAR					0x1004
> +#define JOB_INT_MASK					0x1008
> +#define JOB_INT_STAT					0x100c
> +#define   JOB_INT_GLOBAL_IF				BIT(31)
> +#define   JOB_INT_CSG_IF(x)				BIT(x)
> +
> +/* MMU regs */
> +#define MMU_INT_RAWSTAT					0x2000
> +#define MMU_INT_CLEAR					0x2004
> +#define MMU_INT_MASK					0x2008
> +#define MMU_INT_STAT					0x200c
> +
> +/* AS_COMMAND register commands */
> +
> +#define MMU_BASE					0x2400
> +#define MMU_AS_SHIFT					6
> +#define MMU_AS(as)					(MMU_BASE + ((as) << MMU_AS_SHIFT))
> +
> +#define AS_TRANSTAB_LO(as)				(MMU_AS(as) + 0x00)
> +#define AS_TRANSTAB_HI(as)				(MMU_AS(as) + 0x04)
> +#define AS_MEMATTR_LO(as)				(MMU_AS(as) + 0x08)
> +#define AS_MEMATTR_HI(as)				(MMU_AS(as) + 0x0C)
> +#define   AS_MEMATTR_AARCH64_INNER_ALLOC_IMPL		(2 << 2)
> +#define   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(w, r)	((3 << 2) | \
> +							 ((w) ? BIT(0) : 0) | \
> +							 ((r) ? BIT(1) : 0))
> +#define   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER		(0 << 4)
> +#define   AS_MEMATTR_AARCH64_SH_CPU_INNER		(1 << 4)
> +#define   AS_MEMATTR_AARCH64_SH_CPU_INNER_SHADER_COH	(2 << 4)
> +#define   AS_MEMATTR_AARCH64_SHARED			(0 << 6)
> +#define   AS_MEMATTR_AARCH64_INNER_OUTER_NC		(1 << 6)
> +#define   AS_MEMATTR_AARCH64_INNER_OUTER_WB		(2 << 6)
> +#define   AS_MEMATTR_AARCH64_FAULT			(3 << 6)
> +#define AS_LOCKADDR_LO(as)				(MMU_AS(as) + 0x10)
> +#define AS_LOCKADDR_HI(as)				(MMU_AS(as) + 0x14)
> +#define AS_COMMAND(as)					(MMU_AS(as) + 0x18)
> +#define   AS_COMMAND_NOP				0
> +#define   AS_COMMAND_UPDATE				1
> +#define   AS_COMMAND_LOCK				2
> +#define   AS_COMMAND_UNLOCK				3
> +#define   AS_COMMAND_FLUSH_PT				4
> +#define   AS_COMMAND_FLUSH_MEM				5
> +#define   AS_LOCK_REGION_MIN_SIZE			(1ULL << 15)
> +#define AS_FAULTSTATUS(as)				(MMU_AS(as) + 0x1C)
> +#define  AS_FAULTSTATUS_ACCESS_TYPE_MASK		(0x3 << 8)
> +#define  AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC		(0x0 << 8)
> +#define  AS_FAULTSTATUS_ACCESS_TYPE_EX			(0x1 << 8)
> +#define  AS_FAULTSTATUS_ACCESS_TYPE_READ		(0x2 << 8)
> +#define  AS_FAULTSTATUS_ACCESS_TYPE_WRITE		(0x3 << 8)
> +#define AS_FAULTADDRESS_LO(as)				(MMU_AS(as) + 0x20)
> +#define AS_FAULTADDRESS_HI(as)				(MMU_AS(as) + 0x24)
> +#define AS_STATUS(as)					(MMU_AS(as) + 0x28)
> +#define   AS_STATUS_AS_ACTIVE				BIT(0)
> +#define AS_TRANSCFG_LO(as)				(MMU_AS(as) + 0x30)
> +#define AS_TRANSCFG_HI(as)				(MMU_AS(as) + 0x34)
> +#define   AS_TRANSCFG_ADRMODE_LEGACY			(0 << 0)

I don't believe legacy mode exists any more (it's not in my copy of the
spec).

> +#define   AS_TRANSCFG_ADRMODE_UNMAPPED			(1 << 0)
> +#define   AS_TRANSCFG_ADRMODE_IDENTITY			(2 << 0)
> +#define   AS_TRANSCFG_ADRMODE_AARCH64_4K		(6 << 0)
> +#define   AS_TRANSCFG_ADRMODE_AARCH64_64K		(8 << 0)
> +#define   AS_TRANSCFG_INA_BITS(x)			((x) << 6)
> +#define   AS_TRANSCFG_OUTA_BITS(x)			((x) << 14)
> +#define   AS_TRANSCFG_SL_CONCAT				BIT(22)
> +#define   AS_TRANSCFG_PTW_MEMATTR_NC			(1 << 24)
> +#define   AS_TRANSCFG_PTW_MEMATTR_WB			(2 << 24)
> +#define   AS_TRANSCFG_PTW_SH_NS				(0 << 28)
> +#define   AS_TRANSCFG_PTW_SH_OS				(2 << 28)
> +#define   AS_TRANSCFG_PTW_SH_IS				(3 << 28)
> +#define   AS_TRANSCFG_PTW_RA				BIT(30)
> +#define   AS_TRANSCFG_DISABLE_HIER_AP			BIT(33)
> +#define   AS_TRANSCFG_DISABLE_AF_FAULT			BIT(34)
> +#define   AS_TRANSCFG_WXN				BIT(35)
> +#define   AS_TRANSCFG_XREADABLE				BIT(36)
> +#define AS_FAULTEXTRA_LO(as)				(MMU_AS(as) + 0x38)
> +#define AS_FAULTEXTRA_HI(as)				(MMU_AS(as) + 0x3C)
> +
> +#define CSF_GPU_LATEST_FLUSH_ID				0x10000
> +#define CSF_GPU_LATEST_FLUSH_ID_DEFAULT			0xffffe0

I'm not sure why we need the default value of this register? Seems an
odd thing to include.

Steve

> +
> +#define CSF_DOORBELL(i)					(0x80000 + ((i) * 0x10000))
> +#define CSF_GLB_DOORBELL_ID				0
> +
> +#define gpu_write(dev, reg, data) \
> +	writel(data, (dev)->iomem + (reg))
> +
> +#define gpu_read(dev, reg) \
> +	readl((dev)->iomem + (reg))
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-09 16:53 ` [PATCH v2 04/15] drm/panthor: Add the device logical block Boris Brezillon
@ 2023-08-11 15:47   ` Steven Price
  2023-08-29 14:00     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-11 15:47 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> The panthor driver is designed in a modular way, where each logical
> block is dealing with a specific HW-block or software feature. In order
> for those blocks to communicate with each other, we need a central
> panthor_device collecting all the blocks, and exposing some common
> features, like interrupt handling, power management, reset, ...
> 
> This what this panthor_device logical block is about.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Add devfreq/PM support
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 479 +++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_device.h | 354 +++++++++++++++++
>  2 files changed, 833 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> new file mode 100644
> index 000000000000..15f102116fa0
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -0,0 +1,479 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <linux/clk.h>
> +#include <linux/reset.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_domain.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/regulator/consumer.h>
> +
> +#include <drm/drm_drv.h>
> +#include <drm/drm_managed.h>
> +
> +#include "panthor_sched.h"
> +#include "panthor_device.h"
> +#include "panthor_devfreq.h"
> +#include "panthor_gpu.h"
> +#include "panthor_fw.h"
> +#include "panthor_mmu.h"
> +#include "panthor_regs.h"
> +
> +static int panthor_clk_init(struct panthor_device *ptdev)
> +{
> +	ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
> +	if (IS_ERR(ptdev->clks.core)) {
> +		drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
> +			PTR_ERR(ptdev->clks.core));

I suspect it would be a good idea to use dev_err_probe() here (and
below) as I believe devm_clk_get can return -EPROBE_DEFER.

> +		return PTR_ERR(ptdev->clks.core);
> +	}
> +
> +	ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
> +	if (IS_ERR(ptdev->clks.stacks)) {
> +		drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
> +			PTR_ERR(ptdev->clks.stacks));
> +		return PTR_ERR(ptdev->clks.stacks);
> +	}
> +
> +	ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, "coregroup");
> +	if (IS_ERR(ptdev->clks.coregroup)) {
> +		drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
> +			PTR_ERR(ptdev->clks.coregroup));
> +		return PTR_ERR(ptdev->clks.coregroup);
> +	}
> +
> +	drm_info(&ptdev->base, "clock rate = %lu\n", clk_get_rate(ptdev->clks.core));
> +	return 0;
> +}
> +
> +void panthor_device_unplug(struct panthor_device *ptdev)
> +{
> +	/* FIXME: This is racy. */

Can we fix this? From a quick look it seems like a sequence like below
should avoid the race.

	if (!drm_dev_enter())
		/* Already unplugged */
		return;
	ptdev->base.unplugged = true;
	drm_dev_exit();

Although possibly that should be in the DRM core rather than open-coded
here.

> +	if (drm_dev_is_unplugged(&ptdev->base))
> +		return;
> +
> +	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> +
> +	/* Call drm_dev_unplug() so any access to HW block happening after
> +	 * that point get rejected.
> +	 */
> +	drm_dev_unplug(&ptdev->base);
> +
> +	/* Now, try to cleanly shutdown the GPU before the device resources
> +	 * get reclaimed.
> +	 */
> +	panthor_sched_unplug(ptdev);
> +	panthor_fw_unplug(ptdev);
> +	panthor_mmu_unplug(ptdev);
> +	panthor_gpu_unplug(ptdev);
> +
> +	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> +}
> +
> +static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +
> +	cancel_work_sync(&ptdev->reset.work);
> +	destroy_workqueue(ptdev->reset.wq);
> +}
> +
> +static void panthor_device_reset_work(struct work_struct *work)
> +{
> +	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
> +	int ret, cookie;
> +
> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> +		return;
> +
> +	panthor_sched_pre_reset(ptdev);
> +	panthor_fw_pre_reset(ptdev, true);
> +	panthor_mmu_pre_reset(ptdev);
> +	panthor_gpu_soft_reset(ptdev);
> +	panthor_gpu_l2_power_on(ptdev);
> +	panthor_mmu_post_reset(ptdev);
> +	ret = panthor_fw_post_reset(ptdev);
> +	if (ret)
> +		goto out;
> +
> +	atomic_set(&ptdev->reset.pending, 0);
> +	panthor_sched_post_reset(ptdev);
> +	drm_dev_exit(cookie);
> +
> +out:
> +	if (ret) {

This looks like a race condition too - is there a need for a
drm_dev_exit_and_unplug() function?

> +		panthor_device_unplug(ptdev);
> +		drm_err(&ptdev->base, "Failed to boot MCU after reset, making device unusable.");
> +	}
> +}
> +
> +static bool panthor_device_is_initialized(struct panthor_device *ptdev)
> +{
> +	return !!ptdev->scheduler;
> +}
> +
> +static void panthor_device_free_page(struct drm_device *ddev, void *data)
> +{
> +	free_page((unsigned long)data);
> +}
> +
> +int panthor_device_init(struct panthor_device *ptdev)
> +{
> +	struct resource *res;
> +	struct page *p;
> +	int ret;
> +
> +	ptdev->coherent = device_get_dma_attr(ptdev->base.dev) == DEV_DMA_COHERENT;
> +
> +	drmm_mutex_init(&ptdev->base, &ptdev->pm.lock);
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +	p = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!p)
> +		return -ENOMEM;
> +
> +	ptdev->pm.dummy_latest_flush = page_address(p);
> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
> +				       ptdev->pm.dummy_latest_flush);
> +	if (ret)
> +		return ret;
> +
> +	/* Set the dummy page to the default LATEST_FLUSH value. This
> +	 * will be updated on the next suspend.
> +	 */
> +	*ptdev->pm.dummy_latest_flush = CSF_GPU_LATEST_FLUSH_ID_DEFAULT;

I see why this register default value was defined. Although I'm not sure
it has any benefit over just using zero... If the GPU is off when user
space reads the FLUSH_ID then the GPU's caches are definitely empty so
any flush ID is valid.

Interestingly looking at kbase it seems to use an initial value of 1
(POWER_DOWN_LATEST_FLUSH_VALUE). I guess zero is less ideal because
FLUSH_CACHE2 would then unconditionally do a flush.

> +
> +	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
> +	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
> +	if (!ptdev->reset.wq)
> +		return -ENOMEM;
> +
> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_reset_cleanup, NULL);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_clk_init(ptdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_devfreq_init(ptdev);
> +	if (ret)
> +		return ret;
> +
> +	ptdev->iomem = devm_platform_get_and_ioremap_resource(to_platform_device(ptdev->base.dev),
> +							      0, &res);
> +	if (IS_ERR(ptdev->iomem))
> +		return PTR_ERR(ptdev->iomem);
> +
> +	ptdev->phys_addr = res->start;
> +
> +	ret = devm_pm_runtime_enable(ptdev->base.dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_gpu_init(ptdev);
> +	if (ret)
> +		goto err_rpm_put;
> +
> +	ret = panthor_mmu_init(ptdev);
> +	if (ret)
> +		goto err_rpm_put;
> +
> +	ret = panthor_fw_init(ptdev);
> +	if (ret)
> +		goto err_rpm_put;
> +
> +	ret = panthor_sched_init(ptdev);
> +	if (ret)
> +		goto err_rpm_put;
> +
> +	/* ~3 frames */
> +	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
> +	pm_runtime_use_autosuspend(ptdev->base.dev);
> +	pm_runtime_put_autosuspend(ptdev->base.dev);
> +	return 0;
> +
> +err_rpm_put:
> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> +	return ret;
> +}
> +
> +#define PANTHOR_EXCEPTION(id) \
> +	[DRM_PANTHOR_EXCEPTION_ ## id] = { \
> +		.name = #id, \
> +	}
> +
> +struct panthor_exception_info {
> +	const char *name;
> +};
> +
> +static const struct panthor_exception_info panthor_exception_infos[] = {
> +	PANTHOR_EXCEPTION(OK),
> +	PANTHOR_EXCEPTION(TERMINATED),
> +	PANTHOR_EXCEPTION(KABOOM),
> +	PANTHOR_EXCEPTION(EUREKA),
> +	PANTHOR_EXCEPTION(ACTIVE),
> +	PANTHOR_EXCEPTION(CS_RES_TERM),
> +	PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
> +	PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
> +	PANTHOR_EXCEPTION(CS_BUS_FAULT),
> +	PANTHOR_EXCEPTION(CS_INSTR_INVALID),
> +	PANTHOR_EXCEPTION(CS_CALL_STACK_OVERFLOW),
> +	PANTHOR_EXCEPTION(CS_INHERIT_FAULT),
> +	PANTHOR_EXCEPTION(INSTR_INVALID_PC),
> +	PANTHOR_EXCEPTION(INSTR_INVALID_ENC),
> +	PANTHOR_EXCEPTION(INSTR_BARRIER_FAULT),
> +	PANTHOR_EXCEPTION(DATA_INVALID_FAULT),
> +	PANTHOR_EXCEPTION(TILE_RANGE_FAULT),
> +	PANTHOR_EXCEPTION(ADDR_RANGE_FAULT),
> +	PANTHOR_EXCEPTION(IMPRECISE_FAULT),
> +	PANTHOR_EXCEPTION(OOM),
> +	PANTHOR_EXCEPTION(CSF_FW_INTERNAL_ERROR),
> +	PANTHOR_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
> +	PANTHOR_EXCEPTION(GPU_BUS_FAULT),
> +	PANTHOR_EXCEPTION(GPU_SHAREABILITY_FAULT),
> +	PANTHOR_EXCEPTION(SYS_SHAREABILITY_FAULT),
> +	PANTHOR_EXCEPTION(GPU_CACHEABILITY_FAULT),
> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_0),
> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_1),
> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_2),
> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_3),
> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_4),
> +	PANTHOR_EXCEPTION(PERM_FAULT_0),
> +	PANTHOR_EXCEPTION(PERM_FAULT_1),
> +	PANTHOR_EXCEPTION(PERM_FAULT_2),
> +	PANTHOR_EXCEPTION(PERM_FAULT_3),
> +	PANTHOR_EXCEPTION(ACCESS_FLAG_1),
> +	PANTHOR_EXCEPTION(ACCESS_FLAG_2),
> +	PANTHOR_EXCEPTION(ACCESS_FLAG_3),
> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_IN),
> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_0),
> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_1),
> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_2),
> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_3),
> +};
> +
> +const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_code)
> +{
> +	if (drm_WARN_ON(&ptdev->base,

I'm not convinced this should be a WARN_ON as I suspect it's probably
possible to inject values from user space (although I'm not completely
sure on that). It's certainly not a driver error as such if we can't
decode the value.

> +			exception_code >= ARRAY_SIZE(panthor_exception_infos) ||
> +			!panthor_exception_infos[exception_code].name))
> +		return "Unknown exception type";
> +
> +	return panthor_exception_infos[exception_code].name;
> +}
> +
> +static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct panthor_device *ptdev = vma->vm_private_data;
> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> +	unsigned long pfn;
> +	pgprot_t pgprot;
> +	vm_fault_t ret;
> +	bool active;
> +	int cookie;
> +
> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> +		return VM_FAULT_SIGBUS;
> +
> +	mutex_lock(&ptdev->pm.lock);
> +	active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
> +
> +	switch (id) {
> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> +		if (active)
> +			pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
> +		else
> +			pfn = virt_to_pfn(ptdev->pm.dummy_latest_flush);
> +		break;
> +
> +	default:
> +		ret = VM_FAULT_SIGBUS;
> +		goto out_unlock;
> +	}
> +
> +	pgprot = vma->vm_page_prot;
> +	if (active)
> +		pgprot = pgprot_noncached(pgprot);
> +
> +	ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
> +
> +out_unlock:
> +	mutex_unlock(&ptdev->pm.lock);
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static const struct vm_operations_struct panthor_mmio_vm_ops = {
> +	.fault = panthor_mmio_vm_fault,
> +};
> +
> +int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
> +{
> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	switch (id) {
> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> +		if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
> +		    (vma->vm_flags & (VM_WRITE | VM_EXEC)))
> +			return -EINVAL;
> +
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/* Defer actual mapping to the fault handler. */
> +	vma->vm_private_data = ptdev;
> +	vma->vm_ops = &panthor_mmio_vm_ops;
> +	vm_flags_set(vma,
> +		     VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
> +		     VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
> +	return 0;
> +}
> +
> +#ifdef CONFIG_PM
> +int panthor_device_resume(struct device *dev)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +	int ret, cookie;
> +
> +	mutex_lock(&ptdev->pm.lock);
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_RESUMING);
> +
> +	ret = clk_prepare_enable(ptdev->clks.core);
> +	if (ret)
> +		goto err_unlock;
> +
> +	ret = clk_prepare_enable(ptdev->clks.stacks);
> +	if (ret)
> +		goto err_disable_core_clk;
> +
> +	ret = clk_prepare_enable(ptdev->clks.coregroup);
> +	if (ret)
> +		goto err_disable_stacks_clk;
> +
> +	ret = panthor_devfreq_resume(ptdev);
> +	if (ret)
> +		goto err_disable_coregroup_clk;
> +
> +	if (panthor_device_is_initialized(ptdev) &&
> +	    drm_dev_enter(&ptdev->base, &cookie)) {
> +		panthor_gpu_resume(ptdev);
> +		panthor_mmu_resume(ptdev);
> +		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> +		if (!ret)
> +			panthor_sched_resume(ptdev);
> +
> +		drm_dev_exit(cookie);
> +
> +		if (ret)
> +			goto err_devfreq_suspend;
> +	}
> +
> +	/* Clear all IOMEM mappings pointing to this device after we've
> +	 * resumed. This way the fake mappings pointing to the dummy pages
> +	 * are removed and the real iomem mapping will be restored on next
> +	 * access.
> +	 */
> +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);

Is the ordering here correct? I think we need to set ACTIVE before the
unmap_mapping_range otherwise there is a (very small) race where user
space could fault the page (and get the dummy mapping) before the
atomic_set.

Hmm, actually we have the pm.lock, so no this isn't racy. In which case
is there a good reason that you're using atomics? I can see two accesses
which aren't protected by pm.lock:

  * the early out in panthor_device_suspend() - which could easily be
moved inside the lock.

  * panthor_device_schedule_reset() - this looks racy (the power down
could happen immediately after the atomic_read()), so I suspect it would
be better moving the check into panthor_device_reset_work() and
performing it with the pm.lock held.

> +	if (atomic_read(&ptdev->reset.pending))
> +		queue_work(ptdev->reset.wq, &ptdev->reset.work);
> +
> +	mutex_unlock(&ptdev->pm.lock);
> +	return 0;
> +
> +err_devfreq_suspend:
> +	panthor_devfreq_suspend(ptdev);
> +
> +err_disable_coregroup_clk:
> +	clk_disable_unprepare(ptdev->clks.coregroup);
> +
> +err_disable_stacks_clk:
> +	clk_disable_unprepare(ptdev->clks.stacks);
> +
> +err_disable_core_clk:
> +	clk_disable_unprepare(ptdev->clks.core);
> +
> +err_unlock:
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +	mutex_unlock(&ptdev->pm.lock);
> +	return ret;
> +}
> +
> +int panthor_device_suspend(struct device *dev)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +	int ret, cookie;
> +
> +	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
> +		return 0;
> +
> +	mutex_lock(&ptdev->pm.lock);
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDING);
> +
> +	/* Clear all IOMEM mappings pointing to this device before we
> +	 * shutdown the power-domain and clocks. Failing to do that results
> +	 * in external aborts when the process accesses the iomem region.
> +	 */
> +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> +
> +	if (panthor_device_is_initialized(ptdev) &&
> +	    drm_dev_enter(&ptdev->base, &cookie)) {
> +		cancel_work_sync(&ptdev->reset.work);
> +
> +		/* We prepare everything as if we were resetting the GPU.
> +		 * The end of the reset will happen in the resume path though.
> +		 */
> +		panthor_sched_suspend(ptdev);
> +		panthor_fw_suspend(ptdev);
> +		panthor_mmu_suspend(ptdev);
> +		panthor_gpu_suspend(ptdev);
> +		drm_dev_exit(cookie);
> +	}
> +
> +	ret = panthor_devfreq_suspend(ptdev);
> +	if (ret) {
> +		if (panthor_device_is_initialized(ptdev) &&
> +		    drm_dev_enter(&ptdev->base, &cookie)) {
> +			panthor_gpu_resume(ptdev);
> +			panthor_mmu_resume(ptdev);
> +			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> +			panthor_sched_resume(ptdev);
> +			drm_dev_exit(cookie);
> +		}
> +
> +		atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> +		goto out_unlock;
> +	}
> +
> +	/* Before we suspend, update the dummy_latest_flush page, so accesses
> +	 * to this dummy page return the value the HW would have returned.
> +	 */
> +	*ptdev->pm.dummy_latest_flush = gpu_read(ptdev, CSF_GPU_LATEST_FLUSH_ID);

As above, I don't believe it is important for user space to know the
value the HW would have returned during a suspend. Indeed if the
hardware was successfully suspended the flush ID is likely to be reset -
so this would be inaccurate. However any value should be safe if the
work was prepared while the GPU was off as the caches will be empty.

> +
> +	clk_disable_unprepare(ptdev->clks.coregroup);
> +	clk_disable_unprepare(ptdev->clks.stacks);
> +	clk_disable_unprepare(ptdev->clks.core);
> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +
> +out_unlock:
> +	mutex_unlock(&ptdev->pm.lock);
> +	return ret;
> +}
> +#endif
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> new file mode 100644
> index 000000000000..e0e1be263eb9
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -0,0 +1,354 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_DEVICE_H__
> +#define __PANTHOR_DEVICE_H__
> +
> +#include <linux/atomic.h>
> +#include <linux/io-pgtable.h>
> +#include <linux/regulator/consumer.h>
> +#include <linux/spinlock.h>
> +#include <drm/drm_device.h>
> +#include <drm/drm_mm.h>
> +#include <drm/gpu_scheduler.h>
> +#include <drm/panthor_drm.h>
> +
> +struct panthor_csf;
> +struct panthor_csf_ctx;
> +struct panthor_device;
> +struct panthor_gpu;
> +struct panthor_group_pool;
> +struct panthor_heap_pool;
> +struct panthor_job;
> +struct panthor_mmu;
> +struct panthor_fw;
> +struct panthor_perfcnt;
> +struct panthor_vm;
> +struct panthor_vm_pool;
> +
> +/**
> + * enum panthor_device_pm_state - PM state
> + */
> +enum panthor_device_pm_state {
> +	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDED: Device is suspended. */
> +	PANTHOR_DEVICE_PM_STATE_SUSPENDED = 0,
> +
> +	/** @PANTHOR_DEVICE_PM_STATE_RESUMING: Device is being resumed. */
> +	PANTHOR_DEVICE_PM_STATE_RESUMING,
> +
> +	/** @PANTHOR_DEVICE_PM_STATE_ACTIVE: Device is active. */
> +	PANTHOR_DEVICE_PM_STATE_ACTIVE,
> +
> +	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDING: Device is being suspended. */
> +	PANTHOR_DEVICE_PM_STATE_SUSPENDING,
> +};
> +
> +/**
> + * struct panthor_irq - IRQ data
> + *
> + * Used to automate IRQ handling for the 3 different IRQs we have in this driver.
> + */
> +struct panthor_irq {
> +	/** @ptdev: Panthor device */
> +	struct panthor_device *ptdev;
> +
> +	/** @irq: IRQ number. */
> +	int irq;
> +
> +	/** @mask: Current mask being applied to xxx_INT_MASK. */
> +	u32 mask;
> +
> +	/** @suspended: Set to true when the IRQ is suspended. */
> +	atomic_t suspended;
> +};
> +
> +/**
> + * struct panthor_device - Panthor device
> + */
> +struct panthor_device {
> +	/** @base: Base drm_device. */
> +	struct drm_device base;
> +
> +	/** @phys_addr: Physical address of the iomem region. */
> +	phys_addr_t phys_addr;
> +
> +	/** @iomem: CPU mapping of the IOMEM region. */
> +	void __iomem *iomem;
> +
> +	/** @clks: GPU clocks. */
> +	struct {
> +		/** @core: Core clock. */
> +		struct clk *core;
> +
> +		/** @stacks: Stacks clock. This clock is optional. */
> +		struct clk *stacks;
> +
> +		/** @coregroup: Core group clock. This clock is optional. */
> +		struct clk *coregroup;
> +	} clks;
> +
> +	/** @coherent: True if the CPU/GPU are memory coherent. */
> +	bool coherent;
> +
> +	/** @gpu_info: GPU information. */
> +	struct drm_panthor_gpu_info gpu_info;
> +
> +	/** @csif_info: Command stream interface information. */
> +	struct drm_panthor_csif_info csif_info;
> +
> +	/** @gpu: GPU management data. */
> +	struct panthor_gpu *gpu;
> +
> +	/** @fw: FW management data. */
> +	struct panthor_fw *fw;
> +
> +	/** @mmu: MMU management data. */
> +	struct panthor_mmu *mmu;
> +
> +	/** @scheduler: Scheduler management data. */
> +	struct panthor_scheduler *scheduler;
> +
> +	/** @devfreq: Device frequency scaling management data. */
> +	struct panthor_devfreq *devfreq;
> +
> +	/** @reset: Reset related fields. */
> +	struct {
> +		/** @wq: Ordered worqueud used to schedule reset operations. */
> +		struct workqueue_struct *wq;
> +
> +		/** @work: Reset work. */
> +		struct work_struct work;
> +
> +		/** @pending: Set to true if a reset is pending. */
> +		atomic_t pending;
> +	} reset;
> +
> +	/** @pm: Power management related data. */
> +	struct {
> +		/** @state: Power state, see panthor_device_pm_state. */
> +		atomic_t state;
> +
> +		/**
> +		 * @lock: Lock protecting the suspend/resume operations.
> +		 *
> +		 * This is needed to ensure we map the dummy IO pages when
> +		 * the device is being suspended, and the real IO pages when
> +		 * the device is being resumed. We can't just do with the
> +		 * state atomicity to deal with this race.
> +		 */
> +		struct mutex lock;
> +
> +		/**
> +		 * @dummy_latest_flush: Dummy LATEST_FLUSH page.
> +		 *
> +		 * Used to replace the real LATEST_FLUSH page when the GPU
> +		 * is suspended.
> +		 */
> +		u32 *dummy_latest_flush;
> +	} pm;
> +};
> +
> +/**
> + * struct panthor_file - Panthor file
> + */
> +struct panthor_file {
> +	/** @ptdev: Device attached to this file. */
> +	struct panthor_device *ptdev;
> +
> +	/** @vms: VM pool attached to this file. */
> +	struct panthor_vm_pool *vms;
> +
> +	/** @groups: Scheduling group pool attached to this file. */
> +	struct panthor_group_pool *groups;
> +};
> +
> +int panthor_device_init(struct panthor_device *ptdev);
> +void panthor_device_unplug(struct panthor_device *ptdev);
> +
> +/**
> + * panthor_device_schedule_reset() - Schedules a reset operation
> + */
> +static inline void panthor_device_schedule_reset(struct panthor_device *ptdev)
> +{
> +	if (atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE &&

As above - this is a racy check. Although it might be safe because of
the cancel_work_sync() call in panthor_device_suspend(). But if we get
rid of this check we don't need the atomic variable.

> +	    !atomic_cmpxchg(&ptdev->reset.pending, 0, 1))
> +		queue_work(ptdev->reset.wq, &ptdev->reset.work);
> +}
> +
> +/**
> + * panthor_device_reset_is_pending() - Checks if a reset is pending.
> + *
> + * Return: true if a reset is pending, false otherwise.
> + */
> +static inline bool panthor_device_reset_is_pending(struct panthor_device *ptdev)
> +{
> +	return atomic_read(&ptdev->reset.pending) != 0;
> +}
> +
> +int panthor_device_mmap_io(struct panthor_device *ptdev,
> +			   struct vm_area_struct *vma);
> +
> +int panthor_device_resume(struct device *dev);
> +int panthor_device_suspend(struct device *dev);
> +
> +enum drm_panthor_exception_type {
> +	DRM_PANTHOR_EXCEPTION_OK = 0x00,
> +	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> +	DRM_PANTHOR_EXCEPTION_KABOOM = 0x05,
> +	DRM_PANTHOR_EXCEPTION_EUREKA = 0x06,
> +	DRM_PANTHOR_EXCEPTION_ACTIVE = 0x08,
> +	DRM_PANTHOR_EXCEPTION_CS_RES_TERM = 0x0f,
> +	DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT = 0x3f,
> +	DRM_PANTHOR_EXCEPTION_CS_CONFIG_FAULT = 0x40,
> +	DRM_PANTHOR_EXCEPTION_CS_ENDPOINT_FAULT = 0x44,
> +	DRM_PANTHOR_EXCEPTION_CS_BUS_FAULT = 0x48,
> +	DRM_PANTHOR_EXCEPTION_CS_INSTR_INVALID = 0x49,
> +	DRM_PANTHOR_EXCEPTION_CS_CALL_STACK_OVERFLOW = 0x4a,
> +	DRM_PANTHOR_EXCEPTION_CS_INHERIT_FAULT = 0x4b,
> +	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_PC = 0x50,
> +	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_ENC = 0x51,
> +	DRM_PANTHOR_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
> +	DRM_PANTHOR_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> +	DRM_PANTHOR_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> +	DRM_PANTHOR_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> +	DRM_PANTHOR_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> +	DRM_PANTHOR_EXCEPTION_OOM = 0x60,
> +	DRM_PANTHOR_EXCEPTION_CSF_FW_INTERNAL_ERROR = 0x68,
> +	DRM_PANTHOR_EXCEPTION_CSF_RES_EVICTION_TIMEOUT = 0x69,
> +	DRM_PANTHOR_EXCEPTION_GPU_BUS_FAULT = 0x80,
> +	DRM_PANTHOR_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> +	DRM_PANTHOR_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> +	DRM_PANTHOR_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_0 = 0xc8,
> +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_1 = 0xc9,
> +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_2 = 0xca,
> +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_3 = 0xcb,
> +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_IN = 0xe0,
> +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
> +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
> +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
> +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
> +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
> +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
> +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
> +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
> +};
> +
> +/**
> + * panthor_exception_is_fault() - Checks if an exception is a fault.
> + *
> + * Return: true if the exception is a fault, false otherwise.
> + */
> +static inline bool
> +panthor_exception_is_fault(u32 exception_code)
> +{
> +	return exception_code > DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT;
> +}
> +
> +const char *panthor_exception_name(struct panthor_device *ptdev,
> +				   u32 exception_code);
> +
> +/**
> + * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
> + * registration function.
> + *
> + * The boiler-plate to gracefully deal with shared interrupts is
> + * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
> + * just after you actual handler. The handler prototype is:
s/you/your/ or probably s/you/the/ since we don't expect people to be
adding more ;)

> + *
> + * void (*handler)(struct panthor_device *, u32 status);
> + */
> +#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
> +static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
> +{												\
> +	struct panthor_irq *pirq = data;							\
> +	struct panthor_device *ptdev = pirq->ptdev;						\

Maybe I'm missing something, but I was expecting a check here for if the
irq has been suspended and to avoid the register reads if it was.
Otherwise I'm not entirely sure I follow what all this code is for.

Steve

> +												\
> +	if (!gpu_read(ptdev, __reg_prefix ## _INT_STAT))					\
> +		return IRQ_NONE;								\
> +												\
> +	gpu_write(ptdev, __reg_prefix ## _INT_MASK, 0);						\
> +	return IRQ_WAKE_THREAD;									\
> +}												\
> +												\
> +static irqreturn_t panthor_ ## __name ## _irq_threaded_handler(int irq, void *data)		\
> +{												\
> +	struct panthor_irq *pirq = data;							\
> +	struct panthor_device *ptdev = pirq->ptdev;						\
> +	irqreturn_t ret = IRQ_NONE;								\
> +												\
> +	while (true) {										\
> +		u32 status = gpu_read(ptdev, __reg_prefix ## _INT_RAWSTAT) & pirq->mask;	\
> +												\
> +		if (!status)									\
> +			break;									\
> +												\
> +		gpu_write(ptdev, __reg_prefix ## _INT_CLEAR, status);				\
> +												\
> +		__handler(ptdev, status);							\
> +		ret = IRQ_HANDLED;								\
> +	}											\
> +												\
> +	if (!atomic_read(&pirq->suspended))							\
> +		gpu_write(ptdev, __reg_prefix ## _INT_MASK, pirq->mask);			\
> +												\
> +	return ret;										\
> +}												\
> +												\
> +static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)			\
> +{												\
> +	int cookie;										\
> +												\
> +	atomic_set(&pirq->suspended, true);							\
> +												\
> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
> +		synchronize_irq(pirq->irq);							\
> +		drm_dev_exit(cookie);								\
> +	}											\
> +												\
> +	pirq->mask = 0;										\
> +}												\
> +												\
> +static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
> +{												\
> +	int cookie;										\
> +												\
> +	atomic_set(&pirq->suspended, false);							\
> +	pirq->mask = mask;									\
> +												\
> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_CLEAR, mask);			\
> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, mask);			\
> +		drm_dev_exit(cookie);								\
> +	}											\
> +}												\
> +												\
> +static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev,			\
> +					      struct panthor_irq *pirq,				\
> +					      int irq, u32 mask)				\
> +{												\
> +	pirq->ptdev = ptdev;									\
> +	pirq->irq = irq;									\
> +	panthor_ ## __name ## _irq_resume(pirq, mask);						\
> +												\
> +	return devm_request_threaded_irq(ptdev->base.dev, irq,					\
> +					 panthor_ ## __name ## _irq_raw_handler,		\
> +					 panthor_ ## __name ## _irq_threaded_handler,		\
> +					 IRQF_SHARED, KBUILD_MODNAME "-" # __name,		\
> +					 pirq);							\
> +}
> +
> +extern struct workqueue_struct *panthor_cleanup_wq;
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS
  2023-08-09 16:53 ` [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS Boris Brezillon
@ 2023-08-11 16:08   ` Steven Price
  2023-08-29 17:48     ` Boris Brezillon
  2023-08-31 13:18   ` Liviu Dudau
  1 sibling, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-11 16:08 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Add an entry for the Panthor driver to the MAINTAINERS file.
> 
> v2:
> - New commit
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
> 
> If anyone from Arm wants to volunteer to become a co-maintainer, that
> would be highly appreciated

*sticks his hand up* me me! ;) Seriously though I'm happy to help out
with the maintenance.

And I'll try to finish reviewing the patches next week. I gave it a
quick spin on my Rock 5B and the GPU seems to work fine. I also need to
rebase my user space submission work. And recover from coming back from
holiday! Plus I'm sure I wasn't full-time on GPU related things before I
went on holiday... ;)

Thanks,

Steve

> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cd882b87a3c6..6149ab68d461 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1624,6 +1624,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
>  F:	drivers/gpu/drm/panfrost/
>  F:	include/uapi/drm/panfrost_drm.h
>  
> +ARM MALI PANTHOR DRM DRIVER
> +M:	Boris Brezillon <boris.brezillon@collabora.com>
> +L:	dri-devel@lists.freedesktop.org
> +S:	Supported
> +T:	git git://anongit.freedesktop.org/drm/drm-misc
> +F:	drivers/gpu/drm/panthor/
> +F:	include/uapi/drm/panthor_drm.h
> +
>  ARM MALI-DP DRM DRIVER
>  M:	Liviu Dudau <liviu.dudau@arm.com>
>  S:	Supported


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-09 16:53 ` [PATCH v2 13/15] drm/panthor: Allow driver compilation Boris Brezillon
@ 2023-08-11 16:35   ` Robin Murphy
  2023-08-11 16:56     ` Daniel Stone
  2023-08-21 12:47   ` Steven Price
  1 sibling, 1 reply; 93+ messages in thread
From: Robin Murphy @ 2023-08-11 16:35 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Clément Péron, Marty E . Plummer,
	Faith Ekstrand

On 2023-08-09 17:53, Boris Brezillon wrote:
> Now that all blocks are available, we can add/update Kconfig/Makefile
> files to allow compilation.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Add new dependencies on GPUVA and DRM_SCHED
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>   drivers/gpu/drm/Kconfig          |  2 ++
>   drivers/gpu/drm/Makefile         |  1 +
>   drivers/gpu/drm/panthor/Kconfig  | 16 ++++++++++++++++
>   drivers/gpu/drm/panthor/Makefile | 15 +++++++++++++++
>   4 files changed, 34 insertions(+)
>   create mode 100644 drivers/gpu/drm/panthor/Kconfig
>   create mode 100644 drivers/gpu/drm/panthor/Makefile
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 2a44b9419d4d..bddfbdb2ffee 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -358,6 +358,8 @@ source "drivers/gpu/drm/lima/Kconfig"
>   
>   source "drivers/gpu/drm/panfrost/Kconfig"
>   
> +source "drivers/gpu/drm/panthor/Kconfig"
> +
>   source "drivers/gpu/drm/aspeed/Kconfig"
>   
>   source "drivers/gpu/drm/mcde/Kconfig"
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 215e78e79125..0a260727505f 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -188,6 +188,7 @@ obj-$(CONFIG_DRM_TVE200) += tve200/
>   obj-$(CONFIG_DRM_XEN) += xen/
>   obj-$(CONFIG_DRM_VBOXVIDEO) += vboxvideo/
>   obj-$(CONFIG_DRM_LIMA)  += lima/
> +obj-$(CONFIG_DRM_PANTHOR) += panthor/
>   obj-$(CONFIG_DRM_PANFROST) += panfrost/
>   obj-$(CONFIG_DRM_ASPEED_GFX) += aspeed/
>   obj-$(CONFIG_DRM_MCDE) += mcde/
> diff --git a/drivers/gpu/drm/panthor/Kconfig b/drivers/gpu/drm/panthor/Kconfig
> new file mode 100644
> index 000000000000..a9d17b1bbb75
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/Kconfig
> @@ -0,0 +1,16 @@
> +# SPDX-License-Identifier: GPL-2.0 or MIT
> +
> +config DRM_PANTHOR
> +	tristate "Panthor (DRM support for ARM Mali CSF-based GPUs)"
> +	depends on DRM
> +	depends on ARM || ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
> +	depends on MMU
> +	select DRM_EXEC
> +	select DRM_SCHED
> +	select IOMMU_SUPPORT
> +	select IOMMU_IO_PGTABLE_LPAE
> +	select DRM_GEM_SHMEM_HELPER
> +	select PM_DEVFREQ
> +	select DEVFREQ_GOV_SIMPLE_ONDEMAND
> +	help
> +	  DRM driver for ARM Mali CSF-based GPUs.
> diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
> new file mode 100644
> index 000000000000..64193a484879
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/Makefile
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0 or MIT
> +
> +panthor-y := \
> +	panthor_devfreq.o \
> +	panthor_device.o \
> +	panthor_drv.o \
> +	panthor_gem.o \
> +	panthor_gpu.o \
> +	panthor_heap.o \
> +	panthor_heap.o \
> +	panthor_fw.o \
> +	panthor_mmu.o \
> +	panthor_sched.o
> +
> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o

FWIW I still think it would be nice to have a minor 
directory/Kconfig/Makefile reshuffle and a trivial bit of extra 
registration glue to build both drivers into a single module. It seems 
like it could be a perpetual source of confusion to end users where Mesa 
"panfrost" is the right option but kernel "panfrost" is the wrong one. 
Especially when pretty much every other GPU driver is also just one big 
top-level module to load for many different generations of hardware. 
Plus it would mean that if someone did want to have a go at 
deduplicating the resource-wrangling boilerplate for OPPs etc. in 
future, there's more chance of being able to do so meaningfully.

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-11 16:35   ` Robin Murphy
@ 2023-08-11 16:56     ` Daniel Stone
  2023-08-11 19:26       ` Robin Murphy
  0 siblings, 1 reply; 93+ messages in thread
From: Daniel Stone @ 2023-08-11 16:56 UTC (permalink / raw)
  To: Robin Murphy, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Neil Armstrong, Liviu Dudau, Steven Price,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

Hi,

On 11/08/2023 17:35, Robin Murphy wrote:
> On 2023-08-09 17:53, Boris Brezillon wrote:
>> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o
>
> FWIW I still think it would be nice to have a minor 
> directory/Kconfig/Makefile reshuffle and a trivial bit of extra 
> registration glue to build both drivers into a single module. It seems 
> like it could be a perpetual source of confusion to end users where 
> Mesa "panfrost" is the right option but kernel "panfrost" is the wrong 
> one. Especially when pretty much every other GPU driver is also just 
> one big top-level module to load for many different generations of 
> hardware. Plus it would mean that if someone did want to have a go at 
> deduplicating the resource-wrangling boilerplate for OPPs etc. in 
> future, there's more chance of being able to do so meaningfully.

It might be nice to point it out, but to be fair Intel and AMD both have 
two (or more) drivers, as does Broadcom/RPi. As does, err ... Mali.

I can see the point, but otoh if someone's managed to build all the 
right regulator/clock/etc modules to get a working system, they'll 
probably manage to figure teh GPU side out?

Cheers,

Daniel


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-11 16:56     ` Daniel Stone
@ 2023-08-11 19:26       ` Robin Murphy
  2023-08-14 11:18         ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Robin Murphy @ 2023-08-11 19:26 UTC (permalink / raw)
  To: Daniel Stone, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Neil Armstrong, Liviu Dudau, Steven Price,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 2023-08-11 17:56, Daniel Stone wrote:
> Hi,
> 
> On 11/08/2023 17:35, Robin Murphy wrote:
>> On 2023-08-09 17:53, Boris Brezillon wrote:
>>> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o
>>
>> FWIW I still think it would be nice to have a minor 
>> directory/Kconfig/Makefile reshuffle and a trivial bit of extra 
>> registration glue to build both drivers into a single module. It seems 
>> like it could be a perpetual source of confusion to end users where 
>> Mesa "panfrost" is the right option but kernel "panfrost" is the wrong 
>> one. Especially when pretty much every other GPU driver is also just 
>> one big top-level module to load for many different generations of 
>> hardware. Plus it would mean that if someone did want to have a go at 
>> deduplicating the resource-wrangling boilerplate for OPPs etc. in 
>> future, there's more chance of being able to do so meaningfully.
> 
> It might be nice to point it out, but to be fair Intel and AMD both have 
> two (or more) drivers, as does Broadcom/RPi. As does, err ... Mali.

Indeed, I didn't mean to imply that I'm not aware that e.g. gma500 is to 
i915 what lima is to panfrost. It was more that unlike the others where 
there's a pretty clear line in the sand between "driver for old 
hardware" and "driver for the majority of recent hardware", this one 
happens to fall splat in the middle of the current major generation such 
that panfrost is the correct module for Mali Bifrost but also the wrong 
one for Mali Bifrost... :/

> I can see the point, but otoh if someone's managed to build all the 
> right regulator/clock/etc modules to get a working system, they'll 
> probably manage to figure teh GPU side out?

Maybe; either way I guess it's not really my concern, since I'm the only 
user that *I* have to support, and I do already understand it. From the 
upstream perspective I mostly just want to hold on to the hope of not 
having to write my io-pgtable bugs twice over if at all possible :)

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-09 16:53 ` [PATCH v2 05/15] drm/panthor: Add the GPU " Boris Brezillon
@ 2023-08-14 10:54   ` Steven Price
  2023-08-21 16:09     ` Robin Murphy
  2023-08-29 14:40     ` Boris Brezillon
  0 siblings, 2 replies; 93+ messages in thread
From: Steven Price @ 2023-08-14 10:54 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Handles everything that's not related to the FW, the MMU or the
> scheduler. This is the block dealing with the GPU property retrieval,
> the GPU block power on/off logic, and some global operations, like
> global cache flushing.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> - Use the panthor_irq layer to manage/process IRQs
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_gpu.c | 463 ++++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_gpu.h |  52 +++
>  2 files changed, 515 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> new file mode 100644
> index 000000000000..47d15334b46e
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -0,0 +1,463 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
> +/* Copyright 2019 Collabora ltd. */
> +
> +#include <linux/bitfield.h>
> +#include <linux/bitmap.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
> +
> +#include <drm/drm_drv.h>
> +#include <drm/drm_managed.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_gpu.h"
> +#include "panthor_regs.h"
> +
> +/**
> + * struct panthor_gpu - GPU block management data.
> + */
> +struct panthor_gpu {
> +	/** @irq: GPU irq. */
> +	struct panthor_irq irq;
> +
> +	/** @reqs_lock: Lock protecting access to pending_reqs. */
> +	spinlock_t reqs_lock;
> +
> +	/** @pending_reqs: Pending GPU requests. */
> +	u32 pending_reqs;
> +
> +	/** @reqs_acked: GPU request wait queue. */
> +	wait_queue_head_t reqs_acked;
> +};
> +
> +/**
> + * struct panthor_model - GPU model description
> + */
> +struct panthor_model {
> +	/** @name: Model name. */
> +	const char *name;
> +
> +	/** @id: Model ID. */
> +	u32 id;
> +};
> +
> +/**
> + * GPU_MODEL() - Define a GPU model.
> + */
> +#define GPU_MODEL(_name, _id, ...) \
> +{\
> +	.name = __stringify(_name),				\
> +	.id = _id,						\
> +}
> +
> +#define GPU_MODEL_ID_MASK		0xf00f0000

I would be nice if we had defines for the two components that make this
up (ARCH_MAJOR | PRODUCT_MAJOR). It might even be easier to read the
model list below if we split ID into arch/product combinations (which
can then be written in decimal rather than hex).

> +
> +static const struct panthor_model gpu_models[] = {
> +	GPU_MODEL(g610, 0xa0070000),
> +	{},
> +};
> +
> +#define GPU_INTERRUPTS_MASK	\
> +	(GPU_IRQ_FAULT | \
> +	 GPU_IRQ_PROTM_FAULT | \
> +	 GPU_IRQ_RESET_COMPLETED | \
> +	 GPU_IRQ_MCU_STATUS_CHANGED | \

The code doesn't seem to use the MCU_STATUS_CHANGED interrupt, if it's
not used then it doesn't make sense to be in the mask.

> +	 GPU_IRQ_CLEAN_CACHES_COMPLETED)
> +
> +static void panthor_gpu_init_info(struct panthor_device *ptdev)
> +{
> +	const struct panthor_model *model;
> +	u32 major, minor, status;
> +	unsigned int i;
> +
> +	ptdev->gpu_info.gpu_id = gpu_read(ptdev, GPU_ID);
> +	ptdev->gpu_info.csf_id = gpu_read(ptdev, GPU_CSF_ID);
> +	ptdev->gpu_info.gpu_rev = gpu_read(ptdev, GPU_REVID);
> +	ptdev->gpu_info.l2_features = gpu_read(ptdev, GPU_L2_FEATURES);
> +	ptdev->gpu_info.tiler_features = gpu_read(ptdev, GPU_TILER_FEATURES);
> +	ptdev->gpu_info.mem_features = gpu_read(ptdev, GPU_MEM_FEATURES);
> +	ptdev->gpu_info.mmu_features = gpu_read(ptdev, GPU_MMU_FEATURES);
> +	ptdev->gpu_info.thread_features = gpu_read(ptdev, GPU_THREAD_FEATURES);
> +	ptdev->gpu_info.max_threads = gpu_read(ptdev, GPU_THREAD_MAX_THREADS);
> +	ptdev->gpu_info.thread_max_workgroup_size = gpu_read(ptdev, GPU_THREAD_MAX_WORKGROUP_SIZE);
> +	ptdev->gpu_info.thread_max_barrier_size = gpu_read(ptdev, GPU_THREAD_MAX_BARRIER_SIZE);
> +	ptdev->gpu_info.coherency_features = gpu_read(ptdev, GPU_COHERENCY_FEATURES);
> +	for (i = 0; i < 4; i++)
> +		ptdev->gpu_info.texture_features[i] = gpu_read(ptdev, GPU_TEXTURE_FEATURES(i));
> +
> +	ptdev->gpu_info.as_present = gpu_read(ptdev, GPU_AS_PRESENT);
> +
> +	ptdev->gpu_info.shader_present = gpu_read(ptdev, GPU_SHADER_PRESENT_LO);
> +	ptdev->gpu_info.shader_present |= (u64)gpu_read(ptdev, GPU_SHADER_PRESENT_HI) << 32;
> +
> +	ptdev->gpu_info.tiler_present = gpu_read(ptdev, GPU_TILER_PRESENT_LO);
> +	ptdev->gpu_info.tiler_present |= (u64)gpu_read(ptdev, GPU_TILER_PRESENT_HI) << 32;
> +
> +	ptdev->gpu_info.l2_present = gpu_read(ptdev, GPU_L2_PRESENT_LO);
> +	ptdev->gpu_info.l2_present |= (u64)gpu_read(ptdev, GPU_L2_PRESENT_HI) << 32;
> +	ptdev->gpu_info.core_group_count = hweight64(ptdev->gpu_info.l2_present);

Do we want to expose 'computed' properties like this? My experience in
the past with kbase is that they can cause problems and are practically
impossible to kill off once added.

AFAICT it isn't used by the current Mesa driver so I would suggest
dropping core_group_count (which also enables us to drop the 'pad' field
which is a nice side-effect).

> +
> +	major = (ptdev->gpu_info.gpu_id >> 12) & 0xf;
> +	minor = (ptdev->gpu_info.gpu_id >> 4) & 0xff;
> +	status = ptdev->gpu_info.gpu_id & 0xf;
> +
> +	for (model = gpu_models; model->name; model++) {
> +		if (model->id == (ptdev->gpu_info.gpu_id & GPU_MODEL_ID_MASK))
> +			break;
> +	}
> +
> +	drm_info(&ptdev->base,
> +		 "mali-%s id 0x%x major 0x%x minor 0x%x status 0x%x",
> +		 model->name ?: "unknown", ptdev->gpu_info.gpu_id >> 16,
> +		 major, minor, status);
> +
> +	drm_info(&ptdev->base,
> +		 "Features: L2:0x%08x Tiler:0x%08x Mem:0x%0x MMU:0x%08x AS:0x%x",

There's an odd mix of format strings here. "%0x" for Mem and just "%x"
for AS.

> +		 ptdev->gpu_info.l2_features,
> +		 ptdev->gpu_info.tiler_features,
> +		 ptdev->gpu_info.mem_features,
> +		 ptdev->gpu_info.mmu_features,
> +		 ptdev->gpu_info.as_present);
> +
> +	drm_info(&ptdev->base,
> +		 "shader_present=0x%0llx l2_present=0x%0llx tiler_present=0x%0llx",
> +		 ptdev->gpu_info.shader_present, ptdev->gpu_info.l2_present,
> +		 ptdev->gpu_info.tiler_present);
> +}
> +
> +static void panthor_gpu_irq_handler(struct panthor_device *ptdev, u32 status)
> +{
> +	if (status & (GPU_IRQ_FAULT | GPU_IRQ_PROTM_FAULT)) {

The spec states that GPU_FAULTSTATUS "does not update for
GPU_PROTECTED_FAULT interrupts" - so I don't think we want
GPU_IRQ_PROTM_FAULT in that condition. Or at least printing the
exception information should ideally be avoided.

If I understand correctly a protected fault interrupt is basically
saying the fault is the same as a GPU_IRQ_FAULT but the GPU isn't going
to tell us the details because it was in protected mode (and it doesn't
to accidentally leak the 'super secret' content).

> +		u32 fault_status = gpu_read(ptdev, GPU_FAULT_STATUS);
> +		u64 address = ((u64)gpu_read(ptdev, GPU_FAULT_ADDR_HI) << 32) |
> +			      gpu_read(ptdev, GPU_FAULT_ADDR_LO);
> +
> +		drm_warn(&ptdev->base, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
> +			 fault_status, panthor_exception_name(ptdev, fault_status & 0xFF),
> +			 address);
> +	}
> +
> +	spin_lock(&ptdev->gpu->reqs_lock);
> +	if (status & ptdev->gpu->pending_reqs) {
> +		ptdev->gpu->pending_reqs &= ~status;
> +		wake_up_all(&ptdev->gpu->reqs_acked);
> +	}
> +	spin_unlock(&ptdev->gpu->reqs_lock);
> +}
> +PANTHOR_IRQ_HANDLER(gpu, GPU, panthor_gpu_irq_handler);
> +
> +/**
> + * panthor_gpu_unplug() - Called when the GPU is unplugged.
> + */
> +void panthor_gpu_unplug(struct panthor_device *ptdev)
> +{
> +	unsigned long flags;
> +
> +	/* Make sure the IRQ handler is not running after that point. */
> +	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> +
> +	/* Wake-up all waiters. */
> +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> +	ptdev->gpu->pending_reqs = 0;
> +	wake_up_all(&ptdev->gpu->reqs_acked);
> +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> +}
> +
> +/**
> + * panthor_gpu_init() - Initialize the GPU block
> + * @ptdev: Device.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_gpu *gpu;
> +	u32 pa_bits;
> +	int ret, irq;
> +
> +	gpu = drmm_kzalloc(&ptdev->base, sizeof(*gpu), GFP_KERNEL);
> +	if (!gpu)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&gpu->reqs_lock);
> +	init_waitqueue_head(&gpu->reqs_acked);
> +	ptdev->gpu = gpu;
> +	panthor_gpu_init_info(ptdev);
> +
> +	dma_set_max_seg_size(ptdev->base.dev, UINT_MAX);
> +	pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
> +	ret = dma_set_mask_and_coherent(ptdev->base.dev, DMA_BIT_MASK(pa_bits));
> +	if (ret)
> +		return ret;
> +
> +	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "gpu");
> +	if (irq <= 0)
> +		return ret;
> +
> +	ret = panthor_request_gpu_irq(ptdev, &ptdev->gpu->irq, irq, GPU_INTERRUPTS_MASK);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_gpu_block_power_off() - Power-off a specific block of the GPU
> + * @ptdev: Device.
> + * @blk_name: Block name.
> + * @pwroff_reg: Power-off register for this block.
> + * @pwrtrans_reg: Power transition register for this block.
> + * @mask: Sub-elements to power-off.
> + * @timeout_ms: Timeout in milliseconds.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_block_power_off(struct panthor_device *ptdev,
> +				const char *blk_name,
> +				u32 pwroff_reg, u32 pwrtrans_reg,
> +				u64 mask, u32 timeout_us)
> +{
> +	u32 val, i;
> +	int ret;
> +
> +	for (i = 0; i < 2; i++) {
> +		u32 mask32 = mask >> (i * 32);
> +
> +		if (!mask32)
> +			continue;
> +
> +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> +						 val, !(mask32 & val),
> +						 100, timeout_us);
> +		if (ret) {
> +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> +				blk_name, mask);
> +			return ret;
> +		}
> +	}
> +
> +	if (mask & GENMASK(31, 0))
> +		gpu_write(ptdev, pwroff_reg, mask);
> +
> +	if (mask >> 32)
> +		gpu_write(ptdev, pwroff_reg, mask >> 32);

This should be pwroff_reg + 4.

> +
> +	for (i = 0; i < 2; i++) {
> +		u32 mask32 = mask >> (i * 32);
> +
> +		if (!mask32)
> +			continue;
> +
> +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> +						 val, !(mask & val),
> +						 100, timeout_us);
> +		if (ret) {
> +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> +				blk_name, mask);
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_gpu_block_power_on() - Power-on a specific block of the GPU
> + * @ptdev: Device.
> + * @blk_name: Block name.
> + * @pwron_reg: Power-on register for this block.
> + * @pwrtrans_reg: Power transition register for this block.
> + * @mask: Sub-elements to power-on.
> + * @timeout_ms: Timeout in milliseconds.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_block_power_on(struct panthor_device *ptdev,
> +			       const char *blk_name,
> +			       u32 pwron_reg, u32 pwrtrans_reg,
> +			       u32 rdy_reg, u64 mask, u32 timeout_us)
> +{
> +	u32 val, i;
> +	int ret;
> +
> +	for (i = 0; i < 2; i++) {
> +		u32 mask32 = mask >> (i * 32);
> +
> +		if (!mask32)
> +			continue;
> +
> +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> +						 val, !(mask32 & val),
> +						 100, timeout_us);
> +		if (ret) {
> +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> +				blk_name, mask);
> +			return ret;
> +		}
> +	}
> +
> +	if (mask & GENMASK(31, 0))
> +		gpu_write(ptdev, pwron_reg, mask);
> +
> +	if (mask >> 32)
> +		gpu_write(ptdev, pwron_reg + 4, mask >> 32);
> +
> +	for (i = 0; i < 2; i++) {
> +		u32 mask32 = mask >> (i * 32);
> +
> +		if (!mask32)
> +			continue;
> +
> +		ret = readl_relaxed_poll_timeout(ptdev->iomem + rdy_reg + (i * 4),
> +						 val, (mask32 & val) == mask32,
> +						 100, timeout_us);
> +		if (ret) {
> +			drm_err(&ptdev->base, "timeout waiting on %s:%llx readyness",
> +				blk_name, mask);
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_gpu_l2_power_on() - Power-on the L2-cache
> + * @ptdev: Device.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
> +{
> +	u64 core_mask = U64_MAX;
> +
> +	if (ptdev->gpu_info.l2_present != 1) {
> +		/*
> +		 * Only support one core group now.
> +		 * ~(l2_present - 1) unsets all bits in l2_present except
> +		 * the bottom bit. (l2_present - 2) has all the bits in
> +		 * the first core group set. AND them together to generate
> +		 * a mask of cores in the first core group.
> +		 */
> +		core_mask = ~(ptdev->gpu_info.l2_present - 1) &
> +			     (ptdev->gpu_info.l2_present - 2);
> +		drm_info_once(&ptdev->base, "using only 1st core group (%lu cores from %lu)\n",
> +			      hweight64(core_mask),
> +			      hweight64(ptdev->gpu_info.shader_present));

I'm not sure what the point of this complexity is. This boils down to
the equivalent of:

	if (ptdev->gpu_info.l2_present != 1)
		core_mask = 1;

If we were doing shader-core power management manually (like on pre-CSF
GPUs, rather than letting the firmware control it) then the computed
core_mask would be useful. So I guess it comes down to the
drm_info_once() output and counting the cores - which is nice to have
but it took me some time figuring out what was going on here.

> +	}
> +
> +	return panthor_gpu_power_on(ptdev, L2,
> +				    ptdev->gpu_info.l2_present & core_mask,
> +				    20000);
> +}
> +
> +/**
> + * panthor_gpu_flush_caches() - Flush caches
> + * @ptdev: Device.
> + * @l2: L2 flush type.
> + * @lsc: LSC flush type.
> + * @other: Other flush type.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_flush_caches(struct panthor_device *ptdev,
> +			     u32 l2, u32 lsc, u32 other)
> +{
> +	bool timedout = false;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> +	if (!drm_WARN_ON(&ptdev->base,
> +			 ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
> +		ptdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED;
> +		gpu_write(ptdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other));
> +	}
> +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> +
> +	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
> +				!(ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED),
> +				msecs_to_jiffies(100))) {
> +		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> +		if ((ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) != 0 &&
> +		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_CLEAN_CACHES_COMPLETED))
> +			timedout = true;
> +		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> +	}
> +
> +	if (timedout) {
> +		drm_err(&ptdev->base, "Flush caches timeout");
> +		return -ETIMEDOUT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_gpu_soft_reset() - Issue a soft-reset
> + * @ptdev: Device.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_gpu_soft_reset(struct panthor_device *ptdev)
> +{
> +	bool timedout = false;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> +	if (!drm_WARN_ON(&ptdev->base,
> +			 ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED)) {
> +		ptdev->gpu->pending_reqs |= GPU_IRQ_RESET_COMPLETED;
> +		gpu_write(ptdev, GPU_INT_CLEAR, GPU_IRQ_RESET_COMPLETED);
> +		gpu_write(ptdev, GPU_CMD, GPU_SOFT_RESET);
> +	}
> +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> +
> +	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
> +				!(ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED),
> +				msecs_to_jiffies(100))) {
> +		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> +		if ((ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED) != 0 &&
> +		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_RESET_COMPLETED))
> +			timedout = true;
> +		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> +	}
> +
> +	if (timedout) {
> +		drm_err(&ptdev->base, "Soft reset timeout");
> +		return -ETIMEDOUT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_gpu_suspend() - Suspend the GPU block.
> + * @ptdev: Device.
> + *
> + * Soft reset and suspend the GPU irq. This should be called last
> + * in the suspend procedure, after all other blocks have been suspented.
> + */
> +void panthor_gpu_suspend(struct panthor_device *ptdev)
> +{
> +	panthor_gpu_soft_reset(ptdev);

I'm not sure why we need to soft-reset when suspending? I guess this is
instead of manually powering off the L2? It might be the right action,
but it would be good to have a comment explaining why.

Steve

> +	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> +}
> +
> +/**
> + * panthor_gpu_resume() - Resume the GPU block.
> + *
> + * Resume the IRQ handler and power-on the L2-cache.
> + * The FW takes care of powering the other blocks.
> + */
> +void panthor_gpu_resume(struct panthor_device *ptdev)
> +{
> +	panthor_gpu_irq_resume(&ptdev->gpu->irq, GPU_INTERRUPTS_MASK);
> +	panthor_gpu_l2_power_on(ptdev);
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.h b/drivers/gpu/drm/panthor/panthor_gpu.h
> new file mode 100644
> index 000000000000..bba7555dd3c6
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.h
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Collabora ltd. */
> +
> +#ifndef __PANTHOR_GPU_H__
> +#define __PANTHOR_GPU_H__
> +
> +struct panthor_device;
> +
> +int panthor_gpu_init(struct panthor_device *ptdev);
> +void panthor_gpu_unplug(struct panthor_device *ptdev);
> +void panthor_gpu_suspend(struct panthor_device *ptdev);
> +void panthor_gpu_resume(struct panthor_device *ptdev);
> +
> +int panthor_gpu_block_power_on(struct panthor_device *ptdev,
> +			       const char *blk_name,
> +			       u32 pwron_reg, u32 pwrtrans_reg,
> +			       u32 rdy_reg, u64 mask, u32 timeout_us);
> +int panthor_gpu_block_power_off(struct panthor_device *ptdev,
> +				const char *blk_name,
> +				u32 pwroff_reg, u32 pwrtrans_reg,
> +				u64 mask, u32 timeout_us);
> +
> +/**
> + * panthor_gpu_power_on() - Power on the GPU block.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +#define panthor_gpu_power_on(ptdev, type, mask, timeout_us) \
> +	panthor_gpu_block_power_on(ptdev, #type, \
> +				  type ## _PWRON_LO, \
> +				  type ## _PWRTRANS_LO, \
> +				  type ## _READY_LO, \
> +				  mask, timeout_us)
> +
> +/**
> + * panthor_gpu_power_off() - Power off the GPU block.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +#define panthor_gpu_power_off(ptdev, type, mask, timeout_us) \
> +	panthor_gpu_block_power_off(ptdev, #type, \
> +				   type ## _PWROFF_LO, \
> +				   type ## _PWRTRANS_LO, \
> +				   mask, timeout_us)
> +
> +int panthor_gpu_l2_power_on(struct panthor_device *ptdev);
> +int panthor_gpu_flush_caches(struct panthor_device *ptdev,
> +			     u32 l2, u32 lsc, u32 other);
> +int panthor_gpu_soft_reset(struct panthor_device *ptdev);
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-11 19:26       ` Robin Murphy
@ 2023-08-14 11:18         ` Steven Price
  2023-08-21 17:56           ` Robin Murphy
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-14 11:18 UTC (permalink / raw)
  To: Robin Murphy, Daniel Stone, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 11/08/2023 20:26, Robin Murphy wrote:
> On 2023-08-11 17:56, Daniel Stone wrote:
>> Hi,
>>
>> On 11/08/2023 17:35, Robin Murphy wrote:
>>> On 2023-08-09 17:53, Boris Brezillon wrote:
>>>> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o
>>>
>>> FWIW I still think it would be nice to have a minor
>>> directory/Kconfig/Makefile reshuffle and a trivial bit of extra
>>> registration glue to build both drivers into a single module. It
>>> seems like it could be a perpetual source of confusion to end users
>>> where Mesa "panfrost" is the right option but kernel "panfrost" is
>>> the wrong one. Especially when pretty much every other GPU driver is
>>> also just one big top-level module to load for many different
>>> generations of hardware. Plus it would mean that if someone did want
>>> to have a go at deduplicating the resource-wrangling boilerplate for
>>> OPPs etc. in future, there's more chance of being able to do so
>>> meaningfully.
>>
>> It might be nice to point it out, but to be fair Intel and AMD both
>> have two (or more) drivers, as does Broadcom/RPi. As does, err ... Mali.
> 
> Indeed, I didn't mean to imply that I'm not aware that e.g. gma500 is to
> i915 what lima is to panfrost. It was more that unlike the others where
> there's a pretty clear line in the sand between "driver for old
> hardware" and "driver for the majority of recent hardware", this one
> happens to fall splat in the middle of the current major generation such
> that panfrost is the correct module for Mali Bifrost but also the wrong
> one for Mali Bifrost... :/

Well panfrost.ko is the correct module for all Bifrost ;) It's Valhall
that's the confusing one.

I would hope that for most users they can just build both panfrost and
panthor and everything will "Just Work (tm)". I'm not sure how much
users are actually aware of the architecture family of their GPU.

I think at the moment (until marketing mess it up) there's also the
'simple' rule:

* Mali T* is Midgard and supported by panfrost.ko
* Mali Gxx (two digits) is Bifrost or first-generation Valhall and
supported by panfrost.ko
* Mali Gxxx (three digits) is Valhall CSF and supported by panthor.

(and Immortalis is always three digits and Valhall CSF).

> 
>> I can see the point, but otoh if someone's managed to build all the
>> right regulator/clock/etc modules to get a working system, they'll
>> probably manage to figure teh GPU side out?
> 
> Maybe; either way I guess it's not really my concern, since I'm the only
> user that *I* have to support, and I do already understand it. From the
> upstream perspective I mostly just want to hold on to the hope of not
> having to write my io-pgtable bugs twice over if at all possible :)

I agree it would be nice to merge some of the common code, I'm hoping
this is something that might be possible in the future. But at the
moment the focus is on trying to get basic support for the new GPUs
without the danger of regressing the old GPUs.

And, to be honest, for a fair bit of the common code in
panfrost/panthorm it's common to a few other drivers too. So the correct
answer might well be to try to add more generic helpers (devfreq,
clocks, power domains all spring to mind - there's a lot of boiler plate
and nothing very special about Mali).

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 06/15] drm/panthor: Add GEM logical block
  2023-08-09 16:53 ` [PATCH v2 06/15] drm/panthor: Add GEM " Boris Brezillon
@ 2023-08-14 13:40   ` Steven Price
  2023-08-29 14:45     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-14 13:40 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Anything relating to GEM object management is placed here. Nothing
> particularly interesting here, given the implementation is based on
> drm_gem_shmem_object, which is doing most of the work.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Document the code
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

One minor comment below, but otherwise:

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 229 ++++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_gem.h |  96 +++++++++++
>  2 files changed, 325 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> new file mode 100644
> index 000000000000..a441a68822ca
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -0,0 +1,229 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/dma-buf.h>
> +#include <linux/dma-mapping.h>
> +
> +#include <drm/panthor_drm.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_mmu.h"
> +
> +static void panthor_gem_free_object(struct drm_gem_object *obj)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(obj);
> +
> +	if (drm_WARN_ON(obj->dev, bo->va_node))
> +		panthor_vm_free_va(bo->exclusive_vm, bo->va_node);
> +
> +	panthor_vm_put(bo->exclusive_vm);
> +	drm_gem_free_mmap_offset(&bo->base.base);
> +	mutex_destroy(&bo->gpuva_list_lock);
> +	drm_gem_shmem_free(&bo->base);
> +}
> +
> +/**
> + * panthor_gem_unmap_and_put() - Unmap and drop the reference on a GEM object
> + * @vm: VM to unmap the GEM from.
> + * @bo: GEM object to unmap/release.
> + * @gpu_va: GPU/MCU virtual address the GEM object was mapped at.
> + * @cpu_va: kernel mapping of the GEM object.
> + * Can be NULL if the GEM was not CPU mapped.
> + *
> + * Should be called to undo what was done in panthor_gem_create_and_map().
> + */
> +void panthor_gem_unmap_and_put(struct panthor_vm *vm,
> +			       struct panthor_gem_object *bo,
> +			       u64 gpu_va, void *cpu_va)
> +{
> +	if (cpu_va) {
> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(cpu_va);
> +
> +		drm_gem_vunmap_unlocked(&bo->base.base, &map);
> +	}
> +
> +	drm_WARN_ON(bo->base.base.dev, panthor_vm_unmap_range(vm, gpu_va, bo->base.base.size));
> +	panthor_vm_free_va(vm, bo->va_node);
> +	bo->va_node = NULL;
> +	drm_gem_object_put(&bo->base.base);
> +}
> +
> +/**
> + * panthor_gem_create_and_map() - Create and map a GEM object to a VM
> + * @ptdev: Device.
> + * @vm: VM to map the GEM to.
> + * @bo_flags: Combination of drm_panthor_bo_flags flags.
> + * @vm_map_flags: Combination of drm_panthor_vm_bind_op_flags (only those
> + * that are related to map operations).
> + * @gpu_va: Pointer holding the GPU address assigned when mapping to the VM.
> + * If *gpu_va == PANTHOR_GEM_ALLOC_VA, a virtual address range will be allocated
> + * and the allocated address returned, otherwise *gpu_va is used directly.
> + * @cpu_va: Pointer holding the kernel CPU mapping. If NULL, the GEM object
> + * is not CPU-mapped.
> + *
> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> + */
> +struct panthor_gem_object *
> +panthor_gem_create_and_map(struct panthor_device *ptdev, struct panthor_vm *vm,
> +			   size_t size, u32 bo_flags, u32 vm_map_flags,
> +			   u64 *gpu_va, void **cpu_va)
> +{
> +	struct drm_gem_shmem_object *obj;
> +	struct panthor_gem_object *bo;
> +	int ret;
> +
> +	obj = drm_gem_shmem_create(&ptdev->base, size);
> +	if (!obj)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bo = to_panthor_bo(&obj->base);
> +	bo->flags = bo_flags;
> +	bo->exclusive_vm = panthor_vm_get(vm);
> +	bo->base.base.resv = panthor_vm_resv(vm);
> +
> +	if (*gpu_va == PANTHOR_GEM_ALLOC_VA) {
> +		bo->va_node = panthor_vm_alloc_va(vm, obj->base.size);
> +
> +		if (IS_ERR(bo->va_node)) {
> +			ret = PTR_ERR(bo->va_node);
> +			bo->va_node = NULL;
> +			goto err_put_obj;
> +		}
> +
> +		*gpu_va = bo->va_node->start;
> +	}
> +
> +	ret = panthor_vm_map_bo_range(vm, bo, 0, obj->base.size, *gpu_va, vm_map_flags);
> +	if (ret)
> +		goto err_put_obj;
> +
> +	if (cpu_va) {
> +		struct iosys_map map;
> +		int ret;
> +
> +		ret = drm_gem_vmap_unlocked(&obj->base, &map);
> +		if (ret)
> +			goto err_vm_unmap_range;
> +
> +		*cpu_va = map.vaddr;
> +	}
> +
> +	return bo;
> +
> +err_vm_unmap_range:
> +	panthor_vm_unmap_range(vm, *gpu_va, obj->base.size);
> +
> +err_put_obj:
> +	drm_gem_object_put(&obj->base);
> +	return ERR_PTR(ret);
> +}
> +
> +static int panthor_gem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(obj);
> +
> +	/* Don't allow mmap on objects that have the NO_MMAP flag set. */
> +	if (bo->flags & DRM_PANTHOR_BO_NO_MMAP)
> +		return -EINVAL;
> +
> +	return drm_gem_shmem_object_mmap(obj, vma);
> +}
> +
> +static struct dma_buf *
> +panthor_gem_prime_export(struct drm_gem_object *obj, int flags)
> +{
> +	/* We can't export GEMs that have an exclusive VM. */
> +	if (to_panthor_bo(obj)->exclusive_vm)
> +		return ERR_PTR(-EINVAL);
> +
> +	return drm_gem_prime_export(obj, flags);
> +}
> +
> +static const struct drm_gem_object_funcs panthor_gem_funcs = {
> +	.free = panthor_gem_free_object,
> +	.print_info = drm_gem_shmem_object_print_info,
> +	.pin = drm_gem_shmem_object_pin,
> +	.unpin = drm_gem_shmem_object_unpin,
> +	.get_sg_table = drm_gem_shmem_object_get_sg_table,
> +	.vmap = drm_gem_shmem_object_vmap,
> +	.vunmap = drm_gem_shmem_object_vunmap,
> +	.mmap = panthor_gem_mmap,
> +	.export = panthor_gem_prime_export,
> +	.vm_ops = &drm_gem_shmem_vm_ops,
> +};
> +
> +/**
> + * panthor_gem_create_object - Implementation of driver->gem_create_object.
> + * @dev: DRM device
> + * @size: Size in bytes of the memory the object will reference
> + *
> + * This lets the GEM helpers allocate object structs for us, and keep
> + * our BO stats correct.
> + */
> +struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t size)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	struct panthor_gem_object *obj;
> +
> +	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
> +	if (!obj)
> +		return ERR_PTR(-ENOMEM);
> +
> +	obj->base.base.funcs = &panthor_gem_funcs;
> +	obj->base.map_wc = !ptdev->coherent;
> +	mutex_init(&obj->gpuva_list_lock);
> +	drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
> +
> +	return &obj->base.base;
> +}
> +
> +/**
> + * panthor_gem_create_with_handle() - Create a GEM object and attach it to a handle.
> + * @file: DRM file.
> + * @ddev: DRM device.
> + * @exclusive_vm: Exclusive VM. Not NULL if the GEM object can't be shared.
> + * @size: Size of the GEM object to allocate.
> + * @flags: Combination of drm_panthor_bo_flags flags.
> + * @handle: Pointer holding the handle pointing to the new GEM object.
> + *
> + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> + */
> +struct panthor_gem_object *
> +panthor_gem_create_with_handle(struct drm_file *file,
> +			       struct drm_device *ddev,
> +			       struct panthor_vm *exclusive_vm,
> +			       size_t size,
> +			       u32 flags, u32 *handle)
> +{
> +	int ret;
> +	struct drm_gem_shmem_object *shmem;
> +	struct panthor_gem_object *bo;
> +
> +	shmem = drm_gem_shmem_create(ddev, size);
> +	if (IS_ERR(shmem))
> +		return ERR_CAST(shmem);
> +
> +	bo = to_panthor_bo(&shmem->base);
> +	bo->flags = flags;
> +
> +	if (exclusive_vm) {
> +		bo->exclusive_vm = panthor_vm_get(exclusive_vm);
> +		bo->base.base.resv = panthor_vm_resv(exclusive_vm);
> +	}
> +
> +	/*
> +	 * Allocate an id of idr table where the obj is registered
> +	 * and handle has the id what user can see.
> +	 */
> +	ret = drm_gem_handle_create(file, &shmem->base, handle);
> +	/* drop reference from allocate - handle holds it now. */
> +	drm_gem_object_put(&shmem->base);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return bo;
> +}

This function might be better just returning a simple int. The
"with_handle" approach means that doing anything much with the returned
object is dodgy (because another user space thread could have already
guessed the handle), and anyway the only caller
(panthor_ioctl_bo_create()) doesn't use the object and just extracts the
error code (if any).

Steve

> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
> new file mode 100644
> index 000000000000..07babadc7623
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
> @@ -0,0 +1,96 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_GEM_H__
> +#define __PANTHOR_GEM_H__
> +
> +#include <drm/drm_gem_shmem_helper.h>
> +#include <drm/drm_mm.h>
> +
> +#include <linux/rwsem.h>
> +
> +struct panthor_vm;
> +
> +/**
> + * struct panthor_gem_object - Driver specific GEM object.
> + */
> +struct panthor_gem_object {
> +	/** @base: Inherit from drm_gem_shmem_object. */
> +	struct drm_gem_shmem_object base;
> +
> +	/**
> +	 * @va_node: VA space allocated to this GEM.
> +	 *
> +	 * Should be NULL for all GEM objects managed by userspace.
> +	 *
> +	 * Not NULL when %PANTHOR_GEM_ALLOC_VA is passed as an address, in
> +	 * which case the GEM logic will auto-allocate a VA range before mapping
> +	 * to the VM.
> +	 *
> +	 * @exclusive_vm must be != NULL.
> +	 */
> +	struct drm_mm_node *va_node;
> +
> +	/**
> +	 * @exclusive_vm: Exclusive VM this GEM object can be mapped to.
> +	 *
> +	 * If @exclusive_vm != NULL, any attempt to bind the GEM to a different
> +	 * VM will fail.
> +	 *
> +	 * All FW memory objects have this field set to the MCU VM.
> +	 */
> +	struct panthor_vm *exclusive_vm;
> +
> +	/**
> +	 * @gpuva_list_lock: Custom GPUVA lock.
> +	 *
> +	 * Used to protect insertion of drm_gpuva elements to the
> +	 * drm_gem_object.gpuva.list list.
> +	 *
> +	 * We can't use the GEM resv for that, because drm_gpuva_link() is
> +	 * called in a dma-signaling path, where we're not allowed to take
> +	 * resv locks.
> +	 */
> +	struct mutex gpuva_list_lock;
> +
> +	/** @flags: Combination of drm_panthor_bo_flags flags. */
> +	u32 flags;
> +};
> +
> +static inline
> +struct panthor_gem_object *to_panthor_bo(struct drm_gem_object *obj)
> +{
> +	return container_of(to_drm_gem_shmem_obj(obj), struct panthor_gem_object, base);
> +}
> +
> +struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t size);
> +
> +struct drm_gem_object *
> +panthor_gem_prime_import_sg_table(struct drm_device *ddev,
> +				  struct dma_buf_attachment *attach,
> +				  struct sg_table *sgt);
> +
> +struct panthor_gem_object *
> +panthor_gem_create_with_handle(struct drm_file *file,
> +			       struct drm_device *ddev,
> +			       struct panthor_vm *exclusive_vm,
> +			       size_t size,
> +			       u32 flags,
> +			       uint32_t *handle);
> +
> +void panthor_gem_unmap_and_put(struct panthor_vm *vm, struct panthor_gem_object *bo,
> +			       u64 gpu_va, void *cpu_va);
> +
> +/*
> + * PANTHOR_GEM_ALLOC_VA: Use this magic address when you want the GEM
> + * logic to auto-allocate the virtual address in the reserved kernel VA range.
> + */
> +#define PANTHOR_GEM_ALLOC_VA		~0ull
> +
> +struct panthor_gem_object *
> +panthor_gem_create_and_map(struct panthor_device *ptdev, struct panthor_vm *vm,
> +			   size_t size, u32 bo_flags, u32 vm_map_flags,
> +			   u64 *gpu_va, void **cpu_va);
> +
> +#endif /* __PANTHOR_GEM_H__ */


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 07/15] drm/panthor: Add the devfreq logical block
  2023-08-09 16:53 ` [PATCH v2 07/15] drm/panthor: Add the devfreq " Boris Brezillon
@ 2023-08-14 13:45   ` Steven Price
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-14 13:45 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Every thing related to devfreq in placed in panthor_devfreq.c, and
> helpers that can be called by other logical blocks are exposed through
> panthor_devfreq.h.
> 
> This implementation is loosely based on the panfrost implementation,
> the only difference being that we don't count device users, because
> the idle/active state will be managed by the scheduler logic.
> 
> v2:
> - Added in v2
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 281 ++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_devfreq.h |  25 ++
>  2 files changed, 306 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> new file mode 100644
> index 000000000000..500ce34cccc2
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -0,0 +1,281 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2019 Collabora ltd. */
> +
> +#include <linux/clk.h>
> +#include <linux/devfreq.h>
> +#include <linux/devfreq_cooling.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_opp.h>
> +
> +#include <drm/drm_managed.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_devfreq.h"
> +
> +/**
> + * struct panthor_devfreq - Device frequency management
> + */
> +struct panthor_devfreq {
> +	/** @devfreq: devfreq device. */
> +	struct devfreq *devfreq;
> +
> +	/** @gov_data: Governor data. */
> +	struct devfreq_simple_ondemand_data gov_data;
> +
> +	/** @busy_time: Busy time. */
> +	ktime_t busy_time;
> +
> +	/** @idle_time: Idle time. */
> +	ktime_t idle_time;
> +
> +	/** @time_last_update: Last update time. */
> +	ktime_t time_last_update;
> +
> +	/** @last_busy_state: True if the GPU was busy last time we updated the state. */
> +	bool last_busy_state;
> +
> +	/*
> +	 * Protect busy_time, idle_time, time_last_update and last_busy_state
> +	 * because these can be accessed concurrently by panthor_devfreq_get_dev_status()
> +	 * and panthor_devfreq_record_{busy,idle}().
> +	 */
> +	spinlock_t lock;
> +};
> +
> +static void panthor_devfreq_update_utilization(struct panthor_devfreq *pdevfreq)
> +{
> +	ktime_t now, last;
> +
> +	now = ktime_get();
> +	last = pdevfreq->time_last_update;
> +
> +	if (pdevfreq->last_busy_state)
> +		pdevfreq->busy_time += ktime_sub(now, last);
> +	else
> +		pdevfreq->idle_time += ktime_sub(now, last);
> +
> +	pdevfreq->time_last_update = now;
> +}
> +
> +static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
> +				  u32 flags)
> +{
> +	struct dev_pm_opp *opp;
> +
> +	opp = devfreq_recommended_opp(dev, freq, flags);
> +	if (IS_ERR(opp))
> +		return PTR_ERR(opp);
> +	dev_pm_opp_put(opp);
> +
> +	return dev_pm_opp_set_rate(dev, *freq);
> +}
> +
> +static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
> +{
> +	pdevfreq->busy_time = 0;
> +	pdevfreq->idle_time = 0;
> +	pdevfreq->time_last_update = ktime_get();
> +}
> +
> +static int panthor_devfreq_get_dev_status(struct device *dev,
> +					  struct devfreq_dev_status *status)
> +{
> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> +	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
> +	unsigned long irqflags;
> +
> +	status->current_frequency = clk_get_rate(ptdev->clks.core);
> +
> +	spin_lock_irqsave(&pdevfreq->lock, irqflags);
> +
> +	panthor_devfreq_update_utilization(pdevfreq);
> +
> +	status->total_time = ktime_to_ns(ktime_add(pdevfreq->busy_time,
> +						   pdevfreq->idle_time));
> +
> +	status->busy_time = ktime_to_ns(pdevfreq->busy_time);
> +
> +	panthor_devfreq_reset(pdevfreq);
> +
> +	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
> +
> +	drm_dbg(&ptdev->base, "busy %lu total %lu %lu %% freq %lu MHz\n",
> +		status->busy_time, status->total_time,
> +		status->busy_time / (status->total_time / 100),
> +		status->current_frequency / 1000 / 1000);
> +
> +	return 0;
> +}
> +
> +static struct devfreq_dev_profile panthor_devfreq_profile = {
> +	.timer = DEVFREQ_TIMER_DELAYED,
> +	.polling_ms = 50, /* ~3 frames */
> +	.target = panthor_devfreq_target,
> +	.get_dev_status = panthor_devfreq_get_dev_status,
> +};
> +
> +int panthor_devfreq_init(struct panthor_device *ptdev)
> +{
> +	/* There's actually 2 regulators (mali and sram), but the OPP core only
> +	 * supports one.
> +	 *
> +	 * We assume the sram regulator is coupled with the mali one and let
> +	 * the coupling logic deal with voltage updates.
> +	 */
> +	static const char *reg_names[] = { "mali", NULL };
> +	struct thermal_cooling_device *cooling;
> +	struct device *dev = ptdev->base.dev;
> +	struct panthor_devfreq *pdevfreq;
> +	struct dev_pm_opp *opp;
> +	unsigned long cur_freq;
> +	int ret;
> +
> +	pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), GFP_KERNEL);
> +	if (!pdevfreq)
> +		return -ENOMEM;
> +
> +	ptdev->devfreq = pdevfreq;
> +
> +	ret = devm_pm_opp_set_regulators(dev, reg_names);
> +	if (ret) {
> +		if (ret != -EPROBE_DEFER)
> +			DRM_DEV_ERROR(dev, "Couldn't set OPP regulators\n");
> +
> +		return ret;
> +	}
> +
> +	ret = devm_pm_opp_of_add_table(dev);
> +	if (ret)
> +		return ret;
> +
> +	spin_lock_init(&pdevfreq->lock);
> +
> +	panthor_devfreq_reset(pdevfreq);
> +
> +	cur_freq = clk_get_rate(ptdev->clks.core);
> +
> +	opp = devfreq_recommended_opp(dev, &cur_freq, 0);
> +	if (IS_ERR(opp))
> +		return PTR_ERR(opp);
> +
> +	panthor_devfreq_profile.initial_freq = cur_freq;
> +
> +	/* Regulator coupling only takes care of synchronizing/balancing voltage
> +	 * updates, but the coupled regulator needs to be enabled manually.
> +	 *
> +	 * We use devm_regulator_get_enable_optional() and keep the sram supply
> +	 * enabled until the device is removed, just like we do for the mali
> +	 * supply, which is enabled when dev_pm_opp_set_opp(dev, opp) is called,
> +	 * and disabled when the opp_table is torn down, using the devm action.
> +	 *
> +	 * If we really care about disabling regulators on suspend, we should:
> +	 * - use devm_regulator_get_optional() here
> +	 * - call dev_pm_opp_set_opp(dev, NULL) before leaving this function
> +	 *   (this disables the regulator passed to the OPP layer)
> +	 * - call dev_pm_opp_set_opp(dev, NULL) and
> +	 *   regulator_disable(ptdev->regulators.sram) in
> +	 *   panthor_devfreq_suspend()
> +	 * - call dev_pm_opp_set_opp(dev, default_opp) and
> +	 *   regulator_enable(ptdev->regulators.sram) in
> +	 *   panthor_devfreq_resume()
> +	 *
> +	 * But without knowing if it's beneficial or not (in term of power
> +	 * consumption), or how much it slows down the suspend/resume steps,
> +	 * let's just keep regulators enabled for the device lifetime.
> +	 */
> +	ret = devm_regulator_get_enable_optional(dev, "sram");
> +	if (ret && ret != -ENODEV) {
> +		if (ret != -EPROBE_DEFER)
> +			DRM_DEV_ERROR(dev, "Couldn't retrieve/enable sram supply\n");
> +		return ret;
> +	}
> +
> +	/*
> +	 * Set the recommend OPP this will enable and configure the regulator
> +	 * if any and will avoid a switch off by regulator_late_cleanup()
> +	 */
> +	ret = dev_pm_opp_set_opp(dev, opp);
> +	if (ret) {
> +		DRM_DEV_ERROR(dev, "Couldn't set recommended OPP\n");
> +		return ret;
> +	}
> +
> +	dev_pm_opp_put(opp);
> +
> +	/*
> +	 * Setup default thresholds for the simple_ondemand governor.
> +	 * The values are chosen based on experiments.
> +	 */
> +	pdevfreq->gov_data.upthreshold = 45;
> +	pdevfreq->gov_data.downdifferential = 5;
> +
> +	pdevfreq->devfreq = devm_devfreq_add_device(dev, &panthor_devfreq_profile,
> +						    DEVFREQ_GOV_SIMPLE_ONDEMAND,
> +						    &pdevfreq->gov_data);
> +	if (IS_ERR(pdevfreq->devfreq)) {
> +		DRM_DEV_ERROR(dev, "Couldn't initialize GPU devfreq\n");
> +		ret = PTR_ERR(pdevfreq->devfreq);
> +		pdevfreq->devfreq = NULL;
> +		return ret;
> +	}
> +
> +	cooling = devfreq_cooling_em_register(pdevfreq->devfreq, NULL);
> +	if (IS_ERR(cooling))
> +		DRM_DEV_INFO(dev, "Failed to register cooling device\n");
> +
> +	return 0;
> +}
> +
> +int panthor_devfreq_resume(struct panthor_device *ptdev)
> +{
> +	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
> +
> +	if (!pdevfreq->devfreq)
> +		return 0;
> +
> +	panthor_devfreq_reset(pdevfreq);
> +
> +	return devfreq_resume_device(pdevfreq->devfreq);
> +}
> +
> +int panthor_devfreq_suspend(struct panthor_device *ptdev)
> +{
> +	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
> +
> +	if (!pdevfreq->devfreq)
> +		return 0;
> +
> +	return devfreq_suspend_device(pdevfreq->devfreq);
> +}
> +
> +void panthor_devfreq_record_busy(struct panthor_device *ptdev)
> +{
> +	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
> +	unsigned long irqflags;
> +
> +	if (!pdevfreq->devfreq)
> +		return;
> +
> +	spin_lock_irqsave(&pdevfreq->lock, irqflags);
> +
> +	panthor_devfreq_update_utilization(pdevfreq);
> +	pdevfreq->last_busy_state = true;
> +
> +	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
> +}
> +
> +void panthor_devfreq_record_idle(struct panthor_device *ptdev)
> +{
> +	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
> +	unsigned long irqflags;
> +
> +	if (!pdevfreq->devfreq)
> +		return;
> +
> +	spin_lock_irqsave(&pdevfreq->lock, irqflags);
> +
> +	panthor_devfreq_update_utilization(pdevfreq);
> +	pdevfreq->last_busy_state = false;
> +
> +	spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
> new file mode 100644
> index 000000000000..875fbb5a1c1b
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2019 Collabora ltd. */
> +
> +#ifndef __PANTHOR_DEVFREQ_H__
> +#define __PANTHOR_DEVFREQ_H__
> +
> +#include <linux/devfreq.h>
> +#include <linux/spinlock.h>
> +#include <linux/ktime.h>
> +
> +struct devfreq;
> +struct thermal_cooling_device;
> +
> +struct panthor_device;
> +struct panthor_devfreq;
> +
> +int panthor_devfreq_init(struct panthor_device *ptdev);
> +
> +int panthor_devfreq_resume(struct panthor_device *ptdev);
> +int panthor_devfreq_suspend(struct panthor_device *ptdev);
> +
> +void panthor_devfreq_record_busy(struct panthor_device *ptdev);
> +void panthor_devfreq_record_idle(struct panthor_device *ptdev);
> +
> +#endif /* __PANTHOR_DEVFREQ_H__ */


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-09 16:53 ` [PATCH v2 08/15] drm/panthor: Add the MMU/VM " Boris Brezillon
@ 2023-08-14 15:53   ` Steven Price
  2023-08-29 15:33     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-14 15:53 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> MMU and VM management is related and placed in the same source file.
> 
> Page table updates are delegated to the io-pgtable-arm driver that's in
> the iommu subsystem.
> 
> The VM management logic is based on drm_gpuva_mgr, and is assuming the
> VA space is mostly managed by the usermode driver, except for a reserved
> portion of this VA-space that's used for kernel objects (like the heap
> contexts/chunks).
> 
> Both asynchronous and synchronous VM operations are supported, and
> internal helpers are exposed to allow other logical blocks to map their
> buffers in the GPU VA space.
> 
> There's one VM_BIND queue per-VM (meaning the Vulkan driver can only
> expose one sparse-binding queue), and this bind queue is managed with
> a 1:1 drm_sched_entity:drm_gpu_scheduler, such that each VM gets its own
> independent execution queue, avoiding VM operation serialization at the
> device level (things are still serialized at the VM level).
> 
> The rest is just implementation details that are hopefully well explained
> in the documentation.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Document the code
> - Use drm_gpuva_mgr
> - Replace VM_MAP/UNMAP by VM_BIND
> - Add support for asynchronous VM_BIND (VM_BIND queue implemented with
>   drm_sched)
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> - Use the panthor_irq layer to manage/process IRQs
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_mmu.c | 2611 +++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_mmu.h |   81 +
>  2 files changed, 2692 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> new file mode 100644
> index 000000000000..3ba784473023
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -0,0 +1,2611 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <drm/drm_debugfs.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_exec.h>
> +#include <drm/drm_gpuva_mgr.h>
> +#include <drm/drm_managed.h>
> +#include <drm/gpu_scheduler.h>
> +#include <drm/panthor_drm.h>
> +
> +#include <linux/atomic.h>
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/io-pgtable.h>
> +#include <linux/iommu.h>
> +#include <linux/kmemleak.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/rwsem.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/sizes.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_heap.h"
> +#include "panthor_mmu.h"
> +#include "panthor_sched.h"
> +#include "panthor_gem.h"
> +#include "panthor_regs.h"
> +
> +#define MAX_AS_SLOTS			32
> +
> +struct panthor_vm;
> +
> +/**
> + * struct panthor_as_slot - Address space slot
> + */
> +struct panthor_as_slot {
> +	/** @vm: VM bound to this slot. NULL is no VM is bound. */
> +	struct panthor_vm *vm;
> +
> +	/** @lock: Lock used to serialize access to the AS registers. */
> +	spinlock_t lock;
> +};
> +
> +/**
> + * struct panthor_mmu - MMU related data
> + */
> +struct panthor_mmu {
> +	/** @irq: The MMU irq. */
> +	struct panthor_irq irq;
> +
> +	/** @as: Address space related fields.
> +	 *
> +	 * The GPU has a limited number of address spaces (AS) slots, forcing
> +	 * us to re-assign them to re-assign slots on-demand.
> +	 */
> +	struct {
> +		/** @slots_lock: Lock protecting access to all other AS fields. */
> +		struct mutex slots_lock;
> +
> +		/** @alloc_mask: Bitmask encoding the allocated slots. */
> +		unsigned long alloc_mask;
> +
> +		/** @faulty_mask: Bitmask encoding the faulty slots. */
> +		unsigned long faulty_mask;
> +
> +		/** @slots: VMs currently bound to the AS slots. */
> +		struct panthor_as_slot slots[MAX_AS_SLOTS];
> +
> +		/**
> +		 * @lru_list: List of least recently used VMs.
> +		 *
> +		 * We use this list to pick a VM to evict when all slots are
> +		 * used.
> +		 *
> +		 * There should be no more active VMs than there are AS slots,
> +		 * so this LRU is just here to keep VMs bound until there's
> +		 * a need to release a slot, thus avoid unnecessary TLB/cache
> +		 * flushes.
> +		 */
> +		struct list_head lru_list;
> +	} as;
> +
> +	/** @vm: VMs management fields */
> +	struct {
> +		/** @lock: Lock protecting access to list. */
> +		struct mutex lock;
> +
> +		/** @list: List containing all VMs. */
> +		struct list_head list;
> +
> +		/** @reset_in_progress: True if a reset is in progress. */
> +		bool reset_in_progress;
> +
> +		/** @wq: Workqueue used for the VM_BIND queues. */
> +		struct workqueue_struct *wq;
> +	} vm;
> +};
> +
> +/**
> + * struct panthor_vm_pool - VM pool object
> + */
> +struct panthor_vm_pool {
> +	/** @xa: Array used for VM handle tracking. */
> +	struct xarray xa;
> +};
> +
> +/**
> + * struct panthor_vma - GPU mapping object
> + *
> + * This is used to track GEM mappings in GPU space.
> + */
> +struct panthor_vma {
> +	/** @base: Inherits from drm_gpuva. */
> +	struct drm_gpuva base;
> +
> +	/** @node: Used to insert the mapping in the panthor_vm::shared_bos list. */
> +	struct list_head node;
> +
> +	/**
> +	 * @flags: Combination of drm_panthor_vm_bind_op_flags.
> +	 *
> +	 * Only map related flags are accepted.
> +	 */
> +	u32 flags;
> +};
> +
> +/**
> + * struct panthor_vm_op_ctx - VM operation context
> + *
> + * With VM operations potentially taking place in a dma-signaling path, we
> + * need to make sure everything that might require resource allocation is
> + * pre-allocated upfront. This is what this operation context is far.
> + *
> + * We also collect resources that have been freed, so we can release them
> + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
> + * request.
> + */
> +struct panthor_vm_op_ctx {
> +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
> +	struct {
> +		/** @count: Number of pages reserved. */
> +		u32 count;
> +
> +		/** @ptr: Point to the first unused page in the @pages table. */
> +		u32 ptr;
> +
> +		/**
> +		 * @page: Array of pages that can be used for an MMU page table update.
> +		 *
> +		 * After an VM operation, there might be free pages left in this array.
> +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
> +		 */
> +		void **pages;
> +	} rsvd_page_tables;

Two questions:

1) Would a mempool simplify the implementation? It looks like a
reasonable match.

2) Does it really make sense to have a separate pool of memory for every
operation? Instead of having a separate pool for each operation, it
would be possible to just keep track of the total number needed for all
outstanding operations. Then a single (per device or maybe per-VM if
necessary) mempool could be resized to ensure it has the right amount of
space.

I'm also a little wary that the VM_BIND infrastructure could potentially
be abused to trigger a large amount of kernel allocation as it allocates
up-front for the worst case but those pages are not charged to the
process (AFAICT). But I haven't fully got my head round that yet.

> +
> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
> +	u32 flags;
> +
> +	/** @va: Virtual range targeted by the VM operation. */
> +	struct {
> +		/** @addr: Start address. */
> +		u64 addr;
> +
> +		/** @range: Range size. */
> +		u64 range;
> +	} va;
> +
> +	/**
> +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
> +	 *
> +	 * For unmap operations, this will contain all VMAs that were covered by the
> +	 * specified VA range.
> +	 *
> +	 * For map operations, this will contain all VMAs that previously mapped to
> +	 * the specified VA range.
> +	 *
> +	 * Those VMAs, and the resources they point to will be released as part of
> +	 * the op_ctx cleanup operation.
> +	 */
> +	struct list_head returned_vmas;
> +
> +	/** @map: Fields specific to a map operation. */
> +	struct {
> +		/** @gem: GEM object information. */
> +		struct {
> +			/** @obj: GEM object to map. */
> +			struct drm_gem_object *obj;
> +
> +			/** @offset: Offset in the GEM object. */
> +			u64 offset;
> +		} gem;
> +
> +		/**
> +		 * @sgt: sg-table pointing to pages backing the GEM object.
> +		 *
> +		 * This is gathered at job creation time, such that we don't have
> +		 * to allocate in ::run_job().
> +		 */
> +		struct sg_table *sgt;
> +
> +		/**
> +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
> +		 *
> +		 * If the map request covers a region that's inside another VMA, the
> +		 * previous VMA will be split, requiring instantiation of a maximum of
> +		 * two new VMA objects.
> +		 */
> +		struct panthor_vma *prev_vma;
> +
> +		/**
> +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
> +		 */
> +		struct panthor_vma *new_vma;
> +
> +		/**
> +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
> +		 *
> +		 * See @prev_vma.
> +		 */
> +		struct panthor_vma *next_vma;

It's probably premature optimization, but it feels like having a cache
of these VMA structures might be an idea. I'm also struggling to
understand how both a new prev and new next VMA are needed - but I
haven't dug into the GPU VA manager.

> +	} map;
> +};
> +
> +/**
> + * struct panthor_vm - VM object
> + *
> + * A VM is an object representing a GPU (or MCU) virtual address space.
> + * It embeds the MMU page table for this address space, a tree containing
> + * all the virtual mappings of GEM objects, and other things needed to manage
> + * the VM.
> + *
> + * Except for the MCU VM, which is managed by the kernel, all other VMs are
> + * created by userspace and mostly managed by userspace, using the
> + * %DRM_IOCTL_PANTHOR_VM_BIND ioctl.
> + *
> + * A portion of the virtual address space is reserved for kernel objects,
> + * like heap chunks, and userspace gets to decide how much of the virtual
> + * address space is left to the kernel (half of the virtual address space
> + * by default).
> + */
> +struct panthor_vm {
> +	/**
> +	 * @va_mgr: GPU VA manager.
> +	 *
> +	 * We delegate all the VA management to the common drm_gpuva_mgr framework
> +	 * and only implement hooks to update the MMU page table.
> +	 */
> +	struct drm_gpuva_manager va_mgr;
> +
> +	/**
> +	 * @sched: Scheduler used for asynchronous VM_BIND request.
> +	 *
> +	 * We use a 1:1 scheduler here.
> +	 */
> +	struct drm_gpu_scheduler sched;
> +
> +	/**
> +	 * @entity: Scheduling entity representing the VM_BIND queue.
> +	 *
> +	 * There's currently one bind queue per VM. It doesn't make sense to
> +	 * allow more given the VM operations are serialized anyway.
> +	 */
> +	struct drm_sched_entity entity;
> +
> +	/** @ptdev: Device. */
> +	struct panthor_device *ptdev;
> +
> +	/** @refcount: Reference count. */
> +	struct kref refcount;
> +
> +	/** @memattr: Value to program to the AS_MEMATTR register. */
> +	u64 memattr;
> +
> +	/** @pgtbl_ops: Page table operations. */
> +	struct io_pgtable_ops *pgtbl_ops;
> +
> +	/**
> +	 * @dummy_gem: Used as a VM reservation object.
> +	 *
> +	 * We declare a drm_gem_object and no a dma_resv, so we can use drm_exec()
> +	 * for the VM reservation.
> +	 *
> +	 * All private BOs use the resv of this dummy GEM object instead of
> +	 * drm_gem_object::_resv, such that private GEM preparation is O(1)
> +	 * instead of O(N).
> +	 */
> +	struct drm_gem_object dummy_gem;
> +
> +	/**
> +	 * @op_lock: Lock used to serialize operations on a VM.
> +	 *
> +	 * The serialization of jobs queued to the VM_BIND queue is already
> +	 * taken care of by drm_sched, but we need to serialize synchronous
> +	 * and asynchronous VM_BIND request. This is what this lock is for.
> +	 */
> +	struct mutex op_lock;
> +
> +	/**
> +	 * @op_ctx: The context attached to the currently executing VM operation.
> +	 *
> +	 * NULL when no operation is in progress.
> +	 */
> +	struct panthor_vm_op_ctx *op_ctx;
> +
> +	/**
> +	 * @shared_bos: List of shared BOs.
> +	 *
> +	 * Shared BOs don't use the VM resv, and need to be prepared
> +	 * independently. This list keeps track of all VMAs that target
> +	 * non-private BOs.
> +	 *
> +	 * There might be duplicates, but drm_exec and dma_resv should
> +	 * handle that for us.
> +	 *
> +	 * TODO: This is not optimal. We should probably switch to the
> +	 * drm_gpuva_mgr solution for handling shared BOs once it's
> +	 * ready.
> +	 */
> +	struct list_head shared_bos;
> +
> +	/**
> +	 * @mm: Memory management object representing the auto-VA/kernel-VA.
> +	 *
> +	 * Used to auto-allocate VA space for kernel-managed objects (tiler
> +	 * heaps, ...).
> +	 *
> +	 * For the MCU VM, this is managing the VA range that's used to map
> +	 * all shared interfaces.
> +	 *
> +	 * For user VMs, the range is specified by userspace, and must not
> +	 * exceed half of the VA space addressable.
> +	 */
> +	struct drm_mm mm;
> +
> +	/** @mm_lock: Lock protecting the @mm field. */
> +	struct mutex mm_lock;
> +
> +	/** @as: Address space related fields. */
> +	struct {
> +		/**
> +		 * @id: ID of the address space this VM is bound to.
> +		 *
> +		 * A value of -1 means the VM is inactive/not bound.
> +		 */
> +		int id;
> +
> +		/**
> +		 * @lru_node: Used to instead the VM in the panthor_mmu::as::lru_list.
> +		 *
> +		 * Active VMs should not be inserted in the LRU list.
> +		 */
> +		struct list_head lru_node;
> +	} as;
> +
> +	/**
> +	 * @heaps: Tiler heap related fields.
> +	 */
> +	struct {
> +		/**
> +		 * @pool: The heap pool attached to this VM.
> +		 *
> +		 * Will stay NULL until someone creates a heap context on this VM.
> +		 */
> +		struct panthor_heap_pool *pool;
> +
> +		/** @lock: Lock used to protect access to @pool. */
> +		struct mutex lock;
> +	} heaps;
> +
> +	/** @node: Used to insert the VM in the panthor_mmu::vm::list. */
> +	struct list_head node;
> +
> +	/** @for_mcu: True if this is the MCU VM. */
> +	bool for_mcu;
> +
> +	/**
> +	 * @destroyed: True if the VM was destroyed.
> +	 *
> +	 * No further bind requests should be queued to a destroyed VM.
> +	 */
> +	bool destroyed;
> +
> +	/**
> +	 * @unusable: True if the VM has turned unusable because something
> +	 * bad happened during an asynchronous request.
> +	 *
> +	 * We don't try to recover from such failures, because this implies
> +	 * informing userspace about the specific operation that failed, and
> +	 * hoping the userspace driver can replay things from there. This all
> +	 * sounds very complicated for little gain.
> +	 *
> +	 * Instead, we should just flag the VM as unusable, and fail any
> +	 * further request targeting this VM.
> +	 *
> +	 * We also provide a way to query a VM state, so userspace can destroy
> +	 * it and create a new one.
> +	 *
> +	 * As an analogy, this would be mapped to a VK_ERROR_DEVICE_LOST
> +	 * situation, where the logical device needs to be re-created.
> +	 */
> +	bool unusable;
> +};
> +
> +/**
> + * struct panthor_vm_bind_job - VM bind job
> + */
> +struct panthor_vm_bind_job {
> +	/** @base: Inherit from drm_sched_job. */
> +	struct drm_sched_job base;
> +
> +	/** @refcount: Reference count. */
> +	struct kref refcount;
> +
> +	/** @cleanup_op_ctx_work: Work used to cleanup the VM operation context. */
> +	struct work_struct cleanup_op_ctx_work;
> +
> +	/** @vm: VM targeted by the VM operation. */
> +	struct panthor_vm *vm;
> +
> +	/** @ctx: Operation context. */
> +	struct panthor_vm_op_ctx ctx;
> +};
> +
> +/**
> + * @pt_cache: Cache used to allocate MMU page tables.
> + *
> + * The pre-allocation pattern forces us to over-allocate to plan for
> + * the worst case scenario, and return the pages we didn't use.
> + *
> + * Having a kmem_cache allows us to speed allocations.
> + */
> +static struct kmem_cache *pt_cache;
> +
> +/**
> + * alloc_pt() - Custom page table allocator
> + * @cookie: Cookie passed at page table allocation time.
> + * @size: Size of the page table. This size should be fixed,
> + * and determined at creation time based on the granule size.
> + * @gfp: GFP flags.
> + *
> + * We want a custom allocator so we can use a cache for page table
> + * allocations and amortize the cost of the over-reservation that's
> + * done to allow asynchronous VM operations.
> + *
> + * Return: non-NULL on success, NULL if the allocation failed for any
> + * reason.
> + */
> +static void *alloc_pt(void *cookie, size_t size, gfp_t gfp)
> +{
> +	struct panthor_vm *vm = cookie;
> +	void *page;
> +
> +	/* We're not supposed to have anything bigger than 4k here, because we picked a
> +	 * 4k granule size at init time.
> +	 */
> +	if (drm_WARN_ON(&vm->ptdev->base, size != SZ_4K))
> +		return NULL;
> +
> +	/* Allocation of the root page table happening during init. */
> +	if (!vm->pgtbl_ops) {
> +		drm_WARN_ON(&vm->ptdev->base, vm->op_ctx);
> +		page = kmem_cache_alloc(pt_cache, gfp);
> +		goto out;
> +	}
> +
> +	/* We must have some op_ctx attached to the VM and it must have at least one
> +	 * free page.
> +	 */
> +	if (drm_WARN_ON(&vm->ptdev->base, !vm->op_ctx) ||
> +	    drm_WARN_ON(&vm->ptdev->base,
> +			vm->op_ctx->rsvd_page_tables.ptr >= vm->op_ctx->rsvd_page_tables.count))
> +		return NULL;
> +
> +	page = vm->op_ctx->rsvd_page_tables.pages[vm->op_ctx->rsvd_page_tables.ptr++];
> +	memset(page, 0, SZ_4K);
> +
> +out:
> +	/* Page table entries don't use virtual addresses, which trips out
> +	 * kmemleak. kmemleak_alloc_phys() might work, but physical addresses
> +	 * are mixed with other fields, and I fear kmemleak won't detect that
> +	 * either.
> +	 *
> +	 * Let's just ignore memory passed to the page-table driver for now.
> +	 */
> +	kmemleak_ignore(page);
> +	return page;
> +}
> +
> +/**
> + * @free_pt() - Custom page table free function
> + * @cookie: Cookie passed at page table allocation time.
> + * @data: Page table to free.
> + * @size: Size of the page table. This size should be fixed,
> + * and determined at creation time based on the granule size.
> + */
> +static void free_pt(void *cookie, void *data, size_t size)
> +{
> +	struct panthor_vm *vm = cookie;
> +
> +	if (drm_WARN_ON(&vm->ptdev->base, size != SZ_4K))
> +		return;
> +
> +	/* Return the page to the pt_cache. */
> +	kmem_cache_free(pt_cache, data);
> +}
> +
> +static int wait_ready(struct panthor_device *ptdev, u32 as_nr)
> +{
> +	int ret;
> +	u32 val;
> +
> +	/* Wait for the MMU status to indicate there is no active command, in
> +	 * case one is pending.
> +	 */
> +	ret = readl_relaxed_poll_timeout_atomic(ptdev->iomem + AS_STATUS(as_nr),
> +						val, !(val & AS_STATUS_AS_ACTIVE),
> +						10, 100000);
> +
> +	if (ret) {
> +		panthor_device_schedule_reset(ptdev);
> +		drm_err(&ptdev->base, "AS_ACTIVE bit stuck\n");
> +	}
> +
> +	return ret;
> +}
> +
> +static int write_cmd(struct panthor_device *ptdev, u32 as_nr, u32 cmd)
> +{
> +	int status;
> +
> +	/* write AS_COMMAND when MMU is ready to accept another command */
> +	status = wait_ready(ptdev, as_nr);
> +	if (!status)
> +		gpu_write(ptdev, AS_COMMAND(as_nr), cmd);
> +
> +	return status;
> +}
> +
> +static void lock_region(struct panthor_device *ptdev, u32 as_nr,
> +			u64 region_start, u64 size)
> +{
> +	u8 region_width;
> +	u64 region;
> +	u64 region_end = region_start + size;
> +
> +	if (!size)
> +		return;
> +
> +	/*
> +	 * The locked region is a naturally aligned power of 2 block encoded as
> +	 * log2 minus(1).
> +	 * Calculate the desired start/end and look for the highest bit which
> +	 * differs. The smallest naturally aligned block must include this bit
> +	 * change, the desired region starts with this bit (and subsequent bits)
> +	 * zeroed and ends with the bit (and subsequent bits) set to one.
> +	 */
> +	region_width = max(fls64(region_start ^ (region_end - 1)),
> +			   const_ilog2(AS_LOCK_REGION_MIN_SIZE)) - 1;
> +
> +	/*
> +	 * Mask off the low bits of region_start (which would be ignored by
> +	 * the hardware anyway)
> +	 */
> +	region_start &= GENMASK_ULL(63, region_width);
> +
> +	region = region_width | region_start;
> +
> +	/* Lock the region that needs to be updated */
> +	gpu_write(ptdev, AS_LOCKADDR_LO(as_nr), lower_32_bits(region));
> +	gpu_write(ptdev, AS_LOCKADDR_HI(as_nr), upper_32_bits(region));
> +	write_cmd(ptdev, as_nr, AS_COMMAND_LOCK);
> +}
> +
> +static int mmu_hw_do_operation_locked(struct panthor_device *ptdev, int as_nr,
> +				      u64 iova, u64 size, u32 op)
> +{
> +	if (as_nr < 0)
> +		return 0;
> +
> +	if (op != AS_COMMAND_UNLOCK)
> +		lock_region(ptdev, as_nr, iova, size);
> +
> +	/* Run the MMU operation */
> +	write_cmd(ptdev, as_nr, op);
> +
> +	/* Wait for the flush to complete */
> +	return wait_ready(ptdev, as_nr);
> +}
> +
> +static int mmu_hw_do_operation(struct panthor_vm *vm,
> +			       u64 iova, u64 size, u32 op)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +	int ret;
> +
> +	spin_lock(&ptdev->mmu->as.slots[vm->as.id].lock);
> +	ret = mmu_hw_do_operation_locked(ptdev, vm->as.id, iova, size, op);
> +	spin_unlock(&ptdev->mmu->as.slots[vm->as.id].lock);
> +	return ret;
> +}
> +
> +static int panthor_mmu_as_enable(struct panthor_device *ptdev, u32 as_nr,
> +				 u64 transtab, u64 transcfg, u64 memattr)
> +{
> +	int ret;
> +
> +	ret = mmu_hw_do_operation_locked(ptdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
> +	if (ret)
> +		return ret;
> +
> +	gpu_write(ptdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab));
> +	gpu_write(ptdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
> +
> +	gpu_write(ptdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr));
> +	gpu_write(ptdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));
> +
> +	gpu_write(ptdev, AS_TRANSCFG_LO(as_nr), lower_32_bits(transcfg));
> +	gpu_write(ptdev, AS_TRANSCFG_HI(as_nr), upper_32_bits(transcfg));
> +
> +	return write_cmd(ptdev, as_nr, AS_COMMAND_UPDATE);
> +}
> +
> +static int panthor_mmu_as_disable(struct panthor_device *ptdev, u32 as_nr)
> +{
> +	int ret;
> +
> +	ret = mmu_hw_do_operation_locked(ptdev, as_nr, 0, ~0ULL, AS_COMMAND_FLUSH_MEM);
> +	if (ret)
> +		return ret;
> +
> +	gpu_write(ptdev, AS_TRANSTAB_LO(as_nr), 0);
> +	gpu_write(ptdev, AS_TRANSTAB_HI(as_nr), 0);
> +
> +	gpu_write(ptdev, AS_MEMATTR_LO(as_nr), 0);
> +	gpu_write(ptdev, AS_MEMATTR_HI(as_nr), 0);
> +
> +	gpu_write(ptdev, AS_TRANSCFG_LO(as_nr), AS_TRANSCFG_ADRMODE_UNMAPPED);
> +	gpu_write(ptdev, AS_TRANSCFG_HI(as_nr), 0);
> +
> +	return write_cmd(ptdev, as_nr, AS_COMMAND_UPDATE);
> +}
> +
> +static u32 panthor_mmu_fault_mask(struct panthor_device *ptdev, u32 value)
> +{
> +	/* Bits 16 to 31 mean REQ_COMPLETE. */
> +	return value & GENMASK(15, 0);
> +}
> +
> +static u32 panthor_mmu_as_fault_mask(struct panthor_device *ptdev, u32 as)
> +{
> +	return BIT(as);
> +}
> +
> +/**
> + * panthor_vm_active() - Flag a VM as active
> + * @VM: VM to flag as active.
> + *
> + * Assigns an address space to a VM so it can be used by the GPU/MCU.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_active(struct panthor_vm *vm)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
> +	int ret = 0, as, cookie;
> +	u64 transtab, transcfg;
> +
> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> +		return -ENODEV;
> +
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +
> +	as = vm->as.id;
> +	if (as >= 0) {
> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> +
> +		if (ptdev->mmu->as.faulty_mask & mask) {
> +			/* Unhandled pagefault on this AS, the MMU was
> +			 * disabled. We need to re-enable the MMU after
> +			 * clearing+unmasking the AS interrupts.
> +			 */
> +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
> +			ptdev->mmu->as.faulty_mask &= ~mask;
> +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
> +			goto out_enable_as;
> +		}
> +
> +		goto out_unlock;
> +	}
> +
> +	/* Check for a free AS */
> +	if (vm->for_mcu) {
> +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
> +		as = 0;
> +	} else {
> +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
> +	}
> +
> +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
> +		struct panthor_vm *lru_vm;
> +
> +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
> +						  struct panthor_vm,
> +						  as.lru_node);
> +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}
> +
> +		list_del_init(&lru_vm->as.lru_node);
> +		as = lru_vm->as.id;

Should this not set lru_vm->as.id = -1, so that the code knows the VM no
longer has an address space?

> +	} else {
> +		set_bit(as, &ptdev->mmu->as.alloc_mask);
> +	}
> +
> +	/* Assign the free or reclaimed AS to the FD */
> +	vm->as.id = as;
> +	ptdev->mmu->as.slots[as].vm = vm;
> +
> +out_enable_as:
> +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
> +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
> +		   AS_TRANSCFG_PTW_RA |
> +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
> +	if (ptdev->coherent)
> +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
> +
> +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
> +
> +out_unlock:
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +/**
> + * panthor_vm_idle() - Flag a VM idle
> + * @VM: VM to flag as idle.
> + *
> + * When we know the GPU is done with the VM (no more jobs to process),
> + * we can relinquish the AS slot attached to this VM, if any.
> + *
> + * We don't release the slot immediately, but instead place the VM in
> + * the LRU list, so it can be evicted if another VM needs an AS slot.
> + * This way, VMs keep attached to the AS they were given until we run
> + * out of free slot, limiting the number of MMU operations (TLB flush
> + * and other AS updates).
> + */
> +void panthor_vm_idle(struct panthor_vm *vm)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +	if (vm->as.id >= 0 && list_empty(&vm->as.lru_node))
> +		list_add_tail(&vm->as.lru_node, &ptdev->mmu->as.lru_list);
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +}
> +
> +static void panthor_vm_stop(struct panthor_vm *vm)
> +{
> +	drm_sched_stop(&vm->sched, NULL);
> +}
> +
> +static void panthor_vm_start(struct panthor_vm *vm)
> +{
> +	drm_sched_start(&vm->sched, true);
> +}
> +
> +/**
> + * panthor_vm_as() - Get the AS slot attached to a VM
> + * @vm: VM to get the AS slot of.
> + *
> + * Return: -1 if the VM is not assigned an AS slot yet, >= 0 otherwise.
> + */
> +int panthor_vm_as(struct panthor_vm *vm)
> +{
> +	return vm->as.id;
> +}
> +
> +static size_t get_pgsize(u64 addr, size_t size, size_t *count)
> +{
> +	/*
> +	 * io-pgtable only operates on multiple pages within a single table
> +	 * entry, so we need to split at boundaries of the table size, i.e.
> +	 * the next block size up. The distance from address A to the next
> +	 * boundary of block size B is logically B - A % B, but in unsigned
> +	 * two's complement where B is a power of two we get the equivalence
> +	 * B - A % B == (B - A) % B == (n * B - A) % B, and choose n = 0 :)
> +	 */
> +	size_t blk_offset = -addr % SZ_2M;
> +
> +	if (blk_offset || size < SZ_2M) {
> +		*count = min_not_zero(blk_offset, size) / SZ_4K;
> +		return SZ_4K;
> +	}
> +	blk_offset = -addr % SZ_1G ?: SZ_1G;
> +	*count = min(blk_offset, size) / SZ_2M;
> +	return SZ_2M;
> +}
> +
> +static int panthor_vm_flush_range(struct panthor_vm *vm, u64 iova, u64 size)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +	int ret = 0, cookie;
> +
> +	if (vm->as.id < 0)
> +		return 0;
> +
> +	/* If the device is unplugged, we just silently skip the flush. */
> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> +		return 0;
> +
> +	/* Flush the PTs only if we're already awake */
> +	if (pm_runtime_active(ptdev->base.dev))
> +		ret = mmu_hw_do_operation(vm, iova, size, AS_COMMAND_FLUSH_PT);
> +
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static int panthor_vm_unmap_pages(struct panthor_vm *vm, u64 iova, size_t size)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +	struct io_pgtable_ops *ops = vm->pgtbl_ops;
> +	size_t offset = 0;
> +
> +	drm_dbg(&ptdev->base, "unmap: as=%d, iova=%llx, len=%zx", vm->as.id, iova, size);
> +
> +	while (offset < size) {
> +		size_t unmapped_sz = 0, pgcount;
> +		size_t pgsize = get_pgsize(iova + offset, size - offset, &pgcount);
> +
> +		unmapped_sz = ops->unmap_pages(ops, iova + offset, pgsize, pgcount, NULL);
> +
> +		if (drm_WARN_ON(&ptdev->base, unmapped_sz != pgsize * pgcount)) {
> +			drm_err(&ptdev->base, "failed to unmap range %llx-%llx (requested range %llx-%llx)\n",
> +				iova + offset + unmapped_sz,
> +				iova + offset + pgsize * pgcount,
> +				iova, iova + size);
> +			panthor_vm_flush_range(vm, iova, offset + unmapped_sz);
> +			return  -EINVAL;
> +		}
> +		offset += unmapped_sz;
> +	}
> +
> +	return panthor_vm_flush_range(vm, iova, size);
> +}
> +
> +static int
> +panthor_vm_map_pages(struct panthor_vm *vm, u64 iova, int prot,
> +		     struct sg_table *sgt, u64 offset, ssize_t size)
> +{
> +	struct panthor_device *ptdev = vm->ptdev;
> +	unsigned int count;
> +	struct scatterlist *sgl;
> +	struct io_pgtable_ops *ops = vm->pgtbl_ops;
> +	u64 start_iova = iova;
> +	int ret;
> +
> +	if (!size)
> +		return 0;
> +
> +	for_each_sgtable_dma_sg(sgt, sgl, count) {
> +		dma_addr_t paddr = sg_dma_address(sgl);
> +		size_t len = sg_dma_len(sgl);
> +
> +		if (len <= offset) {
> +			offset -= len;
> +			continue;
> +		}
> +
> +		paddr -= offset;
> +		len -= offset;
> +
> +		if (size >= 0) {
> +			len = min_t(size_t, len, size);
> +			size -= len;
> +		}
> +
> +		drm_dbg(&ptdev->base, "map: as=%d, iova=%llx, paddr=%llx, len=%zx",
> +			vm->as.id, iova, paddr, len);
> +
> +		while (len) {
> +			size_t pgcount, mapped = 0;
> +			size_t pgsize = get_pgsize(iova | paddr, len, &pgcount);
> +
> +			ret = ops->map_pages(ops, iova, paddr, pgsize, pgcount, prot,
> +					     GFP_KERNEL, &mapped);
> +			iova += mapped;
> +			paddr += mapped;
> +			len -= mapped;
> +
> +			if (drm_WARN_ON(&ptdev->base, !ret && !mapped))
> +				ret = -ENOMEM;
> +
> +			if (ret) {
> +				/* If something failed, unmap what we've already mapped before
> +				 * returning. The unmap call is not supposed to fail.
> +				 */
> +				drm_WARN_ON(&ptdev->base,
> +					    panthor_vm_unmap_pages(vm, start_iova,
> +								   iova - start_iova));
> +				return ret;
> +			}
> +		}
> +
> +		if (!size)
> +			break;
> +	}
> +
> +	return panthor_vm_flush_range(vm, start_iova, iova - start_iova);
> +}
> +
> +static int flags_to_prot(u32 flags)
> +{
> +	int prot = 0;
> +
> +	if (flags & DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC)
> +		prot |= IOMMU_NOEXEC;
> +
> +	if (!(flags & DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED))
> +		prot |= IOMMU_CACHE;
> +
> +	if (flags & DRM_PANTHOR_VM_BIND_OP_MAP_READONLY)
> +		prot |= IOMMU_READ;
> +	else
> +		prot |= IOMMU_READ | IOMMU_WRITE;
> +
> +	return prot;
> +}
> +
> +/**
> + * panthor_vm_alloc_va() - Allocate a region in the auto-va space
> + * @VM: VM to allocate a region on.
> + * @size: Size of the region.
> + *
> + * Some GPU objects, like heap chunks, are fully managed by the kernel and
> + * need to be mapped to the userspace VM, in the region reserved for kernel
> + * objects.
> + *
> + * This function takes care of allocating a region in this reserved space.
> + *
> + * Return: A valid pointer on success, and ERR_PTR() otherwise.
> + */
> +struct drm_mm_node *
> +panthor_vm_alloc_va(struct panthor_vm *vm, size_t size)
> +{
> +	struct drm_mm_node *mm_node;
> +	int ret;
> +
> +	if (!size || (size & ~PAGE_MASK))
> +		return ERR_PTR(-EINVAL);
> +
> +	mm_node = kzalloc(sizeof(*mm_node), GFP_KERNEL);
> +	if (!mm_node)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_lock(&vm->mm_lock);
> +	ret = drm_mm_insert_node(&vm->mm, mm_node, size);
> +	mutex_unlock(&vm->mm_lock);
> +
> +	if (ret) {
> +		kfree(mm_node);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return mm_node;
> +}
> +
> +/**
> + * panthor_vm_free_va() - Free a region allocated with panthor_vm_alloc_va()
> + * @VM: VM to free the region on.
> + * @mm_node: Memory node representing the region to free.
> + */
> +void panthor_vm_free_va(struct panthor_vm *vm, struct drm_mm_node *mm_node)
> +{
> +	if (!mm_node)
> +		return;
> +
> +	mutex_lock(&vm->mm_lock);
> +	drm_mm_remove_node(mm_node);
> +	mutex_unlock(&vm->mm_lock);
> +
> +	kfree(mm_node);
> +}
> +
> +static void panthor_vm_cleanup_op_ctx(struct panthor_vm_op_ctx *op_ctx,
> +				      struct panthor_vm *vm)
> +{
> +	struct panthor_vma *vma, *tmp_vma;
> +
> +	u32 remaining_pt_count = op_ctx->rsvd_page_tables.count -
> +				 op_ctx->rsvd_page_tables.ptr;
> +
> +	if (remaining_pt_count) {
> +		kmem_cache_free_bulk(pt_cache, remaining_pt_count,
> +				     op_ctx->rsvd_page_tables.pages +
> +				     op_ctx->rsvd_page_tables.ptr);
> +	}
> +
> +	kfree(op_ctx->rsvd_page_tables.pages);
> +	memset(&op_ctx->rsvd_page_tables, 0, sizeof(op_ctx->rsvd_page_tables));
> +
> +	if (op_ctx->map.gem.obj) {
> +		struct panthor_gem_object *bo = to_panthor_bo(op_ctx->map.gem.obj);
> +
> +		if (!bo->base.base.import_attach)
> +			drm_gem_shmem_unpin(&bo->base);
> +
> +		drm_gem_object_put(&bo->base.base);
> +	}
> +
> +	kfree(op_ctx->map.new_vma);
> +	kfree(op_ctx->map.next_vma);
> +	kfree(op_ctx->map.prev_vma);
> +	memset(&op_ctx->map, 0, sizeof(op_ctx->map));
> +
> +	list_for_each_entry_safe(vma, tmp_vma, &op_ctx->returned_vmas, node) {
> +		struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
> +
> +		if (!bo->base.base.import_attach)
> +			drm_gem_shmem_unpin(&bo->base);
> +
> +		drm_gem_object_put(&bo->base.base);
> +		list_del(&vma->node);
> +		kfree(vma);
> +	}
> +}
> +
> +#define PANTHOR_VM_BIND_OP_MAP_FLAGS \
> +	(DRM_PANTHOR_VM_BIND_OP_MAP_READONLY | \
> +	 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | \
> +	 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED | \
> +	 DRM_PANTHOR_VM_BIND_OP_TYPE_MASK)
> +
> +static int panthor_vm_prepare_map_op_ctx(struct panthor_vm_op_ctx *op_ctx,
> +					 struct panthor_vm *vm,
> +					 struct panthor_gem_object *bo,
> +					 u64 offset,
> +					 size_t size, u64 va,
> +					 u32 flags)
> +{
> +	struct sg_table *sgt = NULL;
> +	u64 pt_count;
> +	int ret;
> +
> +	if (!bo)
> +		return -EINVAL;
> +
> +	if ((flags & ~PANTHOR_VM_BIND_OP_MAP_FLAGS) ||
> +	    (flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) != DRM_PANTHOR_VM_BIND_OP_TYPE_MAP)
> +		return -EINVAL;
> +
> +	/* Make sure the VA and size are aligned and in-bounds. */
> +	if (size > bo->base.base.size || offset > bo->base.base.size - size)
> +		return -EINVAL;
> +
> +	/* If the BO has an exclusive VM attached, it can't be mapped to other VMs. */
> +	if (bo->exclusive_vm && bo->exclusive_vm != vm)
> +		return -EINVAL;
> +
> +	memset(op_ctx, 0, sizeof(*op_ctx));
> +	INIT_LIST_HEAD(&op_ctx->returned_vmas);
> +	op_ctx->flags = flags;
> +	op_ctx->va.range = size;
> +	op_ctx->va.addr = va;
> +
> +	op_ctx->map.new_vma = kzalloc(sizeof(*op_ctx->map.new_vma), GFP_KERNEL);
> +	op_ctx->map.next_vma = kzalloc(sizeof(*op_ctx->map.next_vma), GFP_KERNEL);
> +	op_ctx->map.prev_vma = kzalloc(sizeof(*op_ctx->map.prev_vma), GFP_KERNEL);
> +	if (!op_ctx->map.new_vma || !op_ctx->map.next_vma || !op_ctx->map.prev_vma) {
> +		ret = -ENOMEM;
> +		goto err_cleanup;
> +	}
> +
> +	if (!bo->base.base.import_attach) {
> +		/* Pre-reserve the BO pages, so the map operation doesn't have to
> +		 * allocate.
> +		 */
> +		ret = drm_gem_shmem_pin(&bo->base);
> +		if (ret)
> +			goto err_cleanup;
> +	}
> +
> +	sgt = drm_gem_shmem_get_pages_sgt(&bo->base);
> +	if (IS_ERR(sgt)) {
> +		if (!bo->base.base.import_attach)
> +			drm_gem_shmem_unpin(&bo->base);
> +
> +		ret = PTR_ERR(sgt);
> +		goto err_cleanup;
> +	}
> +
> +	op_ctx->map.sgt = sgt;
> +	op_ctx->map.gem.obj = &bo->base.base;
> +	op_ctx->map.gem.offset = offset;
> +	drm_gem_object_get(op_ctx->map.gem.obj);
> +
> +	/* L1, L2 and L3 page tables.
> +	 * We could optimize L3 allocation by iterating over the sgt and merging
> +	 * 2M contiguous blocks, but it's simpler to over-provision and return
> +	 * the pages if they're not used.
> +	 */
> +	pt_count = ((ALIGN(va + size, 1ull << 39) - ALIGN_DOWN(va, 1ull << 39)) >> 39) +
> +		   ((ALIGN(va + size, 1ull << 30) - ALIGN_DOWN(va, 1ull << 30)) >> 30) +
> +		   ((ALIGN(va + size, 1ull << 21) - ALIGN_DOWN(va, 1ull << 21)) >> 21);
> +
> +	op_ctx->rsvd_page_tables.pages = kcalloc(pt_count,
> +						 sizeof(*op_ctx->rsvd_page_tables.pages),
> +						 GFP_KERNEL);
> +	if (!op_ctx->rsvd_page_tables.pages)
> +		goto err_cleanup;
> +
> +	ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
> +				    op_ctx->rsvd_page_tables.pages);
> +	op_ctx->rsvd_page_tables.count = ret;
> +	if (ret != pt_count) {
> +		ret = -ENOMEM;
> +		goto err_cleanup;
> +	}
> +
> +	return 0;
> +
> +err_cleanup:
> +	panthor_vm_cleanup_op_ctx(op_ctx, vm);
> +	return ret;
> +}
> +
> +static int panthor_vm_prepare_unmap_op_ctx(struct panthor_vm_op_ctx *op_ctx,
> +					   struct panthor_vm *vm,
> +					   u64 va, size_t size)
> +{
> +	u32 pt_count = 0;
> +	int ret;
> +
> +	memset(op_ctx, 0, sizeof(*op_ctx));
> +	INIT_LIST_HEAD(&op_ctx->returned_vmas);
> +	op_ctx->va.range = size;
> +	op_ctx->va.addr = va;
> +	op_ctx->flags = DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP;
> +
> +	/* Pre-allocate L3 page tables to account for the split-2M-block
> +	 * situation on unmap.
> +	 */
> +	if (va != ALIGN(va, SZ_2M))
> +		pt_count++;
> +
> +	if (va + size != ALIGN(va + size, SZ_2M) &&
> +	    ALIGN(va + size, SZ_2M) != ALIGN(va, SZ_2M))
> +		pt_count++;
> +
> +	if (pt_count) {
> +		op_ctx->rsvd_page_tables.pages = kcalloc(pt_count,
> +							 sizeof(*op_ctx->rsvd_page_tables.pages),
> +							 GFP_KERNEL);
> +		if (!op_ctx->rsvd_page_tables.pages)
> +			goto err_cleanup;
> +
> +		ret = kmem_cache_alloc_bulk(pt_cache, GFP_KERNEL, pt_count,
> +					    op_ctx->rsvd_page_tables.pages);
> +		if (ret != pt_count) {
> +			ret = -ENOMEM;
> +			goto err_cleanup;
> +		}
> +		op_ctx->rsvd_page_tables.count = pt_count;
> +	}
> +
> +	return 0;
> +
> +err_cleanup:
> +	panthor_vm_cleanup_op_ctx(op_ctx, vm);
> +	return ret;
> +}
> +
> +/**
> + * panthor_vm_get_bo_for_va() - Get the GEM object mapped at a virtual address
> + * @vm: VM to look into.
> + * @va: Virtual address to search for.
> + * @bo_offset: Offset of the GEM object mapped at this virtual address.
> + * Only valid on success.
> + *
> + * The object returned by this function might no longer be mapped when the
> + * function returns. It's the caller responsibility to ensure there's no
> + * concurrent map/unmap operations making the returned value invalid, or
> + * make sure it doesn't matter if the object is no longer mapped.
> + *
> + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> + */
> +struct panthor_gem_object *
> +panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset)
> +{
> +	struct panthor_gem_object *bo = ERR_PTR(-ENOENT);
> +	struct drm_gpuva *gpuva;
> +	struct panthor_vma *vma;
> +	int ret;
> +
> +	/* Take the VM lock to prevent concurrent map/unmap operation. */
> +	ret = dma_resv_lock(vm->dummy_gem.resv, NULL);
> +	if (drm_WARN_ON(&vm->ptdev->base, ret))
> +		return NULL;
> +
> +	gpuva = drm_gpuva_find_first(&vm->va_mgr, va, 1);
> +	vma = gpuva ? container_of(gpuva, struct panthor_vma, base) : NULL;
> +	if (vma && vma->base.gem.obj) {
> +		drm_gem_object_get(vma->base.gem.obj);
> +		bo = to_panthor_bo(vma->base.gem.obj);
> +		*bo_offset = vma->base.gem.offset;
> +	}
> +	dma_resv_unlock(vm->dummy_gem.resv);
> +
> +	return bo;
> +}
> +
> +/*
> + * Only 32 VMs per open file. If that becomes a limiting factor, we can
> + * increase this number.
> + */
> +#define PANTHOR_MAX_VMS_PER_FILE	 32
> +
> +/**
> + * panthor_vm_pool_create_vm() - Create a VM
> + * @pool: The VM to create this VM on.
> + * @kernel_va_start: Start of the region reserved for kernel objects.
> + * @kernel_va_range: Size of the region reserved for kernel objects.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_pool_create_vm(struct panthor_device *ptdev, struct panthor_vm_pool *pool,
> +			      u64 kernel_va_start, u64 kernel_va_range)
> +{
> +	struct panthor_vm *vm;
> +	int ret;
> +	u32 id;
> +
> +	vm = panthor_vm_create(ptdev, false, kernel_va_start, kernel_va_range);
> +	if (IS_ERR(vm))
> +		return PTR_ERR(vm);
> +
> +	ret = xa_alloc(&pool->xa, &id, vm,
> +		       XA_LIMIT(1, PANTHOR_MAX_VMS_PER_FILE), GFP_KERNEL);
> +
> +	if (ret) {
> +		panthor_vm_put(vm);
> +		return ret;
> +	}
> +
> +	return id;
> +}
> +
> +static void panthor_vm_destroy(struct panthor_vm *vm)
> +{
> +	if (!vm)
> +		return;
> +
> +	vm->destroyed = true;
> +
> +	mutex_lock(&vm->heaps.lock);
> +	panthor_heap_pool_destroy(vm->heaps.pool);
> +	vm->heaps.pool = NULL;
> +	mutex_unlock(&vm->heaps.lock);
> +
> +	drm_WARN_ON(&vm->ptdev->base,
> +		    panthor_vm_unmap_range(vm, vm->va_mgr.mm_start, vm->va_mgr.mm_range));
> +	panthor_vm_put(vm);
> +}
> +
> +/**
> + * panthor_vm_destroy() - Destroy a VM.
> + * @pool: VM pool.
> + * @handle: VM handle.
> + *
> + * This function doesn't free the VM object or its resources, it just kills
> + * all mappings, and makes sure nothing can be mapped after that point.
> + *
> + * If there was any active jobs at the time this function is called, these
> + * jobs should experience page faults and be killed as a result.
> + *
> + * The VM resources are freed when the last reference on the VM object is
> + * dropped.
> + */
> +int panthor_vm_pool_destroy_vm(struct panthor_vm_pool *pool, u32 handle)
> +{
> +	struct panthor_vm *vm;
> +
> +	vm = xa_erase(&pool->xa, handle);
> +
> +	panthor_vm_destroy(vm);
> +
> +	return vm ? 0 : -EINVAL;
> +}
> +
> +/**
> + * panthor_vm_pool_get_vm() - Retrieve VM object bound to a VM handle
> + * @pool: VM pool to check.
> + * @handle: Handle of the VM to retrieve.
> + *
> + * Return: A valid pointer if the VM exists, NULL otherwise.
> + */
> +struct panthor_vm *
> +panthor_vm_pool_get_vm(struct panthor_vm_pool *pool, u32 handle)
> +{
> +	struct panthor_vm *vm;
> +
> +	vm = panthor_vm_get(xa_load(&pool->xa, handle));
> +
> +	return vm;
> +}
> +
> +/**
> + * panthor_vm_pool_destroy() - Destroy a VM pool.
> + * @pfile: File.
> + *
> + * Destroy all VMs in the pool, and release the pool resources.
> + *
> + * Note that VMs can outlive the pool they were created from if other
> + * objects hold a reference to there VMs.
> + */
> +void panthor_vm_pool_destroy(struct panthor_file *pfile)
> +{
> +	struct panthor_vm *vm;
> +	unsigned long i;
> +
> +	if (!pfile->vms)
> +		return;
> +
> +	xa_for_each(&pfile->vms->xa, i, vm)
> +		panthor_vm_destroy(vm);
> +
> +	xa_destroy(&pfile->vms->xa);
> +	kfree(pfile->vms);
> +}
> +
> +/**
> + * panthor_vm_pool_create() - Create a VM pool
> + * @pfile: File.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_pool_create(struct panthor_file *pfile)
> +{
> +	pfile->vms = kzalloc(sizeof(*pfile->vms), GFP_KERNEL);
> +	if (!pfile->vms)
> +		return -ENOMEM;
> +
> +	xa_init_flags(&pfile->vms->xa, XA_FLAGS_ALLOC1);
> +	return 0;
> +}
> +
> +/* dummy TLB ops, the real TLB flush happens in panthor_vm_flush_range() */
> +static void mmu_tlb_flush_all(void *cookie)
> +{
> +}
> +
> +static void mmu_tlb_flush_walk(unsigned long iova, size_t size, size_t granule, void *cookie)
> +{
> +}
> +
> +static const struct iommu_flush_ops mmu_tlb_ops = {
> +	.tlb_flush_all = mmu_tlb_flush_all,
> +	.tlb_flush_walk = mmu_tlb_flush_walk,
> +};
> +
> +static const char *access_type_name(struct panthor_device *ptdev,
> +				    u32 fault_status)
> +{
> +	switch (fault_status & AS_FAULTSTATUS_ACCESS_TYPE_MASK) {
> +	case AS_FAULTSTATUS_ACCESS_TYPE_ATOMIC:
> +		return "ATOMIC";
> +	case AS_FAULTSTATUS_ACCESS_TYPE_READ:
> +		return "READ";
> +	case AS_FAULTSTATUS_ACCESS_TYPE_WRITE:
> +		return "WRITE";
> +	case AS_FAULTSTATUS_ACCESS_TYPE_EX:
> +		return "EXECUTE";
> +	default:
> +		drm_WARN_ON(&ptdev->base, 1);
> +		return NULL;
> +	}
> +}
> +
> +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
> +{
> +	status = panthor_mmu_fault_mask(ptdev, status);
> +	while (status) {
> +		u32 as = ffs(status | (status >> 16)) - 1;
> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> +		u32 new_int_mask;
> +		u64 addr;
> +		u32 fault_status;
> +		u32 exception_type;
> +		u32 access_type;
> +		u32 source_id;
> +
> +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
> +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
> +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
> +
> +		/* decode the fault status */
> +		exception_type = fault_status & 0xFF;
> +		access_type = (fault_status >> 8) & 0x3;
> +		source_id = (fault_status >> 16);
> +
> +		/* Page fault only */

This comment makes no sense - it looks like it's copied over from panfrost.

If I understand correctly we don't (currently) support growing on page
fault - and it's not really needed now the MCU can handle the tiler heaps.

> +		mutex_lock(&ptdev->mmu->as.slots_lock);
> +
> +		new_int_mask =
> +			panthor_mmu_fault_mask(ptdev, ~ptdev->mmu->as.faulty_mask);
> +
> +		/* terminal fault, print info about the fault */
> +		drm_err(&ptdev->base,
> +			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
> +			"raw fault status: 0x%X\n"
> +			"decoded fault status: %s\n"
> +			"exception type 0x%X: %s\n"
> +			"access type 0x%X: %s\n"
> +			"source id 0x%X\n",
> +			as, addr,
> +			fault_status,
> +			(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
> +			exception_type, panthor_exception_name(ptdev, exception_type),
> +			access_type, access_type_name(ptdev, fault_status),
> +			source_id);
> +
> +		/* Ignore MMU interrupts on this AS until it's been
> +		 * re-enabled.
> +		 */
> +		ptdev->mmu->irq.mask = new_int_mask;
> +		gpu_write(ptdev, MMU_INT_MASK, new_int_mask);
> +
> +		/* Disable the MMU to kill jobs on this AS. */
> +		panthor_mmu_as_disable(ptdev, as);
> +		mutex_unlock(&ptdev->mmu->as.slots_lock);
> +
> +		status &= ~mask;
> +	}
> +}
> +PANTHOR_IRQ_HANDLER(mmu, MMU, panthor_mmu_irq_handler);
> +
> +/**
> + * panthor_mmu_suspend() - Suspend the MMU logic
> + * @ptdev: Device.
> + *
> + * All we do here is de-assign the AS slots on all active VMs, so things
> + * get flushed to the main memory, and no further access to these VMs are
> + * possible.
> + *
> + * We also suspend the MMU IRQ.
> + */
> +void panthor_mmu_suspend(struct panthor_device *ptdev)
> +{
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> +		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
> +
> +		if (vm) {
> +			drm_WARN_ON(&ptdev->base, panthor_mmu_as_disable(ptdev, i));
> +			vm->as.id = -1;
> +			list_del_init(&vm->as.lru_node);
> +			ptdev->mmu->as.slots[i].vm = NULL;
> +		}
> +	}
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +
> +	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +}
> +
> +/**
> + * panthor_mmu_resume() - Resume the MMU logic
> + * @ptdev: Device.
> + *
> + * Resume the IRQ.
> + *
> + * We don't re-enable previously active VMs. We assume other parts of the
> + * driver will call panthor_vm_active() on the VMs they intend to use.
> + */
> +void panthor_mmu_resume(struct panthor_device *ptdev)
> +{
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +	ptdev->mmu->as.alloc_mask = 0;
> +	ptdev->mmu->as.faulty_mask = 0;
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +
> +	panthor_mmu_irq_resume(&ptdev->mmu->irq, panthor_mmu_fault_mask(ptdev, ~0));
> +}
> +
> +/**
> + * panthor_mmu_pre_reset() - Prepare for a reset
> + * @ptdev: Device.
> + *
> + * Suspend the IRQ, and make sure all VM_BIND queues are stopped, so we
> + * don't get asked to do a VM operation while the GPU is down.
> + *
> + * We don't cleanly shutdown the AS slots here, because the reset might
> + * come from an AS_ACTIVE_BIT stuck situation.
> + */
> +void panthor_mmu_pre_reset(struct panthor_device *ptdev)
> +{
> +	struct panthor_vm *vm;
> +
> +	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +
> +	mutex_lock(&ptdev->mmu->vm.lock);
> +	ptdev->mmu->vm.reset_in_progress = true;
> +	list_for_each_entry(vm, &ptdev->mmu->vm.list, node)
> +		panthor_vm_stop(vm);
> +	mutex_unlock(&ptdev->mmu->vm.lock);
> +}
> +
> +/**
> + * panthor_mmu_post_reset() - Restore things after a reset
> + * @ptdev: Device.
> + *
> + * Put the MMU logic back in action after a reset. That implies resuming the
> + * IRQ and re-enabling the VM_BIND queues.
> + */
> +void panthor_mmu_post_reset(struct panthor_device *ptdev)
> +{
> +	struct panthor_vm *vm;
> +
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +
> +	/* Now that the reset is effective, we can assume that none of the
> +	 * AS slots are setup, and clear the faulty flags too.
> +	 */
> +	ptdev->mmu->as.alloc_mask = 0;
> +	ptdev->mmu->as.faulty_mask = 0;
> +
> +	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> +		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
> +
> +		if (vm) {
> +			vm->as.id = -1;
> +			list_del_init(&vm->as.lru_node);
> +			ptdev->mmu->as.slots[i].vm = NULL;
> +		}
> +	}
> +
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +
> +	panthor_mmu_irq_resume(&ptdev->mmu->irq, panthor_mmu_fault_mask(ptdev, ~0));
> +
> +	/* Restart the VM_BIND queues. */
> +	mutex_lock(&ptdev->mmu->vm.lock);
> +	list_for_each_entry(vm, &ptdev->mmu->vm.list, node) {
> +		panthor_vm_start(vm);
> +	}
> +	ptdev->mmu->vm.reset_in_progress = false;
> +	mutex_unlock(&ptdev->mmu->vm.lock);
> +}
> +
> +static void panthor_vm_release(struct kref *kref)
> +{
> +	struct panthor_vm *vm = container_of(kref, struct panthor_vm, refcount);
> +	struct panthor_device *ptdev = vm->ptdev;
> +
> +	mutex_lock(&vm->heaps.lock);
> +	if (drm_WARN_ON(&ptdev->base, vm->heaps.pool))
> +		panthor_heap_pool_destroy(vm->heaps.pool);
> +	mutex_unlock(&vm->heaps.lock);
> +	mutex_destroy(&vm->heaps.lock);
> +
> +	mutex_lock(&ptdev->mmu->vm.lock);
> +	list_del(&vm->node);
> +	/* Restore the scheduler state so we can call drm_sched_entity_destroy()
> +	 * and drm_sched_fini(). If get there, that means we have no job left
> +	 * and no new jobs can be queued, so we can start the scheduler without
> +	 * risking interfering with the reset.
> +	 */
> +	if (ptdev->mmu->vm.reset_in_progress)
> +		panthor_vm_start(vm);
> +	mutex_unlock(&ptdev->mmu->vm.lock);
> +
> +	drm_sched_entity_destroy(&vm->entity);
> +	drm_sched_fini(&vm->sched);
> +
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +	if (vm->as.id >= 0) {
> +		int cookie;
> +
> +		if (drm_dev_enter(&ptdev->base, &cookie)) {
> +			panthor_mmu_as_disable(ptdev, vm->as.id);
> +			drm_dev_exit(cookie);
> +		}
> +
> +		ptdev->mmu->as.slots[vm->as.id].vm = NULL;
> +		clear_bit(vm->as.id, &ptdev->mmu->as.alloc_mask);
> +		list_del(&vm->as.lru_node);
> +	}
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +
> +	drm_WARN_ON(&ptdev->base,
> +		    panthor_vm_unmap_range(vm, vm->va_mgr.mm_start, vm->va_mgr.mm_range));
> +
> +	free_io_pgtable_ops(vm->pgtbl_ops);
> +
> +	drm_mm_takedown(&vm->mm);
> +	mutex_destroy(&vm->mm_lock);
> +	drm_gpuva_manager_destroy(&vm->va_mgr);
> +	drm_gem_private_object_fini(&vm->dummy_gem);
> +	mutex_destroy(&vm->op_lock);
> +	kfree(vm);
> +}
> +
> +/**
> + * panthor_vm_put() - Release a reference on a VM
> + * @vm: VM to release the reference on. Can be NULL.
> + */
> +void panthor_vm_put(struct panthor_vm *vm)
> +{
> +	if (vm)
> +		kref_put(&vm->refcount, panthor_vm_release);
> +}
> +
> +/**
> + * panthor_vm_get() - Get a VM reference
> + * @vm: VM to get the reference on. Can be NULL.
> + *
> + * Return: @vm value.
> + */
> +struct panthor_vm *panthor_vm_get(struct panthor_vm *vm)
> +{
> +	if (vm)
> +		kref_get(&vm->refcount);
> +
> +	return vm;
> +}
> +
> +/**
> + * panthor_vm_get_heap_pool() - Get the heap pool attached to a VM
> + * @vm: VM to query the heap pool on.
> + * @create: True if the heap pool should be created when it doesn't exist.
> + *
> + * Heap pools are per-VM. This function allows one to retrieve the heap pool
> + * attached to a VM.
> + *
> + * If no heap pool exists yet, and @create is true, we create one.
> + *
> + * The returned panthor_heap_pool should be released with panthor_heap_pool_put().
> + *
> + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> + */
> +struct panthor_heap_pool *panthor_vm_get_heap_pool(struct panthor_vm *vm, bool create)
> +{
> +	struct panthor_heap_pool *pool;
> +
> +	mutex_lock(&vm->heaps.lock);
> +	if (!vm->heaps.pool && create) {
> +		if (vm->destroyed)
> +			pool = ERR_PTR(-EINVAL);
> +		else
> +			pool = panthor_heap_pool_create(vm->ptdev, vm);
> +
> +		if (!IS_ERR(pool))
> +			vm->heaps.pool = panthor_heap_pool_get(pool);
> +	} else {
> +		pool = panthor_heap_pool_get(vm->heaps.pool);
> +	}
> +	mutex_unlock(&vm->heaps.lock);
> +
> +	return pool;
> +}
> +
> +static u64 mair_to_memattr(u64 mair)
> +{
> +	u64 memattr = 0;
> +	u32 i;
> +
> +	for (i = 0; i < 8; i++) {
> +		u8 in_attr = mair >> (8 * i), out_attr;
> +		u8 outer = in_attr >> 4, inner = in_attr & 0xf;
> +
> +		/* For caching to be enabled, inner and outer caching policy
> +		 * have to be both write-back, if one of them is write-through
> +		 * or non-cacheable, we just choose non-cacheable. Device
> +		 * memory is also translated to non-cacheable.
> +		 */
> +		if (!(outer & 3) || !(outer & 4) || !(inner & 4)) {
> +			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_NC |
> +				   AS_MEMATTR_AARCH64_SH_MIDGARD_INNER |
> +				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(false, false);
> +		} else {
> +			/* Use SH_CPU_INNER mode so SH_IS, which is used when
> +			 * IOMMU_CACHE is set, actually maps to the standard
> +			 * definition of inner-shareable and not Mali's
> +			 * internal-shareable mode.
> +			 */
> +			out_attr = AS_MEMATTR_AARCH64_INNER_OUTER_WB |
> +				   AS_MEMATTR_AARCH64_SH_CPU_INNER |
> +				   AS_MEMATTR_AARCH64_INNER_ALLOC_EXPL(inner & 1, inner & 2);
> +		}
> +
> +		memattr |= (u64)out_attr << (8 * i);
> +	}
> +
> +	return memattr;
> +}
> +
> +static void panthor_vma_link(struct panthor_vm *vm, struct panthor_vma *vma)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
> +
> +	mutex_lock(&bo->gpuva_list_lock);
> +	drm_gpuva_link(&vma->base);
> +	mutex_unlock(&bo->gpuva_list_lock);
> +
> +	if (!bo->exclusive_vm)
> +		list_add_tail(&vma->node, &vm->shared_bos);
> +}
> +
> +static void panthor_vma_unlink(struct panthor_vm_op_ctx *op_ctx,
> +			       struct panthor_vma *vma)
> +{
> +	struct panthor_gem_object *bo = to_panthor_bo(vma->base.gem.obj);
> +
> +	mutex_lock(&bo->gpuva_list_lock);
> +	drm_gpuva_unlink(&vma->base);
> +	mutex_unlock(&bo->gpuva_list_lock);
> +
> +	list_move_tail(&vma->node, &op_ctx->returned_vmas);
> +}
> +
> +static void panthor_vma_init(struct panthor_vma *vma,
> +			     struct drm_gem_object *obj,
> +			     u64 offset,
> +			     u64 va, u64 range, u32 flags)
> +{
> +	INIT_LIST_HEAD(&vma->node);
> +	vma->flags = flags;
> +	vma->base.gem.obj = obj;
> +	vma->base.gem.offset = offset;
> +	vma->base.va.addr = va;
> +	vma->base.va.range = range;
> +}
> +
> +#define PANTHOR_VM_MAP_FLAGS \
> +	(DRM_PANTHOR_VM_BIND_OP_MAP_READONLY | \
> +	 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC | \
> +	 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED)
> +
> +static int panthor_gpuva_sm_step_map(struct drm_gpuva_op *op, void *priv)
> +{
> +	struct panthor_vm *vm = priv;
> +	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
> +	struct panthor_vma *vma = op_ctx->map.new_vma;
> +	int ret;
> +
> +	panthor_vma_init(vma, op->map.gem.obj, op->map.gem.offset, op->map.va.addr,
> +			 op->map.va.range, op_ctx->flags & PANTHOR_VM_MAP_FLAGS);
> +
> +	ret = panthor_vm_map_pages(vm, vma->base.va.addr, flags_to_prot(vma->flags),
> +				   op_ctx->map.sgt, vma->base.gem.offset,
> +				   vma->base.va.range);
> +	if (ret)
> +		return ret;
> +
> +	/* Ref owned by the mapping now, clear the obj field so we don't release the
> +	 * pinning/obj ref behind GPUVA's back.
> +	 */
> +	drm_gpuva_map(&vm->va_mgr, &vma->base, &op->map);
> +	panthor_vma_link(vm, op_ctx->map.new_vma);
> +	op_ctx->map.gem.obj = NULL;
> +	op_ctx->map.new_vma = NULL;
> +	return 0;
> +}
> +
> +static int panthor_gpuva_sm_step_remap(struct drm_gpuva_op *op,
> +				       void *priv)
> +{
> +	struct panthor_vma *unmap_vma = container_of(op->remap.unmap->va, struct panthor_vma, base);
> +	const u64 va_start = op->remap.prev ?
> +			     op->remap.prev->va.addr + op->remap.prev->va.range :
> +			     op->remap.unmap->va->va.addr;
> +	const u64 va_end = op->remap.next ?
> +			   op->remap.next->va.addr :
> +			   op->remap.unmap->va->va.addr + op->remap.unmap->va->va.range;
> +	struct panthor_vm *vm = priv;
> +	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
> +	struct drm_gpuva *prev_va = NULL, *next_va = NULL;
> +	int ret;
> +
> +	ret = panthor_vm_unmap_pages(vm, va_start, va_end - va_start);
> +	if (ret)
> +		return ret;
> +
> +	if (op->remap.prev) {
> +		struct panthor_gem_object *bo = to_panthor_bo(op->remap.prev->gem.obj);
> +
> +		if (!bo->base.base.import_attach) {
> +			ret = drm_gem_shmem_pin(&bo->base);
> +			if (drm_WARN_ON(&vm->ptdev->base, ret))
> +				return ret;
> +		}
> +
> +		panthor_vma_init(op_ctx->map.prev_vma,
> +				 op->remap.prev->gem.obj,
> +				 op->remap.prev->gem.offset,
> +				 op->remap.prev->va.addr,
> +				 op->remap.prev->va.range,
> +				 unmap_vma->flags);
> +		prev_va = &op_ctx->map.prev_vma->base;
> +	}
> +
> +	if (op->remap.next) {
> +		struct panthor_gem_object *bo = to_panthor_bo(op->remap.next->gem.obj);
> +
> +		if (!bo->base.base.import_attach) {
> +			ret = drm_gem_shmem_pin(&bo->base);
> +			if (drm_WARN_ON(&vm->ptdev->base, ret))
> +				return ret;
> +		}
> +
> +		panthor_vma_init(op_ctx->map.next_vma,
> +				 op->remap.next->gem.obj,
> +				 op->remap.next->gem.offset,
> +				 op->remap.next->va.addr,
> +				 op->remap.next->va.range,
> +				 unmap_vma->flags);
> +		next_va = &op_ctx->map.next_vma->base;
> +	}
> +
> +	drm_gpuva_remap(prev_va, next_va, &op->remap);
> +
> +	if (prev_va) {
> +		drm_gem_object_get(prev_va->gem.obj);
> +		panthor_vma_link(vm, op_ctx->map.prev_vma);
> +		op_ctx->map.prev_vma = NULL;
> +	}
> +
> +	if (next_va) {
> +		drm_gem_object_get(next_va->gem.obj);
> +		panthor_vma_link(vm, op_ctx->map.next_vma);
> +		op_ctx->map.next_vma = NULL;
> +	}
> +
> +	panthor_vma_unlink(op_ctx, unmap_vma);
> +	return 0;
> +}
> +
> +static int panthor_gpuva_sm_step_unmap(struct drm_gpuva_op *op,
> +				       void *priv)
> +{
> +	struct panthor_vma *unmap_vma = container_of(op->unmap.va, struct panthor_vma, base);
> +	struct panthor_vm *vm = priv;
> +	struct panthor_vm_op_ctx *op_ctx = vm->op_ctx;
> +	int ret;
> +
> +	ret = panthor_vm_unmap_pages(vm, unmap_vma->base.va.addr,
> +				     unmap_vma->base.va.range);
> +	if (drm_WARN_ON(&vm->ptdev->base, ret))
> +		return ret;
> +
> +	drm_gpuva_unmap(&op->unmap);
> +	panthor_vma_unlink(op_ctx, unmap_vma);
> +	return 0;
> +}
> +
> +static const struct drm_gpuva_fn_ops panthor_gpuva_ops = {
> +	.sm_step_map = panthor_gpuva_sm_step_map,
> +	.sm_step_remap = panthor_gpuva_sm_step_remap,
> +	.sm_step_unmap = panthor_gpuva_sm_step_unmap,
> +};
> +
> +/**
> + * panthor_vm_resv() - Get the dma_resv object attached to a VM.
> + * @vm: VM to get the dma_resv of.
> + *
> + * Return: A dma_resv object.
> + */
> +struct dma_resv *panthor_vm_resv(struct panthor_vm *vm)
> +{
> +	return vm->dummy_gem.resv;
> +}
> +
> +static int
> +panthor_vm_exec_op(struct panthor_vm *vm, struct panthor_vm_op_ctx *op,
> +		   bool flag_vm_unusable_on_failure)
> +{
> +	int ret;
> +
> +	mutex_lock(&vm->op_lock);
> +	vm->op_ctx = op;
> +	switch (op->flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) {
> +	case DRM_PANTHOR_VM_BIND_OP_TYPE_MAP:
> +		if (vm->unusable) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		ret = drm_gpuva_sm_map(&vm->va_mgr, vm, op->va.addr, op->va.range,
> +				       op->map.gem.obj, op->map.gem.offset);
> +		break;
> +
> +	case DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP:
> +		ret = drm_gpuva_sm_unmap(&vm->va_mgr, vm, op->va.addr, op->va.range);
> +		break;
> +
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	if (ret && flag_vm_unusable_on_failure)
> +		vm->unusable = true;
> +
> +	vm->op_ctx = NULL;
> +	mutex_unlock(&vm->op_lock);
> +
> +	return ret;
> +}
> +
> +static struct dma_fence *
> +panthor_vm_bind_run_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
> +	bool cookie;
> +	int ret;
> +
> +	/* Not only we report an error whose result is propagated to the
> +	 * drm_sched finished fence, but we also flag the VM as unusable, because
> +	 * a failure in the async VM_BIND results in an inconsistent state. VM needs
> +	 * to be destroyed and recreated.
> +	 */
> +	cookie = dma_fence_begin_signalling();
> +	ret = panthor_vm_exec_op(job->vm, &job->ctx, true);
> +	dma_fence_end_signalling(cookie);
> +
> +	return ret ? ERR_PTR(ret) : NULL;
> +}
> +
> +static void panthor_vm_bind_job_release(struct kref *kref)
> +{
> +	struct panthor_vm_bind_job *job = container_of(kref, struct panthor_vm_bind_job, refcount);
> +
> +	if (job->base.s_fence)
> +		drm_sched_job_cleanup(&job->base);
> +
> +	panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
> +	panthor_vm_put(job->vm);
> +	kfree(job);
> +}
> +
> +/**
> + * panthor_vm_bind_job_put() - Release a VM_BIND job reference
> + * @sched_job: Job to release the reference on.
> + */
> +void panthor_vm_bind_job_put(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job =
> +		container_of(sched_job, struct panthor_vm_bind_job, base);
> +
> +	if (sched_job)
> +		kref_put(&job->refcount, panthor_vm_bind_job_release);
> +}
> +
> +static void
> +panthor_vm_bind_free_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job =
> +		container_of(sched_job, struct panthor_vm_bind_job, base);
> +
> +	drm_sched_job_cleanup(sched_job);
> +
> +	/* Do the heavy cleanups asynchronously, so we're out of the
> +	 * dma-signaling path and can acquire dma-resv locks safely.
> +	 */
> +	queue_work(panthor_cleanup_wq, &job->cleanup_op_ctx_work);
> +}
> +
> +static enum drm_gpu_sched_stat
> +panthor_vm_bind_timedout_job(struct drm_sched_job *sched_job)
> +{
> +	WARN(1, "VM_BIND ops are synchronous for now, there should be no timeout!");
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
> +}
> +
> +static const struct drm_sched_backend_ops panthor_vm_bind_ops = {
> +	.run_job = panthor_vm_bind_run_job,
> +	.free_job = panthor_vm_bind_free_job,
> +	.timedout_job = panthor_vm_bind_timedout_job,
> +};
> +
> +/**
> + * panthor_vm_create() - Create a VM
> + * @ptdev: Device.
> + * @for_mcu: True if this is the FW MCU VM.
> + * @auto_va_start: Start of the auto-VA range.
> + * @auto_va_range: Size of the auto-VA range.
> + *
> + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> + */
> +struct panthor_vm *
> +panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
> +		  u64 auto_va_start, u64 auto_va_range)
> +{
> +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
> +	u32 pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
> +	struct drm_gpu_scheduler *sched;
> +	struct io_pgtable_cfg pgtbl_cfg;
> +	u64 mair, min_va, va_range;
> +	struct panthor_vm *vm;
> +	int ret;
> +
> +	vm = kzalloc(sizeof(*vm), GFP_KERNEL);
> +	if (!vm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_init(&vm->heaps.lock);
> +	kref_init(&vm->refcount);
> +	drm_gem_private_object_init(&ptdev->base, &vm->dummy_gem, 0);
> +	vm->for_mcu = for_mcu;
> +	vm->ptdev = ptdev;
> +	INIT_LIST_HEAD(&vm->shared_bos);
> +	mutex_init(&vm->op_lock);
> +
> +	if (for_mcu) {
> +		/* CSF MCU is a cortex M7, and can only address 4G */
> +		min_va = 0;
> +		va_range = SZ_4G;
> +	} else {
> +		min_va = 0;
> +		va_range = (1ull << va_bits);
> +
> +		/* If the auto_va_range is zero, we reserve half of the VA
> +		 * space for kernel stuff.
> +		 */
> +		if (!auto_va_range) {
> +			auto_va_range = va_range / 2;
> +			auto_va_start = va_range - auto_va_range;
> +		}
> +	}
> +
> +	mutex_init(&vm->mm_lock);
> +	drm_mm_init(&vm->mm, auto_va_start, auto_va_range);
> +
> +	/* We intentionally leave the reserved range to zero, because we want kernel VMAs
> +	 * to be handled the same way user VMAs are.
> +	 */
> +	drm_gpuva_manager_init(&vm->va_mgr,
> +			       for_mcu ? "panthor-MCU-VA-manager" : "panthor-GPU-VA-manager",
> +			       min_va, va_range, 0, 0,
> +			       &panthor_gpuva_ops);
> +	INIT_LIST_HEAD(&vm->node);
> +	INIT_LIST_HEAD(&vm->as.lru_node);
> +	vm->as.id = -1;
> +
> +	pgtbl_cfg = (struct io_pgtable_cfg) {
> +		.pgsize_bitmap	= SZ_4K | SZ_2M,
> +		.ias		= va_bits,
> +		.oas		= pa_bits,
> +		.coherent_walk	= ptdev->coherent,
> +		.tlb		= &mmu_tlb_ops,
> +		.iommu_dev	= ptdev->base.dev,
> +		.alloc		= alloc_pt,
> +		.free		= free_pt,
> +	};
> +
> +	vm->pgtbl_ops = alloc_io_pgtable_ops(ARM_64_LPAE_S1, &pgtbl_cfg, vm);
> +	if (!vm->pgtbl_ops) {
> +		ret = -EINVAL;
> +		goto err_gpuva_destroy;
> +	}
> +
> +	/* Bind operations are synchronous for now, no timeout needed. */
> +	ret = drm_sched_init(&vm->sched, &panthor_vm_bind_ops, ptdev->mmu->vm.wq, 1, 0,
> +			     MAX_SCHEDULE_TIMEOUT, NULL, NULL,
> +			     "panthor-vm-bind", DRM_SCHED_POLICY_SINGLE_ENTITY,
> +			     ptdev->base.dev);
> +	if (ret)
> +		goto err_free_io_pgtable;
> +
> +	sched = &vm->sched;
> +	ret = drm_sched_entity_init(&vm->entity, DRM_SCHED_PRIORITY_NORMAL,
> +				    &sched, 1, NULL);
> +	if (ret)
> +		goto err_sched_fini;
> +
> +	mair = io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg.arm_lpae_s1_cfg.mair;
> +	vm->memattr = mair_to_memattr(mair);
> +
> +	mutex_lock(&ptdev->mmu->vm.lock);
> +	list_add_tail(&vm->node, &ptdev->mmu->vm.list);
> +
> +	/* If a reset is in progress, stop the scheduler. */
> +	if (ptdev->mmu->vm.reset_in_progress)
> +		panthor_vm_stop(vm);
> +	mutex_unlock(&ptdev->mmu->vm.lock);
> +
> +	return vm;
> +
> +err_sched_fini:
> +	drm_sched_fini(&vm->sched);
> +
> +err_free_io_pgtable:
> +	free_io_pgtable_ops(vm->pgtbl_ops);
> +
> +err_gpuva_destroy:
> +	drm_mm_takedown(&vm->mm);
> +	drm_gpuva_manager_destroy(&vm->va_mgr);
> +	drm_gem_private_object_fini(&vm->dummy_gem);
> +	kfree(vm);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static int
> +panthor_vm_bind_prepare_op_ctx(struct drm_file *file,
> +			       struct panthor_vm *vm,
> +			       const struct drm_panthor_vm_bind_op *op,
> +			       struct panthor_vm_op_ctx *op_ctx)
> +{
> +	struct drm_gem_object *gem;
> +	int ret;
> +
> +	/* Aligned on page size. */
> +	if ((op->va | op->size) & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	switch (op->flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) {
> +	case DRM_PANTHOR_VM_BIND_OP_TYPE_MAP:
> +		gem = drm_gem_object_lookup(file, op->bo_handle);
> +		ret = panthor_vm_prepare_map_op_ctx(op_ctx, vm,
> +						    gem ? to_panthor_bo(gem) : NULL,
> +						    op->bo_offset,
> +						    op->size,
> +						    op->va,
> +						    op->flags);
> +		drm_gem_object_put(gem);
> +		return ret;
> +
> +	case DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP:
> +		return panthor_vm_prepare_unmap_op_ctx(op_ctx, vm, op->va, op->size);
> +
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static void panthor_vm_bind_job_cleanup_op_ctx_work(struct work_struct *work)
> +{
> +	struct panthor_vm_bind_job *job =
> +		container_of(work, struct panthor_vm_bind_job, cleanup_op_ctx_work);
> +
> +	panthor_vm_cleanup_op_ctx(&job->ctx, job->vm);
> +	panthor_vm_bind_job_put(&job->base);
> +}
> +
> +/**
> + * panthor_vm_bind_job_create() - Create a VM_BIND job
> + * @file: File.
> + * @vm: VM targeted by the VM_BIND job.
> + * @op: VM operation data.
> + *
> + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> + */
> +struct drm_sched_job *
> +panthor_vm_bind_job_create(struct drm_file *file,
> +			   struct panthor_vm *vm,
> +			   const struct drm_panthor_vm_bind_op *op)
> +{
> +	struct panthor_vm_bind_job *job;
> +	int ret;
> +
> +	if (!vm)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (vm->destroyed || vm->unusable)
> +		return ERR_PTR(-EINVAL);
> +
> +	job = kzalloc(sizeof(*job), GFP_KERNEL);
> +	if (!job)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_WORK(&job->cleanup_op_ctx_work, panthor_vm_bind_job_cleanup_op_ctx_work);
> +	kref_init(&job->refcount);
> +	job->vm = panthor_vm_get(vm);
> +
> +	ret = panthor_vm_bind_prepare_op_ctx(file, vm, op, &job->ctx);
> +	if (ret)
> +		goto err_put_job;
> +
> +	ret = drm_sched_job_init(&job->base, &vm->entity, vm);
> +	if (ret)
> +		goto err_put_job;
> +
> +	return &job->base;
> +
> +err_put_job:
> +	panthor_vm_bind_job_put(&job->base);
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * panthor_vm_bind_job_prepare_resvs() - Prepare VM_BIND job dma_resvs
> + * @exec: The locking/preparation context.
> + * @sched_job: The job to prepare resvs on.
> + *
> + * Locks and prepare the VM resv.
> + *
> + * If this is a map operation, locks and prepares the GEM resv.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
> +				      struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
> +	int ret;
> +
> +	/* Acquire the VM lock an reserve a slot for this VM bind job. */
> +	ret = drm_exec_prepare_obj(exec, &job->vm->dummy_gem, 1);
> +	if (ret)
> +		return ret;
> +
> +	if (job->ctx.map.gem.obj) {
> +		/* Lock/prepare the GEM being mapped. */
> +		ret = drm_exec_prepare_obj(exec, job->ctx.map.gem.obj, 1);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_vm_bind_job_add_resvs_deps() - Add implicit deps to the VM_BIND job
> + * @sched_job: Job to add implicit deps on.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_bind_job_add_resvs_deps(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
> +	int ret;
> +
> +	/* We use explicit fencing, so no need to wait for anything else but
> +	 * DMA_RESV_USAGE_KERNEL on the BO being mapped or VM. If there are extra
> +	 * dependencies, they should be passed to the VM_BIND ioctl.
> +	 */
> +	ret = drm_sched_job_add_resv_dependencies(sched_job,
> +						  job->vm->dummy_gem.resv,
> +						  DMA_RESV_USAGE_KERNEL);
> +	if (ret)
> +		return ret;
> +
> +	if (job->ctx.map.gem.obj) {
> +		ret = drm_sched_job_add_resv_dependencies(sched_job,
> +							  job->ctx.map.gem.obj->resv,
> +							  DMA_RESV_USAGE_KERNEL);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_vm_bind_job_update_resvs() - Update the resv objects touched by a job
> + * @sched_job: Job to update the resvs on.
> + */
> +void panthor_vm_bind_job_update_resvs(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_vm_bind_job *job = container_of(sched_job, struct panthor_vm_bind_job, base);
> +
> +	/* Explicit sync => we just register our job finished fence as bookkeep. */
> +	dma_resv_add_fence(job->vm->dummy_gem.resv,
> +			   &sched_job->s_fence->finished,
> +			   DMA_RESV_USAGE_BOOKKEEP);
> +
> +	if (job->ctx.map.gem.obj) {
> +		dma_resv_add_fence(job->ctx.map.gem.obj->resv,
> +				   &sched_job->s_fence->finished,
> +				   DMA_RESV_USAGE_BOOKKEEP);
> +	}
> +}
> +
> +/**
> + * panthor_vm_bind_exec_sync_op() - Execute a VM_BIND operation synchronously.
> + * @file: File.
> + * @vm: VM targeted by the VM operation.
> + * @op: Data describing the VM operation.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_bind_exec_sync_op(struct drm_file *file,
> +				 struct panthor_vm *vm,
> +				 struct drm_panthor_vm_bind_op *op)
> +{
> +	struct panthor_vm_op_ctx op_ctx;
> +	int ret;
> +
> +	/* No sync objects allowed on synchronous operations. */
> +	if (op->syncs.count)
> +		return -EINVAL;
> +
> +	if (!op->size)
> +		return 0;
> +
> +	ret = panthor_vm_bind_prepare_op_ctx(file, vm, op, &op_ctx);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_vm_exec_op(vm, &op_ctx, false);
> +	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
> +
> +	return ret;
> +}
> +
> +/**
> + * panthor_vm_map_bo_range() - Map a GEM object range to a VM
> + * @vm: VM to map the GEM to.
> + * @bo: GEM object to map.
> + * @offset: Offset in the GEM object.
> + * @size: Size to map.
> + * @va: Virtual address to map the object to.
> + * @flags: Combination of drm_panthor_vm_bind_op_flags flags.
> + * Only map-related flags are valid.
> + *
> + * Internal use only. For userspace requests, use
> + * panthor_vm_bind_exec_sync_op() instead.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_map_bo_range(struct panthor_vm *vm, struct panthor_gem_object *bo,
> +			    u64 offset, size_t size, u64 va, u32 flags)
> +{
> +	struct panthor_vm_op_ctx op_ctx;
> +	int ret;
> +
> +	ret = panthor_vm_prepare_map_op_ctx(&op_ctx, vm, bo, offset, size, va, flags);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_vm_exec_op(vm, &op_ctx, false);
> +	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
> +
> +	return ret;
> +}
> +
> +/**
> + * panthor_vm_unmap_range() - Unmap a portion of the VA space
> + * @vm: VM to unmap the region from.
> + * @va: Virtual address to unmap. Must be 4k aligned.
> + * @size: Size of the region to unmap. Must be 4k aligned.
> + *
> + * Internal use only. For userspace requests, use
> + * panthor_vm_bind_exec_sync_op() instead.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_unmap_range(struct panthor_vm *vm, u64 va, size_t size)
> +{
> +	struct panthor_vm_op_ctx op_ctx;
> +	int ret;
> +
> +	ret = panthor_vm_prepare_unmap_op_ctx(&op_ctx, vm, va, size);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_vm_exec_op(vm, &op_ctx, false);
> +	panthor_vm_cleanup_op_ctx(&op_ctx, vm);
> +
> +	return ret;
> +}
> +
> +/**
> + * panthor_vm_prepare_mapped_bos_resvs() - Prepare resvs on VM BOs.
> + * @exec: Locking/preparation context.
> + * @vm: VM targeted by the GPU job.
> + *
> + * GPU jobs assume all BOs bound to the VM at the time the job is submitted
> + * are available when the job is executed. In order to guarantee that, we
> + * need to reserve a slot on all BOs mapped to a VM and update this slot with
> + * the job fence after its submission.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm *vm)
> +{
> +	struct panthor_vma *vma;
> +	int ret;
> +
> +	/* Acquire the VM lock an reserve a slot for this GPU job. */
> +	ret = drm_exec_prepare_obj(exec, &vm->dummy_gem, 1);
> +	if (ret)
> +		return ret;
> +
> +	list_for_each_entry(vma, &vm->shared_bos, node) {
> +		ret = drm_exec_prepare_obj(exec, vma->base.gem.obj, 1);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_vm_add_bos_resvs_deps_to_job() - Add implicit VM deps to a GPU job
> + * @vm: VM targeted by the GPU job.
> + * @job: GPU job.
> + *
> + * We just take care of kernel access. Other accesses should be passed as
> + * explicit dependencies to the job.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_vm_add_bos_resvs_deps_to_job(struct panthor_vm *vm,
> +					 struct drm_sched_job *job)
> +{
> +	struct panthor_vma *vma;
> +	int ret;
> +
> +	/* We use explicit fencing, so no need to wait for anything else but
> +	 * DMA_RESV_USAGE_KERNEL on the BO being mapped or VM. If there are extra
> +	 * dependencies, they should be passed to the VM_BIND ioctl.
> +	 */
> +	ret = drm_sched_job_add_resv_dependencies(job,
> +						  vm->dummy_gem.resv,
> +						  DMA_RESV_USAGE_KERNEL);
> +	if (ret)
> +		return ret;
> +
> +	list_for_each_entry(vma, &vm->shared_bos, node) {
> +		ret = drm_sched_job_add_resv_dependencies(job,
> +							  vma->base.gem.obj->resv,
> +							  DMA_RESV_USAGE_KERNEL);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_vm_add_job_fence_to_bos_resvs() - Add GPU job fence to GEM resvs
> + * @vm: VM targeted by the GPU job.
> + * @job: GPU job.
> + *
> + * Update the GEM resvs after a job has been submitted. All GEMs currently
> + * bound to the VMs get the job fence added to their resv as bookkeep. If
> + * another type of implicit dependency is needed, it should be updated
> + * with %DMA_BUF_IOCTL_IMPORT_SYNC_FILE after the
> + * %DRM_IOCTL_PANTHOR_GROUP_SUBMIT ioctl has returned.
> + */
> +void panthor_vm_add_job_fence_to_bos_resvs(struct panthor_vm *vm,
> +					   struct drm_sched_job *job)
> +{
> +	struct panthor_vma *vma;
> +
> +	/* Explicit sync => we just register our job finished fence as bookkeep. */
> +	dma_resv_add_fence(vm->dummy_gem.resv,
> +			   &job->s_fence->finished,
> +			   DMA_RESV_USAGE_BOOKKEEP);
> +
> +	list_for_each_entry(vma, &vm->shared_bos, node) {
> +		dma_resv_add_fence(vma->base.gem.obj->resv,
> +				   &job->s_fence->finished,
> +				   DMA_RESV_USAGE_BOOKKEEP);
> +	}
> +}
> +
> +/**
> + * panthor_mmu_unplug() - Unplug the MMU logic
> + * @ptdev: Device.
> + *
> + * No access to the MMU regs should be done after this function is called.
> + * We suspend the IRQ and disable all VMs to guarantee that.
> + */
> +void panthor_mmu_unplug(struct panthor_device *ptdev)
> +{
> +	if (ptdev->mmu->irq.irq > 0)

In what situation is this not true? AFAICT the driver probe will fail if
the IRQ can't be obtained.

Steve

> +		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +
> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> +	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> +		struct panthor_vm *vm = ptdev->mmu->as.slots[i].vm;
> +
> +		if (vm) {
> +			drm_WARN_ON(&ptdev->base, panthor_mmu_as_disable(ptdev, i));
> +			vm->as.id = -1;
> +			list_del_init(&vm->as.lru_node);
> +			clear_bit(i, &ptdev->mmu->as.alloc_mask);
> +			ptdev->mmu->as.slots[i].vm = NULL;
> +		}
> +	}
> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> +}
> +
> +static void panthor_mmu_release_wq(struct drm_device *ddev, void *res)
> +{
> +	destroy_workqueue(res);
> +}
> +
> +/**
> + * panthor_mmu_init() - Initialize the MMU logic.
> + * @ptdev: Device.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_mmu_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_mmu *mmu;
> +	int ret, irq;
> +
> +	mmu = drmm_kzalloc(&ptdev->base, sizeof(*mmu), GFP_KERNEL);
> +	if (!mmu)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&mmu->as.lru_list);
> +
> +	for (u32 i = 0; i < ARRAY_SIZE(mmu->as.slots); i++)
> +		spin_lock_init(&mmu->as.slots[i].lock);
> +
> +	drmm_mutex_init(&ptdev->base, &mmu->as.slots_lock);
> +	INIT_LIST_HEAD(&mmu->vm.list);
> +	drmm_mutex_init(&ptdev->base, &mmu->vm.lock);
> +
> +	ptdev->mmu = mmu;
> +
> +	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "mmu");
> +	if (irq <= 0)
> +		return -ENODEV;
> +
> +	ret = panthor_request_mmu_irq(ptdev, &mmu->irq, irq,
> +				      panthor_mmu_fault_mask(ptdev, ~0));
> +	if (ret)
> +		return ret;
> +
> +	mmu->vm.wq = alloc_workqueue("panthor-vm-bind", WQ_UNBOUND, 0);
> +	if (!mmu->vm.wq)
> +		return -ENOMEM;
> +
> +	return drmm_add_action_or_reset(&ptdev->base, panthor_mmu_release_wq, mmu->vm.wq);
> +}
> +
> +#ifdef CONFIG_DEBUG_FS
> +static int show_vm_gpuvas(struct panthor_vm *vm, struct seq_file *m)
> +{
> +	int ret;
> +
> +	mutex_lock(&vm->op_lock);
> +	ret = drm_debugfs_gpuva_info(m, &vm->va_mgr);
> +	mutex_unlock(&vm->op_lock);
> +
> +	return ret;
> +}
> +
> +static int show_each_vm(struct seq_file *m, void *arg)
> +{
> +	struct drm_info_node *node = (struct drm_info_node *)m->private;
> +	struct drm_device *ddev = node->minor->dev;
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	int (*show)(struct panthor_vm *, struct seq_file *) = node->info_ent->data;
> +	struct panthor_vm *vm;
> +	int ret = 0;
> +
> +	mutex_lock(&ptdev->mmu->vm.lock);
> +	list_for_each_entry(vm, &ptdev->mmu->vm.list, node) {
> +		ret = show(vm, m);
> +		if (ret < 0)
> +			break;
> +
> +		seq_puts(m, "\n");
> +	}
> +	mutex_unlock(&ptdev->mmu->vm.lock);
> +
> +	return ret;
> +}
> +
> +static struct drm_info_list panthor_mmu_debugfs_list[] = {
> +	DRM_DEBUGFS_GPUVA_INFO(show_each_vm, show_vm_gpuvas),
> +};
> +
> +/**
> + * panthor_mmu_debugfs_init() - Initialize MMU debugfs entries
> + * @minor: Minor.
> + */
> +void panthor_mmu_debugfs_init(struct drm_minor *minor)
> +{
> +	drm_debugfs_create_files(panthor_mmu_debugfs_list,
> +				 ARRAY_SIZE(panthor_mmu_debugfs_list),
> +				 minor->debugfs_root, minor);
> +}
> +#endif /* CONFIG_DEBUG_FS */
> +
> +/**
> + * panthor_mmu_pt_cache_init() - Initialize the page table cache.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_mmu_pt_cache_init(void)
> +{
> +	pt_cache = kmem_cache_create("panthor-mmu-pt", SZ_4K, SZ_4K, 0, NULL);
> +	if (!pt_cache)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_mmu_pt_cache_fini() - Destroy the page table cache.
> + */
> +void panthor_mmu_pt_cache_fini(void)
> +{
> +	kmem_cache_destroy(pt_cache);
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.h b/drivers/gpu/drm/panthor/panthor_mmu.h
> new file mode 100644
> index 000000000000..d94925ccdc8c
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.h
> @@ -0,0 +1,81 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_MMU_H__
> +#define __PANTHOR_MMU_H__
> +
> +struct drm_exec;
> +struct drm_sched_job;
> +struct panthor_gem_object;
> +struct panthor_heap_pool;
> +struct panthor_vm;
> +struct panthor_vma;
> +struct panthor_mmu;
> +
> +int panthor_mmu_init(struct panthor_device *ptdev);
> +void panthor_mmu_unplug(struct panthor_device *ptdev);
> +void panthor_mmu_pre_reset(struct panthor_device *ptdev);
> +void panthor_mmu_post_reset(struct panthor_device *ptdev);
> +void panthor_mmu_suspend(struct panthor_device *ptdev);
> +void panthor_mmu_resume(struct panthor_device *ptdev);
> +
> +int panthor_vm_map_bo_range(struct panthor_vm *vm, struct panthor_gem_object *bo,
> +			    u64 offset, size_t size, u64 va, u32 flags);
> +int panthor_vm_unmap_range(struct panthor_vm *vm, u64 va, size_t size);
> +struct panthor_gem_object *
> +panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset);
> +
> +int panthor_vm_active(struct panthor_vm *vm);
> +void panthor_vm_idle(struct panthor_vm *vm);
> +int panthor_vm_as(struct panthor_vm *vm);
> +
> +struct panthor_heap_pool *
> +panthor_vm_get_heap_pool(struct panthor_vm *vm, bool create);
> +
> +struct panthor_vm *panthor_vm_get(struct panthor_vm *vm);
> +void panthor_vm_put(struct panthor_vm *vm);
> +struct panthor_vm *panthor_vm_create(struct panthor_device *ptdev, bool for_mcu,
> +				     u64 auto_va_start, u64 auto_va_range);
> +
> +int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec,
> +					struct panthor_vm *vm);
> +int panthor_vm_add_bos_resvs_deps_to_job(struct panthor_vm *vm,
> +					 struct drm_sched_job *job);
> +void panthor_vm_add_job_fence_to_bos_resvs(struct panthor_vm *vm,
> +					   struct drm_sched_job *job);
> +
> +struct dma_resv *panthor_vm_resv(struct panthor_vm *vm);
> +
> +void panthor_vm_pool_destroy(struct panthor_file *pfile);
> +int panthor_vm_pool_create(struct panthor_file *pfile);
> +int panthor_vm_pool_create_vm(struct panthor_device *ptdev, struct panthor_vm_pool *pool,
> +			      u64 kernel_va_start, u64 kernel_va_range);
> +int panthor_vm_pool_destroy_vm(struct panthor_vm_pool *pool, u32 handle);
> +struct panthor_vm *panthor_vm_pool_get_vm(struct panthor_vm_pool *pool, u32 handle);
> +
> +struct drm_mm_node *panthor_vm_alloc_va(struct panthor_vm *vm, size_t size);
> +void panthor_vm_free_va(struct panthor_vm *vm, struct drm_mm_node *mm_node);
> +
> +int panthor_vm_bind_exec_sync_op(struct drm_file *file,
> +				 struct panthor_vm *vm,
> +				 struct drm_panthor_vm_bind_op *op);
> +
> +struct drm_sched_job *
> +panthor_vm_bind_job_create(struct drm_file *file,
> +			   struct panthor_vm *vm,
> +			   const struct drm_panthor_vm_bind_op *op);
> +void panthor_vm_bind_job_put(struct drm_sched_job *job);
> +int panthor_vm_bind_job_prepare_resvs(struct drm_exec *exec,
> +				      struct drm_sched_job *job);
> +int panthor_vm_bind_job_add_resvs_deps(struct drm_sched_job *job);
> +void panthor_vm_bind_job_update_resvs(struct drm_sched_job *job);
> +
> +int panthor_mmu_pt_cache_init(void);
> +void panthor_mmu_pt_cache_fini(void);
> +
> +#ifdef CONFIG_DEBUG_FS
> +void panthor_mmu_debugfs_init(struct drm_minor *minor);
> +#endif
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 09/15] drm/panthor: Add the FW logical block
  2023-08-09 16:53 ` [PATCH v2 09/15] drm/panthor: Add the FW " Boris Brezillon
@ 2023-08-16 16:01   ` Steven Price
  2023-08-29 16:15     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-16 16:01 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Contains everything that's FW related, that includes the code dealing
> with the microcontroller unit (MCU) that's running the FW, and anything
> related to allocating memory shared between the FW and the CPU.
> 
> A few global FW events are processed in the IRQ handler, the rest is
> forwarded to the scheduler, since scheduling is the primary reason for
> the FW existence, and also the main source of FW <-> kernel
> interactions.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Rename the file (_mcu -> _fw)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Document the code
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> - Use the panthor_irq layer to manage/process IRQs
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_fw.c | 1417 ++++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_fw.h |  505 +++++++++
>  2 files changed, 1922 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> new file mode 100644
> index 000000000000..359a68f7af03
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -0,0 +1,1417 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <linux/clk.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/firmware.h>
> +#include <linux/iopoll.h>
> +#include <linux/iosys-map.h>
> +#include <linux/mutex.h>
> +#include <linux/platform_device.h>
> +
> +#include <drm/drm_drv.h>
> +#include <drm/drm_managed.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_gpu.h"
> +#include "panthor_regs.h"
> +#include "panthor_fw.h"
> +#include "panthor_mmu.h"
> +#include "panthor_sched.h"
> +
> +#define CSF_FW_NAME "mali_csffw.bin"
> +
> +#define PING_INTERVAL_MS			12000
> +#define PROGRESS_TIMEOUT_CYCLES			(5ull * 500 * 1024 * 1024)
> +#define PROGRESS_TIMEOUT_SCALE_SHIFT		10
> +#define IDLE_HYSTERESIS_US			800
> +#define PWROFF_HYSTERESIS_US			10000
> +
> +/**
> + * struct panthor_fw_mem - FW memory
> + */
> +struct panthor_fw_mem {
> +	/** @bo: Buffer object backing the FW memory. */
> +	struct panthor_gem_object *bo;
> +
> +	/** @kmap: Kernel CPU mapping of the FW memory. */
> +	void *kmap;
> +
> +	/** @va: MCU mapping of the FW memory. */
> +	u64 va;
> +};
> +
> +/**
> + * struct panthor_fw_binary_hdr - Firmware binary header.
> + */
> +struct panthor_fw_binary_hdr {
> +	/** @magic: Magic value to check binary validity. */
> +	u32 magic;
> +#define CSF_FW_BINARY_HEADER_MAGIC		0xc3f13a6e
> +
> +	/** @minor: Minor FW version. */
> +	u8 minor;
> +
> +	/** @major: Major FW version. */
> +	u8 major;
> +#define CSF_FW_BINARY_HEADER_MAJOR_MAX		0
> +
> +	/** @padding1: MBZ. */
> +	u16 padding1;
> +
> +	/** @version_hash: FW version hash. */
> +	u32 version_hash;
> +
> +	/** @padding2: MBZ. */
> +	u32 padding2;
> +
> +	/** @size: FW binary size. */
> +	u32 size;
> +};
> +
> +/**
> + * enum panthor_fw_binary_entry_type - Firmware binary entry type
> + */
> +enum panthor_fw_binary_entry_type {
> +	/** @CSF_FW_BINARY_ENTRY_TYPE_IFACE: Host <-> FW interface. */
> +	CSF_FW_BINARY_ENTRY_TYPE_IFACE = 0,
> +
> +	/** @CSF_FW_BINARY_ENTRY_TYPE_CONFIG: FW config. */
> +	CSF_FW_BINARY_ENTRY_TYPE_CONFIG = 1,
> +
> +	/** @CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST: Unit-tests. */
> +	CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST = 2,
> +
> +	/** @CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER: Trace buffer interface. */
> +	CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER = 3,
> +
> +	/** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata interface. */
> +	CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
> +};
> +
> +#define CSF_FW_BINARY_ENTRY_TYPE(ehdr)					((ehdr) & 0xff)
> +#define CSF_FW_BINARY_ENTRY_SIZE(ehdr)					(((ehdr) >> 8) & 0xff)
> +#define CSF_FW_BINARY_ENTRY_UPDATE					BIT(30)
> +#define CSF_FW_BINARY_ENTRY_OPTIONAL					BIT(31)
> +
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_RD					BIT(0)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_WR					BIT(1)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_EX					BIT(2)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_NONE			(0 << 3)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED			(1 << 3)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_UNCACHED_COHERENT	(2 << 3)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED_COHERENT		(3 << 3)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK			GENMASK(4, 3)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_PROT				BIT(5)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED				BIT(30)
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO				BIT(31)
> +
> +#define CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS			\
> +	(CSF_FW_BINARY_IFACE_ENTRY_RD_RD |				\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_WR |				\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_EX |				\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK |			\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_PROT |				\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED  |				\
> +	 CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO)
> +
> +/**
> + * struct panthor_fw_binary_section_entry_hdr - Describes a section of FW binary
> + */
> +struct panthor_fw_binary_section_entry_hdr {
> +	/** @flags: Section flags. */
> +	u32 flags;
> +
> +	/** @va: MCU virtual range to map this binary section to. */
> +	struct {
> +		/** @start: Start address. */
> +		u32 start;
> +
> +		/** @end: End address. */
> +		u32 end;
> +	} va;
> +
> +	/** @data: Data to initialize the FW section with. */
> +	struct {
> +		/** @start: Start offset in the FW binary. */
> +		u32 start;
> +
> +		/** @end: End offset in the FW binary. */
> +		u32 end;
> +	} data;
> +};
> +
> +/**
> + * struct panthor_fw_binary_iter - Firmware binary iterator
> + *
> + * Used to parse a firmware binary.
> + */
> +struct panthor_fw_binary_iter {
> +	/** @data: FW binary data. */
> +	const void *data;
> +
> +	/** @size: FW binary size. */
> +	size_t size;
> +
> +	/** @offset: Iterator offset. */
> +	size_t offset;
> +};
> +
> +/**
> + * struct panthor_fw_section - FW section
> + */
> +struct panthor_fw_section {
> +	/** @node: Used to keep track of FW sections. */
> +	struct list_head node;
> +
> +	/** @flags: Section flags, as encoded in the FW binary. */
> +	u32 flags;
> +
> +	/** @mem: Section memory. */
> +	struct panthor_fw_mem *mem;
> +
> +	/**
> +	 * @name: Name of the section, as specified in the binary.
> +	 *
> +	 * Can be NULL.
> +	 */
> +	const char *name;
> +
> +	/**
> +	 * @data: Initial data copied to the FW memory.
> +	 *
> +	 * We keep data around so we can reload sections after a reset.
> +	 */
> +	struct {
> +		/** @buf: Buffed used to store init data. */
> +		const void *buf;
> +
> +		/** @size: Size of @buf in bytes. */
> +		size_t size;
> +	} data;
> +};
> +
> +#define CSF_MCU_SHARED_REGION_START		0x04000000ULL
> +#define CSF_MCU_SHARED_REGION_SIZE		0x04000000ULL
> +
> +#define MIN_CS_PER_CSG				8
> +#define MIN_CSGS				3
> +#define MAX_CSG_PRIO				0xf
> +
> +#define CSF_IFACE_VERSION(major, minor, patch)	\
> +	(((major) << 24) | ((minor) << 16) | (patch))
> +#define CSF_IFACE_VERSION_MAJOR(v)		((v) >> 24)
> +#define CSF_IFACE_VERSION_MINOR(v)		(((v) >> 16) & 0xff)
> +#define CSF_IFACE_VERSION_PATCH(v)		((v) & 0xffff)
> +
> +#define CSF_GROUP_CONTROL_OFFSET		0x1000
> +#define CSF_STREAM_CONTROL_OFFSET		0x40
> +#define CSF_UNPRESERVED_REG_COUNT		4
> +
> +/**
> + * struct panthor_fw_iface - FW interfaces
> + */
> +struct panthor_fw_iface {
> +	/** @global: Global interface. */
> +	struct panthor_fw_global_iface global;
> +
> +	/** @groups: Group slot interfaces. */
> +	struct panthor_fw_csg_iface groups[MAX_CSGS];
> +
> +	/** @streams: Command stream slot interfaces. */
> +	struct panthor_fw_cs_iface streams[MAX_CSGS][MAX_CS_PER_CSG];
> +};
> +
> +/**
> + * struct panthor_fw - Firmware management
> + */
> +struct panthor_fw {
> +	/** @vm: MCU VM. */
> +	struct panthor_vm *vm;
> +
> +	/** @sections: List of FW sections. */
> +	struct list_head sections;
> +
> +	/** @shared_section: The section containing the FW interfaces. */
> +	struct panthor_fw_section *shared_section;
> +
> +	/** @iface: FW interfaces. */
> +	struct panthor_fw_iface iface;
> +
> +	/** @watchdog: Collection of fields relating to the FW watchdog. */
> +	struct {
> +		/** @ping_work: Delayed work used to ping the FW. */
> +		struct delayed_work ping_work;
> +	} watchdog;
> +
> +	/**
> +	 * @waitqueues: Request waitqueues.
> +	 *
> +	 * Everytime a request is sent to a command stream group or the global
> +	 * interface, the caller will first busy wait for the request to be
> +	 * acknowledged, and then fallback to a sleeping wait.
> +	 *
> +	 * Those wait queues are here to support the sleeping wait flavor.
> +	 *
> +	 * Entry 31 is the global waitqueue, the other ones are the command
> +	 * stream group slot waitqueues.
> +	 */
> +	wait_queue_head_t waitqueues[32];
> +
> +	/** @booted: True is the FW is booted */
> +	bool booted;
> +
> +	/**
> +	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
> +	 *
> +	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
> +	 *
> +	 * Any time the firmware is properly suspended, a fast reset can take place.
> +	 * On the other hand, if the halt operation failed, the driver will reload
> +	 * all sections to make sure we start from a fresh state.
> +	 */
> +	bool fast_reset;
> +
> +	/** @irq: Job irq data. */
> +	struct panthor_irq irq;
> +};
> +
> +/**
> + * panthor_fw_get_glb_iface() - Get the global interface
> + * @ptdev: Device.
> + *
> + * Return: The global interface.
> + */
> +struct panthor_fw_global_iface *
> +panthor_fw_get_glb_iface(struct panthor_device *ptdev)
> +{
> +	return &ptdev->fw->iface.global;
> +}
> +
> +/**
> + * panthor_fw_get_glb_iface() - Get a command stream group slot interface
> + * @ptdev: Device.
> + * @csg_slot: Index of the command stream group slot.
> + *
> + * Return: The command stream group slot interface.
> + */
> +struct panthor_fw_csg_iface *
> +panthor_fw_get_csg_iface(struct panthor_device *ptdev, u32 csg_slot)
> +{
> +	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS))
> +		return NULL;
> +
> +	return &ptdev->fw->iface.groups[csg_slot];
> +}
> +
> +/**
> + * panthor_fw_get_glb_iface() - Get a command stream slot interface
> + * @ptdev: Device.
> + * @csg_slot: Index of the command stream group slot.
> + * @cs_slot: Index of the command stream slot.
> + *
> + * Return: The command stream slot interface.
> + */
> +struct panthor_fw_cs_iface *
> +panthor_fw_get_cs_iface(struct panthor_device *ptdev, u32 csg_slot, u32 cs_slot)
> +{
> +	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS || cs_slot > MAX_CS_PER_CSG))
> +		return NULL;
> +
> +	return &ptdev->fw->iface.streams[csg_slot][cs_slot];
> +}
> +
> +/**
> + * panthor_fw_conv_timeout() - Convert a timeout into a cycle-count
> + * @ptdev: Device.
> + * @timeout_us: Timeout expressed in micro-seconds.
> + *
> + * The FW has two timer sources: the GPU counter or arch-timer. We need
> + * to express timeouts in term of number of cycles and specify which
> + * timer source should be used.
> + *
> + * Return: A value suitable for timeout fields in the global interface.
> + */
> +static u32 panthor_fw_conv_timeout(struct panthor_device *ptdev, u32 timeout_us)
> +{
> +	bool use_cycle_counter = false;
> +	u32 timer_rate = 0;
> +	u64 cycles;
> +
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +	timer_rate = arch_timer_get_cntfrq();
> +#endif
> +
> +	if (!timer_rate) {
> +		use_cycle_counter = true;
> +		timer_rate = clk_get_rate(ptdev->clks.core);
> +	}
> +
> +	if (drm_WARN_ON(&ptdev->base, !timer_rate)) {
> +		/* We couldn't get a valid clock rate, let's just pick the
> +		 * maximum value so the FW still handles the core
> +		 * power on/off requests.
> +		 */
> +		return GLB_TIMER_VAL(0x7fffffff) |

NIT: This feels like a magic number that could be included in the
header. Or it could be rewritten as GLB_TIMER_VAL(~0) to more clearly
represent 'maximum'.

> +		       GLB_TIMER_SOURCE_GPU_COUNTER;
> +	}
> +
> +	cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate, 1000000);
> +	return GLB_TIMER_VAL(cycles >> 10) |

NIT: This isn't quite as ideal as it could be. The round up is done
before the shift. Plus it's technically possible to overflow the 31 bits
available (although that requires a several minute timeout and the
fastest possible clock).

I'd be tempted to rewrite as:

	mod_cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate,
				      1000000 << 10);

I'm not sure if the theorectical overflow is worth considering, but it
can be handled as:

	if (drm_WARN_ON(&ptdev->base, mod_cycles >= (1 << 31)))
		mod_cycles = (1 << 31) - 1;

or following the style I suggested above:

	if (drm_WARN_ON(&ptdev->base, mod_cycles > GLB_TIMER_VAL(~0)))
		mod_cycles = GFB_TIMER_VAL(~0);

> +	       (use_cycle_counter ? GLB_TIMER_SOURCE_GPU_COUNTER : 0);
> +}
> +
> +static int panthor_fw_binary_iter_read(struct panthor_device *ptdev,
> +				       struct panthor_fw_binary_iter *iter,
> +				       void *out, size_t size)
> +{
> +	size_t new_offset = iter->offset + size;
> +
> +	if (new_offset > iter->size || new_offset < iter->offset) {
> +		drm_err(&ptdev->base, "Firmware too small\n");
> +		return -EINVAL;
> +	}
> +
> +	memcpy(out, iter->data + iter->offset, size);
> +	iter->offset = new_offset;
> +	return 0;
> +}
> +
> +static void panthor_fw_init_section_mem(struct panthor_device *ptdev,
> +					struct panthor_fw_section *section)
> +{
> +	bool was_mapped = !!section->mem->kmap;
> +	void *kmap;
> +
> +	if (!section->data.size &&
> +	    !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO))
> +		return;
> +
> +	kmap = panthor_fw_mem_vmap(section->mem);
> +	if (drm_WARN_ON(&ptdev->base, !kmap))
> +		return;
> +
> +	memcpy(kmap, section->data.buf, section->data.size);
> +	if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO) {
> +		memset(kmap + section->data.size, 0,
> +		       section->mem->bo->base.base.size - section->data.size);
> +	}
> +
> +	if (!was_mapped)
> +		panthor_fw_mem_vunmap(section->mem);
> +}
> +
> +/**
> + * panthor_fw_mem_va() - Get the MCU address of a FW memory object.
> + * @mem: FW memory object.
> + *
> + * Return: The MCU address of a virtual object.
> + */
> +u64 panthor_fw_mem_va(struct panthor_fw_mem *mem)
> +{
> +	return mem->va;
> +}
> +
> +/**
> + * panthor_fw_mem_vunmap() - Kill kernel space mapping of a FW memory object
> + * @mem: FW memory object.
> + */
> +void panthor_fw_mem_vunmap(struct panthor_fw_mem *mem)
> +{
> +	if (mem->kmap) {
> +		struct iosys_map map = IOSYS_MAP_INIT_VADDR(mem->kmap);
> +
> +		drm_gem_vunmap_unlocked(&mem->bo->base.base, &map);
> +		mem->kmap = NULL;
> +	}
> +}
> +
> +/**
> + * panthor_fw_mem_vunmap() - Map a FW memory object in kernel space
> + * @mem: FW memory object.
> + *
> + * Return: a non-NULL pointer on success, NULL otherwise.
> + */
> +void *panthor_fw_mem_vmap(struct panthor_fw_mem *mem)
> +{
> +	if (!mem->kmap) {
> +		struct iosys_map map;
> +		int ret;
> +
> +		ret = drm_gem_vmap_unlocked(&mem->bo->base.base, &map);
> +		if (ret)
> +			return NULL;
> +
> +		mem->kmap = map.vaddr;
> +	}
> +
> +	return mem->kmap;
> +}
> +
> +/**
> + * panthor_fw_mem_free() - Free a FW memory object.
> + * @ptdev: Device.
> + * @mem: FW memory object to free.
> + */
> +void panthor_fw_mem_free(struct panthor_device *ptdev, struct panthor_fw_mem *mem)
> +{
> +	if (IS_ERR_OR_NULL(mem))
> +		return;
> +
> +	if (mem->bo)
> +		panthor_gem_unmap_and_put(ptdev->fw->vm, mem->bo, mem->va, mem->kmap);
> +
> +	kfree(mem);
> +}
> +
> +/**
> + * panthor_fw_mem_alloc() - Allocate a FW memory object and map it to the MCU VM.
> + * @ptdev: Device.
> + * @size: Size of the memory block.
> + * @bo_flags: BO flags.
> + * @vm_map_flags: VM_MAP flags.
> + * @va: Virtual address of the MCU mapping.
> + * Set to PANTHOR_GEM_ALLOC_VA for automatic VA-assignment. In that case, the
> + * VA will be allocated in the shared VA space.
> + *
> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> + */
> +static struct panthor_fw_mem *
> +panthor_fw_mem_alloc(struct panthor_device *ptdev, size_t size,
> +		     u32 bo_flags, u32 vm_map_flags, u64 va)
> +{
> +	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> +	int ret;
> +
> +	if (!mem)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mem->bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
> +					     size, bo_flags, vm_map_flags,
> +					     &va, NULL);
> +	if (IS_ERR(mem->bo)) {
> +		ret = PTR_ERR(mem->bo);
> +		mem->bo = NULL;
> +		goto err_free_mem;
> +	}
> +
> +	mem->va = va;
> +	return mem;
> +
> +err_free_mem:
> +	panthor_fw_mem_free(ptdev, mem);
> +	return ERR_PTR(ret);

The error handling seems more complex than needed, how about:

	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
	struct panthor_gem_object *bo;
	int ret;

	if (!mem)
		return ERR_PTR(-ENOMEM);

	bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
					size, bo_flags, vm_map_flags,
					&va, NULL);

	if (IS_ERR(bo)) {
		kfree(mem);
		return ERR_CAST(bo);
	}

	mem->bo = bo;
	mem->va = va;
	return mem;
	
Which I think also means we don't need the "if (mem->bo)" case in
panthor_fw_mem_free().

> +}
> +
> +/**
> + * panthor_fw_alloc_queue_iface_mem() - Allocate a ring-buffer interfaces.
> + * @ptdev: Device.
> + * @input: Pointer holding the input interface on success.
> + * Should be ignored on failure.
> + * @output: Pointer holding the output interface on success.
> + * Should be ignored on failure.
> + *
> + * Allocates panthor_fw_ringbuf_{input,out}_iface interfaces. The input
> + * interface is at offset 0, and the output interface at offset 4096.
> + *
> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> + */
> +struct panthor_fw_mem *
> +panthor_fw_alloc_queue_iface_mem(struct panthor_device *ptdev,
> +				 struct panthor_fw_ringbuf_input_iface **input,
> +				 const struct panthor_fw_ringbuf_output_iface **output)
> +{
> +	struct panthor_fw_mem *mem;
> +	void *kmap;
> +
> +	mem = panthor_fw_mem_alloc(ptdev, 8192,
> +				   DRM_PANTHOR_BO_NO_MMAP,
> +				   DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +				   DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +				   PANTHOR_GEM_ALLOC_VA);
> +	if (IS_ERR(mem))
> +		return mem;
> +
> +	kmap = panthor_fw_mem_vmap(mem);
> +	if (!kmap) {
> +		panthor_fw_mem_free(ptdev, mem);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	memset(kmap, 0, mem->bo->base.base.size);
> +	*input = kmap;
> +	*output = kmap + 4096;
> +	return mem;
> +}
> +
> +/**
> + * panthor_fw_alloc_suspend_buf_mem() - Allocate a suspend buffer for a command stream group.
> + * @ptdev: Device.
> + * @size: Size of the suspend buffer.
> + *
> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> + */
> +struct panthor_fw_mem *
> +panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size)
> +{
> +	return panthor_fw_mem_alloc(ptdev, size,
> +				    DRM_PANTHOR_BO_NO_MMAP,
> +				    DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
> +				    PANTHOR_GEM_ALLOC_VA);
> +}
> +
> +static int panthor_fw_load_section_entry(struct panthor_device *ptdev,
> +					 const struct firmware *fw,
> +					 struct panthor_fw_binary_iter *iter,
> +					 u32 ehdr)
> +{
> +	struct panthor_fw_binary_section_entry_hdr hdr;
> +	struct panthor_fw_section *section;
> +	u32 section_size;
> +	u32 name_len;
> +	int ret;
> +
> +	ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
> +	if (ret)
> +		return ret;
> +
> +	if (hdr.data.end < hdr.data.start) {
> +		drm_err(&ptdev->base, "Firmware corrupted, data.end < data.start (0x%x < 0x%x)\n",
> +			hdr.data.end, hdr.data.start);
> +		return -EINVAL;
> +	}
> +
> +	if (hdr.va.end < hdr.va.start) {
> +		drm_err(&ptdev->base, "Firmware corrupted, hdr.va.end < hdr.va.start (0x%x < 0x%x)\n",
> +			hdr.va.end, hdr.va.start);
> +		return -EINVAL;
> +	}
> +
> +	if (hdr.data.end > fw->size) {
> +		drm_err(&ptdev->base, "Firmware corrupted, file truncated? data_end=0x%x > fw size=0x%zx\n",
> +			hdr.data.end, fw->size);
> +		return -EINVAL;
> +	}
> +
> +	if ((hdr.va.start & ~PAGE_MASK) != 0 ||
> +	    (hdr.va.end & ~PAGE_MASK) != 0) {
> +		drm_err(&ptdev->base, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n",
> +			hdr.va.start, hdr.va.end);
> +		return -EINVAL;
> +	}
> +
> +	if (hdr.flags & ~CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS) {
> +		drm_err(&ptdev->base, "Firmware contains interface with unsupported flags (0x%x)\n",
> +			hdr.flags);
> +		return -EINVAL;
> +	}
> +
> +	if (hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_PROT) {
> +		drm_warn(&ptdev->base,
> +			 "Firmware protected mode entry not be supported, ignoring");
> +		return 0;
> +	}
> +
> +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START &&
> +	    !(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED)) {
> +		drm_err(&ptdev->base,
> +			"Interface at 0x%llx must be shared", CSF_MCU_SHARED_REGION_START);
> +		return -EINVAL;
> +	}
> +
> +	name_len = iter->size - iter->offset;
> +
> +	section = drmm_kzalloc(&ptdev->base, sizeof(*section), GFP_KERNEL);
> +	if (!section)
> +		return -ENOMEM;
> +
> +	list_add_tail(&section->node, &ptdev->fw->sections);
> +	section->flags = hdr.flags;
> +	section->data.size = hdr.data.end - hdr.data.start;
> +
> +	if (section->data.size > 0) {
> +		void *data = drmm_kmalloc(&ptdev->base, section->data.size, GFP_KERNEL);
> +
> +		if (!data)
> +			return -ENOMEM;
> +
> +		memcpy(data, fw->data + hdr.data.start, section->data.size);
> +		section->data.buf = data;
> +	}
> +
> +	if (name_len > 0) {
> +		char *name = drmm_kmalloc(&ptdev->base, name_len + 1, GFP_KERNEL);
> +
> +		if (!name)
> +			return -ENOMEM;
> +
> +		memcpy(name, iter->data + iter->offset, name_len);
> +		name[name_len] = '\0';
> +		section->name = name;
> +	}
> +
> +	section_size = hdr.va.end - hdr.va.start;
> +	if (section_size) {
> +		u32 cache_mode = hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK;
> +		u32 vm_map_flags = 0;
> +		struct sg_table *sgt;
> +		u64 va = hdr.va.start;
> +
> +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_READONLY;
> +
> +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_EX))
> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC;
> +
> +		/* TODO: CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_*_COHERENT are mapped to
> +		 * non-cacheable for now. We might want to introduce a new
> +		 * IOMMU_xxx flag (or abuse IOMMU_MMIO, which maps to device
> +		 * memory and is currently not used by our driver) for
> +		 * AS_MEMATTR_AARCH64_SHARED memory, so we can take benefit
> +		 * of IO-coherent systems.
> +		 */
> +		if (cache_mode != CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED)
> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED;
> +
> +		/* Shared section is in the auto-VA range. We need to
> +		 * reserve the VA range so it's not allocated to someone else.
> +		 */
> +		if (va >= CSF_MCU_SHARED_REGION_START &&
> +		    va < CSF_MCU_SHARED_REGION_START + CSF_MCU_SHARED_REGION_SIZE)
> +			va = PANTHOR_GEM_ALLOC_VA;
> +
> +		section->mem = panthor_fw_mem_alloc(ptdev, section_size,
> +						    DRM_PANTHOR_BO_NO_MMAP,
> +						    vm_map_flags, va);
> +		if (IS_ERR(section->mem))
> +			return PTR_ERR(section->mem);
> +
> +		if (drm_WARN_ON(&ptdev->base, section->mem->va != hdr.va.start))
> +			return -EINVAL;
> +
> +		panthor_fw_init_section_mem(ptdev, section);
> +
> +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
> +		if (IS_ERR(sgt))
> +			return PTR_ERR(section->mem);
> +
> +		dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
> +
> +		if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED) {
> +			if (!panthor_fw_mem_vmap(section->mem))

Moving this before panthor_fw_init_section_mem() would avoid an
unnecessary unmap/remap - althought this isn't exactly a performance path...

> +				return -ENOMEM;
> +		}
> +	}
> +
> +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
> +		ptdev->fw->shared_section = section;
> +
> +	return 0;
> +}
> +
> +static void
> +panthor_reload_fw_sections(struct panthor_device *ptdev, bool full_reload)
> +{
> +	struct panthor_fw_section *section;
> +
> +	list_for_each_entry(section, &ptdev->fw->sections, node) {
> +		struct sg_table *sgt;
> +
> +		if (!full_reload && !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
> +			continue;
> +
> +		panthor_fw_init_section_mem(ptdev, section);
> +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
> +		if (!drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(sgt)))
> +			dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
> +	}
> +}
> +
> +static int panthor_fw_load_entry(struct panthor_device *ptdev,
> +				 const struct firmware *fw,
> +				 struct panthor_fw_binary_iter *iter)
> +{
> +	struct panthor_fw_binary_iter eiter;
> +	u32 ehdr;
> +	int ret;
> +
> +	ret = panthor_fw_binary_iter_read(ptdev, iter, &ehdr, sizeof(ehdr));
> +	if (ret)
> +		return ret;
> +
> +	if ((iter->offset % sizeof(u32)) ||
> +	    (CSF_FW_BINARY_ENTRY_SIZE(ehdr) % sizeof(u32))) {
> +		drm_err(&ptdev->base, "Firmware entry isn't 32 bit aligned, offset=0x%x size=0x%x\n",
> +			(u32)(iter->offset - sizeof(u32)), CSF_FW_BINARY_ENTRY_SIZE(ehdr));
> +		return -EINVAL;
> +	}
> +
> +	eiter.offset = 0;
> +	eiter.data = iter->data + iter->offset;
> +	eiter.size = CSF_FW_BINARY_ENTRY_SIZE(ehdr) - sizeof(ehdr);
> +	iter->offset += eiter.size;

There should really be a check like:

	if (iter->offset < eiter.size)
		return -EINVAL;

otherwise I think it's possible for a corrupt firmware to cause us to
run off the end of the buffer. Ideally the check would look something
more like the one in panthor_fw_binary_iter_read() (dealing with
potential overflow). I'm wondering if it makes sense to allow
panthor_fw_binary_iter_read() with a NULL 'out' and check the return
value. That way we can replace "iter->offset += eiter.size" with:

	ret = panthor_fw_binary_iter_read(ptdev, iter, NULL,
					  eiter.size);
	if (ret)
		return ret;

(or have a new _skip() function)

> +
> +	switch (CSF_FW_BINARY_ENTRY_TYPE(ehdr)) {
> +	case CSF_FW_BINARY_ENTRY_TYPE_IFACE:
> +		return panthor_fw_load_section_entry(ptdev, fw, &eiter, ehdr);
> +
> +	/* FIXME: handle those entry types? */
> +	case CSF_FW_BINARY_ENTRY_TYPE_CONFIG:
> +	case CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST:
> +	case CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER:
> +	case CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA:
> +		return 0;
> +	default:
> +		break;
> +	}
> +
> +	if (ehdr & CSF_FW_BINARY_ENTRY_OPTIONAL)
> +		return 0;
> +
> +	drm_err(&ptdev->base,
> +		"Unsupported non-optional entry type %u in firmware\n",
> +		CSF_FW_BINARY_ENTRY_TYPE(ehdr));
> +	return -EINVAL;
> +}
> +
> +static int panthor_fw_load(struct panthor_device *ptdev)
> +{
> +	const struct firmware *fw = NULL;
> +	struct panthor_fw_binary_iter iter = {};
> +	struct panthor_fw_binary_hdr hdr;
> +	int ret;
> +
> +	ret = request_firmware(&fw, CSF_FW_NAME, ptdev->base.dev);
> +	if (ret) {
> +		drm_err(&ptdev->base, "Failed to load firmware image '%s'\n",
> +			CSF_FW_NAME);
> +		return ret;
> +	}
> +
> +	iter.data = fw->data;
> +	iter.size = fw->size;
> +	ret = panthor_fw_binary_iter_read(ptdev, &iter, &hdr, sizeof(hdr));
> +	if (ret)
> +		goto out;
> +
> +	if (hdr.magic != CSF_FW_BINARY_HEADER_MAGIC) {
> +		ret = -EINVAL;
> +		drm_err(&ptdev->base, "Invalid firmware magic\n");
> +		goto out;
> +	}
> +
> +	if (hdr.major != CSF_FW_BINARY_HEADER_MAJOR_MAX) {
> +		ret = -EINVAL;
> +		drm_err(&ptdev->base, "Unsupported firmware binary header version %d.%d (expected %d.x)\n",
> +			hdr.major, hdr.minor, CSF_FW_BINARY_HEADER_MAJOR_MAX);
> +		goto out;
> +	}
> +
> +	if (hdr.size > iter.size) {
> +		drm_err(&ptdev->base, "Firmware image is truncated\n");
> +		goto out;
> +	}
> +
> +	iter.size = hdr.size;
> +
> +	while (iter.offset < hdr.size) {
> +		ret = panthor_fw_load_entry(ptdev, fw, &iter);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	if (!ptdev->fw->shared_section) {
> +		drm_err(&ptdev->base, "Shared interface region not found\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +out:
> +	release_firmware(fw);
> +	return ret;
> +}
> +
> +/**
> + * iface_fw_to_cpu_addr() - Turn an MCU address into a CPU address
> + * @ptdev: Device.
> + * @mcu_va: MCU address.
> + *
> + * Return: NULL if the address is not part of the shared section, non-NULL otherwise.
> + */
> +static void *iface_fw_to_cpu_addr(struct panthor_device *ptdev, u32 mcu_va)
> +{
> +	u64 shared_mem_start = ptdev->fw->shared_section->mem->va;
> +	u64 shared_mem_end = ptdev->fw->shared_section->mem->va +
> +			     ptdev->fw->shared_section->mem->bo->base.base.size;
> +	if (mcu_va < shared_mem_start || mcu_va >= shared_mem_end)
> +		return NULL;
> +
> +	return ptdev->fw->shared_section->mem->kmap + (mcu_va - shared_mem_start);
> +}
> +
> +static int panthor_init_cs_iface(struct panthor_device *ptdev,
> +				 unsigned int csg_idx, unsigned int cs_idx)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_idx);
> +	struct panthor_fw_cs_iface *cs_iface = &ptdev->fw->iface.streams[csg_idx][cs_idx];
> +	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
> +	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET +
> +			   (csg_idx * glb_iface->control->group_stride) +
> +			   CSF_STREAM_CONTROL_OFFSET +
> +			   (cs_idx * csg_iface->control->stream_stride);
> +
> +	if (iface_offset + sizeof(*cs_iface) >= shared_section_sz)
> +		return -EINVAL;
> +
> +	spin_lock_init(&cs_iface->lock);
> +	cs_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
> +	cs_iface->input = iface_fw_to_cpu_addr(ptdev, cs_iface->control->input_va);
> +	cs_iface->output = iface_fw_to_cpu_addr(ptdev, cs_iface->control->output_va);
> +
> +	if (!cs_iface->input || !cs_iface->output) {
> +		drm_err(&ptdev->base, "Invalid stream control interface input/output VA");
> +		return -EINVAL;
> +	}
> +
> +	if (csg_idx > 0 || cs_idx > 0) {
> +		struct panthor_fw_cs_iface *first_cs_iface =
> +			panthor_fw_get_cs_iface(ptdev, 0, 0);
> +
> +		if (cs_iface->control->features != first_cs_iface->control->features) {
> +			drm_err(&ptdev->base, "Expecting identical CS slots");
> +			return -EINVAL;
> +		}
> +	} else {
> +		u32 reg_count = CS_FEATURES_WORK_REGS(cs_iface->control->features);
> +
> +		ptdev->csif_info.cs_reg_count = reg_count;
> +		ptdev->csif_info.unpreserved_cs_reg_count = CSF_UNPRESERVED_REG_COUNT;
> +	}

Minor NIT: Both of these could be made unconditional. I feel the neatest
thing could be to move the 'else' part to panthor_fw_init_ifaces()
rather than including it as a special case here.

The conditional could be left as is, removed, or maybe the below is clearer?

	struct panthor_fw_cs_iface *first_cs_iface =
			panthor_fw_get_cs_iface(ptdev, 0, 0);

	if (cs_iface != first_cs_iface) {
		if (cs_iface->control->features !=
		    first_cs_iface->control->features) {

I've no strong views, it's just this bit of code looks very clunky to me.

> +
> +	return 0;
> +}
> +
> +static int panthor_init_csg_iface(struct panthor_device *ptdev,
> +				  unsigned int csg_idx)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct panthor_fw_csg_iface *csg_iface = &ptdev->fw->iface.groups[csg_idx];
> +	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
> +	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET + (csg_idx * glb_iface->control->group_stride);
> +	unsigned int i;
> +
> +	if (iface_offset + sizeof(*csg_iface) >= shared_section_sz)
> +		return -EINVAL;
> +
> +	spin_lock_init(&csg_iface->lock);
> +	csg_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
> +	csg_iface->input = iface_fw_to_cpu_addr(ptdev, csg_iface->control->input_va);
> +	csg_iface->output = iface_fw_to_cpu_addr(ptdev, csg_iface->control->output_va);
> +
> +	if (csg_iface->control->stream_num < MIN_CS_PER_CSG ||
> +	    csg_iface->control->stream_num > MAX_CS_PER_CSG)
> +		return -EINVAL;
> +
> +	if (!csg_iface->input || !csg_iface->output) {
> +		drm_err(&ptdev->base, "Invalid group control interface input/output VA");
> +		return -EINVAL;
> +	}
> +
> +	if (csg_idx > 0) {
> +		struct panthor_fw_csg_iface *first_csg_iface =
> +			panthor_fw_get_csg_iface(ptdev, 0);
> +		u32 first_protm_suspend_size = first_csg_iface->control->protm_suspend_size;
> +
> +		if (first_csg_iface->control->features != csg_iface->control->features ||
> +		    first_csg_iface->control->suspend_size != csg_iface->control->suspend_size ||
> +		    first_protm_suspend_size != csg_iface->control->protm_suspend_size ||
> +		    first_csg_iface->control->stream_num != csg_iface->control->stream_num) {
> +			drm_err(&ptdev->base, "Expecting identical CSG slots");
> +			return -EINVAL;
> +		}

As above, I also wonder whether factoring out a "compare_csg()" function
could make this mode readable - it could take the "->control" members to
keep the line length in check. The special case for
"first_protm_suspend_size" is somewhat ugly.

> +	}
> +
> +	for (i = 0; i < csg_iface->control->stream_num; i++) {
> +		int ret = panthor_init_cs_iface(ptdev, csg_idx, i);
> +
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static u32 panthor_get_instr_features(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	if (glb_iface->control->version < CSF_IFACE_VERSION(1, 1, 0))
> +		return 0;
> +
> +	return glb_iface->control->instr_features;
> +}
> +
> +static int panthor_fw_init_ifaces(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = &ptdev->fw->iface.global;
> +	unsigned int i;
> +
> +	if (!ptdev->fw->shared_section->mem->kmap)
> +		return -EINVAL;
> +
> +	spin_lock_init(&glb_iface->lock);
> +	glb_iface->control = ptdev->fw->shared_section->mem->kmap;
> +
> +	if (!glb_iface->control->version) {
> +		drm_err(&ptdev->base, "Invalid CSF interface version %d.%d.%d (%x)",
> +			CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
> +			CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
> +			CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
> +			glb_iface->control->version);

This looks wrong - we print this message only with version == 0, so the
version number isn't very interesting ;)

I see kbase has this message: "Version check failed. Firmware may have
failed to boot." Which seems much more informative.

> +		return -EINVAL;
> +	}
> +
> +	glb_iface->input = iface_fw_to_cpu_addr(ptdev, glb_iface->control->input_va);
> +	glb_iface->output = iface_fw_to_cpu_addr(ptdev, glb_iface->control->output_va);
> +	if (!glb_iface->input || !glb_iface->output) {
> +		drm_err(&ptdev->base, "Invalid global control interface input/output VA");
> +		return -EINVAL;
> +	}
> +
> +	if (glb_iface->control->group_num > MAX_CSGS ||
> +	    glb_iface->control->group_num < MIN_CSGS) {
> +		drm_err(&ptdev->base, "Invalid number of control groups");
> +		return -EINVAL;
> +	}
> +
> +	for (i = 0; i < glb_iface->control->group_num; i++) {
> +		int ret = panthor_init_csg_iface(ptdev, i);
> +
> +		if (ret)
> +			return ret;
> +	}
> +
> +	drm_info(&ptdev->base, "CSF FW v%d.%d.%d, Features %x Instrumentation features %x",

NIT: Prefix %x with 0x (or use %#x).

> +		 CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
> +		 CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
> +		 CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
> +		 glb_iface->control->features,
> +		 panthor_get_instr_features(ptdev));
> +	return 0;
> +}
> +
> +static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	/* Enable all cores. */
> +	glb_iface->input->core_en_mask = ptdev->gpu_info.shader_present;
> +
> +	/* Setup timers. */
> +	glb_iface->input->poweroff_timer = panthor_fw_conv_timeout(ptdev, PWROFF_HYSTERESIS_US);
> +	glb_iface->input->progress_timer = PROGRESS_TIMEOUT_CYCLES >> PROGRESS_TIMEOUT_SCALE_SHIFT;
> +	glb_iface->input->idle_timer = panthor_fw_conv_timeout(ptdev, IDLE_HYSTERESIS_US);
> +
> +	/* Enable interrupts we care about. */
> +	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
> +					 GLB_PING |
> +					 GLB_CFG_PROGRESS_TIMER |
> +					 GLB_CFG_POWEROFF_TIMER |
> +					 GLB_IDLE_EN |
> +					 GLB_IDLE;
> +
> +	panthor_fw_update_reqs(glb_iface, req, GLB_IDLE_EN, GLB_IDLE_EN);
> +	panthor_fw_toggle_reqs(glb_iface, req, ack,
> +			       GLB_CFG_ALLOC_EN |
> +			       GLB_CFG_POWEROFF_TIMER |
> +			       GLB_CFG_PROGRESS_TIMER);
> +
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +
> +	/* Kick the watchdog. */
> +	mod_delayed_work(ptdev->reset.wq, &ptdev->fw->watchdog.ping_work,
> +			 msecs_to_jiffies(PING_INTERVAL_MS));
> +}
> +
> +static void panthor_fw_process_global_irq(struct panthor_device *ptdev)
> +{
> +	/* If the FW is not booted, don't process IRQs, just flag the FW as booted. */
> +	if (!ptdev->fw->booted)
> +		ptdev->fw->booted = true;
> +	else
> +		panthor_sched_process_global_irq(ptdev);
> +
> +	wake_up_all(&ptdev->fw->waitqueues[31]);
> +}
> +
> +static void panthor_fw_process_csg_irq(struct panthor_device *ptdev, u32 csg_slot)
> +{
> +	panthor_sched_process_csg_irq(ptdev, csg_slot);
> +	wake_up_all(&ptdev->fw->waitqueues[csg_slot]);
> +}
> +
> +static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
> +{
> +	if (status & JOB_INT_GLOBAL_IF) {
> +		panthor_fw_process_global_irq(ptdev);
> +		status &= ~JOB_INT_GLOBAL_IF;
> +	}
> +
> +	while (status) {
> +		u32 csg_id = ffs(status) - 1;
> +
> +		panthor_fw_process_csg_irq(ptdev, csg_id);
> +		status &= ~BIT(csg_id);

NIT: s/BIT/JOB_INT_CSG_IF/ (since it exists...)

> +	}
> +}
> +PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
> +
> +static int panthor_fw_start(struct panthor_device *ptdev)
> +{
> +	bool timedout = false;
> +
> +	ptdev->fw->booted = false;
> +	panthor_job_irq_resume(&ptdev->fw->irq, ~0);
> +	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_AUTO);
> +
> +	if (!wait_event_timeout(ptdev->fw->waitqueues[31],
> +				ptdev->fw->booted,
> +				msecs_to_jiffies(1000))) {
> +		if (!ptdev->fw->booted &&
> +		    !(gpu_read(ptdev, JOB_INT_STAT) & JOB_INT_GLOBAL_IF))
> +			timedout = true;
> +	}
> +
> +	if (timedout) {
> +		drm_err(&ptdev->base, "Failed to boot MCU");
> +		return -ETIMEDOUT;
> +	}
> +
> +	return 0;
> +}
> +
> +static void panthor_fw_stop(struct panthor_device *ptdev)
> +{
> +	u32 status;
> +
> +	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_DISABLE);
> +	if (readl_poll_timeout(ptdev->iomem + MCU_CONTROL, status,
> +			       status == MCU_CONTROL_DISABLE, 10, 100000))

I suspect this should be checking MCU_STATUS not MCU_CONTROL

> +		drm_err(&ptdev->base, "Failed to stop MCU");
> +}
> +
> +/**
> + * panthor_fw_pre_reset() - Call before a reset.
> + * @ptdev: Device.
> + * @on_hang: true if the reset was triggered on a GPU hang.
> + *
> + * If the reset is not triggered on a hang, we try to gracefully halt the
> + * MCU, so we can do a fast-reset when panthor_fw_post_reset() is called.
> + */
> +void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
> +{
> +	/* Make sure we won't be woken up by a ping. */
> +	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
> +
> +	ptdev->fw->fast_reset = false;
> +
> +	if (!on_hang) {
> +		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +		u32 status;
> +
> +		panthor_fw_update_reqs(glb_iface, req, GLB_HALT, GLB_HALT);
> +		gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
> +					status == MCU_STATUS_HALT, 10, 100000) &&
> +		    glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
> +			ptdev->fw->fast_reset = true;
> +		} else {
> +			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
> +		}
> +
> +		/* The FW detects 0 -> 1 transitions. Make sure we reset
> +		 * the HALT bit before the FW is rebooted.
> +		 */
> +		panthor_fw_update_reqs(glb_iface, req, 0, GLB_HALT);
> +	}
> +
> +	panthor_job_irq_suspend(&ptdev->fw->irq);
> +}
> +
> +/**
> + * panthor_fw_post_reset() - Call after a reset.
> + * @ptdev: Device.
> + *
> + * Start the FW. If this is not a fast reset, all FW sections are reloaded to
> + * make sure we can recover from a memory corruption.
> + */
> +int panthor_fw_post_reset(struct panthor_device *ptdev)
> +{
> +	int ret;
> +
> +	/* Make the MCU VM active. */
> +	ret = panthor_vm_active(ptdev->fw->vm);
> +	if (ret)
> +		return ret;
> +
> +	/* Reload all sections, including RO ones. We're not supposed
> +	 * to end up here anyway, let's just assume the overhead of
> +	 * reloading everything is acceptable.
> +	 */
> +	if (!ptdev->fw->fast_reset)
> +		panthor_reload_fw_sections(ptdev, true);
> +
> +	ret = panthor_fw_start(ptdev);
> +	if (ret)
> +		return ret;
> +
> +	/* We must re-initialize the global interface even on fast-reset. */
> +	panthor_fw_init_global_iface(ptdev);
> +	return 0;
> +}
> +
> +/**
> + * panthor_fw_unplug() - Called when the device is unplugged.
> + * @ptdev: Device.
> + *
> + * This function must make sure all pending operations are flushed before
> + * will release device resources, thus preventing any interaction with
> + * the HW.
> + *
> + * If there are still FW-relates works running after this function returns,

s/relates/related/ or maybe even "If there is still FW-related work"

> + * they must use drm_dev_{enter,exit}() and skip any HW access when
> + * drm_dev_enter() returns false.
> + */
> +void panthor_fw_unplug(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_section *section;
> +
> +	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
> +
> +	/* Make sure the IRQ handler can be called after that point. */
> +	if (ptdev->fw->irq.irq)
> +		panthor_job_irq_suspend(&ptdev->fw->irq);
> +
> +	panthor_fw_stop(ptdev);
> +
> +	if (ptdev->fw->vm)
> +		panthor_vm_idle(ptdev->fw->vm);
> +
> +	list_for_each_entry(section, &ptdev->fw->sections, node) {
> +		panthor_fw_mem_free(ptdev, section->mem);
> +	}
> +
> +	panthor_vm_put(ptdev->fw->vm);
> +
> +	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> +}
> +
> +/**
> + * panthor_fw_wait_acks() - Wait for requests to be acknowledged by the FW.
> + * @req_ptr: Pointer to the req register.
> + * @ack_ptr: Pointer to the ack register.
> + * @wq: Wait queue to use for the sleeping wait.
> + * @req_mask: Mask of requests to wait for.
> + * @acked: Pointer to field that's updated with the acked requests.
> + * If the function returns 0, *acked == req_mask.
> + * @timeout_ms: Timeout expressed in milliseconds.
> + *
> + * Return: 0 on success, -ETIMEDOUT otherwise.
> + */
> +static int panthor_fw_wait_acks(const u32 *req_ptr, const u32 *ack_ptr,
> +				wait_queue_head_t *wq,
> +				u32 req_mask, u32 *acked,
> +				u32 timeout_ms)
> +{
> +	u32 ack, req = READ_ONCE(*req_ptr) & req_mask;
> +	int ret;
> +
> +	/* Busy wait for a few µsecs before falling back to a sleeping wait. */
> +	*acked = req_mask;
> +	ret = read_poll_timeout_atomic(READ_ONCE, ack,
> +				       (ack & req_mask) == req,
> +				       0, 10, 0,
> +				       *ack_ptr);
> +	if (!ret)
> +		return 0;
> +
> +	if (wait_event_timeout(*wq, (READ_ONCE(*ack_ptr) & req_mask) == req,
> +			       msecs_to_jiffies(timeout_ms)))
> +		return 0;
> +
> +	/* Check one last time, in case we were not woken up for some reason. */
> +	ack = READ_ONCE(*ack_ptr);
> +	if ((ack & req_mask) == req)
> +		return 0;
> +
> +	*acked = ~(req ^ ack) & req_mask;
> +	return -ETIMEDOUT;
> +}
> +
> +/**
> + * panthor_fw_glb_wait_acks() - Wait for global requests to be acknowledged.
> + * @ptdev: Device.
> + * @req_mask: Mask of requests to wait for.
> + * @acked: Pointer to field that's updated with the acked requests.
> + * If the function returns 0, *acked == req_mask.
> + * @timeout_ms: Timeout expressed in milliseconds.
> + *
> + * Return: 0 on success, -ETIMEDOUT otherwise.
> + */
> +int panthor_fw_glb_wait_acks(struct panthor_device *ptdev,
> +			     u32 req_mask, u32 *acked,
> +			     u32 timeout_ms)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	/* GLB_HALT doesn't get acked through the FW interface. */
> +	if (drm_WARN_ON(&ptdev->base, req_mask & (~GLB_REQ_MASK | GLB_HALT)))
> +		return -EINVAL;
> +
> +	return panthor_fw_wait_acks(&glb_iface->input->req,
> +				    &glb_iface->output->ack,
> +				    &ptdev->fw->waitqueues[31],
> +				    req_mask, acked, timeout_ms);
> +}
> +
> +/**
> + * panthor_fw_glb_wait_acks() - Wait for command stream group requests to be acknowledged.
> + * @ptdev: Device.
> + * @req_mask: Mask of requests to wait for.
> + * @acked: Pointer to field that's updated with the acked requests.
> + * If the function returns 0, *acked == req_mask.
> + * @timeout_ms: Timeout expressed in milliseconds.
> + *
> + * Return: 0 on success, -ETIMEDOUT otherwise.
> + */
> +int panthor_fw_csg_wait_acks(struct panthor_device *ptdev, u32 csg_slot,
> +			     u32 req_mask, u32 *acked, u32 timeout_ms)
> +{
> +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_slot);
> +	int ret;
> +
> +	if (drm_WARN_ON(&ptdev->base, req_mask & ~CSG_REQ_MASK))
> +		return -EINVAL;
> +
> +	ret = panthor_fw_wait_acks(&csg_iface->input->req,
> +				   &csg_iface->output->ack,
> +				   &ptdev->fw->waitqueues[csg_slot],
> +				   req_mask, acked, timeout_ms);
> +
> +	if (ret && (*acked & CSG_STATE_MASK) != CSG_STATE_MASK)
> +		*acked &= ~CSG_STATE_MASK;

I think this could do with a comment, it took me a while to work out
what this was about. If I understand correctly this is attempting to
check that all the bits in the STATE field were updated, and if any
mismatch then clearing all those bits in the 'acked' mask. This enables
code to do a "acked & CSG_STATE_MASK" check and get the right value
(rather than having to do "(acked & CSG_STATE_MASK) == CSG_STATE_MASK").

AFAICT the "ret &&" part is also redundant.

> +
> +	return ret;
> +}
> +
> +/**
> + * panthor_fw_ring_csg_doorbells() - Ring command stream group doorbells.
> + * @ptdev: Device.
> + * @csg_mask: Bitmask encoding the command stream group doorbells to ring.
> + *
> + * This function is toggling bits in the doorbell_req and ringing the
> + * global doorbell. It doesn't require a user doorbell to be attached to
> + * the group.
> + */
> +void panthor_fw_ring_csg_doorbells(struct panthor_device *ptdev, u32 csg_mask)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	panthor_fw_toggle_reqs(glb_iface, doorbell_req, doorbell_ack, csg_mask);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +}
> +
> +static void panthor_fw_ping_work(struct work_struct *work)
> +{
> +	struct panthor_fw *fw = container_of(work, struct panthor_fw, watchdog.ping_work.work);
> +	struct panthor_device *ptdev = fw->irq.ptdev;
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 acked;
> +	int ret;
> +
> +	if (panthor_device_reset_is_pending(ptdev))
> +		return;
> +
> +	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PING);
> +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> +
> +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PING, &acked, 100);
> +	if (ret) {
> +		panthor_device_schedule_reset(ptdev);
> +		drm_err(&ptdev->base, "FW ping timeout, scheduling a reset");
> +	} else {
> +		mod_delayed_work(ptdev->reset.wq, &fw->watchdog.ping_work,
> +				 msecs_to_jiffies(PING_INTERVAL_MS));
> +	}
> +}
> +
> +/**
> + * panthor_fw_init() - Initialize FW related data.
> + * @ptdev: Device.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +int panthor_fw_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw *fw;
> +	int ret, irq;
> +
> +	fw = drmm_kzalloc(&ptdev->base, sizeof(*fw), GFP_KERNEL);
> +	if (!fw)
> +		return -ENOMEM;
> +
> +	ptdev->fw = fw;
> +	for (u32 i = 0; i < ARRAY_SIZE(fw->waitqueues); i++)
> +		init_waitqueue_head(&fw->waitqueues[i]);
> +
> +	INIT_LIST_HEAD(&fw->sections);
> +	INIT_DELAYED_WORK(&fw->watchdog.ping_work, panthor_fw_ping_work);
> +
> +	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "job");
> +	if (irq <= 0)
> +		return -ENODEV;
> +
> +	ret = panthor_request_job_irq(ptdev, &fw->irq, irq, 0);
> +	if (ret) {
> +		drm_err(&ptdev->base, "failed to request job irq");
> +		return ret;
> +	}
> +
> +	ret = panthor_gpu_l2_power_on(ptdev);
> +	if (ret)
> +		return ret;
> +
> +	fw->vm = panthor_vm_create(ptdev, true,
> +				   CSF_MCU_SHARED_REGION_START,
> +				   CSF_MCU_SHARED_REGION_SIZE);
> +	if (IS_ERR(fw->vm)) {
> +		ret = PTR_ERR(fw->vm);
> +		fw->vm = NULL;
> +		goto err_unplug_fw;
> +	}
> +
> +	ret = panthor_fw_load(ptdev);
> +	if (ret)
> +		goto err_unplug_fw;
> +
> +	ret = panthor_vm_active(fw->vm);
> +	if (ret)
> +		goto err_unplug_fw;
> +
> +	ret = panthor_fw_start(ptdev);
> +	if (ret)
> +		goto err_unplug_fw;
> +
> +	ret = panthor_fw_init_ifaces(ptdev);
> +	if (ret)
> +		goto err_unplug_fw;
> +
> +	panthor_fw_init_global_iface(ptdev);
> +	return 0;
> +
> +err_unplug_fw:
> +	panthor_fw_unplug(ptdev);
> +	return ret;
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> new file mode 100644
> index 000000000000..929760c2a46b
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> @@ -0,0 +1,505 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_MCU_H__
> +#define __PANTHOR_MCU_H__
> +
> +#include <linux/types.h>
> +
> +#include "panthor_device.h"
> +
> +struct panthor_fw_mem;
> +
> +#define MAX_CSGS				31
> +#define MAX_CS_PER_CSG                          32
> +
> +struct panthor_fw_ringbuf_input_iface {
> +	u64 insert;
> +	u64 extract;
> +} __packed;
> +
> +struct panthor_fw_ringbuf_output_iface {
> +	u64 extract;
> +	u32 active;
> +} __packed;

Is there a good reason for these to be marked '__packed'? They are
naturally aligned so there's no padding, and we guarantee they are page
aligned. The compiler might have more freedom if they are not marked
__packed.

> +
> +struct panthor_fw_cs_control_iface {
> +#define CS_FEATURES_WORK_REGS(x)		(((x) & GENMASK(7, 0)) + 1)
> +#define CS_FEATURES_SCOREBOARDS(x)		(((x) & GENMASK(15, 8)) >> 8)
> +#define CS_FEATURES_COMPUTE			BIT(16)
> +#define CS_FEATURES_FRAGMENT			BIT(17)
> +#define CS_FEATURES_TILER			BIT(18)
> +	u32 features;
> +	u32 input_va;
> +	u32 output_va;
> +} __packed;

Here I have to admit I can't find a statement in the spec saying that
the stride must be a multiple of 4 bytes... but kbase makes that assumption.

> +
> +struct panthor_fw_cs_input_iface {
> +#define CS_STATE_MASK				GENMASK(2, 0)
> +#define CS_STATE_STOP				0
> +#define CS_STATE_START				1
> +#define CS_EXTRACT_EVENT			BIT(4)
> +#define CS_IDLE_SYNC_WAIT			BIT(8)
> +#define CS_IDLE_PROTM_PENDING			BIT(9)
> +#define CS_IDLE_EMPTY				BIT(10)
> +#define CS_IDLE_RESOURCE_REQ			BIT(11)
> +#define CS_TILER_OOM				BIT(26)
> +#define CS_PROTM_PENDING			BIT(27)
> +#define CS_FATAL				BIT(30)
> +#define CS_FAULT				BIT(31)
> +#define CS_REQ_MASK				(CS_STATE_MASK | \
> +						 CS_EXTRACT_EVENT | \
> +						 CS_IDLE_SYNC_WAIT | \
> +						 CS_IDLE_PROTM_PENDING | \
> +						 CS_IDLE_EMPTY | \
> +						 CS_IDLE_RESOURCE_REQ)
> +#define CS_EVT_MASK				(CS_TILER_OOM | \
> +						 CS_PROTM_PENDING | \
> +						 CS_FATAL | \
> +						 CS_FAULT)
> +	u32 req;
> +
> +#define CS_CONFIG_PRIORITY(x)			((x) & GENMASK(3, 0))
> +#define CS_CONFIG_DOORBELL(x)			(((x) << 8) & GENMASK(15, 8))
> +	u32 config;
> +	u32 reserved1;
> +	u32 ack_irq_mask;
> +	u64 ringbuf_base;
> +	u32 ringbuf_size;
> +	u32 reserved2;
> +	u64 heap_start;
> +	u64 heap_end;
> +	u64 ringbuf_input;
> +	u64 ringbuf_output;
> +	u32 instr_config;
> +	u32 instrbuf_size;
> +	u64 instrbuf_base;
> +	u64 instrbuf_offset_ptr;
> +} __packed;

The spec says this has a minimal alignment of 64 bytes. Although I guess
the code should check this if we remove __packed and rely on it.

> +
> +struct panthor_fw_cs_output_iface {
> +	u32 ack;
> +	u32 reserved1[15];
> +	u64 status_cmd_ptr;
> +
> +#define CS_STATUS_WAIT_SB_MASK			GENMASK(15, 0)
> +#define CS_STATUS_WAIT_SB_SRC_MASK		GENMASK(19, 16)
> +#define CS_STATUS_WAIT_SB_SRC_NONE		(0 << 16)
> +#define CS_STATUS_WAIT_SB_SRC_WAIT		(8 << 16)
> +#define CS_STATUS_WAIT_SYNC_COND_LE		(0 << 24)
> +#define CS_STATUS_WAIT_SYNC_COND_GT		(1 << 24)
> +#define CS_STATUS_WAIT_SYNC_COND_MASK		GENMASK(27, 24)
> +#define CS_STATUS_WAIT_PROGRESS			BIT(28)
> +#define CS_STATUS_WAIT_PROTM			BIT(29)
> +#define CS_STATUS_WAIT_SYNC_64B			BIT(30)
> +#define CS_STATUS_WAIT_SYNC			BIT(31)
> +	u32 status_wait;
> +	u32 status_req_resource;
> +	u64 status_wait_sync_ptr;
> +	u32 status_wait_sync_value;
> +	u32 status_scoreboards;
> +
> +#define CS_STATUS_BLOCKED_REASON_UNBLOCKED	0
> +#define CS_STATUS_BLOCKED_REASON_SB_WAIT	1
> +#define CS_STATUS_BLOCKED_REASON_PROGRESS_WAIT	2
> +#define CS_STATUS_BLOCKED_REASON_SYNC_WAIT	3
> +#define CS_STATUS_BLOCKED_REASON_DEFERRED	5
> +#define CS_STATUS_BLOCKED_REASON_RES		6
> +#define CS_STATUS_BLOCKED_REASON_FLUSH		7
> +#define CS_STATUS_BLOCKED_REASON_MASK		GENMASK(3, 0)
> +	u32 status_blocked_reason;
> +	u32 status_wait_sync_value_hi;
> +	u32 reserved2[6];
> +
> +#define CS_EXCEPTION_TYPE(x)			((x) & GENMASK(7, 0))
> +#define CS_EXCEPTION_DATA(x)			(((x) >> 8) & GENMASK(23, 0))
> +	u32 fault;
> +	u32 fatal;
> +	u64 fault_info;
> +	u64 fatal_info;
> +	u32 reserved3[10];
> +	u32 heap_vt_start;
> +	u32 heap_vt_end;
> +	u32 reserved4;
> +	u32 heap_frag_end;
> +	u64 heap_address;
> +} __packed;

output is the same as input.

> +
> +struct panthor_fw_csg_control_iface {
> +	u32 features;
> +	u32 input_va;
> +	u32 output_va;
> +	u32 suspend_size;
> +	u32 protm_suspend_size;
> +	u32 stream_num;
> +	u32 stream_stride;
> +} __packed;

The spec is ambigious here. It one place it states the stride is 256
bytes, but in another that you need to look at the GLB_GROUP_STRIDE
value. In practice we can rely on 4 byte alignment.

I'm beginning to wonder if it's worth worrying about, I think I'll stop
here ;)

Steve

> +
> +struct panthor_fw_csg_input_iface {
> +#define CSG_STATE_MASK				GENMASK(2, 0)
> +#define CSG_STATE_TERMINATE			0
> +#define CSG_STATE_START				1
> +#define CSG_STATE_SUSPEND			2
> +#define CSG_STATE_RESUME			3
> +#define CSG_ENDPOINT_CONFIG			BIT(4)
> +#define CSG_STATUS_UPDATE			BIT(5)
> +#define CSG_SYNC_UPDATE				BIT(28)
> +#define CSG_IDLE				BIT(29)
> +#define CSG_DOORBELL				BIT(30)
> +#define CSG_PROGRESS_TIMER_EVENT		BIT(31)
> +#define CSG_REQ_MASK				(CSG_STATE_MASK | \
> +						 CSG_ENDPOINT_CONFIG | \
> +						 CSG_STATUS_UPDATE)
> +#define CSG_EVT_MASK				(CSG_SYNC_UPDATE | \
> +						 CSG_IDLE | \
> +						 CSG_PROGRESS_TIMER_EVENT)
> +	u32 req;
> +	u32 ack_irq_mask;
> +
> +	u32 doorbell_req;
> +	u32 cs_irq_ack;
> +	u32 reserved1[4];
> +	u64 allow_compute;
> +	u64 allow_fragment;
> +	u32 allow_other;
> +
> +#define CSG_EP_REQ_COMPUTE(x)			((x) & GENMASK(7, 0))
> +#define CSG_EP_REQ_FRAGMENT(x)			(((x) << 8) & GENMASK(15, 8))
> +#define CSG_EP_REQ_TILER(x)			(((x) << 16) & GENMASK(19, 16))
> +#define CSG_EP_REQ_EXCL_COMPUTE			BIT(20)
> +#define CSG_EP_REQ_EXCL_FRAGMENT		BIT(21)
> +#define CSG_EP_REQ_PRIORITY(x)			(((x) << 28) & GENMASK(31, 28))
> +#define CSG_EP_REQ_PRIORITY_MASK		GENMASK(31, 28)
> +	u32 endpoint_req;
> +	u32 reserved2[2];
> +	u64 suspend_buf;
> +	u64 protm_suspend_buf;
> +	u32 config;
> +	u32 iter_trace_config;
> +} __packed;
> +
> +struct panthor_fw_csg_output_iface {
> +	u32 ack;
> +	u32 reserved1;
> +	u32 doorbell_ack;
> +	u32 cs_irq_req;
> +	u32 status_endpoint_current;
> +	u32 status_endpoint_req;
> +
> +#define CSG_STATUS_STATE_IS_IDLE		BIT(0)
> +	u32 status_state;
> +	u32 resource_dep;
> +} __packed;
> +
> +struct panthor_fw_global_control_iface {
> +	u32 version;
> +	u32 features;
> +	u32 input_va;
> +	u32 output_va;
> +	u32 group_num;
> +	u32 group_stride;
> +	u32 perfcnt_size;
> +	u32 instr_features;
> +} __packed;
> +
> +struct panthor_fw_global_input_iface {
> +#define GLB_HALT				BIT(0)
> +#define GLB_CFG_PROGRESS_TIMER			BIT(1)
> +#define GLB_CFG_ALLOC_EN			BIT(2)
> +#define GLB_CFG_POWEROFF_TIMER			BIT(3)
> +#define GLB_PROTM_ENTER				BIT(4)
> +#define GLB_PERFCNT_EN				BIT(5)
> +#define GLB_PERFCNT_SAMPLER			BIT(6)
> +#define GLB_COUNTER_EN				BIT(7)
> +#define GLB_PING				BIT(8)
> +#define GLB_FWCFG_UPDATE			BIT(9)
> +#define GLB_IDLE_EN				BIT(10)
> +#define GLB_SLEEP				BIT(12)
> +#define GLB_INACTIVE_COMPUTE			BIT(20)
> +#define GLB_INACTIVE_FRAGMENT			BIT(21)
> +#define GLB_INACTIVE_TILER			BIT(22)
> +#define GLB_PROTM_EXIT				BIT(23)
> +#define GLB_PERFCNT_THRESHOLD			BIT(24)
> +#define GLB_PERFCNT_OVERFLOW			BIT(25)
> +#define GLB_IDLE				BIT(26)
> +#define GLB_DBG_CSF				BIT(30)
> +#define GLB_DBG_HOST				BIT(31)
> +#define GLB_REQ_MASK				GENMASK(10, 0)
> +#define GLB_EVT_MASK				GENMASK(26, 20)
> +	u32 req;
> +	u32 ack_irq_mask;
> +	u32 doorbell_req;
> +	u32 reserved1;
> +	u32 progress_timer;
> +
> +#define GLB_TIMER_VAL(x)			((x) & GENMASK(30, 0))
> +#define GLB_TIMER_SOURCE_GPU_COUNTER		BIT(31)
> +	u32 poweroff_timer;
> +	u64 core_en_mask;
> +	u32 reserved2;
> +	u32 perfcnt_as;
> +	u64 perfcnt_base;
> +	u32 perfcnt_extract;
> +	u32 reserved3[3];
> +	u32 perfcnt_config;
> +	u32 perfcnt_csg_select;
> +	u32 perfcnt_fw_enable;
> +	u32 perfcnt_csg_enable;
> +	u32 perfcnt_csf_enable;
> +	u32 perfcnt_shader_enable;
> +	u32 perfcnt_tiler_enable;
> +	u32 perfcnt_mmu_l2_enable;
> +	u32 reserved4[8];
> +	u32 idle_timer;
> +} __packed;
> +
> +enum panthor_fw_halt_status {
> +	PANTHOR_FW_HALT_OK = 0,
> +	PANTHOR_FW_HALT_ON_PANIC = 0x4e,
> +	PANTHOR_FW_HALT_ON_WATCHDOG_EXPIRATION = 0x4f,
> +};
> +
> +struct panthor_fw_global_output_iface {
> +	u32 ack;
> +	u32 reserved1;
> +	u32 doorbell_ack;
> +	u32 reserved2;
> +	u32 halt_status;
> +	u32 perfcnt_status;
> +	u32 perfcnt_insert;
> +} __packed;
> +
> +/**
> + * struct panthor_fw_cs_iface - Firmware command stream slot interface
> + */
> +struct panthor_fw_cs_iface {
> +	/**
> +	 * @lock: Lock protecting access to the panthor_fw_cs_input_iface::req
> +	 * field.
> +	 *
> +	 * Needed so we can update the req field concurrently from the interrupt
> +	 * handler and the scheduler logic.
> +	 *
> +	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
> +	 * interface sections are mapped uncached/write-combined right now, and
> +	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
> +	 * we have 'SHARED' GPU mappings hooked up.
> +	 */
> +	spinlock_t lock;
> +
> +	/**
> +	 * @control: Command stream slot control interface.
> +	 *
> +	 * Used to expose command stream slot properties.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	struct panthor_fw_cs_control_iface *control;
> +
> +	/**
> +	 * @input: Command stream slot input interface.
> +	 *
> +	 * Used for host updates/events.
> +	 */
> +	struct panthor_fw_cs_input_iface *input;
> +
> +	/**
> +	 * @output: Command stream slot output interface.
> +	 *
> +	 * Used for FW updates/events.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	const struct panthor_fw_cs_output_iface *output;
> +};
> +
> +/**
> + * struct panthor_fw_csg_iface - Firmware command stream group slot interface
> + */
> +struct panthor_fw_csg_iface {
> +	/**
> +	 * @lock: Lock protecting access to the panthor_fw_csg_input_iface::req
> +	 * field.
> +	 *
> +	 * Needed so we can update the req field concurrently from the interrupt
> +	 * handler and the scheduler logic.
> +	 *
> +	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
> +	 * interface sections are mapped uncached/write-combined right now, and
> +	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
> +	 * we have 'SHARED' GPU mappings hooked up.
> +	 */
> +	spinlock_t lock;
> +
> +	/**
> +	 * @control: Command stream group slot control interface.
> +	 *
> +	 * Used to expose command stream group slot properties.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	const struct panthor_fw_csg_control_iface *control;
> +
> +	/**
> +	 * @input: Command stream slot input interface.
> +	 *
> +	 * Used for host updates/events.
> +	 */
> +	struct panthor_fw_csg_input_iface *input;
> +
> +	/**
> +	 * @output: Command stream group slot output interface.
> +	 *
> +	 * Used for FW updates/events.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	const struct panthor_fw_csg_output_iface *output;
> +};
> +
> +/**
> + * struct panthor_fw_global_iface - Firmware global interface
> + */
> +struct panthor_fw_global_iface {
> +	/**
> +	 * @lock: Lock protecting access to the panthor_fw_global_input_iface::req
> +	 * field.
> +	 *
> +	 * Needed so we can update the req field concurrently from the interrupt
> +	 * handler and the scheduler/FW management logic.
> +	 *
> +	 * TODO: Ideally we'd want to use a cmpxchg() to update the req, but FW
> +	 * interface sections are mapped uncached/write-combined right now, and
> +	 * using cmpxchg() on such mappings leads to SError faults. Revisit when
> +	 * we have 'SHARED' GPU mappings hooked up.
> +	 */
> +	spinlock_t lock;
> +
> +	/**
> +	 * @control: Command stream group slot control interface.
> +	 *
> +	 * Used to expose global FW properties.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	const struct panthor_fw_global_control_iface *control;
> +
> +	/**
> +	 * @input: Global input interface.
> +	 *
> +	 * Used for host updates/events.
> +	 */
> +	struct panthor_fw_global_input_iface *input;
> +
> +	/**
> +	 * @output: Global output interface.
> +	 *
> +	 * Used for FW updates/events.
> +	 *
> +	 * This interface is read-only.
> +	 */
> +	const struct panthor_fw_global_output_iface *output;
> +};
> +
> +/**
> + * panthor_fw_toggle_reqs() - Toggle acknowledge bits to send an event to the FW
> + * @__iface: The interface to operate on.
> + * @__in_reg: Name of the register to update in the input section of the interface.
> + * @__out_reg: Name of the register to take as a reference in the output section of the
> + * interface.
> + * @__mask: Mask to apply to the update.
> + *
> + * The Host -> FW event/message passing was designed to be lockless, with each side of
> + * the channel having its writeable section. Events are signaled as a difference between
> + * the host and FW side in the req/ack registers (when a bit differs, there's an event
> + * pending, when they are the same, nothing needs attention).
> + *
> + * This helper allows one to update the req register based on the current value of the
> + * ack register managed by the FW. Toggling a specific bit will flag an event. In order
> + * for events to be re-evaluated, the interface doorbell needs to be rung.
> + *
> + * Concurrent accesses to the same req register is covered.
> + *
> + * Anything requiring atomic updates to multiple registers requires a dedicated lock.
> + */
> +#define panthor_fw_toggle_reqs(__iface, __in_reg, __out_reg, __mask) \
> +	do { \
> +		u32 __cur_val, __new_val, __out_val; \
> +		spin_lock(&(__iface)->lock); \
> +		__cur_val = READ_ONCE((__iface)->input->__in_reg); \
> +		__out_val = READ_ONCE((__iface)->output->__out_reg); \
> +		__new_val = ((__out_val ^ (__mask)) & (__mask)) | (__cur_val & ~(__mask)); \
> +		WRITE_ONCE((__iface)->input->__in_reg, __new_val); \
> +		spin_unlock(&(__iface)->lock); \
> +	} while (0)
> +
> +/**
> + * panthor_fw_update_reqs() - Update bits to reflect a configuration change
> + * @__iface: The interface to operate on.
> + * @__in_reg: Name of the register to update in the input section of the interface.
> + * @__val: Value to set.
> + * @__mask: Mask to apply to the update.
> + *
> + * Some configuration get passed through req registers that are also used to
> + * send events to the FW. Those req registers being updated from the interrupt
> + * handler, they require special helpers to update the configuration part as well.
> + *
> + * Concurrent accesses to the same req register is covered.
> + *
> + * Anything requiring atomic updates to multiple registers requires a dedicated lock.
> + */
> +#define panthor_fw_update_reqs(__iface, __in_reg, __val, __mask) \
> +	do { \
> +		u32 __cur_val, __new_val; \
> +		spin_lock(&(__iface)->lock); \
> +		__cur_val = READ_ONCE((__iface)->input->__in_reg); \
> +		__new_val = (__cur_val & ~(__mask)) | ((__val) & (__mask)); \
> +		WRITE_ONCE((__iface)->input->__in_reg, __new_val); \
> +		spin_unlock(&(__iface)->lock); \
> +	} while (0)
> +
> +struct panthor_fw_global_iface *
> +panthor_fw_get_glb_iface(struct panthor_device *ptdev);
> +
> +struct panthor_fw_csg_iface *
> +panthor_fw_get_csg_iface(struct panthor_device *ptdev, u32 csg_slot);
> +
> +struct panthor_fw_cs_iface *
> +panthor_fw_get_cs_iface(struct panthor_device *ptdev, u32 csg_slot, u32 cs_slot);
> +
> +int panthor_fw_csg_wait_acks(struct panthor_device *ptdev, u32 csg_id, u32 req_mask,
> +			     u32 *acked, u32 timeout_ms);
> +
> +int panthor_fw_glb_wait_acks(struct panthor_device *ptdev, u32 req_mask, u32 *acked,
> +			     u32 timeout_ms);
> +
> +void panthor_fw_ring_csg_doorbells(struct panthor_device *ptdev, u32 csg_slot);
> +
> +void panthor_fw_mem_vunmap(struct panthor_fw_mem *mem);
> +void *panthor_fw_mem_vmap(struct panthor_fw_mem *mem);
> +u64 panthor_fw_mem_va(struct panthor_fw_mem *mem);
> +void panthor_fw_mem_free(struct panthor_device *ptdev, struct panthor_fw_mem *mem);
> +struct panthor_fw_mem *
> +panthor_fw_alloc_queue_iface_mem(struct panthor_device *ptdev,
> +				 struct panthor_fw_ringbuf_input_iface **input,
> +				 const struct panthor_fw_ringbuf_output_iface **output);
> +struct panthor_fw_mem *
> +panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size);
> +
> +void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang);
> +int panthor_fw_post_reset(struct panthor_device *ptdev);
> +
> +static inline void panthor_fw_suspend(struct panthor_device *ptdev)
> +{
> +	panthor_fw_pre_reset(ptdev, false);
> +}
> +
> +static inline int panthor_fw_resume(struct panthor_device *ptdev)
> +{
> +	return panthor_fw_post_reset(ptdev);
> +}
> +
> +int panthor_fw_init(struct panthor_device *ptdev);
> +void panthor_fw_unplug(struct panthor_device *ptdev);
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/15] drm/panthor: Add the heap logical block
  2023-08-09 16:53 ` [PATCH v2 10/15] drm/panthor: Add the heap " Boris Brezillon
@ 2023-08-18 14:39   ` Steven Price
  2023-08-29 16:21     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-18 14:39 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Tiler heap growing requires some kernel driver involvement: when the
> tiler runs out of heap memory, it will raise an exception which is
> either directly handled by the firmware if some free heap chunks are
> available in the heap context, or passed back to the kernel otherwise.
> The heap helpers will be used by the scheduler logic to allocate more
> heap chunks to a heap context, when such a situation happens.
> 
> Heap context creation is explicitly requested by userspace (using
> the TILER_HEAP_CREATE ioctl), and the returned context is attached to a
> queue through some command stream instruction.
> 
> All the kernel does is keep the list of heap chunks allocated to a
> context, so they can be freed when TILER_HEAP_DESTROY is called, or
> extended when the FW requests a new chunk.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Split the driver addition commit
> - Document the code
> - Fix various bugs
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Mostly looks good, but I think we might have issues with struct
panthor_heap_gpu_ctx potentially being smaller than a (GPU) cache line
(see below).

> ---
>  drivers/gpu/drm/panthor/panthor_heap.c | 550 +++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_heap.h |  36 ++
>  2 files changed, 586 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_heap.c b/drivers/gpu/drm/panthor/panthor_heap.c
> new file mode 100644
> index 000000000000..39244efc2eaa
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_heap.c
> @@ -0,0 +1,550 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora ltd. */
> +
> +#include <linux/iosys-map.h>
> +#include <linux/rwsem.h>
> +
> +#include <drm/panthor_drm.h>
> +
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_heap.h"
> +#include "panthor_mmu.h"
> +
> +/**
> + * struct panthor_heap_gpu_ctx - Heap context used by the GPU/FW.
> + */
> +struct panthor_heap_gpu_ctx {
> +	/**
> +	 * @first_heap_chunk: GPU VA of the first free heap chunk.
> +	 *
> +	 * This forms a single-link list, where each chunk points to the
> +	 * next free chunk, and the last element points to NULL.
> +	 *
> +	 * Heap chunks get freed and returned to the heap context when fragment
> +	 * jobs picking data from those heap chunks complete. When this happens
> +	 * this field is updated to insert the heap chunks that were freed.
> +	 *
> +	 * When the tiler runs out of memory, it will first check if there
> +	 * are free heap chunks in the heap context, and pick those if there are.
> +	 *
> +	 * When there is no free heap chunks left, the FW will raise a TILER_OOM
> +	 * interrupt, letting the kernel driver allocate more heap chunks.
> +	 *
> +	 * If the heap context reached its heap chunk limit, the FW will wait
> +	 * for fragment jobs to consume some data and return chunks to the
> +	 * context.
> +	 *
> +	 * As a last resort, if there is no in-flight fragment jobs, the FW
> +	 * will try to execute the exception handler set on the command stream.
> +	 * This exception handler is expected to issue fragment job to store
> +	 * the partial rendering results, free up some heap chunks.
> +	 */
> +	u64 first_heap_chunk;
> +
> +	/** @unused1: MBZ. */
> +	u32 unused1[2];
> +
> +	/**
> +	 * @vt_started_count: Number of vertex/tiling operations started.
> +	 *
> +	 * This is marking the beginning of a render pass, and is explicity

s/explicity/explicitly/

> +	 * flagged with a HEAP_OPERATION.vt_start instruction. If the render pass
> +	 * contains multiple vertex/tiler/IDVS jobs, this HEAP_OPERATION.vt_start
> +	 * is only called once.
> +	 */
> +	u32 vt_started_count;
> +
> +	/**
> +	 * @vt_completed_count: Number of completed vertex/tiler jobs.
> +	 *
> +	 * This is marking the end of the geometry processing part of a render
> +	 * pass, and is explicity flagged by the user command stream with

s/explicity/explicitly/

> +	 * a HEAP_OPERATION.vt_completed instruction. If the render pass contains
> +	 * multiple vertex/tiler/IDVS jobs, this HEAP_OPERATION.vt_end
> +	 * instruction is only issued once.
> +	 */
> +	u32 vt_completed_count;
> +
> +	/** @unused2: MBZ. */
> +	u32 unused2;
> +
> +	/**
> +	 * @frag_completed_count: Number of completed fragment jobs.
> +	 *
> +	 * @vt_started_count - @frag_completed_count is the number of in-flight
> +	 * render targets that's used by the driver to determine if it's worth
> +	 * allocating new chunk or if we should instead wait for fragment jobs
> +	 * to complete.
> +	 *
> +	 * Fragment completion is explicitly flagged by the user command stream
> +	 * with a HEAP_OPERATION.frag_end or FINISH_FRAGMENT.frag_end instruction.
> +	 */
> +	u32 frag_completed_count;
> +};

I'm not sure whether we should really be describing this structure in
the kernel. Beyond the size the kernel has no reason to be looking at
the internals and the spec does have a warning that the layout may change.

Interestingly kbase also rounds this size up to ensure that it is at
least a cache line. Which I guess might be required if the CPU and GPU
are not coherent as we zero the context (from the CPU) before use.

> +
> +/**
> + * struct panthor_heap_chunk_header - Heap chunk header
> + */
> +struct panthor_heap_chunk_header {
> +	/**
> +	 * @next: Next heap chunk in the list.
> +	 *
> +	 * This is a GPU VA.
> +	 */
> +	u64 next;
> +
> +	/** @unknown: MBZ. */
> +	u32 unknown[14];
> +};
> +
> +/**
> + * struct panthor_heap_chunk - Structure used to keep track of allocated heap chunks.
> + */
> +struct panthor_heap_chunk {
> +	/** @node: Used to insert the heap chunk in panthor_heap::chunks. */
> +	struct list_head node;
> +
> +	/** @bo: Buffer object backing the heap chunk. */
> +	struct panthor_gem_object *bo;
> +
> +	/** @gpu_va: GPU address of this heap chunk. */
> +	u64 gpu_va;
> +};
> +
> +/**
> + * struct panthor_heap - Structure used to manage tiler heap contexts.
> + */
> +struct panthor_heap {
> +	/** @chunks: List containing all heap chunks allocated so far. */
> +	struct list_head chunks;
> +
> +	/** @chunk_size: Size of each chunk. */
> +	u32 chunk_size;
> +
> +	/** @max_chunks: Maximum number of chunks. */
> +	u32 max_chunks;
> +
> +	/**
> +	 * @target_in_flight: Number of in-flight render passes after which
> +	 * we'd let the FW wait for fragment job to finish instead of allocating new chunks.
> +	 */
> +	u32 target_in_flight;
> +
> +	/** @chunk_count: Number of heap chunks currently allocated. */
> +	u32 chunk_count;
> +};
> +
> +#define MAX_HEAPS_PER_POOL    128
> +
> +/**
> + * struct panthor_heap_pool - Pool of heap contexts
> + *
> + * The pool is attached to a panthor_file and can't be shared across processes.
> + */
> +struct panthor_heap_pool {
> +	/** @refcount: Reference count. */
> +	struct kref refcount;
> +
> +	/** @ptdev: Device. */
> +	struct panthor_device *ptdev;
> +
> +	/** @vm: VM this pool is bound to. */
> +	struct panthor_vm *vm;
> +
> +	/** @lock: Lock protecting access to @xa. */
> +	struct rw_semaphore lock;
> +
> +	/** @xa: Array storing panthor_heap objects. */
> +	struct xarray xa;
> +
> +	/** @bo: Buffer object containing the GPU heap contexts. */
> +	struct panthor_gem_object *bo;
> +
> +	/** @gpu_contexts: Array of GPU heap contexts. */
> +	struct panthor_heap_gpu_ctx *gpu_contexts;
> +
> +	/** @gpu_va: GPU address of the heap contexts. */
> +	u64 gpu_va;
> +};
> +
> +static void panthor_free_heap_chunk(struct panthor_vm *vm,
> +				    struct panthor_heap_chunk *chunk)
> +{
> +	if (!chunk)
> +		return;
> +
> +	list_del(&chunk->node);
> +	panthor_gem_unmap_and_put(vm, chunk->bo, chunk->gpu_va, NULL);
> +	kfree(chunk);
> +}
> +
> +static int panthor_alloc_heap_chunk(struct panthor_device *ptdev,
> +				    struct panthor_vm *vm,
> +				    struct panthor_heap *heap,
> +				    bool initial_chunk)
> +{
> +	struct iosys_map map = IOSYS_MAP_INIT_VADDR(NULL);
> +	struct panthor_heap_chunk *chunk;
> +	struct panthor_heap_chunk_header *hdr;
> +	int ret;
> +
> +	chunk = kmalloc(sizeof(*chunk), GFP_KERNEL);
> +	if (!chunk)
> +		return -ENOMEM;
> +
> +	chunk->gpu_va = PANTHOR_GEM_ALLOC_VA;
> +	chunk->bo = panthor_gem_create_and_map(ptdev, vm, heap->chunk_size,
> +					       DRM_PANTHOR_BO_NO_MMAP,
> +					       DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
> +					       &chunk->gpu_va,
> +					       (void **)&hdr);
> +	if (IS_ERR(chunk->bo)) {
> +		ret = PTR_ERR(chunk->bo);
> +		goto err_free_chunk;
> +	}
> +
> +	memset(hdr, 0, sizeof(*hdr));
> +
> +	if (initial_chunk && !list_empty(&heap->chunks)) {
> +		struct panthor_heap_chunk *prev_chunk;
> +
> +		prev_chunk = list_first_entry(&heap->chunks,
> +					      struct panthor_heap_chunk,
> +					      node);
> +
> +		hdr->next = (prev_chunk->gpu_va & GENMASK_ULL(63, 12)) |
> +			    (heap->chunk_size >> 12);
> +	}
> +
> +	map.vaddr = hdr;
> +	drm_gem_vunmap_unlocked(&chunk->bo->base.base, &map);
> +
> +	if (initial_chunk)
> +		list_add(&chunk->node, &heap->chunks);
> +	else
> +		list_add_tail(&chunk->node, &heap->chunks);

I'm not sure I see the reason to do list_add_tail() here, changing it to
always list_add() and updating the list_last_entry() to
list_first_entry() in panthor_heap_grow() would seem to work (unless
I've missed something).

> +	heap->chunk_count++;
> +
> +	return 0;
> +
> +err_free_chunk:
> +	kfree(chunk);
> +
> +	return ret;
> +}
> +
> +static void panthor_free_heap_chunks(struct panthor_vm *vm,
> +				     struct panthor_heap *heap)
> +{
> +	struct panthor_heap_chunk *chunk, *tmp;
> +
> +	list_for_each_entry_safe(chunk, tmp, &heap->chunks, node) {
> +		panthor_free_heap_chunk(vm, chunk);
> +	}
> +
> +	heap->chunk_count = 0;
> +}
> +
> +static int panthor_alloc_heap_chunks(struct panthor_device *ptdev,
> +				     struct panthor_vm *vm,
> +				     struct panthor_heap *heap,
> +				     u32 chunk_count)
> +{
> +	int ret;
> +	u32 i;
> +
> +	for (i = 0; i < chunk_count; i++) {
> +		ret = panthor_alloc_heap_chunk(ptdev, vm, heap, true);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +panthor_heap_destroy_locked(struct panthor_heap_pool *pool, u32 handle)
> +{
> +	struct panthor_heap *heap = NULL;

No need to initialize heap to NULL.

Steve

> +
> +	heap = xa_erase(&pool->xa, handle);
> +	if (!heap)
> +		return -EINVAL;
> +
> +	panthor_free_heap_chunks(pool->vm, heap);
> +	kfree(heap);
> +	return 0;
> +}
> +
> +/**
> + * panthor_heap_destroy() - Destroy a heap context
> + * @pool: Pool this context belongs to.
> + * @handle: Handle returned by panthor_heap_create().
> + */
> +int panthor_heap_destroy(struct panthor_heap_pool *pool, u32 handle)
> +{
> +	int ret;
> +
> +	down_write(&pool->lock);
> +	ret = panthor_heap_destroy_locked(pool, handle);
> +	up_write(&pool->lock);
> +
> +	return ret;
> +}
> +
> +/**
> + * panthor_heap_create() - Create a heap context
> + * @pool: Pool to instantiate the heap context from.
> + * @initial_chunk_count: Number of chunk allocated at initialization time.
> + * Must be at least 1.
> + * @chunk_size: The size of each chunk. Must be a power of two between 256k
> + * and 2M.
> + * @max_chunks: Maximum number of chunks that can be allocated.
> + * @target_in_flight: Maximum number of in-flight render passes.
> + * @heap_ctx_gpu_va: Pointer holding the GPU address of the allocated heap
> + * context.
> + * @first_chunk_gpu_va: Pointer holding the GPU address of the first chunk
> + * assigned to the heap context.
> + *
> + * Return: a positive handle on success, a negative error otherwise.
> + */
> +int panthor_heap_create(struct panthor_heap_pool *pool,
> +			u32 initial_chunk_count,
> +			u32 chunk_size,
> +			u32 max_chunks,
> +			u32 target_in_flight,
> +			u64 *heap_ctx_gpu_va,
> +			u64 *first_chunk_gpu_va)
> +{
> +	struct panthor_heap *heap;
> +	struct panthor_heap_gpu_ctx *gpu_ctx;
> +	struct panthor_heap_chunk *first_chunk;
> +	int ret = 0;
> +	u32 id;
> +
> +	if (initial_chunk_count == 0)
> +		return -EINVAL;
> +
> +	if (hweight32(chunk_size) != 1 ||
> +	    chunk_size < SZ_256K || chunk_size > SZ_2M)
> +		return -EINVAL;
> +
> +	heap = kzalloc(sizeof(*heap), GFP_KERNEL);
> +	if (!heap)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&heap->chunks);
> +	heap->chunk_size = chunk_size;
> +	heap->max_chunks = max_chunks;
> +	heap->target_in_flight = target_in_flight;
> +
> +	down_write(&pool->lock);
> +
> +	/* The pool has been destroyed, we can't create a new heap. */
> +	if (!pool->vm) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, MAX_HEAPS_PER_POOL), GFP_KERNEL);
> +	if (ret) {
> +		kfree(heap);
> +		goto out_unlock;
> +	}
> +
> +	gpu_ctx = &pool->gpu_contexts[id];
> +	memset(gpu_ctx, 0, sizeof(*gpu_ctx));
> +
> +	ret = panthor_alloc_heap_chunks(pool->ptdev, pool->vm, heap,
> +					initial_chunk_count);
> +	if (ret) {
> +		panthor_heap_destroy_locked(pool, id);
> +		goto out_unlock;
> +	}
> +
> +	*heap_ctx_gpu_va = pool->gpu_va + (sizeof(*pool->gpu_contexts) * id);
> +
> +	first_chunk = list_first_entry(&heap->chunks,
> +				       struct panthor_heap_chunk,
> +				       node);
> +	*first_chunk_gpu_va = first_chunk->gpu_va;
> +	ret = id;
> +
> +out_unlock:
> +	up_write(&pool->lock);
> +	return ret;
> +}
> +
> +/**
> + * panthor_heap_grow() - Make a heap context grow.
> + * @pool: The pool this heap belongs to.
> + * @heap_gpu_va: The GPU address of the heap context.
> + * @renderpasses_in_flight: Number of render passes currently in-flight.
> + * @pending_frag_count: Number of fragment jobs waiting for execution/completion.
> + */
> +int panthor_heap_grow(struct panthor_heap_pool *pool,
> +		      u64 heap_gpu_va,
> +		      u32 renderpasses_in_flight,
> +		      u32 pending_frag_count,
> +		      u64 *new_chunk_gpu_va)
> +{
> +	u64 heap_id = (heap_gpu_va - pool->gpu_va) /
> +		      sizeof(struct panthor_heap_gpu_ctx);
> +	struct panthor_heap_chunk *chunk;
> +	struct panthor_heap *heap;
> +	int ret;
> +
> +	down_read(&pool->lock);
> +	heap = xa_load(&pool->xa, heap_id);
> +	if (!heap) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/* If we reached the target in-flight render passes, or if we
> +	 * reached the maximum number of chunks, let the FW figure another way to
> +	 * find some memory (wait for render passes to finish, or call the exception
> +	 * handler provided by the userspace driver, if any).
> +	 */
> +	if (renderpasses_in_flight > heap->target_in_flight ||
> +	    (pending_frag_count > 0 && heap->chunk_count >= heap->max_chunks)) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	} else if (heap->chunk_count >= heap->max_chunks) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	ret = panthor_alloc_heap_chunk(pool->ptdev, pool->vm, heap, false);
> +	if (ret)
> +		goto out_unlock;
> +
> +	chunk = list_last_entry(&heap->chunks,
> +				struct panthor_heap_chunk,
> +				node);
> +	*new_chunk_gpu_va = (chunk->gpu_va & GENMASK_ULL(63, 12)) |
> +			    (heap->chunk_size >> 12);
> +	ret = 0;
> +
> +out_unlock:
> +	up_read(&pool->lock);
> +	return ret;
> +}
> +
> +static void panthor_heap_pool_release(struct kref *refcount)
> +{
> +	struct panthor_heap_pool *pool =
> +		container_of(refcount, struct panthor_heap_pool, refcount);
> +
> +	xa_destroy(&pool->xa);
> +	kfree(pool);
> +}
> +
> +/**
> + * panthor_heap_pool_put() - Release a heap pool reference
> + * @pool: Pool to release the reference on. Can be NULL.
> + */
> +void panthor_heap_pool_put(struct panthor_heap_pool *pool)
> +{
> +	if (pool)
> +		kref_put(&pool->refcount, panthor_heap_pool_release);
> +}
> +
> +/**
> + * panthor_heap_pool_get() - Get a heap pool reference
> + * @pool: Pool to get the reference on. Can be NULL.
> + *
> + * Return: @pool.
> + */
> +struct panthor_heap_pool *
> +panthor_heap_pool_get(struct panthor_heap_pool *pool)
> +{
> +	if (pool)
> +		kref_get(&pool->refcount);
> +
> +	return pool;
> +}
> +
> +/**
> + * panthor_heap_pool_create() - Create a heap pool
> + * @ptdev: Device.
> + * @vm: The VM this heap pool will be attached to.
> + *
> + * Heap pools might contain up to 128 heap contexts, and are per-VM.
> + *
> + * Return: A valid pointer on success, a negative error code otherwise.
> + */
> +struct panthor_heap_pool *
> +panthor_heap_pool_create(struct panthor_device *ptdev, struct panthor_vm *vm)
> +{
> +	size_t bosize = ALIGN(MAX_HEAPS_PER_POOL *
> +			      sizeof(struct panthor_heap_gpu_ctx),
> +			      4096);
> +	struct panthor_heap_pool *pool;
> +	int ret = 0;
> +
> +	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +	if (!pool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* We want a weak ref here: the heap pool belongs to the VM, so we're
> +	 * sure that, as long as the heap pool exists, the VM exists too.
> +	 */
> +	pool->vm = vm;
> +	pool->ptdev = ptdev;
> +	init_rwsem(&pool->lock);
> +	xa_init_flags(&pool->xa, XA_FLAGS_ALLOC1);
> +	kref_init(&pool->refcount);
> +
> +	pool->gpu_va = PANTHOR_GEM_ALLOC_VA;
> +	pool->bo = panthor_gem_create_and_map(ptdev, vm, bosize,
> +					      DRM_PANTHOR_BO_NO_MMAP,
> +					      DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
> +					      &pool->gpu_va,
> +					      (void *)&pool->gpu_contexts);
> +	if (IS_ERR(pool->bo)) {
> +		ret = PTR_ERR(pool->bo);
> +		goto err_destroy_pool;
> +	}
> +
> +	return pool;
> +
> +err_destroy_pool:
> +	panthor_heap_pool_destroy(pool);
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * panthor_heap_pool_destroy() - Destroy a heap pool.
> + * @pool: Pool to destroy.
> + *
> + * This function destroys all heap contexts and their resources. Thus
> + * preventing any use of the heap context or the chunk attached to them
> + * after that point.
> + *
> + * If the GPU still has access to some heap contexts, a fault should be
> + * triggered, which should flag the command stream groups using these
> + * context as faulty.
> + *
> + * The heap pool object is only released when all references to this pool
> + * are released.
> + */
> +void panthor_heap_pool_destroy(struct panthor_heap_pool *pool)
> +{
> +	struct panthor_heap *heap;
> +	unsigned long i;
> +
> +	down_write(&pool->lock);
> +	xa_for_each(&pool->xa, i, heap)
> +		drm_WARN_ON(&pool->ptdev->base, panthor_heap_destroy_locked(pool, i));
> +
> +	if (!IS_ERR_OR_NULL(pool->bo))
> +		panthor_gem_unmap_and_put(pool->vm, pool->bo, pool->gpu_va, pool->gpu_contexts);
> +
> +	/* Reflects the fact the pool has been destroyed. */
> +	pool->vm = NULL;
> +	up_write(&pool->lock);
> +
> +	panthor_heap_pool_put(pool);
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_heap.h b/drivers/gpu/drm/panthor/panthor_heap.h
> new file mode 100644
> index 000000000000..ff6ebdcd412e
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_heap.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_HEAP_H__
> +#define __PANTHOR_HEAP_H__
> +
> +#include <linux/types.h>
> +
> +struct panthor_device;
> +struct panthor_heap_pool;
> +struct panthor_vm;
> +
> +int panthor_heap_create(struct panthor_heap_pool *pool,
> +			u32 initial_chunk_count,
> +			u32 chunk_size,
> +			u32 max_chunks,
> +			u32 target_in_flight,
> +			u64 *heap_ctx_gpu_va,
> +			u64 *first_chunk_gpu_va);
> +int panthor_heap_destroy(struct panthor_heap_pool *pool, u32 handle);
> +
> +struct panthor_heap_pool *
> +panthor_heap_pool_create(struct panthor_device *ptdev, struct panthor_vm *vm);
> +void panthor_heap_pool_destroy(struct panthor_heap_pool *pool);
> +
> +struct panthor_heap_pool *
> +panthor_heap_pool_get(struct panthor_heap_pool *pool);
> +void panthor_heap_pool_put(struct panthor_heap_pool *pool);
> +
> +int panthor_heap_grow(struct panthor_heap_pool *pool,
> +		      u64 heap_gpu_va,
> +		      u32 renderpasses_in_flight,
> +		      u32 pending_frag_count,
> +		      u64 *new_chunk_gpu_va);
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/15] drm/panthor: Add the scheduler logical block
  2023-08-09 16:53 ` [PATCH v2 11/15] drm/panthor: Add the scheduler " Boris Brezillon
@ 2023-08-18 15:38   ` Steven Price
  2023-08-29 16:36     ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-18 15:38 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> This is the piece of software interacting with the FW scheduler, and
> taking care of some scheduling aspects when the FW comes short of slots
> scheduling slots. Indeed, the FW only expose a few slots, and the kernel
> has to give all submission contexts, a chance to execute their jobs.
> 
> The kernel-side scheduler is timeslice-based, with a round-robin queue
> per priority level.
> 
> Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler,
> allowing us to delegate the dependency tracking to the core.
> 
> All the gory details should be documented inline.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Rename the file (_mcu -> _fw)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Document the code
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> - Move the ping logic to panthor_fw.c
> - Fix various bugs
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Mostly typos below, but there is possibly inverted logic in
sched_queue_work() (and sched_queue_delayed_work()).

> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 3272 +++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_sched.h |   50 +
>  2 files changed, 3322 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
>  create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> new file mode 100644
> index 000000000000..c1a516454e5d
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -0,0 +1,3272 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +#include <asm/arch_timer.h>
> +#endif
> +
> +#include <drm/panthor_drm.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_gem_shmem_helper.h>
> +#include <drm/drm_managed.h>
> +#include <drm/gpu_scheduler.h>
> +
> +#include <linux/build_bug.h>
> +#include <linux/clk.h>
> +#include <linux/delay.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/firmware.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/iopoll.h>
> +#include <linux/iosys-map.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/dma-resv.h>
> +
> +#include "panthor_sched.h"
> +#include "panthor_devfreq.h"
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_heap.h"
> +#include "panthor_regs.h"
> +#include "panthor_gpu.h"
> +#include "panthor_fw.h"
> +#include "panthor_mmu.h"
> +
> +/**
> + * DOC: Scheduler
> + *
> + * Mali CSF hardware adopts a firmware-assited scheduling model, where

s/assited/assisted/

> + * the firmware takes care of scheduling aspects, to some extend.
> + *
> + * The scheduling happens at the scheduling group level, each group
> + * contains 1 to N queues (N is FW/hardware dependent, and exposed
> + * through the firmware interface). Each queue is assigned a command
> + * stream ring buffer, which serves as a way to get jobs submitted to
> + * the GPU, among other things.
> + *
> + * The firmware can schedule a maximum of M groups (M is FW/hardware
> + * dependent, and exposed through the firmware interface). Passed
> + * this maximum number of groups, the kernel must take care of
> + * rotating the groups passed to the firmware so every group gets
> + * a chance to have his queues scheduled for execution.
> + *
> + * The current implementation only supports with kernel-mode queues.
> + * In other terms, userspace doesn't have access to the ring-buffer.
> + * Instead, userspace passes indirect command stream buffers that are
> + * called from the queue ring-buffer by the kernel using a pre-defined
> + * sequence of command stream instructions to ensure the userspace driver
> + * always gets consistent results (cache maintenance,
> + * synchronization, ...).
> + *
> + * We rely on the drm_gpu_scheduler framework to deal with job
> + * dependencies and submission. As any other driver dealing with a
> + * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each
> + * entity has its own job scheduler. When a job is ready to be executed
> + * (all its dependencies are met), it is pushed to the appropriate
> + * queue ring-buffer, and the group is scheduled for execution if it
> + * wasn't already active.
> + *
> + * Kernel-side group scheduling is timeslice-based. When we have less
> + * groups than there are slots, the periodic tick is disabled and we
> + * just let the FW schedule the active groups. When there are more
> + * groups than slots, we let each group a chance to execute stuff for
> + * a given amount of time, and then re-evaluate and pick new groups
> + * to schedule. The group selection algorithm is based on
> + * priority+round-robin.
> + *
> + * Even though user-mode queues is out of the scope right now, the
> + * current design takes them into account by avoiding any guess on the
> + * group/queue state that would be based on information we wouldn't have
> + * if userspace was in charge of the ring-buffer. That's also one of the
> + * reason we don't do 'cooperative' scheduling (encoding FW group slot
> + * reservation as dma_fence that would be returned from the
> + * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as
> + * a queue of waiters, ordered by job submission order). This approach
> + * would work for kernel-mode queues, but would make user-mode queues a
> + * lot more complicated to retrofit.
> + */
> +
> +#define JOB_TIMEOUT_MS				5000
> +
> +#define MIN_CS_PER_CSG				8
> +
> +#define MIN_CSGS				3
> +#define MAX_CSG_PRIO				0xf
> +
> +struct panthor_group;
> +
> +/**
> + * struct panthor_csg_slot - Command stream group slot
> + *
> + * This represents a FW slot for a scheduling group.
> + */
> +struct panthor_csg_slot {
> +	/** @group: Scheduling group bound to this slot. */
> +	struct panthor_group *group;
> +
> +	/** @priority: Group priority. */
> +	u8 priority;
> +
> +	/**
> +	 * @idle: True if the group bound to this slot is idle.
> +	 *
> +	 * A group is idle when it has nothing waiting for execution on
> +	 * all its queues, or when queues are blocked waiting for something
> +	 * to happen (synchronization object).
> +	 */
> +	bool idle;
> +};
> +
> +/**
> + * enum panthor_csg_priority - Group priority
> + */
> +enum panthor_csg_priority {
> +	/** @PANTHOR_CSG_PRIORITY_LOW: Low priority group. */
> +	PANTHOR_CSG_PRIORITY_LOW = 0,
> +
> +	/** @PANTHOR_CSG_PRIORITY_MEDIUM: Medium priority group. */
> +	PANTHOR_CSG_PRIORITY_MEDIUM,
> +
> +	/** @PANTHOR_CSG_PRIORITY_HIGH: High priority group. */
> +	PANTHOR_CSG_PRIORITY_HIGH,
> +
> +	/**
> +	 * @PANTHOR_CSG_PRIORITY_RT: Real-time priority group.
> +	 *
> +	 * Real-time prioty allows one to preempt scheduling of other

priority

> +	 * non-real-time groups. When such a group becomes executable,
> +	 * it will evict the group with the lowest non-rt priority if
> +	 * there's no free group slot available.
> +	 *
> +	 * Currently not exposed to userspace.
> +	 */
> +	PANTHOR_CSG_PRIORITY_RT,
> +
> +	/** @PANTHOR_CSG_PRIORITY_COUNT: Number of priority levels. */
> +	PANTHOR_CSG_PRIORITY_COUNT,
> +};
> +
> +/**
> + * struct panthor_scheduler - Object used to manage the scheduler
> + */
> +struct panthor_scheduler {
> +	/** @ptdev: Device. */
> +	struct panthor_device *ptdev;
> +	/**
> +	 * @wq: Worqueue passed to the drm_gpu_scheduler.

s/Worqueue/Workqueue

> +	 *
> +	 * Used to submit/cleanup jobs.
> +	 */
> +	struct workqueue_struct *wq;
> +
> +	/** @tick_work: Work executed on a scheduling tick. */
> +	struct delayed_work tick_work;
> +
> +	/**
> +	 * @sync_upd_work: Work used to process synchronization object updates.
> +	 *
> +	 * We use this work to unblock queues/groups that were waiting on a
> +	 * synchronization object.
> +	 */
> +	struct work_struct sync_upd_work;
> +
> +	/**
> +	 * @resched_target: When the next tick should occur.
> +	 *
> +	 * Expressed in jiffies.
> +	 */
> +	u64 resched_target;
> +
> +	/**
> +	 * @last_tick: When the last tick occurred.
> +	 *
> +	 * Expressed in jiffies.
> +	 */
> +	u64 last_tick;
> +
> +	/** @tick_period: Tick period in jiffies. */
> +	u64 tick_period;
> +
> +	/**
> +	 * @lock: Lock protecting access to all the scheduler fields.
> +	 *
> +	 * Should be taken in the tick work, the irq handler, and anywhere the @groups
> +	 * fields are touched.
> +	 */
> +	struct mutex lock;
> +
> +	/** @groups: Various lists used to classify groups. */
> +	struct {
> +		/**
> +		 * @runnable: Runnable group lists.
> +		 *
> +		 * When a group has queues that want to execute something,
> +		 * its panthor_group::run_node should be inserted here.
> +		 *
> +		 * One list per-priority.
> +		 */
> +		struct list_head runnable[PANTHOR_CSG_PRIORITY_COUNT];
> +
> +		/**
> +		 * @idle: Idle group lists.
> +		 *
> +		 * When all queues of a group are idle (either because they
> +		 * have nothing to execute, or because they are blocked), the
> +		 * panthor_group::run_node field should be inserted here.
> +		 *
> +		 * One list per-priority.
> +		 */
> +		struct list_head idle[PANTHOR_CSG_PRIORITY_COUNT];
> +
> +		/**
> +		 * @waiting: List of groups whose queues are blocked on a
> +		 * synchronization object.
> +		 *
> +		 * Insert panthor_group::wait_node here when a group is waiting
> +		 * for synchronization objects to be signaled.

s/signaled/signalled/

> +		 *
> +		 * This list is evaluated in the @sync_upd_work work.
> +		 */
> +		struct list_head waiting;
> +	} groups;
> +
> +	/**
> +	 * @csg_slots: FW command stream group slots.
> +	 */
> +	struct panthor_csg_slot csg_slots[MAX_CSGS];
> +
> +	/** @csg_slot_count: Number of command stream group slots exposed by the FW. */
> +	u32 csg_slot_count;
> +
> +	/** @cs_slot_count: Number of command stream slot per group slot exposed by the FW. */
> +	u32 cs_slot_count;
> +
> +	/** @as_slot_count: Number of address space slots supported by the MMU. */
> +	u32 as_slot_count;
> +
> +	/** @used_csg_slot_count: Number of command stream group slot currently used. */
> +	u32 used_csg_slot_count;
> +
> +	/** @sb_slot_count: Number of scoreboard slots. */
> +	u32 sb_slot_count;
> +
> +	/**
> +	 * @might_have_idle_groups: True if an active group might have become idle.
> +	 *
> +	 * This will force a tick, so other runnable groups can be scheduler if one

s/scheduler/scheduled/

> +	 * or more active groups became idle.
> +	 */
> +	bool might_have_idle_groups;
> +
> +	/** @pm: Power management related fields. */
> +	struct {
> +		/** @has_ref: True if the scheduler owns a runtime PM reference. */
> +		bool has_ref;
> +	} pm;
> +
> +	/** @reset: Reset related fields. */
> +	struct {
> +		/** @lock: Lock protecting the other reset fields. */
> +		struct mutex lock;
> +
> +		/**
> +		 * @in_progress: True if a reset is in progress.
> +		 *
> +		 * Set to true in panthor_sched_pre_reset() and back to false in
> +		 * panthor_sched_post_reset().
> +		 */
> +		bool in_progress;
> +
> +		/**
> +		 * @stopped_groups: List containing all groups that were stopped
> +		 * before a reset.
> +		 *
> +		 * Insert panthor_group::run_node in the pre_reset path.
> +		 */
> +		struct list_head stopped_groups;
> +	} reset;
> +};
> +
> +/**
> + * struct panthor_syncobj_32b - 32-bit FW synchronization object
> + */
> +struct panthor_syncobj_32b {
> +	/** @seqno: Sequence number. */
> +	u32 seqno;
> +
> +	/**
> +	 * @status: Status.
> +	 *
> +	 * Not zero on failure.
> +	 */
> +	u32 status;
> +};
> +
> +/**
> + * struct panthor_syncobj_64b - 64-bit FW synchronization object
> + */
> +struct panthor_syncobj_64b {
> +	/** @seqno: Sequence number. */
> +	u64 seqno;
> +
> +	/**
> +	 * @status: Status.
> +	 *
> +	 * Not zero on failure.
> +	 */
> +	u32 status;
> +
> +	/** @pad: MBZ. */
> +	u32 pad;
> +};
> +
> +/**
> + * struct panthor_queue - Execution queue
> + */
> +struct panthor_queue {
> +	/** @scheduler: DRM scheduler used for this queue. */
> +	struct drm_gpu_scheduler scheduler;
> +
> +	/** @entity: DRM scheduling entity used for this queue. */
> +	struct drm_sched_entity entity;
> +
> +	/**
> +	 * @remaining_time: Time remaining before the job timeout expires.
> +	 *
> +	 * The job timeout is suspended when the is not scheduled by the
                                             ^^^^^^
"the queue is"?

> +	 * FW. Every time we suspend the timer, we need to save the remaining
> +	 * time so we can restore it later on.
> +	 */
> +	unsigned long remaining_time;
> +
> +	/** @timeout_suspended: True if the job timeout was suspended. */
> +	bool timeout_suspended;
> +
> +	/**
> +	 * @doorbell_id: Doorbell assigned to this queue.
> +	 *
> +	 * Right now, all groups share the same doorbell, and the doorbell ID
> +	 * is assigned to group_slot + 1 when the group is assigned a slot. But
> +	 * we might decide to provide fine grained doorbell assignment at some
> +	 * point, so don't have to wake up all queues in a group every time one
> +	 * of them is updated.
> +	 */
> +	u8 doorbell_id;
> +
> +	/**
> +	 * @priority: Priority of the queue inside the group.
> +	 *
> +	 * Must be less than 16 (Only 4 bits available).
> +	 */
> +	u8 priority;
> +#define CSF_MAX_QUEUE_PRIO	GENMASK(3, 0)
> +
> +	/** @ringbuf: Command stream ring-buffer fields. */
> +	struct {
> +		/** @bo: Buffer object for the ring-buffer. */
> +		struct panthor_gem_object *bo;
> +
> +		/** @gpu_va: GPU virtual address. */
> +		u64 gpu_va;
> +
> +		/** @kmap: Kernel mapping of the ring buffer. */
> +		u64 *kmap;
> +	} ringbuf;
> +
> +	/** @iface: Firmware interface. */
> +	struct {
> +		/** @mem: FW memory allocated for this interface. */
> +		struct panthor_fw_mem *mem;
> +
> +		/** @input: Input interface. */
> +		struct panthor_fw_ringbuf_input_iface *input;
> +
> +		/** @output: Output interface. */
> +		const struct panthor_fw_ringbuf_output_iface *output;
> +	} iface;
> +
> +	/**
> +	 * @syncwait: Stores information about the synchronization object this
> +	 * queue is waiting on.
> +	 */
> +	struct {
> +		/** @gpu_va: GPU address of the synchronization object. */
> +		u64 gpu_va;
> +
> +		/** @ref: Reference value to compare against. */
> +		u64 ref;
> +
> +		/** @gt: True is this is a greater-than test. */

s/True is/True if/

> +		bool gt;
> +
> +		/** @sync64: True if this is a 64-bit sync object. */
> +		bool sync64;
> +
> +		/** @bo: Buffer object holding the synchronization object. */
> +		struct panthor_gem_object *bo;
> +
> +		/** @offset: Offset of the synchronization object inside @bo. */
> +		u64 offset;
> +
> +		/**
> +		 * @kmap: Kernel mapping of the buffer object holding the
> +		 * synchronization object.
> +		 */
> +		void *kmap;
> +	} syncwait;
> +
> +	/** @fence_ctx: Fence context fields. */
> +	struct {
> +		/** @lock: Used to protect access to all fences allocated by this context. */
> +		spinlock_t lock;
> +
> +		/**
> +		 * @id: Fence context ID.
> +		 *
> +		 * Allocated with dma_fence_context_alloc().
> +		 */
> +		u64 id;
> +
> +		/** @seqno: Sequence number of the last initialized fence. */
> +		atomic64_t seqno;
> +
> +		/**
> +		 * @in_flight_jobs: List containing all in-flight jobs.
> +		 *
> +		 * Used to keep track and signal panthor_job::done_fence when the
> +		 * synchronization object attached to the queue is signaled.

s/signaled/signalled/

> +		 */
> +		struct list_head in_flight_jobs;
> +	} fence_ctx;
> +};
> +
> +/**
> + * enum panthor_group_state - Scheduling group state.
> + */
> +enum panthor_group_state {
> +	/** @PANTHOR_CS_GROUP_CREATED: Group was created, but not scheduled yet. */
> +	PANTHOR_CS_GROUP_CREATED,
> +
> +	/** @PANTHOR_CS_GROUP_ACTIVE: Group is currently scheduled. */
> +	PANTHOR_CS_GROUP_ACTIVE,
> +
> +	/**
> +	 * @PANTHOR_CS_GROUP_SUSPENDED: Group was scheduled at least once, but is
> +	 * inactive/suspended right now.
> +	 */
> +	PANTHOR_CS_GROUP_SUSPENDED,
> +
> +	/**
> +	 * @PANTHOR_CS_GROUP_TERMINATED: Group was terminated.
> +	 *
> +	 * Can no longer be scheduled. The only allowed action is a destruction.
> +	 */
> +	PANTHOR_CS_GROUP_TERMINATED,
> +};
> +
> +/**
> + * struct panthor_group - Scheduling group object
> + */
> +struct panthor_group {
> +	/** @refcount: Reference count */
> +	struct kref refcount;
> +
> +	/** @ptdev: Device. */
> +	struct panthor_device *ptdev;
> +
> +	/** @vm: VM bound to the group. */
> +	struct panthor_vm *vm;
> +
> +	/** @compute_core_mask: Mask of shader cores that can be used for compute jobs. */
> +	u64 compute_core_mask;
> +
> +	/** @fragment_core_mask: Mask of shader cores that can be used for fragment jobs. */
> +	u64 fragment_core_mask;
> +
> +	/** @tiler_core_mask: Mask of tiler cores that can be used for tiler jobs. */
> +	u64 tiler_core_mask;
> +
> +	/** @max_compute_cores: Maximum number of shader cores used for compute jobs. */
> +	u8 max_compute_cores;
> +
> +	/** @max_compute_cores: Maximum number of shader cores used for fragment jobs. */
> +	u8 max_fragment_cores;
> +
> +	/** @max_tiler_cores: Maximum number of tiler cores used for tiler jobs. */
> +	u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (check panthor_csg_priority). */
> +	u8 priority;
> +
> +	/** @blocked_queues: Bitmask reflecting the blocked queues. */
> +	u32 blocked_queues;
> +
> +	/** @idle_queues: Bitmask reflecting the blocked queues. */
> +	u32 idle_queues;
> +
> +	/** @fatal_lock: Lock used to protect access to fatal fields. */
> +	spinlock_t fatal_lock;
> +
> +	/** @fatal_queues: Bitmask reflecting the queues that hit a fatal exception. */
> +	u32 fatal_queues;
> +
> +	/** @queue_count: Number of queues in this group. */
> +	u32 queue_count;
> +
> +	/** @queues: Queues owned by this group. */
> +	struct panthor_queue *queues[MAX_CS_PER_CSG];
> +
> +	/**
> +	 * @csg_id: ID of the FW group slot.
> +	 *
> +	 * -1 when the group is not scheduled/active.
> +	 */
> +	int csg_id;
> +
> +	/**
> +	 * @destroyed: True when the group has been destroyed.
> +	 *
> +	 * If a group is destroyed it becomes useless: no further jobs can be submitted
> +	 * to its queues. We simply wait for all references to be dropped so we can
> +	 * release the group object.
> +	 */
> +	bool destroyed;
> +
> +	/**
> +	 * @timedout: True when a timeout occurred on any of the queues owned by
> +	 * this group.
> +	 *
> +	 * Timeouts can be reported by drm_sched or by the FW. In any case, any
> +	 * timeout situation in unrecoverable, and the group becomes useless.

s/in/is/

> +	 * We simply wait for all references to be dropped so we can release the
> +	 * group object.
> +	 */
> +	bool timedout;
> +
> +	/**
> +	 * @syncobjs: Pool of per-queue synchronization objects.
> +	 *
> +	 * One sync object per queue. The position of the sync object is
> +	 * determined by the queue index.
> +	 */
> +	struct {
> +		/** @bo: Buffer object containing these synchronization objects. */
> +		struct panthor_gem_object *bo;
> +
> +		/** @gpu_va: GPU address of the sync object pool */
> +		u64 gpu_va;
> +
> +		/** @kmap: The kernel mapping of the sync object pool. */
> +		void *kmap;
> +	} syncobjs;
> +
> +	/** @state: Group state. */
> +	enum panthor_group_state state;
> +
> +	/**
> +	 * @suspend_buf: Suspend buffer.
> +	 *
> +	 * Stores the state of the group and its queues when a group is suspended.
> +	 * Used at resume time to restore the group in its previous state.
> +	 *
> +	 * The size of the suspend buffer is exposed through the FW interface.
> +	 */
> +	struct panthor_fw_mem *suspend_buf;
> +
> +	/**
> +	 * @protm_suspend_buf: Protection mode suspend buffer.
> +	 *
> +	 * Stores the state of the group and its queues when a group that's in
> +	 * protection mode is suspended.
> +	 *
> +	 * Used at resume time to restore the group in its previous state.
> +	 *
> +	 * The size of the protection mode suspend buffer is exposed through the
> +	 * FW interface.
> +	 */
> +	struct panthor_fw_mem *protm_suspend_buf;
> +
> +	/** @sync_upd_work: Work used to check/signal job fences. */
> +	struct work_struct sync_upd_work;
> +
> +	/** @term_work: Work used to finish the group termination procedure. */
> +	struct work_struct term_work;
> +
> +	/**
> +	 * @release_work: Work used to release group resources.
> +	 *
> +	 * We need to postpone the group release to avoid a deadlock when
> +	 * the last ref is released in the tick work.
> +	 */
> +	struct work_struct release_work;
> +
> +	/**
> +	 * @run_node: Node used to insert the group in the
> +	 * panthor_group::groups::{runnable,idle} and
> +	 * panthor_group::reset.stopped_groups lists.
> +	 */
> +	struct list_head run_node;
> +
> +	/**
> +	 * @wait_node: Node used to insert the group in the
> +	 * panthor_group::groups::waiting list.
> +	 */
> +	struct list_head wait_node;
> +};
> +
> +/**
> + * group_queue_work() - Queue a group work
> + * @group: Group to queue the work for.
> + * @wname: Work name.
> + *
> + * Grabs a ref and queue a work item to the scheduler workqueue. If
> + * the work was already queued, we release the reference we grabbed.
> + *
> + * Work callbacks must release the reference we grabbed here.
> + */
> +#define group_queue_work(group, wname) \
> +	do { \
> +		group_get(group); \
> +		if (!queue_work((group)->ptdev->scheduler->wq, &(group)->wname ## _work)) \
> +			group_put(group); \
> +	} while (0)
> +
> +/**
> + * sched_queue_work() - Queue a scheduler work.
> + * @sched: Scheduler object.
> + * @wname: Work name.
> + *
> + * Conditionally queues a scheduler work if no reset is pending/in-progress.
> + */
> +#define sched_queue_work(sched, wname) \
> +	do { \
> +		if (sched->reset.in_progress || \

Is this missing a '!'? This executes if a reset is in progress.

> +		    !panthor_device_reset_is_pending((sched)->ptdev)) \
> +			queue_work((sched)->wq, &(sched)->wname ## _work); \
> +	} while (0)
> +
> +/**
> + * sched_queue_work() - Queue a scheduler delayed work.

s/sched_queue_work/sched_queue_delayed_work/

> + * @sched: Scheduler object.
> + * @wname: Work name.
> + * @delay: Work delay in jiffies.
> + *
> + * Conditionally queues a scheduler delayed work if no reset is
> + * pending/in-progress.
> + */
> +#define sched_queue_delayed_work(sched, wname, delay) \
> +	do { \
> +		if (sched->reset.in_progress || \

Ditto

> +		    !panthor_device_reset_is_pending((sched)->ptdev)) \
> +			mod_delayed_work((sched)->wq, &(sched)->wname ## _work, delay); \
> +	} while (0)
> +
> +/*
> + * We currently set the maximum of groups per file to an arbitrary low value.
> + * But this can be updated if we need more.
> + */
> +#define MAX_GROUPS_PER_POOL 128
> +
> +/**
> + * struct panthor_group_pool - Group pool
> + *
> + * Each file get assigned a group pool.
> + */
> +struct panthor_group_pool {
> +	/** @xa: Xarray used to manage group handles. */
> +	struct xarray xa;
> +};
> +
> +/**
> + * struct panthor_job - Used to manage GPU job
> + */
> +struct panthor_job {
> +	/** @base: Inherit from drm_sched_job. */
> +	struct drm_sched_job base;
> +
> +	/** @refcount: Reference count. */
> +	struct kref refcount;
> +
> +	/** @group: Group of the queue this job will be pushed to. */
> +	struct panthor_group *group;
> +
> +	/** @queue_idx: Index of the queue inside @group. */
> +	u32 queue_idx;
> +
> +	/** @call_info: Information about the userspace command stream call. */
> +	struct {
> +		/** @start: GPU address of the userspace command stream. */
> +		u64 start;
> +
> +		/** @size: Size of the userspace command stream. */
> +		u32 size;
> +
> +		/**
> +		 * @latest_flush: Flush ID at the time the userspace command
> +		 * stream was built.
> +		 *
> +		 * Needed for the flush reduction mechanism.
> +		 */
> +		u32 latest_flush;
> +	} call_info;
> +
> +	/** @ringbuf: Position of this job is in the ring buffer. */
> +	struct {
> +		/** @start: Start offset. */
> +		u64 start;
> +
> +		/** @end: End offset. */
> +		u64 end;
> +	} ringbuf;
> +
> +	/**
> +	 * @node: Used to insert the job in the panthor_queue::fence_ctx::in_flight_jobs
> +	 * list.
> +	 */
> +	struct list_head node;
> +
> +	/** @done_fence: Fence signaled when the job is finished or cancelled. */

s/signaled/signalled/ (worth a global search ;) )

> +	struct dma_fence *done_fence;
> +};
> +
> +static void group_free_queue(struct panthor_group *group, u32 idx)
> +{
> +	struct panthor_queue *queue = group->queues[idx];
> +
> +	if (IS_ERR_OR_NULL(queue))
> +		return;
> +
> +	if (queue->entity.fence_context)
> +		drm_sched_entity_destroy(&queue->entity);
> +
> +	if (queue->scheduler.ops)
> +		drm_sched_fini(&queue->scheduler);
> +
> +	if (queue->syncwait.bo) {
> +		panthor_gem_unmap_and_put(group->vm, queue->syncwait.bo,
> +					  queue->syncwait.gpu_va,
> +					  queue->syncwait.kmap);
> +	}
> +
> +	if (!IS_ERR_OR_NULL(queue->ringbuf.bo)) {
> +		panthor_gem_unmap_and_put(group->vm, queue->ringbuf.bo,
> +					  queue->ringbuf.gpu_va,
> +					  queue->ringbuf.kmap);
> +	}
> +
> +	panthor_fw_mem_free(group->ptdev, queue->iface.mem);
> +	kfree(queue);
> +}
> +
> +static void group_release_work(struct work_struct *work)
> +{
> +	struct panthor_group *group = container_of(work,
> +						   struct panthor_group,
> +						   release_work);
> +	struct panthor_device *ptdev = group->ptdev;
> +	u32 i;
> +
> +	for (i = 0; i < group->queue_count; i++)
> +		group_free_queue(group, i);
> +
> +	if (group->suspend_buf)
> +		panthor_fw_mem_free(ptdev, group->suspend_buf);
> +
> +	if (group->protm_suspend_buf)
> +		panthor_fw_mem_free(ptdev, group->protm_suspend_buf);
> +
> +	if (!IS_ERR_OR_NULL(group->syncobjs.bo)) {
> +		panthor_gem_unmap_and_put(group->vm, group->syncobjs.bo,
> +					  group->syncobjs.gpu_va, group->syncobjs.kmap);
> +	}
> +
> +	panthor_vm_put(group->vm);
> +	kfree(group);
> +}
> +
> +static void group_release(struct kref *kref)
> +{
> +	struct panthor_group *group = container_of(kref,
> +						   struct panthor_group,
> +						   refcount);
> +	struct panthor_device *ptdev = group->ptdev;
> +
> +	drm_WARN_ON(&ptdev->base, group->csg_id >= 0);
> +	drm_WARN_ON(&ptdev->base, !list_empty(&group->run_node));
> +	drm_WARN_ON(&ptdev->base, !list_empty(&group->wait_node));
> +
> +	queue_work(panthor_cleanup_wq, &group->release_work);
> +}
> +
> +static void group_put(struct panthor_group *group)
> +{
> +	if (group)
> +		kref_put(&group->refcount, group_release);
> +}
> +
> +static struct panthor_group *
> +group_get(struct panthor_group *group)
> +{
> +	if (group)
> +		kref_get(&group->refcount);
> +
> +	return group;
> +}
> +
> +/**
> + * group_bind_locked() - Bind a group to a group slot
> + * @group: Group.
> + * @csg_id: Slot.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +group_bind_locked(struct panthor_group *group, u32 csg_id)
> +{
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_csg_slot *csg_slot;
> +	int ret;
> +
> +	if (drm_WARN_ON(&ptdev->base, group->csg_id != -1 || csg_id >= MAX_CSGS ||
> +			ptdev->scheduler->csg_slots[csg_id].group))
> +		return -EINVAL;
> +
> +	ret = panthor_vm_active(group->vm);
> +	if (ret)
> +		return ret;
> +
> +	csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	group_get(group);
> +	group->csg_id = csg_id;
> +
> +	/* Dummy doorbell allocation: doorbell is assigned to the group and
> +	 * all queues use the same doorbell.
> +	 *
> +	 * TODO: Implement LRU-based doorbell assignment, so the most often
> +	 * updated queues get their own doorbell, thus avoiding useless checks
> +	 * on queues belonging to the same group that are rarely updated.
> +	 */
> +	for (u32 i = 0; i < group->queue_count; i++)
> +		group->queues[i]->doorbell_id = csg_id + 1;
> +
> +	csg_slot->group = group;
> +
> +	return 0;
> +}
> +
> +/**
> + * group_unbind_locked() - Unbind a group from a slot.
> + * @group: Group to unbind.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +group_unbind_locked(struct panthor_group *group)
> +{
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_csg_slot *slot;
> +
> +	if (drm_WARN_ON(&ptdev->base, group->csg_id < 0 || group->csg_id >= MAX_CSGS))
> +		return -EINVAL;
> +
> +	if (drm_WARN_ON(&ptdev->base, group->state == PANTHOR_CS_GROUP_ACTIVE))
> +		return -EINVAL;
> +
> +	slot = &ptdev->scheduler->csg_slots[group->csg_id];
> +	panthor_vm_idle(group->vm);
> +	group->csg_id = -1;
> +
> +	for (u32 i = 0; i < group->queue_count; i++)
> +		group->queues[i]->doorbell_id = -1;
> +
> +	slot->group = NULL;
> +
> +	group_put(group);
> +	return 0;
> +}
> +
> +/**
> + * cs_slot_prog_locked() - Program a queue slot
> + * @ptdev: Device.
> + * @csg_id: Group slot ID.
> + * @cs_id: Queue slot ID.
> + *
> + * Program a queue slot with the queue information so things can start being
> + * executed on this queue.
> + *
> + * The group slot must have a group bound to it already (group_bind_locked()).
> + */
> +static void
> +cs_slot_prog_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_queue *queue = ptdev->scheduler->csg_slots[csg_id].group->queues[cs_id];
> +	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +
> +	queue->iface.input->extract = queue->iface.output->extract;
> +	drm_WARN_ON(&ptdev->base, queue->iface.input->insert < queue->iface.input->extract);
> +
> +	cs_iface->input->ringbuf_base = queue->ringbuf.gpu_va;
> +	cs_iface->input->ringbuf_size = queue->ringbuf.bo->base.base.size;
> +	cs_iface->input->ringbuf_input = panthor_fw_mem_va(queue->iface.mem);
> +	cs_iface->input->ringbuf_output = panthor_fw_mem_va(queue->iface.mem) + PAGE_SIZE;
> +	cs_iface->input->config = CS_CONFIG_PRIORITY(queue->priority) |
> +				  CS_CONFIG_DOORBELL(queue->doorbell_id);
> +	cs_iface->input->ack_irq_mask = ~0;
> +	panthor_fw_update_reqs(cs_iface, req,
> +			       CS_IDLE_SYNC_WAIT |
> +			       CS_IDLE_EMPTY |
> +			       CS_STATE_START |
> +			       CS_EXTRACT_EVENT,
> +			       CS_IDLE_SYNC_WAIT |
> +			       CS_IDLE_EMPTY |
> +			       CS_STATE_MASK |
> +			       CS_EXTRACT_EVENT);
> +	drm_sched_resume_timeout(&queue->scheduler, queue->remaining_time);
> +	if (queue->iface.input->insert != queue->iface.input->extract && queue->timeout_suspended) {
> +		drm_sched_resume_timeout(&queue->scheduler, queue->remaining_time);
> +		queue->timeout_suspended = false;
> +	}
> +}
> +
> +/**
> + * @cs_slot_reset_locked() - Reset a queue slot
> + * @ptdev: Device.
> + * @csg_id: Group slot.
> + * @cs_id: Queue slot.
> + *
> + * Change the queue slot state to STOP and suspend the queue timeout if
> + * the queue is not blocked.
> + *
> + * The group slot must have a group bound to it (group_bind_locked()).
> + */
> +static int
> +cs_slot_reset_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	struct panthor_group *group = ptdev->scheduler->csg_slots[csg_id].group;
> +	struct panthor_queue *queue = group->queues[cs_id];
> +
> +	panthor_fw_update_reqs(cs_iface, req,
> +			       CS_STATE_STOP,
> +			       CS_STATE_MASK);
> +
> +	/* If the queue is blocked, we want to keep the timeout running, so
> +	 * we can detect unbounded waits and kill the group when that happens.
> +	 */
> +	if (!(group->blocked_queues & BIT(cs_id)) && !queue->timeout_suspended) {
> +		queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
> +		queue->timeout_suspended = true;
> +		WARN_ON(queue->remaining_time > msecs_to_jiffies(JOB_TIMEOUT_MS));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * csg_slot_sync_priority_locked() - Synchronize the group slot priority
> + * @ptdev: Device.
> + * @csg_id: Group slot ID.
> + *
> + * Group slot priority update happens asynchronously. When we receive a
> + * %CSG_ENDPOINT_CONFIG, we know the update is effective, and can
> + * reflect it to our panthor_csg_slot object.
> + */
> +static void
> +csg_slot_sync_priority_locked(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	struct panthor_fw_csg_iface *csg_iface;
> +
> +	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +	csg_slot->priority = (csg_iface->input->endpoint_req & CSG_EP_REQ_PRIORITY_MASK) >> 28;
> +}
> +
> +/**
> + * cs_slot_sync_queue_state_locked() - Synchronize the queue slot priority
> + * @ptdev: Device.
> + * @csg_id: Group slot.
> + * @cs_id: Queue slot.
> + *
> + * Queue state is updated on group suspend or STATUS_UPDATE event.
> + */
> +static void
> +cs_slot_sync_queue_state_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_group *group = ptdev->scheduler->csg_slots[csg_id].group;
> +	struct panthor_queue *queue = group->queues[cs_id];
> +	struct panthor_fw_cs_iface *cs_iface =
> +		panthor_fw_get_cs_iface(group->ptdev, csg_id, cs_id);
> +
> +	u32 status_wait_cond;
> +
> +	switch (cs_iface->output->status_blocked_reason) {
> +	case CS_STATUS_BLOCKED_REASON_UNBLOCKED:
> +		if (queue->iface.input->insert == queue->iface.output->extract &&
> +		    cs_iface->output->status_scoreboards == 0)
> +			group->idle_queues |= BIT(cs_id);
> +		break;
> +
> +	case CS_STATUS_BLOCKED_REASON_SYNC_WAIT:
> +		drm_WARN_ON(&ptdev->base, !list_empty(&group->wait_node));
> +		list_move_tail(&group->wait_node, &group->ptdev->scheduler->groups.waiting);
> +		group->blocked_queues |= BIT(cs_id);
> +		queue->syncwait.gpu_va = cs_iface->output->status_wait_sync_ptr;
> +		queue->syncwait.ref = cs_iface->output->status_wait_sync_value;
> +		status_wait_cond = cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_COND_MASK;
> +		queue->syncwait.gt = status_wait_cond == CS_STATUS_WAIT_SYNC_COND_GT;
> +		if (cs_iface->output->status_wait & CS_STATUS_WAIT_SYNC_64B) {
> +			u64 sync_val_hi = cs_iface->output->status_wait_sync_value_hi;
> +
> +			queue->syncwait.sync64 = true;
> +			queue->syncwait.ref |= sync_val_hi << 32;
> +		} else {
> +			queue->syncwait.sync64 = false;
> +		}
> +		break;
> +
> +	default:
> +		/* Other reasons are not blocking. Consider the queue as runnable
> +		 * in those cases.
> +		 */
> +		break;
> +	}
> +}
> +
> +static void
> +csg_slot_sync_queues_state_locked(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +	u32 i;
> +
> +	group->idle_queues = 0;
> +	group->blocked_queues = 0;
> +
> +	for (i = 0; i < group->queue_count; i++) {
> +		if (group->queues[i])
> +			cs_slot_sync_queue_state_locked(ptdev, csg_id, i);
> +	}
> +}
> +
> +static void
> +csg_slot_sync_state_locked(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	struct panthor_fw_csg_iface *csg_iface;
> +	struct panthor_group *group;
> +	enum panthor_group_state new_state, old_state;
> +
> +	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +	group = csg_slot->group;
> +
> +	if (!group)
> +		return;
> +
> +	old_state = group->state;
> +	switch (csg_iface->output->ack & CSG_STATE_MASK) {
> +	case CSG_STATE_START:
> +	case CSG_STATE_RESUME:
> +		new_state = PANTHOR_CS_GROUP_ACTIVE;
> +		break;
> +	case CSG_STATE_TERMINATE:
> +		new_state = PANTHOR_CS_GROUP_TERMINATED;
> +		break;
> +	case CSG_STATE_SUSPEND:
> +		new_state = PANTHOR_CS_GROUP_SUSPENDED;
> +		break;
> +	}
> +
> +	if (old_state == new_state)
> +		return;
> +
> +	if (new_state == PANTHOR_CS_GROUP_SUSPENDED)
> +		csg_slot_sync_queues_state_locked(ptdev, csg_id);
> +
> +	if (old_state == PANTHOR_CS_GROUP_ACTIVE) {
> +		u32 i;
> +
> +		/* Reset the queue slots so we start from a clean
> +		 * state when starting/resuming a new group on this
> +		 * CSG slot. No wait needed here, and no ringbell
> +		 * either, since the CS slot will only be re-used
> +		 * on the next CSG start operation.
> +		 */
> +		for (i = 0; i < group->queue_count; i++) {
> +			if (group->queues[i])
> +				cs_slot_reset_locked(ptdev, csg_id, i);
> +		}
> +	}
> +
> +	group->state = new_state;
> +}
> +
> +static int
> +csg_slot_prog_locked(struct panthor_device *ptdev, u32 csg_id, u32 priority)
> +{
> +	struct panthor_fw_csg_iface *csg_iface;
> +	struct panthor_csg_slot *csg_slot;
> +	struct panthor_group *group;
> +	u32 queue_mask = 0, i;
> +
> +	if (priority > MAX_CSG_PRIO)
> +		return -EINVAL;
> +
> +	if (drm_WARN_ON(&ptdev->base, csg_id >= MAX_CSGS))
> +		return -EINVAL;
> +
> +	csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	group = csg_slot->group;
> +	if (!group || group->state == PANTHOR_CS_GROUP_ACTIVE)
> +		return 0;
> +
> +	csg_iface = panthor_fw_get_csg_iface(group->ptdev, csg_id);
> +
> +	for (i = 0; i < group->queue_count; i++) {
> +		if (group->queues[i]) {
> +			cs_slot_prog_locked(ptdev, csg_id, i);
> +			queue_mask |= BIT(i);
> +		}
> +	}
> +
> +	csg_iface->input->allow_compute = group->compute_core_mask;
> +	csg_iface->input->allow_fragment = group->fragment_core_mask;
> +	csg_iface->input->allow_other = group->tiler_core_mask;
> +	csg_iface->input->endpoint_req = CSG_EP_REQ_COMPUTE(group->max_compute_cores) |
> +					 CSG_EP_REQ_FRAGMENT(group->max_fragment_cores) |
> +					 CSG_EP_REQ_TILER(group->max_tiler_cores) |
> +					 CSG_EP_REQ_PRIORITY(priority);
> +	csg_iface->input->config = panthor_vm_as(group->vm);
> +
> +	if (group->suspend_buf)
> +		csg_iface->input->suspend_buf = panthor_fw_mem_va(group->suspend_buf);
> +	else
> +		csg_iface->input->suspend_buf = 0;
> +
> +	if (group->protm_suspend_buf)
> +		csg_iface->input->protm_suspend_buf = panthor_fw_mem_va(group->protm_suspend_buf);
> +	else
> +		csg_iface->input->protm_suspend_buf = 0;
> +
> +	csg_iface->input->ack_irq_mask = ~0;
> +	panthor_fw_toggle_reqs(csg_iface, doorbell_req, doorbell_ack, queue_mask);
> +	return 0;
> +}
> +
> +static void
> +cs_slot_process_fatal_event(struct panthor_device *ptdev,
> +			    u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +	struct panthor_fw_cs_iface *csg_iface;
> +	struct panthor_fw_cs_iface *cs_iface;
> +	u32 fatal;
> +	u64 info;
> +
> +	csg_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	fatal = cs_iface->output->fatal;
> +	info = cs_iface->output->fatal_info;
> +	group->fatal_queues |= BIT(cs_id);
> +	sched_queue_delayed_work(sched, tick, 0);
> +	drm_warn(&ptdev->base,
> +		 "CSG slot %d CS slot: %d\n"
> +		 "CS_FATAL.EXCEPTION_TYPE: 0x%x (%s)\n"
> +		 "CS_FATAL.EXCEPTION_DATA: 0x%x\n"
> +		 "CS_FATAL_INFO.EXCEPTION_DATA: 0x%llx\n",
> +		 csg_id, cs_id,
> +		 (unsigned int)CS_EXCEPTION_TYPE(fatal),
> +		 panthor_exception_name(ptdev, CS_EXCEPTION_TYPE(fatal)),
> +		 (unsigned int)CS_EXCEPTION_DATA(fatal),
> +		 info);
> +}
> +
> +static void
> +cs_slot_process_fault_event(struct panthor_device *ptdev,
> +			    u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +	struct panthor_queue *queue = cs_id < group->queue_count ? group->queues[cs_id] : NULL;
> +	struct panthor_fw_cs_iface *cs_iface;
> +	u32 fault;
> +	u64 info;
> +
> +	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	fault = cs_iface->output->fault;
> +	info = cs_iface->output->fault_info;
> +
> +	if (queue && CS_EXCEPTION_TYPE(fault) == DRM_PANTHOR_EXCEPTION_CS_INHERIT_FAULT) {
> +		u64 cs_extract = queue->iface.output->extract;
> +		struct panthor_job *job;
> +
> +		spin_lock(&queue->fence_ctx.lock);
> +		list_for_each_entry(job, &queue->fence_ctx.in_flight_jobs, node) {
> +			if (cs_extract >= job->ringbuf.end)
> +				continue;
> +
> +			if (cs_extract < job->ringbuf.start)
> +				break;
> +
> +			dma_fence_set_error(job->done_fence, -EINVAL);
> +		}
> +		spin_unlock(&queue->fence_ctx.lock);
> +	}
> +
> +	drm_warn(&ptdev->base,
> +		 "CSG slot %d CS slot: %d\n"
> +		 "CS_FAULT.EXCEPTION_TYPE: 0x%x (%s)\n"
> +		 "CS_FAULT.EXCEPTION_DATA: 0x%x\n"
> +		 "CS_FAULT_INFO.EXCEPTION_DATA: 0x%llx\n",
> +		 csg_id, cs_id,
> +		 (unsigned int)CS_EXCEPTION_TYPE(fault),
> +		 panthor_exception_name(ptdev, CS_EXCEPTION_TYPE(fault)),
> +		 (unsigned int)CS_EXCEPTION_DATA(fault),
> +		 info);
> +}
> +
> +static void
> +cs_slot_process_tiler_oom_event(struct panthor_device *ptdev,
> +				u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +	struct panthor_fw_cs_iface *cs_iface;
> +	struct panthor_heap_pool *heaps;
> +	struct panthor_queue *queue;
> +	u32 fault, vt_start, vt_end, frag_end;
> +	u32 renderpasses_in_flight, pending_frag_count;
> +	u64 info, heap_address, new_chunk_va;
> +	int ret;
> +
> +	if (drm_WARN_ON(&ptdev->base, !group))
> +		return;
> +
> +	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	queue = group->queues[cs_id];
> +	heaps = panthor_vm_get_heap_pool(group->vm, false);
> +	fault = cs_iface->output->fault;
> +	info = cs_iface->output->fault_info;
> +	heap_address = cs_iface->output->heap_address;
> +	vt_start = cs_iface->output->heap_vt_start;
> +	vt_end = cs_iface->output->heap_vt_end;
> +	frag_end = cs_iface->output->heap_frag_end;
> +	renderpasses_in_flight = vt_start - frag_end;
> +	pending_frag_count = vt_end - frag_end;
> +
> +	if (!heaps || frag_end > vt_end || vt_end >= vt_start) {
> +		ret = -EINVAL;
> +	} else {
> +		ret = panthor_heap_grow(heaps, heap_address,
> +					renderpasses_in_flight,
> +					pending_frag_count, &new_chunk_va);
> +	}
> +
> +	if (!ret) {
> +		cs_iface->input->heap_start = new_chunk_va;
> +		cs_iface->input->heap_end = new_chunk_va;
> +	} else if (ret == -EBUSY) {
> +		cs_iface->input->heap_start = 0;
> +		cs_iface->input->heap_end = 0;
> +	} else {
> +		group->fatal_queues |= BIT(csg_id);
> +		sched_queue_delayed_work(sched, tick, 0);
> +	}
> +
> +	panthor_heap_pool_put(heaps);
> +}
> +
> +static bool cs_slot_process_irq(struct panthor_device *ptdev,
> +				u32 csg_id, u32 cs_id)
> +{
> +	struct panthor_fw_cs_iface *cs_iface;
> +	u32 req, ack, events;
> +
> +	cs_iface = panthor_fw_get_cs_iface(ptdev, csg_id, cs_id);
> +	req = cs_iface->input->req;
> +	ack = cs_iface->output->ack;
> +	events = (req ^ ack) & CS_EVT_MASK;
> +
> +	if (events & CS_FATAL)
> +		cs_slot_process_fatal_event(ptdev, csg_id, cs_id);
> +
> +	if (events & CS_FAULT)
> +		cs_slot_process_fault_event(ptdev, csg_id, cs_id);
> +
> +	if (events & CS_TILER_OOM)
> +		cs_slot_process_tiler_oom_event(ptdev, csg_id, cs_id);
> +
> +	panthor_fw_update_reqs(cs_iface, req, ack,
> +			       CS_FATAL | CS_FAULT | CS_TILER_OOM);
> +
> +	return (events & (CS_FAULT | CS_TILER_OOM)) != 0;
> +}
> +
> +static void csg_slot_sync_idle_state_locked(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	struct panthor_fw_csg_iface *csg_iface;
> +
> +	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +	csg_slot->idle = csg_iface->output->status_state & CSG_STATUS_STATE_IS_IDLE;
> +}
> +
> +static void csg_slot_process_idle_event(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +
> +	mutex_lock(&sched->lock);
> +	sched->might_have_idle_groups = true;
> +	mutex_unlock(&sched->lock);
> +
> +	/* Schedule a tick so we can evict idle groups and schedule non-idle
> +	 * ones. This will also update runtime PM and devfreq busy/idle states,
> +	 * so the device can lower its frequency or get suspended.
> +	 */
> +	sched_queue_delayed_work(sched, tick, 0);
> +}
> +
> +static void csg_slot_sync_update_locked(struct panthor_device *ptdev,
> +					u32 csg_id)
> +{
> +	struct panthor_csg_slot *csg_slot = &ptdev->scheduler->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +
> +	if (group)
> +		group_queue_work(group, sync_upd);
> +
> +	sched_queue_work(ptdev->scheduler, sync_upd);
> +}
> +
> +static void csg_slot_process_sync_update_event(struct panthor_device *ptdev,
> +					       u32 csg_id)
> +{
> +	mutex_lock(&ptdev->scheduler->lock);
> +	csg_slot_sync_update_locked(ptdev, csg_id);
> +	mutex_unlock(&ptdev->scheduler->lock);
> +}
> +
> +static void
> +csg_slot_process_progress_timer_event(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +	struct panthor_group *group = csg_slot->group;
> +
> +	drm_warn(&ptdev->base, "CSG slot %d progress timeout\n", csg_id);
> +
> +	mutex_lock(&sched->lock);
> +	group = csg_slot->group;
> +	if (!drm_WARN_ON(&ptdev->base, !group))
> +		group->timedout = true;
> +	mutex_unlock(&sched->lock);
> +
> +	sched_queue_delayed_work(sched, tick, 0);
> +}
> +
> +void panthor_sched_process_csg_irq(struct panthor_device *ptdev, u32 csg_id)
> +{
> +	u32 req, ack, cs_irq_req, cs_irq_ack, cs_irqs, csg_events;
> +	struct panthor_fw_csg_iface *csg_iface;
> +	u32 ring_cs_db_mask = 0;
> +
> +	if (drm_WARN_ON(&ptdev->base, csg_id >= ptdev->scheduler->csg_slot_count))
> +		return;
> +
> +	csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +	req = READ_ONCE(csg_iface->input->req);
> +	ack = READ_ONCE(csg_iface->output->ack);
> +	cs_irq_req = READ_ONCE(csg_iface->output->cs_irq_req);
> +	cs_irq_ack = READ_ONCE(csg_iface->input->cs_irq_ack);
> +	csg_events = (req ^ ack) & CSG_EVT_MASK;
> +
> +	/* There may not be any pending CSG/CS interrupts to process */
> +	if (req == ack && cs_irq_req == cs_irq_ack)
> +		return;
> +
> +	/* Immediately set IRQ_ACK bits to be same as the IRQ_REQ bits before
> +	 * examining the CS_ACK & CS_REQ bits. This would ensure that Host
> +	 * doesn't misses an interrupt for the CS in the race scenario where

s/misses/miss/

> +	 * whilst Host is servicing an interrupt for the CS, firmware sends
> +	 * another interrupt for that CS.
> +	 */
> +	csg_iface->input->cs_irq_ack = cs_irq_req;
> +
> +	panthor_fw_update_reqs(csg_iface, req, ack,
> +			       CSG_SYNC_UPDATE |
> +			       CSG_IDLE |
> +			       CSG_PROGRESS_TIMER_EVENT);
> +
> +	if (csg_events & CSG_IDLE)
> +		csg_slot_process_idle_event(ptdev, csg_id);
> +
> +	if (csg_events & CSG_PROGRESS_TIMER_EVENT)
> +		csg_slot_process_progress_timer_event(ptdev, csg_id);
> +
> +	cs_irqs = cs_irq_req ^ cs_irq_ack;
> +	while (cs_irqs) {
> +		u32 cs_id = ffs(cs_irqs) - 1;
> +
> +		if (cs_slot_process_irq(ptdev, csg_id, cs_id))
> +			ring_cs_db_mask |= BIT(cs_id);
> +
> +		cs_irqs &= ~BIT(cs_id);
> +	}
> +
> +	if (csg_events & CSG_SYNC_UPDATE)
> +		csg_slot_process_sync_update_event(ptdev, csg_id);
> +
> +	if (ring_cs_db_mask)
> +		panthor_fw_toggle_reqs(csg_iface, doorbell_req, doorbell_ack, ring_cs_db_mask);
> +
> +	panthor_fw_ring_csg_doorbells(ptdev, BIT(csg_id));
> +}
> +
> +static void sched_process_idle_event(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +
> +	/* Acknowledge the idle event and schedule a tick. */
> +	panthor_fw_update_reqs(glb_iface, req, glb_iface->output->ack, GLB_IDLE);
> +	sched_queue_delayed_work(ptdev->scheduler, tick, 0);
> +}
> +
> +/**
> + * panthor_sched_process_global_irq() - Process the scheduling part of a global IRQ
> + * @ptdev: Device.
> + */
> +void panthor_sched_process_global_irq(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	u32 req, ack, evts;
> +
> +	req = READ_ONCE(glb_iface->input->req);
> +	ack = READ_ONCE(glb_iface->output->ack);
> +	evts = (req ^ ack) & GLB_EVT_MASK;
> +
> +	if (evts & GLB_IDLE)
> +		sched_process_idle_event(ptdev);
> +}
> +
> +static const char *fence_get_driver_name(struct dma_fence *fence)
> +{
> +	return "panthor";
> +}
> +
> +static const char *queue_fence_get_timeline_name(struct dma_fence *fence)
> +{
> +	return "queue-fence";
> +}
> +
> +static const struct dma_fence_ops panthor_queue_fence_ops = {
> +	.get_driver_name = fence_get_driver_name,
> +	.get_timeline_name = queue_fence_get_timeline_name,
> +};
> +
> +/**
> + */
> +struct panthor_csg_slots_upd_ctx {
> +	u32 update_mask;
> +	u32 timedout_mask;
> +	struct {
> +		u32 value;
> +		u32 mask;
> +	} requests[MAX_CSGS];
> +};
> +
> +static void csgs_upd_ctx_init(struct panthor_csg_slots_upd_ctx *ctx)
> +{
> +	memset(ctx, 0, sizeof(*ctx));
> +}
> +
> +static void csgs_upd_ctx_queue_reqs(struct panthor_device *ptdev,
> +				    struct panthor_csg_slots_upd_ctx *ctx,
> +				    u32 csg_id, u32 value, u32 mask)
> +{
> +	if (drm_WARN_ON(&ptdev->base, !mask) ||
> +	    drm_WARN_ON(&ptdev->base, csg_id >= ptdev->scheduler->csg_slot_count))
> +		return;
> +
> +	ctx->requests[csg_id].value = (ctx->requests[csg_id].value & ~mask) | (value & mask);
> +	ctx->requests[csg_id].mask |= mask;
> +	ctx->update_mask |= BIT(csg_id);
> +}
> +
> +static int csgs_upd_ctx_apply_locked(struct panthor_device *ptdev,
> +				     struct panthor_csg_slots_upd_ctx *ctx)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	u32 update_slots = ctx->update_mask;
> +
> +	lockdep_assert_held(&sched->lock);
> +
> +	if (!ctx->update_mask)
> +		return 0;
> +
> +	while (update_slots) {
> +		struct panthor_fw_csg_iface *csg_iface;
> +		u32 csg_id = ffs(update_slots) - 1;
> +
> +		update_slots &= ~BIT(csg_id);
> +		csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +		panthor_fw_update_reqs(csg_iface, req,
> +				       ctx->requests[csg_id].value,
> +				       ctx->requests[csg_id].mask);
> +	}
> +
> +	panthor_fw_ring_csg_doorbells(ptdev, ctx->update_mask);
> +
> +	update_slots = ctx->update_mask;
> +	while (update_slots) {
> +		struct panthor_fw_csg_iface *csg_iface;
> +		u32 csg_id = ffs(update_slots) - 1;
> +		u32 req_mask = ctx->requests[csg_id].mask, acked;
> +		int ret;
> +
> +		update_slots &= ~BIT(csg_id);
> +		csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +
> +		ret = panthor_fw_csg_wait_acks(ptdev, csg_id, req_mask, &acked, 100);
> +
> +		if (acked & CSG_ENDPOINT_CONFIG)
> +			csg_slot_sync_priority_locked(ptdev, csg_id);
> +
> +		if (acked & CSG_STATE_MASK)
> +			csg_slot_sync_state_locked(ptdev, csg_id);
> +
> +		if (acked & CSG_STATUS_UPDATE) {
> +			csg_slot_sync_queues_state_locked(ptdev, csg_id);
> +			csg_slot_sync_idle_state_locked(ptdev, csg_id);
> +		}
> +
> +		if (ret && acked != req_mask &&
> +		    ((csg_iface->input->req ^ csg_iface->output->ack) & req_mask) != 0) {
> +			drm_err(&ptdev->base, "CSG %d update request timedout", csg_id);
> +			ctx->timedout_mask |= BIT(csg_id);
> +		}
> +	}
> +
> +	if (ctx->timedout_mask)
> +		return -ETIMEDOUT;
> +
> +	return 0;
> +}
> +
> +struct panthor_sched_tick_ctx {
> +	struct list_head old_groups[PANTHOR_CSG_PRIORITY_COUNT];
> +	struct list_head groups[PANTHOR_CSG_PRIORITY_COUNT];
> +	u32 idle_group_count;
> +	u32 group_count;
> +	enum panthor_csg_priority min_priority;
> +	struct panthor_vm *vms[MAX_CS_PER_CSG];
> +	u32 as_count;
> +	bool immediate_tick;
> +	u32 csg_upd_failed_mask;
> +};
> +
> +static bool
> +tick_ctx_is_full(const struct panthor_scheduler *sched,
> +		 const struct panthor_sched_tick_ctx *ctx)
> +{
> +	return ctx->group_count == sched->csg_slot_count;
> +}
> +
> +static bool
> +group_is_idle(struct panthor_group *group)
> +{
> +	struct panthor_device *ptdev = group->ptdev;
> +	u32 inactive_queues;
> +
> +	if (group->csg_id >= 0)
> +		return ptdev->scheduler->csg_slots[group->csg_id].idle;
> +
> +	inactive_queues = group->idle_queues | group->blocked_queues;
> +	return hweight32(inactive_queues) == group->queue_count;
> +}
> +
> +static bool
> +group_can_run(struct panthor_group *group)
> +{
> +	return group->state != PANTHOR_CS_GROUP_TERMINATED &&
> +	       !group->destroyed && group->fatal_queues == 0 &&
> +	       !group->timedout;
> +}
> +
> +static void
> +tick_ctx_pick_groups_from_list(const struct panthor_scheduler *sched,
> +			       struct panthor_sched_tick_ctx *ctx,
> +			       struct list_head *queue,
> +			       bool skip_idle_groups,
> +			       bool owned_by_tick_ctx)
> +{
> +	struct panthor_group *group, *tmp;
> +
> +	if (tick_ctx_is_full(sched, ctx))
> +		return;
> +
> +	list_for_each_entry_safe(group, tmp, queue, run_node) {
> +		u32 i;
> +
> +		if (!group_can_run(group))
> +			continue;
> +
> +		if (skip_idle_groups && group_is_idle(group))
> +			continue;
> +
> +		for (i = 0; i < ctx->as_count; i++) {
> +			if (ctx->vms[i] == group->vm)
> +				break;
> +		}
> +
> +		if (i == ctx->as_count && ctx->as_count == sched->as_slot_count)
> +			continue;
> +
> +		if (!owned_by_tick_ctx)
> +			group_get(group);
> +
> +		list_move_tail(&group->run_node, &ctx->groups[group->priority]);
> +		ctx->group_count++;
> +		if (group_is_idle(group))
> +			ctx->idle_group_count++;
> +
> +		if (i == ctx->as_count)
> +			ctx->vms[ctx->as_count++] = group->vm;
> +
> +		if (ctx->min_priority > group->priority)
> +			ctx->min_priority = group->priority;
> +
> +		if (tick_ctx_is_full(sched, ctx))
> +			return;
> +	}
> +}
> +
> +static void
> +tick_ctx_insert_old_group(struct panthor_scheduler *sched,
> +			  struct panthor_sched_tick_ctx *ctx,
> +			  struct panthor_group *group,
> +			  bool full_tick)
> +{
> +	struct panthor_csg_slot *csg_slot = &sched->csg_slots[group->csg_id];
> +	struct panthor_group *other_group;
> +
> +	if (!full_tick) {
> +		list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
> +		return;
> +	}
> +
> +	/* Rotate to make sure groups with lower CSG slot
> +	 * priorities have a chance to get a higher CSG slot
> +	 * priority next time they get picked. This priority
> +	 * has an impact on resource request ordering, so it's
> +	 * important to make sure we don't let one group starve
> +	 * all other groups with the same group priority.
> +	 */
> +	list_for_each_entry(other_group,
> +			    &ctx->old_groups[csg_slot->group->priority],
> +			    run_node) {
> +		struct panthor_csg_slot *other_csg_slot = &sched->csg_slots[other_group->csg_id];
> +
> +		if (other_csg_slot->priority > csg_slot->priority) {
> +			list_add_tail(&csg_slot->group->run_node, &other_group->run_node);
> +			return;
> +		}
> +	}
> +
> +	list_add_tail(&group->run_node, &ctx->old_groups[group->priority]);
> +}
> +
> +static void
> +tick_ctx_init(struct panthor_scheduler *sched,
> +	      struct panthor_sched_tick_ctx *ctx,
> +	      bool full_tick)
> +{
> +	struct panthor_device *ptdev = sched->ptdev;
> +	struct panthor_csg_slots_upd_ctx upd_ctx;
> +	int ret;
> +	u32 i;
> +
> +	memset(ctx, 0, sizeof(*ctx));
> +	csgs_upd_ctx_init(&upd_ctx);
> +
> +	ctx->min_priority = PANTHOR_CSG_PRIORITY_COUNT;
> +	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
> +		INIT_LIST_HEAD(&ctx->groups[i]);
> +		INIT_LIST_HEAD(&ctx->old_groups[i]);
> +	}
> +
> +	for (i = 0; i < sched->csg_slot_count; i++) {
> +		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
> +		struct panthor_fw_csg_iface *csg_iface;
> +
> +		csg_iface = panthor_fw_get_csg_iface(ptdev, i);
> +		if (csg_slot->group) {
> +			group_get(csg_slot->group);
> +			tick_ctx_insert_old_group(sched, ctx, csg_slot->group, full_tick);
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
> +						csg_iface->output->ack ^ CSG_STATUS_UPDATE,
> +						CSG_STATUS_UPDATE);
> +		}
> +	}
> +
> +	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> +	if (ret) {
> +		panthor_device_schedule_reset(ptdev);
> +		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
> +	}
> +}
> +
> +#define NUM_INSTRS_PER_SLOT		16
> +
> +static void
> +group_term_post_processing(struct panthor_group *group)
> +{
> +	struct panthor_job *job, *tmp;
> +	LIST_HEAD(faulty_jobs);
> +	bool cookie;
> +	u32 i = 0;
> +
> +	if (drm_WARN_ON(&group->ptdev->base, group_can_run(group)))
> +		return;
> +
> +	cookie = dma_fence_begin_signalling();
> +	for (i = 0; i < group->queue_count; i++) {
> +		struct panthor_queue *queue = group->queues[i];
> +		struct panthor_syncobj_64b *syncobj;
> +		int err;
> +
> +		if (group->fatal_queues & BIT(i))
> +			err = -EINVAL;
> +		else if (group->timedout)
> +			err = -ETIMEDOUT;
> +		else
> +			err = -ECANCELED;
> +
> +		if (!queue)
> +			continue;
> +
> +		spin_lock(&queue->fence_ctx.lock);
> +		list_for_each_entry_safe(job, tmp, &queue->fence_ctx.in_flight_jobs, node) {
> +			list_move_tail(&job->node, &faulty_jobs);
> +			dma_fence_set_error(job->done_fence, err);
> +			dma_fence_signal_locked(job->done_fence);
> +		}
> +		spin_unlock(&queue->fence_ctx.lock);
> +
> +		/* Manually update the syncobj seqno to unblock waiters. */
> +		syncobj = group->syncobjs.kmap + (i * sizeof(*syncobj));
> +		syncobj->status = ~0;
> +		syncobj->seqno = atomic64_read(&queue->fence_ctx.seqno);
> +		sched_queue_work(group->ptdev->scheduler, sync_upd);
> +	}
> +	dma_fence_end_signalling(cookie);
> +
> +	list_for_each_entry_safe(job, tmp, &faulty_jobs, node) {
> +		list_del_init(&job->node);
> +		panthor_job_put(&job->base);
> +	}
> +}
> +
> +static void group_term_work(struct work_struct *work)
> +{
> +	struct panthor_group *group =
> +		container_of(work, struct panthor_group, term_work);
> +
> +	group_term_post_processing(group);
> +	group_put(group);
> +}
> +
> +static void
> +tick_ctx_cleanup(struct panthor_scheduler *sched,
> +		 struct panthor_sched_tick_ctx *ctx)
> +{
> +	struct panthor_group *group, *tmp;
> +	u32 i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctx->old_groups); i++) {
> +		list_for_each_entry_safe(group, tmp, &ctx->old_groups[i], run_node) {
> +			/* If everything went fine, we should only have groups
> +			 * to be terminated in the old_groups lists.
> +			 */
> +			drm_WARN_ON(&group->ptdev->base, !ctx->csg_upd_failed_mask &&
> +				    group_can_run(group));
> +
> +			if (!group_can_run(group)) {
> +				list_del_init(&group->run_node);
> +				list_del_init(&group->wait_node);
> +				group_queue_work(group, term);
> +			} else if (group->csg_id >= 0) {
> +				list_del_init(&group->run_node);
> +			} else {
> +				list_move(&group->run_node,
> +					  group_is_idle(group) ?
> +					  &sched->groups.idle[group->priority] :
> +					  &sched->groups.runnable[group->priority]);
> +			}
> +			group_put(group);
> +		}
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(ctx->groups); i++) {
> +		/* If everything went fine, the groups to schedule lists should
> +		 * be empty.
> +		 */
> +		drm_WARN_ON(&group->ptdev->base,
> +			    !ctx->csg_upd_failed_mask && !list_empty(&ctx->groups[i]));
> +
> +		list_for_each_entry_safe(group, tmp, &ctx->groups[i], run_node) {
> +			if (group->csg_id >= 0) {
> +				list_del_init(&group->run_node);
> +			} else {
> +				list_move(&group->run_node,
> +					  group_is_idle(group) ?
> +					  &sched->groups.idle[group->priority] :
> +					  &sched->groups.runnable[group->priority]);
> +			}
> +			group_put(group);
> +		}
> +	}
> +}
> +
> +static void
> +tick_ctx_apply(struct panthor_scheduler *sched, struct panthor_sched_tick_ctx *ctx)
> +{
> +	struct panthor_group *group, *tmp;
> +	struct panthor_device *ptdev = sched->ptdev;
> +	struct panthor_csg_slot *csg_slot;
> +	int prio, new_csg_prio = MAX_CSG_PRIO, i;
> +	u32 csg_mod_mask = 0, free_csg_slots = 0;
> +	struct panthor_csg_slots_upd_ctx upd_ctx;
> +	int ret;
> +
> +	csgs_upd_ctx_init(&upd_ctx);
> +
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		/* Suspend or terminate evicted groups. */
> +		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
> +			struct panthor_fw_csg_iface *csg_iface;
> +			bool term = !group_can_run(group);
> +			int csg_id = group->csg_id;
> +
> +			if (drm_WARN_ON(&ptdev->base, csg_id < 0))
> +				continue;
> +
> +			csg_slot = &sched->csg_slots[csg_id];
> +			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
> +						term ? CSG_STATE_TERMINATE : CSG_STATE_SUSPEND,
> +						CSG_STATE_MASK);
> +		}
> +
> +		/* Update priorities on already running groups. */
> +		list_for_each_entry(group, &ctx->groups[prio], run_node) {
> +			struct panthor_fw_csg_iface *csg_iface;
> +			int csg_id = group->csg_id;
> +
> +			if (csg_id < 0) {
> +				new_csg_prio--;
> +				continue;
> +			}
> +
> +			csg_slot = &sched->csg_slots[csg_id];
> +			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +			if (csg_slot->priority == new_csg_prio) {
> +				new_csg_prio--;
> +				continue;
> +			}
> +
> +			panthor_fw_update_reqs(csg_iface, endpoint_req,
> +					       CSG_EP_REQ_PRIORITY(new_csg_prio),
> +					       CSG_EP_REQ_PRIORITY_MASK);
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
> +						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
> +						CSG_ENDPOINT_CONFIG);
> +			new_csg_prio--;
> +		}
> +	}
> +
> +	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> +	if (ret) {
> +		panthor_device_schedule_reset(ptdev);
> +		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
> +		return;
> +	}
> +
> +	/* Unbind evicted groups. */
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		list_for_each_entry(group, &ctx->old_groups[prio], run_node) {
> +			group_unbind_locked(group);
> +		}
> +	}
> +
> +	for (i = 0; i < sched->csg_slot_count; i++) {
> +		if (!sched->csg_slots[i].group)
> +			free_csg_slots |= BIT(i);
> +	}
> +
> +	csgs_upd_ctx_init(&upd_ctx);
> +	new_csg_prio = MAX_CSG_PRIO;
> +
> +	/* Start new groups. */
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		list_for_each_entry(group, &ctx->groups[prio], run_node) {
> +			int csg_id = group->csg_id;
> +			struct panthor_fw_csg_iface *csg_iface;
> +
> +			if (csg_id >= 0) {
> +				new_csg_prio--;
> +				continue;
> +			}
> +
> +			csg_id = ffs(free_csg_slots) - 1;
> +			if (drm_WARN_ON(&ptdev->base, csg_id < 0))
> +				break;
> +
> +			csg_iface = panthor_fw_get_csg_iface(ptdev, csg_id);
> +			csg_slot = &sched->csg_slots[csg_id];
> +			csg_mod_mask |= BIT(csg_id);
> +			group_bind_locked(group, csg_id);
> +			csg_slot_prog_locked(ptdev, csg_id, new_csg_prio--);
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
> +						group->state == PANTHOR_CS_GROUP_SUSPENDED ?
> +						CSG_STATE_RESUME : CSG_STATE_START,
> +						CSG_STATE_MASK);
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
> +						csg_iface->output->ack ^ CSG_ENDPOINT_CONFIG,
> +						CSG_ENDPOINT_CONFIG);
> +			free_csg_slots &= ~BIT(csg_id);
> +		}
> +	}
> +
> +	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> +	if (ret) {
> +		panthor_device_schedule_reset(ptdev);
> +		ctx->csg_upd_failed_mask |= upd_ctx.timedout_mask;
> +		return;
> +	}
> +
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		list_for_each_entry_safe(group, tmp, &ctx->groups[prio], run_node) {
> +			list_del_init(&group->run_node);
> +
> +			/* If the group has been destroyed while we were
> +			 * scheduling, ask for an immediate tick to
> +			 * re-evaluate as soon as possible and get rid of
> +			 * this dangling group.
> +			 */
> +			if (group->destroyed)
> +				ctx->immediate_tick = true;
> +			group_put(group);
> +		}
> +
> +		/* Return evicted groups to the idle or run queues. Groups
> +		 * that can no longer be run (because they've been destroyed
> +		 * or experienced an unrecoverable error) will be scheduled
> +		 * for destruction in tick_ctx_cleanup().
> +		 */
> +		list_for_each_entry_safe(group, tmp, &ctx->old_groups[prio], run_node) {
> +			if (!group_can_run(group))
> +				continue;
> +
> +			if (group_is_idle(group))
> +				list_move_tail(&group->run_node, &sched->groups.idle[prio]);
> +			else
> +				list_move_tail(&group->run_node, &sched->groups.runnable[prio]);
> +			group_put(group);
> +		}
> +	}
> +
> +	sched->used_csg_slot_count = ctx->group_count;
> +	sched->might_have_idle_groups = ctx->idle_group_count > 0;
> +}
> +
> +static u64
> +tick_ctx_update_resched_target(struct panthor_scheduler *sched,
> +			       const struct panthor_sched_tick_ctx *ctx)
> +{
> +	/* We had space left, no need to reschedule until some external event happens. */
> +	if (!tick_ctx_is_full(sched, ctx))
> +		goto no_tick;
> +
> +	/* If idle groups were scheduled, no need to wake up until some external
> +	 * event happens (group unblocked, new job submitted, ...).
> +	 */
> +	if (ctx->idle_group_count)
> +		goto no_tick;
> +
> +	if (drm_WARN_ON(&sched->ptdev->base, ctx->min_priority >= PANTHOR_CSG_PRIORITY_COUNT))
> +		goto no_tick;
> +
> +	/* If there are groups of the same priority waiting, we need to
> +	 * keep the scheduler ticking, otherwise, we'll just wait for
> +	 * new groups with higher priority to be queued.
> +	 */
> +	if (!list_empty(&sched->groups.runnable[ctx->min_priority])) {
> +		u64 resched_target = sched->last_tick + sched->tick_period;
> +
> +		if (time_before64(sched->resched_target, sched->last_tick) ||
> +		    time_before64(resched_target, sched->resched_target))
> +			sched->resched_target = resched_target;
> +
> +		return sched->resched_target - sched->last_tick;
> +	}
> +
> +no_tick:
> +	sched->resched_target = U64_MAX;
> +	return U64_MAX;
> +}
> +
> +static void tick_work(struct work_struct *work)
> +{
> +	struct panthor_scheduler *sched = container_of(work, struct panthor_scheduler,
> +						      tick_work.work);
> +	struct panthor_device *ptdev = sched->ptdev;
> +	struct panthor_sched_tick_ctx ctx;
> +	u64 remaining_jiffies = 0, resched_delay;
> +	u64 now = get_jiffies_64();
> +	int prio, ret, cookie;
> +
> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> +		return;
> +
> +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	if (drm_WARN_ON(&ptdev->base, ret))
> +		goto out_dev_exit;
> +
> +	if (time_before64(now, sched->resched_target))
> +		remaining_jiffies = sched->resched_target - now;
> +
> +	mutex_lock(&sched->lock);
> +	if (panthor_device_reset_is_pending(sched->ptdev))
> +		goto out_unlock;
> +
> +	tick_ctx_init(sched, &ctx, remaining_jiffies != 0);
> +	if (ctx.csg_upd_failed_mask)
> +		goto out_cleanup_ctx;
> +
> +	if (remaining_jiffies) {
> +		/* Scheduling forced in the middle of a tick. Only RT groups
> +		 * can preempt non-RT ones. Currently running RT groups can't be
> +		 * preempted.
> +		 */
> +		for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
> +		     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
> +		     prio--) {
> +			tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio],
> +						       true, true);
> +			if (prio == PANTHOR_CSG_PRIORITY_RT) {
> +				tick_ctx_pick_groups_from_list(sched, &ctx,
> +							       &sched->groups.runnable[prio],
> +							       true, false);
> +			}
> +		}
> +	}
> +
> +	/* First pick non-idle groups */
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
> +	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
> +	     prio--) {
> +		tick_ctx_pick_groups_from_list(sched, &ctx, &sched->groups.runnable[prio],
> +					       true, false);
> +		tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio], true, true);
> +	}
> +
> +	/* If we have free CSG slots left, pick idle groups */
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1;
> +	     prio >= 0 && !tick_ctx_is_full(sched, &ctx);
> +	     prio--) {
> +		/* Check the old_group queue first to avoid reprogramming the slots */
> +		tick_ctx_pick_groups_from_list(sched, &ctx, &ctx.old_groups[prio], false, true);
> +		tick_ctx_pick_groups_from_list(sched, &ctx, &sched->groups.idle[prio],
> +					       false, false);
> +	}
> +
> +	tick_ctx_apply(sched, &ctx);
> +	if (ctx.csg_upd_failed_mask)
> +		goto out_cleanup_ctx;
> +
> +	if (ctx.idle_group_count == ctx.group_count) {
> +		panthor_devfreq_record_idle(sched->ptdev);
> +		if (sched->pm.has_ref) {
> +			pm_runtime_put_autosuspend(ptdev->base.dev);
> +			sched->pm.has_ref = false;
> +		}
> +	} else {
> +		panthor_devfreq_record_busy(sched->ptdev);
> +		if (!sched->pm.has_ref) {
> +			pm_runtime_get(ptdev->base.dev);
> +			sched->pm.has_ref = true;
> +		}
> +	}
> +
> +	sched->last_tick = now;
> +	resched_delay = tick_ctx_update_resched_target(sched, &ctx);
> +	if (ctx.immediate_tick)
> +		resched_delay = 0;
> +
> +	if (resched_delay != U64_MAX)
> +		sched_queue_delayed_work(sched, tick, resched_delay);
> +
> +out_cleanup_ctx:
> +	tick_ctx_cleanup(sched, &ctx);
> +
> +out_unlock:
> +	mutex_unlock(&sched->lock);
> +	pm_runtime_mark_last_busy(ptdev->base.dev);
> +	pm_runtime_put_autosuspend(ptdev->base.dev);
> +
> +out_dev_exit:
> +	drm_dev_exit(cookie);
> +}
> +
> +static void *
> +panthor_queue_get_syncwait_obj(struct panthor_group *group, struct panthor_queue *queue)
> +{
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct iosys_map map;
> +	int ret;
> +
> +	if (queue->syncwait.kmap)
> +		return queue->syncwait.kmap + queue->syncwait.offset;
> +
> +	if (!queue->syncwait.bo) {
> +		queue->syncwait.bo = panthor_vm_get_bo_for_va(group->vm,
> +							      queue->syncwait.gpu_va,
> +							      &queue->syncwait.offset);
> +		if (drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(queue->syncwait.bo)))
> +			return NULL;
> +	}
> +
> +	ret = drm_gem_vmap_unlocked(&queue->syncwait.bo->base.base, &map);
> +	if (drm_WARN_ON(&ptdev->base, ret))
> +		return NULL;
> +
> +	queue->syncwait.kmap = map.vaddr;
> +	if (drm_WARN_ON(&ptdev->base, !queue->syncwait.kmap))
> +		return NULL;
> +
> +	return queue->syncwait.kmap + queue->syncwait.offset;
> +}
> +
> +static int panthor_queue_eval_syncwait(struct panthor_group *group, u8 queue_idx)
> +{
> +	struct panthor_queue *queue = group->queues[queue_idx];
> +	union {
> +		struct panthor_syncobj_64b sync64;
> +		struct panthor_syncobj_32b sync32;
> +	} *syncobj;
> +	bool result;
> +	u64 value;
> +
> +	syncobj = panthor_queue_get_syncwait_obj(group, queue);
> +	if (!syncobj)
> +		return -EINVAL;
> +
> +	value = queue->syncwait.sync64 ?
> +		syncobj->sync64.seqno :
> +		syncobj->sync32.seqno;
> +
> +	if (queue->syncwait.gt)
> +		result = value > queue->syncwait.ref;
> +	else
> +		result = value <= queue->syncwait.ref;
> +
> +	if (result) {
> +		panthor_gem_unmap_and_put(group->vm, queue->syncwait.bo,
> +					  queue->syncwait.gpu_va,
> +					  queue->syncwait.kmap);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void sync_upd_work(struct work_struct *work)
> +{
> +	struct panthor_scheduler *sched = container_of(work,
> +						      struct panthor_scheduler,
> +						      sync_upd_work);
> +	struct panthor_group *group, *tmp;
> +	bool immediate_tick = false;
> +
> +	mutex_lock(&sched->lock);
> +	list_for_each_entry_safe(group, tmp, &sched->groups.waiting, wait_node) {
> +		u32 tested_queues = group->blocked_queues;
> +		u32 unblocked_queues = 0;
> +
> +		while (tested_queues) {
> +			u32 cs_id = ffs(tested_queues) - 1;
> +			int ret;
> +
> +			ret = panthor_queue_eval_syncwait(group, cs_id);
> +			drm_WARN_ON(&group->ptdev->base, ret < 0);
> +			if (ret)
> +				unblocked_queues |= BIT(cs_id);
> +
> +			tested_queues &= ~BIT(cs_id);
> +		}
> +
> +		if (unblocked_queues) {
> +			group->blocked_queues &= ~unblocked_queues;
> +
> +			if (group->csg_id < 0) {
> +				list_move(&group->run_node,
> +					  &sched->groups.runnable[group->priority]);
> +				if (group->priority == PANTHOR_CSG_PRIORITY_RT)
> +					immediate_tick = true;
> +			}
> +		}
> +
> +		if (!group->blocked_queues)
> +			list_del_init(&group->wait_node);
> +	}
> +	mutex_unlock(&sched->lock);
> +
> +	if (immediate_tick)
> +		sched_queue_delayed_work(sched, tick, 0);
> +}
> +
> +static void group_schedule_locked(struct panthor_group *group, u32 queue_mask)
> +{
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct list_head *queue = &sched->groups.runnable[group->priority];
> +	u64 delay_jiffies = 0;
> +	bool was_idle;
> +	u64 now;
> +
> +	if (!group_can_run(group))
> +		return;
> +
> +	/* All updated queues are blocked, no need to wake up the scheduler. */
> +	if ((queue_mask & group->blocked_queues) == queue_mask)
> +		return;
> +
> +	was_idle = group_is_idle(group);
> +	group->idle_queues &= ~queue_mask;
> +	if (was_idle && !group_is_idle(group))
> +		list_move_tail(&group->run_node, queue);
> +
> +	/* RT groups are preemptive. */
> +	if (group->priority == PANTHOR_CSG_PRIORITY_RT) {
> +		sched_queue_delayed_work(sched, tick, 0);
> +		return;
> +	}
> +
> +	/* Some groups might be idle, force an immediate tick to
> +	 * re-evaluate.
> +	 */
> +	if (sched->might_have_idle_groups) {
> +		sched_queue_delayed_work(sched, tick, 0);
> +		return;
> +	}
> +
> +	/* Scheduler is ticking, nothing to do. */
> +	if (sched->resched_target != U64_MAX) {
> +		/* If there are free slots, force immediating ticking. */
> +		if (sched->used_csg_slot_count < sched->csg_slot_count)
> +			sched_queue_delayed_work(sched, tick, 0);
> +
> +		return;
> +	}
> +
> +	/* Scheduler tick was off, recalculate the resched_target based on the
> +	 * last tick event, and queue the scheduler work.
> +	 */
> +	now = get_jiffies_64();
> +	sched->resched_target = sched->last_tick + sched->tick_period;
> +	if (sched->used_csg_slot_count == sched->csg_slot_count &&
> +	    time_before64(now, sched->resched_target))
> +		delay_jiffies = min_t(unsigned long, sched->resched_target - now, ULONG_MAX);
> +
> +	sched_queue_delayed_work(sched, tick, delay_jiffies);
> +}
> +
> +static void queue_stop(struct panthor_queue *queue,
> +		       struct panthor_job *bad_job)
> +{
> +	drm_sched_stop(&queue->scheduler, bad_job ? &bad_job->base : NULL);
> +}
> +
> +static void queue_start(struct panthor_queue *queue)
> +{
> +	struct panthor_job *job;
> +
> +	/* Re-assign the parent fences. */
> +	list_for_each_entry(job, &queue->scheduler.pending_list, base.list)
> +		job->base.s_fence->parent = dma_fence_get(job->done_fence);
> +
> +	drm_sched_start(&queue->scheduler, true);
> +}
> +
> +static void panthor_group_stop(struct panthor_group *group)
> +{
> +	struct panthor_scheduler *sched = group->ptdev->scheduler;
> +
> +	lockdep_assert_held(&sched->reset.lock);
> +
> +	for (u32 i = 0; i < group->queue_count; i++)
> +		queue_stop(group->queues[i], NULL);
> +
> +	group_get(group);
> +	list_move_tail(&group->run_node, &sched->reset.stopped_groups);
> +}
> +
> +static void panthor_group_start(struct panthor_group *group)
> +{
> +	struct panthor_scheduler *sched = group->ptdev->scheduler;
> +
> +	lockdep_assert_held(&group->ptdev->scheduler->reset.lock);
> +
> +	for (u32 i = 0; i < group->queue_count; i++)
> +		queue_start(group->queues[i]);
> +
> +	if (group_can_run(group)) {
> +		list_move_tail(&group->run_node,
> +			       group_is_idle(group) ?
> +			       &sched->groups.idle[group->priority] :
> +			       &sched->groups.runnable[group->priority]);
> +	} else {
> +		list_del_init(&group->run_node);
> +		list_del_init(&group->wait_node);
> +		group_queue_work(group, term);
> +	}
> +
> +	group_put(group);
> +}
> +
> +void panthor_sched_resume(struct panthor_device *ptdev)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +
> +	/* Force a tick to re-evaluate after a resume. */
> +	sched_queue_delayed_work(sched, tick, 0);
> +}
> +
> +void panthor_sched_suspend(struct panthor_device *ptdev)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_csg_slots_upd_ctx upd_ctx;
> +	u64 suspended_slots, faulty_slots;
> +	struct panthor_group *group;
> +	int ret;
> +	u32 i;
> +
> +	mutex_lock(&sched->lock);
> +	csgs_upd_ctx_init(&upd_ctx);
> +	for (i = 0; i < sched->csg_slot_count; i++) {
> +		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
> +
> +		if (csg_slot->group) {
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
> +						CSG_STATE_SUSPEND,
> +						CSG_STATE_MASK);
> +		}
> +	}
> +
> +	suspended_slots = upd_ctx.update_mask;
> +
> +	ret = csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> +	suspended_slots &= ~upd_ctx.timedout_mask;
> +	faulty_slots = upd_ctx.timedout_mask;
> +
> +	if (faulty_slots) {
> +		u32 slot_mask = faulty_slots;
> +
> +		drm_err(&ptdev->base, "CSG suspend failed, escalating to termination");
> +		csgs_upd_ctx_init(&upd_ctx);
> +		while (slot_mask) {
> +			u32 csg_id = ffs(slot_mask) - 1;
> +
> +			csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id,
> +						CSG_STATE_TERMINATE,
> +						CSG_STATE_MASK);
> +			slot_mask &= ~BIT(csg_id);
> +		}
> +
> +		csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> +
> +		slot_mask = upd_ctx.timedout_mask;
> +		while (slot_mask) {
> +			u32 csg_id = ffs(slot_mask) - 1;
> +			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +
> +			/* Terminate command timedout, but the soft-reset will
> +			 * automatically terminate all active groups, so let's
> +			 * force the state to halted here.
> +			 */
> +			if (csg_slot->group->state != PANTHOR_CS_GROUP_TERMINATED)
> +				csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
> +			slot_mask &= ~BIT(csg_id);
> +		}
> +	}
> +
> +	/* Flush L2 and LSC caches to make sure suspend state is up-to-date.
> +	 * If the flush fails, flag all queues for termination.
> +	 */
> +	if (suspended_slots) {
> +		bool flush_caches_failed = false;
> +		u32 slot_mask = suspended_slots;
> +
> +		if (panthor_gpu_flush_caches(ptdev, CACHE_CLEAN, CACHE_CLEAN, 0))
> +			flush_caches_failed = true;
> +
> +		while (slot_mask) {
> +			u32 csg_id = ffs(slot_mask) - 1;
> +			struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id];
> +
> +			if (flush_caches_failed)
> +				csg_slot->group->state = PANTHOR_CS_GROUP_TERMINATED;
> +			else
> +				csg_slot_sync_update_locked(ptdev, csg_id);
> +
> +			slot_mask &= ~BIT(csg_id);
> +		}
> +
> +		if (flush_caches_failed)
> +			faulty_slots |= suspended_slots;
> +	}
> +
> +	for (i = 0; i < sched->csg_slot_count; i++) {
> +		struct panthor_csg_slot *csg_slot = &sched->csg_slots[i];
> +
> +		group = csg_slot->group;
> +		if (!group)
> +			continue;
> +
> +		group_get(group);
> +		group_unbind_locked(group);
> +
> +		drm_WARN_ON(&group->ptdev->base, !list_empty(&group->run_node));
> +
> +		if (group_can_run(group)) {
> +			list_add(&group->run_node,
> +				 group_is_idle(group) ?
> +				 &sched->groups.idle[group->priority] :
> +				 &sched->groups.runnable[group->priority]);
> +		} else {
> +			/* We don't bother stopping the scheduler if the group is
> +			 * faulty, the group termination work will finish the job.
> +			 */
> +			list_del_init(&group->wait_node);
> +			group_queue_work(group, term);
> +		}
> +		group_put(group);
> +	}
> +	mutex_unlock(&sched->lock);
> +}
> +
> +void panthor_sched_pre_reset(struct panthor_device *ptdev)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_group *group, *group_tmp;
> +	u32 i;
> +
> +	mutex_lock(&sched->reset.lock);
> +
> +	/* Cancel all scheduler works. Once this is done, these works can't be
> +	 * scheduled again until the reset operation is complete.
> +	 */
> +	sched->reset.in_progress = true;
> +	cancel_work_sync(&sched->sync_upd_work);
> +	cancel_delayed_work_sync(&sched->tick_work);
> +
> +	panthor_sched_suspend(ptdev);
> +
> +	/* Stop all groups that might still accept jobs, so we don't get passed
> +	 * new jobs while we're resetting.
> +	 */
> +	for (i = 0; i < ARRAY_SIZE(sched->groups.runnable); i++) {
> +		list_for_each_entry_safe(group, group_tmp, &sched->groups.runnable[i], run_node)
> +			panthor_group_stop(group);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(sched->groups.idle); i++) {
> +		list_for_each_entry_safe(group, group_tmp, &sched->groups.idle[i], run_node)
> +			panthor_group_stop(group);
> +	}
> +
> +	mutex_unlock(&sched->reset.lock);
> +}
> +
> +void panthor_sched_post_reset(struct panthor_device *ptdev)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_group *group, *group_tmp;
> +
> +	mutex_lock(&sched->reset.lock);
> +
> +	list_for_each_entry_safe(group, group_tmp, &sched->reset.stopped_groups, run_node)
> +		panthor_group_start(group);
> +
> +	/* We're done resetting the GPU, clear the reset.in_progress bit so we can
> +	 * kick the scheduler.
> +	 */
> +	sched->reset.in_progress = false;
> +	mutex_unlock(&sched->reset.lock);
> +
> +	sched_queue_delayed_work(sched, tick, 0);
> +
> +	sched_queue_work(sched, sync_upd);
> +}
> +
> +static void group_sync_upd_work(struct work_struct *work)
> +{
> +	struct panthor_group *group =
> +		container_of(work, struct panthor_group, sync_upd_work);
> +	struct panthor_job *job, *job_tmp;
> +	LIST_HEAD(done_jobs);
> +	u32 queue_idx;
> +	bool cookie;
> +
> +	cookie = dma_fence_begin_signalling();
> +	for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
> +		struct panthor_queue *queue = group->queues[queue_idx];
> +		struct panthor_syncobj_64b *syncobj;
> +
> +		if (!queue)
> +			continue;
> +
> +		syncobj = group->syncobjs.kmap + (queue_idx * sizeof(*syncobj));
> +
> +		spin_lock(&queue->fence_ctx.lock);
> +		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
> +			if (!job->call_info.size)
> +				continue;
> +
> +			if (syncobj->seqno < job->done_fence->seqno)
> +				break;
> +
> +			list_move_tail(&job->node, &done_jobs);
> +			dma_fence_signal_locked(job->done_fence);
> +		}
> +		spin_unlock(&queue->fence_ctx.lock);
> +	}
> +	dma_fence_end_signalling(cookie);
> +
> +	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
> +		list_del_init(&job->node);
> +		panthor_job_put(&job->base);
> +	}
> +
> +	group_put(group);
> +}
> +
> +static struct dma_fence *
> +queue_run_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +	struct panthor_group *group = job->group;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	u32 ringbuf_size = queue->ringbuf.bo->base.base.size;
> +	u32 ringbuf_insert = queue->iface.input->insert % ringbuf_size;
> +	u64 addr_reg = ptdev->csif_info.cs_reg_count -
> +		       ptdev->csif_info.unpreserved_cs_reg_count;
> +	u64 val_reg = addr_reg + 2;
> +	u64 sync_addr = group->syncobjs.gpu_va +
> +			job->queue_idx * sizeof(struct panthor_syncobj_64b);
> +	u32 waitall_mask = GENMASK(sched->sb_slot_count - 1, 0);
> +	struct dma_fence *done_fence;
> +	int ret;
> +
> +	u64 call_instrs[NUM_INSTRS_PER_SLOT] = {
> +		/* MOV32 rX+2, cs.latest_flush */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.latest_flush,
> +
> +		/* FLUSH_CACHE2.clean_inv_all.no_wait.signal(0) rX+2 */
> +		(36ull << 56) | (0ull << 48) | (val_reg << 40) | (0 << 16) | 0x233,
> +
> +		/* MOV48 rX:rX+1, cs.start */
> +		(1ull << 56) | (addr_reg << 48) | job->call_info.start,
> +
> +		/* MOV32 rX+2, cs.size */
> +		(2ull << 56) | (val_reg << 48) | job->call_info.size,
> +
> +		/* WAIT(0) => waits for FLUSH_CACHE2 instruction */
> +		(3ull << 56) | (1 << 16),
> +
> +		/* CALL rX:rX+1, rX+2 */
> +		(32ull << 56) | (addr_reg << 40) | (val_reg << 32),
> +
> +		/* MOV48 rX:rX+1, sync_addr */
> +		(1ull << 56) | (addr_reg << 48) | sync_addr,
> +
> +		/* MOV32 rX+2, #1 */

s/MOV32/MOV48/

Steve

> +		(1ull << 56) | (val_reg << 48) | 1,
> +
> +		/* WAIT(all) */
> +		(3ull << 56) | (waitall_mask << 16),
> +
> +		/* SYNC_ADD64.system_scope.propage_err.nowait rX:rX+1, rX+2*/
> +		(51ull << 56) | (0ull << 48) | (addr_reg << 40) | (val_reg << 32) | (0 << 16) | 1,
> +
> +		/* ERROR_BARRIER, so we can recover from faults at job
> +		 * boundaries.
> +		 */
> +		(47ull << 56),
> +	};
> +
> +	/* Need to be cacheline aligned to please the prefetcher. */
> +	static_assert(sizeof(call_instrs) % 64 == 0,
> +		      "call_instrs is not aligned on a cacheline");
> +
> +	/* Stream size is zero, nothing to do => return a NULL fence and let
> +	 * drm_sched signal the parent.
> +	 */
> +	if (!job->call_info.size)
> +		return NULL;
> +
> +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	if (drm_WARN_ON(&ptdev->base, ret))
> +		return ERR_PTR(ret);
> +
> +	mutex_lock(&sched->lock);
> +	if (!group_can_run(group)) {
> +		done_fence = ERR_PTR(-ECANCELED);
> +		goto out_unlock;
> +	}
> +
> +	dma_fence_init(job->done_fence,
> +		       &panthor_queue_fence_ops,
> +		       &queue->fence_ctx.lock,
> +		       queue->fence_ctx.id,
> +		       atomic64_inc_return(&queue->fence_ctx.seqno));
> +
> +	memcpy((u8 *)queue->ringbuf.kmap + ringbuf_insert,
> +	       call_instrs, sizeof(call_instrs));
> +
> +	panthor_job_get(&job->base);
> +	spin_lock(&queue->fence_ctx.lock);
> +	list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
> +	spin_unlock(&queue->fence_ctx.lock);
> +
> +	job->ringbuf.start = queue->iface.input->insert;
> +	job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> +
> +	/* Make sure the ring buffer is updated before the INSERT
> +	 * register.
> +	 */
> +	wmb();
> +
> +	queue->iface.input->extract = queue->iface.output->extract;
> +	queue->iface.input->insert = job->ringbuf.end;
> +
> +	if (group->csg_id < 0) {
> +		/* If the queue is blocked, we want to keep the timeout running, so we
> +		 * can detect unbounded waits and kill the group when that happens.
> +		 * Otherwise, we suspend the timeout so the time we spend waiting for
> +		 * a CSG slot is not counted.
> +		 */
> +		if (!(group->blocked_queues & BIT(job->queue_idx)) &&
> +		    !queue->timeout_suspended) {
> +			queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
> +			queue->timeout_suspended = true;
> +		}
> +
> +		group_schedule_locked(group, BIT(job->queue_idx));
> +	} else {
> +		gpu_write(ptdev, CSF_DOORBELL(queue->doorbell_id), 1);
> +		if (!sched->pm.has_ref &&
> +		    !(group->blocked_queues & BIT(job->queue_idx))) {
> +			pm_runtime_get(ptdev->base.dev);
> +			sched->pm.has_ref = true;
> +		}
> +	}
> +
> +	done_fence = dma_fence_get(job->done_fence);
> +
> +out_unlock:
> +	mutex_unlock(&sched->lock);
> +	pm_runtime_mark_last_busy(ptdev->base.dev);
> +	pm_runtime_put_autosuspend(ptdev->base.dev);
> +
> +	return done_fence;
> +}
> +
> +static enum drm_gpu_sched_stat
> +queue_timedout_job(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +	struct panthor_group *group = job->group;
> +	struct panthor_device *ptdev = group->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_queue *queue = group->queues[job->queue_idx];
> +
> +	drm_warn(&ptdev->base, "job timeout\n");
> +
> +	WARN_ON(sched->reset.in_progress);
> +
> +	queue_stop(queue, job);
> +
> +	mutex_lock(&sched->lock);
> +	group->timedout = true;
> +	if (group->csg_id >= 0) {
> +		sched_queue_delayed_work(ptdev->scheduler, tick, 0);
> +	} else {
> +		/* Remove from the run queues, so the scheduler can't
> +		 * pick the group on the next tick.
> +		 */
> +		WARN_ON(list_empty(&group->run_node));
> +		list_del_init(&group->run_node);
> +		list_del_init(&group->wait_node);
> +
> +		group_queue_work(group, term);
> +	}
> +	mutex_unlock(&sched->lock);
> +
> +	queue_start(queue);
> +
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
> +}
> +
> +static void queue_free_job(struct drm_sched_job *sched_job)
> +{
> +	drm_sched_job_cleanup(sched_job);
> +	panthor_job_put(sched_job);
> +}
> +
> +static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
> +	.run_job = queue_run_job,
> +	.timedout_job = queue_timedout_job,
> +	.free_job = queue_free_job,
> +};
> +
> +static struct panthor_queue *
> +group_create_queue(struct panthor_group *group,
> +		   const struct drm_panthor_queue_create *args)
> +{
> +	struct panthor_scheduler *scheduler = group->ptdev->scheduler;
> +	struct drm_gpu_scheduler *drm_sched;
> +	struct panthor_queue *queue;
> +	int ret;
> +
> +	if (args->pad[0] || args->pad[1] || args->pad[2])
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!IS_ALIGNED(args->ringbuf_size, PAGE_SIZE) || args->ringbuf_size > SZ_64K)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (args->priority > CSF_MAX_QUEUE_PRIO)
> +		return ERR_PTR(-EINVAL);
> +
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return ERR_PTR(-ENOMEM);
> +
> +	queue->fence_ctx.id = dma_fence_context_alloc(1);
> +	spin_lock_init(&queue->fence_ctx.lock);
> +	INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);
> +
> +	queue->priority = args->priority;
> +
> +	queue->ringbuf.gpu_va = PANTHOR_GEM_ALLOC_VA;
> +	queue->ringbuf.bo = panthor_gem_create_and_map(group->ptdev, group->vm,
> +						       args->ringbuf_size,
> +						       DRM_PANTHOR_BO_NO_MMAP,
> +						       DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +						       DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +						       &queue->ringbuf.gpu_va,
> +						       (void **)&queue->ringbuf.kmap);
> +	if (IS_ERR(queue->ringbuf.bo)) {
> +		ret = PTR_ERR(queue->ringbuf.bo);
> +		goto out;
> +	}
> +
> +	queue->iface.mem = panthor_fw_alloc_queue_iface_mem(group->ptdev,
> +							    &queue->iface.input,
> +							    &queue->iface.output);
> +	if (IS_ERR(queue->iface.mem)) {
> +		ret = PTR_ERR(queue->iface.mem);
> +		goto out;
> +	}
> +
> +	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
> +			     scheduler->wq,
> +			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
> +			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
> +			     group->ptdev->reset.wq,
> +			     NULL, "panthor-queue", DRM_SCHED_POLICY_SINGLE_ENTITY,
> +			     group->ptdev->base.dev);
> +	if (ret)
> +		goto out;
> +
> +	drm_sched = &queue->scheduler;
> +	ret = drm_sched_entity_init(&queue->entity, DRM_SCHED_PRIORITY_NORMAL,
> +				    &drm_sched, 1, NULL);
> +
> +out:
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return queue;
> +}
> +
> +int panthor_group_create(struct panthor_file *pfile,
> +			 const struct drm_panthor_group_create *group_args,
> +			 const struct drm_panthor_queue_create *queue_args)
> +{
> +	struct panthor_device *ptdev = pfile->ptdev;
> +	struct panthor_group_pool *gpool = pfile->groups;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
> +	struct panthor_group *group = NULL;
> +	u32 gid, i, suspend_size;
> +	int ret;
> +
> +	if (group_args->pad)
> +		return -EINVAL;
> +
> +	if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
> +		return -EINVAL;
> +
> +	if ((group_args->compute_core_mask & ~ptdev->gpu_info.shader_present) ||
> +	    (group_args->fragment_core_mask & ~ptdev->gpu_info.shader_present) ||
> +	    (group_args->tiler_core_mask & ~ptdev->gpu_info.tiler_present))
> +		return -EINVAL;
> +
> +	if (hweight64(group_args->compute_core_mask) < group_args->max_compute_cores ||
> +	    hweight64(group_args->fragment_core_mask) < group_args->max_fragment_cores ||
> +	    hweight64(group_args->tiler_core_mask) < group_args->max_tiler_cores)
> +		return -EINVAL;
> +
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	if (!group)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&group->fatal_lock);
> +	kref_init(&group->refcount);
> +	group->state = PANTHOR_CS_GROUP_CREATED;
> +	group->csg_id = -1;
> +
> +	group->ptdev = ptdev;
> +	group->max_compute_cores = group_args->max_compute_cores;
> +	group->compute_core_mask = group_args->compute_core_mask;
> +	group->max_fragment_cores = group_args->max_fragment_cores;
> +	group->fragment_core_mask = group_args->fragment_core_mask;
> +	group->max_tiler_cores = group_args->max_tiler_cores;
> +	group->tiler_core_mask = group_args->tiler_core_mask;
> +	group->priority = group_args->priority;
> +
> +	INIT_LIST_HEAD(&group->wait_node);
> +	INIT_LIST_HEAD(&group->run_node);
> +	INIT_WORK(&group->term_work, group_term_work);
> +	INIT_WORK(&group->sync_upd_work, group_sync_upd_work);
> +	INIT_WORK(&group->release_work, group_release_work);
> +
> +	group->vm = panthor_vm_pool_get_vm(pfile->vms, group_args->vm_id);
> +	if (!group->vm) {
> +		ret = -EINVAL;
> +		goto err_put_group;
> +	}
> +
> +	suspend_size = csg_iface->control->suspend_size;
> +	group->suspend_buf = panthor_fw_alloc_suspend_buf_mem(ptdev, suspend_size);
> +	if (IS_ERR(group->suspend_buf)) {
> +		ret = PTR_ERR(group->suspend_buf);
> +		group->suspend_buf = NULL;
> +		goto err_put_group;
> +	}
> +
> +	suspend_size = csg_iface->control->protm_suspend_size;
> +	group->protm_suspend_buf = panthor_fw_alloc_suspend_buf_mem(ptdev, suspend_size);
> +	if (IS_ERR(group->protm_suspend_buf)) {
> +		ret = PTR_ERR(group->protm_suspend_buf);
> +		group->protm_suspend_buf = NULL;
> +		goto err_put_group;
> +	}
> +
> +	group->syncobjs.gpu_va = PANTHOR_GEM_ALLOC_VA;
> +	group->syncobjs.bo = panthor_gem_create_and_map(ptdev, group->vm,
> +							group_args->queues.count *
> +							sizeof(struct panthor_syncobj_64b),
> +							DRM_PANTHOR_BO_NO_MMAP,
> +							DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +							DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +							&group->syncobjs.gpu_va,
> +							(void **)&group->syncobjs.kmap);
> +	if (IS_ERR(group->syncobjs.bo)) {
> +		ret = PTR_ERR(group->syncobjs.bo);
> +		goto err_put_group;
> +	}
> +
> +	memset(group->syncobjs.kmap, 0,
> +	       group_args->queues.count * sizeof(struct panthor_syncobj_64b));
> +
> +	for (i = 0; i < group_args->queues.count; i++) {
> +		group->queues[i] = group_create_queue(group, &queue_args[i]);
> +		if (IS_ERR(group->queues[i])) {
> +			ret = PTR_ERR(group->queues[i]);
> +			group->queues[i] = NULL;
> +			goto err_put_group;
> +		}
> +
> +		group->queue_count++;
> +	}
> +
> +	group->idle_queues = GENMASK(group->queue_count - 1, 0);
> +
> +	ret = xa_alloc(&gpool->xa, &gid, group, XA_LIMIT(1, sched->csg_slot_count), GFP_KERNEL);
> +	if (ret)
> +		goto err_put_group;
> +
> +	mutex_lock(&sched->reset.lock);
> +	if (sched->reset.in_progress) {
> +		panthor_group_stop(group);
> +	} else {
> +		mutex_lock(&sched->lock);
> +		list_add_tail(&group->run_node,
> +			      &sched->groups.idle[group->priority]);
> +		mutex_unlock(&sched->lock);
> +	}
> +	mutex_unlock(&sched->reset.lock);
> +
> +	return gid;
> +
> +err_put_group:
> +	group_put(group);
> +	return ret;
> +}
> +
> +int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle)
> +{
> +	struct panthor_group_pool *gpool = pfile->groups;
> +	struct panthor_device *ptdev = pfile->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_group *group;
> +
> +	group = xa_erase(&gpool->xa, group_handle);
> +	if (!group)
> +		return -EINVAL;
> +
> +	for (u32 i = 0; i < group->queue_count; i++) {
> +		if (group->queues[i])
> +			drm_sched_entity_destroy(&group->queues[i]->entity);
> +	}
> +
> +	mutex_lock(&sched->reset.lock);
> +	mutex_lock(&sched->lock);
> +	group->destroyed = true;
> +	if (group->csg_id >= 0) {
> +		sched_queue_delayed_work(sched, tick, 0);
> +	} else if (!sched->reset.in_progress) {
> +		/* Remove from the run queues, so the scheduler can't
> +		 * pick the group on the next tick.
> +		 */
> +		list_del_init(&group->run_node);
> +		list_del_init(&group->wait_node);
> +		group_queue_work(group, term);
> +	}
> +	mutex_unlock(&sched->lock);
> +	mutex_unlock(&sched->reset.lock);
> +
> +	group_put(group);
> +	return 0;
> +}
> +
> +int panthor_group_get_state(struct panthor_file *pfile,
> +			    struct drm_panthor_group_get_state *get_state)
> +{
> +	struct panthor_group_pool *gpool = pfile->groups;
> +	struct panthor_device *ptdev = pfile->ptdev;
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	struct panthor_group *group;
> +
> +	if (get_state->pad)
> +		return -EINVAL;
> +
> +	group = group_get(xa_load(&gpool->xa, get_state->group_handle));
> +	if (!group)
> +		return -EINVAL;
> +
> +	memset(get_state, 0, sizeof(*get_state));
> +
> +	mutex_lock(&sched->lock);
> +	if (group->timedout)
> +		get_state->state |= DRM_PANTHOR_GROUP_STATE_TIMEDOUT;
> +	if (group->fatal_queues) {
> +		get_state->state |= DRM_PANTHOR_GROUP_STATE_FATAL_FAULT;
> +		get_state->fatal_queues = group->fatal_queues;
> +	}
> +	mutex_unlock(&sched->lock);
> +
> +	group_put(group);
> +	return 0;
> +}
> +
> +int panthor_group_pool_create(struct panthor_file *pfile)
> +{
> +	struct panthor_group_pool *gpool;
> +
> +	gpool = kzalloc(sizeof(*gpool), GFP_KERNEL);
> +	if (!gpool)
> +		return -ENOMEM;
> +
> +	xa_init_flags(&gpool->xa, XA_FLAGS_ALLOC1);
> +	pfile->groups = gpool;
> +	return 0;
> +}
> +
> +void panthor_group_pool_destroy(struct panthor_file *pfile)
> +{
> +	struct panthor_group_pool *gpool = pfile->groups;
> +	struct panthor_group *group;
> +	unsigned long i;
> +
> +	if (IS_ERR_OR_NULL(gpool))
> +		return;
> +
> +	xa_for_each(&gpool->xa, i, group)
> +		panthor_group_destroy(pfile, i);
> +
> +	xa_destroy(&gpool->xa);
> +	kfree(gpool);
> +	pfile->groups = NULL;
> +}
> +
> +static void job_release(struct kref *ref)
> +{
> +	struct panthor_job *job = container_of(ref, struct panthor_job, refcount);
> +
> +	drm_WARN_ON(&job->group->ptdev->base, !list_empty(&job->node));
> +
> +	if (job->base.s_fence)
> +		drm_sched_job_cleanup(&job->base);
> +
> +	if (job->done_fence && job->done_fence->ops)
> +		dma_fence_put(job->done_fence);
> +	else
> +		dma_fence_free(job->done_fence);
> +
> +	group_put(job->group);
> +
> +	kfree(job);
> +}
> +
> +struct drm_sched_job *panthor_job_get(struct drm_sched_job *sched_job)
> +{
> +	if (sched_job) {
> +		struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +
> +		kref_get(&job->refcount);
> +	}
> +
> +	return sched_job;
> +}
> +
> +void panthor_job_put(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +
> +	if (sched_job)
> +		kref_put(&job->refcount, job_release);
> +}
> +
> +struct drm_sched_job *
> +panthor_job_create(struct panthor_file *pfile,
> +		   u16 group_handle,
> +		   const struct drm_panthor_queue_submit *qsubmit)
> +{
> +	struct panthor_group_pool *gpool = pfile->groups;
> +	struct panthor_job *job;
> +	int ret;
> +
> +	if (qsubmit->pad)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* If stream_addr is zero, so stream_size should be. */
> +	if ((qsubmit->stream_size == 0) != (qsubmit->stream_addr == 0))
> +		return ERR_PTR(-EINVAL);
> +
> +	/* Make sure the address is aligned on 64-byte (cacheline) and the size is
> +	 * aligned on 8-byte (instruction size).
> +	 */
> +	if ((qsubmit->stream_addr & 63) || (qsubmit->stream_size & 7))
> +		return ERR_PTR(-EINVAL);
> +
> +	/* bits 24:30 must be zero. */
> +	if (qsubmit->latest_flush & GENMASK(30, 24))
> +		return ERR_PTR(-EINVAL);
> +
> +	job = kzalloc(sizeof(*job), GFP_KERNEL);
> +	if (!job)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&job->refcount);
> +	job->queue_idx = qsubmit->queue_index;
> +	job->call_info.size = qsubmit->stream_size;
> +	job->call_info.start = qsubmit->stream_addr;
> +	job->call_info.latest_flush = qsubmit->latest_flush;
> +	INIT_LIST_HEAD(&job->node);
> +
> +	job->group = group_get(xa_load(&gpool->xa, group_handle));
> +	if (!job->group) {
> +		ret = -EINVAL;
> +		goto err_put_job;
> +	}
> +
> +	if (job->queue_idx >= job->group->queue_count ||
> +	    !job->group->queues[job->queue_idx]) {
> +		ret = -EINVAL;
> +		goto err_put_job;
> +	}
> +
> +	job->done_fence = kzalloc(sizeof(*job->done_fence), GFP_KERNEL);
> +	if (!job->done_fence) {
> +		ret = -ENOMEM;
> +		goto err_put_job;
> +	}
> +
> +	ret = drm_sched_job_init(&job->base,
> +				 &job->group->queues[job->queue_idx]->entity,
> +				 job->group);
> +	if (ret)
> +		goto err_put_job;
> +
> +	return &job->base;
> +
> +err_put_job:
> +	panthor_job_put(&job->base);
> +	return ERR_PTR(ret);
> +}
> +
> +int panthor_job_prepare_resvs(struct drm_exec *exec,
> +			      struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +
> +	return panthor_vm_prepare_mapped_bos_resvs(exec, job->group->vm);
> +}
> +
> +int panthor_job_add_resvs_deps(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +
> +	return panthor_vm_add_bos_resvs_deps_to_job(job->group->vm, sched_job);
> +}
> +
> +void panthor_job_update_resvs(struct drm_sched_job *sched_job)
> +{
> +	struct panthor_job *job = container_of(sched_job, struct panthor_job, base);
> +
> +	panthor_vm_add_job_fence_to_bos_resvs(job->group->vm, sched_job);
> +}
> +
> +void panthor_sched_unplug(struct panthor_device *ptdev)
> +{
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +
> +	cancel_delayed_work_sync(&sched->tick_work);
> +
> +	mutex_lock(&sched->lock);
> +	if (sched->pm.has_ref) {
> +		pm_runtime_put(ptdev->base.dev);
> +		sched->pm.has_ref = false;
> +	}
> +	mutex_unlock(&sched->lock);
> +}
> +
> +static void panthor_sched_fini(struct drm_device *ddev, void *res)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	struct panthor_scheduler *sched = ptdev->scheduler;
> +	int prio;
> +
> +	if (!sched || !sched->csg_slot_count)
> +		return;
> +
> +	cancel_delayed_work_sync(&sched->tick_work);
> +
> +	if (sched->wq) {
> +		drain_workqueue(sched->wq);
> +		destroy_workqueue(sched->wq);
> +	}
> +
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		drm_WARN_ON(ddev, !list_empty(&sched->groups.runnable[prio]));
> +		drm_WARN_ON(ddev, !list_empty(&sched->groups.idle[prio]));
> +	}
> +
> +	drm_WARN_ON(ddev, !list_empty(&sched->groups.waiting));
> +}
> +
> +int panthor_sched_init(struct panthor_device *ptdev)
> +{
> +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, 0);
> +	struct panthor_fw_cs_iface *cs_iface = panthor_fw_get_cs_iface(ptdev, 0, 0);
> +	struct panthor_scheduler *sched;
> +	u32 gpu_as_count, num_groups;
> +	int prio;
> +
> +	sched = drmm_kzalloc(&ptdev->base, sizeof(*sched), GFP_KERNEL);
> +	if (!sched)
> +		return -ENOMEM;
> +
> +	/* The highest bit in JOB_INT_* is reserved for globabl IRQs. That
> +	 * leaves 31 bits for CSG IRQs, hence the MAX_CSGS clamp here.
> +	 */
> +	num_groups = min_t(u32, MAX_CSGS, glb_iface->control->group_num);
> +
> +	/* The FW-side scheduler might deadlock if two groups with the same
> +	 * priority try to access a set of resources that overlaps, with part
> +	 * of the resources being allocated to one group and the other part to
> +	 * the other group, both groups waiting for the remaining resources to
> +	 * be allocated. To avoid that, it is recommended to assign each CSG a
> +	 * different priority. In theory we could allow several groups to have
> +	 * the same CSG priority if they don't request the same resources, but
> +	 * that makes the scheduling logic more complicated, so let's clamp
> +	 * the number of CSG slots to MAX_CSG_PRIO + 1 for now.
> +	 */
> +	num_groups = min_t(u32, MAX_CSG_PRIO + 1, num_groups);
> +
> +	/* We need at least one AS for the MCU and one for the GPU contexts. */
> +	gpu_as_count = hweight32(ptdev->gpu_info.as_present & GENMASK(31, 1));
> +	if (!gpu_as_count) {
> +		drm_err(&ptdev->base, "Not enough AS (%d, expected at least 2)",
> +			gpu_as_count + 1);
> +		return -EINVAL;
> +	}
> +
> +	sched->ptdev = ptdev;
> +	sched->sb_slot_count = CS_FEATURES_SCOREBOARDS(cs_iface->control->features);
> +	sched->csg_slot_count = num_groups;
> +	sched->cs_slot_count = csg_iface->control->stream_num;
> +	sched->as_slot_count = gpu_as_count;
> +	ptdev->csif_info.csg_slot_count = sched->csg_slot_count;
> +	ptdev->csif_info.cs_slot_count = sched->cs_slot_count;
> +	ptdev->csif_info.scoreboard_slot_count = sched->sb_slot_count;
> +
> +	sched->last_tick = 0;
> +	sched->resched_target = U64_MAX;
> +	sched->tick_period = msecs_to_jiffies(10);
> +	INIT_DELAYED_WORK(&sched->tick_work, tick_work);
> +	INIT_WORK(&sched->sync_upd_work, sync_upd_work);
> +
> +	drmm_mutex_init(&ptdev->base, &sched->lock);
> +	for (prio = PANTHOR_CSG_PRIORITY_COUNT - 1; prio >= 0; prio--) {
> +		INIT_LIST_HEAD(&sched->groups.runnable[prio]);
> +		INIT_LIST_HEAD(&sched->groups.idle[prio]);
> +	}
> +	INIT_LIST_HEAD(&sched->groups.waiting);
> +
> +	drmm_mutex_init(&ptdev->base, &sched->reset.lock);
> +	INIT_LIST_HEAD(&sched->reset.stopped_groups);
> +
> +	ptdev->scheduler = sched;
> +
> +	sched->wq = alloc_workqueue("panthor-csf-sched", WQ_UNBOUND, 0);
> +	if (!sched->wq) {
> +		panthor_sched_fini(&ptdev->base, NULL);
> +		drm_err(&ptdev->base, "Failed to allocate the workqueues");
> +		return -ENOMEM;
> +	}
> +
> +	return drmm_add_action_or_reset(&ptdev->base, panthor_sched_fini, NULL);
> +}
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.h b/drivers/gpu/drm/panthor/panthor_sched.h
> new file mode 100644
> index 000000000000..ecdd9dd41ad9
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_sched.h
> @@ -0,0 +1,50 @@
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> +/* Copyright 2023 Collabora ltd. */
> +
> +#ifndef __PANTHOR_SCHED_H__
> +#define __PANTHOR_SCHED_H__
> +
> +#include <drm/panthor_drm.h>
> +
> +struct drm_exec;
> +struct dma_fence;
> +struct drm_file;
> +struct drm_gem_object;
> +struct drm_sched_job;
> +struct panthor_device;
> +struct panthor_file;
> +struct panthor_group_pool;
> +struct panthor_job;
> +
> +int panthor_group_create(struct panthor_file *pfile,
> +			 const struct drm_panthor_group_create *group_args,
> +			 const struct drm_panthor_queue_create *queue_args);
> +int panthor_group_destroy(struct panthor_file *pfile, u32 group_handle);
> +int panthor_group_get_state(struct panthor_file *pfile,
> +			    struct drm_panthor_group_get_state *get_state);
> +
> +struct drm_sched_job *
> +panthor_job_create(struct panthor_file *pfile,
> +		   u16 group_handle,
> +		   const struct drm_panthor_queue_submit *qsubmit);
> +struct drm_sched_job *panthor_job_get(struct drm_sched_job *job);
> +void panthor_job_put(struct drm_sched_job *job);
> +int panthor_job_prepare_resvs(struct drm_exec *exec,
> +			      struct drm_sched_job *job);
> +int panthor_job_add_resvs_deps(struct drm_sched_job *job);
> +void panthor_job_update_resvs(struct drm_sched_job *job);
> +
> +int panthor_group_pool_create(struct panthor_file *pfile);
> +void panthor_group_pool_destroy(struct panthor_file *pfile);
> +
> +void panthor_sched_process_csg_irq(struct panthor_device *ptdev, u32 csg_slot);
> +void panthor_sched_process_global_irq(struct panthor_device *ptdev);
> +
> +int panthor_sched_init(struct panthor_device *ptdev);
> +void panthor_sched_unplug(struct panthor_device *ptdev);
> +void panthor_sched_pre_reset(struct panthor_device *ptdev);
> +void panthor_sched_post_reset(struct panthor_device *ptdev);
> +void panthor_sched_suspend(struct panthor_device *ptdev);
> +void panthor_sched_resume(struct panthor_device *ptdev);
> +
> +#endif


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t
  2023-08-11 13:08   ` Steven Price
@ 2023-08-19  2:13     ` Dmitry Osipenko
  2023-08-28  9:03       ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Dmitry Osipenko @ 2023-08-19  2:13 UTC (permalink / raw)
  To: Steven Price, Boris Brezillon, dri-devel
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 8/11/23 16:08, Steven Price wrote:
> On 09/08/2023 17:53, Boris Brezillon wrote:
>> This way we can grab a pages ref without acquiring the resv lock when
>> pages_use_count > 0. Need to implement asynchronous map using the
> 
> NIT: s/Need/This is needed/
> 
>> drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
>> requiring the new left/right regions to grab an additional page ref
>> to guarantee that the pages stay pinned when the middle section is
>> unmapped.
>>
>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>> ---
>>  drivers/gpu/drm/drm_gem_shmem_helper.c  | 28 +++++++++++++------------
>>  drivers/gpu/drm/lima/lima_gem.c         |  2 +-
>>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  2 +-
>>  include/drm/drm_gem_shmem_helper.h      |  2 +-
>>  4 files changed, 18 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
>> index a783d2245599..ca6938ea1b82 100644
>> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
>> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
>> @@ -155,7 +155,7 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
>>  		if (shmem->pages)
>>  			drm_gem_shmem_put_pages(shmem);
>>  
>> -		drm_WARN_ON(obj->dev, shmem->pages_use_count);
>> +		drm_WARN_ON(obj->dev, atomic_read(&shmem->pages_use_count));
>>  
>>  		dma_resv_unlock(shmem->base.resv);
>>  	}
>> @@ -172,14 +172,14 @@ static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
>>  
>>  	dma_resv_assert_held(shmem->base.resv);
>>  
>> -	if (shmem->pages_use_count++ > 0)
>> +	if (atomic_inc_return(&shmem->pages_use_count) > 1)
>>  		return 0;
>>  
>>  	pages = drm_gem_get_pages(obj);
>>  	if (IS_ERR(pages)) {
>>  		drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
>>  			    PTR_ERR(pages));
>> -		shmem->pages_use_count = 0;
>> +		atomic_set(&shmem->pages_use_count, 0);
>>  		return PTR_ERR(pages);
>>  	}
>>  
>> @@ -210,10 +210,10 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
>>  
>>  	dma_resv_assert_held(shmem->base.resv);
>>  
>> -	if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
>> +	if (drm_WARN_ON_ONCE(obj->dev, !atomic_read(&shmem->pages_use_count)))
>>  		return;
>>  
>> -	if (--shmem->pages_use_count > 0)
>> +	if (atomic_dec_return(&shmem->pages_use_count) > 0)
>>  		return;
>>  
>>  #ifdef CONFIG_X86
>> @@ -263,6 +263,10 @@ int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
>>  
>>  	drm_WARN_ON(obj->dev, obj->import_attach);
>>  
>> +	/* If we are the first owner, we need to grab the lock. */
>> +	if (atomic_inc_not_zero(&shmem->pages_use_count))
>> +		return 0;
>> +
> 
> Unless I'm misunderstanding I think this introduces a race where two
> threads call drm_gem_shmem_pin() at the same time:
> 
> Thread1				| Thread 2
> --------------------------------+------------------------------
> drm_gem_shmem_pin()		|
>  - pages_use_count == 0 so not  |
>    incremented                  |
>  - lock taken			|
> drm_gem_shmem_pin_locked()	|
> drm_gem_shmem_get_pages()	|
>  - pages_use_count incremented	|
> <thread descheduled>            | drm_gem_shmem_pin()
>                                 |  - pages_use_count == 1 so is it
> 				|    incremented and returns early
> 				|    without taking the lock
> 				| Code tries to use shmem->pages
> <thread rescheduled>		| and blows up
> drm_gem_get_pages()		|
> shmem->pages populated		|
> lock released			|
> 
> I think you need to modify drm_gem_shmem_get_pages() to only increment
> pages_use_count when shmem->pages has been populated. That also gets rid
> of the atomic_set() in that function which scares me.

This is correct, both pin() and get_pages() should use
atomic_inc_not_zero().

Note that we shouldn't use atomic functions open-coded, there is kref
helper for that which uses refcount_t underneath and has additional
checks/warnings for count underflow/overflow. I'm going to post patches
converting drm-shmem to kref around next week, Boris is aware about it
and we should then sync shrinker/panthor patchsets to the common
drm-shmem base.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-08-09 16:53   ` Boris Brezillon
@ 2023-08-20  8:01     ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-08-20  8:01 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Conor Dooley, Nicolas Boichat, Daniel Stone, Krzysztof Kozlowski,
	Neil Armstrong, Liviu Dudau, Steven Price, devicetree,
	Rob Herring, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 09/08/2023 18:53, Boris Brezillon wrote:
> From: Liviu Dudau <liviu.dudau@arm.com>
> 
> Arm has introduced a new v10 GPU architecture that replaces the Job Manager
> interface with a new Command Stream Frontend. It adds firmware driven
> command stream queues that can be used by kernel and user space to submit
> jobs to the GPU.
> 
> Add the initial schema for the device tree that is based on support for
> RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
> platforms they will tend to expose the semi-independent clocks for better
> power management.

A nit, subject: drop second/last, redundant "bindings for". The
"dt-bindings" prefix is already stating that these are bindings.

Also drop "driver" form the subject. Bindings are for hardware, not drivers.

> 
> v2:
> - New commit
> 
> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>

SoB chain is incomplete.

> Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: devicetree@vger.kernel.org
> ---
>  .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
>  1 file changed, 148 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> 
> diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> new file mode 100644
> index 000000000000..2b9f77aa0b7a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> @@ -0,0 +1,148 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: ARM Mali Valhall GPU
> +
> +maintainers:
> +  - Liviu Dudau <liviu.dudau@arm.com>
> +  - Boris Brezillon <boris.brezillon@collabora.com>
> +
> +properties:
> +  $nodename:
> +    pattern: '^gpu@[a-f0-9]+$'
> +
> +  compatible:
> +    oneOf:

Drop oneOf.

> +      - items:
> +          - enum:
> +              - rockchip,rk3588-mali
> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> +
> +  reg:
> +    maxItems: 1
> +
> +  interrupts:
> +    items:
> +      - description: Job interrupt
> +      - description: MMU interrupt
> +      - description: GPU interrupt
> +
> +  interrupt-names:
> +    items:
> +      - const: job
> +      - const: mmu
> +      - const: gpu
> +
> +  clocks:
> +    minItems: 1
> +    maxItems: 3
> +
> +  clock-names:
> +    minItems: 1
> +    items:
> +      - const: core
> +      - const: coregroup
> +      - const: stacks
> +
> +  mali-supply: true
> +
> +  sram-supply: true
> +
> +  operating-points-v2: true

Missing opp-table.

> +
> +  power-domains:
> +    minItems: 1
> +    maxItems: 5
> +
> +  power-domain-names:
> +    minItems: 1
> +    maxItems: 5
> +
> +  "#cooling-cells":
> +    const: 2
> +
> +  dynamic-power-coefficient:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      A u32 value that represents the running time dynamic
> +      power coefficient in units of uW/MHz/V^2. The
> +      coefficient can either be calculated from power
> +      measurements or derived by analysis.
> +
> +      The dynamic power consumption of the GPU is
> +      proportional to the square of the Voltage (V) and
> +      the clock frequency (f). The coefficient is used to
> +      calculate the dynamic power as below -
> +
> +      Pdyn = dynamic-power-coefficient * V^2 * f
> +
> +      where voltage is in V, frequency is in MHz.
> +
> +  dma-coherent: true
> +
> +required:
> +  - compatible
> +  - reg
> +  - interrupts
> +  - interrupt-names
> +  - clocks
> +  - mali-supply
> +
> +additionalProperties: false
> +
> +allOf:
> +  - if:
> +      properties:
> +        compatible:
> +          contains:
> +            const: rockchip,rk3588-mali
> +    then:
> +      properties:
> +        clocks:
> +          minItems: 3
> +        clock-names:
> +          items:
> +            - const: core
> +            - const: coregroup
> +            - const: stacks

This duplicates top-level. Just minItems: 3.

Please describe also power domains - constrains and names.

> +
> +examples:
> +  - |
> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> +    #include <dt-bindings/interrupt-controller/irq.h>
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +    #include <dt-bindings/power/rk3588-power.h>
> +
> +    gpu: gpu@fb000000 {
> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> +        reg = <0xfb000000 0x200000>;
> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> +        interrupt-names = "job", "mmu", "gpu";
> +        clock-names = "core", "coregroup", "stacks";
> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> +                 <&cru CLK_GPU_STACKS>;
> +        power-domains = <&power RK3588_PD_GPU>;
> +        operating-points-v2 = <&gpu_opp_table>;
> +        mali-supply = <&vdd_gpu_s0>;
> +        sram-supply = <&vdd_gpu_mem_s0>;
> +        status = "disabled";

Drop status.

> +    };
> +
> +    gpu_opp_table: opp-table {

Opp table should be inside the device node.

> +        compatible = "operating-points-v2";
> +        opp-300000000 {
> +            opp-hz = /bits/ 64 <300000000>;
> +            opp-microvolt = <675000 675000 850000>;
> +        };

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-08-20  8:01     ` Krzysztof Kozlowski
  0 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-08-20  8:01 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Daniel Vetter, Marty E . Plummer, Rob Herring,
	Clément Péron, Nicolas Boichat, Neil Armstrong,
	Faith Ekstrand, Daniel Stone, Liviu Dudau, Steven Price,
	Robin Murphy, Krzysztof Kozlowski, Rob Herring, Conor Dooley,
	devicetree

On 09/08/2023 18:53, Boris Brezillon wrote:
> From: Liviu Dudau <liviu.dudau@arm.com>
> 
> Arm has introduced a new v10 GPU architecture that replaces the Job Manager
> interface with a new Command Stream Frontend. It adds firmware driven
> command stream queues that can be used by kernel and user space to submit
> jobs to the GPU.
> 
> Add the initial schema for the device tree that is based on support for
> RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
> platforms they will tend to expose the semi-independent clocks for better
> power management.

A nit, subject: drop second/last, redundant "bindings for". The
"dt-bindings" prefix is already stating that these are bindings.

Also drop "driver" form the subject. Bindings are for hardware, not drivers.

> 
> v2:
> - New commit
> 
> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>

SoB chain is incomplete.

> Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Conor Dooley <conor+dt@kernel.org>
> Cc: devicetree@vger.kernel.org
> ---
>  .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
>  1 file changed, 148 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> 
> diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> new file mode 100644
> index 000000000000..2b9f77aa0b7a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> @@ -0,0 +1,148 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: ARM Mali Valhall GPU
> +
> +maintainers:
> +  - Liviu Dudau <liviu.dudau@arm.com>
> +  - Boris Brezillon <boris.brezillon@collabora.com>
> +
> +properties:
> +  $nodename:
> +    pattern: '^gpu@[a-f0-9]+$'
> +
> +  compatible:
> +    oneOf:

Drop oneOf.

> +      - items:
> +          - enum:
> +              - rockchip,rk3588-mali
> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> +
> +  reg:
> +    maxItems: 1
> +
> +  interrupts:
> +    items:
> +      - description: Job interrupt
> +      - description: MMU interrupt
> +      - description: GPU interrupt
> +
> +  interrupt-names:
> +    items:
> +      - const: job
> +      - const: mmu
> +      - const: gpu
> +
> +  clocks:
> +    minItems: 1
> +    maxItems: 3
> +
> +  clock-names:
> +    minItems: 1
> +    items:
> +      - const: core
> +      - const: coregroup
> +      - const: stacks
> +
> +  mali-supply: true
> +
> +  sram-supply: true
> +
> +  operating-points-v2: true

Missing opp-table.

> +
> +  power-domains:
> +    minItems: 1
> +    maxItems: 5
> +
> +  power-domain-names:
> +    minItems: 1
> +    maxItems: 5
> +
> +  "#cooling-cells":
> +    const: 2
> +
> +  dynamic-power-coefficient:
> +    $ref: /schemas/types.yaml#/definitions/uint32
> +    description:
> +      A u32 value that represents the running time dynamic
> +      power coefficient in units of uW/MHz/V^2. The
> +      coefficient can either be calculated from power
> +      measurements or derived by analysis.
> +
> +      The dynamic power consumption of the GPU is
> +      proportional to the square of the Voltage (V) and
> +      the clock frequency (f). The coefficient is used to
> +      calculate the dynamic power as below -
> +
> +      Pdyn = dynamic-power-coefficient * V^2 * f
> +
> +      where voltage is in V, frequency is in MHz.
> +
> +  dma-coherent: true
> +
> +required:
> +  - compatible
> +  - reg
> +  - interrupts
> +  - interrupt-names
> +  - clocks
> +  - mali-supply
> +
> +additionalProperties: false
> +
> +allOf:
> +  - if:
> +      properties:
> +        compatible:
> +          contains:
> +            const: rockchip,rk3588-mali
> +    then:
> +      properties:
> +        clocks:
> +          minItems: 3
> +        clock-names:
> +          items:
> +            - const: core
> +            - const: coregroup
> +            - const: stacks

This duplicates top-level. Just minItems: 3.

Please describe also power domains - constrains and names.

> +
> +examples:
> +  - |
> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> +    #include <dt-bindings/interrupt-controller/irq.h>
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +    #include <dt-bindings/power/rk3588-power.h>
> +
> +    gpu: gpu@fb000000 {
> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> +        reg = <0xfb000000 0x200000>;
> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> +        interrupt-names = "job", "mmu", "gpu";
> +        clock-names = "core", "coregroup", "stacks";
> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> +                 <&cru CLK_GPU_STACKS>;
> +        power-domains = <&power RK3588_PD_GPU>;
> +        operating-points-v2 = <&gpu_opp_table>;
> +        mali-supply = <&vdd_gpu_s0>;
> +        sram-supply = <&vdd_gpu_mem_s0>;
> +        status = "disabled";

Drop status.

> +    };
> +
> +    gpu_opp_table: opp-table {

Opp table should be inside the device node.

> +        compatible = "operating-points-v2";
> +        opp-300000000 {
> +            opp-hz = /bits/ 64 <300000000>;
> +            opp-microvolt = <675000 675000 850000>;
> +        };

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-08-09 16:53 ` [PATCH v2 12/15] drm/panthor: Add the driver frontend block Boris Brezillon
@ 2023-08-21 11:31   ` Steven Price
  2023-08-29 17:46     ` Boris Brezillon
  2023-09-06 12:38   ` Ketil Johnsen
  1 sibling, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-21 11:31 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> This is the last piece missing to expose the driver to the outside
> world.
> 
> This is basically a wrapper between the ioctls and the other logical
> blocks.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Document the code
> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> - Fix various bugs
> - Refactored the code to make job submission re-usable for VM_BIND
>   jobs
> - Add user object copy helpers
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_drv.c | 1540 +++++++++++++++++++++++++
>  1 file changed, 1540 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> new file mode 100644
> index 000000000000..377ebea4c0e8
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -0,0 +1,1540 @@
> +// SPDX-License-Identifier: GPL-2.0 or MIT
> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> +/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
> +/* Copyright 2019 Collabora ltd. */
> +
> +#include <linux/module.h>
> +#include <linux/of_platform.h>
> +#include <linux/pagemap.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/xarray.h>
> +
> +#include <drm/drm_drv.h>
> +#include <drm/drm_exec.h>
> +#include <drm/drm_ioctl.h>
> +#include <drm/drm_syncobj.h>
> +#include <drm/drm_utils.h>
> +#include <drm/drm_debugfs.h>
> +#include <drm/gpu_scheduler.h>
> +#include <drm/panthor_drm.h>
> +
> +#include "panthor_sched.h"
> +#include "panthor_device.h"
> +#include "panthor_gem.h"
> +#include "panthor_heap.h"
> +#include "panthor_fw.h"
> +#include "panthor_mmu.h"
> +#include "panthor_gpu.h"
> +#include "panthor_regs.h"
> +
> +/**
> + * DOC: user <-> kernel object copy helpers.
> + */
> +
> +/**
> + * panthor_set_uobj() - Copy kernel object to user object.
> + * @usr_ptr: Users pointer.
> + * @usr_size: Size of the user object.
> + * @min_size: Minimum size for this object.
> + * @kern_size: Size of the kernel object.
> + * @in: Address of the kernel object to copy.
> + *
> + * Helper automating kernel -> user object copies.
> + *
> + * Don't use this function directly, use PANTHOR_UOBJ_SET() instead.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const void *in)
> +{
> +	/* User size shouldn't be smaller than the minimal object size. */
> +	if (usr_size < min_size)
> +		return -EINVAL;
> +
> +	if (copy_to_user(u64_to_user_ptr(usr_ptr), in, min_t(u32, usr_size, kern_size)))
> +		return -EFAULT;
> +
> +	/* When the kernel object is smaller than the user object, we fill the gap with
> +	 * zeros.
> +	 */
> +	if (usr_size > kern_size &&
> +	    clear_user(u64_to_user_ptr(usr_ptr + kern_size), usr_size - kern_size)) {
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
> + * @in: The object array to copy.
> + * @min_stride: Minimum array stride.
> + * @obj_kernel: Kernel object size.
> + * @out: Pointer to a variable that will hold the newly allocated object array.
> + *
> + * Helper automating user -> kernel object copies.
> + *
> + * Don't use this function directly, use PANTHOR_UOBJ_ARRAY_GET() instead.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
> +		       u32 obj_size, void **out)

Instead of having 'out' as a return parameter you could use ERR_PTR()s 
for the error cases. I know why you haven't, but see below.

> +{
> +	int ret = 0;
> +	void *out_alloc;
> +
> +	/* User stride must be at least the minimum object size, otherwise it might
> +	 * lack useful information.
> +	 */
> +	if (in->stride < min_stride)
> +		return -EINVAL;
> +
> +	if (!in->count)
> +		return 0;
> +
> +	out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
> +	if (!out_alloc)
> +		return -ENOMEM;
> +
> +	if (obj_size == in->stride) {
> +		/* Fast path when user/kernel have the same uAPI header version. */
> +		if (copy_from_user(out_alloc, u64_to_user_ptr(in->array),
> +				   (unsigned long)obj_size * in->count))
> +			ret = -EFAULT;
> +	} else {
> +		void __user *in_ptr = u64_to_user_ptr(in->array);
> +		void *out_ptr = out_alloc;
> +
> +		/* If the sizes differ, we need to copy elements one by one. */
> +		for (u32 i = 0; i < in->count; i++) {
> +			ret = copy_struct_from_user(out_ptr, obj_size, in_ptr, in->stride);
> +			if (ret)
> +				break;
> +
> +			out_ptr += obj_size;
> +			in_ptr += in->stride;
> +		}
> +	}
> +
> +	if (ret) {
> +		kvfree(out_alloc);
> +		return ret;
> +	}
> +
> +	*out = out_alloc;
> +	return 0;
> +}
> +
> +/**
> + * PANTHOR_UOBJ_MIN_SIZE_INTERNAL() - Get the minimum user object size
> + * @_typename: Object type.
> + * @_last_mandatory_field: Last mandatory field.
> + *
> + * Get the minimum user object size based on the last mandatory field name,
> + * A.K.A, the name of the last field of the structure at the time this
> + * structure was added to the uAPI.
> + *
> + * Don't use directly, use PANTHOR_UOBJ_DECL() instead.
> + */
> +#define PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field) \
> +	(offsetof(_typename, _last_mandatory_field) + \
> +	 sizeof(((_typename *)NULL)->_last_mandatory_field))
> +
> +/**
> + * PANTHOR_UOBJ_DECL() - Declare a new uAPI object whose subject to
> + * evolutions.
> + * @_typename: Object type.
> + * @_last_mandatory_field: Last mandatory field.
> + *
> + * Should be used to extend the PANTHOR_UOBJ_MIN_SIZE() list.
> + */
> +#define PANTHOR_UOBJ_DECL(_typename, _last_mandatory_field) \
> +	_typename : PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field)
> +
> +/**
> + * PANTHOR_UOBJ_MIN_SIZE() - Get the minimum size of a given uAPI object
> + * @_obj_name: Object to get the minimum size of.
> + *
> + * Don't use this macro directly, it's automatically called by
> + * PANTHOR_UOBJ_{SET,GET_ARRAY}().
> + */
> +#define PANTHOR_UOBJ_MIN_SIZE(_obj_name) \
> +	_Generic(_obj_name, \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_gpu_info, tiler_present), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_csif_info, pad), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
> +		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
> +
> +/**
> + * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
> + * @_dest_usr_ptr: User pointer to copy to.
> + * @_usr_size: Size of the user object.
> + * @_src_obj: Kernel object to copy (not a pointer).
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +#define PANTHOR_UOBJ_SET(_dest_usr_ptr, _usr_size, _src_obj) \
> +	panthor_set_uobj(_dest_usr_ptr, _usr_size, \
> +			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
> +			 sizeof(_src_obj), &(_src_obj))
> +
> +/**
> + * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
> + * object array.
> + * @_dest_array: Local variable that will hold the newly allocated kernel
> + * object array.
> + * @_uobj_array: The drm_panthor_obj_array object describing the user object
> + * array.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) \
> +	panthor_get_uobj_array(_uobj_array, \
> +			       PANTHOR_UOBJ_MIN_SIZE((_dest_array)[0]), \
> +			       sizeof((_dest_array)[0]), (void **)&(_dest_array))

Here you have an ugly cast to make the output pointer work. The below 
patch avoids this by changing panthor_get_uobj_array() to return an 
ERR_PTR:

----8<----
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 377ebea4c0e8..ff749832f344 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -79,9 +79,9 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
  *
  * Return: 0 on success, a negative error code otherwise.
  */
-static int
+static void *
 panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
-		       u32 obj_size, void **out)
+		       u32 obj_size)
 {
 	int ret = 0;
 	void *out_alloc;
@@ -90,14 +90,14 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
 	 * lack useful information.
 	 */
 	if (in->stride < min_stride)
-		return -EINVAL;
+		return ERR_PTR(-EINVAL);
 
 	if (!in->count)
-		return 0;
+		return NULL;
 
 	out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
 	if (!out_alloc)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	if (obj_size == in->stride) {
 		/* Fast path when user/kernel have the same uAPI header version. */
@@ -121,11 +121,10 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
 
 	if (ret) {
 		kvfree(out_alloc);
-		return ret;
+		return ERR_PTR(ret);
 	}
 
-	*out = out_alloc;
-	return 0;
+	return out_alloc;
 }
 
 /**
@@ -193,10 +192,12 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
  *
  * Return: 0 on success, a negative error code otherwise.
  */
-#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) \
-	panthor_get_uobj_array(_uobj_array, \
+#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) ({\
+	_dest_array = panthor_get_uobj_array(_uobj_array, \
 			       PANTHOR_UOBJ_MIN_SIZE((_dest_array)[0]), \
-			       sizeof((_dest_array)[0]), (void **)&(_dest_array))
+			       sizeof((_dest_array)[0])); \
+	IS_ERR(_dest_array) ? PTR_ERR(_dest_array) : 0; \
+	})
 
 /**
  * DOC: Job submission helpers.
---8<----

TBH, I'd also be tempted to make PANTHOR_UOBJ_GET_ARRAY simply return 
the ERR_PTR and change the call sites appropriately. That way you avoid 
the 'magic' of passing an lvalue.

> +
> +/**
> + * DOC: Job submission helpers.
> + *
> + * Here is the workflow for atomic submission of multiple jobs. By atomic,
> + * we mean that we either submit the whole batch, or nothing. This requires
> + * doing things in multiple steps, each step operating on all jobs belonging
> + * to a batch.
> + *
> + * int xxx_submit_ioctl(...)
> + * {
> + *	...
> + *
> + *	// Initialize the submission context.
> + *	ret = panthor_submit_ctx_init(&ctx, file, job_count);
> + *	if (ret)
> + *		return ret;
> + *
> + *	// Create jobs and attach sync operations.
> + *	for (u32 i = 0; i < job_count; i++) {
> + *		...
> + *
> + *		// Create job
> + *		job = job_create(pfile, ...);
> + *		if (IS_ERR(job)) {
> + *			ret = PTR_ERR(job);
> + *			goto out_cleanup_submit_ctx;
> + *		}
> + *
> + *		// Add job to the submit context
> + *		ret = panthor_submit_ctx_add_job(&ctx, i, job, sync_ops);
> + *		if (ret)
> + *			goto out_cleanup_submit_ctx;
> + *	}
> + *
> + *	// Collect signal operations on all jobs, such that each job can pick
> + *	// from it for its dependencies and update the fence to signal when
> + *	// the job is submitted.

I can't figure out here how we avoid depedency loops within a batch. 
What stops two jobs from each depending on each other?

Or do we "allow" this but rely on the loop in panthor_submit_ctx_add_deps_and_arm_jobs()
to effectively enforce that a job cannot actually depend on a job
which is later in the batch. In which case why bother with this
complexity rather than just performing all the steps on each job
in order?

Being able to submit a forward dependency, but then having it
ignored seems like an odd design. So I feel like I must be
missing something.

> + *	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> + *	if (ret)
> + *		goto out_cleanup_submit_ctx;
> + *
> + *	// We acquire/prepare revs on all jobs before proceeding with the
> + *	// dependency registration.
> + *	//
> + *	// This is solving two problems:
> + *	// 1. drm_sched_job_arm() and drm_sched_entity_push_job() must be protected
> + *	//    by a lock to make sure no concurrent access to the same entity get
> + *	//    interleaved, which would mess up with the fence seqno ordering.
> + *	//    Luckily, one of the resv being acquired is the VM resv, and a scheduling
> + *	//    entity is only bound to a single VM. As soon as we acquire the VM resv,
> + *	//    we should be safe.
> + *	// 2. Jobs might depend on fences that were issued by previous jobs in the
> + *	//    same batch, so we can't add dependencies on all jobs before arming
> + *	//    previous jobs and registering the fence to the signal array, otherwise
> + *	//    we might miss dependencies, or point to an outdated fence.
> + *	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
> + *	if (ret)
> + *		goto out_cleanup_submit_ctx;
> + *
> + *	// Now that resvs are locked/prepared, we can iterate over each job to add
> + *	// the dependencies, arm the job fence, register the job fence to the signal
> + *	// array.
> + *	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
> + *	if (ret)
> + *		goto out_cleanup_submit_ctx;
> + *
> + *	// Nothing can fail after that point, so we can make our job fences visible to the
> + *	// outside world. Push jobs and set the job fences to the resv slots we reserved.
> + *	// This also pushes the fences to the syncobjs that are part of the signal array.
> + *	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
> + *
> + * out_cleanup_submit_ctx:
> + *	// Cleanup the context.
> + *	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
> + *	...
> + *	return ret;
> + *}

I'm not sure it's beneficial to have this 'pseudo-code' version of the 
submit function here. Can we not have the relevant comments in the 
panthor_ioctl_group_submit() function instead. My main concern is that 
this is going to get out of sync with the code over time - the function 
names are already not a complete match.

> + */
> +
> +/**
> + * struct panthor_sync_signal - Represent a synchronization object point to attach
> + * our job fence to.
> + *
> + * This structure is here to keep track of fences that are currently bound to
> + * a specific syncobj point.
> + *
> + * At the beginning of a job submission, the fence
> + * is retrieved from the syncobj itself, and can be NULL if no fence was attached
> + * to this point.
> + *
> + * At the end, it points to the fence of the last job that had a
> + * %DRM_PANTHOR_SYNC_OP_SIGNAL on this syncobj.
> + *
> + * With jobs being submitted in batches, the fence might change several times during
> + * the process, allowing one job to wait on a job that's part of the same submission
> + * be appears earlier in the drm_panthor_group_submit::queue_submits array.

s/be/but/

> + */
> +struct panthor_sync_signal {
> +	/** @handle: The syncobj handle. */
> +	u32 handle;
> +
> +	/**
> +	 * @point: The syncobj point.
> +	 *
> +	 * Zero for regular syncobjs, and non-zero for timeline syncobjs.
> +	 */
> +	u64 point;
> +
> +	/**
> +	 * @syncobj: The sync object pointed by @handle.
> +	 */
> +	struct drm_syncobj *syncobj;
> +
> +	/**
> +	 * @chain: Chain object used to link the new fence to an existing
> +	 * timeline syncobj.
> +	 *
> +	 * NULL for regular syncobj, non-NULL for timeline syncobjs.
> +	 */
> +	struct dma_fence_chain *chain;
> +
> +	/**
> +	 * @fence: The fence to assign to the syncobj or syncobj-point.
> +	 */
> +	struct dma_fence *fence;
> +};
> +
> +/**
> + * struct panthor_job_ctx - Job context
> + */
> +struct panthor_job_ctx {
> +	/** @job: The job that is about to be submitted to drm_sched. */
> +	struct drm_sched_job *job;
> +
> +	/** @syncobjs: Array of sync operations. */
> +	struct drm_panthor_sync_op *syncops;
> +
> +	/** @syncop_count: Number of sync operations. */
> +	u32 syncop_count;
> +};
> +
> +/**
> + * struct panthor_submit_ctx - Submission context
> + *
> + * Anything that's related to a submission (%DRM_IOCTL_PANTHOR_VM_BIND or
> + * %DRM_IOCTL_PANTHOR_GROUP_SUBMIT) is kept here, so we can automate the
> + * initialization and cleanup steps.
> + */
> +struct panthor_submit_ctx {
> +	/** @file: DRM file this submission happens on. */
> +	struct drm_file *file;
> +
> +	/**
> +	 * @signal: Array of panthor_sync_signal objects.
> +	 *
> +	 * %DRM_PANTHOR_SYNC_OP_SIGNAL operations will be recorded here,
> +	 * and %DRM_PANTHOR_SYNC_OP_WAIT will first check if an entry
> +	 * matching the syncobj+point exists before calling
> +	 * drm_syncobj_find_fence(). This allows us to describe dependencies
> +	 * existing between jobs that are part of the same batch.
> +	 */
> +	struct xarray signal;

This feels like the wrong data structure - it's simply used as a list. I 
suspect it would be better to simple add a list_head to struct 
panthor_sync_signal.

> +
> +	/** @jobs: Array of jobs. */
> +	struct panthor_job_ctx *jobs;
> +
> +	/** @job_count: Number of entries in the @jobs array. */
> +	u32 job_count;
> +
> +	/** @exec: drm_exec context used to acquire and prepare resv objects. */
> +	struct drm_exec exec;
> +};
> +
> +#define PANTHOR_SYNC_OP_FLAGS_MASK \
> +	(DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK | DRM_PANTHOR_SYNC_OP_SIGNAL)
> +
> +/**
> + * panthor_check_sync_op() - Check drm_panthor_sync_op fields
> + * @sync_op: The sync operation to check.
> + *
> + * Return: 0 on success, -EINVAL otherwise.
> + */
> +static int
> +panthor_check_sync_op(const struct drm_panthor_sync_op *sync_op)
> +{
> +	u8 handle_type;
> +
> +	if (sync_op->flags & ~PANTHOR_SYNC_OP_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	handle_type = sync_op->flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK;
> +	if (handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
> +	    handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ)
> +		return -EINVAL;
> +
> +	if (handle_type == DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
> +	    sync_op->timeline_value != 0)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_sync_signal_free() - Release resources and free a panthor_sync_signal object
> + * @sig_sync: Signal object to free.
> + */
> +static void
> +panthor_sync_signal_free(struct panthor_sync_signal *sig_sync)
> +{
> +	if (!sig_sync)
> +		return;
> +
> +	drm_syncobj_put(sig_sync->syncobj);
> +	dma_fence_chain_free(sig_sync->chain);
> +	dma_fence_put(sig_sync->fence);
> +	kfree(sig_sync);
> +}
> +
> +/**
> + * panthor_submit_ctx_add_sync_signal() - Add a signal operation to a submit context
> + * @ctx: Context to add the signal operation to.
> + * @handle: Syncobj handle.
> + * @point: Syncobj point.
> + *
> + * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.

The only part of the return used is the ERR_PTR() part, so make this a simple int.

> + */
> +static struct panthor_sync_signal *
> +panthor_submit_ctx_add_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> +{
> +	struct panthor_sync_signal *sig_sync;
> +	struct dma_fence *cur_fence;
> +	int ret;
> +	u32 id;
> +
> +	sig_sync = kzalloc(sizeof(*sig_sync), GFP_KERNEL);
> +	if (!sig_sync)
> +		return ERR_PTR(-ENOMEM);
> +
> +	sig_sync->handle = handle;
> +	sig_sync->point = point;
> +
> +	if (point > 0) {
> +		sig_sync->chain = dma_fence_chain_alloc();
> +		if (!sig_sync->chain) {
> +			ret = -ENOMEM;
> +			goto err_free_sig_sync;
> +		}
> +	}
> +
> +	sig_sync->syncobj = drm_syncobj_find(ctx->file, handle);
> +	if (!sig_sync->syncobj) {
> +		ret = -EINVAL;
> +		goto err_free_sig_sync;
> +	}
> +
> +	/* Retrieve the current fence attached to that point. It's
> +	 * perfectly fine to get a NULL fence here, it just means there's
> +	 * no fence attached to that point yet.
> +	 */
> +	if (!drm_syncobj_find_fence(ctx->file, handle, point, 0, &cur_fence))
> +		sig_sync->fence = cur_fence;
> +
> +	ret = xa_alloc(&ctx->signal, &id, sig_sync, xa_limit_32b, GFP_KERNEL);
> +	if (ret)
> +		goto err_free_sig_sync;
> +
> +	return sig_sync;
> +
> +err_free_sig_sync:
> +	panthor_sync_signal_free(sig_sync);
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * panthor_submit_ctx_search_sync_signal() - Search an existing signal operation in a
> + * submit context.
> + * @ctx: Context to search the signal operation in.
> + * @handle: Syncobj handle.
> + * @point: Syncobj point.
> + *
> + * Return: A valid panthor_sync_signal object if found, NULL otherwise.
> + */
> +static struct panthor_sync_signal *
> +panthor_submit_ctx_search_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> +{
> +	struct panthor_sync_signal *sig_sync;
> +	unsigned long i;
> +
> +	xa_for_each(&ctx->signal, i, sig_sync) {
> +		if (handle == sig_sync->handle && point == sig_sync->point)
> +			return sig_sync;
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * panthor_submit_ctx_add_job() - Add a job to a submit context
> + * @ctx: Context to search the signal operation in.
> + * @idx: Index of the job in the context.
> + * @job: Job to add.
> + * @syncs: Sync operations provided by userspace.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
> +			   struct drm_sched_job *job,
> +			   const struct drm_panthor_obj_array *syncs)
> +{
> +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> +						    struct panthor_device,
> +						    base);
> +	int ret;
> +
> +	if (drm_WARN_ON(&ptdev->base,
> +			idx >= ctx->job_count ||
> +			ctx->jobs[idx].job ||
> +			ctx->jobs[idx].syncops ||
> +			ctx->jobs[idx].syncop_count))
> +		return -EINVAL;
> +
> +	ctx->jobs[idx].job = job;

While the WARN_ON obviously shouldn't happen, this positioning of the 
ctx->jobs[].job assignment means the caller has no idea if the 
assignment has happened. AFAICT in the case of the WARN_ON the job isn't 
cleaned up properly.

The options I can see are to move this line further down (and make the 
caller clean up that one job if this function fails), or to clean up the 
job in the case where the WARN_ON fails.

> +
> +	ret = PANTHOR_UOBJ_GET_ARRAY(ctx->jobs[idx].syncops, syncs);
> +	if (ret)
> +		return ret;
> +
> +	ctx->jobs[idx].syncop_count = syncs->count;
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_get_sync_signal() - Search signal operation and add one if none was found.
> + * @ctx: Context to search the signal operation in.
> + * @handle: Syncobj handle.
> + * @point: Syncobj point.
> + *
> + * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.

As above, no need to return the object just an int error code.

> + */
> +static struct panthor_sync_signal *
> +panthor_submit_ctx_get_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> +{
> +	struct panthor_sync_signal *sig_sync;
> +
> +	sig_sync = panthor_submit_ctx_search_sync_signal(ctx, handle, point);
> +	if (sig_sync)
> +		return sig_sync;
> +
> +	return panthor_submit_ctx_add_sync_signal(ctx, handle, point);
> +}
> +
> +/**
> + * panthor_submit_ctx_update_job_sync_signal_fences() - Update fences
> + * on the signal operations specified by a job.
> + * @ctx: Context to search the signal operation in.
> + * @job_idx: Index of the job to operate on.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_update_job_sync_signal_fences(struct panthor_submit_ctx *ctx,
> +						 u32 job_idx)
> +{
> +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> +						    struct panthor_device,
> +						    base);
> +	struct dma_fence *done_fence = &ctx->jobs[job_idx].job->s_fence->finished;
> +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> +
> +	for (u32 i = 0; i < sync_op_count; i++) {
> +		struct dma_fence *old_fence;
> +		struct panthor_sync_signal *sig_sync;
> +
> +		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
> +			continue;
> +
> +		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
> +								 sync_ops[i].timeline_value);
> +		if (drm_WARN_ON(&ptdev->base, !sig_sync))
> +			return -EINVAL;
> +
> +		old_fence = sig_sync->fence;
> +		sig_sync->fence = dma_fence_get(done_fence);
> +		dma_fence_put(old_fence);
> +
> +		if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
> +			return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_collect_job_signal_ops() - Iterate over all job signal operations
> + * and add them to the context.
> + * @ctx: Context to search the signal operation in.
> + * @job_idx: Index of the job to operate on.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_collect_job_signal_ops(struct panthor_submit_ctx *ctx,
> +					  u32 job_idx)
> +{
> +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> +
> +	for (u32 i = 0; i < sync_op_count; i++) {
> +		struct panthor_sync_signal *sig_sync;
> +		int ret;
> +
> +		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
> +			continue;
> +
> +		ret = panthor_check_sync_op(&sync_ops[i]);
> +		if (ret)
> +			return ret;
> +
> +		sig_sync = panthor_submit_ctx_get_sync_signal(ctx,
> +							      sync_ops[i].handle,
> +							      sync_ops[i].timeline_value);
> +		if (IS_ERR(sig_sync))
> +			return PTR_ERR(sig_sync);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_push_fences() - Iterate over the signal array, and for each entry, push
> + * the currently assigned fence to the associated syncobj.
> + * @ctx: Context to push fences on.
> + *
> + * This is the last step of a submission procedure, and is done once we know the submission
> + * is effective and job fences are guaranteed to be signaled in finite time.
> + */
> +static void
> +panthor_submit_ctx_push_fences(struct panthor_submit_ctx *ctx)
> +{
> +	struct panthor_sync_signal *sig_sync;
> +	unsigned long i;
> +
> +	xa_for_each(&ctx->signal, i, sig_sync) {
> +		if (sig_sync->chain) {
> +			drm_syncobj_add_point(sig_sync->syncobj, sig_sync->chain,
> +					      sig_sync->fence, sig_sync->point);
> +			sig_sync->chain = NULL;
> +		} else {
> +			drm_syncobj_replace_fence(sig_sync->syncobj, sig_sync->fence);
> +		}
> +	}
> +}
> +
> +/**
> + * panthor_submit_ctx_add_sync_deps_to_job() - Add sync wait operations as
> + * job dependencies.
> + * @ctx: Submit context.
> + * @job_idx: Index of the job to operate on.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_add_sync_deps_to_job(struct panthor_submit_ctx *ctx,
> +					u32 job_idx)
> +{
> +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> +						    struct panthor_device,
> +						    base);
> +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> +	struct drm_sched_job *job = ctx->jobs[job_idx].job;
> +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> +	int ret = 0;
> +
> +	if (!sync_op_count)
> +		return 0;

Not needed - the for loop will be skipped in this case anyway.

> +
> +	for (u32 i = 0; i < sync_op_count; i++) {
> +		struct panthor_sync_signal *sig_sync;
> +		struct dma_fence *fence;
> +
> +		if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
> +			continue;

NIT: It might be worth having a helper for the operation type. It's a 
little confusing that we have !(flags & SIGNAL) and (flags & SIGNAL) but 
not (flags & WAIT) - obviously looking at the definition shows why. Also 
there'll be a lot of careful refactoring needed if a third operation is 
ever added.

> +
> +		ret = panthor_check_sync_op(&sync_ops[i]);
> +		if (ret)
> +			return ret;
> +
> +		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
> +								 sync_ops[i].timeline_value);
> +		if (sig_sync) {
> +			if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
> +				return -EINVAL;
> +
> +			fence = dma_fence_get(sig_sync->fence);
> +		} else {
> +			ret = drm_syncobj_find_fence(ctx->file, sync_ops[i].handle,
> +						     sync_ops[i].timeline_value,
> +						     0, &fence);
> +			if (ret)
> +				return ret;
> +		}
> +
> +		ret = drm_sched_job_add_dependency(job, fence);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_collect_jobs_signal_ops() - Collect all signal operations
> + * and add them to the submit context.
> + * @ctx: Submit context.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_collect_jobs_signal_ops(struct panthor_submit_ctx *ctx)
> +{
> +	for (u32 i = 0; i < ctx->job_count; i++) {
> +		int ret;
> +
> +		ret = panthor_submit_ctx_collect_job_signal_ops(ctx, i);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_add_deps_and_arm_jobs() - Add jobs dependencies and arm jobs
> + * @ctx: Submit context.
> + * @add_resvs_deps: Callback used to add implicit job dependencies.
> + *
> + * Must be called after panthor_submit_ctx_prepare_resvs().
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_add_deps_and_arm_jobs(struct panthor_submit_ctx *ctx,
> +					 int (*add_resvs_deps)(struct drm_sched_job *))
> +{
> +	for (u32 i = 0; i < ctx->job_count; i++) {
> +		int ret;
> +
> +		ret = add_resvs_deps(ctx->jobs[i].job);
> +		if (ret)
> +			return ret;
> +
> +		ret = panthor_submit_ctx_add_sync_deps_to_job(ctx, i);
> +		if (ret)
> +			return ret;
> +
> +		drm_sched_job_arm(ctx->jobs[i].job);
> +
> +		ret = panthor_submit_ctx_update_job_sync_signal_fences(ctx, i);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_prepare_resvs() - Lock/prepare reservation objects for all jobs.
> + * @ctx: Submit context.
> + * @prep_resvs: Callback used to prepare reservation objects associated to a job.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int
> +panthor_submit_ctx_prepare_resvs(struct panthor_submit_ctx *ctx,
> +				 int (*prep_resvs)(struct drm_exec *, struct drm_sched_job *))
> +{
> +	drm_exec_until_all_locked(&ctx->exec) {
> +		for (u32 i = 0; i < ctx->job_count; i++) {
> +			int ret = prep_resvs(&ctx->exec, ctx->jobs[i].job);
> +
> +			drm_exec_retry_on_contention(&ctx->exec);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_push_jobs() - Push jobs to their scheduling entities.
> + * @ctx: Submit context.
> + * @upd_resvs: Callback used to update reservation objects that were prepared in
> + * panthor_submit_ctx_prepare_resvs().
> + */
> +static void
> +panthor_submit_ctx_push_jobs(struct panthor_submit_ctx *ctx,
> +			     void (*upd_resvs)(struct drm_sched_job *))
> +{
> +	for (u32 i = 0; i < ctx->job_count; i++) {
> +		upd_resvs(ctx->jobs[i].job);
> +		drm_sched_entity_push_job(ctx->jobs[i].job);
> +
> +		/* Job is owned by the scheduler now. */
> +		ctx->jobs[i].job = NULL;
> +	}
> +
> +	panthor_submit_ctx_push_fences(ctx);
> +}
> +
> +/**
> + * panthor_submit_ctx_init() - Initializes a submission context
> + * @ctx: Submit context to initialize.
> + * @file: drm_file this submission happens on.
> + * @job_count: Number of jobs that will be submitted.
> + *
> + * Return: 0 on success, a negative error code otherwise.
> + */
> +static int panthor_submit_ctx_init(struct panthor_submit_ctx *ctx,
> +				   struct drm_file *file, u32 job_count)
> +{
> +	ctx->jobs = kvmalloc_array(job_count, sizeof(*ctx->jobs),
> +				   GFP_KERNEL | __GFP_ZERO);
> +	if (!ctx->jobs)
> +		return -ENOMEM;
> +
> +	ctx->file = file;
> +	ctx->job_count = job_count;
> +	xa_init_flags(&ctx->signal, XA_FLAGS_ALLOC);
> +	drm_exec_init(&ctx->exec, DRM_EXEC_INTERRUPTIBLE_WAIT | DRM_EXEC_IGNORE_DUPLICATES);
> +	return 0;
> +}
> +
> +/**
> + * panthor_submit_ctx_cleanup() - Cleanup a submission context
> + * @ctx: Submit context to cleanup.
> + */
> +static void panthor_submit_ctx_cleanup(struct panthor_submit_ctx *ctx,
> +				       void (*job_put)(struct drm_sched_job *))
> +{
> +	struct panthor_sync_signal *sig_sync;
> +	unsigned long i;
> +
> +	drm_exec_fini(&ctx->exec);
> +
> +	xa_for_each(&ctx->signal, i, sig_sync)
> +		panthor_sync_signal_free(sig_sync);
> +
> +	xa_destroy(&ctx->signal);
> +
> +	for (i = 0; i < ctx->job_count; i++) {
> +		job_put(ctx->jobs[i].job);
> +		kvfree(ctx->jobs[i].syncops);
> +	}
> +
> +	kvfree(ctx->jobs);
> +}
> +
> +static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	struct drm_panthor_dev_query *args = data;
> +
> +	if (!args->pointer) {
> +		switch (args->type) {
> +		case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
> +			args->size = sizeof(ptdev->gpu_info);
> +			return 0;
> +
> +		case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
> +			args->size = sizeof(ptdev->csif_info);
> +			return 0;
> +
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +	switch (args->type) {
> +	case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
> +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->gpu_info);
> +
> +	case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
> +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->csif_info);
> +
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +#define PANTHOR_VM_CREATE_FLAGS			0
> +
> +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
> +				   struct drm_file *file)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_vm_create *args = data;
> +	u64 kernel_va_start = 0;
> +	int cookie, ret;
> +
> +	if (!drm_dev_enter(ddev, &cookie))
> +		return -ENODEV;
> +
> +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
> +		ret = -EINVAL;
> +		goto out_dev_exit;
> +	}
> +
> +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {

The check for !va_bits would be better done at probe time. I'd also be 
tempted to move the change for kernel_va_range down to 
panthor_vm_create() as that has to repeat the va_bits calculation.

> +		ret = -EINVAL;
> +		goto out_dev_exit;
> +	}
> +
> +	if (args->kernel_va_range)
> +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;

And also push the calculation of va_start down to 
panthor_vm_create() as well.

> +
> +	ret = panthor_vm_pool_create_vm(ptdev, pfile->vms,
> +					kernel_va_start, args->kernel_va_range);
> +	if (ret >= 0) {
> +		args->id = ret;
> +		ret = 0;
> +	}
> +
> +out_dev_exit:
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_vm_destroy(struct drm_device *ddev, void *data,
> +				    struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_vm_destroy *args = data;
> +
> +	if (args->pad)
> +		return -EINVAL;
> +
> +	return panthor_vm_pool_destroy_vm(pfile->vms, args->id);
> +}
> +
> +#define PANTHOR_BO_FLAGS		DRM_PANTHOR_BO_NO_MMAP
> +
> +static int panthor_ioctl_bo_create(struct drm_device *ddev, void *data,
> +				   struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct panthor_gem_object *bo;
> +	struct drm_panthor_bo_create *args = data;
> +	struct panthor_vm *vm = NULL;
> +	int cookie, ret;
> +
> +	if (!drm_dev_enter(ddev, &cookie))
> +		return -ENODEV;
> +
> +	if (!args->size || args->pad ||
> +	    (args->flags & ~PANTHOR_BO_FLAGS)) {
> +		ret = -EINVAL;
> +		goto out_dev_exit;
> +	}
> +
> +	if (args->exclusive_vm_id) {
> +		vm = panthor_vm_pool_get_vm(pfile->vms, args->exclusive_vm_id);
> +		if (!vm) {
> +			ret = -EINVAL;
> +			goto out_dev_exit;
> +		}
> +	}
> +
> +	bo = panthor_gem_create_with_handle(file, ddev, vm, args->size, args->flags,
> +					    &args->handle);

As mentioned before, we should have a function which just returns the 
handle, we don't need/want the BO here.

> +
> +	panthor_vm_put(vm);
> +
> +	if (IS_ERR(bo))
> +		ret = PTR_ERR(bo);
> +	else
> +		ret = 0;
> +
> +out_dev_exit:
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_bo_mmap_offset(struct drm_device *ddev, void *data,
> +					struct drm_file *file)
> +{
> +	struct drm_panthor_bo_mmap_offset *args = data;
> +	struct drm_gem_object *obj;
> +	int ret;
> +
> +	if (args->pad)
> +		return -EINVAL;
> +
> +	obj = drm_gem_object_lookup(file, args->handle);
> +	if (!obj)
> +		return -ENOENT;
> +
> +	ret = drm_gem_create_mmap_offset(obj);
> +	if (ret)
> +		goto out;
> +
> +	args->offset = drm_vma_node_offset_addr(&obj->vma_node);
> +
> +out:
> +	drm_gem_object_put(obj);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_group_submit(struct drm_device *ddev, void *data,
> +				      struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_group_submit *args = data;
> +	struct drm_panthor_queue_submit *jobs_args;
> +	struct panthor_submit_ctx ctx;
> +	int ret = 0, cookie;
> +
> +	if (args->pad)
> +		return -EINVAL;
> +
> +	if (!drm_dev_enter(ddev, &cookie))
> +		return -ENODEV;
> +
> +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->queue_submits);
> +	if (ret)
> +		goto out_dev_exit;
> +
> +	ret = panthor_submit_ctx_init(&ctx, file, args->queue_submits.count);
> +	if (ret)
> +		goto out_free_jobs_args;
> +
> +	for (u32 i = 0; i < args->queue_submits.count; i++) {
> +		const struct drm_panthor_queue_submit *qsubmit = &jobs_args[i];
> +		struct drm_sched_job *job;
> +
> +		job = panthor_job_create(pfile, args->group_handle, qsubmit);
> +		if (IS_ERR(job)) {
> +			ret = PTR_ERR(job);
> +			goto out_cleanup_submit_ctx;
> +		}
> +
> +		ret = panthor_submit_ctx_add_job(&ctx, i, job, &qsubmit->syncs);
> +		if (ret)
> +			goto out_cleanup_submit_ctx;
> +	}
> +
> +	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	/* Nothing can fail after that point. */
> +	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
> +
> +out_cleanup_submit_ctx:
> +	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
> +
> +out_free_jobs_args:
> +	kvfree(jobs_args);
> +
> +out_dev_exit:
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_group_destroy(struct drm_device *ddev, void *data,
> +				       struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_group_destroy *args = data;
> +
> +	if (args->pad)
> +		return -EINVAL;
> +
> +	return panthor_group_destroy(pfile, args->group_handle);
> +}
> +
> +static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
> +				      struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_group_create *args = data;
> +	struct drm_panthor_queue_create *queue_args;
> +	int ret;
> +
> +	if (!args->queues.count)
> +		return -EINVAL;
> +
> +	ret = PANTHOR_UOBJ_GET_ARRAY(queue_args, &args->queues);
> +	if (ret)
> +		return ret;
> +
> +	ret = panthor_group_create(pfile, args, queue_args);
> +	if (ret >= 0) {
> +		args->group_handle = ret;
> +		ret = 0;
> +	}
> +
> +	kvfree(queue_args);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_group_get_state(struct drm_device *ddev, void *data,
> +					 struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_group_get_state *args = data;
> +
> +	return panthor_group_get_state(pfile, args);
> +}
> +
> +static int panthor_ioctl_tiler_heap_create(struct drm_device *ddev, void *data,
> +					   struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_tiler_heap_create *args = data;
> +	struct panthor_heap_pool *pool;
> +	struct panthor_vm *vm;
> +	int ret;
> +
> +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> +	if (!vm)
> +		return -EINVAL;
> +
> +	pool = panthor_vm_get_heap_pool(vm, true);
> +	if (IS_ERR(pool)) {
> +		ret = PTR_ERR(pool);
> +		goto out_put_vm;
> +	}
> +
> +	ret = panthor_heap_create(pool,
> +				  args->initial_chunk_count,
> +				  args->chunk_size,
> +				  args->max_chunks,
> +				  args->target_in_flight,
> +				  &args->tiler_heap_ctx_gpu_va,
> +				  &args->first_heap_chunk_gpu_va);
> +	if (ret < 0)
> +		goto out_put_heap_pool;
> +
> +	/* Heap pools are per-VM. We combine the VM and HEAP id to make
> +	 * a unique heap handle.
> +	 */
> +	args->handle = (args->vm_id << 16) | ret;
> +	ret = 0;
> +
> +out_put_heap_pool:
> +	panthor_heap_pool_put(pool);
> +
> +out_put_vm:
> +	panthor_vm_put(vm);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_tiler_heap_destroy(struct drm_device *ddev, void *data,
> +					    struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_tiler_heap_destroy *args = data;
> +	struct panthor_heap_pool *pool;
> +	struct panthor_vm *vm;
> +	int ret;
> +
> +	if (args->pad)
> +		return -EINVAL;
> +
> +	vm = panthor_vm_pool_get_vm(pfile->vms, args->handle >> 16);
> +	if (!vm)
> +		return -EINVAL;
> +
> +	pool = panthor_vm_get_heap_pool(vm, false);
> +	if (!pool) {
> +		ret = -EINVAL;
> +		goto out_put_vm;
> +	}
> +
> +	ret = panthor_heap_destroy(pool, args->handle & GENMASK(15, 0));
> +	panthor_heap_pool_put(pool);
> +
> +out_put_vm:
> +	panthor_vm_put(vm);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_vm_bind_async(struct drm_device *ddev,
> +				       struct drm_panthor_vm_bind *args,
> +				       struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_vm_bind_op *jobs_args;
> +	struct panthor_submit_ctx ctx;
> +	struct panthor_vm *vm;
> +	int ret = 0;
> +
> +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> +	if (!vm)
> +		return -EINVAL;
> +
> +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
> +	if (ret)
> +		goto out_put_vm;
> +
> +	ret = panthor_submit_ctx_init(&ctx, file, args->ops.count);
> +	if (ret)
> +		goto out_free_jobs_args;
> +
> +	for (u32 i = 0; i < args->ops.count; i++) {
> +		struct drm_panthor_vm_bind_op *op = &jobs_args[i];
> +		struct drm_sched_job *job;
> +
> +		job = panthor_vm_bind_job_create(file, vm, op);
> +		if (IS_ERR(job)) {
> +			ret = PTR_ERR(job);
> +			goto out_cleanup_submit_ctx;
> +		}
> +
> +		ret = panthor_submit_ctx_add_job(&ctx, i, job, &op->syncs);
> +		if (ret)
> +			goto out_cleanup_submit_ctx;
> +	}
> +
> +	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_vm_bind_job_prepare_resvs);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_vm_bind_job_add_resvs_deps);
> +	if (ret)
> +		goto out_cleanup_submit_ctx;
> +
> +	/* Nothing can fail after that point. */
> +	panthor_submit_ctx_push_jobs(&ctx, panthor_vm_bind_job_update_resvs);
> +
> +out_cleanup_submit_ctx:
> +	panthor_submit_ctx_cleanup(&ctx, panthor_vm_bind_job_put);
> +
> +out_free_jobs_args:
> +	kvfree(jobs_args);
> +
> +out_put_vm:
> +	panthor_vm_put(vm);
> +	return ret;
> +}
> +
> +static int panthor_ioctl_vm_bind_sync(struct drm_device *ddev,
> +				      struct drm_panthor_vm_bind *args,
> +				      struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_vm_bind_op *jobs_args;
> +	struct panthor_vm *vm;
> +	int ret;
> +
> +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> +	if (!vm)
> +		return -EINVAL;
> +
> +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
> +	if (ret)
> +		goto out_put_vm;
> +
> +	for (u32 i = 0; i < args->ops.count; i++) {
> +		ret = panthor_vm_bind_exec_sync_op(file, vm, &jobs_args[i]);
> +		if (ret) {
> +			/* Update ops.count so the user knows where things failed. */

It might be worth mentioning this in the UAPI header as the array count
wouldn't usually be modified.

> +			args->ops.count = i;
> +			break;
> +		}
> +	}
> +
> +	kvfree(jobs_args);
> +
> +out_put_vm:
> +	panthor_vm_put(vm);
> +	return ret;
> +}
> +
> +#define PANTHOR_VM_BIND_FLAGS DRM_PANTHOR_VM_BIND_ASYNC
> +
> +static int panthor_ioctl_vm_bind(struct drm_device *ddev, void *data,
> +				 struct drm_file *file)
> +{
> +	struct drm_panthor_vm_bind *args = data;
> +	int cookie, ret;
> +
> +	if (!drm_dev_enter(ddev, &cookie))
> +		return -ENODEV;
> +
> +	if (args->flags & DRM_PANTHOR_VM_BIND_ASYNC)
> +		ret = panthor_ioctl_vm_bind_async(ddev, args, file);
> +	else
> +		ret = panthor_ioctl_vm_bind_sync(ddev, args, file);
> +
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static int
> +panthor_open(struct drm_device *ddev, struct drm_file *file)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	struct panthor_file *pfile;
> +	int ret;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -EINVAL;
> +
> +	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
> +	if (!pfile) {
> +		ret = -ENOMEM;
> +		goto err_put_mod;
> +	}
> +
> +	pfile->ptdev = ptdev;
> +
> +	ret = panthor_vm_pool_create(pfile);
> +	if (ret)
> +		goto err_free_file;
> +
> +	ret = panthor_group_pool_create(pfile);
> +	if (ret)
> +		goto err_destroy_vm_pool;
> +
> +	file->driver_priv = pfile;
> +	return 0;
> +
> +err_destroy_vm_pool:
> +	panthor_vm_pool_destroy(pfile);
> +
> +err_free_file:
> +	kfree(pfile);
> +
> +err_put_mod:
> +	module_put(THIS_MODULE);
> +	return ret;
> +}
> +
> +static void
> +panthor_postclose(struct drm_device *ddev, struct drm_file *file)
> +{
> +	struct panthor_file *pfile = file->driver_priv;
> +
> +	panthor_group_pool_destroy(pfile);
> +	panthor_vm_pool_destroy(pfile);
> +
> +	kfree(pfile);
> +	module_put(THIS_MODULE);
> +}
> +
> +static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
> +#define PANTHOR_IOCTL(n, func, flags) \
> +	DRM_IOCTL_DEF_DRV(PANTHOR_##n, panthor_ioctl_##func, flags)
> +
> +	PANTHOR_IOCTL(DEV_QUERY, dev_query, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(VM_CREATE, vm_create, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(VM_DESTROY, vm_destroy, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(VM_BIND, vm_bind, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(BO_CREATE, bo_create, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(BO_MMAP_OFFSET, bo_mmap_offset, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(GROUP_CREATE, group_create, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(GROUP_DESTROY, group_destroy, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(GROUP_GET_STATE, group_get_state, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
> +	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
> +};
> +
> +static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +	struct drm_file *file = filp->private_data;
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct panthor_device *ptdev = pfile->ptdev;
> +	int ret, cookie;
> +
> +	if (!drm_dev_enter(file->minor->dev, &cookie))
> +		return -ENODEV;
> +
> +	if (vma->vm_pgoff >= (DRM_PANTHOR_USER_MMIO_OFFSET >> PAGE_SHIFT))
> +		ret = panthor_device_mmap_io(ptdev, vma);
> +	else
> +		ret = drm_gem_mmap(filp, vma);
> +
> +	drm_dev_exit(cookie);
> +	return ret;
> +}
> +
> +static const struct file_operations panthor_drm_driver_fops = {
> +	.open = drm_open,
> +	.release = drm_release,
> +	.unlocked_ioctl = drm_ioctl,
> +	.compat_ioctl = drm_compat_ioctl,
> +	.poll = drm_poll,
> +	.read = drm_read,
> +	.llseek = noop_llseek,
> +	.mmap = panthor_mmap,
> +};
> +
> +#ifdef CONFIG_DEBUG_FS
> +void panthor_debugfs_init(struct drm_minor *minor)
> +{
> +	panthor_mmu_debugfs_init(minor);
> +}
> +#endif
> +
> +/*
> + * PanCSF driver version:
> + * - 1.0 - initial interface
> + */
> +static const struct drm_driver panthor_drm_driver = {
> +	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
> +			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
> +	.open = panthor_open,
> +	.postclose = panthor_postclose,
> +	.ioctls = panthor_drm_driver_ioctls,
> +	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
> +	.fops = &panthor_drm_driver_fops,
> +	.name = "panthor",
> +	.desc = "Panthor DRM driver",
> +	.date = "20230801",
> +	.major = 1,
> +	.minor = 0,
> +
> +	.gem_create_object = panthor_gem_create_object,
> +	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> +#ifdef CONFIG_DEBUG_FS
> +	.debugfs_init = panthor_debugfs_init,
> +#endif
> +};
> +
> +static int panthor_probe(struct platform_device *pdev)
> +{
> +	struct panthor_device *ptdev;
> +	int ret;
> +
> +	ptdev = devm_drm_dev_alloc(&pdev->dev, &panthor_drm_driver,
> +				   struct panthor_device, base);
> +	if (!ptdev)
> +		return -ENOMEM;
> +
> +	platform_set_drvdata(pdev, ptdev);
> +
> +	ret = panthor_device_init(ptdev);
> +	if (ret)
> +		return ret;
> +
> +	return drm_dev_register(&ptdev->base, 0);
> +}
> +
> +static void panthor_remove(struct platform_device *pdev)
> +{
> +	struct panthor_device *ptdev = platform_get_drvdata(pdev);
> +
> +	panthor_device_unplug(ptdev);
> +}
> +
> +static const struct of_device_id dt_match[] = {
> +	{ .compatible = "rockchip,rk3588-mali" },
> +	{ .compatible = "arm,mali-valhall-csf" },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(of, dt_match);
> +
> +static DEFINE_RUNTIME_DEV_PM_OPS(panthor_pm_ops,
> +				 panthor_device_suspend,
> +				 panthor_device_resume,
> +				 NULL);
> +
> +static struct platform_driver panthor_driver = {
> +	.probe = panthor_probe,
> +	.remove_new = panthor_remove,
> +	.driver = {
> +		.name = "panthor",
> +		.pm = &panthor_pm_ops,
> +		.of_match_table = dt_match,
> +	},
> +};
> +
> +/**
> + * @cleanup_wq: Workqueue used to cleanup stuff.
> + *
> + * We create a dedicated workqueue so we can drain on unplug and
> + * make sure all resources are freed before the module is unloaded.
> + */
> +struct workqueue_struct *panthor_cleanup_wq;
> +
> +static int __init panthor_init(void)
> +{
> +	int ret;
> +
> +	ret = panthor_mmu_pt_cache_init();
> +	if (ret)
> +		return ret;
> +
> +	panthor_cleanup_wq = alloc_workqueue("panthor-cleanup", WQ_UNBOUND, 0);
> +	if (!panthor_cleanup_wq) {
> +		pr_err("panthor: Failed to allocate the workqueues");
> +		ret = -ENOMEM;
> +		goto err_mmu_pt_cache_fini;
> +	}
> +
> +	ret = platform_driver_register(&panthor_driver);
> +	if (ret)
> +		goto err_destroy_cleanup_wq;
> +
> +	return ret;
> +
> +err_mmu_pt_cache_fini:
> +	panthor_mmu_pt_cache_fini();
> +
> +err_destroy_cleanup_wq:
> +	destroy_workqueue(panthor_cleanup_wq);

This cleanup looks backwards.

Steve

> +	return ret;
> +}
> +module_init(panthor_init);
> +
> +static void __exit panthor_exit(void)
> +{
> +	platform_driver_unregister(&panthor_driver);
> +	destroy_workqueue(panthor_cleanup_wq);
> +	panthor_mmu_pt_cache_fini();
> +}
> +module_exit(panthor_exit);
> +
> +MODULE_AUTHOR("Panthor Project Developers");
> +MODULE_DESCRIPTION("Panthor DRM Driver");
> +MODULE_LICENSE("Dual MIT/GPL");


^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-09 16:53 ` [PATCH v2 13/15] drm/panthor: Allow driver compilation Boris Brezillon
  2023-08-11 16:35   ` Robin Murphy
@ 2023-08-21 12:47   ` Steven Price
  1 sibling, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-21 12:47 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Now that all blocks are available, we can add/update Kconfig/Makefile
> files to allow compilation.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Add new dependencies on GPUVA and DRM_SCHED
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/Kconfig          |  2 ++
>  drivers/gpu/drm/Makefile         |  1 +
>  drivers/gpu/drm/panthor/Kconfig  | 16 ++++++++++++++++
>  drivers/gpu/drm/panthor/Makefile | 15 +++++++++++++++
>  4 files changed, 34 insertions(+)
>  create mode 100644 drivers/gpu/drm/panthor/Kconfig
>  create mode 100644 drivers/gpu/drm/panthor/Makefile
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 2a44b9419d4d..bddfbdb2ffee 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -358,6 +358,8 @@ source "drivers/gpu/drm/lima/Kconfig"
>  
>  source "drivers/gpu/drm/panfrost/Kconfig"
>  
> +source "drivers/gpu/drm/panthor/Kconfig"
> +
>  source "drivers/gpu/drm/aspeed/Kconfig"
>  
>  source "drivers/gpu/drm/mcde/Kconfig"
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index 215e78e79125..0a260727505f 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -188,6 +188,7 @@ obj-$(CONFIG_DRM_TVE200) += tve200/
>  obj-$(CONFIG_DRM_XEN) += xen/
>  obj-$(CONFIG_DRM_VBOXVIDEO) += vboxvideo/
>  obj-$(CONFIG_DRM_LIMA)  += lima/
> +obj-$(CONFIG_DRM_PANTHOR) += panthor/
>  obj-$(CONFIG_DRM_PANFROST) += panfrost/

NIT: Here panthor is before panfrost, above (in the kconfig 'source')
they are the other way around. Although both lists seem to be in an
arbitrary order.

>  obj-$(CONFIG_DRM_ASPEED_GFX) += aspeed/
>  obj-$(CONFIG_DRM_MCDE) += mcde/
> diff --git a/drivers/gpu/drm/panthor/Kconfig b/drivers/gpu/drm/panthor/Kconfig
> new file mode 100644
> index 000000000000..a9d17b1bbb75
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/Kconfig
> @@ -0,0 +1,16 @@
> +# SPDX-License-Identifier: GPL-2.0 or MIT
> +
> +config DRM_PANTHOR
> +	tristate "Panthor (DRM support for ARM Mali CSF-based GPUs)"
> +	depends on DRM
> +	depends on ARM || ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)

This is technically wrong. There are ARM configurations that do select
GENERIC_ATOMIC64 and will cause the "select IOMMU_IO_PGTABLE_LPAE" to
conflict with the depends of that option.

Splitting it onto two lines, like panfrost does, matches the iommu
config and I think is easier to read:

        depends on ARM || ARM64 || COMPILE_TEST
        depends on !GENERIC_ATOMIC64    # for IOMMU_IO_PGTABLE_LPAE

Steve

> +	depends on MMU
> +	select DRM_EXEC
> +	select DRM_SCHED
> +	select IOMMU_SUPPORT
> +	select IOMMU_IO_PGTABLE_LPAE
> +	select DRM_GEM_SHMEM_HELPER
> +	select PM_DEVFREQ
> +	select DEVFREQ_GOV_SIMPLE_ONDEMAND
> +	help
> +	  DRM driver for ARM Mali CSF-based GPUs.

It might be worth expanding this to mention Valhall and/or Mali-Gxxx.

Steve

> diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
> new file mode 100644
> index 000000000000..64193a484879
> --- /dev/null
> +++ b/drivers/gpu/drm/panthor/Makefile
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0 or MIT
> +
> +panthor-y := \
> +	panthor_devfreq.o \
> +	panthor_device.o \
> +	panthor_drv.o \
> +	panthor_gem.o \
> +	panthor_gpu.o \
> +	panthor_heap.o \
> +	panthor_heap.o \
> +	panthor_fw.o \
> +	panthor_mmu.o \
> +	panthor_sched.o
> +
> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs
  2023-08-10 15:44   ` Boris Brezillon
@ 2023-08-21 14:01     ` Rob Herring
  0 siblings, 0 replies; 93+ messages in thread
From: Rob Herring @ 2023-08-21 14:01 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Thu, Aug 10, 2023 at 10:44 AM Boris Brezillon
<boris.brezillon@collabora.com> wrote:
>
> Hello Rob,
>
> On Wed, 9 Aug 2023 14:22:59 -0600
> Rob Herring <robh@kernel.org> wrote:
>
> > On Wed, Aug 9, 2023 at 10:53 AM Boris Brezillon
> > <boris.brezillon@collabora.com> wrote:
> > >
> > > I tried to Cc anyone that was involved in any development of the code
> > > I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
> > > change. If I missed someone, please let me know.
> >
> > Panfrost was largely based on etnaviv, vc4, v3d, and msm. Those are
> > all GPL2 (or 2+) only.
>
> Uh, I must have missed some copyright headers then. Note that not all
> panfrost files were taken as a base for panthor:
>
> - Makefile/Kconfig. I honestly hope there's nothing copyright-able in
>   there, given there's no other way to define your driver and
>   compilation rules.
> - panthor_device.{c,h} copied from panfrost_device.{c,h} with quite a
>   few modifications in the process. This one has your copyright, and
>   Marty's one.
> - a tiny part of panthor_drv.c was copied from panfrost_drv.c, but let's
>   be honest, the part that was copied (ioctl wrappers, mostly), can't
>   really be done differently. This one has your copyright, Marty's one,
>   and Collabora's one.
> - panthor_regs.h copied from panfrost_regs.h. This one has your
>   copyright, Marty's one and Arm's one (definitions extracted from
>   kbase). But again, I'm not even sure register definitions are
>   copyright-able, given there's no other way to define them. If that
>   makes a difference, I changed the prefix, and dropped definition that
>   do not exist on CSF HW.
> - panthor_gpu.{c,h} copied from panfrost_gpu.{c,h}. These files have
>   your copyright, Marty's one, and Collabora's one.
> - panthor_{gem,mmu}.{c,h} copied from panfrost_{gem,mmu}.{c,h}. Those
>   ones have your copyright only.
> - panthor_devfreq.{c,h} copied from panfrost_devfreq.{c,h}. Collabora's
>   copyright only.
> - panthor_{heap,fw,sched}.{c,h}. Those are brand new files, that were
>   written from scratch.
>
> I also git-blamed the lines I copies to Cc any contributors to the
> above files. I might have omitted someone, but I did my best to
> try and spot people that have a word in this decision.
>
> > How is relicensing that code okay?
>
> Sorry, the copyright headers of the files I copied didn't mention that
> :-/. If that's an omission, it would be good to have the headers updated
> to reflect the actual chain of copyrights.

Yes, we probably should make it more explicit though at this point it
would be fairly vague in terms of the exact sources. IMO, it should be
assumed by default any driver is derived work. No one writes a new
driver from scratch (unless they are really actively trying to avoid
being derivative work). Then the question is what driver(s) were the
source. I think it is safe to say no one copies the big 3 (Intel, AMD,
NVIDIA) nor para-virt drivers as those are the MIT licensed ones. The
ones left are pretty much *all* GPL.

> > Also,
> > panfrost depends on drm_gem_shmem_helper.c (at least) which is GPL2.
> > Does that get re-implemented in a MIT licensed environment?
>
> Not only drm_gem_shmem, but drm_gpuva_mgr and drm_sched too. And yes,
> any helper function/lib that's not GPL+MIT will have to be
> re-implemented or replaced by something else.
>
> >
> > Maybe some drivers are enough of a silo to get away with MIT
> > licensing, but I wouldn't be comfortable claiming it.
>
> Well, yes, re-using the code as-is is almost impossible, unless
> someone rewrites the various GPL components we depend on. But if
> someone wants to pick, say, the scheduling logic, and replace drm_sched
> by something else, they can. Not saying it's worth it, just saying it's
> possible.

Sure, it is possible. Seems like reimplementing all that would be more
work than the driver. Maybe the BSDs already have?

Rob

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-14 10:54   ` Steven Price
@ 2023-08-21 16:09     ` Robin Murphy
  2023-08-23  8:48       ` Steven Price
  2023-08-29 14:42       ` Boris Brezillon
  2023-08-29 14:40     ` Boris Brezillon
  1 sibling, 2 replies; 93+ messages in thread
From: Robin Murphy @ 2023-08-21 16:09 UTC (permalink / raw)
  To: Steven Price, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 2023-08-14 11:54, Steven Price wrote:
[...]
>> +/**
>> + * panthor_gpu_l2_power_on() - Power-on the L2-cache
>> + * @ptdev: Device.
>> + *
>> + * Return: 0 on success, a negative error code otherwise.
>> + */
>> +int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
>> +{
>> +	u64 core_mask = U64_MAX;
>> +
>> +	if (ptdev->gpu_info.l2_present != 1) {
>> +		/*
>> +		 * Only support one core group now.
>> +		 * ~(l2_present - 1) unsets all bits in l2_present except
>> +		 * the bottom bit. (l2_present - 2) has all the bits in
>> +		 * the first core group set. AND them together to generate
>> +		 * a mask of cores in the first core group.
>> +		 */
>> +		core_mask = ~(ptdev->gpu_info.l2_present - 1) &
>> +			     (ptdev->gpu_info.l2_present - 2);
>> +		drm_info_once(&ptdev->base, "using only 1st core group (%lu cores from %lu)\n",
>> +			      hweight64(core_mask),
>> +			      hweight64(ptdev->gpu_info.shader_present));
> 
> I'm not sure what the point of this complexity is. This boils down to
> the equivalent of:
> 
> 	if (ptdev->gpu_info.l2_present != 1)
> 		core_mask = 1;

Hmm, that doesn't look right - the idiom here should be to set all bits 
of the output below the *second* set bit of the input, i.e. 0x11 -> 
0x0f. However since panthor is (somewhat ironically) unlikely to ever 
run on T628, and everything newer should pretend to have a single L2 
because software-managed coherency is a terrible idea, I would agree 
that ultimately it does all seem a bit pointless.

> If we were doing shader-core power management manually (like on pre-CSF
> GPUs, rather than letting the firmware control it) then the computed
> core_mask would be useful. So I guess it comes down to the
> drm_info_once() output and counting the cores - which is nice to have
> but it took me some time figuring out what was going on here.
As for the complexity, I'd suggest you can have some choice words with 
the guy who originally suggested that code[1] ;)

Cheers,
Robin.

[1] 
https://lore.kernel.org/dri-devel/b009b4c4-0396-58c2-7779-30c844f36f04@arm.com/

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-14 11:18         ` Steven Price
@ 2023-08-21 17:56           ` Robin Murphy
  2023-08-23  9:17             ` Steven Price
  2023-08-29 12:51             ` Boris Brezillon
  0 siblings, 2 replies; 93+ messages in thread
From: Robin Murphy @ 2023-08-21 17:56 UTC (permalink / raw)
  To: Steven Price, Daniel Stone, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 2023-08-14 12:18, Steven Price wrote:
> On 11/08/2023 20:26, Robin Murphy wrote:
>> On 2023-08-11 17:56, Daniel Stone wrote:
>>> Hi,
>>>
>>> On 11/08/2023 17:35, Robin Murphy wrote:
>>>> On 2023-08-09 17:53, Boris Brezillon wrote:
>>>>> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o
>>>>
>>>> FWIW I still think it would be nice to have a minor
>>>> directory/Kconfig/Makefile reshuffle and a trivial bit of extra
>>>> registration glue to build both drivers into a single module. It
>>>> seems like it could be a perpetual source of confusion to end users
>>>> where Mesa "panfrost" is the right option but kernel "panfrost" is
>>>> the wrong one. Especially when pretty much every other GPU driver is
>>>> also just one big top-level module to load for many different
>>>> generations of hardware. Plus it would mean that if someone did want
>>>> to have a go at deduplicating the resource-wrangling boilerplate for
>>>> OPPs etc. in future, there's more chance of being able to do so
>>>> meaningfully.
>>>
>>> It might be nice to point it out, but to be fair Intel and AMD both
>>> have two (or more) drivers, as does Broadcom/RPi. As does, err ... Mali.
>>
>> Indeed, I didn't mean to imply that I'm not aware that e.g. gma500 is to
>> i915 what lima is to panfrost. It was more that unlike the others where
>> there's a pretty clear line in the sand between "driver for old
>> hardware" and "driver for the majority of recent hardware", this one
>> happens to fall splat in the middle of the current major generation such
>> that panfrost is the correct module for Mali Bifrost but also the wrong
>> one for Mali Bifrost... :/
> 
> Well panfrost.ko is the correct module for all Bifrost ;) It's Valhall
> that's the confusing one.

Bah, you see? If even developers sufficiently involved to be CCed on the 
patches can't remember what's what, what hope does Joe User have? :D

> I would hope that for most users they can just build both panfrost and
> panthor and everything will "Just Work (tm)". I'm not sure how much
> users are actually aware of the architecture family of their GPU.
> 
> I think at the moment (until marketing mess it up) there's also the
> 'simple' rule:
> 
> * Mali T* is Midgard and supported by panfrost.ko
> * Mali Gxx (two digits) is Bifrost or first-generation Valhall and
> supported by panfrost.ko
> * Mali Gxxx (three digits) is Valhall CSF and supported by panthor.
> 
> (and Immortalis is always three digits and Valhall CSF).

With brain now engaged, indeed that sounds right. However if the 
expectation is that most people would steer clear even of marketing's 
alphabet soup and just enable everything, that could also be seen as 
somewhat of an argument for just putting it all together and not 
bothering with a separate option.

>>> I can see the point, but otoh if someone's managed to build all the
>>> right regulator/clock/etc modules to get a working system, they'll
>>> probably manage to figure teh GPU side out?
>>
>> Maybe; either way I guess it's not really my concern, since I'm the only
>> user that *I* have to support, and I do already understand it. From the
>> upstream perspective I mostly just want to hold on to the hope of not
>> having to write my io-pgtable bugs twice over if at all possible :)
> 
> I agree it would be nice to merge some of the common code, I'm hoping
> this is something that might be possible in the future. But at the
> moment the focus is on trying to get basic support for the new GPUs
> without the danger of regressing the old GPUs.

Yup, I get that, it's just the niggling concern I have is whether what 
we do at the moment might paint us into a corner with respect to what 
we're then able to change later; I know KConfig symbols are explicitly 
not ABI, but module names and driver names might be more of a grey area.

> And, to be honest, for a fair bit of the common code in
> panfrost/panthorm it's common to a few other drivers too. So the correct
> answer might well be to try to add more generic helpers (devfreq,
> clocks, power domains all spring to mind - there's a lot of boiler plate
> and nothing very special about Mali).

That much is true, however I guess there's also stuff like perf counter 
support which is less likely to be DRM-level generic but perhaps still 
sufficiently similar between JM and CSF. The main thing I don't know, 
and thus feel compelled to poke at, is whether there's any possibility 
that once the new UAPI is mature, it might eventually become preferable 
to move Job Manager support over to some subset of that rather than 
maintain two whole UAPIs in parallel (particularly at the Mesa end). My 
(limited) understanding is that all the BO-wrangling and MMU code is 
primarily different here for the sake of supporting new shiny UAPI 
features, not because of anything inherent to CSF itself (other than CSF 
being the thing which makes supporting said features feasible). If 
that's a preposterous idea and absolutely never ever going to be 
realistic, then fine, but if not, then it feels like the kind of thing 
that my all-too-great experience of technical debt and bad short-term 
decisions tells me is worth planning around from the very start.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-21 16:09     ` Robin Murphy
@ 2023-08-23  8:48       ` Steven Price
  2023-08-29 14:42       ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-23  8:48 UTC (permalink / raw)
  To: Robin Murphy, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 21/08/2023 17:09, Robin Murphy wrote:
> On 2023-08-14 11:54, Steven Price wrote:
> [...]
>>> +/**
>>> + * panthor_gpu_l2_power_on() - Power-on the L2-cache
>>> + * @ptdev: Device.
>>> + *
>>> + * Return: 0 on success, a negative error code otherwise.
>>> + */
>>> +int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
>>> +{
>>> +    u64 core_mask = U64_MAX;
>>> +
>>> +    if (ptdev->gpu_info.l2_present != 1) {
>>> +        /*
>>> +         * Only support one core group now.
>>> +         * ~(l2_present - 1) unsets all bits in l2_present except
>>> +         * the bottom bit. (l2_present - 2) has all the bits in
>>> +         * the first core group set. AND them together to generate
>>> +         * a mask of cores in the first core group.
>>> +         */
>>> +        core_mask = ~(ptdev->gpu_info.l2_present - 1) &
>>> +                 (ptdev->gpu_info.l2_present - 2);
>>> +        drm_info_once(&ptdev->base, "using only 1st core group (%lu
>>> cores from %lu)\n",
>>> +                  hweight64(core_mask),
>>> +                  hweight64(ptdev->gpu_info.shader_present));
>>
>> I'm not sure what the point of this complexity is. This boils down to
>> the equivalent of:
>>
>>     if (ptdev->gpu_info.l2_present != 1)
>>         core_mask = 1;
> 
> Hmm, that doesn't look right - the idiom here should be to set all bits
> of the output below the *second* set bit of the input, i.e. 0x11 ->
> 0x0f. However since panthor is (somewhat ironically) unlikely to ever
> run on T628, and everything newer should pretend to have a single L2
> because software-managed coherency is a terrible idea, I would agree
> that ultimately it does all seem a bit pointless.

Sorry I should have been clearer here. Other than the message printed 
(using drm_info_once) the only use of core_mask in this function is in 
the call to panthor_gpu_power_on:

+	return panthor_gpu_power_on(ptdev, L2,
+				    ptdev->gpu_info.l2_present & core_mask,
+				    20000);

Here the core_mask variable is anded with l2_present. So using the value 
1 is equivalent to the actual core mask which is being calculated. 
Obviously '1' isn't likely to be the real core mask (it's an "L2 mask").

Mostly it just seemed odd to calculate the core_mask and then 
effectively throw the value away.

>> If we were doing shader-core power management manually (like on pre-CSF
>> GPUs, rather than letting the firmware control it) then the computed
>> core_mask would be useful. So I guess it comes down to the
>> drm_info_once() output and counting the cores - which is nice to have
>> but it took me some time figuring out what was going on here.
> As for the complexity, I'd suggest you can have some choice words with
> the guy who originally suggested that code[1] ;)

I do often have problems with the code that guy wrote ;)

Steve

> 
> Cheers,
> Robin.
> 
> [1]
> https://lore.kernel.org/dri-devel/b009b4c4-0396-58c2-7779-30c844f36f04@arm.com/


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-21 17:56           ` Robin Murphy
@ 2023-08-23  9:17             ` Steven Price
  2023-08-29 12:51             ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-23  9:17 UTC (permalink / raw)
  To: Robin Murphy, Daniel Stone, Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Faith Ekstrand

On 21/08/2023 18:56, Robin Murphy wrote:
> On 2023-08-14 12:18, Steven Price wrote:
>> On 11/08/2023 20:26, Robin Murphy wrote:
>>> On 2023-08-11 17:56, Daniel Stone wrote:
>>>> Hi,
>>>>
>>>> On 11/08/2023 17:35, Robin Murphy wrote:
>>>>> On 2023-08-09 17:53, Boris Brezillon wrote:
>>>>>> +obj-$(CONFIG_DRM_PANTHOR) += panthor.o
>>>>>
>>>>> FWIW I still think it would be nice to have a minor
>>>>> directory/Kconfig/Makefile reshuffle and a trivial bit of extra
>>>>> registration glue to build both drivers into a single module. It
>>>>> seems like it could be a perpetual source of confusion to end users
>>>>> where Mesa "panfrost" is the right option but kernel "panfrost" is
>>>>> the wrong one. Especially when pretty much every other GPU driver is
>>>>> also just one big top-level module to load for many different
>>>>> generations of hardware. Plus it would mean that if someone did want
>>>>> to have a go at deduplicating the resource-wrangling boilerplate for
>>>>> OPPs etc. in future, there's more chance of being able to do so
>>>>> meaningfully.
>>>>
>>>> It might be nice to point it out, but to be fair Intel and AMD both
>>>> have two (or more) drivers, as does Broadcom/RPi. As does, err ...
>>>> Mali.
>>>
>>> Indeed, I didn't mean to imply that I'm not aware that e.g. gma500 is to
>>> i915 what lima is to panfrost. It was more that unlike the others where
>>> there's a pretty clear line in the sand between "driver for old
>>> hardware" and "driver for the majority of recent hardware", this one
>>> happens to fall splat in the middle of the current major generation such
>>> that panfrost is the correct module for Mali Bifrost but also the wrong
>>> one for Mali Bifrost... :/
>>
>> Well panfrost.ko is the correct module for all Bifrost ;) It's Valhall
>> that's the confusing one.
> 
> Bah, you see? If even developers sufficiently involved to be CCed on the
> patches can't remember what's what, what hope does Joe User have? :D
> 
>> I would hope that for most users they can just build both panfrost and
>> panthor and everything will "Just Work (tm)". I'm not sure how much
>> users are actually aware of the architecture family of their GPU.
>>
>> I think at the moment (until marketing mess it up) there's also the
>> 'simple' rule:
>>
>> * Mali T* is Midgard and supported by panfrost.ko
>> * Mali Gxx (two digits) is Bifrost or first-generation Valhall and
>> supported by panfrost.ko
>> * Mali Gxxx (three digits) is Valhall CSF and supported by panthor.
>>
>> (and Immortalis is always three digits and Valhall CSF).
> 
> With brain now engaged, indeed that sounds right. However if the
> expectation is that most people would steer clear even of marketing's
> alphabet soup and just enable everything, that could also be seen as
> somewhat of an argument for just putting it all together and not
> bothering with a separate option.
> 
>>>> I can see the point, but otoh if someone's managed to build all the
>>>> right regulator/clock/etc modules to get a working system, they'll
>>>> probably manage to figure teh GPU side out?
>>>
>>> Maybe; either way I guess it's not really my concern, since I'm the only
>>> user that *I* have to support, and I do already understand it. From the
>>> upstream perspective I mostly just want to hold on to the hope of not
>>> having to write my io-pgtable bugs twice over if at all possible :)
>>
>> I agree it would be nice to merge some of the common code, I'm hoping
>> this is something that might be possible in the future. But at the
>> moment the focus is on trying to get basic support for the new GPUs
>> without the danger of regressing the old GPUs.
> 
> Yup, I get that, it's just the niggling concern I have is whether what
> we do at the moment might paint us into a corner with respect to what
> we're then able to change later; I know KConfig symbols are explicitly
> not ABI, but module names and driver names might be more of a grey area.
> 
>> And, to be honest, for a fair bit of the common code in
>> panfrost/panthorm it's common to a few other drivers too. So the correct
>> answer might well be to try to add more generic helpers (devfreq,
>> clocks, power domains all spring to mind - there's a lot of boiler plate
>> and nothing very special about Mali).
> 
> That much is true, however I guess there's also stuff like perf counter
> support which is less likely to be DRM-level generic but perhaps still
> sufficiently similar between JM and CSF. The main thing I don't know,
> and thus feel compelled to poke at, is whether there's any possibility
> that once the new UAPI is mature, it might eventually become preferable
> to move Job Manager support over to some subset of that rather than
> maintain two whole UAPIs in parallel (particularly at the Mesa end). My
> (limited) understanding is that all the BO-wrangling and MMU code is
> primarily different here for the sake of supporting new shiny UAPI
> features, not because of anything inherent to CSF itself (other than CSF
> being the thing which makes supporting said features feasible). If
> that's a preposterous idea and absolutely never ever going to be
> realistic, then fine, but if not, then it feels like the kind of thing
> that my all-too-great experience of technical debt and bad short-term
> decisions tells me is worth planning around from the very start.

I agree this seems to be more a "political" decision rather than a
technical one. There is an attempt to start supporting Mali CSF GPUs
better and hopefully have more engagement from within Arm as well as Arm
backing Collabora[1]. This means there's some desire to be able to work
on panthor without having to worry about the potential of regressing
panfrost.

But CSF also provides some fairly radical changes to the way the GPU is
driven: firmware scheduling being the obvious one, and user-mode
submission being something that is hopefully coming soon. So to some
extent there's going to be two UAPIs because the GPU interface has changed.

However, there are definitely aspects of panthor that could apply to
panfrost - VM_BIND *could* be implemented for panfrost and potentially
could be useful. And the control of the GPU's VA space that panthor
provides is something that's lacking in panfrost. The question that I
see is, if panfrost was extended to include these APIs, would anyone use
them? If no-one is going to work on the Mesa side to make use of these
features in panfrost then it's likely to be untested (buggy) code; we'd
be relying on it "being the same as CSF" while not quite being.

In terms of the question of one kernel module or two: it's a good
question. There's a patch that moves panfrost over to using drm_exec[2]
which requires loading a new kernel module - it broke my test setup, but
I don't think we generally consider this ABI that we mustn't break. So I
think there is scope for changing our minds in the future if necessary.

Given that the two drivers are currently not at all combined it seems
sensible to me to build separate kernel modules, but I've no strong
views on that. And it might make sharing code in the future harder.

Steve

[1]
https://www.arm.com/company/news/2023/07/arm-expands-open-source-partnerships

[2]
https://lore.kernel.org/r/20230712124704.333004-6-christian.koenig%40amd.com

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t
  2023-08-19  2:13     ` Dmitry Osipenko
@ 2023-08-28  9:03       ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-28  9:03 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Sat, 19 Aug 2023 05:13:06 +0300
Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:

> On 8/11/23 16:08, Steven Price wrote:
> > On 09/08/2023 17:53, Boris Brezillon wrote:  
> >> This way we can grab a pages ref without acquiring the resv lock when
> >> pages_use_count > 0. Need to implement asynchronous map using the  
> > 
> > NIT: s/Need/This is needed/
> >   
> >> drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
> >> requiring the new left/right regions to grab an additional page ref
> >> to guarantee that the pages stay pinned when the middle section is
> >> unmapped.
> >>
> >> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> >> ---
> >>  drivers/gpu/drm/drm_gem_shmem_helper.c  | 28 +++++++++++++------------
> >>  drivers/gpu/drm/lima/lima_gem.c         |  2 +-
> >>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  2 +-
> >>  include/drm/drm_gem_shmem_helper.h      |  2 +-
> >>  4 files changed, 18 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> >> index a783d2245599..ca6938ea1b82 100644
> >> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> >> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> >> @@ -155,7 +155,7 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
> >>  		if (shmem->pages)
> >>  			drm_gem_shmem_put_pages(shmem);
> >>  
> >> -		drm_WARN_ON(obj->dev, shmem->pages_use_count);
> >> +		drm_WARN_ON(obj->dev, atomic_read(&shmem->pages_use_count));
> >>  
> >>  		dma_resv_unlock(shmem->base.resv);
> >>  	}
> >> @@ -172,14 +172,14 @@ static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
> >>  
> >>  	dma_resv_assert_held(shmem->base.resv);
> >>  
> >> -	if (shmem->pages_use_count++ > 0)
> >> +	if (atomic_inc_return(&shmem->pages_use_count) > 1)
> >>  		return 0;
> >>  
> >>  	pages = drm_gem_get_pages(obj);
> >>  	if (IS_ERR(pages)) {
> >>  		drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
> >>  			    PTR_ERR(pages));
> >> -		shmem->pages_use_count = 0;
> >> +		atomic_set(&shmem->pages_use_count, 0);
> >>  		return PTR_ERR(pages);
> >>  	}
> >>  
> >> @@ -210,10 +210,10 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
> >>  
> >>  	dma_resv_assert_held(shmem->base.resv);
> >>  
> >> -	if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
> >> +	if (drm_WARN_ON_ONCE(obj->dev, !atomic_read(&shmem->pages_use_count)))
> >>  		return;
> >>  
> >> -	if (--shmem->pages_use_count > 0)
> >> +	if (atomic_dec_return(&shmem->pages_use_count) > 0)
> >>  		return;
> >>  
> >>  #ifdef CONFIG_X86
> >> @@ -263,6 +263,10 @@ int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
> >>  
> >>  	drm_WARN_ON(obj->dev, obj->import_attach);
> >>  
> >> +	/* If we are the first owner, we need to grab the lock. */
> >> +	if (atomic_inc_not_zero(&shmem->pages_use_count))
> >> +		return 0;
> >> +  
> > 
> > Unless I'm misunderstanding I think this introduces a race where two
> > threads call drm_gem_shmem_pin() at the same time:
> > 
> > Thread1				| Thread 2
> > --------------------------------+------------------------------
> > drm_gem_shmem_pin()		|
> >  - pages_use_count == 0 so not  |
> >    incremented                  |
> >  - lock taken			|
> > drm_gem_shmem_pin_locked()	|
> > drm_gem_shmem_get_pages()	|
> >  - pages_use_count incremented	|
> > <thread descheduled>            | drm_gem_shmem_pin()
> >                                 |  - pages_use_count == 1 so is it
> > 				|    incremented and returns early
> > 				|    without taking the lock
> > 				| Code tries to use shmem->pages
> > <thread rescheduled>		| and blows up
> > drm_gem_get_pages()		|
> > shmem->pages populated		|
> > lock released			|
> > 
> > I think you need to modify drm_gem_shmem_get_pages() to only increment
> > pages_use_count when shmem->pages has been populated.

Oops, didn't spot that race. Thanks for pointing it out.

> 
> This is correct, both pin() and get_pages() should use
> atomic_inc_not_zero().
> 
> Note that we shouldn't use atomic functions open-coded, there is kref
> helper for that which uses refcount_t underneath and has additional
> checks/warnings for count underflow/overflow. I'm going to post patches
> converting drm-shmem to kref around next week, Boris is aware about it
> and we should then sync shrinker/panthor patchsets to the common
> drm-shmem base.

Thanks, I'll have a look at these patches pretty soon.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 13/15] drm/panthor: Allow driver compilation
  2023-08-21 17:56           ` Robin Murphy
  2023-08-23  9:17             ` Steven Price
@ 2023-08-29 12:51             ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 12:51 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Faith Ekstrand

On Mon, 21 Aug 2023 18:56:21 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> > And, to be honest, for a fair bit of the common code in
> > panfrost/panthorm it's common to a few other drivers too. So the correct
> > answer might well be to try to add more generic helpers (devfreq,
> > clocks, power domains all spring to mind - there's a lot of boiler plate
> > and nothing very special about Mali).  
> 
> That much is true, however I guess there's also stuff like perf counter 
> support which is less likely to be DRM-level generic but perhaps still 
> sufficiently similar between JM and CSF. The main thing I don't know, 
> and thus feel compelled to poke at, is whether there's any possibility 
> that once the new UAPI is mature, it might eventually become preferable 
> to move Job Manager support over to some subset of that rather than 
> maintain two whole UAPIs in parallel (particularly at the Mesa end). My 
> (limited) understanding is that all the BO-wrangling and MMU code is 
> primarily different here for the sake of supporting new shiny UAPI 
> features, not because of anything inherent to CSF itself (other than CSF 
> being the thing which makes supporting said features feasible).

You nailed it. The fact we went for a new driver is not so much about
supporting CSF HW (though, supporting CSF with the panfrost model is
challenging to be honest, even more if we want a zero-regression
guarantee for pre-existing users), but more about starting from a green
field so we don't have to think about supporting both GL and Vulkan
models (explicit vs implicit VM maintenance, explicit vs implicit
synchronization everywhere, and probably other things I forgot about).
Those are things that are hard to reconcile, which makes the code even
more complicated to apprehend, and more likely to break in subtle ways.

Intel went for this 'new driver' approach with Xe, Nouveau didn't. I
can't guarantee we took the right decision, but it definitely makes the
bringup phase less painful/risky, since we don't have to make sure we
don't regress existing users, and we don't have to implement
wrappers/bridges for the old uAPI.

As for supporting JM with the new driver, that's something we are
considering, especially if we want proper Vulkan support on
bifrost/valhall-non-csf at some point, but that's clearly not the
priority right now.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 03/15] drm/panthor: Add GPU register definitions
  2023-08-11 14:13   ` Steven Price
@ 2023-08-29 13:00     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 13:00 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Fri, 11 Aug 2023 15:13:23 +0100
Steven Price <steven.price@arm.com> wrote:

> > +#define AS_TRANSCFG_LO(as)				(MMU_AS(as) + 0x30)
> > +#define AS_TRANSCFG_HI(as)				(MMU_AS(as) + 0x34)
> > +#define   AS_TRANSCFG_ADRMODE_LEGACY			(0 << 0)  
> 
> I don't believe legacy mode exists any more (it's not in my copy of the
> spec).

Oops, I'll drop it.

> 
> > +#define   AS_TRANSCFG_ADRMODE_UNMAPPED			(1 << 0)
> > +#define   AS_TRANSCFG_ADRMODE_IDENTITY			(2 << 0)
> > +#define   AS_TRANSCFG_ADRMODE_AARCH64_4K		(6 << 0)
> > +#define   AS_TRANSCFG_ADRMODE_AARCH64_64K		(8 << 0)
> > +#define   AS_TRANSCFG_INA_BITS(x)			((x) << 6)
> > +#define   AS_TRANSCFG_OUTA_BITS(x)			((x) << 14)
> > +#define   AS_TRANSCFG_SL_CONCAT				BIT(22)
> > +#define   AS_TRANSCFG_PTW_MEMATTR_NC			(1 << 24)
> > +#define   AS_TRANSCFG_PTW_MEMATTR_WB			(2 << 24)
> > +#define   AS_TRANSCFG_PTW_SH_NS				(0 << 28)
> > +#define   AS_TRANSCFG_PTW_SH_OS				(2 << 28)
> > +#define   AS_TRANSCFG_PTW_SH_IS				(3 << 28)
> > +#define   AS_TRANSCFG_PTW_RA				BIT(30)
> > +#define   AS_TRANSCFG_DISABLE_HIER_AP			BIT(33)
> > +#define   AS_TRANSCFG_DISABLE_AF_FAULT			BIT(34)
> > +#define   AS_TRANSCFG_WXN				BIT(35)
> > +#define   AS_TRANSCFG_XREADABLE				BIT(36)
> > +#define AS_FAULTEXTRA_LO(as)				(MMU_AS(as) + 0x38)
> > +#define AS_FAULTEXTRA_HI(as)				(MMU_AS(as) + 0x3C)
> > +
> > +#define CSF_GPU_LATEST_FLUSH_ID				0x10000
> > +#define CSF_GPU_LATEST_FLUSH_ID_DEFAULT			0xffffe0  
> 
> I'm not sure why we need the default value of this register? Seems an
> odd thing to include.

I'm using it to set the dummy FLUSH_ID page to the reset value on
suspend, which you suggested to set to zero or one. If we agree on that
(still want to explain the reasoning before taking a decision), I'll
drop this definition.

> 
> Steve
> 
> > +
> > +#define CSF_DOORBELL(i)					(0x80000 + ((i) * 0x10000))
> > +#define CSF_GLB_DOORBELL_ID				0
> > +
> > +#define gpu_write(dev, reg, data) \
> > +	writel(data, (dev)->iomem + (reg))
> > +
> > +#define gpu_read(dev, reg) \
> > +	readl((dev)->iomem + (reg))
> > +
> > +#endif  
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-11 15:47   ` Steven Price
@ 2023-08-29 14:00     ` Boris Brezillon
  2023-08-30 13:17       ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 14:00 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Fri, 11 Aug 2023 16:47:56 +0100
Steven Price <steven.price@arm.com> wrote:

> On 09/08/2023 17:53, Boris Brezillon wrote:
> > The panthor driver is designed in a modular way, where each logical
> > block is dealing with a specific HW-block or software feature. In order
> > for those blocks to communicate with each other, we need a central
> > panthor_device collecting all the blocks, and exposing some common
> > features, like interrupt handling, power management, reset, ...
> > 
> > This what this panthor_device logical block is about.
> > 
> > v2:
> > - Rename the driver (pancsf -> panthor)
> > - Change the license (GPL2 -> MIT + GPL2)
> > - Split the driver addition commit
> > - Add devfreq/PM support
> > - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.c | 479 +++++++++++++++++++++++
> >  drivers/gpu/drm/panthor/panthor_device.h | 354 +++++++++++++++++
> >  2 files changed, 833 insertions(+)
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > new file mode 100644
> > index 000000000000..15f102116fa0
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -0,0 +1,479 @@
> > +// SPDX-License-Identifier: GPL-2.0 or MIT
> > +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> > +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> > +/* Copyright 2023 Collabora ltd. */
> > +
> > +#include <linux/clk.h>
> > +#include <linux/reset.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/pm_domain.h>
> > +#include <linux/pm_runtime.h>
> > +#include <linux/regulator/consumer.h>
> > +
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_managed.h>
> > +
> > +#include "panthor_sched.h"
> > +#include "panthor_device.h"
> > +#include "panthor_devfreq.h"
> > +#include "panthor_gpu.h"
> > +#include "panthor_fw.h"
> > +#include "panthor_mmu.h"
> > +#include "panthor_regs.h"
> > +
> > +static int panthor_clk_init(struct panthor_device *ptdev)
> > +{
> > +	ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
> > +	if (IS_ERR(ptdev->clks.core)) {
> > +		drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
> > +			PTR_ERR(ptdev->clks.core));  
> 
> I suspect it would be a good idea to use dev_err_probe() here (and
> below) as I believe devm_clk_get can return -EPROBE_DEFER.

Nice, didn't know there was a logging function that was silencing
probe-defer errors.

> 
> > +		return PTR_ERR(ptdev->clks.core);
> > +	}
> > +
> > +	ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
> > +	if (IS_ERR(ptdev->clks.stacks)) {
> > +		drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
> > +			PTR_ERR(ptdev->clks.stacks));
> > +		return PTR_ERR(ptdev->clks.stacks);
> > +	}
> > +
> > +	ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, "coregroup");
> > +	if (IS_ERR(ptdev->clks.coregroup)) {
> > +		drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
> > +			PTR_ERR(ptdev->clks.coregroup));
> > +		return PTR_ERR(ptdev->clks.coregroup);
> > +	}
> > +
> > +	drm_info(&ptdev->base, "clock rate = %lu\n", clk_get_rate(ptdev->clks.core));
> > +	return 0;
> > +}
> > +
> > +void panthor_device_unplug(struct panthor_device *ptdev)
> > +{
> > +	/* FIXME: This is racy. */  
> 
> Can we fix this? From a quick look it seems like a sequence like below
> should avoid the race.
> 
> 	if (!drm_dev_enter())
> 		/* Already unplugged */
> 		return;
> 	ptdev->base.unplugged = true;
> 	drm_dev_exit();
> 
> Although possibly that should be in the DRM core rather than open-coded
> here.

Are you sure that's protecting us against two concurrent calls to
drm_dev_unplug() (drm_dev_enter() is taking a read-lock)? And that's not
the only thing I need actually. If there are 2 threads entering
panthor_device_unplug(), I need to make sure the one who losts (arrived
after unplugged was set to false) is waiting for all operations after
the drm_dev_unplug() call to be done, otherwise we might return from
platform_driver->remove() before the unplug cleanups are done, and
there might still be threads/workqueues accessing device resources
while/after they get released by the device-model.

> 
> > +	if (drm_dev_is_unplugged(&ptdev->base))
> > +		return;
> > +
> > +	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> > +
> > +	/* Call drm_dev_unplug() so any access to HW block happening after
> > +	 * that point get rejected.
> > +	 */
> > +	drm_dev_unplug(&ptdev->base);
> > +
> > +	/* Now, try to cleanly shutdown the GPU before the device resources
> > +	 * get reclaimed.
> > +	 */
> > +	panthor_sched_unplug(ptdev);
> > +	panthor_fw_unplug(ptdev);
> > +	panthor_mmu_unplug(ptdev);
> > +	panthor_gpu_unplug(ptdev);
> > +
> > +	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> > +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> > +}
> > +
> > +static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
> > +{
> > +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> > +
> > +	cancel_work_sync(&ptdev->reset.work);
> > +	destroy_workqueue(ptdev->reset.wq);
> > +}
> > +
> > +static void panthor_device_reset_work(struct work_struct *work)
> > +{
> > +	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
> > +	int ret, cookie;
> > +
> > +	if (!drm_dev_enter(&ptdev->base, &cookie))
> > +		return;
> > +
> > +	panthor_sched_pre_reset(ptdev);
> > +	panthor_fw_pre_reset(ptdev, true);
> > +	panthor_mmu_pre_reset(ptdev);
> > +	panthor_gpu_soft_reset(ptdev);
> > +	panthor_gpu_l2_power_on(ptdev);
> > +	panthor_mmu_post_reset(ptdev);
> > +	ret = panthor_fw_post_reset(ptdev);
> > +	if (ret)
> > +		goto out;
> > +
> > +	atomic_set(&ptdev->reset.pending, 0);
> > +	panthor_sched_post_reset(ptdev);
> > +	drm_dev_exit(cookie);
> > +
> > +out:
> > +	if (ret) {  
> 
> This looks like a race condition too - is there a need for a
> drm_dev_exit_and_unplug() function?

drm_dev_exit() is just releasing the read-lock. drm_dev_unplug()
waits for all readers to be done and sets the unplugged value to true.
So we only get readers/writer synchronization here, but nothing doing
writer/writer sync. I guess the drm core leaves that to drivers, given
drm_dev_unplug() is usually called from xxx_driver->remove() hook, on
which serialization is guaranteed by the device-model.

TLDR; yes, it's racy, but I don't think drm_dev_exit_and_unplug() would
help solve the existing race.

It's worth noting that we currently have only 2 paths calling
panthor_device_unplug(): the platform_driver->remove() hook and the
reset worker. Calling drm_dev_unplug() might not be the right thing to
do, I just thought it was a good match to reflect the fact the device
becomes inaccessible, without adding yet another kind of device-lost
field.

> 
> > +		panthor_device_unplug(ptdev);
> > +		drm_err(&ptdev->base, "Failed to boot MCU after reset, making device unusable.");
> > +	}
> > +}
> > +
> > +static bool panthor_device_is_initialized(struct panthor_device *ptdev)
> > +{
> > +	return !!ptdev->scheduler;
> > +}
> > +
> > +static void panthor_device_free_page(struct drm_device *ddev, void *data)
> > +{
> > +	free_page((unsigned long)data);
> > +}
> > +
> > +int panthor_device_init(struct panthor_device *ptdev)
> > +{
> > +	struct resource *res;
> > +	struct page *p;
> > +	int ret;
> > +
> > +	ptdev->coherent = device_get_dma_attr(ptdev->base.dev) == DEV_DMA_COHERENT;
> > +
> > +	drmm_mutex_init(&ptdev->base, &ptdev->pm.lock);
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > +	p = alloc_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!p)
> > +		return -ENOMEM;
> > +
> > +	ptdev->pm.dummy_latest_flush = page_address(p);
> > +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
> > +				       ptdev->pm.dummy_latest_flush);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Set the dummy page to the default LATEST_FLUSH value. This
> > +	 * will be updated on the next suspend.
> > +	 */
> > +	*ptdev->pm.dummy_latest_flush = CSF_GPU_LATEST_FLUSH_ID_DEFAULT;  
> 
> I see why this register default value was defined. Although I'm not sure
> it has any benefit over just using zero... If the GPU is off when user
> space reads the FLUSH_ID then the GPU's caches are definitely empty so
> any flush ID is valid.

Zero means we'll force a cache flush for all CS that were created while
the device was suspended, that's not ideal.

> 
> Interestingly looking at kbase it seems to use an initial value of 1
> (POWER_DOWN_LATEST_FLUSH_VALUE). I guess zero is less ideal because
> FLUSH_CACHE2 would then unconditionally do a flush.

I guess a value of 1 would work. It just means we'll get a spurious
flush if the CS is submitted after 32 flushes happened, on the other
hand we also a spurious flush on the first submitted CS when we use
POWER_DOWN_LATEST_FLUSH_VALUE. I'll switch to 1, drop the default def,
and update the comment accordingly.

> 
> > +
> > +	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
> > +	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
> > +	if (!ptdev->reset.wq)
> > +		return -ENOMEM;
> > +
> > +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_reset_cleanup, NULL);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = panthor_clk_init(ptdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = panthor_devfreq_init(ptdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ptdev->iomem = devm_platform_get_and_ioremap_resource(to_platform_device(ptdev->base.dev),
> > +							      0, &res);
> > +	if (IS_ERR(ptdev->iomem))
> > +		return PTR_ERR(ptdev->iomem);
> > +
> > +	ptdev->phys_addr = res->start;
> > +
> > +	ret = devm_pm_runtime_enable(ptdev->base.dev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = panthor_gpu_init(ptdev);
> > +	if (ret)
> > +		goto err_rpm_put;
> > +
> > +	ret = panthor_mmu_init(ptdev);
> > +	if (ret)
> > +		goto err_rpm_put;
> > +
> > +	ret = panthor_fw_init(ptdev);
> > +	if (ret)
> > +		goto err_rpm_put;
> > +
> > +	ret = panthor_sched_init(ptdev);
> > +	if (ret)
> > +		goto err_rpm_put;
> > +
> > +	/* ~3 frames */
> > +	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
> > +	pm_runtime_use_autosuspend(ptdev->base.dev);
> > +	pm_runtime_put_autosuspend(ptdev->base.dev);
> > +	return 0;
> > +
> > +err_rpm_put:
> > +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> > +	return ret;
> > +}
> > +
> > +#define PANTHOR_EXCEPTION(id) \
> > +	[DRM_PANTHOR_EXCEPTION_ ## id] = { \
> > +		.name = #id, \
> > +	}
> > +
> > +struct panthor_exception_info {
> > +	const char *name;
> > +};
> > +
> > +static const struct panthor_exception_info panthor_exception_infos[] = {
> > +	PANTHOR_EXCEPTION(OK),
> > +	PANTHOR_EXCEPTION(TERMINATED),
> > +	PANTHOR_EXCEPTION(KABOOM),
> > +	PANTHOR_EXCEPTION(EUREKA),
> > +	PANTHOR_EXCEPTION(ACTIVE),
> > +	PANTHOR_EXCEPTION(CS_RES_TERM),
> > +	PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
> > +	PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
> > +	PANTHOR_EXCEPTION(CS_BUS_FAULT),
> > +	PANTHOR_EXCEPTION(CS_INSTR_INVALID),
> > +	PANTHOR_EXCEPTION(CS_CALL_STACK_OVERFLOW),
> > +	PANTHOR_EXCEPTION(CS_INHERIT_FAULT),
> > +	PANTHOR_EXCEPTION(INSTR_INVALID_PC),
> > +	PANTHOR_EXCEPTION(INSTR_INVALID_ENC),
> > +	PANTHOR_EXCEPTION(INSTR_BARRIER_FAULT),
> > +	PANTHOR_EXCEPTION(DATA_INVALID_FAULT),
> > +	PANTHOR_EXCEPTION(TILE_RANGE_FAULT),
> > +	PANTHOR_EXCEPTION(ADDR_RANGE_FAULT),
> > +	PANTHOR_EXCEPTION(IMPRECISE_FAULT),
> > +	PANTHOR_EXCEPTION(OOM),
> > +	PANTHOR_EXCEPTION(CSF_FW_INTERNAL_ERROR),
> > +	PANTHOR_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
> > +	PANTHOR_EXCEPTION(GPU_BUS_FAULT),
> > +	PANTHOR_EXCEPTION(GPU_SHAREABILITY_FAULT),
> > +	PANTHOR_EXCEPTION(SYS_SHAREABILITY_FAULT),
> > +	PANTHOR_EXCEPTION(GPU_CACHEABILITY_FAULT),
> > +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_0),
> > +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_1),
> > +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_2),
> > +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_3),
> > +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_4),
> > +	PANTHOR_EXCEPTION(PERM_FAULT_0),
> > +	PANTHOR_EXCEPTION(PERM_FAULT_1),
> > +	PANTHOR_EXCEPTION(PERM_FAULT_2),
> > +	PANTHOR_EXCEPTION(PERM_FAULT_3),
> > +	PANTHOR_EXCEPTION(ACCESS_FLAG_1),
> > +	PANTHOR_EXCEPTION(ACCESS_FLAG_2),
> > +	PANTHOR_EXCEPTION(ACCESS_FLAG_3),
> > +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_IN),
> > +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
> > +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
> > +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
> > +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
> > +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_0),
> > +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_1),
> > +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_2),
> > +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_3),
> > +};
> > +
> > +const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_code)
> > +{
> > +	if (drm_WARN_ON(&ptdev->base,  
> 
> I'm not convinced this should be a WARN_ON as I suspect it's probably
> possible to inject values from user space (although I'm not completely
> sure on that).

Normally no (it's something returned by the FW), unless userspace gets
access to the kernel <-> FW interface, which would be worrisome :-).

> It's certainly not a driver error as such if we can't
> decode the value.

Ack on dropping the WARN_ON().

> 
> > +			exception_code >= ARRAY_SIZE(panthor_exception_infos) ||
> > +			!panthor_exception_infos[exception_code].name))
> > +		return "Unknown exception type";
> > +
> > +	return panthor_exception_infos[exception_code].name;
> > +}
> > +
> > +static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct panthor_device *ptdev = vma->vm_private_data;
> > +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> > +	unsigned long pfn;
> > +	pgprot_t pgprot;
> > +	vm_fault_t ret;
> > +	bool active;
> > +	int cookie;
> > +
> > +	if (!drm_dev_enter(&ptdev->base, &cookie))
> > +		return VM_FAULT_SIGBUS;
> > +
> > +	mutex_lock(&ptdev->pm.lock);
> > +	active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
> > +
> > +	switch (id) {
> > +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> > +		if (active)
> > +			pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
> > +		else
> > +			pfn = virt_to_pfn(ptdev->pm.dummy_latest_flush);
> > +		break;
> > +
> > +	default:
> > +		ret = VM_FAULT_SIGBUS;
> > +		goto out_unlock;
> > +	}
> > +
> > +	pgprot = vma->vm_page_prot;
> > +	if (active)
> > +		pgprot = pgprot_noncached(pgprot);
> > +
> > +	ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
> > +
> > +out_unlock:
> > +	mutex_unlock(&ptdev->pm.lock);
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static const struct vm_operations_struct panthor_mmio_vm_ops = {
> > +	.fault = panthor_mmio_vm_fault,
> > +};
> > +
> > +int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
> > +{
> > +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> > +
> > +	switch (id) {
> > +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> > +		if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
> > +		    (vma->vm_flags & (VM_WRITE | VM_EXEC)))
> > +			return -EINVAL;
> > +
> > +		break;
> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* Defer actual mapping to the fault handler. */
> > +	vma->vm_private_data = ptdev;
> > +	vma->vm_ops = &panthor_mmio_vm_ops;
> > +	vm_flags_set(vma,
> > +		     VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
> > +		     VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
> > +	return 0;
> > +}
> > +
> > +#ifdef CONFIG_PM
> > +int panthor_device_resume(struct device *dev)
> > +{
> > +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> > +	int ret, cookie;
> > +
> > +	mutex_lock(&ptdev->pm.lock);
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_RESUMING);
> > +
> > +	ret = clk_prepare_enable(ptdev->clks.core);
> > +	if (ret)
> > +		goto err_unlock;
> > +
> > +	ret = clk_prepare_enable(ptdev->clks.stacks);
> > +	if (ret)
> > +		goto err_disable_core_clk;
> > +
> > +	ret = clk_prepare_enable(ptdev->clks.coregroup);
> > +	if (ret)
> > +		goto err_disable_stacks_clk;
> > +
> > +	ret = panthor_devfreq_resume(ptdev);
> > +	if (ret)
> > +		goto err_disable_coregroup_clk;
> > +
> > +	if (panthor_device_is_initialized(ptdev) &&
> > +	    drm_dev_enter(&ptdev->base, &cookie)) {
> > +		panthor_gpu_resume(ptdev);
> > +		panthor_mmu_resume(ptdev);
> > +		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> > +		if (!ret)
> > +			panthor_sched_resume(ptdev);
> > +
> > +		drm_dev_exit(cookie);
> > +
> > +		if (ret)
> > +			goto err_devfreq_suspend;
> > +	}
> > +
> > +	/* Clear all IOMEM mappings pointing to this device after we've
> > +	 * resumed. This way the fake mappings pointing to the dummy pages
> > +	 * are removed and the real iomem mapping will be restored on next
> > +	 * access.
> > +	 */
> > +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> > +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);  
> 
> Is the ordering here correct? I think we need to set ACTIVE before the
> unmap_mapping_range otherwise there is a (very small) race where user
> space could fault the page (and get the dummy mapping) before the
> atomic_set.

We take the pm.lock in panthor_mmio_vm_fault().

> 
> Hmm, actually we have the pm.lock, so no this isn't racy. In which case
> is there a good reason that you're using atomics? I can see two accesses
> which aren't protected by pm.lock:
> 
>   * the early out in panthor_device_suspend() - which could easily be
> moved inside the lock.

When we're in suspend() we are the one in control of the pm.state, so
no race expected here.

> 
>   * panthor_device_schedule_reset() - this looks racy (the power down
> could happen immediately after the atomic_read()), so I suspect it would
> be better moving the check into panthor_device_reset_work() and
> performing it with the pm.lock held.

I think the main reason for it being an atomic is because I didn't
have PM locking in the initial implementation, but I ended adding
locking at some point because I didn't really have choice. I thought
the race didn't exist because of the workqueue synchronization/work
cancellation that happens in panthor_sched_suspend(), but I see now
that it's not protecting us (thread queuing the job could be paused
just after checking the PM state and resumed after the suspend
happened). This being said, we might have a lock ordering issue if we
take the PM lock in that path (I need to check that).

> 
> > +	if (atomic_read(&ptdev->reset.pending))
> > +		queue_work(ptdev->reset.wq, &ptdev->reset.work);
> > +
> > +	mutex_unlock(&ptdev->pm.lock);
> > +	return 0;
> > +
> > +err_devfreq_suspend:
> > +	panthor_devfreq_suspend(ptdev);
> > +
> > +err_disable_coregroup_clk:
> > +	clk_disable_unprepare(ptdev->clks.coregroup);
> > +
> > +err_disable_stacks_clk:
> > +	clk_disable_unprepare(ptdev->clks.stacks);
> > +
> > +err_disable_core_clk:
> > +	clk_disable_unprepare(ptdev->clks.core);
> > +
> > +err_unlock:
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > +	mutex_unlock(&ptdev->pm.lock);
> > +	return ret;
> > +}
> > +
> > +int panthor_device_suspend(struct device *dev)
> > +{
> > +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> > +	int ret, cookie;
> > +
> > +	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
> > +		return 0;
> > +
> > +	mutex_lock(&ptdev->pm.lock);
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDING);
> > +
> > +	/* Clear all IOMEM mappings pointing to this device before we
> > +	 * shutdown the power-domain and clocks. Failing to do that results
> > +	 * in external aborts when the process accesses the iomem region.
> > +	 */
> > +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> > +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> > +
> > +	if (panthor_device_is_initialized(ptdev) &&
> > +	    drm_dev_enter(&ptdev->base, &cookie)) {
> > +		cancel_work_sync(&ptdev->reset.work);
> > +
> > +		/* We prepare everything as if we were resetting the GPU.
> > +		 * The end of the reset will happen in the resume path though.
> > +		 */
> > +		panthor_sched_suspend(ptdev);
> > +		panthor_fw_suspend(ptdev);
> > +		panthor_mmu_suspend(ptdev);
> > +		panthor_gpu_suspend(ptdev);
> > +		drm_dev_exit(cookie);
> > +	}
> > +
> > +	ret = panthor_devfreq_suspend(ptdev);
> > +	if (ret) {
> > +		if (panthor_device_is_initialized(ptdev) &&
> > +		    drm_dev_enter(&ptdev->base, &cookie)) {
> > +			panthor_gpu_resume(ptdev);
> > +			panthor_mmu_resume(ptdev);
> > +			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> > +			panthor_sched_resume(ptdev);
> > +			drm_dev_exit(cookie);
> > +		}
> > +
> > +		atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* Before we suspend, update the dummy_latest_flush page, so accesses
> > +	 * to this dummy page return the value the HW would have returned.
> > +	 */
> > +	*ptdev->pm.dummy_latest_flush = gpu_read(ptdev, CSF_GPU_LATEST_FLUSH_ID);  
> 
> As above, I don't believe it is important for user space to know the
> value the HW would have returned during a suspend. Indeed if the
> hardware was successfully suspended the flush ID is likely to be reset -
> so this would be inaccurate. However any value should be safe if the
> work was prepared while the GPU was off as the caches will be empty.

Agreed.

> 
> > +
> > +	clk_disable_unprepare(ptdev->clks.coregroup);
> > +	clk_disable_unprepare(ptdev->clks.stacks);
> > +	clk_disable_unprepare(ptdev->clks.core);
> > +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > +
> > +out_unlock:
> > +	mutex_unlock(&ptdev->pm.lock);
> > +	return ret;
> > +}
> > +#endif
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > new file mode 100644
> > index 000000000000..e0e1be263eb9
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -0,0 +1,354 @@
> > +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> > +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> > +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> > +/* Copyright 2023 Collabora ltd. */
> > +
> > +#ifndef __PANTHOR_DEVICE_H__
> > +#define __PANTHOR_DEVICE_H__
> > +
> > +#include <linux/atomic.h>
> > +#include <linux/io-pgtable.h>
> > +#include <linux/regulator/consumer.h>
> > +#include <linux/spinlock.h>
> > +#include <drm/drm_device.h>
> > +#include <drm/drm_mm.h>
> > +#include <drm/gpu_scheduler.h>
> > +#include <drm/panthor_drm.h>
> > +
> > +struct panthor_csf;
> > +struct panthor_csf_ctx;
> > +struct panthor_device;
> > +struct panthor_gpu;
> > +struct panthor_group_pool;
> > +struct panthor_heap_pool;
> > +struct panthor_job;
> > +struct panthor_mmu;
> > +struct panthor_fw;
> > +struct panthor_perfcnt;
> > +struct panthor_vm;
> > +struct panthor_vm_pool;
> > +
> > +/**
> > + * enum panthor_device_pm_state - PM state
> > + */
> > +enum panthor_device_pm_state {
> > +	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDED: Device is suspended. */
> > +	PANTHOR_DEVICE_PM_STATE_SUSPENDED = 0,
> > +
> > +	/** @PANTHOR_DEVICE_PM_STATE_RESUMING: Device is being resumed. */
> > +	PANTHOR_DEVICE_PM_STATE_RESUMING,
> > +
> > +	/** @PANTHOR_DEVICE_PM_STATE_ACTIVE: Device is active. */
> > +	PANTHOR_DEVICE_PM_STATE_ACTIVE,
> > +
> > +	/** @PANTHOR_DEVICE_PM_STATE_SUSPENDING: Device is being suspended. */
> > +	PANTHOR_DEVICE_PM_STATE_SUSPENDING,
> > +};
> > +
> > +/**
> > + * struct panthor_irq - IRQ data
> > + *
> > + * Used to automate IRQ handling for the 3 different IRQs we have in this driver.
> > + */
> > +struct panthor_irq {
> > +	/** @ptdev: Panthor device */
> > +	struct panthor_device *ptdev;
> > +
> > +	/** @irq: IRQ number. */
> > +	int irq;
> > +
> > +	/** @mask: Current mask being applied to xxx_INT_MASK. */
> > +	u32 mask;
> > +
> > +	/** @suspended: Set to true when the IRQ is suspended. */
> > +	atomic_t suspended;
> > +};
> > +
> > +/**
> > + * struct panthor_device - Panthor device
> > + */
> > +struct panthor_device {
> > +	/** @base: Base drm_device. */
> > +	struct drm_device base;
> > +
> > +	/** @phys_addr: Physical address of the iomem region. */
> > +	phys_addr_t phys_addr;
> > +
> > +	/** @iomem: CPU mapping of the IOMEM region. */
> > +	void __iomem *iomem;
> > +
> > +	/** @clks: GPU clocks. */
> > +	struct {
> > +		/** @core: Core clock. */
> > +		struct clk *core;
> > +
> > +		/** @stacks: Stacks clock. This clock is optional. */
> > +		struct clk *stacks;
> > +
> > +		/** @coregroup: Core group clock. This clock is optional. */
> > +		struct clk *coregroup;
> > +	} clks;
> > +
> > +	/** @coherent: True if the CPU/GPU are memory coherent. */
> > +	bool coherent;
> > +
> > +	/** @gpu_info: GPU information. */
> > +	struct drm_panthor_gpu_info gpu_info;
> > +
> > +	/** @csif_info: Command stream interface information. */
> > +	struct drm_panthor_csif_info csif_info;
> > +
> > +	/** @gpu: GPU management data. */
> > +	struct panthor_gpu *gpu;
> > +
> > +	/** @fw: FW management data. */
> > +	struct panthor_fw *fw;
> > +
> > +	/** @mmu: MMU management data. */
> > +	struct panthor_mmu *mmu;
> > +
> > +	/** @scheduler: Scheduler management data. */
> > +	struct panthor_scheduler *scheduler;
> > +
> > +	/** @devfreq: Device frequency scaling management data. */
> > +	struct panthor_devfreq *devfreq;
> > +
> > +	/** @reset: Reset related fields. */
> > +	struct {
> > +		/** @wq: Ordered worqueud used to schedule reset operations. */
> > +		struct workqueue_struct *wq;
> > +
> > +		/** @work: Reset work. */
> > +		struct work_struct work;
> > +
> > +		/** @pending: Set to true if a reset is pending. */
> > +		atomic_t pending;
> > +	} reset;
> > +
> > +	/** @pm: Power management related data. */
> > +	struct {
> > +		/** @state: Power state, see panthor_device_pm_state. */
> > +		atomic_t state;
> > +
> > +		/**
> > +		 * @lock: Lock protecting the suspend/resume operations.
> > +		 *
> > +		 * This is needed to ensure we map the dummy IO pages when
> > +		 * the device is being suspended, and the real IO pages when
> > +		 * the device is being resumed. We can't just do with the
> > +		 * state atomicity to deal with this race.
> > +		 */
> > +		struct mutex lock;
> > +
> > +		/**
> > +		 * @dummy_latest_flush: Dummy LATEST_FLUSH page.
> > +		 *
> > +		 * Used to replace the real LATEST_FLUSH page when the GPU
> > +		 * is suspended.
> > +		 */
> > +		u32 *dummy_latest_flush;
> > +	} pm;
> > +};
> > +
> > +/**
> > + * struct panthor_file - Panthor file
> > + */
> > +struct panthor_file {
> > +	/** @ptdev: Device attached to this file. */
> > +	struct panthor_device *ptdev;
> > +
> > +	/** @vms: VM pool attached to this file. */
> > +	struct panthor_vm_pool *vms;
> > +
> > +	/** @groups: Scheduling group pool attached to this file. */
> > +	struct panthor_group_pool *groups;
> > +};
> > +
> > +int panthor_device_init(struct panthor_device *ptdev);
> > +void panthor_device_unplug(struct panthor_device *ptdev);
> > +
> > +/**
> > + * panthor_device_schedule_reset() - Schedules a reset operation
> > + */
> > +static inline void panthor_device_schedule_reset(struct panthor_device *ptdev)
> > +{
> > +	if (atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE &&  
> 
> As above - this is a racy check. Although it might be safe because of
> the cancel_work_sync() call in panthor_device_suspend(). But if we get
> rid of this check we don't need the atomic variable.

As mentioned above, I don't think cancel_work_sync() solves the race,
because the check might be done just before suspend, the thread paused
and resumed after cancel_work_sync() was called in the suspend path.

> 
> > +	    !atomic_cmpxchg(&ptdev->reset.pending, 0, 1))
> > +		queue_work(ptdev->reset.wq, &ptdev->reset.work);
> > +}
> > +
> > +/**
> > + * panthor_device_reset_is_pending() - Checks if a reset is pending.
> > + *
> > + * Return: true if a reset is pending, false otherwise.
> > + */
> > +static inline bool panthor_device_reset_is_pending(struct panthor_device *ptdev)
> > +{
> > +	return atomic_read(&ptdev->reset.pending) != 0;
> > +}
> > +
> > +int panthor_device_mmap_io(struct panthor_device *ptdev,
> > +			   struct vm_area_struct *vma);
> > +
> > +int panthor_device_resume(struct device *dev);
> > +int panthor_device_suspend(struct device *dev);
> > +
> > +enum drm_panthor_exception_type {
> > +	DRM_PANTHOR_EXCEPTION_OK = 0x00,
> > +	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> > +	DRM_PANTHOR_EXCEPTION_KABOOM = 0x05,
> > +	DRM_PANTHOR_EXCEPTION_EUREKA = 0x06,
> > +	DRM_PANTHOR_EXCEPTION_ACTIVE = 0x08,
> > +	DRM_PANTHOR_EXCEPTION_CS_RES_TERM = 0x0f,
> > +	DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT = 0x3f,
> > +	DRM_PANTHOR_EXCEPTION_CS_CONFIG_FAULT = 0x40,
> > +	DRM_PANTHOR_EXCEPTION_CS_ENDPOINT_FAULT = 0x44,
> > +	DRM_PANTHOR_EXCEPTION_CS_BUS_FAULT = 0x48,
> > +	DRM_PANTHOR_EXCEPTION_CS_INSTR_INVALID = 0x49,
> > +	DRM_PANTHOR_EXCEPTION_CS_CALL_STACK_OVERFLOW = 0x4a,
> > +	DRM_PANTHOR_EXCEPTION_CS_INHERIT_FAULT = 0x4b,
> > +	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_PC = 0x50,
> > +	DRM_PANTHOR_EXCEPTION_INSTR_INVALID_ENC = 0x51,
> > +	DRM_PANTHOR_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
> > +	DRM_PANTHOR_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> > +	DRM_PANTHOR_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> > +	DRM_PANTHOR_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> > +	DRM_PANTHOR_EXCEPTION_OOM = 0x60,
> > +	DRM_PANTHOR_EXCEPTION_CSF_FW_INTERNAL_ERROR = 0x68,
> > +	DRM_PANTHOR_EXCEPTION_CSF_RES_EVICTION_TIMEOUT = 0x69,
> > +	DRM_PANTHOR_EXCEPTION_GPU_BUS_FAULT = 0x80,
> > +	DRM_PANTHOR_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> > +	DRM_PANTHOR_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> > +	DRM_PANTHOR_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> > +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> > +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> > +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> > +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> > +	DRM_PANTHOR_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> > +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_0 = 0xc8,
> > +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_1 = 0xc9,
> > +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_2 = 0xca,
> > +	DRM_PANTHOR_EXCEPTION_PERM_FAULT_3 = 0xcb,
> > +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> > +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> > +	DRM_PANTHOR_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_IN = 0xe0,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
> > +	DRM_PANTHOR_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
> > +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
> > +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
> > +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
> > +	DRM_PANTHOR_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
> > +};
> > +
> > +/**
> > + * panthor_exception_is_fault() - Checks if an exception is a fault.
> > + *
> > + * Return: true if the exception is a fault, false otherwise.
> > + */
> > +static inline bool
> > +panthor_exception_is_fault(u32 exception_code)
> > +{
> > +	return exception_code > DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT;
> > +}
> > +
> > +const char *panthor_exception_name(struct panthor_device *ptdev,
> > +				   u32 exception_code);
> > +
> > +/**
> > + * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
> > + * registration function.
> > + *
> > + * The boiler-plate to gracefully deal with shared interrupts is
> > + * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
> > + * just after you actual handler. The handler prototype is:  
> s/you/your/ or probably s/you/the/ since we don't expect people to be
> adding more ;)
> 
> > + *
> > + * void (*handler)(struct panthor_device *, u32 status);
> > + */
> > +#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
> > +static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
> > +{												\
> > +	struct panthor_irq *pirq = data;							\
> > +	struct panthor_device *ptdev = pirq->ptdev;						\  
> 
> Maybe I'm missing something, but I was expecting a check here for if the
> irq has been suspended and to avoid the register reads if it was.

Thought the INT_MASK=0 + synchronize_irq() in panthor_xxx_irq_suspend()
would guarantee that the handler can't be called after
panthor_xxx_irq_suspend() was called.

> Otherwise I'm not entirely sure I follow what all this code is for.

Not entirely sure which code we're talking about. The reason we
don't use the default raw IRQ handler is because it doesn't work if the
irq line is shared. In that case, we need to mask all interrupts to
make sure other handlers on the same irq line don't get spammed with
our IRQs.

> 
> Steve
> 
> > +												\
> > +	if (!gpu_read(ptdev, __reg_prefix ## _INT_STAT))					\
> > +		return IRQ_NONE;								\
> > +												\
> > +	gpu_write(ptdev, __reg_prefix ## _INT_MASK, 0);						\
> > +	return IRQ_WAKE_THREAD;									\
> > +}												\
> > +												\
> > +static irqreturn_t panthor_ ## __name ## _irq_threaded_handler(int irq, void *data)		\
> > +{												\
> > +	struct panthor_irq *pirq = data;							\
> > +	struct panthor_device *ptdev = pirq->ptdev;						\
> > +	irqreturn_t ret = IRQ_NONE;								\
> > +												\
> > +	while (true) {										\
> > +		u32 status = gpu_read(ptdev, __reg_prefix ## _INT_RAWSTAT) & pirq->mask;	\
> > +												\
> > +		if (!status)									\
> > +			break;									\
> > +												\
> > +		gpu_write(ptdev, __reg_prefix ## _INT_CLEAR, status);				\
> > +												\
> > +		__handler(ptdev, status);							\
> > +		ret = IRQ_HANDLED;								\
> > +	}											\
> > +												\
> > +	if (!atomic_read(&pirq->suspended))							\
> > +		gpu_write(ptdev, __reg_prefix ## _INT_MASK, pirq->mask);			\
> > +												\
> > +	return ret;										\
> > +}												\
> > +												\
> > +static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)			\
> > +{												\
> > +	int cookie;										\
> > +												\
> > +	atomic_set(&pirq->suspended, true);							\
> > +												\
> > +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> > +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
> > +		synchronize_irq(pirq->irq);							\
> > +		drm_dev_exit(cookie);								\
> > +	}											\
> > +												\
> > +	pirq->mask = 0;										\
> > +}												\
> > +												\
> > +static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
> > +{												\
> > +	int cookie;										\
> > +												\
> > +	atomic_set(&pirq->suspended, false);							\
> > +	pirq->mask = mask;									\
> > +												\
> > +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> > +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_CLEAR, mask);			\
> > +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, mask);			\
> > +		drm_dev_exit(cookie);								\
> > +	}											\
> > +}												\
> > +												\
> > +static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev,			\
> > +					      struct panthor_irq *pirq,				\
> > +					      int irq, u32 mask)				\
> > +{												\
> > +	pirq->ptdev = ptdev;									\
> > +	pirq->irq = irq;									\
> > +	panthor_ ## __name ## _irq_resume(pirq, mask);						\
> > +												\
> > +	return devm_request_threaded_irq(ptdev->base.dev, irq,					\
> > +					 panthor_ ## __name ## _irq_raw_handler,		\
> > +					 panthor_ ## __name ## _irq_threaded_handler,		\
> > +					 IRQF_SHARED, KBUILD_MODNAME "-" # __name,		\
> > +					 pirq);							\
> > +}
> > +
> > +extern struct workqueue_struct *panthor_cleanup_wq;
> > +
> > +#endif  
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-14 10:54   ` Steven Price
  2023-08-21 16:09     ` Robin Murphy
@ 2023-08-29 14:40     ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 14:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 14 Aug 2023 11:54:27 +0100
Steven Price <steven.price@arm.com> wrote:

> On 09/08/2023 17:53, Boris Brezillon wrote:
> > Handles everything that's not related to the FW, the MMU or the
> > scheduler. This is the block dealing with the GPU property retrieval,
> > the GPU block power on/off logic, and some global operations, like
> > global cache flushing.
> > 
> > v2:
> > - Rename the driver (pancsf -> panthor)
> > - Change the license (GPL2 -> MIT + GPL2)
> > - Split the driver addition commit
> > - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> > - Use the panthor_irq layer to manage/process IRQs
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_gpu.c | 463 ++++++++++++++++++++++++++
> >  drivers/gpu/drm/panthor/panthor_gpu.h |  52 +++
> >  2 files changed, 515 insertions(+)
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> > new file mode 100644
> > index 000000000000..47d15334b46e
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> > @@ -0,0 +1,463 @@
> > +// SPDX-License-Identifier: GPL-2.0 or MIT
> > +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> > +/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
> > +/* Copyright 2019 Collabora ltd. */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/delay.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/io.h>
> > +#include <linux/iopoll.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/pm_runtime.h>
> > +
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_managed.h>
> > +
> > +#include "panthor_device.h"
> > +#include "panthor_gpu.h"
> > +#include "panthor_regs.h"
> > +
> > +/**
> > + * struct panthor_gpu - GPU block management data.
> > + */
> > +struct panthor_gpu {
> > +	/** @irq: GPU irq. */
> > +	struct panthor_irq irq;
> > +
> > +	/** @reqs_lock: Lock protecting access to pending_reqs. */
> > +	spinlock_t reqs_lock;
> > +
> > +	/** @pending_reqs: Pending GPU requests. */
> > +	u32 pending_reqs;
> > +
> > +	/** @reqs_acked: GPU request wait queue. */
> > +	wait_queue_head_t reqs_acked;
> > +};
> > +
> > +/**
> > + * struct panthor_model - GPU model description
> > + */
> > +struct panthor_model {
> > +	/** @name: Model name. */
> > +	const char *name;
> > +
> > +	/** @id: Model ID. */
> > +	u32 id;
> > +};
> > +
> > +/**
> > + * GPU_MODEL() - Define a GPU model.
> > + */
> > +#define GPU_MODEL(_name, _id, ...) \
> > +{\
> > +	.name = __stringify(_name),				\
> > +	.id = _id,						\
> > +}
> > +
> > +#define GPU_MODEL_ID_MASK		0xf00f0000  
> 
> I would be nice if we had defines for the two components that make this
> up (ARCH_MAJOR | PRODUCT_MAJOR). It might even be easier to read the
> model list below if we split ID into arch/product combinations (which
> can then be written in decimal rather than hex).

Sure, I can do that.

> 
> > +
> > +static const struct panthor_model gpu_models[] = {
> > +	GPU_MODEL(g610, 0xa0070000),
> > +	{},
> > +};
> > +
> > +#define GPU_INTERRUPTS_MASK	\
> > +	(GPU_IRQ_FAULT | \
> > +	 GPU_IRQ_PROTM_FAULT | \
> > +	 GPU_IRQ_RESET_COMPLETED | \
> > +	 GPU_IRQ_MCU_STATUS_CHANGED | \  
> 
> The code doesn't seem to use the MCU_STATUS_CHANGED interrupt, if it's
> not used then it doesn't make sense to be in the mask.

Oops, I intended to use it and probably never did (I think I'm polling
some register to check the MCU status change). I'll drop this interrupt.

> 
> > +	 GPU_IRQ_CLEAN_CACHES_COMPLETED)
> > +
> > +static void panthor_gpu_init_info(struct panthor_device *ptdev)
> > +{
> > +	const struct panthor_model *model;
> > +	u32 major, minor, status;
> > +	unsigned int i;
> > +
> > +	ptdev->gpu_info.gpu_id = gpu_read(ptdev, GPU_ID);
> > +	ptdev->gpu_info.csf_id = gpu_read(ptdev, GPU_CSF_ID);
> > +	ptdev->gpu_info.gpu_rev = gpu_read(ptdev, GPU_REVID);
> > +	ptdev->gpu_info.l2_features = gpu_read(ptdev, GPU_L2_FEATURES);
> > +	ptdev->gpu_info.tiler_features = gpu_read(ptdev, GPU_TILER_FEATURES);
> > +	ptdev->gpu_info.mem_features = gpu_read(ptdev, GPU_MEM_FEATURES);
> > +	ptdev->gpu_info.mmu_features = gpu_read(ptdev, GPU_MMU_FEATURES);
> > +	ptdev->gpu_info.thread_features = gpu_read(ptdev, GPU_THREAD_FEATURES);
> > +	ptdev->gpu_info.max_threads = gpu_read(ptdev, GPU_THREAD_MAX_THREADS);
> > +	ptdev->gpu_info.thread_max_workgroup_size = gpu_read(ptdev, GPU_THREAD_MAX_WORKGROUP_SIZE);
> > +	ptdev->gpu_info.thread_max_barrier_size = gpu_read(ptdev, GPU_THREAD_MAX_BARRIER_SIZE);
> > +	ptdev->gpu_info.coherency_features = gpu_read(ptdev, GPU_COHERENCY_FEATURES);
> > +	for (i = 0; i < 4; i++)
> > +		ptdev->gpu_info.texture_features[i] = gpu_read(ptdev, GPU_TEXTURE_FEATURES(i));
> > +
> > +	ptdev->gpu_info.as_present = gpu_read(ptdev, GPU_AS_PRESENT);
> > +
> > +	ptdev->gpu_info.shader_present = gpu_read(ptdev, GPU_SHADER_PRESENT_LO);
> > +	ptdev->gpu_info.shader_present |= (u64)gpu_read(ptdev, GPU_SHADER_PRESENT_HI) << 32;
> > +
> > +	ptdev->gpu_info.tiler_present = gpu_read(ptdev, GPU_TILER_PRESENT_LO);
> > +	ptdev->gpu_info.tiler_present |= (u64)gpu_read(ptdev, GPU_TILER_PRESENT_HI) << 32;
> > +
> > +	ptdev->gpu_info.l2_present = gpu_read(ptdev, GPU_L2_PRESENT_LO);
> > +	ptdev->gpu_info.l2_present |= (u64)gpu_read(ptdev, GPU_L2_PRESENT_HI) << 32;
> > +	ptdev->gpu_info.core_group_count = hweight64(ptdev->gpu_info.l2_present);  
> 
> Do we want to expose 'computed' properties like this? My experience in
> the past with kbase is that they can cause problems and are practically
> impossible to kill off once added.

I actually wondered the same. I only did that because panfrost was. I
can drop it and let mesa calculate the group count if it ever needs it.

> 
> AFAICT it isn't used by the current Mesa driver so I would suggest
> dropping core_group_count (which also enables us to drop the 'pad' field
> which is a nice side-effect).
> 
> > +
> > +	major = (ptdev->gpu_info.gpu_id >> 12) & 0xf;
> > +	minor = (ptdev->gpu_info.gpu_id >> 4) & 0xff;
> > +	status = ptdev->gpu_info.gpu_id & 0xf;
> > +
> > +	for (model = gpu_models; model->name; model++) {
> > +		if (model->id == (ptdev->gpu_info.gpu_id & GPU_MODEL_ID_MASK))
> > +			break;
> > +	}
> > +
> > +	drm_info(&ptdev->base,
> > +		 "mali-%s id 0x%x major 0x%x minor 0x%x status 0x%x",
> > +		 model->name ?: "unknown", ptdev->gpu_info.gpu_id >> 16,
> > +		 major, minor, status);
> > +
> > +	drm_info(&ptdev->base,
> > +		 "Features: L2:0x%08x Tiler:0x%08x Mem:0x%0x MMU:0x%08x AS:0x%x",  
> 
> There's an odd mix of format strings here. "%0x" for Mem and just "%x"
> for AS.

Sure, I can make it consistent, just tell me which version you prefer
;-).

> 
> > +		 ptdev->gpu_info.l2_features,
> > +		 ptdev->gpu_info.tiler_features,
> > +		 ptdev->gpu_info.mem_features,
> > +		 ptdev->gpu_info.mmu_features,
> > +		 ptdev->gpu_info.as_present);
> > +
> > +	drm_info(&ptdev->base,
> > +		 "shader_present=0x%0llx l2_present=0x%0llx tiler_present=0x%0llx",
> > +		 ptdev->gpu_info.shader_present, ptdev->gpu_info.l2_present,
> > +		 ptdev->gpu_info.tiler_present);
> > +}
> > +
> > +static void panthor_gpu_irq_handler(struct panthor_device *ptdev, u32 status)
> > +{
> > +	if (status & (GPU_IRQ_FAULT | GPU_IRQ_PROTM_FAULT)) {  
> 
> The spec states that GPU_FAULTSTATUS "does not update for
> GPU_PROTECTED_FAULT interrupts" - so I don't think we want
> GPU_IRQ_PROTM_FAULT in that condition. Or at least printing the
> exception information should ideally be avoided.

Right.

> 
> If I understand correctly a protected fault interrupt is basically
> saying the fault is the same as a GPU_IRQ_FAULT but the GPU isn't going
> to tell us the details because it was in protected mode (and it doesn't
> to accidentally leak the 'super secret' content).

That's my understanding too. I'll add a separate if () block for the
protm-fault.

> 
> > +		u32 fault_status = gpu_read(ptdev, GPU_FAULT_STATUS);
> > +		u64 address = ((u64)gpu_read(ptdev, GPU_FAULT_ADDR_HI) << 32) |
> > +			      gpu_read(ptdev, GPU_FAULT_ADDR_LO);
> > +
> > +		drm_warn(&ptdev->base, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
> > +			 fault_status, panthor_exception_name(ptdev, fault_status & 0xFF),
> > +			 address);
> > +	}
> > +
> > +	spin_lock(&ptdev->gpu->reqs_lock);
> > +	if (status & ptdev->gpu->pending_reqs) {
> > +		ptdev->gpu->pending_reqs &= ~status;
> > +		wake_up_all(&ptdev->gpu->reqs_acked);
> > +	}
> > +	spin_unlock(&ptdev->gpu->reqs_lock);
> > +}
> > +PANTHOR_IRQ_HANDLER(gpu, GPU, panthor_gpu_irq_handler);
> > +
> > +/**
> > + * panthor_gpu_unplug() - Called when the GPU is unplugged.
> > + */
> > +void panthor_gpu_unplug(struct panthor_device *ptdev)
> > +{
> > +	unsigned long flags;
> > +
> > +	/* Make sure the IRQ handler is not running after that point. */
> > +	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> > +
> > +	/* Wake-up all waiters. */
> > +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> > +	ptdev->gpu->pending_reqs = 0;
> > +	wake_up_all(&ptdev->gpu->reqs_acked);
> > +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> > +}
> > +
> > +/**
> > + * panthor_gpu_init() - Initialize the GPU block
> > + * @ptdev: Device.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_init(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_gpu *gpu;
> > +	u32 pa_bits;
> > +	int ret, irq;
> > +
> > +	gpu = drmm_kzalloc(&ptdev->base, sizeof(*gpu), GFP_KERNEL);
> > +	if (!gpu)
> > +		return -ENOMEM;
> > +
> > +	spin_lock_init(&gpu->reqs_lock);
> > +	init_waitqueue_head(&gpu->reqs_acked);
> > +	ptdev->gpu = gpu;
> > +	panthor_gpu_init_info(ptdev);
> > +
> > +	dma_set_max_seg_size(ptdev->base.dev, UINT_MAX);
> > +	pa_bits = GPU_MMU_FEATURES_PA_BITS(ptdev->gpu_info.mmu_features);
> > +	ret = dma_set_mask_and_coherent(ptdev->base.dev, DMA_BIT_MASK(pa_bits));
> > +	if (ret)
> > +		return ret;
> > +
> > +	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "gpu");
> > +	if (irq <= 0)
> > +		return ret;
> > +
> > +	ret = panthor_request_gpu_irq(ptdev, &ptdev->gpu->irq, irq, GPU_INTERRUPTS_MASK);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_gpu_block_power_off() - Power-off a specific block of the GPU
> > + * @ptdev: Device.
> > + * @blk_name: Block name.
> > + * @pwroff_reg: Power-off register for this block.
> > + * @pwrtrans_reg: Power transition register for this block.
> > + * @mask: Sub-elements to power-off.
> > + * @timeout_ms: Timeout in milliseconds.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_block_power_off(struct panthor_device *ptdev,
> > +				const char *blk_name,
> > +				u32 pwroff_reg, u32 pwrtrans_reg,
> > +				u64 mask, u32 timeout_us)
> > +{
> > +	u32 val, i;
> > +	int ret;
> > +
> > +	for (i = 0; i < 2; i++) {
> > +		u32 mask32 = mask >> (i * 32);
> > +
> > +		if (!mask32)
> > +			continue;
> > +
> > +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> > +						 val, !(mask32 & val),
> > +						 100, timeout_us);
> > +		if (ret) {
> > +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> > +				blk_name, mask);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (mask & GENMASK(31, 0))
> > +		gpu_write(ptdev, pwroff_reg, mask);
> > +
> > +	if (mask >> 32)
> > +		gpu_write(ptdev, pwroff_reg, mask >> 32);  
> 
> This should be pwroff_reg + 4.

Oh, I'm lucky to not have cores above bit 31 in the G610 :-).

> 
> > +
> > +	for (i = 0; i < 2; i++) {
> > +		u32 mask32 = mask >> (i * 32);
> > +
> > +		if (!mask32)
> > +			continue;
> > +
> > +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> > +						 val, !(mask & val),
> > +						 100, timeout_us);
> > +		if (ret) {
> > +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> > +				blk_name, mask);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_gpu_block_power_on() - Power-on a specific block of the GPU
> > + * @ptdev: Device.
> > + * @blk_name: Block name.
> > + * @pwron_reg: Power-on register for this block.
> > + * @pwrtrans_reg: Power transition register for this block.
> > + * @mask: Sub-elements to power-on.
> > + * @timeout_ms: Timeout in milliseconds.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_block_power_on(struct panthor_device *ptdev,
> > +			       const char *blk_name,
> > +			       u32 pwron_reg, u32 pwrtrans_reg,
> > +			       u32 rdy_reg, u64 mask, u32 timeout_us)
> > +{
> > +	u32 val, i;
> > +	int ret;
> > +
> > +	for (i = 0; i < 2; i++) {
> > +		u32 mask32 = mask >> (i * 32);
> > +
> > +		if (!mask32)
> > +			continue;
> > +
> > +		ret = readl_relaxed_poll_timeout(ptdev->iomem + pwrtrans_reg + (i * 4),
> > +						 val, !(mask32 & val),
> > +						 100, timeout_us);
> > +		if (ret) {
> > +			drm_err(&ptdev->base, "timeout waiting on %s:%llx power transition",
> > +				blk_name, mask);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (mask & GENMASK(31, 0))
> > +		gpu_write(ptdev, pwron_reg, mask);
> > +
> > +	if (mask >> 32)
> > +		gpu_write(ptdev, pwron_reg + 4, mask >> 32);
> > +
> > +	for (i = 0; i < 2; i++) {
> > +		u32 mask32 = mask >> (i * 32);
> > +
> > +		if (!mask32)
> > +			continue;
> > +
> > +		ret = readl_relaxed_poll_timeout(ptdev->iomem + rdy_reg + (i * 4),
> > +						 val, (mask32 & val) == mask32,
> > +						 100, timeout_us);
> > +		if (ret) {
> > +			drm_err(&ptdev->base, "timeout waiting on %s:%llx readyness",
> > +				blk_name, mask);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_gpu_l2_power_on() - Power-on the L2-cache
> > + * @ptdev: Device.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
> > +{
> > +	u64 core_mask = U64_MAX;
> > +
> > +	if (ptdev->gpu_info.l2_present != 1) {
> > +		/*
> > +		 * Only support one core group now.
> > +		 * ~(l2_present - 1) unsets all bits in l2_present except
> > +		 * the bottom bit. (l2_present - 2) has all the bits in
> > +		 * the first core group set. AND them together to generate
> > +		 * a mask of cores in the first core group.
> > +		 */
> > +		core_mask = ~(ptdev->gpu_info.l2_present - 1) &
> > +			     (ptdev->gpu_info.l2_present - 2);
> > +		drm_info_once(&ptdev->base, "using only 1st core group (%lu cores from %lu)\n",
> > +			      hweight64(core_mask),
> > +			      hweight64(ptdev->gpu_info.shader_present));  
> 
> I'm not sure what the point of this complexity is.

Copied directly from panfrost. I didn't even try to understand why
this was written like that :-).

> This boils down to
> the equivalent of:
> 
> 	if (ptdev->gpu_info.l2_present != 1)
> 		core_mask = 1;

I think what it does is create a mask for the first core group only. So,
an equivalent to this logic would be:

	first_core_group_mask = find_second_bit_set(l2_mask) - 1;

> 
> If we were doing shader-core power management manually (like on pre-CSF
> GPUs, rather than letting the firmware control it) then the computed
> core_mask would be useful.

I agree with your new proposal, assuming s/core_mask/l2_mask/.

> So I guess it comes down to the
> drm_info_once() output and counting the cores - which is nice to have
> but it took me some time figuring out what was going on here.

If we were to count the cores, we'd just do
hweight64(ptdev->gpu_info.shader_present). Here we reflect the fact
only cores from the first group are usable. I don't remember what the
problem was with core_group > 1 though.

> 
> > +	}
> > +
> > +	return panthor_gpu_power_on(ptdev, L2,
> > +				    ptdev->gpu_info.l2_present & core_mask,
> > +				    20000);
> > +}
> > +
> > +/**
> > + * panthor_gpu_flush_caches() - Flush caches
> > + * @ptdev: Device.
> > + * @l2: L2 flush type.
> > + * @lsc: LSC flush type.
> > + * @other: Other flush type.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_flush_caches(struct panthor_device *ptdev,
> > +			     u32 l2, u32 lsc, u32 other)
> > +{
> > +	bool timedout = false;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> > +	if (!drm_WARN_ON(&ptdev->base,
> > +			 ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED)) {
> > +		ptdev->gpu->pending_reqs |= GPU_IRQ_CLEAN_CACHES_COMPLETED;
> > +		gpu_write(ptdev, GPU_CMD, GPU_FLUSH_CACHES(l2, lsc, other));
> > +	}
> > +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> > +
> > +	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
> > +				!(ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED),
> > +				msecs_to_jiffies(100))) {
> > +		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> > +		if ((ptdev->gpu->pending_reqs & GPU_IRQ_CLEAN_CACHES_COMPLETED) != 0 &&
> > +		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_CLEAN_CACHES_COMPLETED))
> > +			timedout = true;
> > +		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> > +	}
> > +
> > +	if (timedout) {
> > +		drm_err(&ptdev->base, "Flush caches timeout");
> > +		return -ETIMEDOUT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_gpu_soft_reset() - Issue a soft-reset
> > + * @ptdev: Device.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_gpu_soft_reset(struct panthor_device *ptdev)
> > +{
> > +	bool timedout = false;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> > +	if (!drm_WARN_ON(&ptdev->base,
> > +			 ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED)) {
> > +		ptdev->gpu->pending_reqs |= GPU_IRQ_RESET_COMPLETED;
> > +		gpu_write(ptdev, GPU_INT_CLEAR, GPU_IRQ_RESET_COMPLETED);
> > +		gpu_write(ptdev, GPU_CMD, GPU_SOFT_RESET);
> > +	}
> > +	spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> > +
> > +	if (!wait_event_timeout(ptdev->gpu->reqs_acked,
> > +				!(ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED),
> > +				msecs_to_jiffies(100))) {
> > +		spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> > +		if ((ptdev->gpu->pending_reqs & GPU_IRQ_RESET_COMPLETED) != 0 &&
> > +		    !(gpu_read(ptdev, GPU_INT_RAWSTAT) & GPU_IRQ_RESET_COMPLETED))
> > +			timedout = true;
> > +		spin_unlock_irqrestore(&ptdev->gpu->reqs_lock, flags);
> > +	}
> > +
> > +	if (timedout) {
> > +		drm_err(&ptdev->base, "Soft reset timeout");
> > +		return -ETIMEDOUT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_gpu_suspend() - Suspend the GPU block.
> > + * @ptdev: Device.
> > + *
> > + * Soft reset and suspend the GPU irq. This should be called last
> > + * in the suspend procedure, after all other blocks have been suspented.
> > + */
> > +void panthor_gpu_suspend(struct panthor_device *ptdev)
> > +{
> > +	panthor_gpu_soft_reset(ptdev);  
> 
> I'm not sure why we need to soft-reset when suspending? I guess this is
> instead of manually powering off the L2? It might be the right action,
> but it would be good to have a comment explaining why.

When the MCU halt operation failed, we have to issue a soft-reset,
and since it worked for successful suspension too I kept the same logic
for both. It might indeed be preferable to do a soft-reset only when the
MCU wasn't suspended properly, and manually power off the L2s otherwise.

> 
> Steve
> 
> > +	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> > +}
> > +
> > +/**
> > + * panthor_gpu_resume() - Resume the GPU block.
> > + *
> > + * Resume the IRQ handler and power-on the L2-cache.
> > + * The FW takes care of powering the other blocks.
> > + */
> > +void panthor_gpu_resume(struct panthor_device *ptdev)
> > +{
> > +	panthor_gpu_irq_resume(&ptdev->gpu->irq, GPU_INTERRUPTS_MASK);
> > +	panthor_gpu_l2_power_on(ptdev);
> > +}
> > diff --git a/drivers/gpu/drm/panthor/panthor_gpu.h b/drivers/gpu/drm/panthor/panthor_gpu.h
> > new file mode 100644
> > index 000000000000..bba7555dd3c6
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_gpu.h
> > @@ -0,0 +1,52 @@
> > +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> > +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> > +/* Copyright 2019 Collabora ltd. */
> > +
> > +#ifndef __PANTHOR_GPU_H__
> > +#define __PANTHOR_GPU_H__
> > +
> > +struct panthor_device;
> > +
> > +int panthor_gpu_init(struct panthor_device *ptdev);
> > +void panthor_gpu_unplug(struct panthor_device *ptdev);
> > +void panthor_gpu_suspend(struct panthor_device *ptdev);
> > +void panthor_gpu_resume(struct panthor_device *ptdev);
> > +
> > +int panthor_gpu_block_power_on(struct panthor_device *ptdev,
> > +			       const char *blk_name,
> > +			       u32 pwron_reg, u32 pwrtrans_reg,
> > +			       u32 rdy_reg, u64 mask, u32 timeout_us);
> > +int panthor_gpu_block_power_off(struct panthor_device *ptdev,
> > +				const char *blk_name,
> > +				u32 pwroff_reg, u32 pwrtrans_reg,
> > +				u64 mask, u32 timeout_us);
> > +
> > +/**
> > + * panthor_gpu_power_on() - Power on the GPU block.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +#define panthor_gpu_power_on(ptdev, type, mask, timeout_us) \
> > +	panthor_gpu_block_power_on(ptdev, #type, \
> > +				  type ## _PWRON_LO, \
> > +				  type ## _PWRTRANS_LO, \
> > +				  type ## _READY_LO, \
> > +				  mask, timeout_us)
> > +
> > +/**
> > + * panthor_gpu_power_off() - Power off the GPU block.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +#define panthor_gpu_power_off(ptdev, type, mask, timeout_us) \
> > +	panthor_gpu_block_power_off(ptdev, #type, \
> > +				   type ## _PWROFF_LO, \
> > +				   type ## _PWRTRANS_LO, \
> > +				   mask, timeout_us)
> > +
> > +int panthor_gpu_l2_power_on(struct panthor_device *ptdev);
> > +int panthor_gpu_flush_caches(struct panthor_device *ptdev,
> > +			     u32 l2, u32 lsc, u32 other);
> > +int panthor_gpu_soft_reset(struct panthor_device *ptdev);
> > +
> > +#endif  
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 05/15] drm/panthor: Add the GPU logical block
  2023-08-21 16:09     ` Robin Murphy
  2023-08-23  8:48       ` Steven Price
@ 2023-08-29 14:42       ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 14:42 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Faith Ekstrand

On Mon, 21 Aug 2023 17:09:49 +0100
Robin Murphy <robin.murphy@arm.com> wrote:

> On 2023-08-14 11:54, Steven Price wrote:
> [...]
> >> +/**
> >> + * panthor_gpu_l2_power_on() - Power-on the L2-cache
> >> + * @ptdev: Device.
> >> + *
> >> + * Return: 0 on success, a negative error code otherwise.
> >> + */
> >> +int panthor_gpu_l2_power_on(struct panthor_device *ptdev)
> >> +{
> >> +	u64 core_mask = U64_MAX;
> >> +
> >> +	if (ptdev->gpu_info.l2_present != 1) {
> >> +		/*
> >> +		 * Only support one core group now.
> >> +		 * ~(l2_present - 1) unsets all bits in l2_present except
> >> +		 * the bottom bit. (l2_present - 2) has all the bits in
> >> +		 * the first core group set. AND them together to generate
> >> +		 * a mask of cores in the first core group.
> >> +		 */
> >> +		core_mask = ~(ptdev->gpu_info.l2_present - 1) &
> >> +			     (ptdev->gpu_info.l2_present - 2);
> >> +		drm_info_once(&ptdev->base, "using only 1st core group (%lu cores from %lu)\n",
> >> +			      hweight64(core_mask),
> >> +			      hweight64(ptdev->gpu_info.shader_present));  
> > 
> > I'm not sure what the point of this complexity is. This boils down to
> > the equivalent of:
> > 
> > 	if (ptdev->gpu_info.l2_present != 1)
> > 		core_mask = 1;  
> 
> Hmm, that doesn't look right - the idiom here should be to set all bits 
> of the output below the *second* set bit of the input, i.e. 0x11 -> 
> 0x0f.

Ah ah, I should really read the whole thread before replying :-).

> However since panthor is (somewhat ironically) unlikely to ever 
> run on T628, and everything newer should pretend to have a single L2 
> because software-managed coherency is a terrible idea, I would agree 
> that ultimately it does all seem a bit pointless.

Okay, good to know.

> 
> > If we were doing shader-core power management manually (like on pre-CSF
> > GPUs, rather than letting the firmware control it) then the computed
> > core_mask would be useful. So I guess it comes down to the
> > drm_info_once() output and counting the cores - which is nice to have
> > but it took me some time figuring out what was going on here.  
> As for the complexity, I'd suggest you can have some choice words with 
> the guy who originally suggested that code[1] ;)

:'-)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 06/15] drm/panthor: Add GEM logical block
  2023-08-14 13:40   ` Steven Price
@ 2023-08-29 14:45     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 14:45 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 14 Aug 2023 14:40:25 +0100
Steven Price <steven.price@arm.com> wrote:

> > +/**
> > + * panthor_gem_create_with_handle() - Create a GEM object and attach it to a handle.
> > + * @file: DRM file.
> > + * @ddev: DRM device.
> > + * @exclusive_vm: Exclusive VM. Not NULL if the GEM object can't be shared.
> > + * @size: Size of the GEM object to allocate.
> > + * @flags: Combination of drm_panthor_bo_flags flags.
> > + * @handle: Pointer holding the handle pointing to the new GEM object.
> > + *
> > + * Return: A valid pointer on success, an ERR_PTR() otherwise.
> > + */
> > +struct panthor_gem_object *
> > +panthor_gem_create_with_handle(struct drm_file *file,
> > +			       struct drm_device *ddev,
> > +			       struct panthor_vm *exclusive_vm,
> > +			       size_t size,
> > +			       u32 flags, u32 *handle)
> > +{
> > +	int ret;
> > +	struct drm_gem_shmem_object *shmem;
> > +	struct panthor_gem_object *bo;
> > +
> > +	shmem = drm_gem_shmem_create(ddev, size);
> > +	if (IS_ERR(shmem))
> > +		return ERR_CAST(shmem);
> > +
> > +	bo = to_panthor_bo(&shmem->base);
> > +	bo->flags = flags;
> > +
> > +	if (exclusive_vm) {
> > +		bo->exclusive_vm = panthor_vm_get(exclusive_vm);
> > +		bo->base.base.resv = panthor_vm_resv(exclusive_vm);
> > +	}
> > +
> > +	/*
> > +	 * Allocate an id of idr table where the obj is registered
> > +	 * and handle has the id what user can see.
> > +	 */
> > +	ret = drm_gem_handle_create(file, &shmem->base, handle);
> > +	/* drop reference from allocate - handle holds it now. */
> > +	drm_gem_object_put(&shmem->base);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> > +
> > +	return bo;
> > +}  
> 
> This function might be better just returning a simple int. The
> "with_handle" approach means that doing anything much with the returned
> object is dodgy (because another user space thread could have already
> guessed the handle), and anyway the only caller
> (panthor_ioctl_bo_create()) doesn't use the object and just extracts the
> error code (if any).

Totally agree. I'll change the return type for an int.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-14 15:53   ` Steven Price
@ 2023-08-29 15:33     ` Boris Brezillon
  2023-08-30 14:12       ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 15:33 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 14 Aug 2023 16:53:09 +0100
Steven Price <steven.price@arm.com> wrote:

> > +
> > +/**
> > + * struct panthor_vm_op_ctx - VM operation context
> > + *
> > + * With VM operations potentially taking place in a dma-signaling path, we
> > + * need to make sure everything that might require resource allocation is
> > + * pre-allocated upfront. This is what this operation context is far.
> > + *
> > + * We also collect resources that have been freed, so we can release them
> > + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
> > + * request.
> > + */
> > +struct panthor_vm_op_ctx {
> > +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
> > +	struct {
> > +		/** @count: Number of pages reserved. */
> > +		u32 count;
> > +
> > +		/** @ptr: Point to the first unused page in the @pages table. */
> > +		u32 ptr;
> > +
> > +		/**
> > +		 * @page: Array of pages that can be used for an MMU page table update.
> > +		 *
> > +		 * After an VM operation, there might be free pages left in this array.
> > +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
> > +		 */
> > +		void **pages;
> > +	} rsvd_page_tables;  
> 
> Two questions:
> 
> 1) Would a mempool simplify the implementation? It looks like a
> reasonable match.

Not sure what you mean by mempool, but I'm using a kmem_cache here for
all page table allocations. The pages that are passed to
panthor_vm_op_ctx::rsvd_page_tables::pages are allocated from this
pool. It's just that for each VM operation we pre-allocate page-tables,
and release those that were not used when the operation is done (we
over-allocate for the worst case scenario).

> 
> 2) Does it really make sense to have a separate pool of memory for every
> operation? Instead of having a separate pool for each operation, it
> would be possible to just keep track of the total number needed for all
> outstanding operations. Then a single (per device or maybe per-VM if
> necessary) mempool could be resized to ensure it has the right amount of
> space.

The pool is per-driver (see the global pt_cache). rsvd_page_tables just
holds pages needed for a specific VM operation. To be more specific, it
holds pages for the worst case (page table tree is empty, except for the
root page table).

> 
> I'm also a little wary that the VM_BIND infrastructure could potentially
> be abused to trigger a large amount of kernel allocation as it allocates
> up-front for the worst case but those pages are not charged to the
> process (AFAICT). But I haven't fully got my head round that yet.

Yep, that's problematic, indeed. I considered allocating page tables
as GEM objects, but the overhead of a GEM object is quite big
(hundreds of bytes of meta-data) compared to the size of a page table
(4k), and kmem_cache was just super convenient for this page table
cache :-).

> 
> > +
> > +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
> > +	u32 flags;
> > +
> > +	/** @va: Virtual range targeted by the VM operation. */
> > +	struct {
> > +		/** @addr: Start address. */
> > +		u64 addr;
> > +
> > +		/** @range: Range size. */
> > +		u64 range;
> > +	} va;
> > +
> > +	/**
> > +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
> > +	 *
> > +	 * For unmap operations, this will contain all VMAs that were covered by the
> > +	 * specified VA range.
> > +	 *
> > +	 * For map operations, this will contain all VMAs that previously mapped to
> > +	 * the specified VA range.
> > +	 *
> > +	 * Those VMAs, and the resources they point to will be released as part of
> > +	 * the op_ctx cleanup operation.
> > +	 */
> > +	struct list_head returned_vmas;
> > +
> > +	/** @map: Fields specific to a map operation. */
> > +	struct {
> > +		/** @gem: GEM object information. */
> > +		struct {
> > +			/** @obj: GEM object to map. */
> > +			struct drm_gem_object *obj;
> > +
> > +			/** @offset: Offset in the GEM object. */
> > +			u64 offset;
> > +		} gem;
> > +
> > +		/**
> > +		 * @sgt: sg-table pointing to pages backing the GEM object.
> > +		 *
> > +		 * This is gathered at job creation time, such that we don't have
> > +		 * to allocate in ::run_job().
> > +		 */
> > +		struct sg_table *sgt;
> > +
> > +		/**
> > +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
> > +		 *
> > +		 * If the map request covers a region that's inside another VMA, the
> > +		 * previous VMA will be split, requiring instantiation of a maximum of
> > +		 * two new VMA objects.
> > +		 */
> > +		struct panthor_vma *prev_vma;
> > +
> > +		/**
> > +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
> > +		 */
> > +		struct panthor_vma *new_vma;
> > +
> > +		/**
> > +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
> > +		 *
> > +		 * See @prev_vma.
> > +		 */
> > +		struct panthor_vma *next_vma;  
> 
> It's probably premature optimization, but it feels like having a cache
> of these VMA structures might be an idea.

If it's needed, I'll probably go for a kmem_cache, but I need to
check if it's worth it first (if the closest kmalloc cache is
significantly biffer than the struct size).

> I'm also struggling to
> understand how both a new prev and new next VMA are needed - but I
> haven't dug into the GPU VA manager.

prev/next are for mapping splits: an object is already mapped, and a new
object is mapped in the middle of this pre-existing mapping. In that
case, we need 2 vma object for the preceeding and succeeding mappings,
since the old mapping object will be released.

new_vma is for the new mapping.

> 
> > +	} map;
> > +};
> > +

[...]

> > +/**
> > + * panthor_vm_active() - Flag a VM as active
> > + * @VM: VM to flag as active.
> > + *
> > + * Assigns an address space to a VM so it can be used by the GPU/MCU.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_vm_active(struct panthor_vm *vm)
> > +{
> > +	struct panthor_device *ptdev = vm->ptdev;
> > +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
> > +	int ret = 0, as, cookie;
> > +	u64 transtab, transcfg;
> > +
> > +	if (!drm_dev_enter(&ptdev->base, &cookie))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&ptdev->mmu->as.slots_lock);
> > +
> > +	as = vm->as.id;
> > +	if (as >= 0) {
> > +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> > +
> > +		if (ptdev->mmu->as.faulty_mask & mask) {
> > +			/* Unhandled pagefault on this AS, the MMU was
> > +			 * disabled. We need to re-enable the MMU after
> > +			 * clearing+unmasking the AS interrupts.
> > +			 */
> > +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
> > +			ptdev->mmu->as.faulty_mask &= ~mask;
> > +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
> > +			goto out_enable_as;
> > +		}
> > +
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* Check for a free AS */
> > +	if (vm->for_mcu) {
> > +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
> > +		as = 0;
> > +	} else {
> > +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
> > +	}
> > +
> > +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
> > +		struct panthor_vm *lru_vm;
> > +
> > +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
> > +						  struct panthor_vm,
> > +						  as.lru_node);
> > +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> > +
> > +		list_del_init(&lru_vm->as.lru_node);
> > +		as = lru_vm->as.id;  
> 
> Should this not set lru_vm->as.id = -1, so that the code knows the VM no
> longer has an address space?

Good catch!

> 
> > +	} else {
> > +		set_bit(as, &ptdev->mmu->as.alloc_mask);
> > +	}
> > +
> > +	/* Assign the free or reclaimed AS to the FD */
> > +	vm->as.id = as;
> > +	ptdev->mmu->as.slots[as].vm = vm;
> > +
> > +out_enable_as:
> > +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
> > +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
> > +		   AS_TRANSCFG_PTW_RA |
> > +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
> > +	if (ptdev->coherent)
> > +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
> > +
> > +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
> > +
> > +out_unlock:
> > +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +

[...]

> > +
> > +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
> > +{
> > +	status = panthor_mmu_fault_mask(ptdev, status);
> > +	while (status) {
> > +		u32 as = ffs(status | (status >> 16)) - 1;
> > +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> > +		u32 new_int_mask;
> > +		u64 addr;
> > +		u32 fault_status;
> > +		u32 exception_type;
> > +		u32 access_type;
> > +		u32 source_id;
> > +
> > +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
> > +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
> > +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
> > +
> > +		/* decode the fault status */
> > +		exception_type = fault_status & 0xFF;
> > +		access_type = (fault_status >> 8) & 0x3;
> > +		source_id = (fault_status >> 16);
> > +
> > +		/* Page fault only */  
> 
> This comment makes no sense - it looks like it's copied over from panfrost.

Uh, it made sense before I dropped map/alloc-on-fault :-).

> 
> If I understand correctly we don't (currently) support growing on page
> fault - and it's not really needed now the MCU can handle the tiler heaps.

Exaclty. Map/alloc on fault is a bit challenging because of the whole
'we have to guarantee that a job is done in finite time, and we must
make sure fence signaling is not blocked on allocation'. Given
drm_gem_get_pages() doesn't do non-blocking allocations, I thought it'd
be preferable to postpone map-on-fault until we actually decide we need
it. Note that i915 seems to have some sort of non-blocking page
allocator in shmem_sg_alloc_table()[1].

> 
> > +		mutex_lock(&ptdev->mmu->as.slots_lock);
> > +
> > +		new_int_mask =
> > +			panthor_mmu_fault_mask(ptdev, ~ptdev->mmu->as.faulty_mask);
> > +
> > +		/* terminal fault, print info about the fault */
> > +		drm_err(&ptdev->base,
> > +			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
> > +			"raw fault status: 0x%X\n"
> > +			"decoded fault status: %s\n"
> > +			"exception type 0x%X: %s\n"
> > +			"access type 0x%X: %s\n"
> > +			"source id 0x%X\n",
> > +			as, addr,
> > +			fault_status,
> > +			(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
> > +			exception_type, panthor_exception_name(ptdev, exception_type),
> > +			access_type, access_type_name(ptdev, fault_status),
> > +			source_id);
> > +
> > +		/* Ignore MMU interrupts on this AS until it's been
> > +		 * re-enabled.
> > +		 */
> > +		ptdev->mmu->irq.mask = new_int_mask;
> > +		gpu_write(ptdev, MMU_INT_MASK, new_int_mask);
> > +
> > +		/* Disable the MMU to kill jobs on this AS. */
> > +		panthor_mmu_as_disable(ptdev, as);
> > +		mutex_unlock(&ptdev->mmu->as.slots_lock);
> > +
> > +		status &= ~mask;
> > +	}
> > +}
> > +PANTHOR_IRQ_HANDLER(mmu, MMU, panthor_mmu_irq_handler);
> > +

[...]

> > +
> > +/**
> > + * panthor_mmu_unplug() - Unplug the MMU logic
> > + * @ptdev: Device.
> > + *
> > + * No access to the MMU regs should be done after this function is called.
> > + * We suspend the IRQ and disable all VMs to guarantee that.
> > + */
> > +void panthor_mmu_unplug(struct panthor_device *ptdev)
> > +{
> > +	if (ptdev->mmu->irq.irq > 0)  
> 
> In what situation is this not true? AFAICT the driver probe will fail if
> the IRQ can't be obtained.

Right, I'll drop this test.

[1]https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L63

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 09/15] drm/panthor: Add the FW logical block
  2023-08-16 16:01   ` Steven Price
@ 2023-08-29 16:15     ` Boris Brezillon
  2023-08-30 15:20       ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 16:15 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Wed, 16 Aug 2023 17:01:56 +0100
Steven Price <steven.price@arm.com> wrote:

> On 09/08/2023 17:53, Boris Brezillon wrote:
> > Contains everything that's FW related, that includes the code dealing
> > with the microcontroller unit (MCU) that's running the FW, and anything
> > related to allocating memory shared between the FW and the CPU.
> > 
> > A few global FW events are processed in the IRQ handler, the rest is
> > forwarded to the scheduler, since scheduling is the primary reason for
> > the FW existence, and also the main source of FW <-> kernel
> > interactions.
> > 
> > v2:
> > - Rename the driver (pancsf -> panthor)
> > - Rename the file (_mcu -> _fw)
> > - Change the license (GPL2 -> MIT + GPL2)
> > - Split the driver addition commit
> > - Document the code
> > - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> > - Use the panthor_irq layer to manage/process IRQs
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_fw.c | 1417 ++++++++++++++++++++++++++
> >  drivers/gpu/drm/panthor/panthor_fw.h |  505 +++++++++
> >  2 files changed, 1922 insertions(+)
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> > new file mode 100644
> > index 000000000000..359a68f7af03
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> > @@ -0,0 +1,1417 @@
> > +// SPDX-License-Identifier: GPL-2.0 or MIT
> > +/* Copyright 2023 Collabora ltd. */
> > +
> > +#include <linux/clk.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/firmware.h>
> > +#include <linux/iopoll.h>
> > +#include <linux/iosys-map.h>
> > +#include <linux/mutex.h>
> > +#include <linux/platform_device.h>
> > +
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_managed.h>
> > +
> > +#include "panthor_device.h"
> > +#include "panthor_gem.h"
> > +#include "panthor_gpu.h"
> > +#include "panthor_regs.h"
> > +#include "panthor_fw.h"
> > +#include "panthor_mmu.h"
> > +#include "panthor_sched.h"
> > +
> > +#define CSF_FW_NAME "mali_csffw.bin"
> > +
> > +#define PING_INTERVAL_MS			12000
> > +#define PROGRESS_TIMEOUT_CYCLES			(5ull * 500 * 1024 * 1024)
> > +#define PROGRESS_TIMEOUT_SCALE_SHIFT		10
> > +#define IDLE_HYSTERESIS_US			800
> > +#define PWROFF_HYSTERESIS_US			10000
> > +
> > +/**
> > + * struct panthor_fw_mem - FW memory
> > + */
> > +struct panthor_fw_mem {
> > +	/** @bo: Buffer object backing the FW memory. */
> > +	struct panthor_gem_object *bo;
> > +
> > +	/** @kmap: Kernel CPU mapping of the FW memory. */
> > +	void *kmap;
> > +
> > +	/** @va: MCU mapping of the FW memory. */
> > +	u64 va;
> > +};
> > +
> > +/**
> > + * struct panthor_fw_binary_hdr - Firmware binary header.
> > + */
> > +struct panthor_fw_binary_hdr {
> > +	/** @magic: Magic value to check binary validity. */
> > +	u32 magic;
> > +#define CSF_FW_BINARY_HEADER_MAGIC		0xc3f13a6e
> > +
> > +	/** @minor: Minor FW version. */
> > +	u8 minor;
> > +
> > +	/** @major: Major FW version. */
> > +	u8 major;
> > +#define CSF_FW_BINARY_HEADER_MAJOR_MAX		0
> > +
> > +	/** @padding1: MBZ. */
> > +	u16 padding1;
> > +
> > +	/** @version_hash: FW version hash. */
> > +	u32 version_hash;
> > +
> > +	/** @padding2: MBZ. */
> > +	u32 padding2;
> > +
> > +	/** @size: FW binary size. */
> > +	u32 size;
> > +};
> > +
> > +/**
> > + * enum panthor_fw_binary_entry_type - Firmware binary entry type
> > + */
> > +enum panthor_fw_binary_entry_type {
> > +	/** @CSF_FW_BINARY_ENTRY_TYPE_IFACE: Host <-> FW interface. */
> > +	CSF_FW_BINARY_ENTRY_TYPE_IFACE = 0,
> > +
> > +	/** @CSF_FW_BINARY_ENTRY_TYPE_CONFIG: FW config. */
> > +	CSF_FW_BINARY_ENTRY_TYPE_CONFIG = 1,
> > +
> > +	/** @CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST: Unit-tests. */
> > +	CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST = 2,
> > +
> > +	/** @CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER: Trace buffer interface. */
> > +	CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER = 3,
> > +
> > +	/** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata interface. */
> > +	CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
> > +};
> > +
> > +#define CSF_FW_BINARY_ENTRY_TYPE(ehdr)					((ehdr) & 0xff)
> > +#define CSF_FW_BINARY_ENTRY_SIZE(ehdr)					(((ehdr) >> 8) & 0xff)
> > +#define CSF_FW_BINARY_ENTRY_UPDATE					BIT(30)
> > +#define CSF_FW_BINARY_ENTRY_OPTIONAL					BIT(31)
> > +
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_RD					BIT(0)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_WR					BIT(1)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_EX					BIT(2)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_NONE			(0 << 3)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED			(1 << 3)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_UNCACHED_COHERENT	(2 << 3)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED_COHERENT		(3 << 3)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK			GENMASK(4, 3)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_PROT				BIT(5)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED				BIT(30)
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO				BIT(31)
> > +
> > +#define CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS			\
> > +	(CSF_FW_BINARY_IFACE_ENTRY_RD_RD |				\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_WR |				\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_EX |				\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK |			\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_PROT |				\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED  |				\
> > +	 CSF_FW_BINARY_IFACE_ENTRY_RD_ZERO)
> > +
> > +/**
> > + * struct panthor_fw_binary_section_entry_hdr - Describes a section of FW binary
> > + */
> > +struct panthor_fw_binary_section_entry_hdr {
> > +	/** @flags: Section flags. */
> > +	u32 flags;
> > +
> > +	/** @va: MCU virtual range to map this binary section to. */
> > +	struct {
> > +		/** @start: Start address. */
> > +		u32 start;
> > +
> > +		/** @end: End address. */
> > +		u32 end;
> > +	} va;
> > +
> > +	/** @data: Data to initialize the FW section with. */
> > +	struct {
> > +		/** @start: Start offset in the FW binary. */
> > +		u32 start;
> > +
> > +		/** @end: End offset in the FW binary. */
> > +		u32 end;
> > +	} data;
> > +};
> > +
> > +/**
> > + * struct panthor_fw_binary_iter - Firmware binary iterator
> > + *
> > + * Used to parse a firmware binary.
> > + */
> > +struct panthor_fw_binary_iter {
> > +	/** @data: FW binary data. */
> > +	const void *data;
> > +
> > +	/** @size: FW binary size. */
> > +	size_t size;
> > +
> > +	/** @offset: Iterator offset. */
> > +	size_t offset;
> > +};
> > +
> > +/**
> > + * struct panthor_fw_section - FW section
> > + */
> > +struct panthor_fw_section {
> > +	/** @node: Used to keep track of FW sections. */
> > +	struct list_head node;
> > +
> > +	/** @flags: Section flags, as encoded in the FW binary. */
> > +	u32 flags;
> > +
> > +	/** @mem: Section memory. */
> > +	struct panthor_fw_mem *mem;
> > +
> > +	/**
> > +	 * @name: Name of the section, as specified in the binary.
> > +	 *
> > +	 * Can be NULL.
> > +	 */
> > +	const char *name;
> > +
> > +	/**
> > +	 * @data: Initial data copied to the FW memory.
> > +	 *
> > +	 * We keep data around so we can reload sections after a reset.
> > +	 */
> > +	struct {
> > +		/** @buf: Buffed used to store init data. */
> > +		const void *buf;
> > +
> > +		/** @size: Size of @buf in bytes. */
> > +		size_t size;
> > +	} data;
> > +};
> > +
> > +#define CSF_MCU_SHARED_REGION_START		0x04000000ULL
> > +#define CSF_MCU_SHARED_REGION_SIZE		0x04000000ULL
> > +
> > +#define MIN_CS_PER_CSG				8
> > +#define MIN_CSGS				3
> > +#define MAX_CSG_PRIO				0xf
> > +
> > +#define CSF_IFACE_VERSION(major, minor, patch)	\
> > +	(((major) << 24) | ((minor) << 16) | (patch))
> > +#define CSF_IFACE_VERSION_MAJOR(v)		((v) >> 24)
> > +#define CSF_IFACE_VERSION_MINOR(v)		(((v) >> 16) & 0xff)
> > +#define CSF_IFACE_VERSION_PATCH(v)		((v) & 0xffff)
> > +
> > +#define CSF_GROUP_CONTROL_OFFSET		0x1000
> > +#define CSF_STREAM_CONTROL_OFFSET		0x40
> > +#define CSF_UNPRESERVED_REG_COUNT		4
> > +
> > +/**
> > + * struct panthor_fw_iface - FW interfaces
> > + */
> > +struct panthor_fw_iface {
> > +	/** @global: Global interface. */
> > +	struct panthor_fw_global_iface global;
> > +
> > +	/** @groups: Group slot interfaces. */
> > +	struct panthor_fw_csg_iface groups[MAX_CSGS];
> > +
> > +	/** @streams: Command stream slot interfaces. */
> > +	struct panthor_fw_cs_iface streams[MAX_CSGS][MAX_CS_PER_CSG];
> > +};
> > +
> > +/**
> > + * struct panthor_fw - Firmware management
> > + */
> > +struct panthor_fw {
> > +	/** @vm: MCU VM. */
> > +	struct panthor_vm *vm;
> > +
> > +	/** @sections: List of FW sections. */
> > +	struct list_head sections;
> > +
> > +	/** @shared_section: The section containing the FW interfaces. */
> > +	struct panthor_fw_section *shared_section;
> > +
> > +	/** @iface: FW interfaces. */
> > +	struct panthor_fw_iface iface;
> > +
> > +	/** @watchdog: Collection of fields relating to the FW watchdog. */
> > +	struct {
> > +		/** @ping_work: Delayed work used to ping the FW. */
> > +		struct delayed_work ping_work;
> > +	} watchdog;
> > +
> > +	/**
> > +	 * @waitqueues: Request waitqueues.
> > +	 *
> > +	 * Everytime a request is sent to a command stream group or the global
> > +	 * interface, the caller will first busy wait for the request to be
> > +	 * acknowledged, and then fallback to a sleeping wait.
> > +	 *
> > +	 * Those wait queues are here to support the sleeping wait flavor.
> > +	 *
> > +	 * Entry 31 is the global waitqueue, the other ones are the command
> > +	 * stream group slot waitqueues.
> > +	 */
> > +	wait_queue_head_t waitqueues[32];
> > +
> > +	/** @booted: True is the FW is booted */
> > +	bool booted;
> > +
> > +	/**
> > +	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
> > +	 *
> > +	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
> > +	 *
> > +	 * Any time the firmware is properly suspended, a fast reset can take place.
> > +	 * On the other hand, if the halt operation failed, the driver will reload
> > +	 * all sections to make sure we start from a fresh state.
> > +	 */
> > +	bool fast_reset;
> > +
> > +	/** @irq: Job irq data. */
> > +	struct panthor_irq irq;
> > +};
> > +
> > +/**
> > + * panthor_fw_get_glb_iface() - Get the global interface
> > + * @ptdev: Device.
> > + *
> > + * Return: The global interface.
> > + */
> > +struct panthor_fw_global_iface *
> > +panthor_fw_get_glb_iface(struct panthor_device *ptdev)
> > +{
> > +	return &ptdev->fw->iface.global;
> > +}
> > +
> > +/**
> > + * panthor_fw_get_glb_iface() - Get a command stream group slot interface
> > + * @ptdev: Device.
> > + * @csg_slot: Index of the command stream group slot.
> > + *
> > + * Return: The command stream group slot interface.
> > + */
> > +struct panthor_fw_csg_iface *
> > +panthor_fw_get_csg_iface(struct panthor_device *ptdev, u32 csg_slot)
> > +{
> > +	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS))
> > +		return NULL;
> > +
> > +	return &ptdev->fw->iface.groups[csg_slot];
> > +}
> > +
> > +/**
> > + * panthor_fw_get_glb_iface() - Get a command stream slot interface
> > + * @ptdev: Device.
> > + * @csg_slot: Index of the command stream group slot.
> > + * @cs_slot: Index of the command stream slot.
> > + *
> > + * Return: The command stream slot interface.
> > + */
> > +struct panthor_fw_cs_iface *
> > +panthor_fw_get_cs_iface(struct panthor_device *ptdev, u32 csg_slot, u32 cs_slot)
> > +{
> > +	if (drm_WARN_ON(&ptdev->base, csg_slot >= MAX_CSGS || cs_slot > MAX_CS_PER_CSG))
> > +		return NULL;
> > +
> > +	return &ptdev->fw->iface.streams[csg_slot][cs_slot];
> > +}
> > +
> > +/**
> > + * panthor_fw_conv_timeout() - Convert a timeout into a cycle-count
> > + * @ptdev: Device.
> > + * @timeout_us: Timeout expressed in micro-seconds.
> > + *
> > + * The FW has two timer sources: the GPU counter or arch-timer. We need
> > + * to express timeouts in term of number of cycles and specify which
> > + * timer source should be used.
> > + *
> > + * Return: A value suitable for timeout fields in the global interface.
> > + */
> > +static u32 panthor_fw_conv_timeout(struct panthor_device *ptdev, u32 timeout_us)
> > +{
> > +	bool use_cycle_counter = false;
> > +	u32 timer_rate = 0;
> > +	u64 cycles;
> > +
> > +#ifdef CONFIG_ARM_ARCH_TIMER
> > +	timer_rate = arch_timer_get_cntfrq();
> > +#endif
> > +
> > +	if (!timer_rate) {
> > +		use_cycle_counter = true;
> > +		timer_rate = clk_get_rate(ptdev->clks.core);
> > +	}
> > +
> > +	if (drm_WARN_ON(&ptdev->base, !timer_rate)) {
> > +		/* We couldn't get a valid clock rate, let's just pick the
> > +		 * maximum value so the FW still handles the core
> > +		 * power on/off requests.
> > +		 */
> > +		return GLB_TIMER_VAL(0x7fffffff) |  
> 
> NIT: This feels like a magic number that could be included in the
> header. Or it could be rewritten as GLB_TIMER_VAL(~0) to more clearly
> represent 'maximum'.

I'll go for the latter, after checking GLB_TIMER_VAL() has a valid mask
operation.

> 
> > +		       GLB_TIMER_SOURCE_GPU_COUNTER;
> > +	}
> > +
> > +	cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate, 1000000);
> > +	return GLB_TIMER_VAL(cycles >> 10) |  
> 
> NIT: This isn't quite as ideal as it could be. The round up is done
> before the shift. Plus it's technically possible to overflow the 31 bits
> available (although that requires a several minute timeout and the
> fastest possible clock).
> 
> I'd be tempted to rewrite as:
> 
> 	mod_cycles = DIV_ROUND_UP_ULL((u64)timeout_us * timer_rate,
> 				      1000000 << 10);
> 
> I'm not sure if the theorectical overflow is worth considering, but it
> can be handled as:
> 
> 	if (drm_WARN_ON(&ptdev->base, mod_cycles >= (1 << 31)))
> 		mod_cycles = (1 << 31) - 1;
> 
> or following the style I suggested above:
> 
> 	if (drm_WARN_ON(&ptdev->base, mod_cycles > GLB_TIMER_VAL(~0)))
> 		mod_cycles = GFB_TIMER_VAL(~0);

Ack.

> 
> > +	       (use_cycle_counter ? GLB_TIMER_SOURCE_GPU_COUNTER : 0);
> > +}
> > +

[...]

> > +/**
> > + * panthor_fw_mem_alloc() - Allocate a FW memory object and map it to the MCU VM.
> > + * @ptdev: Device.
> > + * @size: Size of the memory block.
> > + * @bo_flags: BO flags.
> > + * @vm_map_flags: VM_MAP flags.
> > + * @va: Virtual address of the MCU mapping.
> > + * Set to PANTHOR_GEM_ALLOC_VA for automatic VA-assignment. In that case, the
> > + * VA will be allocated in the shared VA space.
> > + *
> > + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> > + */
> > +static struct panthor_fw_mem *
> > +panthor_fw_mem_alloc(struct panthor_device *ptdev, size_t size,
> > +		     u32 bo_flags, u32 vm_map_flags, u64 va)
> > +{
> > +	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> > +	int ret;
> > +
> > +	if (!mem)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	mem->bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
> > +					     size, bo_flags, vm_map_flags,
> > +					     &va, NULL);
> > +	if (IS_ERR(mem->bo)) {
> > +		ret = PTR_ERR(mem->bo);
> > +		mem->bo = NULL;
> > +		goto err_free_mem;
> > +	}
> > +
> > +	mem->va = va;
> > +	return mem;
> > +
> > +err_free_mem:
> > +	panthor_fw_mem_free(ptdev, mem);
> > +	return ERR_PTR(ret);  
> 
> The error handling seems more complex than needed, how about:
> 
> 	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
> 	struct panthor_gem_object *bo;
> 	int ret;
> 
> 	if (!mem)
> 		return ERR_PTR(-ENOMEM);
> 
> 	bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
> 					size, bo_flags, vm_map_flags,
> 					&va, NULL);
> 
> 	if (IS_ERR(bo)) {
> 		kfree(mem);
> 		return ERR_CAST(bo);
> 	}
> 
> 	mem->bo = bo;
> 	mem->va = va;
> 	return mem;
> 	
> Which I think also means we don't need the "if (mem->bo)" case in
> panthor_fw_mem_free().

Not so sure about that one. I've been adding code to existing functions
and having a structured error path, with free functions that can deal
with partially initialized object makes code addition less error-prone.
I agree on the local bo variable to avoid mem->bo re-initialization
though.

> 
> > +}
> > +

[...]

> > +/**
> > + * panthor_fw_alloc_suspend_buf_mem() - Allocate a suspend buffer for a command stream group.
> > + * @ptdev: Device.
> > + * @size: Size of the suspend buffer.
> > + *
> > + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
> > + */
> > +struct panthor_fw_mem *
> > +panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size)
> > +{
> > +	return panthor_fw_mem_alloc(ptdev, size,
> > +				    DRM_PANTHOR_BO_NO_MMAP,
> > +				    DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
> > +				    PANTHOR_GEM_ALLOC_VA);
> > +}
> > +
> > +static int panthor_fw_load_section_entry(struct panthor_device *ptdev,
> > +					 const struct firmware *fw,
> > +					 struct panthor_fw_binary_iter *iter,
> > +					 u32 ehdr)
> > +{
> > +	struct panthor_fw_binary_section_entry_hdr hdr;
> > +	struct panthor_fw_section *section;
> > +	u32 section_size;
> > +	u32 name_len;
> > +	int ret;
> > +
> > +	ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (hdr.data.end < hdr.data.start) {
> > +		drm_err(&ptdev->base, "Firmware corrupted, data.end < data.start (0x%x < 0x%x)\n",
> > +			hdr.data.end, hdr.data.start);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (hdr.va.end < hdr.va.start) {
> > +		drm_err(&ptdev->base, "Firmware corrupted, hdr.va.end < hdr.va.start (0x%x < 0x%x)\n",
> > +			hdr.va.end, hdr.va.start);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (hdr.data.end > fw->size) {
> > +		drm_err(&ptdev->base, "Firmware corrupted, file truncated? data_end=0x%x > fw size=0x%zx\n",
> > +			hdr.data.end, fw->size);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if ((hdr.va.start & ~PAGE_MASK) != 0 ||
> > +	    (hdr.va.end & ~PAGE_MASK) != 0) {
> > +		drm_err(&ptdev->base, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n",
> > +			hdr.va.start, hdr.va.end);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (hdr.flags & ~CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS) {
> > +		drm_err(&ptdev->base, "Firmware contains interface with unsupported flags (0x%x)\n",
> > +			hdr.flags);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_PROT) {
> > +		drm_warn(&ptdev->base,
> > +			 "Firmware protected mode entry not be supported, ignoring");
> > +		return 0;
> > +	}
> > +
> > +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START &&
> > +	    !(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED)) {
> > +		drm_err(&ptdev->base,
> > +			"Interface at 0x%llx must be shared", CSF_MCU_SHARED_REGION_START);
> > +		return -EINVAL;
> > +	}
> > +
> > +	name_len = iter->size - iter->offset;
> > +
> > +	section = drmm_kzalloc(&ptdev->base, sizeof(*section), GFP_KERNEL);
> > +	if (!section)
> > +		return -ENOMEM;
> > +
> > +	list_add_tail(&section->node, &ptdev->fw->sections);
> > +	section->flags = hdr.flags;
> > +	section->data.size = hdr.data.end - hdr.data.start;
> > +
> > +	if (section->data.size > 0) {
> > +		void *data = drmm_kmalloc(&ptdev->base, section->data.size, GFP_KERNEL);
> > +
> > +		if (!data)
> > +			return -ENOMEM;
> > +
> > +		memcpy(data, fw->data + hdr.data.start, section->data.size);
> > +		section->data.buf = data;
> > +	}
> > +
> > +	if (name_len > 0) {
> > +		char *name = drmm_kmalloc(&ptdev->base, name_len + 1, GFP_KERNEL);
> > +
> > +		if (!name)
> > +			return -ENOMEM;
> > +
> > +		memcpy(name, iter->data + iter->offset, name_len);
> > +		name[name_len] = '\0';
> > +		section->name = name;
> > +	}
> > +
> > +	section_size = hdr.va.end - hdr.va.start;
> > +	if (section_size) {
> > +		u32 cache_mode = hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK;
> > +		u32 vm_map_flags = 0;
> > +		struct sg_table *sgt;
> > +		u64 va = hdr.va.start;
> > +
> > +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
> > +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_READONLY;
> > +
> > +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_EX))
> > +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC;
> > +
> > +		/* TODO: CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_*_COHERENT are mapped to
> > +		 * non-cacheable for now. We might want to introduce a new
> > +		 * IOMMU_xxx flag (or abuse IOMMU_MMIO, which maps to device
> > +		 * memory and is currently not used by our driver) for
> > +		 * AS_MEMATTR_AARCH64_SHARED memory, so we can take benefit
> > +		 * of IO-coherent systems.
> > +		 */
> > +		if (cache_mode != CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED)
> > +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED;
> > +
> > +		/* Shared section is in the auto-VA range. We need to
> > +		 * reserve the VA range so it's not allocated to someone else.
> > +		 */
> > +		if (va >= CSF_MCU_SHARED_REGION_START &&
> > +		    va < CSF_MCU_SHARED_REGION_START + CSF_MCU_SHARED_REGION_SIZE)
> > +			va = PANTHOR_GEM_ALLOC_VA;
> > +
> > +		section->mem = panthor_fw_mem_alloc(ptdev, section_size,
> > +						    DRM_PANTHOR_BO_NO_MMAP,
> > +						    vm_map_flags, va);
> > +		if (IS_ERR(section->mem))
> > +			return PTR_ERR(section->mem);
> > +
> > +		if (drm_WARN_ON(&ptdev->base, section->mem->va != hdr.va.start))
> > +			return -EINVAL;
> > +
> > +		panthor_fw_init_section_mem(ptdev, section);
> > +
> > +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
> > +		if (IS_ERR(sgt))
> > +			return PTR_ERR(section->mem);
> > +
> > +		dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
> > +
> > +		if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED) {
> > +			if (!panthor_fw_mem_vmap(section->mem))  
> 
> Moving this before panthor_fw_init_section_mem() would avoid an
> unnecessary unmap/remap - althought this isn't exactly a performance path...

Sure, I can do that.

> 
> > +				return -ENOMEM;
> > +		}
> > +	}
> > +
> > +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
> > +		ptdev->fw->shared_section = section;
> > +
> > +	return 0;
> > +}
> > +
> > +static void
> > +panthor_reload_fw_sections(struct panthor_device *ptdev, bool full_reload)
> > +{
> > +	struct panthor_fw_section *section;
> > +
> > +	list_for_each_entry(section, &ptdev->fw->sections, node) {
> > +		struct sg_table *sgt;
> > +
> > +		if (!full_reload && !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
> > +			continue;
> > +
> > +		panthor_fw_init_section_mem(ptdev, section);
> > +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
> > +		if (!drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(sgt)))
> > +			dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
> > +	}
> > +}
> > +
> > +static int panthor_fw_load_entry(struct panthor_device *ptdev,
> > +				 const struct firmware *fw,
> > +				 struct panthor_fw_binary_iter *iter)
> > +{
> > +	struct panthor_fw_binary_iter eiter;
> > +	u32 ehdr;
> > +	int ret;
> > +
> > +	ret = panthor_fw_binary_iter_read(ptdev, iter, &ehdr, sizeof(ehdr));
> > +	if (ret)
> > +		return ret;
> > +
> > +	if ((iter->offset % sizeof(u32)) ||
> > +	    (CSF_FW_BINARY_ENTRY_SIZE(ehdr) % sizeof(u32))) {
> > +		drm_err(&ptdev->base, "Firmware entry isn't 32 bit aligned, offset=0x%x size=0x%x\n",
> > +			(u32)(iter->offset - sizeof(u32)), CSF_FW_BINARY_ENTRY_SIZE(ehdr));
> > +		return -EINVAL;
> > +	}
> > +
> > +	eiter.offset = 0;
> > +	eiter.data = iter->data + iter->offset;
> > +	eiter.size = CSF_FW_BINARY_ENTRY_SIZE(ehdr) - sizeof(ehdr);
> > +	iter->offset += eiter.size;  
> 
> There should really be a check like:
> 
> 	if (iter->offset < eiter.size)
> 		return -EINVAL;

Uh, I thought I had added size checks everywhere, but I apparently
missed some places.

> 
> otherwise I think it's possible for a corrupt firmware to cause us to
> run off the end of the buffer. Ideally the check would look something
> more like the one in panthor_fw_binary_iter_read() (dealing with
> potential overflow). I'm wondering if it makes sense to allow
> panthor_fw_binary_iter_read() with a NULL 'out' and check the return
> value. That way we can replace "iter->offset += eiter.size" with:
> 
> 	ret = panthor_fw_binary_iter_read(ptdev, iter, NULL,
> 					  eiter.size);
> 	if (ret)
> 		return ret;
> 
> (or have a new _skip() function)

Might make sense to add a panthor_fw_binary_sub_iter_init() helper that
would take care of doing the size check on the main iter, Unless you
see other places requiring a size check that are not expressed as
sub-iterators.

> 
> > +
> > +	switch (CSF_FW_BINARY_ENTRY_TYPE(ehdr)) {
> > +	case CSF_FW_BINARY_ENTRY_TYPE_IFACE:
> > +		return panthor_fw_load_section_entry(ptdev, fw, &eiter, ehdr);
> > +
> > +	/* FIXME: handle those entry types? */
> > +	case CSF_FW_BINARY_ENTRY_TYPE_CONFIG:
> > +	case CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST:
> > +	case CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER:
> > +	case CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA:
> > +		return 0;
> > +	default:
> > +		break;
> > +	}
> > +
> > +	if (ehdr & CSF_FW_BINARY_ENTRY_OPTIONAL)
> > +		return 0;
> > +
> > +	drm_err(&ptdev->base,
> > +		"Unsupported non-optional entry type %u in firmware\n",
> > +		CSF_FW_BINARY_ENTRY_TYPE(ehdr));
> > +	return -EINVAL;
> > +}

[...]

> > +static int panthor_init_cs_iface(struct panthor_device *ptdev,
> > +				 unsigned int csg_idx, unsigned int cs_idx)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_idx);
> > +	struct panthor_fw_cs_iface *cs_iface = &ptdev->fw->iface.streams[csg_idx][cs_idx];
> > +	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
> > +	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET +
> > +			   (csg_idx * glb_iface->control->group_stride) +
> > +			   CSF_STREAM_CONTROL_OFFSET +
> > +			   (cs_idx * csg_iface->control->stream_stride);
> > +
> > +	if (iface_offset + sizeof(*cs_iface) >= shared_section_sz)
> > +		return -EINVAL;
> > +
> > +	spin_lock_init(&cs_iface->lock);
> > +	cs_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
> > +	cs_iface->input = iface_fw_to_cpu_addr(ptdev, cs_iface->control->input_va);
> > +	cs_iface->output = iface_fw_to_cpu_addr(ptdev, cs_iface->control->output_va);
> > +
> > +	if (!cs_iface->input || !cs_iface->output) {
> > +		drm_err(&ptdev->base, "Invalid stream control interface input/output VA");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (csg_idx > 0 || cs_idx > 0) {
> > +		struct panthor_fw_cs_iface *first_cs_iface =
> > +			panthor_fw_get_cs_iface(ptdev, 0, 0);
> > +
> > +		if (cs_iface->control->features != first_cs_iface->control->features) {
> > +			drm_err(&ptdev->base, "Expecting identical CS slots");
> > +			return -EINVAL;
> > +		}
> > +	} else {
> > +		u32 reg_count = CS_FEATURES_WORK_REGS(cs_iface->control->features);
> > +
> > +		ptdev->csif_info.cs_reg_count = reg_count;
> > +		ptdev->csif_info.unpreserved_cs_reg_count = CSF_UNPRESERVED_REG_COUNT;
> > +	}  
> 
> Minor NIT: Both of these could be made unconditional. I feel the neatest
> thing could be to move the 'else' part to panthor_fw_init_ifaces()
> rather than including it as a special case here.
> 
> The conditional could be left as is, removed, or maybe the below is clearer?
> 
> 	struct panthor_fw_cs_iface *first_cs_iface =
> 			panthor_fw_get_cs_iface(ptdev, 0, 0);
> 
> 	if (cs_iface != first_cs_iface) {
> 		if (cs_iface->control->features !=
> 		    first_cs_iface->control->features) {
> 
> I've no strong views, it's just this bit of code looks very clunky to me.

No objection.

> 
> > +
> > +	return 0;
> > +}
> > +
> > +static int panthor_init_csg_iface(struct panthor_device *ptdev,
> > +				  unsigned int csg_idx)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +	struct panthor_fw_csg_iface *csg_iface = &ptdev->fw->iface.groups[csg_idx];
> > +	u64 shared_section_sz = ptdev->fw->shared_section->mem->bo->base.base.size;
> > +	u32 iface_offset = CSF_GROUP_CONTROL_OFFSET + (csg_idx * glb_iface->control->group_stride);
> > +	unsigned int i;
> > +
> > +	if (iface_offset + sizeof(*csg_iface) >= shared_section_sz)
> > +		return -EINVAL;
> > +
> > +	spin_lock_init(&csg_iface->lock);
> > +	csg_iface->control = ptdev->fw->shared_section->mem->kmap + iface_offset;
> > +	csg_iface->input = iface_fw_to_cpu_addr(ptdev, csg_iface->control->input_va);
> > +	csg_iface->output = iface_fw_to_cpu_addr(ptdev, csg_iface->control->output_va);
> > +
> > +	if (csg_iface->control->stream_num < MIN_CS_PER_CSG ||
> > +	    csg_iface->control->stream_num > MAX_CS_PER_CSG)
> > +		return -EINVAL;
> > +
> > +	if (!csg_iface->input || !csg_iface->output) {
> > +		drm_err(&ptdev->base, "Invalid group control interface input/output VA");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (csg_idx > 0) {
> > +		struct panthor_fw_csg_iface *first_csg_iface =
> > +			panthor_fw_get_csg_iface(ptdev, 0);
> > +		u32 first_protm_suspend_size = first_csg_iface->control->protm_suspend_size;
> > +
> > +		if (first_csg_iface->control->features != csg_iface->control->features ||
> > +		    first_csg_iface->control->suspend_size != csg_iface->control->suspend_size ||
> > +		    first_protm_suspend_size != csg_iface->control->protm_suspend_size ||
> > +		    first_csg_iface->control->stream_num != csg_iface->control->stream_num) {
> > +			drm_err(&ptdev->base, "Expecting identical CSG slots");
> > +			return -EINVAL;
> > +		}  
> 
> As above, I also wonder whether factoring out a "compare_csg()" function
> could make this mode readable - it could take the "->control" members to
> keep the line length in check. The special case for
> "first_protm_suspend_size" is somewhat ugly.

Sure, I can do that.

> 
> > +	}
> > +
> > +	for (i = 0; i < csg_iface->control->stream_num; i++) {
> > +		int ret = panthor_init_cs_iface(ptdev, csg_idx, i);
> > +
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static u32 panthor_get_instr_features(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +
> > +	if (glb_iface->control->version < CSF_IFACE_VERSION(1, 1, 0))
> > +		return 0;
> > +
> > +	return glb_iface->control->instr_features;
> > +}
> > +
> > +static int panthor_fw_init_ifaces(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = &ptdev->fw->iface.global;
> > +	unsigned int i;
> > +
> > +	if (!ptdev->fw->shared_section->mem->kmap)
> > +		return -EINVAL;
> > +
> > +	spin_lock_init(&glb_iface->lock);
> > +	glb_iface->control = ptdev->fw->shared_section->mem->kmap;
> > +
> > +	if (!glb_iface->control->version) {
> > +		drm_err(&ptdev->base, "Invalid CSF interface version %d.%d.%d (%x)",
> > +			CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
> > +			CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
> > +			CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
> > +			glb_iface->control->version);  
> 
> This looks wrong - we print this message only with version == 0, so the
> version number isn't very interesting ;)
> 
> I see kbase has this message: "Version check failed. Firmware may have
> failed to boot." Which seems much more informative.

Makes sense.

> 
> > +		return -EINVAL;
> > +	}
> > +
> > +	glb_iface->input = iface_fw_to_cpu_addr(ptdev, glb_iface->control->input_va);
> > +	glb_iface->output = iface_fw_to_cpu_addr(ptdev, glb_iface->control->output_va);
> > +	if (!glb_iface->input || !glb_iface->output) {
> > +		drm_err(&ptdev->base, "Invalid global control interface input/output VA");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (glb_iface->control->group_num > MAX_CSGS ||
> > +	    glb_iface->control->group_num < MIN_CSGS) {
> > +		drm_err(&ptdev->base, "Invalid number of control groups");
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (i = 0; i < glb_iface->control->group_num; i++) {
> > +		int ret = panthor_init_csg_iface(ptdev, i);
> > +
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	drm_info(&ptdev->base, "CSF FW v%d.%d.%d, Features %x Instrumentation features %x",  
> 
> NIT: Prefix %x with 0x (or use %#x).

Will do.

> 
> > +		 CSF_IFACE_VERSION_MAJOR(glb_iface->control->version),
> > +		 CSF_IFACE_VERSION_MINOR(glb_iface->control->version),
> > +		 CSF_IFACE_VERSION_PATCH(glb_iface->control->version),
> > +		 glb_iface->control->features,
> > +		 panthor_get_instr_features(ptdev));
> > +	return 0;
> > +}
> > +
> > +static void panthor_fw_init_global_iface(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +
> > +	/* Enable all cores. */
> > +	glb_iface->input->core_en_mask = ptdev->gpu_info.shader_present;
> > +
> > +	/* Setup timers. */
> > +	glb_iface->input->poweroff_timer = panthor_fw_conv_timeout(ptdev, PWROFF_HYSTERESIS_US);
> > +	glb_iface->input->progress_timer = PROGRESS_TIMEOUT_CYCLES >> PROGRESS_TIMEOUT_SCALE_SHIFT;
> > +	glb_iface->input->idle_timer = panthor_fw_conv_timeout(ptdev, IDLE_HYSTERESIS_US);
> > +
> > +	/* Enable interrupts we care about. */
> > +	glb_iface->input->ack_irq_mask = GLB_CFG_ALLOC_EN |
> > +					 GLB_PING |
> > +					 GLB_CFG_PROGRESS_TIMER |
> > +					 GLB_CFG_POWEROFF_TIMER |
> > +					 GLB_IDLE_EN |
> > +					 GLB_IDLE;
> > +
> > +	panthor_fw_update_reqs(glb_iface, req, GLB_IDLE_EN, GLB_IDLE_EN);
> > +	panthor_fw_toggle_reqs(glb_iface, req, ack,
> > +			       GLB_CFG_ALLOC_EN |
> > +			       GLB_CFG_POWEROFF_TIMER |
> > +			       GLB_CFG_PROGRESS_TIMER);
> > +
> > +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> > +
> > +	/* Kick the watchdog. */
> > +	mod_delayed_work(ptdev->reset.wq, &ptdev->fw->watchdog.ping_work,
> > +			 msecs_to_jiffies(PING_INTERVAL_MS));
> > +}
> > +
> > +static void panthor_fw_process_global_irq(struct panthor_device *ptdev)
> > +{
> > +	/* If the FW is not booted, don't process IRQs, just flag the FW as booted. */
> > +	if (!ptdev->fw->booted)
> > +		ptdev->fw->booted = true;
> > +	else
> > +		panthor_sched_process_global_irq(ptdev);
> > +
> > +	wake_up_all(&ptdev->fw->waitqueues[31]);
> > +}
> > +
> > +static void panthor_fw_process_csg_irq(struct panthor_device *ptdev, u32 csg_slot)
> > +{
> > +	panthor_sched_process_csg_irq(ptdev, csg_slot);
> > +	wake_up_all(&ptdev->fw->waitqueues[csg_slot]);
> > +}
> > +
> > +static void panthor_job_irq_handler(struct panthor_device *ptdev, u32 status)
> > +{
> > +	if (status & JOB_INT_GLOBAL_IF) {
> > +		panthor_fw_process_global_irq(ptdev);
> > +		status &= ~JOB_INT_GLOBAL_IF;
> > +	}
> > +
> > +	while (status) {
> > +		u32 csg_id = ffs(status) - 1;
> > +
> > +		panthor_fw_process_csg_irq(ptdev, csg_id);
> > +		status &= ~BIT(csg_id);  
> 
> NIT: s/BIT/JOB_INT_CSG_IF/ (since it exists...)

Will use JOB_INT_CSG_IF here.

> 
> > +	}
> > +}
> > +PANTHOR_IRQ_HANDLER(job, JOB, panthor_job_irq_handler);
> > +
> > +static int panthor_fw_start(struct panthor_device *ptdev)
> > +{
> > +	bool timedout = false;
> > +
> > +	ptdev->fw->booted = false;
> > +	panthor_job_irq_resume(&ptdev->fw->irq, ~0);
> > +	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_AUTO);
> > +
> > +	if (!wait_event_timeout(ptdev->fw->waitqueues[31],
> > +				ptdev->fw->booted,
> > +				msecs_to_jiffies(1000))) {
> > +		if (!ptdev->fw->booted &&
> > +		    !(gpu_read(ptdev, JOB_INT_STAT) & JOB_INT_GLOBAL_IF))
> > +			timedout = true;
> > +	}
> > +
> > +	if (timedout) {
> > +		drm_err(&ptdev->base, "Failed to boot MCU");
> > +		return -ETIMEDOUT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void panthor_fw_stop(struct panthor_device *ptdev)
> > +{
> > +	u32 status;
> > +
> > +	gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_DISABLE);
> > +	if (readl_poll_timeout(ptdev->iomem + MCU_CONTROL, status,
> > +			       status == MCU_CONTROL_DISABLE, 10, 100000))  
> 
> I suspect this should be checking MCU_STATUS not MCU_CONTROL

Yes, it should.

> 
> > +		drm_err(&ptdev->base, "Failed to stop MCU");
> > +}
> > +
> > +/**
> > + * panthor_fw_pre_reset() - Call before a reset.
> > + * @ptdev: Device.
> > + * @on_hang: true if the reset was triggered on a GPU hang.
> > + *
> > + * If the reset is not triggered on a hang, we try to gracefully halt the
> > + * MCU, so we can do a fast-reset when panthor_fw_post_reset() is called.
> > + */
> > +void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
> > +{
> > +	/* Make sure we won't be woken up by a ping. */
> > +	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
> > +
> > +	ptdev->fw->fast_reset = false;
> > +
> > +	if (!on_hang) {
> > +		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +		u32 status;
> > +
> > +		panthor_fw_update_reqs(glb_iface, req, GLB_HALT, GLB_HALT);
> > +		gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> > +		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
> > +					status == MCU_STATUS_HALT, 10, 100000) &&
> > +		    glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
> > +			ptdev->fw->fast_reset = true;
> > +		} else {
> > +			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
> > +		}
> > +
> > +		/* The FW detects 0 -> 1 transitions. Make sure we reset
> > +		 * the HALT bit before the FW is rebooted.
> > +		 */
> > +		panthor_fw_update_reqs(glb_iface, req, 0, GLB_HALT);
> > +	}
> > +
> > +	panthor_job_irq_suspend(&ptdev->fw->irq);
> > +}
> > +
> > +/**
> > + * panthor_fw_post_reset() - Call after a reset.
> > + * @ptdev: Device.
> > + *
> > + * Start the FW. If this is not a fast reset, all FW sections are reloaded to
> > + * make sure we can recover from a memory corruption.
> > + */
> > +int panthor_fw_post_reset(struct panthor_device *ptdev)
> > +{
> > +	int ret;
> > +
> > +	/* Make the MCU VM active. */
> > +	ret = panthor_vm_active(ptdev->fw->vm);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* Reload all sections, including RO ones. We're not supposed
> > +	 * to end up here anyway, let's just assume the overhead of
> > +	 * reloading everything is acceptable.
> > +	 */
> > +	if (!ptdev->fw->fast_reset)
> > +		panthor_reload_fw_sections(ptdev, true);
> > +
> > +	ret = panthor_fw_start(ptdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* We must re-initialize the global interface even on fast-reset. */
> > +	panthor_fw_init_global_iface(ptdev);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_fw_unplug() - Called when the device is unplugged.
> > + * @ptdev: Device.
> > + *
> > + * This function must make sure all pending operations are flushed before
> > + * will release device resources, thus preventing any interaction with
> > + * the HW.
> > + *
> > + * If there are still FW-relates works running after this function returns,  
> 
> s/relates/related/ or maybe even "If there is still FW-related work"

Will fix.

> 
> > + * they must use drm_dev_{enter,exit}() and skip any HW access when
> > + * drm_dev_enter() returns false.
> > + */
> > +void panthor_fw_unplug(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_fw_section *section;
> > +
> > +	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
> > +
> > +	/* Make sure the IRQ handler can be called after that point. */
> > +	if (ptdev->fw->irq.irq)
> > +		panthor_job_irq_suspend(&ptdev->fw->irq);
> > +
> > +	panthor_fw_stop(ptdev);
> > +
> > +	if (ptdev->fw->vm)
> > +		panthor_vm_idle(ptdev->fw->vm);
> > +
> > +	list_for_each_entry(section, &ptdev->fw->sections, node) {
> > +		panthor_fw_mem_free(ptdev, section->mem);
> > +	}
> > +
> > +	panthor_vm_put(ptdev->fw->vm);
> > +
> > +	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> > +}
> > +
> > +/**
> > + * panthor_fw_wait_acks() - Wait for requests to be acknowledged by the FW.
> > + * @req_ptr: Pointer to the req register.
> > + * @ack_ptr: Pointer to the ack register.
> > + * @wq: Wait queue to use for the sleeping wait.
> > + * @req_mask: Mask of requests to wait for.
> > + * @acked: Pointer to field that's updated with the acked requests.
> > + * If the function returns 0, *acked == req_mask.
> > + * @timeout_ms: Timeout expressed in milliseconds.
> > + *
> > + * Return: 0 on success, -ETIMEDOUT otherwise.
> > + */
> > +static int panthor_fw_wait_acks(const u32 *req_ptr, const u32 *ack_ptr,
> > +				wait_queue_head_t *wq,
> > +				u32 req_mask, u32 *acked,
> > +				u32 timeout_ms)
> > +{
> > +	u32 ack, req = READ_ONCE(*req_ptr) & req_mask;
> > +	int ret;
> > +
> > +	/* Busy wait for a few µsecs before falling back to a sleeping wait. */
> > +	*acked = req_mask;
> > +	ret = read_poll_timeout_atomic(READ_ONCE, ack,
> > +				       (ack & req_mask) == req,
> > +				       0, 10, 0,
> > +				       *ack_ptr);
> > +	if (!ret)
> > +		return 0;
> > +
> > +	if (wait_event_timeout(*wq, (READ_ONCE(*ack_ptr) & req_mask) == req,
> > +			       msecs_to_jiffies(timeout_ms)))
> > +		return 0;
> > +
> > +	/* Check one last time, in case we were not woken up for some reason. */
> > +	ack = READ_ONCE(*ack_ptr);
> > +	if ((ack & req_mask) == req)
> > +		return 0;
> > +
> > +	*acked = ~(req ^ ack) & req_mask;
> > +	return -ETIMEDOUT;
> > +}
> > +
> > +/**
> > + * panthor_fw_glb_wait_acks() - Wait for global requests to be acknowledged.
> > + * @ptdev: Device.
> > + * @req_mask: Mask of requests to wait for.
> > + * @acked: Pointer to field that's updated with the acked requests.
> > + * If the function returns 0, *acked == req_mask.
> > + * @timeout_ms: Timeout expressed in milliseconds.
> > + *
> > + * Return: 0 on success, -ETIMEDOUT otherwise.
> > + */
> > +int panthor_fw_glb_wait_acks(struct panthor_device *ptdev,
> > +			     u32 req_mask, u32 *acked,
> > +			     u32 timeout_ms)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +
> > +	/* GLB_HALT doesn't get acked through the FW interface. */
> > +	if (drm_WARN_ON(&ptdev->base, req_mask & (~GLB_REQ_MASK | GLB_HALT)))
> > +		return -EINVAL;
> > +
> > +	return panthor_fw_wait_acks(&glb_iface->input->req,
> > +				    &glb_iface->output->ack,
> > +				    &ptdev->fw->waitqueues[31],
> > +				    req_mask, acked, timeout_ms);
> > +}
> > +
> > +/**
> > + * panthor_fw_glb_wait_acks() - Wait for command stream group requests to be acknowledged.
> > + * @ptdev: Device.
> > + * @req_mask: Mask of requests to wait for.
> > + * @acked: Pointer to field that's updated with the acked requests.
> > + * If the function returns 0, *acked == req_mask.
> > + * @timeout_ms: Timeout expressed in milliseconds.
> > + *
> > + * Return: 0 on success, -ETIMEDOUT otherwise.
> > + */
> > +int panthor_fw_csg_wait_acks(struct panthor_device *ptdev, u32 csg_slot,
> > +			     u32 req_mask, u32 *acked, u32 timeout_ms)
> > +{
> > +	struct panthor_fw_csg_iface *csg_iface = panthor_fw_get_csg_iface(ptdev, csg_slot);
> > +	int ret;
> > +
> > +	if (drm_WARN_ON(&ptdev->base, req_mask & ~CSG_REQ_MASK))
> > +		return -EINVAL;
> > +
> > +	ret = panthor_fw_wait_acks(&csg_iface->input->req,
> > +				   &csg_iface->output->ack,
> > +				   &ptdev->fw->waitqueues[csg_slot],
> > +				   req_mask, acked, timeout_ms);
> > +
> > +	if (ret && (*acked & CSG_STATE_MASK) != CSG_STATE_MASK)
> > +		*acked &= ~CSG_STATE_MASK;  
> 
> I think this could do with a comment, it took me a while to work out
> what this was about. If I understand correctly this is attempting to
> check that all the bits in the STATE field were updated, and if any
> mismatch then clearing all those bits in the 'acked' mask. This enables
> code to do a "acked & CSG_STATE_MASK" check and get the right value
> (rather than having to do "(acked & CSG_STATE_MASK) == CSG_STATE_MASK").

Right. I'll add a comment.

> 
> AFAICT the "ret &&" part is also redundant.

Indeed, I'll drop it.

> 
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * panthor_fw_ring_csg_doorbells() - Ring command stream group doorbells.
> > + * @ptdev: Device.
> > + * @csg_mask: Bitmask encoding the command stream group doorbells to ring.
> > + *
> > + * This function is toggling bits in the doorbell_req and ringing the
> > + * global doorbell. It doesn't require a user doorbell to be attached to
> > + * the group.
> > + */
> > +void panthor_fw_ring_csg_doorbells(struct panthor_device *ptdev, u32 csg_mask)
> > +{
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +
> > +	panthor_fw_toggle_reqs(glb_iface, doorbell_req, doorbell_ack, csg_mask);
> > +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> > +}
> > +
> > +static void panthor_fw_ping_work(struct work_struct *work)
> > +{
> > +	struct panthor_fw *fw = container_of(work, struct panthor_fw, watchdog.ping_work.work);
> > +	struct panthor_device *ptdev = fw->irq.ptdev;
> > +	struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> > +	u32 acked;
> > +	int ret;
> > +
> > +	if (panthor_device_reset_is_pending(ptdev))
> > +		return;
> > +
> > +	panthor_fw_toggle_reqs(glb_iface, req, ack, GLB_PING);
> > +	gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
> > +
> > +	ret = panthor_fw_glb_wait_acks(ptdev, GLB_PING, &acked, 100);
> > +	if (ret) {
> > +		panthor_device_schedule_reset(ptdev);
> > +		drm_err(&ptdev->base, "FW ping timeout, scheduling a reset");
> > +	} else {
> > +		mod_delayed_work(ptdev->reset.wq, &fw->watchdog.ping_work,
> > +				 msecs_to_jiffies(PING_INTERVAL_MS));
> > +	}
> > +}
> > +
> > +/**
> > + * panthor_fw_init() - Initialize FW related data.
> > + * @ptdev: Device.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_fw_init(struct panthor_device *ptdev)
> > +{
> > +	struct panthor_fw *fw;
> > +	int ret, irq;
> > +
> > +	fw = drmm_kzalloc(&ptdev->base, sizeof(*fw), GFP_KERNEL);
> > +	if (!fw)
> > +		return -ENOMEM;
> > +
> > +	ptdev->fw = fw;
> > +	for (u32 i = 0; i < ARRAY_SIZE(fw->waitqueues); i++)
> > +		init_waitqueue_head(&fw->waitqueues[i]);
> > +
> > +	INIT_LIST_HEAD(&fw->sections);
> > +	INIT_DELAYED_WORK(&fw->watchdog.ping_work, panthor_fw_ping_work);
> > +
> > +	irq = platform_get_irq_byname(to_platform_device(ptdev->base.dev), "job");
> > +	if (irq <= 0)
> > +		return -ENODEV;
> > +
> > +	ret = panthor_request_job_irq(ptdev, &fw->irq, irq, 0);
> > +	if (ret) {
> > +		drm_err(&ptdev->base, "failed to request job irq");
> > +		return ret;
> > +	}
> > +
> > +	ret = panthor_gpu_l2_power_on(ptdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	fw->vm = panthor_vm_create(ptdev, true,
> > +				   CSF_MCU_SHARED_REGION_START,
> > +				   CSF_MCU_SHARED_REGION_SIZE);
> > +	if (IS_ERR(fw->vm)) {
> > +		ret = PTR_ERR(fw->vm);
> > +		fw->vm = NULL;
> > +		goto err_unplug_fw;
> > +	}
> > +
> > +	ret = panthor_fw_load(ptdev);
> > +	if (ret)
> > +		goto err_unplug_fw;
> > +
> > +	ret = panthor_vm_active(fw->vm);
> > +	if (ret)
> > +		goto err_unplug_fw;
> > +
> > +	ret = panthor_fw_start(ptdev);
> > +	if (ret)
> > +		goto err_unplug_fw;
> > +
> > +	ret = panthor_fw_init_ifaces(ptdev);
> > +	if (ret)
> > +		goto err_unplug_fw;
> > +
> > +	panthor_fw_init_global_iface(ptdev);
> > +	return 0;
> > +
> > +err_unplug_fw:
> > +	panthor_fw_unplug(ptdev);
> > +	return ret;
> > +}
> > diff --git a/drivers/gpu/drm/panthor/panthor_fw.h b/drivers/gpu/drm/panthor/panthor_fw.h
> > new file mode 100644
> > index 000000000000..929760c2a46b
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_fw.h
> > @@ -0,0 +1,505 @@
> > +/* SPDX-License-Identifier: GPL-2.0 or MIT */
> > +/* Copyright 2023 Collabora ltd. */
> > +
> > +#ifndef __PANTHOR_MCU_H__
> > +#define __PANTHOR_MCU_H__
> > +
> > +#include <linux/types.h>
> > +
> > +#include "panthor_device.h"
> > +
> > +struct panthor_fw_mem;
> > +
> > +#define MAX_CSGS				31
> > +#define MAX_CS_PER_CSG                          32
> > +
> > +struct panthor_fw_ringbuf_input_iface {
> > +	u64 insert;
> > +	u64 extract;
> > +} __packed;
> > +
> > +struct panthor_fw_ringbuf_output_iface {
> > +	u64 extract;
> > +	u32 active;
> > +} __packed;  
> 
> Is there a good reason for these to be marked '__packed'? They are
> naturally aligned so there's no padding, and we guarantee they are page
> aligned. The compiler might have more freedom if they are not marked
> __packed.

Nope, no good reason.

> 
> > +
> > +struct panthor_fw_cs_control_iface {
> > +#define CS_FEATURES_WORK_REGS(x)		(((x) & GENMASK(7, 0)) + 1)
> > +#define CS_FEATURES_SCOREBOARDS(x)		(((x) & GENMASK(15, 8)) >> 8)
> > +#define CS_FEATURES_COMPUTE			BIT(16)
> > +#define CS_FEATURES_FRAGMENT			BIT(17)
> > +#define CS_FEATURES_TILER			BIT(18)
> > +	u32 features;
> > +	u32 input_va;
> > +	u32 output_va;
> > +} __packed;  
> 
> Here I have to admit I can't find a statement in the spec saying that
> the stride must be a multiple of 4 bytes... but kbase makes that assumption.

The stride of?

> 
> > +
> > +struct panthor_fw_cs_input_iface {
> > +#define CS_STATE_MASK				GENMASK(2, 0)
> > +#define CS_STATE_STOP				0
> > +#define CS_STATE_START				1
> > +#define CS_EXTRACT_EVENT			BIT(4)
> > +#define CS_IDLE_SYNC_WAIT			BIT(8)
> > +#define CS_IDLE_PROTM_PENDING			BIT(9)
> > +#define CS_IDLE_EMPTY				BIT(10)
> > +#define CS_IDLE_RESOURCE_REQ			BIT(11)
> > +#define CS_TILER_OOM				BIT(26)
> > +#define CS_PROTM_PENDING			BIT(27)
> > +#define CS_FATAL				BIT(30)
> > +#define CS_FAULT				BIT(31)
> > +#define CS_REQ_MASK				(CS_STATE_MASK | \
> > +						 CS_EXTRACT_EVENT | \
> > +						 CS_IDLE_SYNC_WAIT | \
> > +						 CS_IDLE_PROTM_PENDING | \
> > +						 CS_IDLE_EMPTY | \
> > +						 CS_IDLE_RESOURCE_REQ)
> > +#define CS_EVT_MASK				(CS_TILER_OOM | \
> > +						 CS_PROTM_PENDING | \
> > +						 CS_FATAL | \
> > +						 CS_FAULT)
> > +	u32 req;
> > +
> > +#define CS_CONFIG_PRIORITY(x)			((x) & GENMASK(3, 0))
> > +#define CS_CONFIG_DOORBELL(x)			(((x) << 8) & GENMASK(15, 8))
> > +	u32 config;
> > +	u32 reserved1;
> > +	u32 ack_irq_mask;
> > +	u64 ringbuf_base;
> > +	u32 ringbuf_size;
> > +	u32 reserved2;
> > +	u64 heap_start;
> > +	u64 heap_end;
> > +	u64 ringbuf_input;
> > +	u64 ringbuf_output;
> > +	u32 instr_config;
> > +	u32 instrbuf_size;
> > +	u64 instrbuf_base;
> > +	u64 instrbuf_offset_ptr;
> > +} __packed;  
> 
> The spec says this has a minimal alignment of 64 bytes. Although I guess
> the code should check this if we remove __packed and rely on it.

The allocation granularity is 4k, and we're not even in control of the
offset inside the FW interface section. So yes, we can check it when
parsing the FW sections, but there's no point adding __aligned() here.

> 
> > +
> > +struct panthor_fw_cs_output_iface {
> > +	u32 ack;
> > +	u32 reserved1[15];
> > +	u64 status_cmd_ptr;
> > +
> > +#define CS_STATUS_WAIT_SB_MASK			GENMASK(15, 0)
> > +#define CS_STATUS_WAIT_SB_SRC_MASK		GENMASK(19, 16)
> > +#define CS_STATUS_WAIT_SB_SRC_NONE		(0 << 16)
> > +#define CS_STATUS_WAIT_SB_SRC_WAIT		(8 << 16)
> > +#define CS_STATUS_WAIT_SYNC_COND_LE		(0 << 24)
> > +#define CS_STATUS_WAIT_SYNC_COND_GT		(1 << 24)
> > +#define CS_STATUS_WAIT_SYNC_COND_MASK		GENMASK(27, 24)
> > +#define CS_STATUS_WAIT_PROGRESS			BIT(28)
> > +#define CS_STATUS_WAIT_PROTM			BIT(29)
> > +#define CS_STATUS_WAIT_SYNC_64B			BIT(30)
> > +#define CS_STATUS_WAIT_SYNC			BIT(31)
> > +	u32 status_wait;
> > +	u32 status_req_resource;
> > +	u64 status_wait_sync_ptr;
> > +	u32 status_wait_sync_value;
> > +	u32 status_scoreboards;
> > +
> > +#define CS_STATUS_BLOCKED_REASON_UNBLOCKED	0
> > +#define CS_STATUS_BLOCKED_REASON_SB_WAIT	1
> > +#define CS_STATUS_BLOCKED_REASON_PROGRESS_WAIT	2
> > +#define CS_STATUS_BLOCKED_REASON_SYNC_WAIT	3
> > +#define CS_STATUS_BLOCKED_REASON_DEFERRED	5
> > +#define CS_STATUS_BLOCKED_REASON_RES		6
> > +#define CS_STATUS_BLOCKED_REASON_FLUSH		7
> > +#define CS_STATUS_BLOCKED_REASON_MASK		GENMASK(3, 0)
> > +	u32 status_blocked_reason;
> > +	u32 status_wait_sync_value_hi;
> > +	u32 reserved2[6];
> > +
> > +#define CS_EXCEPTION_TYPE(x)			((x) & GENMASK(7, 0))
> > +#define CS_EXCEPTION_DATA(x)			(((x) >> 8) & GENMASK(23, 0))
> > +	u32 fault;
> > +	u32 fatal;
> > +	u64 fault_info;
> > +	u64 fatal_info;
> > +	u32 reserved3[10];
> > +	u32 heap_vt_start;
> > +	u32 heap_vt_end;
> > +	u32 reserved4;
> > +	u32 heap_frag_end;
> > +	u64 heap_address;
> > +} __packed;  
> 
> output is the same as input.

You mean in term of alignment?

> 
> > +
> > +struct panthor_fw_csg_control_iface {
> > +	u32 features;
> > +	u32 input_va;
> > +	u32 output_va;
> > +	u32 suspend_size;
> > +	u32 protm_suspend_size;
> > +	u32 stream_num;
> > +	u32 stream_stride;
> > +} __packed;  
> 
> The spec is ambigious here. It one place it states the stride is 256
> bytes, but in another that you need to look at the GLB_GROUP_STRIDE
> value. In practice we can rely on 4 byte alignment.
> 
> I'm beginning to wonder if it's worth worrying about, I think I'll stop
> here ;)

Hehe. I'll add checks where I can in the parsing logic. I guess having
things naturally aligned and making sure there's no overlap with other
interfaces is a minimum.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 10/15] drm/panthor: Add the heap logical block
  2023-08-18 14:39   ` Steven Price
@ 2023-08-29 16:21     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 16:21 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Fri, 18 Aug 2023 15:39:03 +0100
Steven Price <steven.price@arm.com> wrote:

> I'm not sure whether we should really be describing this structure in
> the kernel. Beyond the size the kernel has no reason to be looking at
> the internals and the spec does have a warning that the layout may change.

Yeah, I guess I just wanted to have that documented somewhere, so
people understand what this heap context is about. Was quite obscure to
me before that. Anyway, I can move that to some mesa doc, that's not a
big deal.

> 
> Interestingly kbase also rounds this size up to ensure that it is at
> least a cache line. Which I guess might be required if the CPU and GPU
> are not coherent as we zero the context (from the CPU) before use.

Makes sense.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 11/15] drm/panthor: Add the scheduler logical block
  2023-08-18 15:38   ` Steven Price
@ 2023-08-29 16:36     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 16:36 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Fri, 18 Aug 2023 16:38:57 +0100
Steven Price <steven.price@arm.com> wrote:

> > +/**
> > + * sched_queue_work() - Queue a scheduler work.
> > + * @sched: Scheduler object.
> > + * @wname: Work name.
> > + *
> > + * Conditionally queues a scheduler work if no reset is pending/in-progress.
> > + */
> > +#define sched_queue_work(sched, wname) \
> > +	do { \
> > +		if (sched->reset.in_progress || \  
> 
> Is this missing a '!'? This executes if a reset is in progress.

What?! I wonder how this went unnoticed. I guess the fact I only use
scheduler-level works for user sync object signaling (which are not
used yet) and ping (I'm sure I tested it, but it must have been before I
extended the reset logic...) could explain that, but still...

> 
> > +		    !panthor_device_reset_is_pending((sched)->ptdev)) \
> > +			queue_work((sched)->wq, &(sched)->wname ## _work); \
> > +	} while (0)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-08-21 11:31   ` Steven Price
@ 2023-08-29 17:46     ` Boris Brezillon
  2023-08-31 14:42       ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 17:46 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 21 Aug 2023 12:31:29 +0100
Steven Price <steven.price@arm.com> wrote:

> On 09/08/2023 17:53, Boris Brezillon wrote:
> > This is the last piece missing to expose the driver to the outside
> > world.
> > 
> > This is basically a wrapper between the ioctls and the other logical
> > blocks.
> > 
> > v2:
> > - Rename the driver (pancsf -> panthor)
> > - Change the license (GPL2 -> MIT + GPL2)
> > - Split the driver addition commit
> > - Document the code
> > - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> > - Fix various bugs
> > - Refactored the code to make job submission re-usable for VM_BIND
> >   jobs
> > - Add user object copy helpers
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_drv.c | 1540 +++++++++++++++++++++++++
> >  1 file changed, 1540 insertions(+)
> >  create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> > new file mode 100644
> > index 000000000000..377ebea4c0e8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> > @@ -0,0 +1,1540 @@
> > +// SPDX-License-Identifier: GPL-2.0 or MIT
> > +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> > +/* Copyright 2019 Linaro, Ltd., Rob Herring <robh@kernel.org> */
> > +/* Copyright 2019 Collabora ltd. */
> > +
> > +#include <linux/module.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/pm_runtime.h>
> > +#include <linux/xarray.h>
> > +
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_exec.h>
> > +#include <drm/drm_ioctl.h>
> > +#include <drm/drm_syncobj.h>
> > +#include <drm/drm_utils.h>
> > +#include <drm/drm_debugfs.h>
> > +#include <drm/gpu_scheduler.h>
> > +#include <drm/panthor_drm.h>
> > +
> > +#include "panthor_sched.h"
> > +#include "panthor_device.h"
> > +#include "panthor_gem.h"
> > +#include "panthor_heap.h"
> > +#include "panthor_fw.h"
> > +#include "panthor_mmu.h"
> > +#include "panthor_gpu.h"
> > +#include "panthor_regs.h"
> > +
> > +/**
> > + * DOC: user <-> kernel object copy helpers.
> > + */
> > +
> > +/**
> > + * panthor_set_uobj() - Copy kernel object to user object.
> > + * @usr_ptr: Users pointer.
> > + * @usr_size: Size of the user object.
> > + * @min_size: Minimum size for this object.
> > + * @kern_size: Size of the kernel object.
> > + * @in: Address of the kernel object to copy.
> > + *
> > + * Helper automating kernel -> user object copies.
> > + *
> > + * Don't use this function directly, use PANTHOR_UOBJ_SET() instead.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const void *in)
> > +{
> > +	/* User size shouldn't be smaller than the minimal object size. */
> > +	if (usr_size < min_size)
> > +		return -EINVAL;
> > +
> > +	if (copy_to_user(u64_to_user_ptr(usr_ptr), in, min_t(u32, usr_size, kern_size)))
> > +		return -EFAULT;
> > +
> > +	/* When the kernel object is smaller than the user object, we fill the gap with
> > +	 * zeros.
> > +	 */
> > +	if (usr_size > kern_size &&
> > +	    clear_user(u64_to_user_ptr(usr_ptr + kern_size), usr_size - kern_size)) {
> > +		return -EFAULT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_get_uobj_array() - Copy a user object array into a kernel accessible object array.
> > + * @in: The object array to copy.
> > + * @min_stride: Minimum array stride.
> > + * @obj_kernel: Kernel object size.
> > + * @out: Pointer to a variable that will hold the newly allocated object array.
> > + *
> > + * Helper automating user -> kernel object copies.
> > + *
> > + * Don't use this function directly, use PANTHOR_UOBJ_ARRAY_GET() instead.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
> > +		       u32 obj_size, void **out)  
> 
> Instead of having 'out' as a return parameter you could use ERR_PTR()s 
> for the error cases. I know why you haven't, but see below.
> 
> > +{
> > +	int ret = 0;
> > +	void *out_alloc;
> > +
> > +	/* User stride must be at least the minimum object size, otherwise it might
> > +	 * lack useful information.
> > +	 */
> > +	if (in->stride < min_stride)
> > +		return -EINVAL;
> > +
> > +	if (!in->count)
> > +		return 0;
> > +
> > +	out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
> > +	if (!out_alloc)
> > +		return -ENOMEM;
> > +
> > +	if (obj_size == in->stride) {
> > +		/* Fast path when user/kernel have the same uAPI header version. */
> > +		if (copy_from_user(out_alloc, u64_to_user_ptr(in->array),
> > +				   (unsigned long)obj_size * in->count))
> > +			ret = -EFAULT;
> > +	} else {
> > +		void __user *in_ptr = u64_to_user_ptr(in->array);
> > +		void *out_ptr = out_alloc;
> > +
> > +		/* If the sizes differ, we need to copy elements one by one. */
> > +		for (u32 i = 0; i < in->count; i++) {
> > +			ret = copy_struct_from_user(out_ptr, obj_size, in_ptr, in->stride);
> > +			if (ret)
> > +				break;
> > +
> > +			out_ptr += obj_size;
> > +			in_ptr += in->stride;
> > +		}
> > +	}
> > +
> > +	if (ret) {
> > +		kvfree(out_alloc);
> > +		return ret;
> > +	}
> > +
> > +	*out = out_alloc;
> > +	return 0;
> > +}
> > +
> > +/**
> > + * PANTHOR_UOBJ_MIN_SIZE_INTERNAL() - Get the minimum user object size
> > + * @_typename: Object type.
> > + * @_last_mandatory_field: Last mandatory field.
> > + *
> > + * Get the minimum user object size based on the last mandatory field name,
> > + * A.K.A, the name of the last field of the structure at the time this
> > + * structure was added to the uAPI.
> > + *
> > + * Don't use directly, use PANTHOR_UOBJ_DECL() instead.
> > + */
> > +#define PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field) \
> > +	(offsetof(_typename, _last_mandatory_field) + \
> > +	 sizeof(((_typename *)NULL)->_last_mandatory_field))
> > +
> > +/**
> > + * PANTHOR_UOBJ_DECL() - Declare a new uAPI object whose subject to
> > + * evolutions.
> > + * @_typename: Object type.
> > + * @_last_mandatory_field: Last mandatory field.
> > + *
> > + * Should be used to extend the PANTHOR_UOBJ_MIN_SIZE() list.
> > + */
> > +#define PANTHOR_UOBJ_DECL(_typename, _last_mandatory_field) \
> > +	_typename : PANTHOR_UOBJ_MIN_SIZE_INTERNAL(_typename, _last_mandatory_field)
> > +
> > +/**
> > + * PANTHOR_UOBJ_MIN_SIZE() - Get the minimum size of a given uAPI object
> > + * @_obj_name: Object to get the minimum size of.
> > + *
> > + * Don't use this macro directly, it's automatically called by
> > + * PANTHOR_UOBJ_{SET,GET_ARRAY}().
> > + */
> > +#define PANTHOR_UOBJ_MIN_SIZE(_obj_name) \
> > +	_Generic(_obj_name, \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_gpu_info, tiler_present), \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_csif_info, pad), \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, ringbuf_size), \
> > +		 PANTHOR_UOBJ_DECL(struct drm_panthor_vm_bind_op, syncs))
> > +
> > +/**
> > + * PANTHOR_UOBJ_SET() - Copy a kernel object to a user object.
> > + * @_dest_usr_ptr: User pointer to copy to.
> > + * @_usr_size: Size of the user object.
> > + * @_src_obj: Kernel object to copy (not a pointer).
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +#define PANTHOR_UOBJ_SET(_dest_usr_ptr, _usr_size, _src_obj) \
> > +	panthor_set_uobj(_dest_usr_ptr, _usr_size, \
> > +			 PANTHOR_UOBJ_MIN_SIZE(_src_obj), \
> > +			 sizeof(_src_obj), &(_src_obj))
> > +
> > +/**
> > + * PANTHOR_UOBJ_GET_ARRAY() - Copy a user object array to a kernel accessible
> > + * object array.
> > + * @_dest_array: Local variable that will hold the newly allocated kernel
> > + * object array.
> > + * @_uobj_array: The drm_panthor_obj_array object describing the user object
> > + * array.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) \
> > +	panthor_get_uobj_array(_uobj_array, \
> > +			       PANTHOR_UOBJ_MIN_SIZE((_dest_array)[0]), \
> > +			       sizeof((_dest_array)[0]), (void **)&(_dest_array))  
> 
> Here you have an ugly cast to make the output pointer work. The below 
> patch avoids this by changing panthor_get_uobj_array() to return an 
> ERR_PTR:
> 
> ----8<----
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 377ebea4c0e8..ff749832f344 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -79,9 +79,9 @@ panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const v
>   *
>   * Return: 0 on success, a negative error code otherwise.
>   */
> -static int
> +static void *
>  panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
> -		       u32 obj_size, void **out)
> +		       u32 obj_size)
>  {
>  	int ret = 0;
>  	void *out_alloc;
> @@ -90,14 +90,14 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  	 * lack useful information.
>  	 */
>  	if (in->stride < min_stride)
> -		return -EINVAL;
> +		return ERR_PTR(-EINVAL);
>  
>  	if (!in->count)
> -		return 0;
> +		return NULL;
>  
>  	out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
>  	if (!out_alloc)
> -		return -ENOMEM;
> +		return ERR_PTR(-ENOMEM);
>  
>  	if (obj_size == in->stride) {
>  		/* Fast path when user/kernel have the same uAPI header version. */
> @@ -121,11 +121,10 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>  
>  	if (ret) {
>  		kvfree(out_alloc);
> -		return ret;
> +		return ERR_PTR(ret);
>  	}
>  
> -	*out = out_alloc;
> -	return 0;
> +	return out_alloc;
>  }
>  
>  /**
> @@ -193,10 +192,12 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
>   *
>   * Return: 0 on success, a negative error code otherwise.
>   */
> -#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) \
> -	panthor_get_uobj_array(_uobj_array, \
> +#define PANTHOR_UOBJ_GET_ARRAY(_dest_array, _uobj_array) ({\
> +	_dest_array = panthor_get_uobj_array(_uobj_array, \
>  			       PANTHOR_UOBJ_MIN_SIZE((_dest_array)[0]), \
> -			       sizeof((_dest_array)[0]), (void **)&(_dest_array))
> +			       sizeof((_dest_array)[0])); \
> +	IS_ERR(_dest_array) ? PTR_ERR(_dest_array) : 0; \
> +	})
>  
>  /**
>   * DOC: Job submission helpers.
> ---8<----
> 
> TBH, I'd also be tempted to make PANTHOR_UOBJ_GET_ARRAY simply return 
> the ERR_PTR and change the call sites appropriately. That way you avoid 
> the 'magic' of passing an lvalue.

Yep, I've considered doing that too actually, it's just that you get

if (IS_ERR(out))
	out = NULL;

if you want to call kfree(out) without testing in the error path. Maybe
that's a small price to pay if it helps cleaning up the helpers a bit.

> 
> > +
> > +/**
> > + * DOC: Job submission helpers.
> > + *
> > + * Here is the workflow for atomic submission of multiple jobs. By atomic,
> > + * we mean that we either submit the whole batch, or nothing. This requires
> > + * doing things in multiple steps, each step operating on all jobs belonging
> > + * to a batch.
> > + *
> > + * int xxx_submit_ioctl(...)
> > + * {
> > + *	...
> > + *
> > + *	// Initialize the submission context.
> > + *	ret = panthor_submit_ctx_init(&ctx, file, job_count);
> > + *	if (ret)
> > + *		return ret;
> > + *
> > + *	// Create jobs and attach sync operations.
> > + *	for (u32 i = 0; i < job_count; i++) {
> > + *		...
> > + *
> > + *		// Create job
> > + *		job = job_create(pfile, ...);
> > + *		if (IS_ERR(job)) {
> > + *			ret = PTR_ERR(job);
> > + *			goto out_cleanup_submit_ctx;
> > + *		}
> > + *
> > + *		// Add job to the submit context
> > + *		ret = panthor_submit_ctx_add_job(&ctx, i, job, sync_ops);
> > + *		if (ret)
> > + *			goto out_cleanup_submit_ctx;
> > + *	}
> > + *
> > + *	// Collect signal operations on all jobs, such that each job can pick
> > + *	// from it for its dependencies and update the fence to signal when
> > + *	// the job is submitted.  
> 
> I can't figure out here how we avoid depedency loops within a batch. 
> What stops two jobs from each depending on each other?
> 
> Or do we "allow" this but rely on the loop in panthor_submit_ctx_add_deps_and_arm_jobs()
> to effectively enforce that a job cannot actually depend on a job
> which is later in the batch.

You can't have circular dependencies because the job fence is created
after its dependencies have been registered, so a job at the beginning
of the array can't depend on a job that's coming after. It might be
passed the same syncobj, but since a syncobj is just a container, the
fence attached to the syncobj at the time the first job adds it as a
dependency will point to a different dma_fence.

> In which case why bother with this
> complexity rather than just performing all the steps on each job
> in order?

Because, before submitting a set of jobs, we want to make sure all jobs
that are passed to a submit request are valid and enough resources are
available for their execution to proceed. We could allow partial
execution (and that's actually the approach I had taken in one of the
patch I proposed to allow submitting multiple jobs in one call to
panfrost), but then you potentially have to figure out where things
failed, not to mention that the syncobjs might point to intermediate
dma_fence objects instead of the final one.

> 
> Being able to submit a forward dependency, but then having it
> ignored seems like an odd design. So I feel like I must be
> missing something.

It's not about allowing forward dependencies (that would be mess), but
allowing one job to take a dependency on a job that was appearing
earlier in the job array of the same submit call.

> 
> > + *	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);

Here panthor_submit_ctx_collect_jobs_signal_ops() is not registering
job out_fences to the syncobjs, it's just collecting all signal
operations from all jobs in an array. Each entry in this array contains
the syncobj handle, the syncobj object, and the fence that was attached
to it at the time the collection happens, and that's it.

Now, when a job are populated, and after we made sure it had
everything it needs to be submitted, for each signal operation passed
to this specific job, we update the corresponding entry in the signal
array with the job finished fence, but the syncobj is not updated at
that point, because we want to make sure all jobs belonging to a submit
can be submitted before exposing their fences to the outside world.

For jobs happening later in the array, when we see a WAIT operation,
we will first check the signal array to see if there's a
corresponding entry cached there for the given syncobj handle, if there
is, we take the dma_fence from here (this dma_fence might come from a
job submitted earlier in this submit context, or it might be the fence
that was there initially), if not, we call drm_syncobj_find_fence() to
get the dependency.

Once all jobs have been parsed/checked/populated, we start the
non-failing step => job submission. And after that point, we can start
exposing the job fences to the outside world. This is what happens in
panthor_submit_ctx_push_fences(): we iterate over the signal
operations, and update each syncobj with the fence that was last
attached to it (the last job in the submit array having a SIGNAL
operation on that syncobj).

> > + *	if (ret)
> > + *		goto out_cleanup_submit_ctx;
> > + *
> > + *	// We acquire/prepare revs on all jobs before proceeding with the
> > + *	// dependency registration.
> > + *	//
> > + *	// This is solving two problems:
> > + *	// 1. drm_sched_job_arm() and drm_sched_entity_push_job() must be protected
> > + *	//    by a lock to make sure no concurrent access to the same entity get
> > + *	//    interleaved, which would mess up with the fence seqno ordering.
> > + *	//    Luckily, one of the resv being acquired is the VM resv, and a scheduling
> > + *	//    entity is only bound to a single VM. As soon as we acquire the VM resv,
> > + *	//    we should be safe.
> > + *	// 2. Jobs might depend on fences that were issued by previous jobs in the
> > + *	//    same batch, so we can't add dependencies on all jobs before arming
> > + *	//    previous jobs and registering the fence to the signal array, otherwise
> > + *	//    we might miss dependencies, or point to an outdated fence.
> > + *	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
> > + *	if (ret)
> > + *		goto out_cleanup_submit_ctx;
> > + *
> > + *	// Now that resvs are locked/prepared, we can iterate over each job to add
> > + *	// the dependencies, arm the job fence, register the job fence to the signal
> > + *	// array.
> > + *	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
> > + *	if (ret)
> > + *		goto out_cleanup_submit_ctx;
> > + *
> > + *	// Nothing can fail after that point, so we can make our job fences visible to the
> > + *	// outside world. Push jobs and set the job fences to the resv slots we reserved.
> > + *	// This also pushes the fences to the syncobjs that are part of the signal array.
> > + *	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
> > + *
> > + * out_cleanup_submit_ctx:
> > + *	// Cleanup the context.
> > + *	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
> > + *	...
> > + *	return ret;
> > + *}  
> 
> I'm not sure it's beneficial to have this 'pseudo-code' version of the 
> submit function here. Can we not have the relevant comments in the 
> panthor_ioctl_group_submit() function instead. My main concern is that 
> this is going to get out of sync with the code over time - the function 
> names are already not a complete match.

Given the same logic is used for GPU and VM_BIND jobs, I thought it'd
be good to have the workflow documented in a central place, but I get
your point.

> 
> > + */
> > +
> > +/**
> > + * struct panthor_sync_signal - Represent a synchronization object point to attach
> > + * our job fence to.
> > + *
> > + * This structure is here to keep track of fences that are currently bound to
> > + * a specific syncobj point.
> > + *
> > + * At the beginning of a job submission, the fence
> > + * is retrieved from the syncobj itself, and can be NULL if no fence was attached
> > + * to this point.
> > + *
> > + * At the end, it points to the fence of the last job that had a
> > + * %DRM_PANTHOR_SYNC_OP_SIGNAL on this syncobj.
> > + *
> > + * With jobs being submitted in batches, the fence might change several times during
> > + * the process, allowing one job to wait on a job that's part of the same submission
> > + * be appears earlier in the drm_panthor_group_submit::queue_submits array.  
> 
> s/be/but/
> 
> > + */
> > +struct panthor_sync_signal {
> > +	/** @handle: The syncobj handle. */
> > +	u32 handle;
> > +
> > +	/**
> > +	 * @point: The syncobj point.
> > +	 *
> > +	 * Zero for regular syncobjs, and non-zero for timeline syncobjs.
> > +	 */
> > +	u64 point;
> > +
> > +	/**
> > +	 * @syncobj: The sync object pointed by @handle.
> > +	 */
> > +	struct drm_syncobj *syncobj;
> > +
> > +	/**
> > +	 * @chain: Chain object used to link the new fence to an existing
> > +	 * timeline syncobj.
> > +	 *
> > +	 * NULL for regular syncobj, non-NULL for timeline syncobjs.
> > +	 */
> > +	struct dma_fence_chain *chain;
> > +
> > +	/**
> > +	 * @fence: The fence to assign to the syncobj or syncobj-point.
> > +	 */
> > +	struct dma_fence *fence;
> > +};
> > +
> > +/**
> > + * struct panthor_job_ctx - Job context
> > + */
> > +struct panthor_job_ctx {
> > +	/** @job: The job that is about to be submitted to drm_sched. */
> > +	struct drm_sched_job *job;
> > +
> > +	/** @syncobjs: Array of sync operations. */
> > +	struct drm_panthor_sync_op *syncops;
> > +
> > +	/** @syncop_count: Number of sync operations. */
> > +	u32 syncop_count;
> > +};
> > +
> > +/**
> > + * struct panthor_submit_ctx - Submission context
> > + *
> > + * Anything that's related to a submission (%DRM_IOCTL_PANTHOR_VM_BIND or
> > + * %DRM_IOCTL_PANTHOR_GROUP_SUBMIT) is kept here, so we can automate the
> > + * initialization and cleanup steps.
> > + */
> > +struct panthor_submit_ctx {
> > +	/** @file: DRM file this submission happens on. */
> > +	struct drm_file *file;
> > +
> > +	/**
> > +	 * @signal: Array of panthor_sync_signal objects.
> > +	 *
> > +	 * %DRM_PANTHOR_SYNC_OP_SIGNAL operations will be recorded here,
> > +	 * and %DRM_PANTHOR_SYNC_OP_WAIT will first check if an entry
> > +	 * matching the syncobj+point exists before calling
> > +	 * drm_syncobj_find_fence(). This allows us to describe dependencies
> > +	 * existing between jobs that are part of the same batch.
> > +	 */
> > +	struct xarray signal;  
> 
> This feels like the wrong data structure - it's simply used as a list. I 
> suspect it would be better to simple add a list_head to struct 
> panthor_sync_signal.

I think I initially planned to use a raw array to make things
cache-friendly, and diverged to an xarray to not have to bother
calculating the number of entries needed, but because of the
indirection (array contains pointer to signal objects not signal
objects themselves), it kinda defeat the original goal... :-/

So yeah, moving to a list is probably a good thing. We'll see if we
ever end up spending a lot of time iterating the signal list (happens
when searching for dependencies and updating signal entries, as
explained above).

> 
> > +
> > +	/** @jobs: Array of jobs. */
> > +	struct panthor_job_ctx *jobs;
> > +
> > +	/** @job_count: Number of entries in the @jobs array. */
> > +	u32 job_count;
> > +
> > +	/** @exec: drm_exec context used to acquire and prepare resv objects. */
> > +	struct drm_exec exec;
> > +};
> > +
> > +#define PANTHOR_SYNC_OP_FLAGS_MASK \
> > +	(DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK | DRM_PANTHOR_SYNC_OP_SIGNAL)
> > +
> > +/**
> > + * panthor_check_sync_op() - Check drm_panthor_sync_op fields
> > + * @sync_op: The sync operation to check.
> > + *
> > + * Return: 0 on success, -EINVAL otherwise.
> > + */
> > +static int
> > +panthor_check_sync_op(const struct drm_panthor_sync_op *sync_op)
> > +{
> > +	u8 handle_type;
> > +
> > +	if (sync_op->flags & ~PANTHOR_SYNC_OP_FLAGS_MASK)
> > +		return -EINVAL;
> > +
> > +	handle_type = sync_op->flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK;
> > +	if (handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
> > +	    handle_type != DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ)
> > +		return -EINVAL;
> > +
> > +	if (handle_type == DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ &&
> > +	    sync_op->timeline_value != 0)
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_sync_signal_free() - Release resources and free a panthor_sync_signal object
> > + * @sig_sync: Signal object to free.
> > + */
> > +static void
> > +panthor_sync_signal_free(struct panthor_sync_signal *sig_sync)
> > +{
> > +	if (!sig_sync)
> > +		return;
> > +
> > +	drm_syncobj_put(sig_sync->syncobj);
> > +	dma_fence_chain_free(sig_sync->chain);
> > +	dma_fence_put(sig_sync->fence);
> > +	kfree(sig_sync);
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_add_sync_signal() - Add a signal operation to a submit context
> > + * @ctx: Context to add the signal operation to.
> > + * @handle: Syncobj handle.
> > + * @point: Syncobj point.
> > + *
> > + * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.  
> 
> The only part of the return used is the ERR_PTR() part, so make this a simple int.

I see.

> 
> > + */
> > +static struct panthor_sync_signal *
> > +panthor_submit_ctx_add_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> > +{
> > +	struct panthor_sync_signal *sig_sync;
> > +	struct dma_fence *cur_fence;
> > +	int ret;
> > +	u32 id;
> > +
> > +	sig_sync = kzalloc(sizeof(*sig_sync), GFP_KERNEL);
> > +	if (!sig_sync)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	sig_sync->handle = handle;
> > +	sig_sync->point = point;
> > +
> > +	if (point > 0) {
> > +		sig_sync->chain = dma_fence_chain_alloc();
> > +		if (!sig_sync->chain) {
> > +			ret = -ENOMEM;
> > +			goto err_free_sig_sync;
> > +		}
> > +	}
> > +
> > +	sig_sync->syncobj = drm_syncobj_find(ctx->file, handle);
> > +	if (!sig_sync->syncobj) {
> > +		ret = -EINVAL;
> > +		goto err_free_sig_sync;
> > +	}
> > +
> > +	/* Retrieve the current fence attached to that point. It's
> > +	 * perfectly fine to get a NULL fence here, it just means there's
> > +	 * no fence attached to that point yet.
> > +	 */
> > +	if (!drm_syncobj_find_fence(ctx->file, handle, point, 0, &cur_fence))
> > +		sig_sync->fence = cur_fence;
> > +
> > +	ret = xa_alloc(&ctx->signal, &id, sig_sync, xa_limit_32b, GFP_KERNEL);
> > +	if (ret)
> > +		goto err_free_sig_sync;
> > +
> > +	return sig_sync;
> > +
> > +err_free_sig_sync:
> > +	panthor_sync_signal_free(sig_sync);
> > +	return ERR_PTR(ret);
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_search_sync_signal() - Search an existing signal operation in a
> > + * submit context.
> > + * @ctx: Context to search the signal operation in.
> > + * @handle: Syncobj handle.
> > + * @point: Syncobj point.
> > + *
> > + * Return: A valid panthor_sync_signal object if found, NULL otherwise.
> > + */
> > +static struct panthor_sync_signal *
> > +panthor_submit_ctx_search_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> > +{
> > +	struct panthor_sync_signal *sig_sync;
> > +	unsigned long i;
> > +
> > +	xa_for_each(&ctx->signal, i, sig_sync) {
> > +		if (handle == sig_sync->handle && point == sig_sync->point)
> > +			return sig_sync;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_add_job() - Add a job to a submit context
> > + * @ctx: Context to search the signal operation in.
> > + * @idx: Index of the job in the context.
> > + * @job: Job to add.
> > + * @syncs: Sync operations provided by userspace.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
> > +			   struct drm_sched_job *job,
> > +			   const struct drm_panthor_obj_array *syncs)
> > +{
> > +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> > +						    struct panthor_device,
> > +						    base);
> > +	int ret;
> > +
> > +	if (drm_WARN_ON(&ptdev->base,
> > +			idx >= ctx->job_count ||
> > +			ctx->jobs[idx].job ||
> > +			ctx->jobs[idx].syncops ||
> > +			ctx->jobs[idx].syncop_count))
> > +		return -EINVAL;
> > +
> > +	ctx->jobs[idx].job = job;  
> 
> While the WARN_ON obviously shouldn't happen, this positioning of the 
> ctx->jobs[].job assignment means the caller has no idea if the 
> assignment has happened. AFAICT in the case of the WARN_ON the job isn't 
> cleaned up properly.

It's not really about cleanup not happening, more about being passed an
index that was already populated.

> 
> The options I can see are to move this line further down (and make the 
> caller clean up that one job if this function fails), or to clean up the 
> job in the case where the WARN_ON fails.

Maybe I should drop this WARN_ON() and assume the caller passed a valid
index...

> 
> > +
> > +	ret = PANTHOR_UOBJ_GET_ARRAY(ctx->jobs[idx].syncops, syncs);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ctx->jobs[idx].syncop_count = syncs->count;
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_get_sync_signal() - Search signal operation and add one if none was found.
> > + * @ctx: Context to search the signal operation in.
> > + * @handle: Syncobj handle.
> > + * @point: Syncobj point.
> > + *
> > + * Return: A valid panthor_sync_signal object on success, an ERR_PTR() otherwise.  
> 
> As above, no need to return the object just an int error code.
> 
> > + */
> > +static struct panthor_sync_signal *
> > +panthor_submit_ctx_get_sync_signal(struct panthor_submit_ctx *ctx, u32 handle, u64 point)
> > +{
> > +	struct panthor_sync_signal *sig_sync;
> > +
> > +	sig_sync = panthor_submit_ctx_search_sync_signal(ctx, handle, point);
> > +	if (sig_sync)
> > +		return sig_sync;
> > +
> > +	return panthor_submit_ctx_add_sync_signal(ctx, handle, point);
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_update_job_sync_signal_fences() - Update fences
> > + * on the signal operations specified by a job.
> > + * @ctx: Context to search the signal operation in.
> > + * @job_idx: Index of the job to operate on.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_update_job_sync_signal_fences(struct panthor_submit_ctx *ctx,
> > +						 u32 job_idx)
> > +{
> > +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> > +						    struct panthor_device,
> > +						    base);
> > +	struct dma_fence *done_fence = &ctx->jobs[job_idx].job->s_fence->finished;
> > +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> > +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> > +
> > +	for (u32 i = 0; i < sync_op_count; i++) {
> > +		struct dma_fence *old_fence;
> > +		struct panthor_sync_signal *sig_sync;
> > +
> > +		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
> > +			continue;
> > +
> > +		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
> > +								 sync_ops[i].timeline_value);
> > +		if (drm_WARN_ON(&ptdev->base, !sig_sync))
> > +			return -EINVAL;
> > +
> > +		old_fence = sig_sync->fence;
> > +		sig_sync->fence = dma_fence_get(done_fence);
> > +		dma_fence_put(old_fence);
> > +
> > +		if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
> > +			return -EINVAL;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_collect_job_signal_ops() - Iterate over all job signal operations
> > + * and add them to the context.
> > + * @ctx: Context to search the signal operation in.
> > + * @job_idx: Index of the job to operate on.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_collect_job_signal_ops(struct panthor_submit_ctx *ctx,
> > +					  u32 job_idx)
> > +{
> > +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> > +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> > +
> > +	for (u32 i = 0; i < sync_op_count; i++) {
> > +		struct panthor_sync_signal *sig_sync;
> > +		int ret;
> > +
> > +		if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL))
> > +			continue;
> > +
> > +		ret = panthor_check_sync_op(&sync_ops[i]);
> > +		if (ret)
> > +			return ret;
> > +
> > +		sig_sync = panthor_submit_ctx_get_sync_signal(ctx,
> > +							      sync_ops[i].handle,
> > +							      sync_ops[i].timeline_value);
> > +		if (IS_ERR(sig_sync))
> > +			return PTR_ERR(sig_sync);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_push_fences() - Iterate over the signal array, and for each entry, push
> > + * the currently assigned fence to the associated syncobj.
> > + * @ctx: Context to push fences on.
> > + *
> > + * This is the last step of a submission procedure, and is done once we know the submission
> > + * is effective and job fences are guaranteed to be signaled in finite time.
> > + */
> > +static void
> > +panthor_submit_ctx_push_fences(struct panthor_submit_ctx *ctx)
> > +{
> > +	struct panthor_sync_signal *sig_sync;
> > +	unsigned long i;
> > +
> > +	xa_for_each(&ctx->signal, i, sig_sync) {
> > +		if (sig_sync->chain) {
> > +			drm_syncobj_add_point(sig_sync->syncobj, sig_sync->chain,
> > +					      sig_sync->fence, sig_sync->point);
> > +			sig_sync->chain = NULL;
> > +		} else {
> > +			drm_syncobj_replace_fence(sig_sync->syncobj, sig_sync->fence);
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_add_sync_deps_to_job() - Add sync wait operations as
> > + * job dependencies.
> > + * @ctx: Submit context.
> > + * @job_idx: Index of the job to operate on.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_add_sync_deps_to_job(struct panthor_submit_ctx *ctx,
> > +					u32 job_idx)
> > +{
> > +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
> > +						    struct panthor_device,
> > +						    base);
> > +	const struct drm_panthor_sync_op *sync_ops = ctx->jobs[job_idx].syncops;
> > +	struct drm_sched_job *job = ctx->jobs[job_idx].job;
> > +	u32 sync_op_count = ctx->jobs[job_idx].syncop_count;
> > +	int ret = 0;
> > +
> > +	if (!sync_op_count)
> > +		return 0;  
> 
> Not needed - the for loop will be skipped in this case anyway.
> 
> > +
> > +	for (u32 i = 0; i < sync_op_count; i++) {
> > +		struct panthor_sync_signal *sig_sync;
> > +		struct dma_fence *fence;
> > +
> > +		if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
> > +			continue;  
> 
> NIT: It might be worth having a helper for the operation type. It's a 
> little confusing that we have !(flags & SIGNAL) and (flags & SIGNAL) but 
> not (flags & WAIT) - obviously looking at the definition shows why. Also 
> there'll be a lot of careful refactoring needed if a third operation is 
> ever added.

I had the operation as a separate field initially, but I couldn't think
of any other operations we could do on a syncobj, so I decided to make
it a flag, and mimic what Xe does.

> 
> > +
> > +		ret = panthor_check_sync_op(&sync_ops[i]);
> > +		if (ret)
> > +			return ret;
> > +
> > +		sig_sync = panthor_submit_ctx_search_sync_signal(ctx, sync_ops[i].handle,
> > +								 sync_ops[i].timeline_value);
> > +		if (sig_sync) {
> > +			if (drm_WARN_ON(&ptdev->base, !sig_sync->fence))
> > +				return -EINVAL;
> > +
> > +			fence = dma_fence_get(sig_sync->fence);
> > +		} else {
> > +			ret = drm_syncobj_find_fence(ctx->file, sync_ops[i].handle,
> > +						     sync_ops[i].timeline_value,
> > +						     0, &fence);
> > +			if (ret)
> > +				return ret;
> > +		}
> > +
> > +		ret = drm_sched_job_add_dependency(job, fence);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_collect_jobs_signal_ops() - Collect all signal operations
> > + * and add them to the submit context.
> > + * @ctx: Submit context.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_collect_jobs_signal_ops(struct panthor_submit_ctx *ctx)
> > +{
> > +	for (u32 i = 0; i < ctx->job_count; i++) {
> > +		int ret;
> > +
> > +		ret = panthor_submit_ctx_collect_job_signal_ops(ctx, i);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_add_deps_and_arm_jobs() - Add jobs dependencies and arm jobs
> > + * @ctx: Submit context.
> > + * @add_resvs_deps: Callback used to add implicit job dependencies.
> > + *
> > + * Must be called after panthor_submit_ctx_prepare_resvs().
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_add_deps_and_arm_jobs(struct panthor_submit_ctx *ctx,
> > +					 int (*add_resvs_deps)(struct drm_sched_job *))
> > +{
> > +	for (u32 i = 0; i < ctx->job_count; i++) {
> > +		int ret;
> > +
> > +		ret = add_resvs_deps(ctx->jobs[i].job);
> > +		if (ret)
> > +			return ret;
> > +
> > +		ret = panthor_submit_ctx_add_sync_deps_to_job(ctx, i);
> > +		if (ret)
> > +			return ret;
> > +
> > +		drm_sched_job_arm(ctx->jobs[i].job);
> > +
> > +		ret = panthor_submit_ctx_update_job_sync_signal_fences(ctx, i);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_prepare_resvs() - Lock/prepare reservation objects for all jobs.
> > + * @ctx: Submit context.
> > + * @prep_resvs: Callback used to prepare reservation objects associated to a job.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int
> > +panthor_submit_ctx_prepare_resvs(struct panthor_submit_ctx *ctx,
> > +				 int (*prep_resvs)(struct drm_exec *, struct drm_sched_job *))
> > +{
> > +	drm_exec_until_all_locked(&ctx->exec) {
> > +		for (u32 i = 0; i < ctx->job_count; i++) {
> > +			int ret = prep_resvs(&ctx->exec, ctx->jobs[i].job);
> > +
> > +			drm_exec_retry_on_contention(&ctx->exec);
> > +			if (ret)
> > +				return ret;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_push_jobs() - Push jobs to their scheduling entities.
> > + * @ctx: Submit context.
> > + * @upd_resvs: Callback used to update reservation objects that were prepared in
> > + * panthor_submit_ctx_prepare_resvs().
> > + */
> > +static void
> > +panthor_submit_ctx_push_jobs(struct panthor_submit_ctx *ctx,
> > +			     void (*upd_resvs)(struct drm_sched_job *))
> > +{
> > +	for (u32 i = 0; i < ctx->job_count; i++) {
> > +		upd_resvs(ctx->jobs[i].job);
> > +		drm_sched_entity_push_job(ctx->jobs[i].job);
> > +
> > +		/* Job is owned by the scheduler now. */
> > +		ctx->jobs[i].job = NULL;
> > +	}
> > +
> > +	panthor_submit_ctx_push_fences(ctx);
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_init() - Initializes a submission context
> > + * @ctx: Submit context to initialize.
> > + * @file: drm_file this submission happens on.
> > + * @job_count: Number of jobs that will be submitted.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +static int panthor_submit_ctx_init(struct panthor_submit_ctx *ctx,
> > +				   struct drm_file *file, u32 job_count)
> > +{
> > +	ctx->jobs = kvmalloc_array(job_count, sizeof(*ctx->jobs),
> > +				   GFP_KERNEL | __GFP_ZERO);
> > +	if (!ctx->jobs)
> > +		return -ENOMEM;
> > +
> > +	ctx->file = file;
> > +	ctx->job_count = job_count;
> > +	xa_init_flags(&ctx->signal, XA_FLAGS_ALLOC);
> > +	drm_exec_init(&ctx->exec, DRM_EXEC_INTERRUPTIBLE_WAIT | DRM_EXEC_IGNORE_DUPLICATES);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * panthor_submit_ctx_cleanup() - Cleanup a submission context
> > + * @ctx: Submit context to cleanup.
> > + */
> > +static void panthor_submit_ctx_cleanup(struct panthor_submit_ctx *ctx,
> > +				       void (*job_put)(struct drm_sched_job *))
> > +{
> > +	struct panthor_sync_signal *sig_sync;
> > +	unsigned long i;
> > +
> > +	drm_exec_fini(&ctx->exec);
> > +
> > +	xa_for_each(&ctx->signal, i, sig_sync)
> > +		panthor_sync_signal_free(sig_sync);
> > +
> > +	xa_destroy(&ctx->signal);
> > +
> > +	for (i = 0; i < ctx->job_count; i++) {
> > +		job_put(ctx->jobs[i].job);
> > +		kvfree(ctx->jobs[i].syncops);
> > +	}
> > +
> > +	kvfree(ctx->jobs);
> > +}
> > +
> > +static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, struct drm_file *file)
> > +{
> > +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> > +	struct drm_panthor_dev_query *args = data;
> > +
> > +	if (!args->pointer) {
> > +		switch (args->type) {
> > +		case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
> > +			args->size = sizeof(ptdev->gpu_info);
> > +			return 0;
> > +
> > +		case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
> > +			args->size = sizeof(ptdev->csif_info);
> > +			return 0;
> > +
> > +		default:
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	switch (args->type) {
> > +	case DRM_PANTHOR_DEV_QUERY_GPU_INFO:
> > +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->gpu_info);
> > +
> > +	case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
> > +		return PANTHOR_UOBJ_SET(args->pointer, args->size, ptdev->csif_info);
> > +
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +}
> > +
> > +#define PANTHOR_VM_CREATE_FLAGS			0
> > +
> > +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
> > +				   struct drm_file *file)
> > +{
> > +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> > +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_vm_create *args = data;
> > +	u64 kernel_va_start = 0;
> > +	int cookie, ret;
> > +
> > +	if (!drm_dev_enter(ddev, &cookie))
> > +		return -ENODEV;
> > +
> > +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
> > +		ret = -EINVAL;
> > +		goto out_dev_exit;
> > +	}
> > +
> > +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {  
> 
> The check for !va_bits would be better done at probe time. I'd also be 
> tempted to move the change for kernel_va_range down to 
> panthor_vm_create() as that has to repeat the va_bits calculation.
> 
> > +		ret = -EINVAL;
> > +		goto out_dev_exit;
> > +	}
> > +
> > +	if (args->kernel_va_range)
> > +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;  
> 
> And also push the calculation of va_start down to 
> panthor_vm_create() as well.

panthor_vm_create() is used internally, for the MCU VM creation, and
I'd prefer to keep it uAPI agnostic. I don't mind moving it to
panthor_vm_pool_create_vm() but we'd still have to do the va_bits
calculation twice.

> 
> > +
> > +	ret = panthor_vm_pool_create_vm(ptdev, pfile->vms,
> > +					kernel_va_start, args->kernel_va_range);
> > +	if (ret >= 0) {
> > +		args->id = ret;
> > +		ret = 0;
> > +	}
> > +
> > +out_dev_exit:
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_vm_destroy(struct drm_device *ddev, void *data,
> > +				    struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_vm_destroy *args = data;
> > +
> > +	if (args->pad)
> > +		return -EINVAL;
> > +
> > +	return panthor_vm_pool_destroy_vm(pfile->vms, args->id);
> > +}
> > +
> > +#define PANTHOR_BO_FLAGS		DRM_PANTHOR_BO_NO_MMAP
> > +
> > +static int panthor_ioctl_bo_create(struct drm_device *ddev, void *data,
> > +				   struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct panthor_gem_object *bo;
> > +	struct drm_panthor_bo_create *args = data;
> > +	struct panthor_vm *vm = NULL;
> > +	int cookie, ret;
> > +
> > +	if (!drm_dev_enter(ddev, &cookie))
> > +		return -ENODEV;
> > +
> > +	if (!args->size || args->pad ||
> > +	    (args->flags & ~PANTHOR_BO_FLAGS)) {
> > +		ret = -EINVAL;
> > +		goto out_dev_exit;
> > +	}
> > +
> > +	if (args->exclusive_vm_id) {
> > +		vm = panthor_vm_pool_get_vm(pfile->vms, args->exclusive_vm_id);
> > +		if (!vm) {
> > +			ret = -EINVAL;
> > +			goto out_dev_exit;
> > +		}
> > +	}
> > +
> > +	bo = panthor_gem_create_with_handle(file, ddev, vm, args->size, args->flags,
> > +					    &args->handle);  
> 
> As mentioned before, we should have a function which just returns the 
> handle, we don't need/want the BO here.

Sure, will do that.

> 
> > +
> > +	panthor_vm_put(vm);
> > +
> > +	if (IS_ERR(bo))
> > +		ret = PTR_ERR(bo);
> > +	else
> > +		ret = 0;
> > +
> > +out_dev_exit:
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_bo_mmap_offset(struct drm_device *ddev, void *data,
> > +					struct drm_file *file)
> > +{
> > +	struct drm_panthor_bo_mmap_offset *args = data;
> > +	struct drm_gem_object *obj;
> > +	int ret;
> > +
> > +	if (args->pad)
> > +		return -EINVAL;
> > +
> > +	obj = drm_gem_object_lookup(file, args->handle);
> > +	if (!obj)
> > +		return -ENOENT;
> > +
> > +	ret = drm_gem_create_mmap_offset(obj);
> > +	if (ret)
> > +		goto out;
> > +
> > +	args->offset = drm_vma_node_offset_addr(&obj->vma_node);
> > +
> > +out:
> > +	drm_gem_object_put(obj);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_group_submit(struct drm_device *ddev, void *data,
> > +				      struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_group_submit *args = data;
> > +	struct drm_panthor_queue_submit *jobs_args;
> > +	struct panthor_submit_ctx ctx;
> > +	int ret = 0, cookie;
> > +
> > +	if (args->pad)
> > +		return -EINVAL;
> > +
> > +	if (!drm_dev_enter(ddev, &cookie))
> > +		return -ENODEV;
> > +
> > +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->queue_submits);
> > +	if (ret)
> > +		goto out_dev_exit;
> > +
> > +	ret = panthor_submit_ctx_init(&ctx, file, args->queue_submits.count);
> > +	if (ret)
> > +		goto out_free_jobs_args;
> > +
> > +	for (u32 i = 0; i < args->queue_submits.count; i++) {
> > +		const struct drm_panthor_queue_submit *qsubmit = &jobs_args[i];
> > +		struct drm_sched_job *job;
> > +
> > +		job = panthor_job_create(pfile, args->group_handle, qsubmit);
> > +		if (IS_ERR(job)) {
> > +			ret = PTR_ERR(job);
> > +			goto out_cleanup_submit_ctx;
> > +		}
> > +
> > +		ret = panthor_submit_ctx_add_job(&ctx, i, job, &qsubmit->syncs);
> > +		if (ret)
> > +			goto out_cleanup_submit_ctx;
> > +	}
> > +
> > +	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_job_prepare_resvs);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_job_add_resvs_deps);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	/* Nothing can fail after that point. */
> > +	panthor_submit_ctx_push_jobs(&ctx, panthor_job_update_resvs);
> > +
> > +out_cleanup_submit_ctx:
> > +	panthor_submit_ctx_cleanup(&ctx, panthor_job_put);
> > +
> > +out_free_jobs_args:
> > +	kvfree(jobs_args);
> > +
> > +out_dev_exit:
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_group_destroy(struct drm_device *ddev, void *data,
> > +				       struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_group_destroy *args = data;
> > +
> > +	if (args->pad)
> > +		return -EINVAL;
> > +
> > +	return panthor_group_destroy(pfile, args->group_handle);
> > +}
> > +
> > +static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
> > +				      struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_group_create *args = data;
> > +	struct drm_panthor_queue_create *queue_args;
> > +	int ret;
> > +
> > +	if (!args->queues.count)
> > +		return -EINVAL;
> > +
> > +	ret = PANTHOR_UOBJ_GET_ARRAY(queue_args, &args->queues);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = panthor_group_create(pfile, args, queue_args);
> > +	if (ret >= 0) {
> > +		args->group_handle = ret;
> > +		ret = 0;
> > +	}
> > +
> > +	kvfree(queue_args);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_group_get_state(struct drm_device *ddev, void *data,
> > +					 struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_group_get_state *args = data;
> > +
> > +	return panthor_group_get_state(pfile, args);
> > +}
> > +
> > +static int panthor_ioctl_tiler_heap_create(struct drm_device *ddev, void *data,
> > +					   struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_tiler_heap_create *args = data;
> > +	struct panthor_heap_pool *pool;
> > +	struct panthor_vm *vm;
> > +	int ret;
> > +
> > +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> > +	if (!vm)
> > +		return -EINVAL;
> > +
> > +	pool = panthor_vm_get_heap_pool(vm, true);
> > +	if (IS_ERR(pool)) {
> > +		ret = PTR_ERR(pool);
> > +		goto out_put_vm;
> > +	}
> > +
> > +	ret = panthor_heap_create(pool,
> > +				  args->initial_chunk_count,
> > +				  args->chunk_size,
> > +				  args->max_chunks,
> > +				  args->target_in_flight,
> > +				  &args->tiler_heap_ctx_gpu_va,
> > +				  &args->first_heap_chunk_gpu_va);
> > +	if (ret < 0)
> > +		goto out_put_heap_pool;
> > +
> > +	/* Heap pools are per-VM. We combine the VM and HEAP id to make
> > +	 * a unique heap handle.
> > +	 */
> > +	args->handle = (args->vm_id << 16) | ret;
> > +	ret = 0;
> > +
> > +out_put_heap_pool:
> > +	panthor_heap_pool_put(pool);
> > +
> > +out_put_vm:
> > +	panthor_vm_put(vm);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_tiler_heap_destroy(struct drm_device *ddev, void *data,
> > +					    struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_tiler_heap_destroy *args = data;
> > +	struct panthor_heap_pool *pool;
> > +	struct panthor_vm *vm;
> > +	int ret;
> > +
> > +	if (args->pad)
> > +		return -EINVAL;
> > +
> > +	vm = panthor_vm_pool_get_vm(pfile->vms, args->handle >> 16);
> > +	if (!vm)
> > +		return -EINVAL;
> > +
> > +	pool = panthor_vm_get_heap_pool(vm, false);
> > +	if (!pool) {
> > +		ret = -EINVAL;
> > +		goto out_put_vm;
> > +	}
> > +
> > +	ret = panthor_heap_destroy(pool, args->handle & GENMASK(15, 0));
> > +	panthor_heap_pool_put(pool);
> > +
> > +out_put_vm:
> > +	panthor_vm_put(vm);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_vm_bind_async(struct drm_device *ddev,
> > +				       struct drm_panthor_vm_bind *args,
> > +				       struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_vm_bind_op *jobs_args;
> > +	struct panthor_submit_ctx ctx;
> > +	struct panthor_vm *vm;
> > +	int ret = 0;
> > +
> > +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> > +	if (!vm)
> > +		return -EINVAL;
> > +
> > +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
> > +	if (ret)
> > +		goto out_put_vm;
> > +
> > +	ret = panthor_submit_ctx_init(&ctx, file, args->ops.count);
> > +	if (ret)
> > +		goto out_free_jobs_args;
> > +
> > +	for (u32 i = 0; i < args->ops.count; i++) {
> > +		struct drm_panthor_vm_bind_op *op = &jobs_args[i];
> > +		struct drm_sched_job *job;
> > +
> > +		job = panthor_vm_bind_job_create(file, vm, op);
> > +		if (IS_ERR(job)) {
> > +			ret = PTR_ERR(job);
> > +			goto out_cleanup_submit_ctx;
> > +		}
> > +
> > +		ret = panthor_submit_ctx_add_job(&ctx, i, job, &op->syncs);
> > +		if (ret)
> > +			goto out_cleanup_submit_ctx;
> > +	}
> > +
> > +	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	ret = panthor_submit_ctx_prepare_resvs(&ctx, panthor_vm_bind_job_prepare_resvs);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	ret = panthor_submit_ctx_add_deps_and_arm_jobs(&ctx, panthor_vm_bind_job_add_resvs_deps);
> > +	if (ret)
> > +		goto out_cleanup_submit_ctx;
> > +
> > +	/* Nothing can fail after that point. */
> > +	panthor_submit_ctx_push_jobs(&ctx, panthor_vm_bind_job_update_resvs);
> > +
> > +out_cleanup_submit_ctx:
> > +	panthor_submit_ctx_cleanup(&ctx, panthor_vm_bind_job_put);
> > +
> > +out_free_jobs_args:
> > +	kvfree(jobs_args);
> > +
> > +out_put_vm:
> > +	panthor_vm_put(vm);
> > +	return ret;
> > +}
> > +
> > +static int panthor_ioctl_vm_bind_sync(struct drm_device *ddev,
> > +				      struct drm_panthor_vm_bind *args,
> > +				      struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_vm_bind_op *jobs_args;
> > +	struct panthor_vm *vm;
> > +	int ret;
> > +
> > +	vm = panthor_vm_pool_get_vm(pfile->vms, args->vm_id);
> > +	if (!vm)
> > +		return -EINVAL;
> > +
> > +	ret = PANTHOR_UOBJ_GET_ARRAY(jobs_args, &args->ops);
> > +	if (ret)
> > +		goto out_put_vm;
> > +
> > +	for (u32 i = 0; i < args->ops.count; i++) {
> > +		ret = panthor_vm_bind_exec_sync_op(file, vm, &jobs_args[i]);
> > +		if (ret) {
> > +			/* Update ops.count so the user knows where things failed. */  
> 
> It might be worth mentioning this in the UAPI header as the array count
> wouldn't usually be modified.

Will do. Note that it's only the case for synchronous operations.

> 
> > +			args->ops.count = i;
> > +			break;
> > +		}
> > +	}
> > +
> > +	kvfree(jobs_args);
> > +
> > +out_put_vm:
> > +	panthor_vm_put(vm);
> > +	return ret;
> > +}
> > +
> > +#define PANTHOR_VM_BIND_FLAGS DRM_PANTHOR_VM_BIND_ASYNC
> > +
> > +static int panthor_ioctl_vm_bind(struct drm_device *ddev, void *data,
> > +				 struct drm_file *file)
> > +{
> > +	struct drm_panthor_vm_bind *args = data;
> > +	int cookie, ret;
> > +
> > +	if (!drm_dev_enter(ddev, &cookie))
> > +		return -ENODEV;
> > +
> > +	if (args->flags & DRM_PANTHOR_VM_BIND_ASYNC)
> > +		ret = panthor_ioctl_vm_bind_async(ddev, args, file);
> > +	else
> > +		ret = panthor_ioctl_vm_bind_sync(ddev, args, file);
> > +
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static int
> > +panthor_open(struct drm_device *ddev, struct drm_file *file)
> > +{
> > +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> > +	struct panthor_file *pfile;
> > +	int ret;
> > +
> > +	if (!try_module_get(THIS_MODULE))
> > +		return -EINVAL;
> > +
> > +	pfile = kzalloc(sizeof(*pfile), GFP_KERNEL);
> > +	if (!pfile) {
> > +		ret = -ENOMEM;
> > +		goto err_put_mod;
> > +	}
> > +
> > +	pfile->ptdev = ptdev;
> > +
> > +	ret = panthor_vm_pool_create(pfile);
> > +	if (ret)
> > +		goto err_free_file;
> > +
> > +	ret = panthor_group_pool_create(pfile);
> > +	if (ret)
> > +		goto err_destroy_vm_pool;
> > +
> > +	file->driver_priv = pfile;
> > +	return 0;
> > +
> > +err_destroy_vm_pool:
> > +	panthor_vm_pool_destroy(pfile);
> > +
> > +err_free_file:
> > +	kfree(pfile);
> > +
> > +err_put_mod:
> > +	module_put(THIS_MODULE);
> > +	return ret;
> > +}
> > +
> > +static void
> > +panthor_postclose(struct drm_device *ddev, struct drm_file *file)
> > +{
> > +	struct panthor_file *pfile = file->driver_priv;
> > +
> > +	panthor_group_pool_destroy(pfile);
> > +	panthor_vm_pool_destroy(pfile);
> > +
> > +	kfree(pfile);
> > +	module_put(THIS_MODULE);
> > +}
> > +
> > +static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
> > +#define PANTHOR_IOCTL(n, func, flags) \
> > +	DRM_IOCTL_DEF_DRV(PANTHOR_##n, panthor_ioctl_##func, flags)
> > +
> > +	PANTHOR_IOCTL(DEV_QUERY, dev_query, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(VM_CREATE, vm_create, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(VM_DESTROY, vm_destroy, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(VM_BIND, vm_bind, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(BO_CREATE, bo_create, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(BO_MMAP_OFFSET, bo_mmap_offset, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(GROUP_CREATE, group_create, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(GROUP_DESTROY, group_destroy, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(GROUP_GET_STATE, group_get_state, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
> > +	PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
> > +};
> > +
> > +static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
> > +{
> > +	struct drm_file *file = filp->private_data;
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct panthor_device *ptdev = pfile->ptdev;
> > +	int ret, cookie;
> > +
> > +	if (!drm_dev_enter(file->minor->dev, &cookie))
> > +		return -ENODEV;
> > +
> > +	if (vma->vm_pgoff >= (DRM_PANTHOR_USER_MMIO_OFFSET >> PAGE_SHIFT))
> > +		ret = panthor_device_mmap_io(ptdev, vma);
> > +	else
> > +		ret = drm_gem_mmap(filp, vma);
> > +
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +
> > +static const struct file_operations panthor_drm_driver_fops = {
> > +	.open = drm_open,
> > +	.release = drm_release,
> > +	.unlocked_ioctl = drm_ioctl,
> > +	.compat_ioctl = drm_compat_ioctl,
> > +	.poll = drm_poll,
> > +	.read = drm_read,
> > +	.llseek = noop_llseek,
> > +	.mmap = panthor_mmap,
> > +};
> > +
> > +#ifdef CONFIG_DEBUG_FS
> > +void panthor_debugfs_init(struct drm_minor *minor)
> > +{
> > +	panthor_mmu_debugfs_init(minor);
> > +}
> > +#endif
> > +
> > +/*
> > + * PanCSF driver version:
> > + * - 1.0 - initial interface
> > + */
> > +static const struct drm_driver panthor_drm_driver = {
> > +	.driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
> > +			   DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
> > +	.open = panthor_open,
> > +	.postclose = panthor_postclose,
> > +	.ioctls = panthor_drm_driver_ioctls,
> > +	.num_ioctls = ARRAY_SIZE(panthor_drm_driver_ioctls),
> > +	.fops = &panthor_drm_driver_fops,
> > +	.name = "panthor",
> > +	.desc = "Panthor DRM driver",
> > +	.date = "20230801",
> > +	.major = 1,
> > +	.minor = 0,
> > +
> > +	.gem_create_object = panthor_gem_create_object,
> > +	.gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> > +#ifdef CONFIG_DEBUG_FS
> > +	.debugfs_init = panthor_debugfs_init,
> > +#endif
> > +};
> > +
> > +static int panthor_probe(struct platform_device *pdev)
> > +{
> > +	struct panthor_device *ptdev;
> > +	int ret;
> > +
> > +	ptdev = devm_drm_dev_alloc(&pdev->dev, &panthor_drm_driver,
> > +				   struct panthor_device, base);
> > +	if (!ptdev)
> > +		return -ENOMEM;
> > +
> > +	platform_set_drvdata(pdev, ptdev);
> > +
> > +	ret = panthor_device_init(ptdev);
> > +	if (ret)
> > +		return ret;
> > +
> > +	return drm_dev_register(&ptdev->base, 0);
> > +}
> > +
> > +static void panthor_remove(struct platform_device *pdev)
> > +{
> > +	struct panthor_device *ptdev = platform_get_drvdata(pdev);
> > +
> > +	panthor_device_unplug(ptdev);
> > +}
> > +
> > +static const struct of_device_id dt_match[] = {
> > +	{ .compatible = "rockchip,rk3588-mali" },
> > +	{ .compatible = "arm,mali-valhall-csf" },
> > +	{}
> > +};
> > +MODULE_DEVICE_TABLE(of, dt_match);
> > +
> > +static DEFINE_RUNTIME_DEV_PM_OPS(panthor_pm_ops,
> > +				 panthor_device_suspend,
> > +				 panthor_device_resume,
> > +				 NULL);
> > +
> > +static struct platform_driver panthor_driver = {
> > +	.probe = panthor_probe,
> > +	.remove_new = panthor_remove,
> > +	.driver = {
> > +		.name = "panthor",
> > +		.pm = &panthor_pm_ops,
> > +		.of_match_table = dt_match,
> > +	},
> > +};
> > +
> > +/**
> > + * @cleanup_wq: Workqueue used to cleanup stuff.
> > + *
> > + * We create a dedicated workqueue so we can drain on unplug and
> > + * make sure all resources are freed before the module is unloaded.
> > + */
> > +struct workqueue_struct *panthor_cleanup_wq;
> > +
> > +static int __init panthor_init(void)
> > +{
> > +	int ret;
> > +
> > +	ret = panthor_mmu_pt_cache_init();
> > +	if (ret)
> > +		return ret;
> > +
> > +	panthor_cleanup_wq = alloc_workqueue("panthor-cleanup", WQ_UNBOUND, 0);
> > +	if (!panthor_cleanup_wq) {
> > +		pr_err("panthor: Failed to allocate the workqueues");
> > +		ret = -ENOMEM;
> > +		goto err_mmu_pt_cache_fini;
> > +	}
> > +
> > +	ret = platform_driver_register(&panthor_driver);
> > +	if (ret)
> > +		goto err_destroy_cleanup_wq;
> > +
> > +	return ret;
> > +
> > +err_mmu_pt_cache_fini:
> > +	panthor_mmu_pt_cache_fini();
> > +
> > +err_destroy_cleanup_wq:
> > +	destroy_workqueue(panthor_cleanup_wq);  
> 
> This cleanup looks backwards.

Oops, will fix that.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS
  2023-08-11 16:08   ` Steven Price
@ 2023-08-29 17:48     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-29 17:48 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Fri, 11 Aug 2023 17:08:20 +0100
Steven Price <steven.price@arm.com> wrote:

> On 09/08/2023 17:53, Boris Brezillon wrote:
> > Add an entry for the Panthor driver to the MAINTAINERS file.
> > 
> > v2:
> > - New commit
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > 
> > If anyone from Arm wants to volunteer to become a co-maintainer, that
> > would be highly appreciated  
> 
> *sticks his hand up* me me! ;) Seriously though I'm happy to help out
> with the maintenance.

Ah, that's awesome news!!!! Thanks for volunteering.

> 
> And I'll try to finish reviewing the patches next week. I gave it a
> quick spin on my Rock 5B and the GPU seems to work fine. I also need to
> rebase my user space submission work. And recover from coming back from
> holiday! Plus I'm sure I wasn't full-time on GPU related things before I
> went on holiday... ;)

Thanks a lot for the thorough review. I'll try to address all your
comments and update the branches on my repo with the new version.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-29 14:00     ` Boris Brezillon
@ 2023-08-30 13:17       ` Steven Price
  2023-08-30 14:06         ` Boris Brezillon
  2023-09-04 11:46         ` Liviu Dudau
  0 siblings, 2 replies; 93+ messages in thread
From: Steven Price @ 2023-08-30 13:17 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 29/08/2023 15:00, Boris Brezillon wrote:
> On Fri, 11 Aug 2023 16:47:56 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 09/08/2023 17:53, Boris Brezillon wrote:
>>> The panthor driver is designed in a modular way, where each logical
>>> block is dealing with a specific HW-block or software feature. In order
>>> for those blocks to communicate with each other, we need a central
>>> panthor_device collecting all the blocks, and exposing some common
>>> features, like interrupt handling, power management, reset, ...
>>>
>>> This what this panthor_device logical block is about.
>>>
>>> v2:
>>> - Rename the driver (pancsf -> panthor)
>>> - Change the license (GPL2 -> MIT + GPL2)
>>> - Split the driver addition commit
>>> - Add devfreq/PM support
>>> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
>>>
>>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>>> ---
>>>  drivers/gpu/drm/panthor/panthor_device.c | 479 +++++++++++++++++++++++
>>>  drivers/gpu/drm/panthor/panthor_device.h | 354 +++++++++++++++++
>>>  2 files changed, 833 insertions(+)
>>>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
>>>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
>>> new file mode 100644
>>> index 000000000000..15f102116fa0
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/panthor/panthor_device.c
>>> @@ -0,0 +1,479 @@
>>> +// SPDX-License-Identifier: GPL-2.0 or MIT
>>> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
>>> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
>>> +/* Copyright 2023 Collabora ltd. */
>>> +
>>> +#include <linux/clk.h>
>>> +#include <linux/reset.h>
>>> +#include <linux/platform_device.h>
>>> +#include <linux/pm_domain.h>
>>> +#include <linux/pm_runtime.h>
>>> +#include <linux/regulator/consumer.h>
>>> +
>>> +#include <drm/drm_drv.h>
>>> +#include <drm/drm_managed.h>
>>> +
>>> +#include "panthor_sched.h"
>>> +#include "panthor_device.h"
>>> +#include "panthor_devfreq.h"
>>> +#include "panthor_gpu.h"
>>> +#include "panthor_fw.h"
>>> +#include "panthor_mmu.h"
>>> +#include "panthor_regs.h"
>>> +
>>> +static int panthor_clk_init(struct panthor_device *ptdev)
>>> +{
>>> +	ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
>>> +	if (IS_ERR(ptdev->clks.core)) {
>>> +		drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
>>> +			PTR_ERR(ptdev->clks.core));  
>>
>> I suspect it would be a good idea to use dev_err_probe() here (and
>> below) as I believe devm_clk_get can return -EPROBE_DEFER.
> 
> Nice, didn't know there was a logging function that was silencing
> probe-defer errors.
> 
>>
>>> +		return PTR_ERR(ptdev->clks.core);
>>> +	}
>>> +
>>> +	ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
>>> +	if (IS_ERR(ptdev->clks.stacks)) {
>>> +		drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
>>> +			PTR_ERR(ptdev->clks.stacks));
>>> +		return PTR_ERR(ptdev->clks.stacks);
>>> +	}
>>> +
>>> +	ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, "coregroup");
>>> +	if (IS_ERR(ptdev->clks.coregroup)) {
>>> +		drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
>>> +			PTR_ERR(ptdev->clks.coregroup));
>>> +		return PTR_ERR(ptdev->clks.coregroup);
>>> +	}
>>> +
>>> +	drm_info(&ptdev->base, "clock rate = %lu\n", clk_get_rate(ptdev->clks.core));
>>> +	return 0;
>>> +}
>>> +
>>> +void panthor_device_unplug(struct panthor_device *ptdev)
>>> +{
>>> +	/* FIXME: This is racy. */  
>>
>> Can we fix this? From a quick look it seems like a sequence like below
>> should avoid the race.
>>
>> 	if (!drm_dev_enter())
>> 		/* Already unplugged */
>> 		return;
>> 	ptdev->base.unplugged = true;
>> 	drm_dev_exit();
>>
>> Although possibly that should be in the DRM core rather than open-coded
>> here.
> 
> Are you sure that's protecting us against two concurrent calls to
> drm_dev_unplug() (drm_dev_enter() is taking a read-lock)?

Well now I'm not sure ;) This was based on the implementations of
drm_dev_is_unplugged() and drm_dev_unplug(). drm_dev_is_unplugged()
simply tries to enter then exit. drm_dev_unplug() sets dev->unplugged
(without first taking any locks). So my naïve combination resulted in
the above.

The part I was missing is the synchronize_srcu() call in
drm_dev_unplug() is what matches up with the read lock in drm_dev_enter().

> And that's not
> the only thing I need actually. If there are 2 threads entering
> panthor_device_unplug(), I need to make sure the one who losts (arrived
> after unplugged was set to false) is waiting for all operations after
> the drm_dev_unplug() call to be done, otherwise we might return from
> platform_driver->remove() before the unplug cleanups are done, and
> there might still be threads/workqueues accessing device resources
> while/after they get released by the device-model.

I can't figure out how to do this other than adding a new atomic status
bit into panthor. So something like:

	if (!drm_dev_enter())
		/* Already unplugged */
		return;

	if (atomic_cmpxchg(&unplugging, false, true)) {
		/* Racing with another thread */
		drm_dev_exit();
		/* Wait for other threads to exit */
		synchronize_srcu(&drm_unplug_srcu);
		return;
	}

	panthor_xxx_unplug()

	drm_dev_exit();

Or at least I think that might work. The need to synchronize with
drm_unplug_srcu means this really needs a new helper in drm_drv.c.

>>
>>> +	if (drm_dev_is_unplugged(&ptdev->base))
>>> +		return;
>>> +
>>> +	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
>>> +
>>> +	/* Call drm_dev_unplug() so any access to HW block happening after
>>> +	 * that point get rejected.
>>> +	 */
>>> +	drm_dev_unplug(&ptdev->base);
>>> +
>>> +	/* Now, try to cleanly shutdown the GPU before the device resources
>>> +	 * get reclaimed.
>>> +	 */
>>> +	panthor_sched_unplug(ptdev);
>>> +	panthor_fw_unplug(ptdev);
>>> +	panthor_mmu_unplug(ptdev);
>>> +	panthor_gpu_unplug(ptdev);
>>> +
>>> +	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
>>> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
>>> +}
>>> +
>>> +static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
>>> +
>>> +	cancel_work_sync(&ptdev->reset.work);
>>> +	destroy_workqueue(ptdev->reset.wq);
>>> +}
>>> +
>>> +static void panthor_device_reset_work(struct work_struct *work)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
>>> +	int ret, cookie;
>>> +
>>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
>>> +		return;
>>> +
>>> +	panthor_sched_pre_reset(ptdev);
>>> +	panthor_fw_pre_reset(ptdev, true);
>>> +	panthor_mmu_pre_reset(ptdev);
>>> +	panthor_gpu_soft_reset(ptdev);
>>> +	panthor_gpu_l2_power_on(ptdev);
>>> +	panthor_mmu_post_reset(ptdev);
>>> +	ret = panthor_fw_post_reset(ptdev);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	atomic_set(&ptdev->reset.pending, 0);
>>> +	panthor_sched_post_reset(ptdev);
>>> +	drm_dev_exit(cookie);
>>> +
>>> +out:
>>> +	if (ret) {  
>>
>> This looks like a race condition too - is there a need for a
>> drm_dev_exit_and_unplug() function?
> 
> drm_dev_exit() is just releasing the read-lock. drm_dev_unplug()
> waits for all readers to be done and sets the unplugged value to true.
> So we only get readers/writer synchronization here, but nothing doing
> writer/writer sync. I guess the drm core leaves that to drivers, given
> drm_dev_unplug() is usually called from xxx_driver->remove() hook, on
> which serialization is guaranteed by the device-model.
> 
> TLDR; yes, it's racy, but I don't think drm_dev_exit_and_unplug() would
> help solve the existing race.

Yeah, I hadn't really thought through the reader/writer locks.

> It's worth noting that we currently have only 2 paths calling
> panthor_device_unplug(): the platform_driver->remove() hook and the
> reset worker. Calling drm_dev_unplug() might not be the right thing to
> do, I just thought it was a good match to reflect the fact the device
> becomes inaccessible, without adding yet another kind of device-lost
> field.

I quite liked the unplugged approach, it hides the complexities of the
GPU breaking nicely.

However I do think this path needs fixing in some way, because of the
"goto out" we end up calling panthor_device_unplug() while in the
drm_dev_enter() section. Which, unless I'm mistaken, means
panthor_device_unplug() will call drm_dev_unplug() in that section -
which should produce a lockdep warning at the very least, if not an
actual deadlock.

Given it's only a read lock - I think simply moving drm_dev_exit() below
the "out:" label fixes the deadlock without making any races worse.
Whether the race here actually matters I'm not sure.

>>
>>> +		panthor_device_unplug(ptdev);
>>> +		drm_err(&ptdev->base, "Failed to boot MCU after reset, making device unusable.");
>>> +	}
>>> +}
>>> +
>>> +static bool panthor_device_is_initialized(struct panthor_device *ptdev)
>>> +{
>>> +	return !!ptdev->scheduler;
>>> +}
>>> +
>>> +static void panthor_device_free_page(struct drm_device *ddev, void *data)
>>> +{
>>> +	free_page((unsigned long)data);
>>> +}
>>> +
>>> +int panthor_device_init(struct panthor_device *ptdev)
>>> +{
>>> +	struct resource *res;
>>> +	struct page *p;
>>> +	int ret;
>>> +
>>> +	ptdev->coherent = device_get_dma_attr(ptdev->base.dev) == DEV_DMA_COHERENT;
>>> +
>>> +	drmm_mutex_init(&ptdev->base, &ptdev->pm.lock);
>>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
>>> +	p = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>> +	if (!p)
>>> +		return -ENOMEM;
>>> +
>>> +	ptdev->pm.dummy_latest_flush = page_address(p);
>>> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
>>> +				       ptdev->pm.dummy_latest_flush);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	/* Set the dummy page to the default LATEST_FLUSH value. This
>>> +	 * will be updated on the next suspend.
>>> +	 */
>>> +	*ptdev->pm.dummy_latest_flush = CSF_GPU_LATEST_FLUSH_ID_DEFAULT;  
>>
>> I see why this register default value was defined. Although I'm not sure
>> it has any benefit over just using zero... If the GPU is off when user
>> space reads the FLUSH_ID then the GPU's caches are definitely empty so
>> any flush ID is valid.
> 
> Zero means we'll force a cache flush for all CS that were created while
> the device was suspended, that's not ideal.
> 
>>
>> Interestingly looking at kbase it seems to use an initial value of 1
>> (POWER_DOWN_LATEST_FLUSH_VALUE). I guess zero is less ideal because
>> FLUSH_CACHE2 would then unconditionally do a flush.
> 
> I guess a value of 1 would work. It just means we'll get a spurious
> flush if the CS is submitted after 32 flushes happened, on the other
> hand we also a spurious flush on the first submitted CS when we use
> POWER_DOWN_LATEST_FLUSH_VALUE. I'll switch to 1, drop the default def,
> and update the comment accordingly.

Yeah, matching kbase is almost certainly the safest approach ;) Sorry, I
was reviewing the patches mostly in order and this looked really odd
until I started digging into it. Zero is clearly not the ideal value,
but the reset value is also just a weird value for hardware validation
(it enables easier checking of the wrap condition). Since kbase picks 1,
that must be a value which works well!

>>
>>> +
>>> +	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
>>> +	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
>>> +	if (!ptdev->reset.wq)
>>> +		return -ENOMEM;
>>> +
>>> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_reset_cleanup, NULL);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = panthor_clk_init(ptdev);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = panthor_devfreq_init(ptdev);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ptdev->iomem = devm_platform_get_and_ioremap_resource(to_platform_device(ptdev->base.dev),
>>> +							      0, &res);
>>> +	if (IS_ERR(ptdev->iomem))
>>> +		return PTR_ERR(ptdev->iomem);
>>> +
>>> +	ptdev->phys_addr = res->start;
>>> +
>>> +	ret = devm_pm_runtime_enable(ptdev->base.dev);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = panthor_gpu_init(ptdev);
>>> +	if (ret)
>>> +		goto err_rpm_put;
>>> +
>>> +	ret = panthor_mmu_init(ptdev);
>>> +	if (ret)
>>> +		goto err_rpm_put;
>>> +
>>> +	ret = panthor_fw_init(ptdev);
>>> +	if (ret)
>>> +		goto err_rpm_put;
>>> +
>>> +	ret = panthor_sched_init(ptdev);
>>> +	if (ret)
>>> +		goto err_rpm_put;
>>> +
>>> +	/* ~3 frames */
>>> +	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
>>> +	pm_runtime_use_autosuspend(ptdev->base.dev);
>>> +	pm_runtime_put_autosuspend(ptdev->base.dev);
>>> +	return 0;
>>> +
>>> +err_rpm_put:
>>> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
>>> +	return ret;
>>> +}
>>> +
>>> +#define PANTHOR_EXCEPTION(id) \
>>> +	[DRM_PANTHOR_EXCEPTION_ ## id] = { \
>>> +		.name = #id, \
>>> +	}
>>> +
>>> +struct panthor_exception_info {
>>> +	const char *name;
>>> +};
>>> +
>>> +static const struct panthor_exception_info panthor_exception_infos[] = {
>>> +	PANTHOR_EXCEPTION(OK),
>>> +	PANTHOR_EXCEPTION(TERMINATED),
>>> +	PANTHOR_EXCEPTION(KABOOM),
>>> +	PANTHOR_EXCEPTION(EUREKA),
>>> +	PANTHOR_EXCEPTION(ACTIVE),
>>> +	PANTHOR_EXCEPTION(CS_RES_TERM),
>>> +	PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
>>> +	PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
>>> +	PANTHOR_EXCEPTION(CS_BUS_FAULT),
>>> +	PANTHOR_EXCEPTION(CS_INSTR_INVALID),
>>> +	PANTHOR_EXCEPTION(CS_CALL_STACK_OVERFLOW),
>>> +	PANTHOR_EXCEPTION(CS_INHERIT_FAULT),
>>> +	PANTHOR_EXCEPTION(INSTR_INVALID_PC),
>>> +	PANTHOR_EXCEPTION(INSTR_INVALID_ENC),
>>> +	PANTHOR_EXCEPTION(INSTR_BARRIER_FAULT),
>>> +	PANTHOR_EXCEPTION(DATA_INVALID_FAULT),
>>> +	PANTHOR_EXCEPTION(TILE_RANGE_FAULT),
>>> +	PANTHOR_EXCEPTION(ADDR_RANGE_FAULT),
>>> +	PANTHOR_EXCEPTION(IMPRECISE_FAULT),
>>> +	PANTHOR_EXCEPTION(OOM),
>>> +	PANTHOR_EXCEPTION(CSF_FW_INTERNAL_ERROR),
>>> +	PANTHOR_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
>>> +	PANTHOR_EXCEPTION(GPU_BUS_FAULT),
>>> +	PANTHOR_EXCEPTION(GPU_SHAREABILITY_FAULT),
>>> +	PANTHOR_EXCEPTION(SYS_SHAREABILITY_FAULT),
>>> +	PANTHOR_EXCEPTION(GPU_CACHEABILITY_FAULT),
>>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_0),
>>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_1),
>>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_2),
>>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_3),
>>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_4),
>>> +	PANTHOR_EXCEPTION(PERM_FAULT_0),
>>> +	PANTHOR_EXCEPTION(PERM_FAULT_1),
>>> +	PANTHOR_EXCEPTION(PERM_FAULT_2),
>>> +	PANTHOR_EXCEPTION(PERM_FAULT_3),
>>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_1),
>>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_2),
>>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_3),
>>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_IN),
>>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
>>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
>>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
>>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
>>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_0),
>>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_1),
>>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_2),
>>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_3),
>>> +};
>>> +
>>> +const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_code)
>>> +{
>>> +	if (drm_WARN_ON(&ptdev->base,  
>>
>> I'm not convinced this should be a WARN_ON as I suspect it's probably
>> possible to inject values from user space (although I'm not completely
>> sure on that).
> 
> Normally no (it's something returned by the FW), unless userspace gets
> access to the kernel <-> FW interface, which would be worrisome :-).

I've no idea if it's actually possible, but it feels like it should be
possible to create a firmware synchronisation object with an error code
chosen by the user and possibly then propagate that error code back to
the kernel. It's certainly not trivial though. Either way the WARN is
unnecessary.

>> It's certainly not a driver error as such if we can't
>> decode the value.
> 
> Ack on dropping the WARN_ON().
> 
>>
>>> +			exception_code >= ARRAY_SIZE(panthor_exception_infos) ||
>>> +			!panthor_exception_infos[exception_code].name))
>>> +		return "Unknown exception type";
>>> +
>>> +	return panthor_exception_infos[exception_code].name;
>>> +}
>>> +
>>> +static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
>>> +{
>>> +	struct vm_area_struct *vma = vmf->vma;
>>> +	struct panthor_device *ptdev = vma->vm_private_data;
>>> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
>>> +	unsigned long pfn;
>>> +	pgprot_t pgprot;
>>> +	vm_fault_t ret;
>>> +	bool active;
>>> +	int cookie;
>>> +
>>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
>>> +		return VM_FAULT_SIGBUS;
>>> +
>>> +	mutex_lock(&ptdev->pm.lock);
>>> +	active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
>>> +
>>> +	switch (id) {
>>> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
>>> +		if (active)
>>> +			pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
>>> +		else
>>> +			pfn = virt_to_pfn(ptdev->pm.dummy_latest_flush);
>>> +		break;
>>> +
>>> +	default:
>>> +		ret = VM_FAULT_SIGBUS;
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	pgprot = vma->vm_page_prot;
>>> +	if (active)
>>> +		pgprot = pgprot_noncached(pgprot);
>>> +
>>> +	ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
>>> +
>>> +out_unlock:
>>> +	mutex_unlock(&ptdev->pm.lock);
>>> +	drm_dev_exit(cookie);
>>> +	return ret;
>>> +}
>>> +
>>> +static const struct vm_operations_struct panthor_mmio_vm_ops = {
>>> +	.fault = panthor_mmio_vm_fault,
>>> +};
>>> +
>>> +int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
>>> +{
>>> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
>>> +
>>> +	switch (id) {
>>> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
>>> +		if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
>>> +		    (vma->vm_flags & (VM_WRITE | VM_EXEC)))
>>> +			return -EINVAL;
>>> +
>>> +		break;
>>> +
>>> +	default:
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	/* Defer actual mapping to the fault handler. */
>>> +	vma->vm_private_data = ptdev;
>>> +	vma->vm_ops = &panthor_mmio_vm_ops;
>>> +	vm_flags_set(vma,
>>> +		     VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
>>> +		     VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
>>> +	return 0;
>>> +}
>>> +
>>> +#ifdef CONFIG_PM
>>> +int panthor_device_resume(struct device *dev)
>>> +{
>>> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
>>> +	int ret, cookie;
>>> +
>>> +	mutex_lock(&ptdev->pm.lock);
>>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_RESUMING);
>>> +
>>> +	ret = clk_prepare_enable(ptdev->clks.core);
>>> +	if (ret)
>>> +		goto err_unlock;
>>> +
>>> +	ret = clk_prepare_enable(ptdev->clks.stacks);
>>> +	if (ret)
>>> +		goto err_disable_core_clk;
>>> +
>>> +	ret = clk_prepare_enable(ptdev->clks.coregroup);
>>> +	if (ret)
>>> +		goto err_disable_stacks_clk;
>>> +
>>> +	ret = panthor_devfreq_resume(ptdev);
>>> +	if (ret)
>>> +		goto err_disable_coregroup_clk;
>>> +
>>> +	if (panthor_device_is_initialized(ptdev) &&
>>> +	    drm_dev_enter(&ptdev->base, &cookie)) {
>>> +		panthor_gpu_resume(ptdev);
>>> +		panthor_mmu_resume(ptdev);
>>> +		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
>>> +		if (!ret)
>>> +			panthor_sched_resume(ptdev);
>>> +
>>> +		drm_dev_exit(cookie);
>>> +
>>> +		if (ret)
>>> +			goto err_devfreq_suspend;
>>> +	}
>>> +
>>> +	/* Clear all IOMEM mappings pointing to this device after we've
>>> +	 * resumed. This way the fake mappings pointing to the dummy pages
>>> +	 * are removed and the real iomem mapping will be restored on next
>>> +	 * access.
>>> +	 */
>>> +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
>>> +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
>>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);  
>>
>> Is the ordering here correct? I think we need to set ACTIVE before the
>> unmap_mapping_range otherwise there is a (very small) race where user
>> space could fault the page (and get the dummy mapping) before the
>> atomic_set.
> 
> We take the pm.lock in panthor_mmio_vm_fault().
> 
>>
>> Hmm, actually we have the pm.lock, so no this isn't racy. In which case
>> is there a good reason that you're using atomics? I can see two accesses
>> which aren't protected by pm.lock:
>>
>>   * the early out in panthor_device_suspend() - which could easily be
>> moved inside the lock.
> 
> When we're in suspend() we are the one in control of the pm.state, so
> no race expected here.
> 
>>
>>   * panthor_device_schedule_reset() - this looks racy (the power down
>> could happen immediately after the atomic_read()), so I suspect it would
>> be better moving the check into panthor_device_reset_work() and
>> performing it with the pm.lock held.
> 
> I think the main reason for it being an atomic is because I didn't
> have PM locking in the initial implementation, but I ended adding
> locking at some point because I didn't really have choice. I thought
> the race didn't exist because of the workqueue synchronization/work
> cancellation that happens in panthor_sched_suspend(), but I see now
> that it's not protecting us (thread queuing the job could be paused
> just after checking the PM state and resumed after the suspend
> happened). This being said, we might have a lock ordering issue if we
> take the PM lock in that path (I need to check that).

Yeah I didn't bother to check whether it would create ordering issues...
;) I'll leave you to figure out the fix - the whole atomic + mutex was
confusing and doesn't seem to have quite worked.

[...]

>>> +
>>> +/**
>>> + * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
>>> + * registration function.
>>> + *
>>> + * The boiler-plate to gracefully deal with shared interrupts is
>>> + * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
>>> + * just after you actual handler. The handler prototype is:  
>> s/you/your/ or probably s/you/the/ since we don't expect people to be
>> adding more ;)
>>
>>> + *
>>> + * void (*handler)(struct panthor_device *, u32 status);
>>> + */
>>> +#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
>>> +static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
>>> +{												\
>>> +	struct panthor_irq *pirq = data;							\
>>> +	struct panthor_device *ptdev = pirq->ptdev;						\  
>>
>> Maybe I'm missing something, but I was expecting a check here for if the
>> irq has been suspended and to avoid the register reads if it was.
> 
> Thought the INT_MASK=0 + synchronize_irq() in panthor_xxx_irq_suspend()
> would guarantee that the handler can't be called after
> panthor_xxx_irq_suspend() was called.

If the IRQ is shared then Linux doesn't know which device caused the
interrupt, so another device's (shared) interrupt could cause our
handler to be run.

>> Otherwise I'm not entirely sure I follow what all this code is for.
> 
> Not entirely sure which code we're talking about. The reason we
> don't use the default raw IRQ handler is because it doesn't work if the
> irq line is shared. In that case, we need to mask all interrupts to
> make sure other handlers on the same irq line don't get spammed with
> our IRQs.

What I'm not following is why we need all this extra infrastructure for
IRQs. The 'setting the mask to 0' during suspend is simple enough and
could be included in code which now calls panthor_xxx_irq_suspend()
(equally for restoring the mask on resume). But there's a loads more
code here.

My initial thought when I looked at this was that you were trying to
solve the issue of a shared IRQ where Mali might get powered off, but
the IRQ is then triggered by another device. In that case touching the
Mali registers would be problematic, so I was expecting some code in
_irq_raw_handler() to check whether the IRQ couldn't possibly be for us
(i.e. mask==0) and early out with IRQ_NONE. kbase has a concept like
this "gpu_powered" for exactly this reason.

But I can't see anything in the code to handle that case. And the
"spamming" of other drivers during suspend shouldn't really happen
(there's something odd going on if the hardware is generating interrupts
when it's meant to be suspended).

But maybe I'm just missing something - it's a while since I've dealt
with interrupt code in Linux.

Steve

>>
>> Steve
>>
>>> +												\
>>> +	if (!gpu_read(ptdev, __reg_prefix ## _INT_STAT))					\
>>> +		return IRQ_NONE;								\
>>> +												\
>>> +	gpu_write(ptdev, __reg_prefix ## _INT_MASK, 0);						\
>>> +	return IRQ_WAKE_THREAD;									\
>>> +}												\
>>> +												\
>>> +static irqreturn_t panthor_ ## __name ## _irq_threaded_handler(int irq, void *data)		\
>>> +{												\
>>> +	struct panthor_irq *pirq = data;							\
>>> +	struct panthor_device *ptdev = pirq->ptdev;						\
>>> +	irqreturn_t ret = IRQ_NONE;								\
>>> +												\
>>> +	while (true) {										\
>>> +		u32 status = gpu_read(ptdev, __reg_prefix ## _INT_RAWSTAT) & pirq->mask;	\
>>> +												\
>>> +		if (!status)									\
>>> +			break;									\
>>> +												\
>>> +		gpu_write(ptdev, __reg_prefix ## _INT_CLEAR, status);				\
>>> +												\
>>> +		__handler(ptdev, status);							\
>>> +		ret = IRQ_HANDLED;								\
>>> +	}											\
>>> +												\
>>> +	if (!atomic_read(&pirq->suspended))							\
>>> +		gpu_write(ptdev, __reg_prefix ## _INT_MASK, pirq->mask);			\
>>> +												\
>>> +	return ret;										\
>>> +}												\
>>> +												\
>>> +static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)			\
>>> +{												\
>>> +	int cookie;										\
>>> +												\
>>> +	atomic_set(&pirq->suspended, true);							\
>>> +												\
>>> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
>>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
>>> +		synchronize_irq(pirq->irq);							\
>>> +		drm_dev_exit(cookie);								\
>>> +	}											\
>>> +												\
>>> +	pirq->mask = 0;										\
>>> +}												\
>>> +												\
>>> +static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
>>> +{												\
>>> +	int cookie;										\
>>> +												\
>>> +	atomic_set(&pirq->suspended, false);							\
>>> +	pirq->mask = mask;									\
>>> +												\
>>> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
>>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_CLEAR, mask);			\
>>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, mask);			\
>>> +		drm_dev_exit(cookie);								\
>>> +	}											\
>>> +}												\
>>> +												\
>>> +static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev,			\
>>> +					      struct panthor_irq *pirq,				\
>>> +					      int irq, u32 mask)				\
>>> +{												\
>>> +	pirq->ptdev = ptdev;									\
>>> +	pirq->irq = irq;									\
>>> +	panthor_ ## __name ## _irq_resume(pirq, mask);						\
>>> +												\
>>> +	return devm_request_threaded_irq(ptdev->base.dev, irq,					\
>>> +					 panthor_ ## __name ## _irq_raw_handler,		\
>>> +					 panthor_ ## __name ## _irq_threaded_handler,		\
>>> +					 IRQF_SHARED, KBUILD_MODNAME "-" # __name,		\
>>> +					 pirq);							\
>>> +}
>>> +
>>> +extern struct workqueue_struct *panthor_cleanup_wq;
>>> +
>>> +#endif  
>>
> 


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-30 13:17       ` Steven Price
@ 2023-08-30 14:06         ` Boris Brezillon
  2023-09-04 11:46         ` Liviu Dudau
  1 sibling, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-30 14:06 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Wed, 30 Aug 2023 14:17:57 +0100
Steven Price <steven.price@arm.com> wrote:

> >>> +static void panthor_device_reset_work(struct work_struct *work)
> >>> +{
> >>> +	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
> >>> +	int ret, cookie;
> >>> +
> >>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> >>> +		return;
> >>> +
> >>> +	panthor_sched_pre_reset(ptdev);
> >>> +	panthor_fw_pre_reset(ptdev, true);
> >>> +	panthor_mmu_pre_reset(ptdev);
> >>> +	panthor_gpu_soft_reset(ptdev);
> >>> +	panthor_gpu_l2_power_on(ptdev);
> >>> +	panthor_mmu_post_reset(ptdev);
> >>> +	ret = panthor_fw_post_reset(ptdev);
> >>> +	if (ret)
> >>> +		goto out;
> >>> +
> >>> +	atomic_set(&ptdev->reset.pending, 0);
> >>> +	panthor_sched_post_reset(ptdev);
> >>> +	drm_dev_exit(cookie);
> >>> +
> >>> +out:
> >>> +	if (ret) {    
> >>
> >> This looks like a race condition too - is there a need for a
> >> drm_dev_exit_and_unplug() function?  
> > 
> > drm_dev_exit() is just releasing the read-lock. drm_dev_unplug()
> > waits for all readers to be done and sets the unplugged value to true.
> > So we only get readers/writer synchronization here, but nothing doing
> > writer/writer sync. I guess the drm core leaves that to drivers, given
> > drm_dev_unplug() is usually called from xxx_driver->remove() hook, on
> > which serialization is guaranteed by the device-model.
> > 
> > TLDR; yes, it's racy, but I don't think drm_dev_exit_and_unplug() would
> > help solve the existing race.  
> 
> Yeah, I hadn't really thought through the reader/writer locks.
> 
> > It's worth noting that we currently have only 2 paths calling
> > panthor_device_unplug(): the platform_driver->remove() hook and the
> > reset worker. Calling drm_dev_unplug() might not be the right thing to
> > do, I just thought it was a good match to reflect the fact the device
> > becomes inaccessible, without adding yet another kind of device-lost
> > field.  
> 
> I quite liked the unplugged approach, it hides the complexities of the
> GPU breaking nicely.
> 
> However I do think this path needs fixing in some way, because of the
> "goto out" we end up calling panthor_device_unplug() while in the
> drm_dev_enter() section. Which, unless I'm mistaken, means
> panthor_device_unplug() will call drm_dev_unplug() in that section -
> which should produce a lockdep warning at the very least, if not an
> actual deadlock.
> 
> Given it's only a read lock - I think simply moving drm_dev_exit() below
> the "out:" label fixes the deadlock without making any races worse.

Oh, yeah, I didn't realize this is what you were complaining about. We
definitely need to move the out label before drm_dev_exit().

> Whether the race here actually matters I'm not sure.

It does if we want to be safe against removal. Maybe what we should do
instead is synchronize the reset work in the platform->remove() path,
and make sure it can't be scheduled after the synchronization happened.
This way we don't have to worry about concurrent calls to
panthor_device_unplug(), and we can keep the existing is_unplugged
check.


> >>> +
> >>> +/**
> >>> + * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
> >>> + * registration function.
> >>> + *
> >>> + * The boiler-plate to gracefully deal with shared interrupts is
> >>> + * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
> >>> + * just after you actual handler. The handler prototype is:    
> >> s/you/your/ or probably s/you/the/ since we don't expect people to be
> >> adding more ;)
> >>  
> >>> + *
> >>> + * void (*handler)(struct panthor_device *, u32 status);
> >>> + */
> >>> +#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
> >>> +static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
> >>> +{												\
> >>> +	struct panthor_irq *pirq = data;							\
> >>> +	struct panthor_device *ptdev = pirq->ptdev;						\    
> >>
> >> Maybe I'm missing something, but I was expecting a check here for if the
> >> irq has been suspended and to avoid the register reads if it was.  
> > 
> > Thought the INT_MASK=0 + synchronize_irq() in panthor_xxx_irq_suspend()
> > would guarantee that the handler can't be called after
> > panthor_xxx_irq_suspend() was called.  
> 
> If the IRQ is shared then Linux doesn't know which device caused the
> interrupt, so another device's (shared) interrupt could cause our
> handler to be run.

Uh, that's correct. We definitely need to check the ->suspended value
before reading the register...

> 
> >> Otherwise I'm not entirely sure I follow what all this code is for.  
> > 
> > Not entirely sure which code we're talking about. The reason we
> > don't use the default raw IRQ handler is because it doesn't work if the
> > irq line is shared. In that case, we need to mask all interrupts to
> > make sure other handlers on the same irq line don't get spammed with
> > our IRQs.  
> 
> What I'm not following is why we need all this extra infrastructure for
> IRQs. The 'setting the mask to 0' during suspend is simple enough and
> could be included in code which now calls panthor_xxx_irq_suspend()
> (equally for restoring the mask on resume). But there's a loads more
> code here.

It's not just setting the mask to 0, but also making sure all pending
interrupts have been processed, otherwise we might have our threaded
handler called after we've supposedly suspended the IRQ (set _INT_MASK
to 0), which might trigger access to HW after it's been suspended. It's
pretty easy to forget to do things in the suspend/resume path, even if
those things are supposed to be trivial/obvious (I learnt it the hard
way). By having helpers, we reduce the risk of doing silly mistakes,
and we control that in a central place.

> 
> My initial thought when I looked at this was that you were trying to
> solve the issue of a shared IRQ where Mali might get powered off, but
> the IRQ is then triggered by another device. In that case touching the
> Mali registers would be problematic, so I was expecting some code in
> _irq_raw_handler() to check whether the IRQ couldn't possibly be for us
> (i.e. mask==0) and early out with IRQ_NONE. kbase has a concept like
> this "gpu_powered" for exactly this reason.

Yeah, that was one the goals, just didn't really think of that
case where the IRQ line is shared by different devices, not just the
GPU-related irqs being muxed on one line. In this case, the regs
are still accessible until the clk driving the APB interface is
disabled, which happens when all GPU irqs have been suspended.

> 
> But I can't see anything in the code to handle that case. And the
> "spamming" of other drivers during suspend shouldn't really happen
> (there's something odd going on if the hardware is generating interrupts
> when it's meant to be suspended).

The masking is here to make sure we don't receive interrupts after
calling synchronize_irq(). After that point, we just want to ignore any
IRQs until the device is resumed. As for spamming other drivers or
other GPU components, it can still happen between the moment we suspend
the IRQ, and the moment the device is actually shut down.

> 
> But maybe I'm just missing something - it's a while since I've dealt
> with interrupt code in Linux.

No, you're not missing anything, it's really just about irq
synchronization in the suspend/resume path, which might appear trivial
to you, but is very easy to get wrong in subtle ways.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-29 15:33     ` Boris Brezillon
@ 2023-08-30 14:12       ` Steven Price
  2023-08-30 14:53         ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-08-30 14:12 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 29/08/2023 16:33, Boris Brezillon wrote:
> On Mon, 14 Aug 2023 16:53:09 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>>> +
>>> +/**
>>> + * struct panthor_vm_op_ctx - VM operation context
>>> + *
>>> + * With VM operations potentially taking place in a dma-signaling path, we
>>> + * need to make sure everything that might require resource allocation is
>>> + * pre-allocated upfront. This is what this operation context is far.
>>> + *
>>> + * We also collect resources that have been freed, so we can release them
>>> + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
>>> + * request.
>>> + */
>>> +struct panthor_vm_op_ctx {
>>> +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
>>> +	struct {
>>> +		/** @count: Number of pages reserved. */
>>> +		u32 count;
>>> +
>>> +		/** @ptr: Point to the first unused page in the @pages table. */
>>> +		u32 ptr;
>>> +
>>> +		/**
>>> +		 * @page: Array of pages that can be used for an MMU page table update.
>>> +		 *
>>> +		 * After an VM operation, there might be free pages left in this array.
>>> +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
>>> +		 */
>>> +		void **pages;
>>> +	} rsvd_page_tables;  
>>
>> Two questions:
>>
>> 1) Would a mempool simplify the implementation? It looks like a
>> reasonable match.
> 
> Not sure what you mean by mempool,

See include/linux/mempool.h

> but I'm using a kmem_cache here for
> all page table allocations. The pages that are passed to
> panthor_vm_op_ctx::rsvd_page_tables::pages are allocated from this
> pool. It's just that for each VM operation we pre-allocate page-tables,
> and release those that were not used when the operation is done (we
> over-allocate for the worst case scenario).

The mempool could, potentially, replace the rsvd_page_tables structure.
The kmem_cache you would still want as that's per-driver.

>>
>> 2) Does it really make sense to have a separate pool of memory for every
>> operation? Instead of having a separate pool for each operation, it
>> would be possible to just keep track of the total number needed for all
>> outstanding operations. Then a single (per device or maybe per-VM if
>> necessary) mempool could be resized to ensure it has the right amount of
>> space.
> 
> The pool is per-driver (see the global pt_cache). rsvd_page_tables just
> holds pages needed for a specific VM operation. To be more specific, it
> holds pages for the worst case (page table tree is empty, except for the
> root page table).

What I'm wondering is to we need to keep the pages for each operation in
separate pools. So instead of having a rsvd_page_tables for each
operation, can we have one global one which is sized appropriately for
all operations that are in flight for the device? The operations are
serialized so there's no contention. Or at least a per-VM pool if we can
operate on multiple VMs at once.

>>
>> I'm also a little wary that the VM_BIND infrastructure could potentially
>> be abused to trigger a large amount of kernel allocation as it allocates
>> up-front for the worst case but those pages are not charged to the
>> process (AFAICT). But I haven't fully got my head round that yet.
> 
> Yep, that's problematic, indeed. I considered allocating page tables
> as GEM objects, but the overhead of a GEM object is quite big
> (hundreds of bytes of meta-data) compared to the size of a page table
> (4k), and kmem_cache was just super convenient for this page table
> cache :-).

I think page tables as GEM objects is likely to be overkill, we
obviously also have to be careful not to allow user space to get access
to the contents - whereas GEM objects are usually to provide user space
access ;) I'm not sure quite what the best solution here is, clearly one
'solution' is to just cap the number of outstanding VM_BINDs.

>>
>>> +
>>> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
>>> +	u32 flags;
>>> +
>>> +	/** @va: Virtual range targeted by the VM operation. */
>>> +	struct {
>>> +		/** @addr: Start address. */
>>> +		u64 addr;
>>> +
>>> +		/** @range: Range size. */
>>> +		u64 range;
>>> +	} va;
>>> +
>>> +	/**
>>> +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
>>> +	 *
>>> +	 * For unmap operations, this will contain all VMAs that were covered by the
>>> +	 * specified VA range.
>>> +	 *
>>> +	 * For map operations, this will contain all VMAs that previously mapped to
>>> +	 * the specified VA range.
>>> +	 *
>>> +	 * Those VMAs, and the resources they point to will be released as part of
>>> +	 * the op_ctx cleanup operation.
>>> +	 */
>>> +	struct list_head returned_vmas;
>>> +
>>> +	/** @map: Fields specific to a map operation. */
>>> +	struct {
>>> +		/** @gem: GEM object information. */
>>> +		struct {
>>> +			/** @obj: GEM object to map. */
>>> +			struct drm_gem_object *obj;
>>> +
>>> +			/** @offset: Offset in the GEM object. */
>>> +			u64 offset;
>>> +		} gem;
>>> +
>>> +		/**
>>> +		 * @sgt: sg-table pointing to pages backing the GEM object.
>>> +		 *
>>> +		 * This is gathered at job creation time, such that we don't have
>>> +		 * to allocate in ::run_job().
>>> +		 */
>>> +		struct sg_table *sgt;
>>> +
>>> +		/**
>>> +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
>>> +		 *
>>> +		 * If the map request covers a region that's inside another VMA, the
>>> +		 * previous VMA will be split, requiring instantiation of a maximum of
>>> +		 * two new VMA objects.
>>> +		 */
>>> +		struct panthor_vma *prev_vma;
>>> +
>>> +		/**
>>> +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
>>> +		 */
>>> +		struct panthor_vma *new_vma;
>>> +
>>> +		/**
>>> +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
>>> +		 *
>>> +		 * See @prev_vma.
>>> +		 */
>>> +		struct panthor_vma *next_vma;  
>>
>> It's probably premature optimization, but it feels like having a cache
>> of these VMA structures might be an idea.
> 
> If it's needed, I'll probably go for a kmem_cache, but I need to
> check if it's worth it first (if the closest kmalloc cache is
> significantly biffer than the struct size).
> 
>> I'm also struggling to
>> understand how both a new prev and new next VMA are needed - but I
>> haven't dug into the GPU VA manager.
> 
> prev/next are for mapping splits: an object is already mapped, and a new
> object is mapped in the middle of this pre-existing mapping. In that
> case, we need 2 vma object for the preceeding and succeeding mappings,
> since the old mapping object will be released.
> 
> new_vma is for the new mapping.

Yeah, looking into the GPU VA manager I see now. My problem was that I
assumed in the case of a split one of the original mappings would simply
be resized, so you'd only need one new VMA (plus the one being added).
But AFAICT that resize doesn't happen and instead new VMA are created.

>>
>>> +	} map;
>>> +};
>>> +
> 
> [...]
> 
>>> +/**
>>> + * panthor_vm_active() - Flag a VM as active
>>> + * @VM: VM to flag as active.
>>> + *
>>> + * Assigns an address space to a VM so it can be used by the GPU/MCU.
>>> + *
>>> + * Return: 0 on success, a negative error code otherwise.
>>> + */
>>> +int panthor_vm_active(struct panthor_vm *vm)
>>> +{
>>> +	struct panthor_device *ptdev = vm->ptdev;
>>> +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
>>> +	int ret = 0, as, cookie;
>>> +	u64 transtab, transcfg;
>>> +
>>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
>>> +		return -ENODEV;
>>> +
>>> +	mutex_lock(&ptdev->mmu->as.slots_lock);
>>> +
>>> +	as = vm->as.id;
>>> +	if (as >= 0) {
>>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
>>> +
>>> +		if (ptdev->mmu->as.faulty_mask & mask) {
>>> +			/* Unhandled pagefault on this AS, the MMU was
>>> +			 * disabled. We need to re-enable the MMU after
>>> +			 * clearing+unmasking the AS interrupts.
>>> +			 */
>>> +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
>>> +			ptdev->mmu->as.faulty_mask &= ~mask;
>>> +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
>>> +			goto out_enable_as;
>>> +		}
>>> +
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	/* Check for a free AS */
>>> +	if (vm->for_mcu) {
>>> +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
>>> +		as = 0;
>>> +	} else {
>>> +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
>>> +	}
>>> +
>>> +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
>>> +		struct panthor_vm *lru_vm;
>>> +
>>> +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
>>> +						  struct panthor_vm,
>>> +						  as.lru_node);
>>> +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
>>> +			ret = -EBUSY;
>>> +			goto out_unlock;
>>> +		}
>>> +
>>> +		list_del_init(&lru_vm->as.lru_node);
>>> +		as = lru_vm->as.id;  
>>
>> Should this not set lru_vm->as.id = -1, so that the code knows the VM no
>> longer has an address space?
> 
> Good catch!
> 
>>
>>> +	} else {
>>> +		set_bit(as, &ptdev->mmu->as.alloc_mask);
>>> +	}
>>> +
>>> +	/* Assign the free or reclaimed AS to the FD */
>>> +	vm->as.id = as;
>>> +	ptdev->mmu->as.slots[as].vm = vm;
>>> +
>>> +out_enable_as:
>>> +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
>>> +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
>>> +		   AS_TRANSCFG_PTW_RA |
>>> +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
>>> +	if (ptdev->coherent)
>>> +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
>>> +
>>> +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
>>> +
>>> +out_unlock:
>>> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
>>> +	drm_dev_exit(cookie);
>>> +	return ret;
>>> +}
>>> +
> 
> [...]
> 
>>> +
>>> +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
>>> +{
>>> +	status = panthor_mmu_fault_mask(ptdev, status);
>>> +	while (status) {
>>> +		u32 as = ffs(status | (status >> 16)) - 1;
>>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
>>> +		u32 new_int_mask;
>>> +		u64 addr;
>>> +		u32 fault_status;
>>> +		u32 exception_type;
>>> +		u32 access_type;
>>> +		u32 source_id;
>>> +
>>> +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
>>> +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
>>> +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
>>> +
>>> +		/* decode the fault status */
>>> +		exception_type = fault_status & 0xFF;
>>> +		access_type = (fault_status >> 8) & 0x3;
>>> +		source_id = (fault_status >> 16);
>>> +
>>> +		/* Page fault only */  
>>
>> This comment makes no sense - it looks like it's copied over from panfrost.
> 
> Uh, it made sense before I dropped map/alloc-on-fault :-).

:)

>>
>> If I understand correctly we don't (currently) support growing on page
>> fault - and it's not really needed now the MCU can handle the tiler heaps.
> 
> Exaclty. Map/alloc on fault is a bit challenging because of the whole
> 'we have to guarantee that a job is done in finite time, and we must
> make sure fence signaling is not blocked on allocation'. Given
> drm_gem_get_pages() doesn't do non-blocking allocations, I thought it'd
> be preferable to postpone map-on-fault until we actually decide we need
> it. Note that i915 seems to have some sort of non-blocking page
> allocator in shmem_sg_alloc_table()[1].

Agreed, the intention is definitely to move away from map/alloc-on-fault
- handling page faults from the GPU on the CPU is expensive even without
the can-of-worms of fence signalling.

Steve

>>
>>> +		mutex_lock(&ptdev->mmu->as.slots_lock);
>>> +
>>> +		new_int_mask =
>>> +			panthor_mmu_fault_mask(ptdev, ~ptdev->mmu->as.faulty_mask);
>>> +
>>> +		/* terminal fault, print info about the fault */
>>> +		drm_err(&ptdev->base,
>>> +			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
>>> +			"raw fault status: 0x%X\n"
>>> +			"decoded fault status: %s\n"
>>> +			"exception type 0x%X: %s\n"
>>> +			"access type 0x%X: %s\n"
>>> +			"source id 0x%X\n",
>>> +			as, addr,
>>> +			fault_status,
>>> +			(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
>>> +			exception_type, panthor_exception_name(ptdev, exception_type),
>>> +			access_type, access_type_name(ptdev, fault_status),
>>> +			source_id);
>>> +
>>> +		/* Ignore MMU interrupts on this AS until it's been
>>> +		 * re-enabled.
>>> +		 */
>>> +		ptdev->mmu->irq.mask = new_int_mask;
>>> +		gpu_write(ptdev, MMU_INT_MASK, new_int_mask);
>>> +
>>> +		/* Disable the MMU to kill jobs on this AS. */
>>> +		panthor_mmu_as_disable(ptdev, as);
>>> +		mutex_unlock(&ptdev->mmu->as.slots_lock);
>>> +
>>> +		status &= ~mask;
>>> +	}
>>> +}
>>> +PANTHOR_IRQ_HANDLER(mmu, MMU, panthor_mmu_irq_handler);
>>> +
> 
> [...]
> 
>>> +
>>> +/**
>>> + * panthor_mmu_unplug() - Unplug the MMU logic
>>> + * @ptdev: Device.
>>> + *
>>> + * No access to the MMU regs should be done after this function is called.
>>> + * We suspend the IRQ and disable all VMs to guarantee that.
>>> + */
>>> +void panthor_mmu_unplug(struct panthor_device *ptdev)
>>> +{
>>> +	if (ptdev->mmu->irq.irq > 0)  
>>
>> In what situation is this not true? AFAICT the driver probe will fail if
>> the IRQ can't be obtained.
> 
> Right, I'll drop this test.
> 
> [1]https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L63


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-30 14:12       ` Steven Price
@ 2023-08-30 14:53         ` Boris Brezillon
  2023-08-30 15:55           ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-08-30 14:53 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Wed, 30 Aug 2023 15:12:43 +0100
Steven Price <steven.price@arm.com> wrote:

> On 29/08/2023 16:33, Boris Brezillon wrote:
> > On Mon, 14 Aug 2023 16:53:09 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >>> +
> >>> +/**
> >>> + * struct panthor_vm_op_ctx - VM operation context
> >>> + *
> >>> + * With VM operations potentially taking place in a dma-signaling path, we
> >>> + * need to make sure everything that might require resource allocation is
> >>> + * pre-allocated upfront. This is what this operation context is far.
> >>> + *
> >>> + * We also collect resources that have been freed, so we can release them
> >>> + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
> >>> + * request.
> >>> + */
> >>> +struct panthor_vm_op_ctx {
> >>> +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
> >>> +	struct {
> >>> +		/** @count: Number of pages reserved. */
> >>> +		u32 count;
> >>> +
> >>> +		/** @ptr: Point to the first unused page in the @pages table. */
> >>> +		u32 ptr;
> >>> +
> >>> +		/**
> >>> +		 * @page: Array of pages that can be used for an MMU page table update.
> >>> +		 *
> >>> +		 * After an VM operation, there might be free pages left in this array.
> >>> +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
> >>> +		 */
> >>> +		void **pages;
> >>> +	} rsvd_page_tables;    
> >>
> >> Two questions:
> >>
> >> 1) Would a mempool simplify the implementation? It looks like a
> >> reasonable match.  
> > 
> > Not sure what you mean by mempool,  
> 
> See include/linux/mempool.h

Oh, okay.

> 
> > but I'm using a kmem_cache here for
> > all page table allocations. The pages that are passed to
> > panthor_vm_op_ctx::rsvd_page_tables::pages are allocated from this
> > pool. It's just that for each VM operation we pre-allocate page-tables,
> > and release those that were not used when the operation is done (we
> > over-allocate for the worst case scenario).  
> 
> The mempool could, potentially, replace the rsvd_page_tables structure.
> The kmem_cache you would still want as that's per-driver.

Need to have a closer look at the API to make my mind, but at first
glance it seems to be overkill for what I initially had in mind.

> 
> >>
> >> 2) Does it really make sense to have a separate pool of memory for every
> >> operation? Instead of having a separate pool for each operation, it
> >> would be possible to just keep track of the total number needed for all
> >> outstanding operations. Then a single (per device or maybe per-VM if
> >> necessary) mempool could be resized to ensure it has the right amount of
> >> space.  
> > 
> > The pool is per-driver (see the global pt_cache). rsvd_page_tables just
> > holds pages needed for a specific VM operation. To be more specific, it
> > holds pages for the worst case (page table tree is empty, except for the
> > root page table).  
> 
> What I'm wondering is to we need to keep the pages for each operation in
> separate pools.

I was not really considering it a pool, more a set of pages that will
be used by the VM operation, some of them being returned to the
kmem_cache pool if we end up using less (over-provisioning). If we have
a mempool, say, at the VM level, that means we have 2 levels of caching:
the kmem_cache itself, and the mempool attached to the VM. Is there any
benefit here? Do we expect kmem_cache to be too slow for fast/already
allocated pages?

I do see how over-provisioning can cause us to allocate a lot of pages
that end up being unused, but I fail to see how VM/device level caching
would solve that, because we still have to dequeue some operations to
return pages to the intermediate pool, at which point, we've already
lost, because already queued operations reserved the amount of pages
they thought they needed for the worst case scenario.
Operations being queued after that can pick from the returned pages of
course, but that's already the case right now, because we return pages
to the kmem_cache as soon as we're done executing a VM operation.

The only thing that might help is limiting the number of in-flight
VM_BIND jobs per VM (or globally), and then have the submit path return
EBUSY or EGAIN so the userspace driver knows it has to retry at a later
time.

> So instead of having a rsvd_page_tables for each
> operation, can we have one global one which is sized appropriately for
> all operations that are in flight for the device? The operations are
> serialized so there's no contention. Or at least a per-VM pool if we can
> operate on multiple VMs at once.

We can operate on multiple VMs at once (VM is basically your VkDevice,
AKA the logical device), but I'm not too worried about the
synchronization that would be incurred by the caching at the
panthor_device level. I'm just curious to know what value it would add.
I'm also worried that it makes the reservation logic more complex:
we need to track what's still reserved and how many pages can be
re-used because they were never reclaimed by the VM operation that had
reserved it. Also didn't check how mempool plays with memory reclaim,
but if the memory in a mempool is not reclaimable, that might be
another problem.

To sum-up, I'd really like to refrain adding an intermediate cache
until the per-driver kmem cache is proven to be too slow.

> 
> >>
> >> I'm also a little wary that the VM_BIND infrastructure could potentially
> >> be abused to trigger a large amount of kernel allocation as it allocates
> >> up-front for the worst case but those pages are not charged to the
> >> process (AFAICT). But I haven't fully got my head round that yet.  
> > 
> > Yep, that's problematic, indeed. I considered allocating page tables
> > as GEM objects, but the overhead of a GEM object is quite big
> > (hundreds of bytes of meta-data) compared to the size of a page table
> > (4k), and kmem_cache was just super convenient for this page table
> > cache :-).  
> 
> I think page tables as GEM objects is likely to be overkill,

I agree it's overkill/not easy to interface with io_pgtbl, but there
was one aspect I was interested in, more than the memory accounting:
being able to reclaim page tables the same way we reclaim regular GEMs,
which would simplify the shrinker/reclaim logic. After discussing it
with Robin, I realized it was pretty much useless, because reclaiming
the GEM will also teardown all VM mappings, which will return the page
table memory to the kmem_cache, and then the kmem_cache layer can
reclaim it.

> we
> obviously also have to be careful not to allow user space to get access
> to the contents - whereas GEM objects are usually to provide user space
> access ;).

GEMs can be hidden to userspace if we want. We are the ones in control
of the mmap() (I think there's a BO flag for preventing users access
already).

> I'm not sure quite what the best solution here is, clearly one
> 'solution' is to just cap the number of outstanding VM_BINDs.

I think that'd make sense.

> 
> >>  
> >>> +
> >>> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
> >>> +	u32 flags;
> >>> +
> >>> +	/** @va: Virtual range targeted by the VM operation. */
> >>> +	struct {
> >>> +		/** @addr: Start address. */
> >>> +		u64 addr;
> >>> +
> >>> +		/** @range: Range size. */
> >>> +		u64 range;
> >>> +	} va;
> >>> +
> >>> +	/**
> >>> +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
> >>> +	 *
> >>> +	 * For unmap operations, this will contain all VMAs that were covered by the
> >>> +	 * specified VA range.
> >>> +	 *
> >>> +	 * For map operations, this will contain all VMAs that previously mapped to
> >>> +	 * the specified VA range.
> >>> +	 *
> >>> +	 * Those VMAs, and the resources they point to will be released as part of
> >>> +	 * the op_ctx cleanup operation.
> >>> +	 */
> >>> +	struct list_head returned_vmas;
> >>> +
> >>> +	/** @map: Fields specific to a map operation. */
> >>> +	struct {
> >>> +		/** @gem: GEM object information. */
> >>> +		struct {
> >>> +			/** @obj: GEM object to map. */
> >>> +			struct drm_gem_object *obj;
> >>> +
> >>> +			/** @offset: Offset in the GEM object. */
> >>> +			u64 offset;
> >>> +		} gem;
> >>> +
> >>> +		/**
> >>> +		 * @sgt: sg-table pointing to pages backing the GEM object.
> >>> +		 *
> >>> +		 * This is gathered at job creation time, such that we don't have
> >>> +		 * to allocate in ::run_job().
> >>> +		 */
> >>> +		struct sg_table *sgt;
> >>> +
> >>> +		/**
> >>> +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
> >>> +		 *
> >>> +		 * If the map request covers a region that's inside another VMA, the
> >>> +		 * previous VMA will be split, requiring instantiation of a maximum of
> >>> +		 * two new VMA objects.
> >>> +		 */
> >>> +		struct panthor_vma *prev_vma;
> >>> +
> >>> +		/**
> >>> +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
> >>> +		 */
> >>> +		struct panthor_vma *new_vma;
> >>> +
> >>> +		/**
> >>> +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
> >>> +		 *
> >>> +		 * See @prev_vma.
> >>> +		 */
> >>> +		struct panthor_vma *next_vma;    
> >>
> >> It's probably premature optimization, but it feels like having a cache
> >> of these VMA structures might be an idea.  
> > 
> > If it's needed, I'll probably go for a kmem_cache, but I need to
> > check if it's worth it first (if the closest kmalloc cache is
> > significantly biffer than the struct size).
> >   
> >> I'm also struggling to
> >> understand how both a new prev and new next VMA are needed - but I
> >> haven't dug into the GPU VA manager.  
> > 
> > prev/next are for mapping splits: an object is already mapped, and a new
> > object is mapped in the middle of this pre-existing mapping. In that
> > case, we need 2 vma object for the preceeding and succeeding mappings,
> > since the old mapping object will be released.
> > 
> > new_vma is for the new mapping.  
> 
> Yeah, looking into the GPU VA manager I see now. My problem was that I
> assumed in the case of a split one of the original mappings would simply
> be resized, so you'd only need one new VMA (plus the one being added).
> But AFAICT that resize doesn't happen and instead new VMA are created.

Yes. On the other hand, if we have a kmem_cache for panthor_vma
objects, that shouldn't make a big difference.

> 
> >>  
> >>> +	} map;
> >>> +};
> >>> +  
> > 
> > [...]
> >   
> >>> +/**
> >>> + * panthor_vm_active() - Flag a VM as active
> >>> + * @VM: VM to flag as active.
> >>> + *
> >>> + * Assigns an address space to a VM so it can be used by the GPU/MCU.
> >>> + *
> >>> + * Return: 0 on success, a negative error code otherwise.
> >>> + */
> >>> +int panthor_vm_active(struct panthor_vm *vm)
> >>> +{
> >>> +	struct panthor_device *ptdev = vm->ptdev;
> >>> +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
> >>> +	int ret = 0, as, cookie;
> >>> +	u64 transtab, transcfg;
> >>> +
> >>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> >>> +		return -ENODEV;
> >>> +
> >>> +	mutex_lock(&ptdev->mmu->as.slots_lock);
> >>> +
> >>> +	as = vm->as.id;
> >>> +	if (as >= 0) {
> >>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> >>> +
> >>> +		if (ptdev->mmu->as.faulty_mask & mask) {
> >>> +			/* Unhandled pagefault on this AS, the MMU was
> >>> +			 * disabled. We need to re-enable the MMU after
> >>> +			 * clearing+unmasking the AS interrupts.
> >>> +			 */
> >>> +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
> >>> +			ptdev->mmu->as.faulty_mask &= ~mask;
> >>> +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
> >>> +			goto out_enable_as;
> >>> +		}
> >>> +
> >>> +		goto out_unlock;
> >>> +	}
> >>> +
> >>> +	/* Check for a free AS */
> >>> +	if (vm->for_mcu) {
> >>> +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
> >>> +		as = 0;
> >>> +	} else {
> >>> +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
> >>> +	}
> >>> +
> >>> +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
> >>> +		struct panthor_vm *lru_vm;
> >>> +
> >>> +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
> >>> +						  struct panthor_vm,
> >>> +						  as.lru_node);
> >>> +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
> >>> +			ret = -EBUSY;
> >>> +			goto out_unlock;
> >>> +		}
> >>> +
> >>> +		list_del_init(&lru_vm->as.lru_node);
> >>> +		as = lru_vm->as.id;    
> >>
> >> Should this not set lru_vm->as.id = -1, so that the code knows the VM no
> >> longer has an address space?  
> > 
> > Good catch!
> >   
> >>  
> >>> +	} else {
> >>> +		set_bit(as, &ptdev->mmu->as.alloc_mask);
> >>> +	}
> >>> +
> >>> +	/* Assign the free or reclaimed AS to the FD */
> >>> +	vm->as.id = as;
> >>> +	ptdev->mmu->as.slots[as].vm = vm;
> >>> +
> >>> +out_enable_as:
> >>> +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
> >>> +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
> >>> +		   AS_TRANSCFG_PTW_RA |
> >>> +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
> >>> +	if (ptdev->coherent)
> >>> +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
> >>> +
> >>> +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
> >>> +
> >>> +out_unlock:
> >>> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> >>> +	drm_dev_exit(cookie);
> >>> +	return ret;
> >>> +}
> >>> +  
> > 
> > [...]
> >   
> >>> +
> >>> +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
> >>> +{
> >>> +	status = panthor_mmu_fault_mask(ptdev, status);
> >>> +	while (status) {
> >>> +		u32 as = ffs(status | (status >> 16)) - 1;
> >>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> >>> +		u32 new_int_mask;
> >>> +		u64 addr;
> >>> +		u32 fault_status;
> >>> +		u32 exception_type;
> >>> +		u32 access_type;
> >>> +		u32 source_id;
> >>> +
> >>> +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
> >>> +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
> >>> +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
> >>> +
> >>> +		/* decode the fault status */
> >>> +		exception_type = fault_status & 0xFF;
> >>> +		access_type = (fault_status >> 8) & 0x3;
> >>> +		source_id = (fault_status >> 16);
> >>> +
> >>> +		/* Page fault only */    
> >>
> >> This comment makes no sense - it looks like it's copied over from panfrost.  
> > 
> > Uh, it made sense before I dropped map/alloc-on-fault :-).  
> 
> :)
> 
> >>
> >> If I understand correctly we don't (currently) support growing on page
> >> fault - and it's not really needed now the MCU can handle the tiler heaps.  
> > 
> > Exaclty. Map/alloc on fault is a bit challenging because of the whole
> > 'we have to guarantee that a job is done in finite time, and we must
> > make sure fence signaling is not blocked on allocation'. Given
> > drm_gem_get_pages() doesn't do non-blocking allocations, I thought it'd
> > be preferable to postpone map-on-fault until we actually decide we need
> > it. Note that i915 seems to have some sort of non-blocking page
> > allocator in shmem_sg_alloc_table()[1].  
> 
> Agreed, the intention is definitely to move away from map/alloc-on-fault
> - handling page faults from the GPU on the CPU is expensive even without
> the can-of-worms of fence signalling.

Yeah, I agree, but I'd bet on Khronos members being inventive enough
to come with a use case for this map/alloc-on-fault feature :-).
Anyway, that's not something we have to worry about just yet.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 09/15] drm/panthor: Add the FW logical block
  2023-08-29 16:15     ` Boris Brezillon
@ 2023-08-30 15:20       ` Steven Price
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-30 15:20 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 29/08/2023 17:15, Boris Brezillon wrote:
> On Wed, 16 Aug 2023 17:01:56 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 09/08/2023 17:53, Boris Brezillon wrote:

[...]

>>> +/**
>>> + * panthor_fw_mem_alloc() - Allocate a FW memory object and map it to the MCU VM.
>>> + * @ptdev: Device.
>>> + * @size: Size of the memory block.
>>> + * @bo_flags: BO flags.
>>> + * @vm_map_flags: VM_MAP flags.
>>> + * @va: Virtual address of the MCU mapping.
>>> + * Set to PANTHOR_GEM_ALLOC_VA for automatic VA-assignment. In that case, the
>>> + * VA will be allocated in the shared VA space.
>>> + *
>>> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
>>> + */
>>> +static struct panthor_fw_mem *
>>> +panthor_fw_mem_alloc(struct panthor_device *ptdev, size_t size,
>>> +		     u32 bo_flags, u32 vm_map_flags, u64 va)
>>> +{
>>> +	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>>> +	int ret;
>>> +
>>> +	if (!mem)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	mem->bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
>>> +					     size, bo_flags, vm_map_flags,
>>> +					     &va, NULL);
>>> +	if (IS_ERR(mem->bo)) {
>>> +		ret = PTR_ERR(mem->bo);
>>> +		mem->bo = NULL;
>>> +		goto err_free_mem;
>>> +	}
>>> +
>>> +	mem->va = va;
>>> +	return mem;
>>> +
>>> +err_free_mem:
>>> +	panthor_fw_mem_free(ptdev, mem);
>>> +	return ERR_PTR(ret);  
>>
>> The error handling seems more complex than needed, how about:
>>
>> 	struct panthor_fw_mem *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
>> 	struct panthor_gem_object *bo;
>> 	int ret;
>>
>> 	if (!mem)
>> 		return ERR_PTR(-ENOMEM);
>>
>> 	bo = panthor_gem_create_and_map(ptdev, ptdev->fw->vm,
>> 					size, bo_flags, vm_map_flags,
>> 					&va, NULL);
>>
>> 	if (IS_ERR(bo)) {
>> 		kfree(mem);
>> 		return ERR_CAST(bo);
>> 	}
>>
>> 	mem->bo = bo;
>> 	mem->va = va;
>> 	return mem;
>> 	
>> Which I think also means we don't need the "if (mem->bo)" case in
>> panthor_fw_mem_free().
> 
> Not so sure about that one. I've been adding code to existing functions
> and having a structured error path, with free functions that can deal
> with partially initialized object makes code addition less error-prone.
> I agree on the local bo variable to avoid mem->bo re-initialization
> though.

Yeah the "free accepting NULL" style is generally a good one, so leaving
the NULL check in panthor_fw_mem_free() is fine. It was just in this
case having to explicitly assign NULL before the call to
panthor_fw_mem_free() looked ugly.

>>
>>> +}
>>> +
> 
> [...]
> 
>>> +/**
>>> + * panthor_fw_alloc_suspend_buf_mem() - Allocate a suspend buffer for a command stream group.
>>> + * @ptdev: Device.
>>> + * @size: Size of the suspend buffer.
>>> + *
>>> + * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
>>> + */
>>> +struct panthor_fw_mem *
>>> +panthor_fw_alloc_suspend_buf_mem(struct panthor_device *ptdev, size_t size)
>>> +{
>>> +	return panthor_fw_mem_alloc(ptdev, size,
>>> +				    DRM_PANTHOR_BO_NO_MMAP,
>>> +				    DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC,
>>> +				    PANTHOR_GEM_ALLOC_VA);
>>> +}
>>> +
>>> +static int panthor_fw_load_section_entry(struct panthor_device *ptdev,
>>> +					 const struct firmware *fw,
>>> +					 struct panthor_fw_binary_iter *iter,
>>> +					 u32 ehdr)
>>> +{
>>> +	struct panthor_fw_binary_section_entry_hdr hdr;
>>> +	struct panthor_fw_section *section;
>>> +	u32 section_size;
>>> +	u32 name_len;
>>> +	int ret;
>>> +
>>> +	ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (hdr.data.end < hdr.data.start) {
>>> +		drm_err(&ptdev->base, "Firmware corrupted, data.end < data.start (0x%x < 0x%x)\n",
>>> +			hdr.data.end, hdr.data.start);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (hdr.va.end < hdr.va.start) {
>>> +		drm_err(&ptdev->base, "Firmware corrupted, hdr.va.end < hdr.va.start (0x%x < 0x%x)\n",
>>> +			hdr.va.end, hdr.va.start);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (hdr.data.end > fw->size) {
>>> +		drm_err(&ptdev->base, "Firmware corrupted, file truncated? data_end=0x%x > fw size=0x%zx\n",
>>> +			hdr.data.end, fw->size);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if ((hdr.va.start & ~PAGE_MASK) != 0 ||
>>> +	    (hdr.va.end & ~PAGE_MASK) != 0) {
>>> +		drm_err(&ptdev->base, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n",
>>> +			hdr.va.start, hdr.va.end);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (hdr.flags & ~CSF_FW_BINARY_IFACE_ENTRY_RD_SUPPORTED_FLAGS) {
>>> +		drm_err(&ptdev->base, "Firmware contains interface with unsupported flags (0x%x)\n",
>>> +			hdr.flags);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_PROT) {
>>> +		drm_warn(&ptdev->base,
>>> +			 "Firmware protected mode entry not be supported, ignoring");
>>> +		return 0;
>>> +	}
>>> +
>>> +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START &&
>>> +	    !(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED)) {
>>> +		drm_err(&ptdev->base,
>>> +			"Interface at 0x%llx must be shared", CSF_MCU_SHARED_REGION_START);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	name_len = iter->size - iter->offset;
>>> +
>>> +	section = drmm_kzalloc(&ptdev->base, sizeof(*section), GFP_KERNEL);
>>> +	if (!section)
>>> +		return -ENOMEM;
>>> +
>>> +	list_add_tail(&section->node, &ptdev->fw->sections);
>>> +	section->flags = hdr.flags;
>>> +	section->data.size = hdr.data.end - hdr.data.start;
>>> +
>>> +	if (section->data.size > 0) {
>>> +		void *data = drmm_kmalloc(&ptdev->base, section->data.size, GFP_KERNEL);
>>> +
>>> +		if (!data)
>>> +			return -ENOMEM;
>>> +
>>> +		memcpy(data, fw->data + hdr.data.start, section->data.size);
>>> +		section->data.buf = data;
>>> +	}
>>> +
>>> +	if (name_len > 0) {
>>> +		char *name = drmm_kmalloc(&ptdev->base, name_len + 1, GFP_KERNEL);
>>> +
>>> +		if (!name)
>>> +			return -ENOMEM;
>>> +
>>> +		memcpy(name, iter->data + iter->offset, name_len);
>>> +		name[name_len] = '\0';
>>> +		section->name = name;
>>> +	}
>>> +
>>> +	section_size = hdr.va.end - hdr.va.start;
>>> +	if (section_size) {
>>> +		u32 cache_mode = hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK;
>>> +		u32 vm_map_flags = 0;
>>> +		struct sg_table *sgt;
>>> +		u64 va = hdr.va.start;
>>> +
>>> +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
>>> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_READONLY;
>>> +
>>> +		if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_RD_EX))
>>> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC;
>>> +
>>> +		/* TODO: CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_*_COHERENT are mapped to
>>> +		 * non-cacheable for now. We might want to introduce a new
>>> +		 * IOMMU_xxx flag (or abuse IOMMU_MMIO, which maps to device
>>> +		 * memory and is currently not used by our driver) for
>>> +		 * AS_MEMATTR_AARCH64_SHARED memory, so we can take benefit
>>> +		 * of IO-coherent systems.
>>> +		 */
>>> +		if (cache_mode != CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED)
>>> +			vm_map_flags |= DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED;
>>> +
>>> +		/* Shared section is in the auto-VA range. We need to
>>> +		 * reserve the VA range so it's not allocated to someone else.
>>> +		 */
>>> +		if (va >= CSF_MCU_SHARED_REGION_START &&
>>> +		    va < CSF_MCU_SHARED_REGION_START + CSF_MCU_SHARED_REGION_SIZE)
>>> +			va = PANTHOR_GEM_ALLOC_VA;
>>> +
>>> +		section->mem = panthor_fw_mem_alloc(ptdev, section_size,
>>> +						    DRM_PANTHOR_BO_NO_MMAP,
>>> +						    vm_map_flags, va);
>>> +		if (IS_ERR(section->mem))
>>> +			return PTR_ERR(section->mem);
>>> +
>>> +		if (drm_WARN_ON(&ptdev->base, section->mem->va != hdr.va.start))
>>> +			return -EINVAL;
>>> +
>>> +		panthor_fw_init_section_mem(ptdev, section);
>>> +
>>> +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
>>> +		if (IS_ERR(sgt))
>>> +			return PTR_ERR(section->mem);
>>> +
>>> +		dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
>>> +
>>> +		if (section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_SHARED) {
>>> +			if (!panthor_fw_mem_vmap(section->mem))  
>>
>> Moving this before panthor_fw_init_section_mem() would avoid an
>> unnecessary unmap/remap - althought this isn't exactly a performance path...
> 
> Sure, I can do that.
> 
>>
>>> +				return -ENOMEM;
>>> +		}
>>> +	}
>>> +
>>> +	if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
>>> +		ptdev->fw->shared_section = section;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void
>>> +panthor_reload_fw_sections(struct panthor_device *ptdev, bool full_reload)
>>> +{
>>> +	struct panthor_fw_section *section;
>>> +
>>> +	list_for_each_entry(section, &ptdev->fw->sections, node) {
>>> +		struct sg_table *sgt;
>>> +
>>> +		if (!full_reload && !(section->flags & CSF_FW_BINARY_IFACE_ENTRY_RD_WR))
>>> +			continue;
>>> +
>>> +		panthor_fw_init_section_mem(ptdev, section);
>>> +		sgt = drm_gem_shmem_get_pages_sgt(&section->mem->bo->base);
>>> +		if (!drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(sgt)))
>>> +			dma_sync_sgtable_for_device(ptdev->base.dev, sgt, DMA_TO_DEVICE);
>>> +	}
>>> +}
>>> +
>>> +static int panthor_fw_load_entry(struct panthor_device *ptdev,
>>> +				 const struct firmware *fw,
>>> +				 struct panthor_fw_binary_iter *iter)
>>> +{
>>> +	struct panthor_fw_binary_iter eiter;
>>> +	u32 ehdr;
>>> +	int ret;
>>> +
>>> +	ret = panthor_fw_binary_iter_read(ptdev, iter, &ehdr, sizeof(ehdr));
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if ((iter->offset % sizeof(u32)) ||
>>> +	    (CSF_FW_BINARY_ENTRY_SIZE(ehdr) % sizeof(u32))) {
>>> +		drm_err(&ptdev->base, "Firmware entry isn't 32 bit aligned, offset=0x%x size=0x%x\n",
>>> +			(u32)(iter->offset - sizeof(u32)), CSF_FW_BINARY_ENTRY_SIZE(ehdr));
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	eiter.offset = 0;
>>> +	eiter.data = iter->data + iter->offset;
>>> +	eiter.size = CSF_FW_BINARY_ENTRY_SIZE(ehdr) - sizeof(ehdr);
>>> +	iter->offset += eiter.size;  
>>
>> There should really be a check like:
>>
>> 	if (iter->offset < eiter.size)
>> 		return -EINVAL;
> 
> Uh, I thought I had added size checks everywhere, but I apparently
> missed some places.
> 
>>
>> otherwise I think it's possible for a corrupt firmware to cause us to
>> run off the end of the buffer. Ideally the check would look something
>> more like the one in panthor_fw_binary_iter_read() (dealing with
>> potential overflow). I'm wondering if it makes sense to allow
>> panthor_fw_binary_iter_read() with a NULL 'out' and check the return
>> value. That way we can replace "iter->offset += eiter.size" with:
>>
>> 	ret = panthor_fw_binary_iter_read(ptdev, iter, NULL,
>> 					  eiter.size);
>> 	if (ret)
>> 		return ret;
>>
>> (or have a new _skip() function)
> 
> Might make sense to add a panthor_fw_binary_sub_iter_init() helper that
> would take care of doing the size check on the main iter, Unless you
> see other places requiring a size check that are not expressed as
> sub-iterators.

It was only the sub-iterators that I spotted the missing size check. A
helper for the sub-iterators is probably more clear than my 'skip' function.

[...]

>>> +struct panthor_fw_ringbuf_input_iface {
>>> +	u64 insert;
>>> +	u64 extract;
>>> +} __packed;
>>> +
>>> +struct panthor_fw_ringbuf_output_iface {
>>> +	u64 extract;
>>> +	u32 active;
>>> +} __packed;  
>>
>> Is there a good reason for these to be marked '__packed'? They are
>> naturally aligned so there's no padding, and we guarantee they are page
>> aligned. The compiler might have more freedom if they are not marked
>> __packed.
> 
> Nope, no good reason.
> 
>>
>>> +
>>> +struct panthor_fw_cs_control_iface {
>>> +#define CS_FEATURES_WORK_REGS(x)		(((x) & GENMASK(7, 0)) + 1)
>>> +#define CS_FEATURES_SCOREBOARDS(x)		(((x) & GENMASK(15, 8)) >> 8)
>>> +#define CS_FEATURES_COMPUTE			BIT(16)
>>> +#define CS_FEATURES_FRAGMENT			BIT(17)
>>> +#define CS_FEATURES_TILER			BIT(18)
>>> +	u32 features;
>>> +	u32 input_va;
>>> +	u32 output_va;
>>> +} __packed;  
>>
>> Here I have to admit I can't find a statement in the spec saying that
>> the stride must be a multiple of 4 bytes... but kbase makes that assumption.
> 
> The stride of?

The stride of this structure (panthor_fw_cs_control_iface or
STREAM_CONTROL_BLOCK in the spec). The stride is defined by
GROUP_CONTROL_BLOCK::GROUP_STREAM_STRIDE
(panthor_fw_cs_control_iface->stream_stride here), but the spec doesn't
specify that the FW must obey any restrictions on the stride. For that
reason the use of __packed here is technically correct (the FW could
choose a stride which causes this structure to be mis-aligned).

In reality the firmware always aligns to 4 bytes and kbase depends on
this. And I've raised this internally, so hopefully a future spec will
include the 4 byte alignment requirement.

TLDR; the __packed specifiers shouldn't be needed on any of these
structures.

>>
>>> +
>>> +struct panthor_fw_cs_input_iface {
>>> +#define CS_STATE_MASK				GENMASK(2, 0)
>>> +#define CS_STATE_STOP				0
>>> +#define CS_STATE_START				1
>>> +#define CS_EXTRACT_EVENT			BIT(4)
>>> +#define CS_IDLE_SYNC_WAIT			BIT(8)
>>> +#define CS_IDLE_PROTM_PENDING			BIT(9)
>>> +#define CS_IDLE_EMPTY				BIT(10)
>>> +#define CS_IDLE_RESOURCE_REQ			BIT(11)
>>> +#define CS_TILER_OOM				BIT(26)
>>> +#define CS_PROTM_PENDING			BIT(27)
>>> +#define CS_FATAL				BIT(30)
>>> +#define CS_FAULT				BIT(31)
>>> +#define CS_REQ_MASK				(CS_STATE_MASK | \
>>> +						 CS_EXTRACT_EVENT | \
>>> +						 CS_IDLE_SYNC_WAIT | \
>>> +						 CS_IDLE_PROTM_PENDING | \
>>> +						 CS_IDLE_EMPTY | \
>>> +						 CS_IDLE_RESOURCE_REQ)
>>> +#define CS_EVT_MASK				(CS_TILER_OOM | \
>>> +						 CS_PROTM_PENDING | \
>>> +						 CS_FATAL | \
>>> +						 CS_FAULT)
>>> +	u32 req;
>>> +
>>> +#define CS_CONFIG_PRIORITY(x)			((x) & GENMASK(3, 0))
>>> +#define CS_CONFIG_DOORBELL(x)			(((x) << 8) & GENMASK(15, 8))
>>> +	u32 config;
>>> +	u32 reserved1;
>>> +	u32 ack_irq_mask;
>>> +	u64 ringbuf_base;
>>> +	u32 ringbuf_size;
>>> +	u32 reserved2;
>>> +	u64 heap_start;
>>> +	u64 heap_end;
>>> +	u64 ringbuf_input;
>>> +	u64 ringbuf_output;
>>> +	u32 instr_config;
>>> +	u32 instrbuf_size;
>>> +	u64 instrbuf_base;
>>> +	u64 instrbuf_offset_ptr;
>>> +} __packed;  
>>
>> The spec says this has a minimal alignment of 64 bytes. Although I guess
>> the code should check this if we remove __packed and rely on it.
> 
> The allocation granularity is 4k, and we're not even in control of the
> offset inside the FW interface section. So yes, we can check it when
> parsing the FW sections, but there's no point adding __aligned() here.

Sorry, no I wasn't intending that we'd add __aligned() - I was just
trying to justify (to myself) that the __packed wasn't necessary.

>>
>>> +
>>> +struct panthor_fw_cs_output_iface {
>>> +	u32 ack;
>>> +	u32 reserved1[15];
>>> +	u64 status_cmd_ptr;
>>> +
>>> +#define CS_STATUS_WAIT_SB_MASK			GENMASK(15, 0)
>>> +#define CS_STATUS_WAIT_SB_SRC_MASK		GENMASK(19, 16)
>>> +#define CS_STATUS_WAIT_SB_SRC_NONE		(0 << 16)
>>> +#define CS_STATUS_WAIT_SB_SRC_WAIT		(8 << 16)
>>> +#define CS_STATUS_WAIT_SYNC_COND_LE		(0 << 24)
>>> +#define CS_STATUS_WAIT_SYNC_COND_GT		(1 << 24)
>>> +#define CS_STATUS_WAIT_SYNC_COND_MASK		GENMASK(27, 24)
>>> +#define CS_STATUS_WAIT_PROGRESS			BIT(28)
>>> +#define CS_STATUS_WAIT_PROTM			BIT(29)
>>> +#define CS_STATUS_WAIT_SYNC_64B			BIT(30)
>>> +#define CS_STATUS_WAIT_SYNC			BIT(31)
>>> +	u32 status_wait;
>>> +	u32 status_req_resource;
>>> +	u64 status_wait_sync_ptr;
>>> +	u32 status_wait_sync_value;
>>> +	u32 status_scoreboards;
>>> +
>>> +#define CS_STATUS_BLOCKED_REASON_UNBLOCKED	0
>>> +#define CS_STATUS_BLOCKED_REASON_SB_WAIT	1
>>> +#define CS_STATUS_BLOCKED_REASON_PROGRESS_WAIT	2
>>> +#define CS_STATUS_BLOCKED_REASON_SYNC_WAIT	3
>>> +#define CS_STATUS_BLOCKED_REASON_DEFERRED	5
>>> +#define CS_STATUS_BLOCKED_REASON_RES		6
>>> +#define CS_STATUS_BLOCKED_REASON_FLUSH		7
>>> +#define CS_STATUS_BLOCKED_REASON_MASK		GENMASK(3, 0)
>>> +	u32 status_blocked_reason;
>>> +	u32 status_wait_sync_value_hi;
>>> +	u32 reserved2[6];
>>> +
>>> +#define CS_EXCEPTION_TYPE(x)			((x) & GENMASK(7, 0))
>>> +#define CS_EXCEPTION_DATA(x)			(((x) >> 8) & GENMASK(23, 0))
>>> +	u32 fault;
>>> +	u32 fatal;
>>> +	u64 fault_info;
>>> +	u64 fatal_info;
>>> +	u32 reserved3[10];
>>> +	u32 heap_vt_start;
>>> +	u32 heap_vt_end;
>>> +	u32 reserved4;
>>> +	u32 heap_frag_end;
>>> +	u64 heap_address;
>>> +} __packed;  
>>
>> output is the same as input.
> 
> You mean in term of alignment?

Yep. (Sorry I did a terrible job of explaining myself here - I got
rather distracted trying to work out what alignment was guaranteed by
the spec for all these different structures).

>>
>>> +
>>> +struct panthor_fw_csg_control_iface {
>>> +	u32 features;
>>> +	u32 input_va;
>>> +	u32 output_va;
>>> +	u32 suspend_size;
>>> +	u32 protm_suspend_size;
>>> +	u32 stream_num;
>>> +	u32 stream_stride;
>>> +} __packed;  
>>
>> The spec is ambigious here. It one place it states the stride is 256
>> bytes, but in another that you need to look at the GLB_GROUP_STRIDE
>> value. In practice we can rely on 4 byte alignment.
>>
>> I'm beginning to wonder if it's worth worrying about, I think I'll stop
>> here ;)
> 
> Hehe. I'll add checks where I can in the parsing logic. I guess having
> things naturally aligned and making sure there's no overlap with other
> interfaces is a minimum.

Yes that would be good, and like I said there should be a clarification
in later specs that everything is (at least) 4 byte aligned.

Apparently the 256 byte stride mentioned in one place was due to the way
the structure was expressed in the XML and the XML->HTML tool
calculating it. Or in one word: 'wrong'! ;)

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block
  2023-08-30 14:53         ` Boris Brezillon
@ 2023-08-30 15:55           ` Steven Price
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-30 15:55 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 30/08/2023 15:53, Boris Brezillon wrote:
> On Wed, 30 Aug 2023 15:12:43 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 29/08/2023 16:33, Boris Brezillon wrote:
>>> On Mon, 14 Aug 2023 16:53:09 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>   
>>>>> +
>>>>> +/**
>>>>> + * struct panthor_vm_op_ctx - VM operation context
>>>>> + *
>>>>> + * With VM operations potentially taking place in a dma-signaling path, we
>>>>> + * need to make sure everything that might require resource allocation is
>>>>> + * pre-allocated upfront. This is what this operation context is far.
>>>>> + *
>>>>> + * We also collect resources that have been freed, so we can release them
>>>>> + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
>>>>> + * request.
>>>>> + */
>>>>> +struct panthor_vm_op_ctx {
>>>>> +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
>>>>> +	struct {
>>>>> +		/** @count: Number of pages reserved. */
>>>>> +		u32 count;
>>>>> +
>>>>> +		/** @ptr: Point to the first unused page in the @pages table. */
>>>>> +		u32 ptr;
>>>>> +
>>>>> +		/**
>>>>> +		 * @page: Array of pages that can be used for an MMU page table update.
>>>>> +		 *
>>>>> +		 * After an VM operation, there might be free pages left in this array.
>>>>> +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
>>>>> +		 */
>>>>> +		void **pages;
>>>>> +	} rsvd_page_tables;    
>>>>
>>>> Two questions:
>>>>
>>>> 1) Would a mempool simplify the implementation? It looks like a
>>>> reasonable match.  
>>>
>>> Not sure what you mean by mempool,  
>>
>> See include/linux/mempool.h
> 
> Oh, okay.
> 
>>
>>> but I'm using a kmem_cache here for
>>> all page table allocations. The pages that are passed to
>>> panthor_vm_op_ctx::rsvd_page_tables::pages are allocated from this
>>> pool. It's just that for each VM operation we pre-allocate page-tables,
>>> and release those that were not used when the operation is done (we
>>> over-allocate for the worst case scenario).  
>>
>> The mempool could, potentially, replace the rsvd_page_tables structure.
>> The kmem_cache you would still want as that's per-driver.
> 
> Need to have a closer look at the API to make my mind, but at first
> glance it seems to be overkill for what I initially had in mind.

I agree it has more functionality than is needed here - but equally the
code is already there. struct rsvd_page_tables looks to me to be a very
simplified version of it, so rather than reinventing it we could just
use the existing code.

Having said that I haven't tried the conversion, so perhaps there are
issues with the mempool implementation which makes it better to roll our
own.

>>
>>>>
>>>> 2) Does it really make sense to have a separate pool of memory for every
>>>> operation? Instead of having a separate pool for each operation, it
>>>> would be possible to just keep track of the total number needed for all
>>>> outstanding operations. Then a single (per device or maybe per-VM if
>>>> necessary) mempool could be resized to ensure it has the right amount of
>>>> space.  
>>>
>>> The pool is per-driver (see the global pt_cache). rsvd_page_tables just
>>> holds pages needed for a specific VM operation. To be more specific, it
>>> holds pages for the worst case (page table tree is empty, except for the
>>> root page table).  
>>
>> What I'm wondering is to we need to keep the pages for each operation in
>> separate pools.
> 
> I was not really considering it a pool, more a set of pages that will
> be used by the VM operation, some of them being returned to the
> kmem_cache pool if we end up using less (over-provisioning). If we have
> a mempool, say, at the VM level, that means we have 2 levels of caching:
> the kmem_cache itself, and the mempool attached to the VM. Is there any
> benefit here? Do we expect kmem_cache to be too slow for fast/already
> allocated pages?

My understanding is that we can't (easily) ensure that sufficient pages
remain in the kmem_cache. So we need a second level 'cache' to hold the
pages that might be used by the pending VM_BIND operations. That's what
the rsvd_page_tables struct does currently.

> I do see how over-provisioning can cause us to allocate a lot of pages
> that end up being unused, but I fail to see how VM/device level caching
> would solve that, because we still have to dequeue some operations to
> return pages to the intermediate pool, at which point, we've already
> lost, because already queued operations reserved the amount of pages
> they thought they needed for the worst case scenario.
> Operations being queued after that can pick from the returned pages of
> course, but that's already the case right now, because we return pages
> to the kmem_cache as soon as we're done executing a VM operation.

I don't think it would make a difference other than having one shared
data structure rather than many struct rsvd_page_tables instances.

> The only thing that might help is limiting the number of in-flight
> VM_BIND jobs per VM (or globally), and then have the submit path return
> EBUSY or EGAIN so the userspace driver knows it has to retry at a later
> time.

I think we're going to need this in some form to avoid exposing an
effective DoS vector (unless some way can be found to account the pages).

>> So instead of having a rsvd_page_tables for each
>> operation, can we have one global one which is sized appropriately for
>> all operations that are in flight for the device? The operations are
>> serialized so there's no contention. Or at least a per-VM pool if we can
>> operate on multiple VMs at once.
> 
> We can operate on multiple VMs at once (VM is basically your VkDevice,
> AKA the logical device), but I'm not too worried about the
> synchronization that would be incurred by the caching at the
> panthor_device level. I'm just curious to know what value it would add.
> I'm also worried that it makes the reservation logic more complex:
> we need to track what's still reserved and how many pages can be
> re-used because they were never reclaimed by the VM operation that had
> reserved it.

Well every operation would need to keep (an equivalent of) the
rsvd_page_tables.count variable so that the number of pages in the
mempool could be kept correct (returning pages from the pool to kmem_cache).

> Also didn't check how mempool plays with memory reclaim,
> but if the memory in a mempool is not reclaimable, that might be
> another problem.

AFAIK mempool doesn't support reclaiming, and I don't think we'd want to
reclaim the pages which were reserved for pending VM_BIND ops.

> To sum-up, I'd really like to refrain adding an intermediate cache
> until the per-driver kmem cache is proven to be too slow.

To be honest, I'm having second thoughts - the code you've got works,
and I'm not sure it's worth the time/effort to investigate mempools now.
It just struck me during the review that I would have looked at using that.

>>
>>>>
>>>> I'm also a little wary that the VM_BIND infrastructure could potentially
>>>> be abused to trigger a large amount of kernel allocation as it allocates
>>>> up-front for the worst case but those pages are not charged to the
>>>> process (AFAICT). But I haven't fully got my head round that yet.  
>>>
>>> Yep, that's problematic, indeed. I considered allocating page tables
>>> as GEM objects, but the overhead of a GEM object is quite big
>>> (hundreds of bytes of meta-data) compared to the size of a page table
>>> (4k), and kmem_cache was just super convenient for this page table
>>> cache :-).  
>>
>> I think page tables as GEM objects is likely to be overkill,
> 
> I agree it's overkill/not easy to interface with io_pgtbl, but there
> was one aspect I was interested in, more than the memory accounting:
> being able to reclaim page tables the same way we reclaim regular GEMs,
> which would simplify the shrinker/reclaim logic. After discussing it
> with Robin, I realized it was pretty much useless, because reclaiming
> the GEM will also teardown all VM mappings, which will return the page
> table memory to the kmem_cache, and then the kmem_cache layer can
> reclaim it.
> 
>> we
>> obviously also have to be careful not to allow user space to get access
>> to the contents - whereas GEM objects are usually to provide user space
>> access ;).
> 
> GEMs can be hidden to userspace if we want. We are the ones in control
> of the mmap() (I think there's a BO flag for preventing users access
> already).

True, it's just my gut reaction that I expect the contents of GEMs to be
controlled by user space in some way. Usually hidden GEM buffers are for
intermediate data that doesn't actually need to be protected from user
space, it's just user space has no interest in it. I'd have to dig into
the security model to satisfy myself whether it was completely safe to
put page tables in there.

>> I'm not sure quite what the best solution here is, clearly one
>> 'solution' is to just cap the number of outstanding VM_BINDs.
> 
> I think that'd make sense.
> 
>>
>>>>  
>>>>> +
>>>>> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
>>>>> +	u32 flags;
>>>>> +
>>>>> +	/** @va: Virtual range targeted by the VM operation. */
>>>>> +	struct {
>>>>> +		/** @addr: Start address. */
>>>>> +		u64 addr;
>>>>> +
>>>>> +		/** @range: Range size. */
>>>>> +		u64 range;
>>>>> +	} va;
>>>>> +
>>>>> +	/**
>>>>> +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
>>>>> +	 *
>>>>> +	 * For unmap operations, this will contain all VMAs that were covered by the
>>>>> +	 * specified VA range.
>>>>> +	 *
>>>>> +	 * For map operations, this will contain all VMAs that previously mapped to
>>>>> +	 * the specified VA range.
>>>>> +	 *
>>>>> +	 * Those VMAs, and the resources they point to will be released as part of
>>>>> +	 * the op_ctx cleanup operation.
>>>>> +	 */
>>>>> +	struct list_head returned_vmas;
>>>>> +
>>>>> +	/** @map: Fields specific to a map operation. */
>>>>> +	struct {
>>>>> +		/** @gem: GEM object information. */
>>>>> +		struct {
>>>>> +			/** @obj: GEM object to map. */
>>>>> +			struct drm_gem_object *obj;
>>>>> +
>>>>> +			/** @offset: Offset in the GEM object. */
>>>>> +			u64 offset;
>>>>> +		} gem;
>>>>> +
>>>>> +		/**
>>>>> +		 * @sgt: sg-table pointing to pages backing the GEM object.
>>>>> +		 *
>>>>> +		 * This is gathered at job creation time, such that we don't have
>>>>> +		 * to allocate in ::run_job().
>>>>> +		 */
>>>>> +		struct sg_table *sgt;
>>>>> +
>>>>> +		/**
>>>>> +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
>>>>> +		 *
>>>>> +		 * If the map request covers a region that's inside another VMA, the
>>>>> +		 * previous VMA will be split, requiring instantiation of a maximum of
>>>>> +		 * two new VMA objects.
>>>>> +		 */
>>>>> +		struct panthor_vma *prev_vma;
>>>>> +
>>>>> +		/**
>>>>> +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
>>>>> +		 */
>>>>> +		struct panthor_vma *new_vma;
>>>>> +
>>>>> +		/**
>>>>> +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
>>>>> +		 *
>>>>> +		 * See @prev_vma.
>>>>> +		 */
>>>>> +		struct panthor_vma *next_vma;    
>>>>
>>>> It's probably premature optimization, but it feels like having a cache
>>>> of these VMA structures might be an idea.  
>>>
>>> If it's needed, I'll probably go for a kmem_cache, but I need to
>>> check if it's worth it first (if the closest kmalloc cache is
>>> significantly biffer than the struct size).
>>>   
>>>> I'm also struggling to
>>>> understand how both a new prev and new next VMA are needed - but I
>>>> haven't dug into the GPU VA manager.  
>>>
>>> prev/next are for mapping splits: an object is already mapped, and a new
>>> object is mapped in the middle of this pre-existing mapping. In that
>>> case, we need 2 vma object for the preceeding and succeeding mappings,
>>> since the old mapping object will be released.
>>>
>>> new_vma is for the new mapping.  
>>
>> Yeah, looking into the GPU VA manager I see now. My problem was that I
>> assumed in the case of a split one of the original mappings would simply
>> be resized, so you'd only need one new VMA (plus the one being added).
>> But AFAICT that resize doesn't happen and instead new VMA are created.
> 
> Yes. On the other hand, if we have a kmem_cache for panthor_vma
> objects, that shouldn't make a big difference.

Yeah I can understand the design now - I just hadn't dug deep enough before.

>>
>>>>  
>>>>> +	} map;
>>>>> +};
>>>>> +  
>>>
>>> [...]
>>>   
>>>>> +/**
>>>>> + * panthor_vm_active() - Flag a VM as active
>>>>> + * @VM: VM to flag as active.
>>>>> + *
>>>>> + * Assigns an address space to a VM so it can be used by the GPU/MCU.
>>>>> + *
>>>>> + * Return: 0 on success, a negative error code otherwise.
>>>>> + */
>>>>> +int panthor_vm_active(struct panthor_vm *vm)
>>>>> +{
>>>>> +	struct panthor_device *ptdev = vm->ptdev;
>>>>> +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
>>>>> +	int ret = 0, as, cookie;
>>>>> +	u64 transtab, transcfg;
>>>>> +
>>>>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
>>>>> +		return -ENODEV;
>>>>> +
>>>>> +	mutex_lock(&ptdev->mmu->as.slots_lock);
>>>>> +
>>>>> +	as = vm->as.id;
>>>>> +	if (as >= 0) {
>>>>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
>>>>> +
>>>>> +		if (ptdev->mmu->as.faulty_mask & mask) {
>>>>> +			/* Unhandled pagefault on this AS, the MMU was
>>>>> +			 * disabled. We need to re-enable the MMU after
>>>>> +			 * clearing+unmasking the AS interrupts.
>>>>> +			 */
>>>>> +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
>>>>> +			ptdev->mmu->as.faulty_mask &= ~mask;
>>>>> +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
>>>>> +			goto out_enable_as;
>>>>> +		}
>>>>> +
>>>>> +		goto out_unlock;
>>>>> +	}
>>>>> +
>>>>> +	/* Check for a free AS */
>>>>> +	if (vm->for_mcu) {
>>>>> +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
>>>>> +		as = 0;
>>>>> +	} else {
>>>>> +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
>>>>> +	}
>>>>> +
>>>>> +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
>>>>> +		struct panthor_vm *lru_vm;
>>>>> +
>>>>> +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
>>>>> +						  struct panthor_vm,
>>>>> +						  as.lru_node);
>>>>> +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
>>>>> +			ret = -EBUSY;
>>>>> +			goto out_unlock;
>>>>> +		}
>>>>> +
>>>>> +		list_del_init(&lru_vm->as.lru_node);
>>>>> +		as = lru_vm->as.id;    
>>>>
>>>> Should this not set lru_vm->as.id = -1, so that the code knows the VM no
>>>> longer has an address space?  
>>>
>>> Good catch!
>>>   
>>>>  
>>>>> +	} else {
>>>>> +		set_bit(as, &ptdev->mmu->as.alloc_mask);
>>>>> +	}
>>>>> +
>>>>> +	/* Assign the free or reclaimed AS to the FD */
>>>>> +	vm->as.id = as;
>>>>> +	ptdev->mmu->as.slots[as].vm = vm;
>>>>> +
>>>>> +out_enable_as:
>>>>> +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
>>>>> +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
>>>>> +		   AS_TRANSCFG_PTW_RA |
>>>>> +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
>>>>> +	if (ptdev->coherent)
>>>>> +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
>>>>> +
>>>>> +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
>>>>> +
>>>>> +out_unlock:
>>>>> +	mutex_unlock(&ptdev->mmu->as.slots_lock);
>>>>> +	drm_dev_exit(cookie);
>>>>> +	return ret;
>>>>> +}
>>>>> +  
>>>
>>> [...]
>>>   
>>>>> +
>>>>> +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
>>>>> +{
>>>>> +	status = panthor_mmu_fault_mask(ptdev, status);
>>>>> +	while (status) {
>>>>> +		u32 as = ffs(status | (status >> 16)) - 1;
>>>>> +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
>>>>> +		u32 new_int_mask;
>>>>> +		u64 addr;
>>>>> +		u32 fault_status;
>>>>> +		u32 exception_type;
>>>>> +		u32 access_type;
>>>>> +		u32 source_id;
>>>>> +
>>>>> +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
>>>>> +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
>>>>> +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
>>>>> +
>>>>> +		/* decode the fault status */
>>>>> +		exception_type = fault_status & 0xFF;
>>>>> +		access_type = (fault_status >> 8) & 0x3;
>>>>> +		source_id = (fault_status >> 16);
>>>>> +
>>>>> +		/* Page fault only */    
>>>>
>>>> This comment makes no sense - it looks like it's copied over from panfrost.  
>>>
>>> Uh, it made sense before I dropped map/alloc-on-fault :-).  
>>
>> :)
>>
>>>>
>>>> If I understand correctly we don't (currently) support growing on page
>>>> fault - and it's not really needed now the MCU can handle the tiler heaps.  
>>>
>>> Exaclty. Map/alloc on fault is a bit challenging because of the whole
>>> 'we have to guarantee that a job is done in finite time, and we must
>>> make sure fence signaling is not blocked on allocation'. Given
>>> drm_gem_get_pages() doesn't do non-blocking allocations, I thought it'd
>>> be preferable to postpone map-on-fault until we actually decide we need
>>> it. Note that i915 seems to have some sort of non-blocking page
>>> allocator in shmem_sg_alloc_table()[1].  
>>
>> Agreed, the intention is definitely to move away from map/alloc-on-fault
>> - handling page faults from the GPU on the CPU is expensive even without
>> the can-of-worms of fence signalling.
> 
> Yeah, I agree, but I'd bet on Khronos members being inventive enough
> to come with a use case for this map/alloc-on-fault feature :-).
> Anyway, that's not something we have to worry about just yet.

:) Vulkan already has sparse textures ('Sparse Residency'), but AFAIK
that's not (yet) tied into userfaultfd... and now I've possibly put the
idea into someone's head! ;)

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS
  2023-08-09 16:53 ` [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS Boris Brezillon
  2023-08-11 16:08   ` Steven Price
@ 2023-08-31 13:18   ` Liviu Dudau
  2023-08-31 13:25     ` Boris Brezillon
  1 sibling, 1 reply; 93+ messages in thread
From: Liviu Dudau @ 2023-08-31 13:18 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, dri-devel,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

Hi Boris,

On Wed, Aug 09, 2023 at 06:53:28PM +0200, Boris Brezillon wrote:
> Add an entry for the Panthor driver to the MAINTAINERS file.
> 
> v2:
> - New commit
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
> 
> If anyone from Arm wants to volunteer to become a co-maintainer, that
> would be highly appreciated
> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cd882b87a3c6..6149ab68d461 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1624,6 +1624,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
>  F:	drivers/gpu/drm/panfrost/
>  F:	include/uapi/drm/panfrost_drm.h
>  
> +ARM MALI PANTHOR DRM DRIVER
> +M:	Boris Brezillon <boris.brezillon@collabora.com>
> +L:	dri-devel@lists.freedesktop.org
> +S:	Supported
> +T:	git git://anongit.freedesktop.org/drm/drm-misc
> +F:	drivers/gpu/drm/panthor/
> +F:	include/uapi/drm/panthor_drm.h

Can we also add an entry for the bindings?

+F: Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml

Also, I would like to volunteer as maintainer alongside Steven, so can I
please get added too?

Best regards,
Liviu

> +
>  ARM MALI-DP DRM DRIVER
>  M:	Liviu Dudau <liviu.dudau@arm.com>
>  S:	Supported
> -- 
> 2.41.0
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS
  2023-08-31 13:18   ` Liviu Dudau
@ 2023-08-31 13:25     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-08-31 13:25 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, dri-devel,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Thu, 31 Aug 2023 14:18:42 +0100
Liviu Dudau <Liviu.Dudau@arm.com> wrote:

> Hi Boris,
> 
> On Wed, Aug 09, 2023 at 06:53:28PM +0200, Boris Brezillon wrote:
> > Add an entry for the Panthor driver to the MAINTAINERS file.
> > 
> > v2:
> > - New commit
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > 
> > If anyone from Arm wants to volunteer to become a co-maintainer, that
> > would be highly appreciated
> > ---
> >  MAINTAINERS | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index cd882b87a3c6..6149ab68d461 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1624,6 +1624,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
> >  F:	drivers/gpu/drm/panfrost/
> >  F:	include/uapi/drm/panfrost_drm.h
> >  
> > +ARM MALI PANTHOR DRM DRIVER
> > +M:	Boris Brezillon <boris.brezillon@collabora.com>
> > +L:	dri-devel@lists.freedesktop.org
> > +S:	Supported
> > +T:	git git://anongit.freedesktop.org/drm/drm-misc
> > +F:	drivers/gpu/drm/panthor/
> > +F:	include/uapi/drm/panthor_drm.h  
> 
> Can we also add an entry for the bindings?
> 
> +F: Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml

Will do.

> 
> Also, I would like to volunteer as maintainer alongside Steven, so can I
> please get added too?

Sure, the more the merrier :-).

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-08-29 17:46     ` Boris Brezillon
@ 2023-08-31 14:42       ` Steven Price
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-08-31 14:42 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 29/08/2023 18:46, Boris Brezillon wrote:
> On Mon, 21 Aug 2023 12:31:29 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 09/08/2023 17:53, Boris Brezillon wrote:

[...]

>>> + *	// Collect signal operations on all jobs, such that each job can pick
>>> + *	// from it for its dependencies and update the fence to signal when
>>> + *	// the job is submitted.  
>>
>> I can't figure out here how we avoid depedency loops within a batch. 
>> What stops two jobs from each depending on each other?
>>
>> Or do we "allow" this but rely on the loop in panthor_submit_ctx_add_deps_and_arm_jobs()
>> to effectively enforce that a job cannot actually depend on a job
>> which is later in the batch.
> 
> You can't have circular dependencies because the job fence is created
> after its dependencies have been registered, so a job at the beginning
> of the array can't depend on a job that's coming after. It might be
> passed the same syncobj, but since a syncobj is just a container, the
> fence attached to the syncobj at the time the first job adds it as a
> dependency will point to a different dma_fence.
> 
>> In which case why bother with this
>> complexity rather than just performing all the steps on each job
>> in order?
> 
> Because, before submitting a set of jobs, we want to make sure all jobs
> that are passed to a submit request are valid and enough resources are
> available for their execution to proceed. We could allow partial
> execution (and that's actually the approach I had taken in one of the
> patch I proposed to allow submitting multiple jobs in one call to
> panfrost), but then you potentially have to figure out where things
> failed, not to mention that the syncobjs might point to intermediate
> dma_fence objects instead of the final one.
> 
>>
>> Being able to submit a forward dependency, but then having it
>> ignored seems like an odd design. So I feel like I must be
>> missing something.
> 
> It's not about allowing forward dependencies (that would be mess), but
> allowing one job to take a dependency on a job that was appearing
> earlier in the job array of the same submit call.
> 
>>
>>> + *	ret = panthor_submit_ctx_collect_jobs_signal_ops(&ctx);
> 
> Here panthor_submit_ctx_collect_jobs_signal_ops() is not registering
> job out_fences to the syncobjs, it's just collecting all signal
> operations from all jobs in an array. Each entry in this array contains
> the syncobj handle, the syncobj object, and the fence that was attached
> to it at the time the collection happens, and that's it.
> 
> Now, when a job are populated, and after we made sure it had
> everything it needs to be submitted, for each signal operation passed
> to this specific job, we update the corresponding entry in the signal
> array with the job finished fence, but the syncobj is not updated at
> that point, because we want to make sure all jobs belonging to a submit
> can be submitted before exposing their fences to the outside world.
> 
> For jobs happening later in the array, when we see a WAIT operation,
> we will first check the signal array to see if there's a
> corresponding entry cached there for the given syncobj handle, if there
> is, we take the dma_fence from here (this dma_fence might come from a
> job submitted earlier in this submit context, or it might be the fence
> that was there initially), if not, we call drm_syncobj_find_fence() to
> get the dependency.
> 
> Once all jobs have been parsed/checked/populated, we start the
> non-failing step => job submission. And after that point, we can start
> exposing the job fences to the outside world. This is what happens in
> panthor_submit_ctx_push_fences(): we iterate over the signal
> operations, and update each syncobj with the fence that was last
> attached to it (the last job in the submit array having a SIGNAL
> operation on that syncobj).

Thanks for the detailed explanation. I guess I hadn't considered the
benefits of checking everything is valid and obtaining resources before
submitting anything. That makes sense and I guess justifies this complexity.

[...]

>>> +static int
>>> +panthor_submit_ctx_add_job(struct panthor_submit_ctx *ctx, u32 idx,
>>> +			   struct drm_sched_job *job,
>>> +			   const struct drm_panthor_obj_array *syncs)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(ctx->file->minor->dev,
>>> +						    struct panthor_device,
>>> +						    base);
>>> +	int ret;
>>> +
>>> +	if (drm_WARN_ON(&ptdev->base,
>>> +			idx >= ctx->job_count ||
>>> +			ctx->jobs[idx].job ||
>>> +			ctx->jobs[idx].syncops ||
>>> +			ctx->jobs[idx].syncop_count))
>>> +		return -EINVAL;
>>> +
>>> +	ctx->jobs[idx].job = job;  
>>
>> While the WARN_ON obviously shouldn't happen, this positioning of the 
>> ctx->jobs[].job assignment means the caller has no idea if the 
>> assignment has happened. AFAICT in the case of the WARN_ON the job isn't 
>> cleaned up properly.
> 
> It's not really about cleanup not happening, more about being passed an
> index that was already populated.
> 
>>
>> The options I can see are to move this line further down (and make the 
>> caller clean up that one job if this function fails), or to clean up the 
>> job in the case where the WARN_ON fails.
> 
> Maybe I should drop this WARN_ON() and assume the caller passed a valid
> index...

I'd be fine with that. My reordering suggestion is a bit pointless I
must admit ;)

[...]

>>> +
>>> +	for (u32 i = 0; i < sync_op_count; i++) {
>>> +		struct panthor_sync_signal *sig_sync;
>>> +		struct dma_fence *fence;
>>> +
>>> +		if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
>>> +			continue;  
>>
>> NIT: It might be worth having a helper for the operation type. It's a 
>> little confusing that we have !(flags & SIGNAL) and (flags & SIGNAL) but 
>> not (flags & WAIT) - obviously looking at the definition shows why. Also 
>> there'll be a lot of careful refactoring needed if a third operation is 
>> ever added.
> 
> I had the operation as a separate field initially, but I couldn't think
> of any other operations we could do on a syncobj, so I decided to make
> it a flag, and mimic what Xe does.

A flag is fine, I just find it harder to read:

 if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)
 [...]
 if (!(sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_SIGNAL)

vs

 bool is_signal_op(struct drm_panthor_sync_op *op)
 {
	return !!(op->flags & DRM_PANTHOR_SYNC_OP_SIGNAL);
 }

 bool is_wait_op(struct drm_panthor_sync_op *op)
 {
	return !(op->flags & DRM_PANTHOR_SYNC_OP_SIGNAL);
 }

 if (is_signal_op(&sync_ops[i]))
 [...]
 if (is_wait_op(&sync_ops[i]))

And it avoid anyone accidentally writing:

 if (sync_ops[i].flags & DRM_PANTHOR_SYNC_OP_WAIT)

which in my quick test the compiler doesn't even warn on :(

Although on the subject of the flag, apparently the enumeration type
value doesn't compile with -pedantic as it overflows into the sign bit:

include/drm/panthor_drm.h:237:31: warning: enumerator value for
‘DRM_PANTHOR_SYNC_OP_SIGNAL’ is not an integer constant expression
[-Wpedantic]
  237 |  DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,

Changing it to "(int)(1u << 31)" seems to be workaround. This affects
DRM_PANTHOR_VM_BIND_OP_TYPE_MASK too.

>>
[...]
>>> +#define PANTHOR_VM_CREATE_FLAGS			0
>>> +
>>> +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
>>> +				   struct drm_file *file)
>>> +{
>>> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
>>> +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
>>> +	struct panthor_file *pfile = file->driver_priv;
>>> +	struct drm_panthor_vm_create *args = data;
>>> +	u64 kernel_va_start = 0;
>>> +	int cookie, ret;
>>> +
>>> +	if (!drm_dev_enter(ddev, &cookie))
>>> +		return -ENODEV;
>>> +
>>> +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
>>> +		ret = -EINVAL;
>>> +		goto out_dev_exit;
>>> +	}
>>> +
>>> +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {  
>>
>> The check for !va_bits would be better done at probe time. I'd also be 
>> tempted to move the change for kernel_va_range down to 
>> panthor_vm_create() as that has to repeat the va_bits calculation.
>>
>>> +		ret = -EINVAL;
>>> +		goto out_dev_exit;
>>> +	}
>>> +
>>> +	if (args->kernel_va_range)
>>> +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;  
>>
>> And also push the calculation of va_start down to 
>> panthor_vm_create() as well.
> 
> panthor_vm_create() is used internally, for the MCU VM creation, and
> I'd prefer to keep it uAPI agnostic. I don't mind moving it to
> panthor_vm_pool_create_vm() but we'd still have to do the va_bits
> calculation twice.

Ah true, for panthor_vm_create() you need to be able to pass in the VA
range for the MCU. We do have the "for_mcu" flag so the
CSF_MCU_SHARED_REGION_START/SIZE #defines could be used directly in
panthor_vm_create(). But I'd be happy with it in
panthor_vm_pool_create_vm() if you'd prefer.

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
  2023-08-11 14:13   ` Steven Price
@ 2023-09-01 13:59   ` Liviu Dudau
  2023-09-01 16:10   ` Boris Brezillon
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-01 13:59 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, dri-devel,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

Hi Boris,

On Wed, Aug 09, 2023 at 06:53:15PM +0200, Boris Brezillon wrote:
> Panthor follows the lead of other recently submitted drivers with
> ioctls allowing us to support modern Vulkan features, like sparse memory
> binding:
> 
> - Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
>   with the 'exclusive-VM' bit to speed-up BO reservation on job submission
> - VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
>   ioctl is loosely based on the Xe model, and can handle both
>   asynchronous and synchronous requests
> - GPU execution context creation/destruction, tiler heap context creation
>   and job submission. Those ioctls reflect how the hardware/scheduler
>   works and are thus driver specific.
> 
> We also have a way to expose IO regions, such that the usermode driver
> can directly access specific/well-isolate registers, like the
> LATEST_FLUSH register used to implement cache-flush reduction.
> 
> This uAPI intentionally keeps usermode queues out of the scope, which
> explains why doorbell registers and command stream ring-buffers are not
> directly exposed to userspace.
> 
> v2:
> - Rename the driver (pancsf -> panthor)
> - Change the license (GPL2 -> MIT + GPL2)
> - Split the driver addition commit
> - Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
> - Add the concept of exclusive_vm at BO creation time
> - Add missing padding fields
> - Add documentation
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Minor fixes in addition to what Steve has alread flagged.

> ---
>  Documentation/gpu/driver-uapi.rst |   5 +
>  include/uapi/drm/panthor_drm.h    | 862 ++++++++++++++++++++++++++++++
>  2 files changed, 867 insertions(+)
>  create mode 100644 include/uapi/drm/panthor_drm.h
> 
> diff --git a/Documentation/gpu/driver-uapi.rst b/Documentation/gpu/driver-uapi.rst
> index c08bcbb95fb3..7a667901830f 100644
> --- a/Documentation/gpu/driver-uapi.rst
> +++ b/Documentation/gpu/driver-uapi.rst
> @@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
>      :doc: Overview
>  
>  .. kernel-doc:: include/uapi/drm/nouveau_drm.h
> +
> +drm/panthor uAPI
> +================
> +
> +.. kernel-doc:: include/uapi/drm/panthor_drm.h
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> new file mode 100644
> index 000000000000..e217eb5ad198
> --- /dev/null
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -0,0 +1,862 @@
> +/* SPDX-License-Identifier: MIT */
> +/* Copyright (C) 2023 Collabora ltd. */
> +#ifndef _PANTHOR_DRM_H_
> +#define _PANTHOR_DRM_H_
> +
> +#include "drm.h"
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * DOC: Introduction
> + *
> + * This documentation decribes the Panthor IOCTLs.
> + *
> + * Just a few generic rules about the data passed to the Panthor IOCTLs:
> + *
> + * - Structures must be aligned on 64-bit/8-byte. If the object is not
> + *   naturally aligned, a padding field must be added.
> + * - Fields must be explicity aligned to their natural type alignment with
> + *   pad[0..N] fields.
> + * - All padding fields will be checked by the driver to make sure they are
> + *   zeroed.
> + * - Flags can be added, but not removed/replaced.
> + * - New fields can be added to the main structures (the structures
> + *   directly passed to the ioctl). Those fiels can be added at the end of
> + *   the structure, or replace existing padding fields. Any new field being
> + *   added must preserve the behavior that existed before those fields were
> + *   added when a value of zero is passed.
> + * - New fields can be added to indirect objects (objects pointed by the
> + *   main structure), iff those objects are passed a size to reflect the
> + *   size known by the userspace driver (see drm_panthor_obj_array::stride
> + *   or drm_panthor_dev_query::size).
> + * - If the kernel driver is too old to know some fields, those will
> + *   be ignored (input) and set back to zero (output).
> + * - If userspace is too old to know some fields, those will be zeroed
> + *   (input) before the structure is parsed by the kernel driver.
> + * - Each new flag/field addition must come with a driver version update so
> + *   the userspace driver doesn't have to trial and error to know which
> + *   flags are supported.
> + * - Structures should not contain unions, as this would defeat the
> + *   extensibility of such structures.
> + * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
> + *   at the end of the drm_panthor_ioctl_id enum.
> + */
> +
> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)
> +
> +/**
> + * DOC: IOCTL IDs
> + *
> + * enum drm_panthor_ioctl_id - IOCTL IDs
> + *
> + * Place new ioctls at the end, don't re-oder, don't replace or remove entries.
> + *
> + * These IDs are not meant to be used directly. Use the DRM_IOCTL_PANTHOR_xxx
> + * definitions instead.
> + */
> +enum drm_panthor_ioctl_id {
> +	/** @DRM_PANTHOR_DEV_QUERY: Query device information. */
> +	DRM_PANTHOR_DEV_QUERY = 0,
> +
> +	/** @DRM_PANTHOR_VM_CREATE: Create a VM. */
> +	DRM_PANTHOR_VM_CREATE,
> +
> +	/** @DRM_PANTHOR_VM_DESTROY: Destroy a VM. */
> +	DRM_PANTHOR_VM_DESTROY,
> +
> +	/** @DRM_PANTHOR_VM_BIND: Bind/unbind memory to a VM. */
> +	DRM_PANTHOR_VM_BIND,
> +
> +	/** @DRM_PANTHOR_BO_CREATE: Create a buffer object. */
> +	DRM_PANTHOR_BO_CREATE,
> +
> +	/**
> +	 * @DRM_PANTHOR_BO_MMAP_OFFSET: Get the file offset to pass to
> +	 * mmap to map a GEM object.
> +	 */
> +	DRM_PANTHOR_BO_MMAP_OFFSET,
> +
> +	/** @DRM_PANTHOR_GROUP_CREATE: Create a scheduling group. */
> +	DRM_PANTHOR_GROUP_CREATE,
> +
> +	/** @DRM_PANTHOR_GROUP_DESTROY: Destroy a scheduling group. */
> +	DRM_PANTHOR_GROUP_DESTROY,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_SUBMIT: Submit jobs to queues belonging
> +	 * to a specific scheduling group.
> +	 */
> +	DRM_PANTHOR_GROUP_SUBMIT,
> +
> +	/** @DRM_PANTHOR_GROUP_GET_STATE: Get the state of a scheduling group. */
> +	DRM_PANTHOR_GROUP_GET_STATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_CREATE: Create a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_CREATE,
> +
> +	/** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
> +	DRM_PANTHOR_TILER_HEAP_DESTROY,
> +};
> +
> +/**
> + * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
> + * @__access: Access type. Must be R, W or RW.
> + * @__id: One of the DRM_PANTHOR_xxx id.
> + * @__type: Suffix of the type being passed to the IOCTL.
> + *
> + * Don't use this macro directly, use the DRM_IOCTL_PANTHOR_xxx
> + * values instead.
> + *
> + * Return: An IOCTL number to be passed to ioctl() from userspace.
> + */
> +#define DRM_IOCTL_PANTHOR(__access, __id, __type) \
> +	DRM_IO ## __access(DRM_COMMAND_BASE + DRM_PANTHOR_ ## __id, \
> +			   struct drm_panthor_ ## __type)
> +
> +#define DRM_IOCTL_PANTHOR_DEV_QUERY \
> +	DRM_IOCTL_PANTHOR(WR, DEV_QUERY, dev_query)
> +#define DRM_IOCTL_PANTHOR_VM_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, VM_CREATE, vm_create)
> +#define DRM_IOCTL_PANTHOR_VM_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, VM_DESTROY, vm_destroy)
> +#define DRM_IOCTL_PANTHOR_VM_BIND \
> +	DRM_IOCTL_PANTHOR(WR, VM_BIND, vm_bind)
> +#define DRM_IOCTL_PANTHOR_BO_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, BO_CREATE, bo_create)
> +#define DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET \
> +	DRM_IOCTL_PANTHOR(WR, BO_MMAP_OFFSET, bo_mmap_offset)
> +#define DRM_IOCTL_PANTHOR_GROUP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_CREATE, group_create)
> +#define DRM_IOCTL_PANTHOR_GROUP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_DESTROY, group_destroy)
> +#define DRM_IOCTL_PANTHOR_GROUP_SUBMIT \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_SUBMIT, group_submit)
> +#define DRM_IOCTL_PANTHOR_GROUP_GET_STATE \
> +	DRM_IOCTL_PANTHOR(WR, GROUP_GET_STATE, group_get_state)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create)
> +#define DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY \
> +	DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy)
> +
> +/**
> + * DOC: IOCTL arguments
> + */
> +
> +/**
> + * struct drm_panthor_obj_array - Object array.
> + *
> + * This object is used to pass an array of objects whose size it subject to changes in

s/it subject/is subject/

> + * future versions of the driver. In order to support this mutability, we pass a stride
> + * describing the size of the object as known by userspace.
> + *
> + * You shouldn't fill drm_panthor_obj_array fields directly. You should instead use
> + * the DRM_PANTHOR_OBJ_ARRAY() macro that takes care of initializing the stride to
> + * the object size.
> + */
> +struct drm_panthor_obj_array {
> +	/** @stride: Stride of object struct. Used for versioning. */
> +	__u32 stride;
> +
> +	/** @count: Number of objects in the array. */
> +	__u32 count;
> +
> +	/** @array: User pointer to an array of objects. */
> +	__u64 array;
> +};
> +
> +/**
> + * DRM_PANTHOR_OBJ_ARRAY() - Initialize a drm_panthor_obj_array field.
> + * @cnt: Number of elements in the array.
> + * @ptr: Pointer to the array to pass to the kernel.
> + *
> + * Macro initializing a drm_panthor_obj_array based on the object size as known
> + * by userspace.
> + */
> +#define DRM_PANTHOR_OBJ_ARRAY(cnt, ptr) \
> +	{ .stride = sizeof((ptr)[0]), .count = (cnt), .array = (__u64)(uintptr_t)(ptr) }
> +
> +/**
> + * enum drm_panthor_sync_op_flags - Synchronization operation flags.
> + */
> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,

This gets flagged by GCC in pedantic mode as not an integer constant, see [1]. Fix is to use

	DRM_PANTHOR_SYNC_OP_SIGNAL = (int)(1u << 31),

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71803

> +};
> +
> +/**
> + * struct drm_panthor_sync_op - Synchronization operation.
> + */
> +struct drm_panthor_sync_op {
> +	/** @flags: Synchronization operation flags. Combination of DRM_PANTHOR_SYNC_OP values. */
> +	__u32 flags;
> +
> +	/** @handle: Sync handle. */
> +	__u32 handle;
> +
> +	/**
> +	 * @timeline_value: MBZ if
> +	 * (flags & DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK) !=
> +	 * DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ.
> +	 */
> +	__u64 timeline_value;
> +};
> +
> +/**
> + * enum drm_panthor_dev_query_type - Query type
> + *
> + * Place new types at the end, don't re-oder, don't remove or replace.
> + */
> +enum drm_panthor_dev_query_type {
> +	/** @DRM_PANTHOR_DEV_QUERY_GPU_INFO: Query GPU information. */
> +	DRM_PANTHOR_DEV_QUERY_GPU_INFO = 0,
> +
> +	/** @DRM_PANTHOR_DEV_QUERY_CSIF_INFO: Query command-stream interface information. */
> +	DRM_PANTHOR_DEV_QUERY_CSIF_INFO,
> +};
> +
> +/**
> + * struct drm_panthor_gpu_info - GPU information
> + *
> + * Structure grouping all queryable information relating to the GPU.
> + */
> +struct drm_panthor_gpu_info {
> +	/** @gpu_id : GPU ID. */
> +	__u32 gpu_id;
> +#define DRM_PANTHOR_ARCH_MAJOR(x)		((x) >> 28)
> +#define DRM_PANTHOR_ARCH_MINOR(x)		(((x) >> 24) & 0xf)
> +#define DRM_PANTHOR_ARCH_REV(x)			(((x) >> 20) & 0xf)
> +#define DRM_PANTHOR_PRODUCT_MAJOR(x)		(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_VERSION_MAJOR(x)		(((x) >> 12) & 0xf)
> +#define DRM_PANTHOR_VERSION_MINOR(x)		(((x) >> 4) & 0xff)
> +#define DRM_PANTHOR_VERSION_STATUS(x)		((x) & 0xf)
> +
> +	/** @gpu_rev: GPU revision. */
> +	__u32 gpu_rev;
> +
> +	/** @csf_id: Command stream frontend ID. */
> +	__u32 csf_id;
> +#define DRM_PANTHOR_CSHW_MAJOR(x)		(((x) >> 26) & 0x3f)
> +#define DRM_PANTHOR_CSHW_MINOR(x)		(((x) >> 20) & 0x3f)
> +#define DRM_PANTHOR_CSHW_REV(x)			(((x) >> 16) & 0xf)
> +#define DRM_PANTHOR_MCU_MAJOR(x)		(((x) >> 10) & 0x3f)
> +#define DRM_PANTHOR_MCU_MINOR(x)		(((x) >> 4) & 0x3f)
> +#define DRM_PANTHOR_MCU_REV(x)			((x) & 0xf)
> +
> +	/** @l2_features: L2-cache features. */
> +	__u32 l2_features;
> +
> +	/** @tiler_features: Tiler features. */
> +	__u32 tiler_features;
> +
> +	/** @mem_features: Memory features. */
> +	__u32 mem_features;
> +
> +	/** @mmu_features: MMU features. */
> +	__u32 mmu_features;
> +#define DRM_PANTHOR_MMU_VA_BITS(x)		((x) & 0xff)
> +
> +	/** @thread_features: Thread features. */
> +	__u32 thread_features;
> +
> +	/** @max_threads: Maximum number of threads. */
> +	__u32 max_threads;
> +
> +	/** @thread_max_workgroup_size: Maximum workgroup size. */
> +	__u32 thread_max_workgroup_size;
> +
> +	/**
> +	 * @thread_max_barrier_size: Maximum number of threads that can wait
> +	 * simultaneously on a barrier.
> +	 */
> +	__u32 thread_max_barrier_size;
> +
> +	/** @coherency_features: Coherency features. */
> +	__u32 coherency_features;
> +
> +	/** @texture_features: Texture features. */
> +	__u32 texture_features[4];
> +
> +	/** @as_present: Bitmask encoding the number of address-space exposed by the MMU. */
> +	__u32 as_present;
> +
> +	/** @core_group_count: Number of core groups. */
> +	__u32 core_group_count;
> +
> +	/** @pad: Zero on return. */
> +	__u32 pad;
> +
> +	/** @shader_present: Bitmask encoding the shader cores exposed by the GPU. */
> +	__u64 shader_present;
> +
> +	/** @l2_present: Bitmask encoding the L2 caches exposed by the GPU. */
> +	__u64 l2_present;
> +
> +	/** @tiler_present: Bitmask encoding the tiler unit exposed by the GPU. */
> +	__u64 tiler_present;
> +};
> +
> +/**
> + * struct drm_panthor_csif_info - Command stream interface information
> + *
> + * Structure grouping all queryable information relating to the command stream interface.
> + */
> +struct drm_panthor_csif_info {
> +	/** @csg_slot_count: Number of command stream group slots exposed by the firmware. */
> +	__u32 csg_slot_count;
> +
> +	/** @cs_slot_count: Number of command stream slot per group. */
> +	__u32 cs_slot_count;
> +
> +	/** @cs_reg_count: Number of command stream register. */
> +	__u32 cs_reg_count;
> +
> +	/** @scoreboard_slot_count: Number of scoreboard slot. */
> +	__u32 scoreboard_slot_count;
> +
> +	/**
> +	 * @unpreserved_cs_reg_count: Number of command stream registers reserved by
> +	 * the kernel driver to call a userspace command stream.
> +	 *
> +	 * All registers can be used by a userspace command stream, but the
> +	 * [cs_slot_count - unpreserved_cs_reg_count .. cs_slot_count] registers are
> +	 * used by the kernel when DRM_PANTHOR_IOCTL_GROUP_SUBMIT is called.
> +	 */
> +	__u32 unpreserved_cs_reg_count;
> +
> +	/**
> +	 * @pad: Padding field, set to zero.
> +	 */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_dev_query - Arguments passed to DRM_PANTHOR_IOCTL_DEV_QUERY
> + */
> +struct drm_panthor_dev_query {
> +	/** @type: the query type (see drm_panthor_dev_query_type). */
> +	__u32 type;
> +
> +	/**
> +	 * @size: size of the type being queried.
> +	 *
> +	 * If pointer is NULL, size is updated by the driver to provide the
> +	 * output structure size. If pointer is not NULL, the driver will
> +	 * only copy min(size, actual_structure_size) bytes to the pointer,
> +	 * and update the size accordingly. This allows us to extend query
> +	 * types without breaking userspace.
> +	 */
> +	__u32 size;
> +
> +	/**
> +	 * @pointer: user pointer to a query type struct.
> +	 *
> +	 * Pointer can be NULL, in which case, nothing is copied, but the
> +	 * actual structure size is returned. If not NULL, it must point to
> +	 * a location that's large enough to hold size bytes.
> +	 */
> +	__u64 pointer;
> +};
> +
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;
> +};
> +
> +/**
> + * struct drm_panthor_vm_destroy - Arguments passed to DRM_PANTHOR_IOCTL_VM_DESTROY
> + */
> +struct drm_panthor_vm_destroy {
> +	/** @id: ID of the VM to destroy. */
> +	__u32 id;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_op_flags - VM bind operation flags
> + */
> +enum drm_panthor_vm_bind_op_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_READONLY: Map the memory read-only.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_READONLY = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC: Map the memory not-executable.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC = 1 << 1,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED: Map the memory uncached.
> +	 *
> +	 * Only valid with DRM_PANTHOR_VM_BIND_OP_TYPE_MAP.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED = 1 << 2,
> +
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,

Same here for GCC being pedantic. Also, on 32 bits this is going to exceed UINT_MAX.

Rest of the file looks good to me.

Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Best regard,
Liviu

> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_MAP: Map operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MAP = 0 << 28,
> +
> +	/** @DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP: Unmap operation. */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_UNMAP = 1 << 28,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind_op - VM bind operation
> + */
> +struct drm_panthor_vm_bind_op {
> +	/** @flags: Combination of drm_panthor_vm_bind_op_flags flags. */
> +	__u32 flags;
> +
> +	/**
> +	 * @bo_handle: Handle of the buffer object to map.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u32 bo_handle;
> +
> +	/**
> +	 * @bo_offset: Buffer object offset.
> +	 * MBZ for unmap operations.
> +	 */
> +	__u64 bo_offset;
> +
> +	/**
> +	 * @va: Virtual address to map/unmap.
> +	 */
> +	__u64 va;
> +
> +	/** @size: Size to map/unmap. */
> +	__u64 size;
> +
> +	/**
> +	 * @syncs: Array of synchronization operations.
> +	 *
> +	 * This array must be empty if %DRM_PANTHOR_VM_BIND_ASYNC is not set on
> +	 * the drm_panthor_vm_bind object containing this VM bind operation.
> +	 */
> +	struct drm_panthor_obj_array syncs;
> +
> +};
> +
> +/**
> + * enum drm_panthor_vm_bind_flags - VM bind flags
> + */
> +enum drm_panthor_vm_bind_flags {
> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_ASYNC: VM bind operations are queued to the VM
> +	 * queue instead of being executed synchronously.
> +	 */
> +	DRM_PANTHOR_VM_BIND_ASYNC = 1 << 0,
> +};
> +
> +/**
> + * struct drm_panthor_vm_bind - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_vm_bind {
> +	/** @vm_id: VM targeted by the bind request. */
> +	__u32 vm_id;
> +
> +	/** @flags: Combination of drm_panthor_vm_bind_flags flags. */
> +	__u32 flags;
> +
> +	/** @ops: Array of bind operations. */
> +	struct drm_panthor_obj_array ops;
> +};
> +
> +/**
> + * enum drm_panthor_bo_flags - Buffer object flags, passed at creation time.
> + */
> +enum drm_panthor_bo_flags {
> +	/** @DRM_PANTHOR_BO_NO_MMAP: The buffer object will never be CPU-mapped in userspace. */
> +	DRM_PANTHOR_BO_NO_MMAP = (1 << 0),
> +};
> +
> +/**
> + * struct drm_panthor_bo_create - Arguments passed to DRM_IOCTL_PANTHOR_BO_CREATE.
> + */
> +struct drm_panthor_bo_create {
> +	/**
> +	 * @size: Requested size for the object
> +	 *
> +	 * The (page-aligned) allocated size for the object will be returned.
> +	 */
> +	__u64 size;
> +
> +	/**
> +	 * @flags: Flags. Must be a combination of drm_panthor_bo_flags flags.
> +	 */
> +	__u32 flags;
> +
> +	/**
> +	 * @exclusive_vm_id: Exclusive VM this buffer object will be mapped to.
> +	 *
> +	 * If not zero, the field must refer to a valid VM ID, and implies that:
> +	 *  - the buffer object will only ever be bound to that VM
> +	 *  - cannot be exported as a PRIME fd
> +	 */
> +	__u32 exclusive_vm_id;
> +
> +	/**
> +	 * @handle: Returned handle for the object.
> +	 *
> +	 * Object handles are nonzero.
> +	 */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_bo_mmap_offset - Arguments passed to DRM_IOCTL_PANTHOR_BO_MMAP_OFFSET.
> + */
> +struct drm_panthor_bo_mmap_offset {
> +	/** @handle: Handle of the object we want an mmap offset for. */
> +	__u32 handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @offset: The fake offset to use for subsequent mmap calls. */
> +	__u64 offset;
> +};
> +
> +/**
> + * struct drm_panthor_queue_create - Queue creation arguments.
> + */
> +struct drm_panthor_queue_create {
> +	/**
> +	 * @priority: Defines the priority of queues inside a group. Goes from 0 to 15,
> +	 * 15 being the highest priority.
> +	 */
> +	__u8 priority;
> +
> +	/** @pad: Padding fields, MBZ. */
> +	__u8 pad[3];
> +
> +	/** @ringbuf_size: Size of the ring buffer to allocate to this queue. */
> +	__u32 ringbuf_size;
> +};
> +
> +/**
> + * enum drm_panthor_group_priority - Scheduling group priority
> + */
> +enum drm_panthor_group_priority {
> +	/** @PANTHOR_GROUP_PRIORITY_LOW: Low priority group. */
> +	PANTHOR_GROUP_PRIORITY_LOW = 0,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
> +	PANTHOR_GROUP_PRIORITY_MEDIUM,
> +
> +	/** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> +	PANTHOR_GROUP_PRIORITY_HIGH,
> +};
> +
> +/**
> + * struct drm_panthor_group_create - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_CREATE
> + */
> +struct drm_panthor_group_create {
> +	/** @queues: Array of drm_panthor_create_cs_queue elements. */
> +	struct drm_panthor_obj_array queues;
> +
> +	/**
> +	 * @max_compute_cores: Maximum number of cores that can be used by compute
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @compute_core_mask.
> +	 */
> +	__u8 max_compute_cores;
> +
> +	/**
> +	 * @max_fragment_cores: Maximum number of cores that can be used by fragment
> +	 * jobs across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @fragment_core_mask.
> +	 */
> +	__u8 max_fragment_cores;
> +
> +	/**
> +	 * @max_tiler_cores: Maximum number of tilers that can be used by tiler jobs
> +	 * across CS queues bound to this group.
> +	 *
> +	 * Must be less or equal to the number of bits set in @tiler_core_mask.
> +	 */
> +	__u8 max_tiler_cores;
> +
> +	/** @priority: Group priority (see drm_drm_panthor_cs_group_priority). */
> +	__u8 priority;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +
> +	/**
> +	 * @compute_core_mask: Mask encoding cores that can be used for compute jobs.
> +	 *
> +	 * This field must have at least @max_compute_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 compute_core_mask;
> +
> +	/**
> +	 * @fragment_core_mask: Mask encoding cores that can be used for fragment jobs.
> +	 *
> +	 * This field must have at least @max_fragment_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::shader_present.
> +	 */
> +	__u64 fragment_core_mask;
> +
> +	/**
> +	 * @tiler_core_mask: Mask encoding cores that can be used for tiler jobs.
> +	 *
> +	 * This field must have at least @max_tiler_cores bits set.
> +	 *
> +	 * The bits set here should also be set in drm_panthor_gpu_info::tiler_present.
> +	 */
> +	__u64 tiler_core_mask;
> +
> +	/**
> +	 * @vm_id: VM ID to bind this group to.
> +	 *
> +	 * All submission to queues bound to this group will use this VM.
> +	 */
> +	__u32 vm_id;
> +
> +	/**
> +	 * @group_handle: Returned group handle. Passed back when submitting jobs or
> +	 * destroying a group.
> +	 */
> +	__u32 group_handle;
> +};
> +
> +/**
> + * struct drm_panthor_group_destroy - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_DESTROY
> + */
> +struct drm_panthor_group_destroy {
> +	/** @group_handle: Group to destroy */
> +	__u32 group_handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_queue_submit - Job submission arguments.
> + *
> + * This is describing the userspace command stream to call from the kernel
> + * command stream ring-buffer. Queue submission is always part of a group
> + * submission, taking one or more jobs to submit to the underlying queues.
> + */
> +struct drm_panthor_queue_submit {
> +	/** @queue_index: Index of the queue inside a group. */
> +	__u32 queue_index;
> +
> +	/**
> +	 * @stream_size: Size of the command stream to execute.
> +	 *
> +	 * Must be 64-bit/8-byte aligned (the size of a CS instruction)
> +	 *
> +	 * Can be zero if stream_addr is zero too.
> +	 */
> +	__u32 stream_size;
> +
> +	/**
> +	 * @stream_addr: GPU address of the command stream to execute.
> +	 *
> +	 * Must be aligned on 64-byte.
> +	 *
> +	 * Can be zero is stream_size is zero too.
> +	 */
> +	__u64 stream_addr;
> +
> +	/**
> +	 * @latest_flush: FLUSH_ID read at the time the stream was built.
> +	 *
> +	 * This allows cache flush elimination for the automatic
> +	 * flush+invalidate(all) done at submission time, which is needed to
> +	 * ensure the GPU doesn't get garbage when reading the indirect command
> +	 * stream buffers. If you want the cache flush to happen
> +	 * unconditionally, pass a zero here.
> +	 */
> +	__u32 latest_flush;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @syncs: Array of sync operations. */
> +	struct drm_panthor_obj_array syncs;
> +};
> +
> +/**
> + * struct drm_panthor_group_submit - Arguments passed to DRM_IOCTL_PANTHOR_VM_BIND
> + */
> +struct drm_panthor_group_submit {
> +	/** @group_handle: Handle of the group to queue jobs to. */
> +	__u32 group_handle;
> +
> +	/** @pad: MBZ. */
> +	__u32 pad;
> +
> +	/** @queue_submits: Array of drm_panthor_queue_submit objects. */
> +	struct drm_panthor_obj_array queue_submits;
> +};
> +
> +/**
> + * enum drm_panthor_group_state_flags - Group state flags
> + */
> +enum drm_panthor_group_state_flags {
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_TIMEDOUT: Group had unfinished jobs.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_TIMEDOUT = 1 << 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_GROUP_STATE_FATAL_FAULT: Group had fatal faults.
> +	 *
> +	 * When a group ends up with this flag set, no jobs can be submitted to its queues.
> +	 */
> +	DRM_PANTHOR_GROUP_STATE_FATAL_FAULT = 1 << 1,
> +};
> +
> +/**
> + * struct drm_panthor_group_get_state - Arguments passed to DRM_IOCTL_PANTHOR_GROUP_GET_STATE
> + *
> + * Used to query the state of a group and decide whether a new group should be created to
> + * replace it.
> + */
> +struct drm_panthor_group_get_state {
> +	/** @group_handle: Handle of the group to query state on */
> +	__u32 group_handle;
> +
> +	/**
> +	 * @state: Combination of DRM_PANTHOR_GROUP_STATE_* flags encoding the
> +	 * group state.
> +	 */
> +	__u32 state;
> +
> +	/** @fatal_queues: Bitmask of queues that faced fatal faults. */
> +	__u32 fatal_queues;
> +
> +	/** @pad: MBZ */
> +	__u32 pad;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_create - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE
> + */
> +struct drm_panthor_tiler_heap_create {
> +	/** @vm_id: VM ID the tiler heap should be mapped to */
> +	__u32 vm_id;
> +
> +	/** @initial_chunk_count: Initial number of chunks to allocate. */
> +	__u32 initial_chunk_count;
> +
> +	/** @chunk_size: Chunk size. Must be a power of two at least 256KB large. */
> +	__u32 chunk_size;
> +
> +	/** @max_chunks: Maximum number of chunks that can be allocated. */
> +	__u32 max_chunks;
> +
> +	/**
> +	 * @target_in_flight: Maximum number of in-flight render passes.
> +	 *
> +	 * If the heap has more than tiler jobs in-flight, the FW will wait for render
> +	 * passes to finish before queuing new tiler jobs.
> +	 */
> +	__u32 target_in_flight;
> +
> +	/** @handle: Returned heap handle. Passed back to DESTROY_TILER_HEAP. */
> +	__u32 handle;
> +
> +	/** @tiler_heap_ctx_gpu_va: Returned heap GPU virtual address returned */
> +	__u64 tiler_heap_ctx_gpu_va;
> +
> +	/**
> +	 * @first_heap_chunk_gpu_va: First heap chunk.
> +	 *
> +	 * The tiler heap is formed of heap chunks forming a single-link list. This
> +	 * is the first element in the list.
> +	 */
> +	__u64 first_heap_chunk_gpu_va;
> +};
> +
> +/**
> + * struct drm_panthor_tiler_heap_destroy - Arguments passed to DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
> + */
> +struct drm_panthor_tiler_heap_destroy {
> +	/** @handle: Handle of the tiler heap to destroy */
> +	__u32 handle;
> +
> +	/** @pad: Padding field, MBZ. */
> +	__u32 pad;
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif /* _PANTHOR_DRM_H_ */
> -- 
> 2.41.0
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
  2023-08-11 14:13   ` Steven Price
  2023-09-01 13:59   ` Liviu Dudau
@ 2023-09-01 16:10   ` Boris Brezillon
  2023-09-04  7:42     ` Steven Price
  2023-09-04 16:06   ` Robin Murphy
  2023-09-06 12:18   ` Ketil Johnsen
  4 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-09-01 16:10 UTC (permalink / raw)
  To: dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Wed,  9 Aug 2023 18:53:15 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> +/**
> + * DOC: MMIO regions exposed to userspace.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> + *
> + * File offset for all MMIO regions being exposed to userspace. Don't use
> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> + *
> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> + *
> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> + * GPU cache flushling through CS instructions, but the flush reduction
> + * mechanism requires a flush_id. This flush_id could be queried with an
> + * ioctl, but Arm provides a well-isolated register page containing only this
> + * read-only register, so let's expose this page through a static mmap offset
> + * and allow direct mapping of this MMIO region so we can avoid the
> + * user <-> kernel round-trip.
> + */
> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)

I'm playing with a 32-bit kernel/userspace, and this is problematic,
because vm_pgoff is limited to 32-bit there, meaning we can only map up
to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
userspace set the mmio range?

> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-01 16:10   ` Boris Brezillon
@ 2023-09-04  7:42     ` Steven Price
  2023-09-04  8:26       ` Boris Brezillon
  2023-09-04  9:26       ` Boris Brezillon
  0 siblings, 2 replies; 93+ messages in thread
From: Steven Price @ 2023-09-04  7:42 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 01/09/2023 17:10, Boris Brezillon wrote:
> On Wed,  9 Aug 2023 18:53:15 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> +/**
>> + * DOC: MMIO regions exposed to userspace.
>> + *
>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>> + *
>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>> + *
>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>> + *
>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>> + * GPU cache flushling through CS instructions, but the flush reduction
>> + * mechanism requires a flush_id. This flush_id could be queried with an
>> + * ioctl, but Arm provides a well-isolated register page containing only this
>> + * read-only register, so let's expose this page through a static mmap offset
>> + * and allow direct mapping of this MMIO region so we can avoid the
>> + * user <-> kernel round-trip.
>> + */
>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
> 
> I'm playing with a 32-bit kernel/userspace, and this is problematic,
> because vm_pgoff is limited to 32-bit there, meaning we can only map up
> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> userspace set the mmio range?

Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
user space is likely to still be needed.

I can't really think of anything better than letting user space set the
MMIO range. Having an ioctl which returned a special fd just for MMIO
would be one option (which would preserve the full 44 bit GPU VA) but
seems somewhat overkill. Hiding the mmap within an ioctl would of course
be bad as it breaks tools like Valgrind.

Oh and please do make it a range - user space submission will be adding
to the MMIO range ;)

Steve

>> +#define DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET	(DRM_PANTHOR_USER_MMIO_OFFSET | 0)


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04  7:42     ` Steven Price
@ 2023-09-04  8:26       ` Boris Brezillon
  2023-09-04  9:26       ` Boris Brezillon
  1 sibling, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-09-04  8:26 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 4 Sep 2023 08:42:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 01/09/2023 17:10, Boris Brezillon wrote:
> > On Wed,  9 Aug 2023 18:53:15 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> +/**
> >> + * DOC: MMIO regions exposed to userspace.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >> + *
> >> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >> + *
> >> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >> + * GPU cache flushling through CS instructions, but the flush reduction
> >> + * mechanism requires a flush_id. This flush_id could be queried with an
> >> + * ioctl, but Arm provides a well-isolated register page containing only this
> >> + * read-only register, so let's expose this page through a static mmap offset
> >> + * and allow direct mapping of this MMIO region so we can avoid the
> >> + * user <-> kernel round-trip.
> >> + */
> >> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
> > 
> > I'm playing with a 32-bit kernel/userspace, and this is problematic,
> > because vm_pgoff is limited to 32-bit there, meaning we can only map up
> > to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> > userspace set the mmio range?  
> 
> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> user space is likely to still be needed.

Well, I can tell you some people are using 32-bit kernels ;-).

> 
> I can't really think of anything better than letting user space set the
> MMIO range. Having an ioctl which returned a special fd just for MMIO
> would be one option (which would preserve the full 44 bit GPU VA) but
> seems somewhat overkill.

Yeah, I don't think I like the separate-fd approach. Just feels like it
goes against the DRM-way of doing things. And, with 32-bit userspace,
we'd be limited by the CPU VA range anyway. Of course it's orthogonal
to the max file offset, and just because we can't map all buffers at
once, doesn't mean we don't want to be able to address more than 4G of
memory. But with 43-bit left (I think I'd prefer if we enforce a log2
value for the mmio offset/size, meaning that the max MMIO range would be
1ull << 43), that means we're still able to address 8TB of memory. I
guess that's more than enough for 32-bit users...

> Hiding the mmap within an ioctl would of course
> be bad as it breaks tools like Valgrind.

Don't like this idea either.

> 
> Oh and please do make it a range - user space submission will be adding
> to the MMIO range ;)

Yeah, that was the plan (I keep usermode submission in the back of my
mind ;-)).

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04  7:42     ` Steven Price
  2023-09-04  8:26       ` Boris Brezillon
@ 2023-09-04  9:26       ` Boris Brezillon
  2023-09-04 15:22         ` Steven Price
  1 sibling, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-09-04  9:26 UTC (permalink / raw)
  To: Steven Price
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 4 Sep 2023 08:42:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 01/09/2023 17:10, Boris Brezillon wrote:
> > On Wed,  9 Aug 2023 18:53:15 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> +/**
> >> + * DOC: MMIO regions exposed to userspace.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >> + *
> >> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >> + *
> >> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >> + *
> >> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >> + * GPU cache flushling through CS instructions, but the flush reduction
> >> + * mechanism requires a flush_id. This flush_id could be queried with an
> >> + * ioctl, but Arm provides a well-isolated register page containing only this
> >> + * read-only register, so let's expose this page through a static mmap offset
> >> + * and allow direct mapping of this MMIO region so we can avoid the
> >> + * user <-> kernel round-trip.
> >> + */
> >> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
> > 
> > I'm playing with a 32-bit kernel/userspace, and this is problematic,
> > because vm_pgoff is limited to 32-bit there, meaning we can only map up
> > to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> > userspace set the mmio range?  
> 
> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> user space is likely to still be needed.

Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
chance you could advise me on what to do here?

1. assume this limitation is here for a good reason, and limit the GPU
VA space to 32-bits on 32-bit kernels

or

2. update the interface to make iova an u64

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 04/15] drm/panthor: Add the device logical block
  2023-08-30 13:17       ` Steven Price
  2023-08-30 14:06         ` Boris Brezillon
@ 2023-09-04 11:46         ` Liviu Dudau
  1 sibling, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-04 11:46 UTC (permalink / raw)
  To: Steven Price
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, dri-devel,
	Boris Brezillon, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Wed, Aug 30, 2023 at 02:17:57PM +0100, Steven Price wrote:
> On 29/08/2023 15:00, Boris Brezillon wrote:
> > On Fri, 11 Aug 2023 16:47:56 +0100
> > Steven Price <steven.price@arm.com> wrote:
> > 
> >> On 09/08/2023 17:53, Boris Brezillon wrote:
> >>> The panthor driver is designed in a modular way, where each logical
> >>> block is dealing with a specific HW-block or software feature. In order
> >>> for those blocks to communicate with each other, we need a central
> >>> panthor_device collecting all the blocks, and exposing some common
> >>> features, like interrupt handling, power management, reset, ...
> >>>
> >>> This what this panthor_device logical block is about.
> >>>
> >>> v2:
> >>> - Rename the driver (pancsf -> panthor)
> >>> - Change the license (GPL2 -> MIT + GPL2)
> >>> - Split the driver addition commit
> >>> - Add devfreq/PM support
> >>> - Use drm_dev_{unplug,enter,exit}() to provide safe device removal
> >>>
> >>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> >>> ---
> >>>  drivers/gpu/drm/panthor/panthor_device.c | 479 +++++++++++++++++++++++
> >>>  drivers/gpu/drm/panthor/panthor_device.h | 354 +++++++++++++++++
> >>>  2 files changed, 833 insertions(+)
> >>>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
> >>>  create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
> >>>
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> >>> new file mode 100644
> >>> index 000000000000..15f102116fa0
> >>> --- /dev/null
> >>> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> >>> @@ -0,0 +1,479 @@
> >>> +// SPDX-License-Identifier: GPL-2.0 or MIT
> >>> +/* Copyright 2018 Marty E. Plummer <hanetzer@startmail.com> */
> >>> +/* Copyright 2019 Linaro, Ltd, Rob Herring <robh@kernel.org> */
> >>> +/* Copyright 2023 Collabora ltd. */
> >>> +
> >>> +#include <linux/clk.h>
> >>> +#include <linux/reset.h>
> >>> +#include <linux/platform_device.h>
> >>> +#include <linux/pm_domain.h>
> >>> +#include <linux/pm_runtime.h>
> >>> +#include <linux/regulator/consumer.h>
> >>> +
> >>> +#include <drm/drm_drv.h>
> >>> +#include <drm/drm_managed.h>
> >>> +
> >>> +#include "panthor_sched.h"
> >>> +#include "panthor_device.h"
> >>> +#include "panthor_devfreq.h"
> >>> +#include "panthor_gpu.h"
> >>> +#include "panthor_fw.h"
> >>> +#include "panthor_mmu.h"
> >>> +#include "panthor_regs.h"
> >>> +
> >>> +static int panthor_clk_init(struct panthor_device *ptdev)
> >>> +{
> >>> +	ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
> >>> +	if (IS_ERR(ptdev->clks.core)) {
> >>> +		drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
> >>> +			PTR_ERR(ptdev->clks.core));  
> >>
> >> I suspect it would be a good idea to use dev_err_probe() here (and
> >> below) as I believe devm_clk_get can return -EPROBE_DEFER.
> > 
> > Nice, didn't know there was a logging function that was silencing
> > probe-defer errors.
> > 
> >>
> >>> +		return PTR_ERR(ptdev->clks.core);
> >>> +	}
> >>> +
> >>> +	ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
> >>> +	if (IS_ERR(ptdev->clks.stacks)) {
> >>> +		drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
> >>> +			PTR_ERR(ptdev->clks.stacks));
> >>> +		return PTR_ERR(ptdev->clks.stacks);
> >>> +	}
> >>> +
> >>> +	ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, "coregroup");
> >>> +	if (IS_ERR(ptdev->clks.coregroup)) {
> >>> +		drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
> >>> +			PTR_ERR(ptdev->clks.coregroup));
> >>> +		return PTR_ERR(ptdev->clks.coregroup);
> >>> +	}
> >>> +
> >>> +	drm_info(&ptdev->base, "clock rate = %lu\n", clk_get_rate(ptdev->clks.core));
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +void panthor_device_unplug(struct panthor_device *ptdev)
> >>> +{
> >>> +	/* FIXME: This is racy. */  
> >>
> >> Can we fix this? From a quick look it seems like a sequence like below
> >> should avoid the race.
> >>
> >> 	if (!drm_dev_enter())
> >> 		/* Already unplugged */
> >> 		return;
> >> 	ptdev->base.unplugged = true;
> >> 	drm_dev_exit();
> >>
> >> Although possibly that should be in the DRM core rather than open-coded
> >> here.
> > 
> > Are you sure that's protecting us against two concurrent calls to
> > drm_dev_unplug() (drm_dev_enter() is taking a read-lock)?
> 
> Well now I'm not sure ;) This was based on the implementations of
> drm_dev_is_unplugged() and drm_dev_unplug(). drm_dev_is_unplugged()
> simply tries to enter then exit. drm_dev_unplug() sets dev->unplugged
> (without first taking any locks). So my naïve combination resulted in
> the above.
> 
> The part I was missing is the synchronize_srcu() call in
> drm_dev_unplug() is what matches up with the read lock in drm_dev_enter().
> 
> > And that's not
> > the only thing I need actually. If there are 2 threads entering
> > panthor_device_unplug(), I need to make sure the one who losts (arrived
> > after unplugged was set to false) is waiting for all operations after
> > the drm_dev_unplug() call to be done, otherwise we might return from
> > platform_driver->remove() before the unplug cleanups are done, and
> > there might still be threads/workqueues accessing device resources
> > while/after they get released by the device-model.
> 
> I can't figure out how to do this other than adding a new atomic status
> bit into panthor. So something like:
> 
> 	if (!drm_dev_enter())
> 		/* Already unplugged */
> 		return;
> 
> 	if (atomic_cmpxchg(&unplugging, false, true)) {
> 		/* Racing with another thread */
> 		drm_dev_exit();
> 		/* Wait for other threads to exit */
> 		synchronize_srcu(&drm_unplug_srcu);
> 		return;
> 	}
> 
> 	panthor_xxx_unplug()
> 
> 	drm_dev_exit();
> 
> Or at least I think that might work. The need to synchronize with
> drm_unplug_srcu means this really needs a new helper in drm_drv.c.
> 
> >>
> >>> +	if (drm_dev_is_unplugged(&ptdev->base))
> >>> +		return;
> >>> +
> >>> +	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> >>> +
> >>> +	/* Call drm_dev_unplug() so any access to HW block happening after
> >>> +	 * that point get rejected.
> >>> +	 */
> >>> +	drm_dev_unplug(&ptdev->base);
> >>> +
> >>> +	/* Now, try to cleanly shutdown the GPU before the device resources
> >>> +	 * get reclaimed.
> >>> +	 */
> >>> +	panthor_sched_unplug(ptdev);
> >>> +	panthor_fw_unplug(ptdev);
> >>> +	panthor_mmu_unplug(ptdev);
> >>> +	panthor_gpu_unplug(ptdev);
> >>> +
> >>> +	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> >>> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> >>> +}
> >>> +
> >>> +static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
> >>> +{
> >>> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> >>> +
> >>> +	cancel_work_sync(&ptdev->reset.work);
> >>> +	destroy_workqueue(ptdev->reset.wq);
> >>> +}
> >>> +
> >>> +static void panthor_device_reset_work(struct work_struct *work)
> >>> +{
> >>> +	struct panthor_device *ptdev = container_of(work, struct panthor_device, reset.work);
> >>> +	int ret, cookie;
> >>> +
> >>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> >>> +		return;
> >>> +
> >>> +	panthor_sched_pre_reset(ptdev);
> >>> +	panthor_fw_pre_reset(ptdev, true);
> >>> +	panthor_mmu_pre_reset(ptdev);
> >>> +	panthor_gpu_soft_reset(ptdev);
> >>> +	panthor_gpu_l2_power_on(ptdev);
> >>> +	panthor_mmu_post_reset(ptdev);
> >>> +	ret = panthor_fw_post_reset(ptdev);
> >>> +	if (ret)
> >>> +		goto out;
> >>> +
> >>> +	atomic_set(&ptdev->reset.pending, 0);
> >>> +	panthor_sched_post_reset(ptdev);
> >>> +	drm_dev_exit(cookie);
> >>> +
> >>> +out:
> >>> +	if (ret) {  
> >>
> >> This looks like a race condition too - is there a need for a
> >> drm_dev_exit_and_unplug() function?
> > 
> > drm_dev_exit() is just releasing the read-lock. drm_dev_unplug()
> > waits for all readers to be done and sets the unplugged value to true.
> > So we only get readers/writer synchronization here, but nothing doing
> > writer/writer sync. I guess the drm core leaves that to drivers, given
> > drm_dev_unplug() is usually called from xxx_driver->remove() hook, on
> > which serialization is guaranteed by the device-model.
> > 
> > TLDR; yes, it's racy, but I don't think drm_dev_exit_and_unplug() would
> > help solve the existing race.
> 
> Yeah, I hadn't really thought through the reader/writer locks.
> 
> > It's worth noting that we currently have only 2 paths calling
> > panthor_device_unplug(): the platform_driver->remove() hook and the
> > reset worker. Calling drm_dev_unplug() might not be the right thing to
> > do, I just thought it was a good match to reflect the fact the device
> > becomes inaccessible, without adding yet another kind of device-lost
> > field.
> 
> I quite liked the unplugged approach, it hides the complexities of the
> GPU breaking nicely.
> 
> However I do think this path needs fixing in some way, because of the
> "goto out" we end up calling panthor_device_unplug() while in the
> drm_dev_enter() section. Which, unless I'm mistaken, means
> panthor_device_unplug() will call drm_dev_unplug() in that section -
> which should produce a lockdep warning at the very least, if not an
> actual deadlock.
> 
> Given it's only a read lock - I think simply moving drm_dev_exit() below
> the "out:" label fixes the deadlock without making any races worse.
> Whether the race here actually matters I'm not sure.
> 
> >>
> >>> +		panthor_device_unplug(ptdev);
> >>> +		drm_err(&ptdev->base, "Failed to boot MCU after reset, making device unusable.");
> >>> +	}
> >>> +}
> >>> +
> >>> +static bool panthor_device_is_initialized(struct panthor_device *ptdev)
> >>> +{
> >>> +	return !!ptdev->scheduler;
> >>> +}
> >>> +
> >>> +static void panthor_device_free_page(struct drm_device *ddev, void *data)
> >>> +{
> >>> +	free_page((unsigned long)data);
> >>> +}
> >>> +
> >>> +int panthor_device_init(struct panthor_device *ptdev)
> >>> +{
> >>> +	struct resource *res;
> >>> +	struct page *p;
> >>> +	int ret;
> >>> +
> >>> +	ptdev->coherent = device_get_dma_attr(ptdev->base.dev) == DEV_DMA_COHERENT;
> >>> +
> >>> +	drmm_mutex_init(&ptdev->base, &ptdev->pm.lock);
> >>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> >>> +	p = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >>> +	if (!p)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	ptdev->pm.dummy_latest_flush = page_address(p);
> >>> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_free_page,
> >>> +				       ptdev->pm.dummy_latest_flush);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	/* Set the dummy page to the default LATEST_FLUSH value. This
> >>> +	 * will be updated on the next suspend.
> >>> +	 */
> >>> +	*ptdev->pm.dummy_latest_flush = CSF_GPU_LATEST_FLUSH_ID_DEFAULT;  
> >>
> >> I see why this register default value was defined. Although I'm not sure
> >> it has any benefit over just using zero... If the GPU is off when user
> >> space reads the FLUSH_ID then the GPU's caches are definitely empty so
> >> any flush ID is valid.
> > 
> > Zero means we'll force a cache flush for all CS that were created while
> > the device was suspended, that's not ideal.
> > 
> >>
> >> Interestingly looking at kbase it seems to use an initial value of 1
> >> (POWER_DOWN_LATEST_FLUSH_VALUE). I guess zero is less ideal because
> >> FLUSH_CACHE2 would then unconditionally do a flush.
> > 
> > I guess a value of 1 would work. It just means we'll get a spurious
> > flush if the CS is submitted after 32 flushes happened, on the other
> > hand we also a spurious flush on the first submitted CS when we use
> > POWER_DOWN_LATEST_FLUSH_VALUE. I'll switch to 1, drop the default def,
> > and update the comment accordingly.
> 
> Yeah, matching kbase is almost certainly the safest approach ;) Sorry, I
> was reviewing the patches mostly in order and this looked really odd
> until I started digging into it. Zero is clearly not the ideal value,
> but the reset value is also just a weird value for hardware validation
> (it enables easier checking of the wrap condition). Since kbase picks 1,
> that must be a value which works well!
> 
> >>
> >>> +
> >>> +	INIT_WORK(&ptdev->reset.work, panthor_device_reset_work);
> >>> +	ptdev->reset.wq = alloc_ordered_workqueue("panthor-reset-wq", 0);
> >>> +	if (!ptdev->reset.wq)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	ret = drmm_add_action_or_reset(&ptdev->base, panthor_device_reset_cleanup, NULL);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ret = panthor_clk_init(ptdev);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ret = panthor_devfreq_init(ptdev);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ptdev->iomem = devm_platform_get_and_ioremap_resource(to_platform_device(ptdev->base.dev),
> >>> +							      0, &res);
> >>> +	if (IS_ERR(ptdev->iomem))
> >>> +		return PTR_ERR(ptdev->iomem);
> >>> +
> >>> +	ptdev->phys_addr = res->start;
> >>> +
> >>> +	ret = devm_pm_runtime_enable(ptdev->base.dev);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	ret = panthor_gpu_init(ptdev);
> >>> +	if (ret)
> >>> +		goto err_rpm_put;
> >>> +
> >>> +	ret = panthor_mmu_init(ptdev);
> >>> +	if (ret)
> >>> +		goto err_rpm_put;
> >>> +
> >>> +	ret = panthor_fw_init(ptdev);
> >>> +	if (ret)
> >>> +		goto err_rpm_put;
> >>> +
> >>> +	ret = panthor_sched_init(ptdev);
> >>> +	if (ret)
> >>> +		goto err_rpm_put;
> >>> +
> >>> +	/* ~3 frames */
> >>> +	pm_runtime_set_autosuspend_delay(ptdev->base.dev, 50);
> >>> +	pm_runtime_use_autosuspend(ptdev->base.dev);
> >>> +	pm_runtime_put_autosuspend(ptdev->base.dev);
> >>> +	return 0;
> >>> +
> >>> +err_rpm_put:
> >>> +	pm_runtime_put_sync_suspend(ptdev->base.dev);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +#define PANTHOR_EXCEPTION(id) \
> >>> +	[DRM_PANTHOR_EXCEPTION_ ## id] = { \
> >>> +		.name = #id, \
> >>> +	}
> >>> +
> >>> +struct panthor_exception_info {
> >>> +	const char *name;
> >>> +};
> >>> +
> >>> +static const struct panthor_exception_info panthor_exception_infos[] = {
> >>> +	PANTHOR_EXCEPTION(OK),
> >>> +	PANTHOR_EXCEPTION(TERMINATED),
> >>> +	PANTHOR_EXCEPTION(KABOOM),
> >>> +	PANTHOR_EXCEPTION(EUREKA),
> >>> +	PANTHOR_EXCEPTION(ACTIVE),
> >>> +	PANTHOR_EXCEPTION(CS_RES_TERM),
> >>> +	PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
> >>> +	PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
> >>> +	PANTHOR_EXCEPTION(CS_BUS_FAULT),
> >>> +	PANTHOR_EXCEPTION(CS_INSTR_INVALID),
> >>> +	PANTHOR_EXCEPTION(CS_CALL_STACK_OVERFLOW),
> >>> +	PANTHOR_EXCEPTION(CS_INHERIT_FAULT),
> >>> +	PANTHOR_EXCEPTION(INSTR_INVALID_PC),
> >>> +	PANTHOR_EXCEPTION(INSTR_INVALID_ENC),
> >>> +	PANTHOR_EXCEPTION(INSTR_BARRIER_FAULT),
> >>> +	PANTHOR_EXCEPTION(DATA_INVALID_FAULT),
> >>> +	PANTHOR_EXCEPTION(TILE_RANGE_FAULT),
> >>> +	PANTHOR_EXCEPTION(ADDR_RANGE_FAULT),
> >>> +	PANTHOR_EXCEPTION(IMPRECISE_FAULT),
> >>> +	PANTHOR_EXCEPTION(OOM),
> >>> +	PANTHOR_EXCEPTION(CSF_FW_INTERNAL_ERROR),
> >>> +	PANTHOR_EXCEPTION(CSF_RES_EVICTION_TIMEOUT),
> >>> +	PANTHOR_EXCEPTION(GPU_BUS_FAULT),
> >>> +	PANTHOR_EXCEPTION(GPU_SHAREABILITY_FAULT),
> >>> +	PANTHOR_EXCEPTION(SYS_SHAREABILITY_FAULT),
> >>> +	PANTHOR_EXCEPTION(GPU_CACHEABILITY_FAULT),
> >>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_0),
> >>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_1),
> >>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_2),
> >>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_3),
> >>> +	PANTHOR_EXCEPTION(TRANSLATION_FAULT_4),
> >>> +	PANTHOR_EXCEPTION(PERM_FAULT_0),
> >>> +	PANTHOR_EXCEPTION(PERM_FAULT_1),
> >>> +	PANTHOR_EXCEPTION(PERM_FAULT_2),
> >>> +	PANTHOR_EXCEPTION(PERM_FAULT_3),
> >>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_1),
> >>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_2),
> >>> +	PANTHOR_EXCEPTION(ACCESS_FLAG_3),
> >>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_IN),
> >>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
> >>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
> >>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
> >>> +	PANTHOR_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
> >>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_0),
> >>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_1),
> >>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_2),
> >>> +	PANTHOR_EXCEPTION(MEM_ATTR_FAULT_3),
> >>> +};
> >>> +
> >>> +const char *panthor_exception_name(struct panthor_device *ptdev, u32 exception_code)
> >>> +{
> >>> +	if (drm_WARN_ON(&ptdev->base,  
> >>
> >> I'm not convinced this should be a WARN_ON as I suspect it's probably
> >> possible to inject values from user space (although I'm not completely
> >> sure on that).
> > 
> > Normally no (it's something returned by the FW), unless userspace gets
> > access to the kernel <-> FW interface, which would be worrisome :-).
> 
> I've no idea if it's actually possible, but it feels like it should be
> possible to create a firmware synchronisation object with an error code
> chosen by the user and possibly then propagate that error code back to
> the kernel. It's certainly not trivial though. Either way the WARN is
> unnecessary.
> 
> >> It's certainly not a driver error as such if we can't
> >> decode the value.
> > 
> > Ack on dropping the WARN_ON().
> > 
> >>
> >>> +			exception_code >= ARRAY_SIZE(panthor_exception_infos) ||
> >>> +			!panthor_exception_infos[exception_code].name))
> >>> +		return "Unknown exception type";
> >>> +
> >>> +	return panthor_exception_infos[exception_code].name;
> >>> +}
> >>> +
> >>> +static vm_fault_t panthor_mmio_vm_fault(struct vm_fault *vmf)
> >>> +{
> >>> +	struct vm_area_struct *vma = vmf->vma;
> >>> +	struct panthor_device *ptdev = vma->vm_private_data;
> >>> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> >>> +	unsigned long pfn;
> >>> +	pgprot_t pgprot;
> >>> +	vm_fault_t ret;
> >>> +	bool active;
> >>> +	int cookie;
> >>> +
> >>> +	if (!drm_dev_enter(&ptdev->base, &cookie))
> >>> +		return VM_FAULT_SIGBUS;
> >>> +
> >>> +	mutex_lock(&ptdev->pm.lock);
> >>> +	active = atomic_read(&ptdev->pm.state) == PANTHOR_DEVICE_PM_STATE_ACTIVE;
> >>> +
> >>> +	switch (id) {
> >>> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> >>> +		if (active)
> >>> +			pfn = __phys_to_pfn(ptdev->phys_addr + CSF_GPU_LATEST_FLUSH_ID);
> >>> +		else
> >>> +			pfn = virt_to_pfn(ptdev->pm.dummy_latest_flush);
> >>> +		break;
> >>> +
> >>> +	default:
> >>> +		ret = VM_FAULT_SIGBUS;
> >>> +		goto out_unlock;
> >>> +	}
> >>> +
> >>> +	pgprot = vma->vm_page_prot;
> >>> +	if (active)
> >>> +		pgprot = pgprot_noncached(pgprot);
> >>> +
> >>> +	ret = vmf_insert_pfn_prot(vma, vmf->address, pfn, pgprot);
> >>> +
> >>> +out_unlock:
> >>> +	mutex_unlock(&ptdev->pm.lock);
> >>> +	drm_dev_exit(cookie);
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static const struct vm_operations_struct panthor_mmio_vm_ops = {
> >>> +	.fault = panthor_mmio_vm_fault,
> >>> +};
> >>> +
> >>> +int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *vma)
> >>> +{
> >>> +	u64 id = vma->vm_pgoff << PAGE_SHIFT;
> >>> +
> >>> +	switch (id) {
> >>> +	case DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET:
> >>> +		if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
> >>> +		    (vma->vm_flags & (VM_WRITE | VM_EXEC)))
> >>> +			return -EINVAL;
> >>> +
> >>> +		break;
> >>> +
> >>> +	default:
> >>> +		return -EINVAL;
> >>> +	}
> >>> +
> >>> +	/* Defer actual mapping to the fault handler. */
> >>> +	vma->vm_private_data = ptdev;
> >>> +	vma->vm_ops = &panthor_mmio_vm_ops;
> >>> +	vm_flags_set(vma,
> >>> +		     VM_IO | VM_DONTCOPY | VM_DONTEXPAND |
> >>> +		     VM_NORESERVE | VM_DONTDUMP | VM_PFNMAP);
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_PM
> >>> +int panthor_device_resume(struct device *dev)
> >>> +{
> >>> +	struct panthor_device *ptdev = dev_get_drvdata(dev);
> >>> +	int ret, cookie;
> >>> +
> >>> +	mutex_lock(&ptdev->pm.lock);
> >>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_RESUMING);
> >>> +
> >>> +	ret = clk_prepare_enable(ptdev->clks.core);
> >>> +	if (ret)
> >>> +		goto err_unlock;
> >>> +
> >>> +	ret = clk_prepare_enable(ptdev->clks.stacks);
> >>> +	if (ret)
> >>> +		goto err_disable_core_clk;
> >>> +
> >>> +	ret = clk_prepare_enable(ptdev->clks.coregroup);
> >>> +	if (ret)
> >>> +		goto err_disable_stacks_clk;
> >>> +
> >>> +	ret = panthor_devfreq_resume(ptdev);
> >>> +	if (ret)
> >>> +		goto err_disable_coregroup_clk;
> >>> +
> >>> +	if (panthor_device_is_initialized(ptdev) &&
> >>> +	    drm_dev_enter(&ptdev->base, &cookie)) {
> >>> +		panthor_gpu_resume(ptdev);
> >>> +		panthor_mmu_resume(ptdev);
> >>> +		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> >>> +		if (!ret)
> >>> +			panthor_sched_resume(ptdev);
> >>> +
> >>> +		drm_dev_exit(cookie);
> >>> +
> >>> +		if (ret)
> >>> +			goto err_devfreq_suspend;
> >>> +	}
> >>> +
> >>> +	/* Clear all IOMEM mappings pointing to this device after we've
> >>> +	 * resumed. This way the fake mappings pointing to the dummy pages
> >>> +	 * are removed and the real iomem mapping will be restored on next
> >>> +	 * access.
> >>> +	 */
> >>> +	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> >>> +			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> >>> +	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);  
> >>
> >> Is the ordering here correct? I think we need to set ACTIVE before the
> >> unmap_mapping_range otherwise there is a (very small) race where user
> >> space could fault the page (and get the dummy mapping) before the
> >> atomic_set.
> > 
> > We take the pm.lock in panthor_mmio_vm_fault().
> > 
> >>
> >> Hmm, actually we have the pm.lock, so no this isn't racy. In which case
> >> is there a good reason that you're using atomics? I can see two accesses
> >> which aren't protected by pm.lock:
> >>
> >>   * the early out in panthor_device_suspend() - which could easily be
> >> moved inside the lock.
> > 
> > When we're in suspend() we are the one in control of the pm.state, so
> > no race expected here.
> > 
> >>
> >>   * panthor_device_schedule_reset() - this looks racy (the power down
> >> could happen immediately after the atomic_read()), so I suspect it would
> >> be better moving the check into panthor_device_reset_work() and
> >> performing it with the pm.lock held.
> > 
> > I think the main reason for it being an atomic is because I didn't
> > have PM locking in the initial implementation, but I ended adding
> > locking at some point because I didn't really have choice. I thought
> > the race didn't exist because of the workqueue synchronization/work
> > cancellation that happens in panthor_sched_suspend(), but I see now
> > that it's not protecting us (thread queuing the job could be paused
> > just after checking the PM state and resumed after the suspend
> > happened). This being said, we might have a lock ordering issue if we
> > take the PM lock in that path (I need to check that).
> 
> Yeah I didn't bother to check whether it would create ordering issues...
> ;) I'll leave you to figure out the fix - the whole atomic + mutex was
> confusing and doesn't seem to have quite worked.
> 
> [...]
> 
> >>> +
> >>> +/**
> >>> + * PANTHOR_IRQ_HANDLER() - Define interrupt handlers and the interrupt
> >>> + * registration function.
> >>> + *
> >>> + * The boiler-plate to gracefully deal with shared interrupts is
> >>> + * auto-generated. All you have to do is call PANTHOR_IRQ_HANDLER()
> >>> + * just after you actual handler. The handler prototype is:  
> >> s/you/your/ or probably s/you/the/ since we don't expect people to be
> >> adding more ;)
> >>
> >>> + *
> >>> + * void (*handler)(struct panthor_device *, u32 status);
> >>> + */
> >>> +#define PANTHOR_IRQ_HANDLER(__name, __reg_prefix, __handler)					\
> >>> +static irqreturn_t panthor_ ## __name ## _irq_raw_handler(int irq, void *data)			\
> >>> +{												\
> >>> +	struct panthor_irq *pirq = data;							\
> >>> +	struct panthor_device *ptdev = pirq->ptdev;						\  
> >>
> >> Maybe I'm missing something, but I was expecting a check here for if the
> >> irq has been suspended and to avoid the register reads if it was.
> > 
> > Thought the INT_MASK=0 + synchronize_irq() in panthor_xxx_irq_suspend()
> > would guarantee that the handler can't be called after
> > panthor_xxx_irq_suspend() was called.
> 
> If the IRQ is shared then Linux doesn't know which device caused the
> interrupt, so another device's (shared) interrupt could cause our
> handler to be run.
> 
> >> Otherwise I'm not entirely sure I follow what all this code is for.
> > 
> > Not entirely sure which code we're talking about. The reason we
> > don't use the default raw IRQ handler is because it doesn't work if the
> > irq line is shared. In that case, we need to mask all interrupts to
> > make sure other handlers on the same irq line don't get spammed with
> > our IRQs.
> 
> What I'm not following is why we need all this extra infrastructure for
> IRQs. The 'setting the mask to 0' during suspend is simple enough and
> could be included in code which now calls panthor_xxx_irq_suspend()
> (equally for restoring the mask on resume). But there's a loads more
> code here.
> 
> My initial thought when I looked at this was that you were trying to
> solve the issue of a shared IRQ where Mali might get powered off, but
> the IRQ is then triggered by another device. In that case touching the
> Mali registers would be problematic, so I was expecting some code in
> _irq_raw_handler() to check whether the IRQ couldn't possibly be for us
> (i.e. mask==0) and early out with IRQ_NONE. kbase has a concept like
> this "gpu_powered" for exactly this reason.

There is also support for Juno setups where all the GPU IRQs are muxed into
a single interrupt out of the FPGAs, so you need to share the line between
blocks. I have initially suggested the generic approach to Boris over a
freedesktop MR, but even that one did not handle the case where you would
share the interrupt with another device (who would want to do that anyway,
right? :) )

Best regards,
Liviu

> 
> But I can't see anything in the code to handle that case. And the
> "spamming" of other drivers during suspend shouldn't really happen
> (there's something odd going on if the hardware is generating interrupts
> when it's meant to be suspended).
> 
> But maybe I'm just missing something - it's a while since I've dealt
> with interrupt code in Linux.
> 
> Steve
> 
> >>
> >> Steve
> >>
> >>> +												\
> >>> +	if (!gpu_read(ptdev, __reg_prefix ## _INT_STAT))					\
> >>> +		return IRQ_NONE;								\
> >>> +												\
> >>> +	gpu_write(ptdev, __reg_prefix ## _INT_MASK, 0);						\
> >>> +	return IRQ_WAKE_THREAD;									\
> >>> +}												\
> >>> +												\
> >>> +static irqreturn_t panthor_ ## __name ## _irq_threaded_handler(int irq, void *data)		\
> >>> +{												\
> >>> +	struct panthor_irq *pirq = data;							\
> >>> +	struct panthor_device *ptdev = pirq->ptdev;						\
> >>> +	irqreturn_t ret = IRQ_NONE;								\
> >>> +												\
> >>> +	while (true) {										\
> >>> +		u32 status = gpu_read(ptdev, __reg_prefix ## _INT_RAWSTAT) & pirq->mask;	\
> >>> +												\
> >>> +		if (!status)									\
> >>> +			break;									\
> >>> +												\
> >>> +		gpu_write(ptdev, __reg_prefix ## _INT_CLEAR, status);				\
> >>> +												\
> >>> +		__handler(ptdev, status);							\
> >>> +		ret = IRQ_HANDLED;								\
> >>> +	}											\
> >>> +												\
> >>> +	if (!atomic_read(&pirq->suspended))							\
> >>> +		gpu_write(ptdev, __reg_prefix ## _INT_MASK, pirq->mask);			\
> >>> +												\
> >>> +	return ret;										\
> >>> +}												\
> >>> +												\
> >>> +static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq)			\
> >>> +{												\
> >>> +	int cookie;										\
> >>> +												\
> >>> +	atomic_set(&pirq->suspended, true);							\
> >>> +												\
> >>> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> >>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0);				\
> >>> +		synchronize_irq(pirq->irq);							\
> >>> +		drm_dev_exit(cookie);								\
> >>> +	}											\
> >>> +												\
> >>> +	pirq->mask = 0;										\
> >>> +}												\
> >>> +												\
> >>> +static inline void panthor_ ## __name ## _irq_resume(struct panthor_irq *pirq, u32 mask)	\
> >>> +{												\
> >>> +	int cookie;										\
> >>> +												\
> >>> +	atomic_set(&pirq->suspended, false);							\
> >>> +	pirq->mask = mask;									\
> >>> +												\
> >>> +	if (drm_dev_enter(&pirq->ptdev->base, &cookie)) {					\
> >>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_CLEAR, mask);			\
> >>> +		gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, mask);			\
> >>> +		drm_dev_exit(cookie);								\
> >>> +	}											\
> >>> +}												\
> >>> +												\
> >>> +static int panthor_request_ ## __name ## _irq(struct panthor_device *ptdev,			\
> >>> +					      struct panthor_irq *pirq,				\
> >>> +					      int irq, u32 mask)				\
> >>> +{												\
> >>> +	pirq->ptdev = ptdev;									\
> >>> +	pirq->irq = irq;									\
> >>> +	panthor_ ## __name ## _irq_resume(pirq, mask);						\
> >>> +												\
> >>> +	return devm_request_threaded_irq(ptdev->base.dev, irq,					\
> >>> +					 panthor_ ## __name ## _irq_raw_handler,		\
> >>> +					 panthor_ ## __name ## _irq_threaded_handler,		\
> >>> +					 IRQF_SHARED, KBUILD_MODNAME "-" # __name,		\
> >>> +					 pirq);							\
> >>> +}
> >>> +
> >>> +extern struct workqueue_struct *panthor_cleanup_wq;
> >>> +
> >>> +#endif  
> >>
> > 
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04  9:26       ` Boris Brezillon
@ 2023-09-04 15:22         ` Steven Price
  2023-09-04 16:16           ` Boris Brezillon
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Price @ 2023-09-04 15:22 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 04/09/2023 10:26, Boris Brezillon wrote:
> On Mon, 4 Sep 2023 08:42:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>   
>>>> +/**
>>>> + * DOC: MMIO regions exposed to userspace.
>>>> + *
>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>> + *
>>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>>>> + *
>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>> + *
>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>>>> + * GPU cache flushling through CS instructions, but the flush reduction
>>>> + * mechanism requires a flush_id. This flush_id could be queried with an
>>>> + * ioctl, but Arm provides a well-isolated register page containing only this
>>>> + * read-only register, so let's expose this page through a static mmap offset
>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>> + * user <-> kernel round-trip.
>>>> + */
>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)  
>>>
>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>> userspace set the mmio range?  
>>
>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
>> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
>> user space is likely to still be needed.
> 
> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
> chance you could advise me on what to do here?
> 
> 1. assume this limitation is here for a good reason, and limit the GPU
> VA space to 32-bits on 32-bit kernels
> 
> or
> 
> 2. update the interface to make iova an u64

I'm not sure I can answer the question from a technical perspective,
hopefully Robin will be able to.

But why do we care about 32-bit kernels on a platform which is new
enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?

Given the other limitations present in a 32-bit kernel I'd be tempted to
say '1' just for simplicity. Especially since apparently we've lived
with this for panfrost which presumably has the same limitation (even
though all Bifrost/Midgard GPUs have at least 33 bits of VA space).

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
                     ` (2 preceding siblings ...)
  2023-09-01 16:10   ` Boris Brezillon
@ 2023-09-04 16:06   ` Robin Murphy
  2023-09-06 12:18   ` Ketil Johnsen
  4 siblings, 0 replies; 93+ messages in thread
From: Robin Murphy @ 2023-09-04 16:06 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Steven Price, Clément Péron, Marty E . Plummer,
	Faith Ekstrand

On 2023-08-09 17:53, Boris Brezillon wrote:
[...]
> +/**
> + * struct drm_panthor_vm_create - Arguments passed to DRM_PANTHOR_IOCTL_VM_CREATE
> + */
> +struct drm_panthor_vm_create {
> +	/** @flags: VM flags, MBZ. */
> +	__u32 flags;
> +
> +	/** @id: Returned VM ID. */
> +	__u32 id;
> +
> +	/**
> +	 * @kernel_va_range: Size of the VA space reserved for kernel objects.
> +	 *
> +	 * If kernel_va_range is zero, we pick half of the VA space for kernel objects.
> +	 *
> +	 * Kernel VA space is always placed at the top of the supported VA range.
> +	 */
> +	__u64 kernel_va_range;

Off the back of the "IOVA as unsigned long" concern, Boris and I 
reasoned through the 64-bit vs. 32-bit vs. compat cases on IRC, and it 
seems like this kernel_va_range argument is a source of much of the pain.

Rather than have userspace specify a quantity which it shouldn't care 
about and depend on assumptions of kernel behaviour to infer the 
quantity which *is* relevant (i.e. how large the usable range of the VM 
will actually be), I think it would be considerably more logical for 
userspace to simply request the size of usable VM it actually wants. 
Then it would be straightforward and consistent to define the default 
value in terms of the minimum of half the GPU VA size or TASK_SIZE (the 
latter being the largest *meaningful* value in all 3 cases), and it's 
still easy enough for the kernel to deduce for itself whether there's a 
reasonable amount of space left between the requested limit and 
ULONG_MAX for it to use. 32-bit kernels should then get at least 1GB to 
play with, for compat the kernel BOs can get well out of the way into 
the >32-bit range, and it's only really 64-bit where userspace is liable 
to see "kernel" VA space impinging on usable process VAs. Even then 
we're not sure that's a significant concern beyond OpenCL SVM.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04 15:22         ` Steven Price
@ 2023-09-04 16:16           ` Boris Brezillon
  2023-09-04 16:25             ` Robin Murphy
  0 siblings, 1 reply; 93+ messages in thread
From: Boris Brezillon @ 2023-09-04 16:16 UTC (permalink / raw)
  To: Steven Price
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On Mon, 4 Sep 2023 16:22:19 +0100
Steven Price <steven.price@arm.com> wrote:

> On 04/09/2023 10:26, Boris Brezillon wrote:
> > On Mon, 4 Sep 2023 08:42:08 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 01/09/2023 17:10, Boris Brezillon wrote:  
> >>> On Wed,  9 Aug 2023 18:53:15 +0200
> >>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>     
> >>>> +/**
> >>>> + * DOC: MMIO regions exposed to userspace.
> >>>> + *
> >>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
> >>>> + *
> >>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
> >>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
> >>>> + *
> >>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
> >>>> + *
> >>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
> >>>> + * GPU cache flushling through CS instructions, but the flush reduction
> >>>> + * mechanism requires a flush_id. This flush_id could be queried with an
> >>>> + * ioctl, but Arm provides a well-isolated register page containing only this
> >>>> + * read-only register, so let's expose this page through a static mmap offset
> >>>> + * and allow direct mapping of this MMIO region so we can avoid the
> >>>> + * user <-> kernel round-trip.
> >>>> + */
> >>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)    
> >>>
> >>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
> >>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
> >>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
> >>> userspace set the mmio range?    
> >>
> >> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
> >> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
> >> user space is likely to still be needed.  
> > 
> > Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
> > interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
> > the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
> > chance you could advise me on what to do here?
> > 
> > 1. assume this limitation is here for a good reason, and limit the GPU
> > VA space to 32-bits on 32-bit kernels
> > 
> > or
> > 
> > 2. update the interface to make iova an u64  
> 
> I'm not sure I can answer the question from a technical perspective,
> hopefully Robin will be able to.

Had a quick chat with Robin, and he's recommending going for #1 too.

> 
> But why do we care about 32-bit kernels on a platform which is new
> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?

Apparently the memory you save by switching to a 32-bit kernel matters
to some people. To clarify, the CPU is aarch64, but they want to use it
in 32-bit mode.

> 
> Given the other limitations present in a 32-bit kernel I'd be tempted to
> say '1' just for simplicity. Especially since apparently we've lived
> with this for panfrost which presumably has the same limitation (even
> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).

Well, Panfrost is simpler in that you don't have this kernel VA range,
and, IIRC, we are using the old format that naturally limits the GPU VA
space to 4G.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04 16:16           ` Boris Brezillon
@ 2023-09-04 16:25             ` Robin Murphy
  2023-09-06 10:55               ` Steven Price
  0 siblings, 1 reply; 93+ messages in thread
From: Robin Murphy @ 2023-09-04 16:25 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Faith Ekstrand

On 2023-09-04 17:16, Boris Brezillon wrote:
> On Mon, 4 Sep 2023 16:22:19 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 04/09/2023 10:26, Boris Brezillon wrote:
>>> On Mon, 4 Sep 2023 08:42:08 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>>    
>>>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>>      
>>>>>> +/**
>>>>>> + * DOC: MMIO regions exposed to userspace.
>>>>>> + *
>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>>>> + *
>>>>>> + * File offset for all MMIO regions being exposed to userspace. Don't use
>>>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET values instead.
>>>>>> + *
>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>>>> + *
>>>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
>>>>>> + * GPU cache flushling through CS instructions, but the flush reduction
>>>>>> + * mechanism requires a flush_id. This flush_id could be queried with an
>>>>>> + * ioctl, but Arm provides a well-isolated register page containing only this
>>>>>> + * read-only register, so let's expose this page through a static mmap offset
>>>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>>>> + * user <-> kernel round-trip.
>>>>>> + */
>>>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET		(0x1ull << 56)
>>>>>
>>>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>>>> because vm_pgoff is limited to 32-bit there, meaning we can only map up
>>>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>>>> userspace set the mmio range?
>>>>
>>>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But while
>>>> I can't see why anyone would be running a 32 bit kernel, I guess 32 bit
>>>> user space is likely to still be needed.
>>>
>>> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
>>> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
>>> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
>>> chance you could advise me on what to do here?
>>>
>>> 1. assume this limitation is here for a good reason, and limit the GPU
>>> VA space to 32-bits on 32-bit kernels
>>>
>>> or
>>>
>>> 2. update the interface to make iova an u64
>>
>> I'm not sure I can answer the question from a technical perspective,
>> hopefully Robin will be able to.
> 
> Had a quick chat with Robin, and he's recommending going for #1 too.
> 
>>
>> But why do we care about 32-bit kernels on a platform which is new
>> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?
> 
> Apparently the memory you save by switching to a 32-bit kernel matters
> to some people. To clarify, the CPU is aarch64, but they want to use it
> in 32-bit mode.
> 
>>
>> Given the other limitations present in a 32-bit kernel I'd be tempted to
>> say '1' just for simplicity. Especially since apparently we've lived
>> with this for panfrost which presumably has the same limitation (even
>> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).
> 
> Well, Panfrost is simpler in that you don't have this kernel VA range,
> and, IIRC, we are using the old format that naturally limits the GPU VA
> space to 4G.

FWIW the legacy pagetable format itself should be fine going up to 
however many bits the GPU supports, however there were various ISA 
limitations around crossing 4GB boundaries, and the easiest way to avoid 
having to think about those was to just not use more than 4GB of VA at 
all (minus chunks at the ends for similar weird ISA reasons).

Cheers,
Robin.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-09-04 16:25             ` Robin Murphy
@ 2023-09-06 10:55               ` Steven Price
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-09-06 10:55 UTC (permalink / raw)
  To: Robin Murphy, Boris Brezillon
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Clément Péron, Marty E . Plummer,
	Faith Ekstrand

On 04/09/2023 17:25, Robin Murphy wrote:
> On 2023-09-04 17:16, Boris Brezillon wrote:
>> On Mon, 4 Sep 2023 16:22:19 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> On 04/09/2023 10:26, Boris Brezillon wrote:
>>>> On Mon, 4 Sep 2023 08:42:08 +0100
>>>> Steven Price <steven.price@arm.com> wrote:
>>>>   
>>>>> On 01/09/2023 17:10, Boris Brezillon wrote:
>>>>>> On Wed,  9 Aug 2023 18:53:15 +0200
>>>>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>>>>     
>>>>>>> +/**
>>>>>>> + * DOC: MMIO regions exposed to userspace.
>>>>>>> + *
>>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
>>>>>>> + *
>>>>>>> + * File offset for all MMIO regions being exposed to userspace.
>>>>>>> Don't use
>>>>>>> + * this value directly, use DRM_PANTHOR_USER_<name>_OFFSET
>>>>>>> values instead.
>>>>>>> + *
>>>>>>> + * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
>>>>>>> + *
>>>>>>> + * File offset for the LATEST_FLUSH_ID register. The Userspace
>>>>>>> driver controls
>>>>>>> + * GPU cache flushling through CS instructions, but the flush
>>>>>>> reduction
>>>>>>> + * mechanism requires a flush_id. This flush_id could be queried
>>>>>>> with an
>>>>>>> + * ioctl, but Arm provides a well-isolated register page
>>>>>>> containing only this
>>>>>>> + * read-only register, so let's expose this page through a
>>>>>>> static mmap offset
>>>>>>> + * and allow direct mapping of this MMIO region so we can avoid the
>>>>>>> + * user <-> kernel round-trip.
>>>>>>> + */
>>>>>>> +#define DRM_PANTHOR_USER_MMIO_OFFSET        (0x1ull << 56)
>>>>>>
>>>>>> I'm playing with a 32-bit kernel/userspace, and this is problematic,
>>>>>> because vm_pgoff is limited to 32-bit there, meaning we can only
>>>>>> map up
>>>>>> to (1ull << (PAGE_SHIFT + 32)) - 1. Should we add a DEV_QUERY to let
>>>>>> userspace set the mmio range?
>>>>>
>>>>> Hmm, I was rather hoping we could ignore 32 bit these days ;) But
>>>>> while
>>>>> I can't see why anyone would be running a 32 bit kernel, I guess 32
>>>>> bit
>>>>> user space is likely to still be needed.
>>>>
>>>> Uh, I just hit a new problem with 32-bit kernels: the io-pgtable
>>>> interface (io_pgtable_ops) passes device VAs as unsigned longs, meaning
>>>> the GPU VA space is limited to 4G on a 32-bit build :-(. Robin, any
>>>> chance you could advise me on what to do here?
>>>>
>>>> 1. assume this limitation is here for a good reason, and limit the GPU
>>>> VA space to 32-bits on 32-bit kernels
>>>>
>>>> or
>>>>
>>>> 2. update the interface to make iova an u64
>>>
>>> I'm not sure I can answer the question from a technical perspective,
>>> hopefully Robin will be able to.
>>
>> Had a quick chat with Robin, and he's recommending going for #1 too.
>>
>>>
>>> But why do we care about 32-bit kernels on a platform which is new
>>> enough to have a CSF-GPU (and by extension a recent 64-bit CPU)?
>>
>> Apparently the memory you save by switching to a 32-bit kernel matters
>> to some people. To clarify, the CPU is aarch64, but they want to use it
>> in 32-bit mode.
>>
>>>
>>> Given the other limitations present in a 32-bit kernel I'd be tempted to
>>> say '1' just for simplicity. Especially since apparently we've lived
>>> with this for panfrost which presumably has the same limitation (even
>>> though all Bifrost/Midgard GPUs have at least 33 bits of VA space).
>>
>> Well, Panfrost is simpler in that you don't have this kernel VA range,
>> and, IIRC, we are using the old format that naturally limits the GPU VA
>> space to 4G.
> 
> FWIW the legacy pagetable format itself should be fine going up to
> however many bits the GPU supports, however there were various ISA
> limitations around crossing 4GB boundaries, and the easiest way to avoid
> having to think about those was to just not use more than 4GB of VA at
> all (minus chunks at the ends for similar weird ISA reasons).

Exactly right. The legacy pagetable format supports the full range of
VA. Indeed kbase used the legacy format for a long time.

However the ISA has special handling for addresses where bits 31:12 are
all 0 or all 1, so we have to avoid executable code landing on these
regions. kbase has a modified version of 'unmapped_area_topdown'[1] to
handle these additional restrictions.

Steve

[1] kbase_unmapped_area_topdown()
https://android.googlesource.com/kernel/google-modules/gpu/+/refs/tags/android-12.0.0_r0.42/mali_kbase/thirdparty/mali_kbase_mmap.c#97

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 02/15] drm/panthor: Add uAPI
  2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
                     ` (3 preceding siblings ...)
  2023-09-04 16:06   ` Robin Murphy
@ 2023-09-06 12:18   ` Ketil Johnsen
  4 siblings, 0 replies; 93+ messages in thread
From: Ketil Johnsen @ 2023-09-06 12:18 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 8/9/23 18:53, Boris Brezillon wrote:

> +enum drm_panthor_sync_op_flags {
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK: Synchronization handle type mask. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_MASK = 0xff,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ: Synchronization object type. */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_SYNCOBJ = 0,
> +
> +	/**
> +	 * @DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ: Timeline synchronization
> +	 * object type.
> +	 */
> +	DRM_PANTHOR_SYNC_OP_HANDLE_TYPE_TIMELINE_SYNCOBJ = 1,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_WAIT: Wait operation. */
> +	DRM_PANTHOR_SYNC_OP_WAIT = 0 << 31,
> +
> +	/** @DRM_PANTHOR_SYNC_OP_SIGNAL: Signal operation. */
> +	DRM_PANTHOR_SYNC_OP_SIGNAL = 1 << 31,
> +};

We get an issue with --pedantic here:

warning: enumerator value for 'DRM_PANTHOR_SYNC_OP_SIGNAL' is not an 
integer constant expression [-Wpedantic]

Would be god to get rid of this, so user space can include this header 
without disabling pedantic. Either we can stop using the top most bit or 
a cast value like "(int)(1U << 31)"

> +	/**
> +	 * @DRM_PANTHOR_VM_BIND_OP_TYPE_MASK: Mask used to determine the type of operation.
> +	 */
> +	DRM_PANTHOR_VM_BIND_OP_TYPE_MASK = 0xf << 28,

Same issue for this member. Either not use the top most bit or cast 
value like "(int)(0xfU << 28)" avoids the pedantic warning.

--
Regards,
Ketil Johnsen

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-08-09 16:53 ` [PATCH v2 12/15] drm/panthor: Add the driver frontend block Boris Brezillon
  2023-08-21 11:31   ` Steven Price
@ 2023-09-06 12:38   ` Ketil Johnsen
  2023-09-06 13:05     ` Boris Brezillon
  1 sibling, 1 reply; 93+ messages in thread
From: Ketil Johnsen @ 2023-09-06 12:38 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	Steven Price, Clément Péron, Marty E . Plummer,
	Robin Murphy, Faith Ekstrand

On 8/9/23 18:53, Boris Brezillon wrote:
> +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
> +				   struct drm_file *file)
> +{
> +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
> +	struct panthor_file *pfile = file->driver_priv;
> +	struct drm_panthor_vm_create *args = data;
> +	u64 kernel_va_start = 0;
> +	int cookie, ret;
> +
> +	if (!drm_dev_enter(ddev, &cookie))
> +		return -ENODEV;
> +
> +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
> +		ret = -EINVAL;
> +		goto out_dev_exit;
> +	}
> +
> +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {
> +		ret = -EINVAL;
> +		goto out_dev_exit;
> +	}
> +
> +	if (args->kernel_va_range)
> +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;

Bug here if user space provides kernel_va_range, which is the intention 
of the current Mesa proposal.

I think the desired calculation should be something like:
kernel_va_start = (1ull << va_bits) - args->kernel_va_range;

PS: There is currently also a bug in the accompanying Mesa changes which 
accidentally makes kernel_va_range always zero, thus bypassing this 
kernel bug.
The Mesa bug is due to va_bits always being zero because mmu_features 
field is not copied in panthor_dev_query_props().

> +	ret = panthor_vm_pool_create_vm(ptdev, pfile->vms,
> +					kernel_va_start, args->kernel_va_range);
> +	if (ret >= 0) {
> +		args->id = ret;
> +		ret = 0;
> +	}
> +
> +out_dev_exit:
> +	drm_dev_exit(cookie);
> +	return ret;
> +}

--
Regards,
Ketil Johnsen

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 12/15] drm/panthor: Add the driver frontend block
  2023-09-06 12:38   ` Ketil Johnsen
@ 2023-09-06 13:05     ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-09-06 13:05 UTC (permalink / raw)
  To: Ketil Johnsen
  Cc: Neil Armstrong, Nicolas Boichat, Daniel Stone, Liviu Dudau,
	dri-devel, Steven Price, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, 6 Sep 2023 14:38:15 +0200
Ketil Johnsen <ketil.johnsen@arm.com> wrote:

> On 8/9/23 18:53, Boris Brezillon wrote:
> > +static int panthor_ioctl_vm_create(struct drm_device *ddev, void *data,
> > +				   struct drm_file *file)
> > +{
> > +	struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
> > +	u32 va_bits = GPU_MMU_FEATURES_VA_BITS(ptdev->gpu_info.mmu_features);
> > +	struct panthor_file *pfile = file->driver_priv;
> > +	struct drm_panthor_vm_create *args = data;
> > +	u64 kernel_va_start = 0;
> > +	int cookie, ret;
> > +
> > +	if (!drm_dev_enter(ddev, &cookie))
> > +		return -ENODEV;
> > +
> > +	if (args->flags & ~PANTHOR_VM_CREATE_FLAGS) {
> > +		ret = -EINVAL;
> > +		goto out_dev_exit;
> > +	}
> > +
> > +	if (drm_WARN_ON(ddev, !va_bits) || args->kernel_va_range > (1ull << (va_bits - 1))) {
> > +		ret = -EINVAL;
> > +		goto out_dev_exit;
> > +	}
> > +
> > +	if (args->kernel_va_range)
> > +		kernel_va_start = (1 << (va_bits - 1)) - args->kernel_va_range;  
> 
> Bug here if user space provides kernel_va_range, which is the intention 
> of the current Mesa proposal.
> 
> I think the desired calculation should be something like:
> kernel_va_start = (1ull << va_bits) - args->kernel_va_range;
> 
> PS: There is currently also a bug in the accompanying Mesa changes which 
> accidentally makes kernel_va_range always zero, thus bypassing this 
> kernel bug.
> The Mesa bug is due to va_bits always being zero because mmu_features 
> field is not copied in panthor_dev_query_props().

Yep, I noticed/fixed the problem recently, when working on 32-bit
enablement. Anyway, thanks for the reporting those bugs.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-08-20  8:01     ` Krzysztof Kozlowski
@ 2023-09-20 13:41       ` Liviu Dudau
  -1 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-20 13:41 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Boris Brezillon, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Hi Krzysztof,

Thanks for taking the time to review this patch. I'm about to update it
to address your comments and I need some clarifications from you.

On Sun, Aug 20, 2023 at 10:01:25AM +0200, Krzysztof Kozlowski wrote:
> On 09/08/2023 18:53, Boris Brezillon wrote:
> > From: Liviu Dudau <liviu.dudau@arm.com>
> > 
> > Arm has introduced a new v10 GPU architecture that replaces the Job Manager
> > interface with a new Command Stream Frontend. It adds firmware driven
> > command stream queues that can be used by kernel and user space to submit
> > jobs to the GPU.
> > 
> > Add the initial schema for the device tree that is based on support for
> > RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
> > platforms they will tend to expose the semi-independent clocks for better
> > power management.
> 
> A nit, subject: drop second/last, redundant "bindings for". The
> "dt-bindings" prefix is already stating that these are bindings.
> 
> Also drop "driver" form the subject. Bindings are for hardware, not drivers.

Not exactly true as the 'compatible' string is for the driver, but I understand
your point.

> 
> > 
> > v2:
> > - New commit
> > 
> > Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
> 
> SoB chain is incomplete.
> 
> > Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
> > Cc: Rob Herring <robh+dt@kernel.org>
> > Cc: Conor Dooley <conor+dt@kernel.org>
> > Cc: devicetree@vger.kernel.org
> > ---
> >  .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
> >  1 file changed, 148 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > 
> > diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > new file mode 100644
> > index 000000000000..2b9f77aa0b7a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > @@ -0,0 +1,148 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: ARM Mali Valhall GPU
> > +
> > +maintainers:
> > +  - Liviu Dudau <liviu.dudau@arm.com>
> > +  - Boris Brezillon <boris.brezillon@collabora.com>
> > +
> > +properties:
> > +  $nodename:
> > +    pattern: '^gpu@[a-f0-9]+$'
> > +
> > +  compatible:
> > +    oneOf:
> 
> Drop oneOf.

The idea was to allow for future compatible strings to be added later, but
I guess we can re-introduce the oneOf entry later. Will remove it.

> 
> > +      - items:
> > +          - enum:
> > +              - rockchip,rk3588-mali
> > +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> > +
> > +  reg:
> > +    maxItems: 1
> > +
> > +  interrupts:
> > +    items:
> > +      - description: Job interrupt
> > +      - description: MMU interrupt
> > +      - description: GPU interrupt
> > +
> > +  interrupt-names:
> > +    items:
> > +      - const: job
> > +      - const: mmu
> > +      - const: gpu
> > +
> > +  clocks:
> > +    minItems: 1
> > +    maxItems: 3
> > +
> > +  clock-names:
> > +    minItems: 1
> > +    items:
> > +      - const: core
> > +      - const: coregroup
> > +      - const: stacks
> > +
> > +  mali-supply: true
> > +
> > +  sram-supply: true
> > +
> > +  operating-points-v2: true
> 
> Missing opp-table.

This is the main topic I want to clarify. See further down for the main comment,
but I would like to understand what you are asking here. To copy the schema
from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?

> 
> > +
> > +  power-domains:
> > +    minItems: 1
> > +    maxItems: 5
> > +
> > +  power-domain-names:
> > +    minItems: 1
> > +    maxItems: 5
> > +
> > +  "#cooling-cells":
> > +    const: 2
> > +
> > +  dynamic-power-coefficient:
> > +    $ref: /schemas/types.yaml#/definitions/uint32
> > +    description:
> > +      A u32 value that represents the running time dynamic
> > +      power coefficient in units of uW/MHz/V^2. The
> > +      coefficient can either be calculated from power
> > +      measurements or derived by analysis.
> > +
> > +      The dynamic power consumption of the GPU is
> > +      proportional to the square of the Voltage (V) and
> > +      the clock frequency (f). The coefficient is used to
> > +      calculate the dynamic power as below -
> > +
> > +      Pdyn = dynamic-power-coefficient * V^2 * f
> > +
> > +      where voltage is in V, frequency is in MHz.
> > +
> > +  dma-coherent: true
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +  - interrupts
> > +  - interrupt-names
> > +  - clocks
> > +  - mali-supply
> > +
> > +additionalProperties: false
> > +
> > +allOf:
> > +  - if:
> > +      properties:
> > +        compatible:
> > +          contains:
> > +            const: rockchip,rk3588-mali
> > +    then:
> > +      properties:
> > +        clocks:
> > +          minItems: 3
> > +        clock-names:
> > +          items:
> > +            - const: core
> > +            - const: coregroup
> > +            - const: stacks
> 
> This duplicates top-level. Just minItems: 3.

Will remove the duplicated names.

> 
> Please describe also power domains - constrains and names.

I'm not sure the power domains and how to handle them have been
entirely settled for Rockchip, hence why they were not included. Will
check with Collabora to see if they have anything to add here, but
for non-Rockchip platforms (like Juno with FPGAs) the constraints
are going to be different.

> 
> > +
> > +examples:
> > +  - |
> > +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> > +    #include <dt-bindings/interrupt-controller/irq.h>
> > +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> > +    #include <dt-bindings/power/rk3588-power.h>
> > +
> > +    gpu: gpu@fb000000 {
> > +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> > +        reg = <0xfb000000 0x200000>;
> > +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> > +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> > +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> > +        interrupt-names = "job", "mmu", "gpu";
> > +        clock-names = "core", "coregroup", "stacks";
> > +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> > +                 <&cru CLK_GPU_STACKS>;
> > +        power-domains = <&power RK3588_PD_GPU>;
> > +        operating-points-v2 = <&gpu_opp_table>;
> > +        mali-supply = <&vdd_gpu_s0>;
> > +        sram-supply = <&vdd_gpu_mem_s0>;
> > +        status = "disabled";
> 
> Drop status.

Will do.

> 
> > +    };
> > +
> > +    gpu_opp_table: opp-table {
> 
> Opp table should be inside the device node.

I cannot find any device tree that supports your suggested usage. Most (all?) of
the device trees that I can find have the opp table as a separate node from
the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference
in the board fragment. To me that makes more sense as different boards can have
different operating points and is no reason to make them sub-nodes of the gpu.

> 
> > +        compatible = "operating-points-v2";
> > +        opp-300000000 {
> > +            opp-hz = /bits/ 64 <300000000>;
> > +            opp-microvolt = <675000 675000 850000>;
> > +        };
> 
> Best regards,
> Krzysztof
> 

Thanks again for your review!

Best regards,
Liviu

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-09-20 13:41       ` Liviu Dudau
  0 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-20 13:41 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Boris Brezillon, dri-devel, Conor Dooley, Nicolas Boichat,
	Daniel Stone, Krzysztof Kozlowski, Neil Armstrong, Steven Price,
	devicetree, Rob Herring, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

Hi Krzysztof,

Thanks for taking the time to review this patch. I'm about to update it
to address your comments and I need some clarifications from you.

On Sun, Aug 20, 2023 at 10:01:25AM +0200, Krzysztof Kozlowski wrote:
> On 09/08/2023 18:53, Boris Brezillon wrote:
> > From: Liviu Dudau <liviu.dudau@arm.com>
> > 
> > Arm has introduced a new v10 GPU architecture that replaces the Job Manager
> > interface with a new Command Stream Frontend. It adds firmware driven
> > command stream queues that can be used by kernel and user space to submit
> > jobs to the GPU.
> > 
> > Add the initial schema for the device tree that is based on support for
> > RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
> > platforms they will tend to expose the semi-independent clocks for better
> > power management.
> 
> A nit, subject: drop second/last, redundant "bindings for". The
> "dt-bindings" prefix is already stating that these are bindings.
> 
> Also drop "driver" form the subject. Bindings are for hardware, not drivers.

Not exactly true as the 'compatible' string is for the driver, but I understand
your point.

> 
> > 
> > v2:
> > - New commit
> > 
> > Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
> 
> SoB chain is incomplete.
> 
> > Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
> > Cc: Rob Herring <robh+dt@kernel.org>
> > Cc: Conor Dooley <conor+dt@kernel.org>
> > Cc: devicetree@vger.kernel.org
> > ---
> >  .../bindings/gpu/arm,mali-valhall-csf.yaml    | 148 ++++++++++++++++++
> >  1 file changed, 148 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > 
> > diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > new file mode 100644
> > index 000000000000..2b9f77aa0b7a
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
> > @@ -0,0 +1,148 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: ARM Mali Valhall GPU
> > +
> > +maintainers:
> > +  - Liviu Dudau <liviu.dudau@arm.com>
> > +  - Boris Brezillon <boris.brezillon@collabora.com>
> > +
> > +properties:
> > +  $nodename:
> > +    pattern: '^gpu@[a-f0-9]+$'
> > +
> > +  compatible:
> > +    oneOf:
> 
> Drop oneOf.

The idea was to allow for future compatible strings to be added later, but
I guess we can re-introduce the oneOf entry later. Will remove it.

> 
> > +      - items:
> > +          - enum:
> > +              - rockchip,rk3588-mali
> > +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> > +
> > +  reg:
> > +    maxItems: 1
> > +
> > +  interrupts:
> > +    items:
> > +      - description: Job interrupt
> > +      - description: MMU interrupt
> > +      - description: GPU interrupt
> > +
> > +  interrupt-names:
> > +    items:
> > +      - const: job
> > +      - const: mmu
> > +      - const: gpu
> > +
> > +  clocks:
> > +    minItems: 1
> > +    maxItems: 3
> > +
> > +  clock-names:
> > +    minItems: 1
> > +    items:
> > +      - const: core
> > +      - const: coregroup
> > +      - const: stacks
> > +
> > +  mali-supply: true
> > +
> > +  sram-supply: true
> > +
> > +  operating-points-v2: true
> 
> Missing opp-table.

This is the main topic I want to clarify. See further down for the main comment,
but I would like to understand what you are asking here. To copy the schema
from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?

> 
> > +
> > +  power-domains:
> > +    minItems: 1
> > +    maxItems: 5
> > +
> > +  power-domain-names:
> > +    minItems: 1
> > +    maxItems: 5
> > +
> > +  "#cooling-cells":
> > +    const: 2
> > +
> > +  dynamic-power-coefficient:
> > +    $ref: /schemas/types.yaml#/definitions/uint32
> > +    description:
> > +      A u32 value that represents the running time dynamic
> > +      power coefficient in units of uW/MHz/V^2. The
> > +      coefficient can either be calculated from power
> > +      measurements or derived by analysis.
> > +
> > +      The dynamic power consumption of the GPU is
> > +      proportional to the square of the Voltage (V) and
> > +      the clock frequency (f). The coefficient is used to
> > +      calculate the dynamic power as below -
> > +
> > +      Pdyn = dynamic-power-coefficient * V^2 * f
> > +
> > +      where voltage is in V, frequency is in MHz.
> > +
> > +  dma-coherent: true
> > +
> > +required:
> > +  - compatible
> > +  - reg
> > +  - interrupts
> > +  - interrupt-names
> > +  - clocks
> > +  - mali-supply
> > +
> > +additionalProperties: false
> > +
> > +allOf:
> > +  - if:
> > +      properties:
> > +        compatible:
> > +          contains:
> > +            const: rockchip,rk3588-mali
> > +    then:
> > +      properties:
> > +        clocks:
> > +          minItems: 3
> > +        clock-names:
> > +          items:
> > +            - const: core
> > +            - const: coregroup
> > +            - const: stacks
> 
> This duplicates top-level. Just minItems: 3.

Will remove the duplicated names.

> 
> Please describe also power domains - constrains and names.

I'm not sure the power domains and how to handle them have been
entirely settled for Rockchip, hence why they were not included. Will
check with Collabora to see if they have anything to add here, but
for non-Rockchip platforms (like Juno with FPGAs) the constraints
are going to be different.

> 
> > +
> > +examples:
> > +  - |
> > +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> > +    #include <dt-bindings/interrupt-controller/irq.h>
> > +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> > +    #include <dt-bindings/power/rk3588-power.h>
> > +
> > +    gpu: gpu@fb000000 {
> > +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> > +        reg = <0xfb000000 0x200000>;
> > +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> > +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> > +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> > +        interrupt-names = "job", "mmu", "gpu";
> > +        clock-names = "core", "coregroup", "stacks";
> > +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> > +                 <&cru CLK_GPU_STACKS>;
> > +        power-domains = <&power RK3588_PD_GPU>;
> > +        operating-points-v2 = <&gpu_opp_table>;
> > +        mali-supply = <&vdd_gpu_s0>;
> > +        sram-supply = <&vdd_gpu_mem_s0>;
> > +        status = "disabled";
> 
> Drop status.

Will do.

> 
> > +    };
> > +
> > +    gpu_opp_table: opp-table {
> 
> Opp table should be inside the device node.

I cannot find any device tree that supports your suggested usage. Most (all?) of
the device trees that I can find have the opp table as a separate node from
the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference
in the board fragment. To me that makes more sense as different boards can have
different operating points and is no reason to make them sub-nodes of the gpu.

> 
> > +        compatible = "operating-points-v2";
> > +        opp-300000000 {
> > +            opp-hz = /bits/ 64 <300000000>;
> > +            opp-microvolt = <675000 675000 850000>;
> > +        };
> 
> Best regards,
> Krzysztof
> 

Thanks again for your review!

Best regards,
Liviu

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-09-20 13:41       ` Liviu Dudau
@ 2023-09-20 13:51         ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-09-20 13:51 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Boris Brezillon, dri-devel, Conor Dooley, Nicolas Boichat,
	Daniel Stone, Krzysztof Kozlowski, Neil Armstrong, Steven Price,
	devicetree, Rob Herring, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On 20/09/2023 15:41, Liviu Dudau wrote:
>>> +properties:
>>> +  $nodename:
>>> +    pattern: '^gpu@[a-f0-9]+$'
>>> +
>>> +  compatible:
>>> +    oneOf:
>>
>> Drop oneOf.
> 
> The idea was to allow for future compatible strings to be added later, but
> I guess we can re-introduce the oneOf entry later. Will remove it.

If you already predict that new list will be added (so new fallback
compatible!), then it's fine.

> 
>>
>>> +      - items:
>>> +          - enum:
>>> +              - rockchip,rk3588-mali
>>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
>>> +
>>> +  reg:
>>> +    maxItems: 1
>>> +
>>> +  interrupts:
>>> +    items:
>>> +      - description: Job interrupt
>>> +      - description: MMU interrupt
>>> +      - description: GPU interrupt
>>> +
>>> +  interrupt-names:
>>> +    items:
>>> +      - const: job
>>> +      - const: mmu
>>> +      - const: gpu
>>> +
>>> +  clocks:
>>> +    minItems: 1
>>> +    maxItems: 3
>>> +
>>> +  clock-names:
>>> +    minItems: 1
>>> +    items:
>>> +      - const: core
>>> +      - const: coregroup
>>> +      - const: stacks
>>> +
>>> +  mali-supply: true
>>> +
>>> +  sram-supply: true
>>> +
>>> +  operating-points-v2: true
>>
>> Missing opp-table.
> 
> This is the main topic I want to clarify. See further down for the main comment,
> but I would like to understand what you are asking here. To copy the schema
> from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?

No, "opp-table" property.
git grep "opp-table:"

> 
>>
>>> +
>>> +  power-domains:
>>> +    minItems: 1
>>> +    maxItems: 5
>>> +
>>> +  power-domain-names:
>>> +    minItems: 1
>>> +    maxItems: 5
>>> +
>>> +  "#cooling-cells":
>>> +    const: 2
>>> +
>>> +  dynamic-power-coefficient:
>>> +    $ref: /schemas/types.yaml#/definitions/uint32
>>> +    description:
>>> +      A u32 value that represents the running time dynamic
>>> +      power coefficient in units of uW/MHz/V^2. The
>>> +      coefficient can either be calculated from power
>>> +      measurements or derived by analysis.
>>> +
>>> +      The dynamic power consumption of the GPU is
>>> +      proportional to the square of the Voltage (V) and
>>> +      the clock frequency (f). The coefficient is used to
>>> +      calculate the dynamic power as below -
>>> +
>>> +      Pdyn = dynamic-power-coefficient * V^2 * f
>>> +
>>> +      where voltage is in V, frequency is in MHz.
>>> +
>>> +  dma-coherent: true
>>> +
>>> +required:
>>> +  - compatible
>>> +  - reg
>>> +  - interrupts
>>> +  - interrupt-names
>>> +  - clocks
>>> +  - mali-supply
>>> +
>>> +additionalProperties: false
>>> +
>>> +allOf:
>>> +  - if:
>>> +      properties:
>>> +        compatible:
>>> +          contains:
>>> +            const: rockchip,rk3588-mali
>>> +    then:
>>> +      properties:
>>> +        clocks:
>>> +          minItems: 3
>>> +        clock-names:
>>> +          items:
>>> +            - const: core
>>> +            - const: coregroup
>>> +            - const: stacks
>>
>> This duplicates top-level. Just minItems: 3.
> 
> Will remove the duplicated names.
> 
>>
>> Please describe also power domains - constrains and names.
> 
> I'm not sure the power domains and how to handle them have been
> entirely settled for Rockchip, hence why they were not included. Will
> check with Collabora to see if they have anything to add here, but
> for non-Rockchip platforms (like Juno with FPGAs) the constraints
> are going to be different.
> 
>>
>>> +
>>> +examples:
>>> +  - |
>>> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
>>> +    #include <dt-bindings/interrupt-controller/irq.h>
>>> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
>>> +    #include <dt-bindings/power/rk3588-power.h>
>>> +
>>> +    gpu: gpu@fb000000 {
>>> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
>>> +        reg = <0xfb000000 0x200000>;
>>> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
>>> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
>>> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
>>> +        interrupt-names = "job", "mmu", "gpu";
>>> +        clock-names = "core", "coregroup", "stacks";
>>> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
>>> +                 <&cru CLK_GPU_STACKS>;
>>> +        power-domains = <&power RK3588_PD_GPU>;
>>> +        operating-points-v2 = <&gpu_opp_table>;
>>> +        mali-supply = <&vdd_gpu_s0>;
>>> +        sram-supply = <&vdd_gpu_mem_s0>;
>>> +        status = "disabled";
>>
>> Drop status.
> 
> Will do.
> 
>>
>>> +    };
>>> +
>>> +    gpu_opp_table: opp-table {
>>
>> Opp table should be inside the device node.
> 
> I cannot find any device tree that supports your suggested usage. Most (all?) of

Really? All Qcom have it embedded.

> the device trees that I can find have the opp table as a separate node from
> the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference

operating-points-v2 is needed anyway, I am not suggesting to drop it.

> in the board fragment. To me that makes more sense as different boards can have
> different operating points and is no reason to make them sub-nodes of the gpu.

How boards do it, is independent. They can keep it inside, outside,
override etc.

For majority of simple cases, the OPPs come from the SoC, thus they are
in DTSI.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-09-20 13:51         ` Krzysztof Kozlowski
  0 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-09-20 13:51 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Boris Brezillon, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On 20/09/2023 15:41, Liviu Dudau wrote:
>>> +properties:
>>> +  $nodename:
>>> +    pattern: '^gpu@[a-f0-9]+$'
>>> +
>>> +  compatible:
>>> +    oneOf:
>>
>> Drop oneOf.
> 
> The idea was to allow for future compatible strings to be added later, but
> I guess we can re-introduce the oneOf entry later. Will remove it.

If you already predict that new list will be added (so new fallback
compatible!), then it's fine.

> 
>>
>>> +      - items:
>>> +          - enum:
>>> +              - rockchip,rk3588-mali
>>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
>>> +
>>> +  reg:
>>> +    maxItems: 1
>>> +
>>> +  interrupts:
>>> +    items:
>>> +      - description: Job interrupt
>>> +      - description: MMU interrupt
>>> +      - description: GPU interrupt
>>> +
>>> +  interrupt-names:
>>> +    items:
>>> +      - const: job
>>> +      - const: mmu
>>> +      - const: gpu
>>> +
>>> +  clocks:
>>> +    minItems: 1
>>> +    maxItems: 3
>>> +
>>> +  clock-names:
>>> +    minItems: 1
>>> +    items:
>>> +      - const: core
>>> +      - const: coregroup
>>> +      - const: stacks
>>> +
>>> +  mali-supply: true
>>> +
>>> +  sram-supply: true
>>> +
>>> +  operating-points-v2: true
>>
>> Missing opp-table.
> 
> This is the main topic I want to clarify. See further down for the main comment,
> but I would like to understand what you are asking here. To copy the schema
> from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?

No, "opp-table" property.
git grep "opp-table:"

> 
>>
>>> +
>>> +  power-domains:
>>> +    minItems: 1
>>> +    maxItems: 5
>>> +
>>> +  power-domain-names:
>>> +    minItems: 1
>>> +    maxItems: 5
>>> +
>>> +  "#cooling-cells":
>>> +    const: 2
>>> +
>>> +  dynamic-power-coefficient:
>>> +    $ref: /schemas/types.yaml#/definitions/uint32
>>> +    description:
>>> +      A u32 value that represents the running time dynamic
>>> +      power coefficient in units of uW/MHz/V^2. The
>>> +      coefficient can either be calculated from power
>>> +      measurements or derived by analysis.
>>> +
>>> +      The dynamic power consumption of the GPU is
>>> +      proportional to the square of the Voltage (V) and
>>> +      the clock frequency (f). The coefficient is used to
>>> +      calculate the dynamic power as below -
>>> +
>>> +      Pdyn = dynamic-power-coefficient * V^2 * f
>>> +
>>> +      where voltage is in V, frequency is in MHz.
>>> +
>>> +  dma-coherent: true
>>> +
>>> +required:
>>> +  - compatible
>>> +  - reg
>>> +  - interrupts
>>> +  - interrupt-names
>>> +  - clocks
>>> +  - mali-supply
>>> +
>>> +additionalProperties: false
>>> +
>>> +allOf:
>>> +  - if:
>>> +      properties:
>>> +        compatible:
>>> +          contains:
>>> +            const: rockchip,rk3588-mali
>>> +    then:
>>> +      properties:
>>> +        clocks:
>>> +          minItems: 3
>>> +        clock-names:
>>> +          items:
>>> +            - const: core
>>> +            - const: coregroup
>>> +            - const: stacks
>>
>> This duplicates top-level. Just minItems: 3.
> 
> Will remove the duplicated names.
> 
>>
>> Please describe also power domains - constrains and names.
> 
> I'm not sure the power domains and how to handle them have been
> entirely settled for Rockchip, hence why they were not included. Will
> check with Collabora to see if they have anything to add here, but
> for non-Rockchip platforms (like Juno with FPGAs) the constraints
> are going to be different.
> 
>>
>>> +
>>> +examples:
>>> +  - |
>>> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
>>> +    #include <dt-bindings/interrupt-controller/irq.h>
>>> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
>>> +    #include <dt-bindings/power/rk3588-power.h>
>>> +
>>> +    gpu: gpu@fb000000 {
>>> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
>>> +        reg = <0xfb000000 0x200000>;
>>> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
>>> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
>>> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
>>> +        interrupt-names = "job", "mmu", "gpu";
>>> +        clock-names = "core", "coregroup", "stacks";
>>> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
>>> +                 <&cru CLK_GPU_STACKS>;
>>> +        power-domains = <&power RK3588_PD_GPU>;
>>> +        operating-points-v2 = <&gpu_opp_table>;
>>> +        mali-supply = <&vdd_gpu_s0>;
>>> +        sram-supply = <&vdd_gpu_mem_s0>;
>>> +        status = "disabled";
>>
>> Drop status.
> 
> Will do.
> 
>>
>>> +    };
>>> +
>>> +    gpu_opp_table: opp-table {
>>
>> Opp table should be inside the device node.
> 
> I cannot find any device tree that supports your suggested usage. Most (all?) of

Really? All Qcom have it embedded.

> the device trees that I can find have the opp table as a separate node from
> the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference

operating-points-v2 is needed anyway, I am not suggesting to drop it.

> in the board fragment. To me that makes more sense as different boards can have
> different operating points and is no reason to make them sub-nodes of the gpu.

How boards do it, is independent. They can keep it inside, outside,
override etc.

For majority of simple cases, the OPPs come from the SoC, thus they are
in DTSI.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-09-20 13:41       ` Liviu Dudau
@ 2023-09-20 13:56         ` Boris Brezillon
  -1 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-09-20 13:56 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Krzysztof Kozlowski,
	Rob Herring, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, 20 Sep 2023 14:41:05 +0100
Liviu Dudau <Liviu.Dudau@arm.com> wrote:

> > 
> > Please describe also power domains - constrains and names.  
> 
> I'm not sure the power domains and how to handle them have been
> entirely settled for Rockchip, hence why they were not included. Will
> check with Collabora to see if they have anything to add here,

On rk3588, it's just one power domain feeding all GPU blocks.

> but
> for non-Rockchip platforms (like Juno with FPGAs) the constraints
> are going to be different.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-09-20 13:56         ` Boris Brezillon
  0 siblings, 0 replies; 93+ messages in thread
From: Boris Brezillon @ 2023-09-20 13:56 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Krzysztof Kozlowski, dri-devel, Conor Dooley, Nicolas Boichat,
	Daniel Stone, Krzysztof Kozlowski, Neil Armstrong, Steven Price,
	devicetree, Rob Herring, Clément Péron,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, 20 Sep 2023 14:41:05 +0100
Liviu Dudau <Liviu.Dudau@arm.com> wrote:

> > 
> > Please describe also power domains - constrains and names.  
> 
> I'm not sure the power domains and how to handle them have been
> entirely settled for Rockchip, hence why they were not included. Will
> check with Collabora to see if they have anything to add here,

On rk3588, it's just one power domain feeding all GPU blocks.

> but
> for non-Rockchip platforms (like Juno with FPGAs) the constraints
> are going to be different.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-09-20 13:56         ` Boris Brezillon
  (?)
@ 2023-09-20 14:03         ` Liviu Dudau
  -1 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-20 14:03 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Krzysztof Kozlowski,
	Rob Herring, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, Sep 20, 2023 at 03:56:24PM +0200, Boris Brezillon wrote:
> On Wed, 20 Sep 2023 14:41:05 +0100
> Liviu Dudau <Liviu.Dudau@arm.com> wrote:
> 
> > > 
> > > Please describe also power domains - constrains and names.  
> > 
> > I'm not sure the power domains and how to handle them have been
> > entirely settled for Rockchip, hence why they were not included. Will
> > check with Collabora to see if they have anything to add here,
> 
> On rk3588, it's just one power domain feeding all GPU blocks.

Not sure if this has been missed, but earlier in the file there is an
entry for power-domains and power-domain-names, but with only a minimum
of requirements and no mandate on the actual names.

Best regards,
Liviu

> 
> > but
> > for non-Rockchip platforms (like Juno with FPGAs) the constraints
> > are going to be different.

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-09-20 13:51         ` Krzysztof Kozlowski
@ 2023-09-20 14:25           ` Liviu Dudau
  -1 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-20 14:25 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Boris Brezillon, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, Sep 20, 2023 at 03:51:36PM +0200, Krzysztof Kozlowski wrote:
> On 20/09/2023 15:41, Liviu Dudau wrote:
> >>> +properties:
> >>> +  $nodename:
> >>> +    pattern: '^gpu@[a-f0-9]+$'
> >>> +
> >>> +  compatible:
> >>> +    oneOf:
> >>
> >> Drop oneOf.
> > 
> > The idea was to allow for future compatible strings to be added later, but
> > I guess we can re-introduce the oneOf entry later. Will remove it.
> 
> If you already predict that new list will be added (so new fallback
> compatible!), then it's fine.
> 
> > 
> >>
> >>> +      - items:
> >>> +          - enum:
> >>> +              - rockchip,rk3588-mali
> >>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> >>> +
> >>> +  reg:
> >>> +    maxItems: 1
> >>> +
> >>> +  interrupts:
> >>> +    items:
> >>> +      - description: Job interrupt
> >>> +      - description: MMU interrupt
> >>> +      - description: GPU interrupt
> >>> +
> >>> +  interrupt-names:
> >>> +    items:
> >>> +      - const: job
> >>> +      - const: mmu
> >>> +      - const: gpu
> >>> +
> >>> +  clocks:
> >>> +    minItems: 1
> >>> +    maxItems: 3
> >>> +
> >>> +  clock-names:
> >>> +    minItems: 1
> >>> +    items:
> >>> +      - const: core
> >>> +      - const: coregroup
> >>> +      - const: stacks
> >>> +
> >>> +  mali-supply: true
> >>> +
> >>> +  sram-supply: true
> >>> +
> >>> +  operating-points-v2: true
> >>
> >> Missing opp-table.
> > 
> > This is the main topic I want to clarify. See further down for the main comment,
> > but I would like to understand what you are asking here. To copy the schema
> > from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?
> 
> No, "opp-table" property.
> git grep "opp-table:"

You mean adding

     opp-table:
       type: object

as property? What's the difference between opp-table: true (like in
'display/msm/dp-controller.yaml') and 'opp-table: type: object' like in other
places that I can find? Does that mean you need to have an opp-table node somewhere
but it doesn't have to be inside the gpu node? 'arm,mali-midgard.yaml' that was
my original source of inspiration has an 'opp-table: type: object' in the properties
but the example still shows a separate node for table.

> 
> > 
> >>
> >>> +
> >>> +  power-domains:
> >>> +    minItems: 1
> >>> +    maxItems: 5
> >>> +
> >>> +  power-domain-names:
> >>> +    minItems: 1
> >>> +    maxItems: 5
> >>> +
> >>> +  "#cooling-cells":
> >>> +    const: 2
> >>> +
> >>> +  dynamic-power-coefficient:
> >>> +    $ref: /schemas/types.yaml#/definitions/uint32
> >>> +    description:
> >>> +      A u32 value that represents the running time dynamic
> >>> +      power coefficient in units of uW/MHz/V^2. The
> >>> +      coefficient can either be calculated from power
> >>> +      measurements or derived by analysis.
> >>> +
> >>> +      The dynamic power consumption of the GPU is
> >>> +      proportional to the square of the Voltage (V) and
> >>> +      the clock frequency (f). The coefficient is used to
> >>> +      calculate the dynamic power as below -
> >>> +
> >>> +      Pdyn = dynamic-power-coefficient * V^2 * f
> >>> +
> >>> +      where voltage is in V, frequency is in MHz.
> >>> +
> >>> +  dma-coherent: true
> >>> +
> >>> +required:
> >>> +  - compatible
> >>> +  - reg
> >>> +  - interrupts
> >>> +  - interrupt-names
> >>> +  - clocks
> >>> +  - mali-supply
> >>> +
> >>> +additionalProperties: false
> >>> +
> >>> +allOf:
> >>> +  - if:
> >>> +      properties:
> >>> +        compatible:
> >>> +          contains:
> >>> +            const: rockchip,rk3588-mali
> >>> +    then:
> >>> +      properties:
> >>> +        clocks:
> >>> +          minItems: 3
> >>> +        clock-names:
> >>> +          items:
> >>> +            - const: core
> >>> +            - const: coregroup
> >>> +            - const: stacks
> >>
> >> This duplicates top-level. Just minItems: 3.
> > 
> > Will remove the duplicated names.
> > 
> >>
> >> Please describe also power domains - constrains and names.
> > 
> > I'm not sure the power domains and how to handle them have been
> > entirely settled for Rockchip, hence why they were not included. Will
> > check with Collabora to see if they have anything to add here, but
> > for non-Rockchip platforms (like Juno with FPGAs) the constraints
> > are going to be different.
> > 
> >>
> >>> +
> >>> +examples:
> >>> +  - |
> >>> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> >>> +    #include <dt-bindings/interrupt-controller/irq.h>
> >>> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> >>> +    #include <dt-bindings/power/rk3588-power.h>
> >>> +
> >>> +    gpu: gpu@fb000000 {
> >>> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> >>> +        reg = <0xfb000000 0x200000>;
> >>> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> >>> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> >>> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> >>> +        interrupt-names = "job", "mmu", "gpu";
> >>> +        clock-names = "core", "coregroup", "stacks";
> >>> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> >>> +                 <&cru CLK_GPU_STACKS>;
> >>> +        power-domains = <&power RK3588_PD_GPU>;
> >>> +        operating-points-v2 = <&gpu_opp_table>;
> >>> +        mali-supply = <&vdd_gpu_s0>;
> >>> +        sram-supply = <&vdd_gpu_mem_s0>;
> >>> +        status = "disabled";
> >>
> >> Drop status.
> > 
> > Will do.
> > 
> >>
> >>> +    };
> >>> +
> >>> +    gpu_opp_table: opp-table {
> >>
> >> Opp table should be inside the device node.
> > 
> > I cannot find any device tree that supports your suggested usage. Most (all?) of
> 
> Really? All Qcom have it embedded.

The arm,mali-* ones seem to have them outside the gpu node. See "arm,mali-midgard.yaml"

Best regards,
Liviu

> 
> > the device trees that I can find have the opp table as a separate node from
> > the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference
> 
> operating-points-v2 is needed anyway, I am not suggesting to drop it.
> 
> > in the board fragment. To me that makes more sense as different boards can have
> > different operating points and is no reason to make them sub-nodes of the gpu.
> 
> How boards do it, is independent. They can keep it inside, outside,
> override etc.
> 
> For majority of simple cases, the OPPs come from the SoC, thus they are
> in DTSI.
> 
> Best regards,
> Krzysztof
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-09-20 14:25           ` Liviu Dudau
  0 siblings, 0 replies; 93+ messages in thread
From: Liviu Dudau @ 2023-09-20 14:25 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Clément Péron, Krzysztof Kozlowski, Boris Brezillon,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On Wed, Sep 20, 2023 at 03:51:36PM +0200, Krzysztof Kozlowski wrote:
> On 20/09/2023 15:41, Liviu Dudau wrote:
> >>> +properties:
> >>> +  $nodename:
> >>> +    pattern: '^gpu@[a-f0-9]+$'
> >>> +
> >>> +  compatible:
> >>> +    oneOf:
> >>
> >> Drop oneOf.
> > 
> > The idea was to allow for future compatible strings to be added later, but
> > I guess we can re-introduce the oneOf entry later. Will remove it.
> 
> If you already predict that new list will be added (so new fallback
> compatible!), then it's fine.
> 
> > 
> >>
> >>> +      - items:
> >>> +          - enum:
> >>> +              - rockchip,rk3588-mali
> >>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
> >>> +
> >>> +  reg:
> >>> +    maxItems: 1
> >>> +
> >>> +  interrupts:
> >>> +    items:
> >>> +      - description: Job interrupt
> >>> +      - description: MMU interrupt
> >>> +      - description: GPU interrupt
> >>> +
> >>> +  interrupt-names:
> >>> +    items:
> >>> +      - const: job
> >>> +      - const: mmu
> >>> +      - const: gpu
> >>> +
> >>> +  clocks:
> >>> +    minItems: 1
> >>> +    maxItems: 3
> >>> +
> >>> +  clock-names:
> >>> +    minItems: 1
> >>> +    items:
> >>> +      - const: core
> >>> +      - const: coregroup
> >>> +      - const: stacks
> >>> +
> >>> +  mali-supply: true
> >>> +
> >>> +  sram-supply: true
> >>> +
> >>> +  operating-points-v2: true
> >>
> >> Missing opp-table.
> > 
> > This is the main topic I want to clarify. See further down for the main comment,
> > but I would like to understand what you are asking here. To copy the schema
> > from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?
> 
> No, "opp-table" property.
> git grep "opp-table:"

You mean adding

     opp-table:
       type: object

as property? What's the difference between opp-table: true (like in
'display/msm/dp-controller.yaml') and 'opp-table: type: object' like in other
places that I can find? Does that mean you need to have an opp-table node somewhere
but it doesn't have to be inside the gpu node? 'arm,mali-midgard.yaml' that was
my original source of inspiration has an 'opp-table: type: object' in the properties
but the example still shows a separate node for table.

> 
> > 
> >>
> >>> +
> >>> +  power-domains:
> >>> +    minItems: 1
> >>> +    maxItems: 5
> >>> +
> >>> +  power-domain-names:
> >>> +    minItems: 1
> >>> +    maxItems: 5
> >>> +
> >>> +  "#cooling-cells":
> >>> +    const: 2
> >>> +
> >>> +  dynamic-power-coefficient:
> >>> +    $ref: /schemas/types.yaml#/definitions/uint32
> >>> +    description:
> >>> +      A u32 value that represents the running time dynamic
> >>> +      power coefficient in units of uW/MHz/V^2. The
> >>> +      coefficient can either be calculated from power
> >>> +      measurements or derived by analysis.
> >>> +
> >>> +      The dynamic power consumption of the GPU is
> >>> +      proportional to the square of the Voltage (V) and
> >>> +      the clock frequency (f). The coefficient is used to
> >>> +      calculate the dynamic power as below -
> >>> +
> >>> +      Pdyn = dynamic-power-coefficient * V^2 * f
> >>> +
> >>> +      where voltage is in V, frequency is in MHz.
> >>> +
> >>> +  dma-coherent: true
> >>> +
> >>> +required:
> >>> +  - compatible
> >>> +  - reg
> >>> +  - interrupts
> >>> +  - interrupt-names
> >>> +  - clocks
> >>> +  - mali-supply
> >>> +
> >>> +additionalProperties: false
> >>> +
> >>> +allOf:
> >>> +  - if:
> >>> +      properties:
> >>> +        compatible:
> >>> +          contains:
> >>> +            const: rockchip,rk3588-mali
> >>> +    then:
> >>> +      properties:
> >>> +        clocks:
> >>> +          minItems: 3
> >>> +        clock-names:
> >>> +          items:
> >>> +            - const: core
> >>> +            - const: coregroup
> >>> +            - const: stacks
> >>
> >> This duplicates top-level. Just minItems: 3.
> > 
> > Will remove the duplicated names.
> > 
> >>
> >> Please describe also power domains - constrains and names.
> > 
> > I'm not sure the power domains and how to handle them have been
> > entirely settled for Rockchip, hence why they were not included. Will
> > check with Collabora to see if they have anything to add here, but
> > for non-Rockchip platforms (like Juno with FPGAs) the constraints
> > are going to be different.
> > 
> >>
> >>> +
> >>> +examples:
> >>> +  - |
> >>> +    #include <dt-bindings/clock/rockchip,rk3588-cru.h>
> >>> +    #include <dt-bindings/interrupt-controller/irq.h>
> >>> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> >>> +    #include <dt-bindings/power/rk3588-power.h>
> >>> +
> >>> +    gpu: gpu@fb000000 {
> >>> +        compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
> >>> +        reg = <0xfb000000 0x200000>;
> >>> +        interrupts = <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH 0>,
> >>> +                     <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH 0>,
> >>> +                     <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH 0>;
> >>> +        interrupt-names = "job", "mmu", "gpu";
> >>> +        clock-names = "core", "coregroup", "stacks";
> >>> +        clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
> >>> +                 <&cru CLK_GPU_STACKS>;
> >>> +        power-domains = <&power RK3588_PD_GPU>;
> >>> +        operating-points-v2 = <&gpu_opp_table>;
> >>> +        mali-supply = <&vdd_gpu_s0>;
> >>> +        sram-supply = <&vdd_gpu_mem_s0>;
> >>> +        status = "disabled";
> >>
> >> Drop status.
> > 
> > Will do.
> > 
> >>
> >>> +    };
> >>> +
> >>> +    gpu_opp_table: opp-table {
> >>
> >> Opp table should be inside the device node.
> > 
> > I cannot find any device tree that supports your suggested usage. Most (all?) of
> 
> Really? All Qcom have it embedded.

The arm,mali-* ones seem to have them outside the gpu node. See "arm,mali-midgard.yaml"

Best regards,
Liviu

> 
> > the device trees that I can find have the opp table as a separate node from
> > the gpu and make use of the 'operating-points-v2 = <&opp_node_name>' reference
> 
> operating-points-v2 is needed anyway, I am not suggesting to drop it.
> 
> > in the board fragment. To me that makes more sense as different boards can have
> > different operating points and is no reason to make them sub-nodes of the gpu.
> 
> How boards do it, is independent. They can keep it inside, outside,
> override etc.
> 
> For majority of simple cases, the OPPs come from the SoC, thus they are
> in DTSI.
> 
> Best regards,
> Krzysztof
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
  2023-09-20 14:25           ` Liviu Dudau
@ 2023-09-20 15:31             ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-09-20 15:31 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Clément Péron, Krzysztof Kozlowski, Boris Brezillon,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On 20/09/2023 16:25, Liviu Dudau wrote:
> On Wed, Sep 20, 2023 at 03:51:36PM +0200, Krzysztof Kozlowski wrote:
>> On 20/09/2023 15:41, Liviu Dudau wrote:
>>>>> +properties:
>>>>> +  $nodename:
>>>>> +    pattern: '^gpu@[a-f0-9]+$'
>>>>> +
>>>>> +  compatible:
>>>>> +    oneOf:
>>>>
>>>> Drop oneOf.
>>>
>>> The idea was to allow for future compatible strings to be added later, but
>>> I guess we can re-introduce the oneOf entry later. Will remove it.
>>
>> If you already predict that new list will be added (so new fallback
>> compatible!), then it's fine.
>>
>>>
>>>>
>>>>> +      - items:
>>>>> +          - enum:
>>>>> +              - rockchip,rk3588-mali
>>>>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
>>>>> +
>>>>> +  reg:
>>>>> +    maxItems: 1
>>>>> +
>>>>> +  interrupts:
>>>>> +    items:
>>>>> +      - description: Job interrupt
>>>>> +      - description: MMU interrupt
>>>>> +      - description: GPU interrupt
>>>>> +
>>>>> +  interrupt-names:
>>>>> +    items:
>>>>> +      - const: job
>>>>> +      - const: mmu
>>>>> +      - const: gpu
>>>>> +
>>>>> +  clocks:
>>>>> +    minItems: 1
>>>>> +    maxItems: 3
>>>>> +
>>>>> +  clock-names:
>>>>> +    minItems: 1
>>>>> +    items:
>>>>> +      - const: core
>>>>> +      - const: coregroup
>>>>> +      - const: stacks
>>>>> +
>>>>> +  mali-supply: true
>>>>> +
>>>>> +  sram-supply: true
>>>>> +
>>>>> +  operating-points-v2: true
>>>>
>>>> Missing opp-table.
>>>
>>> This is the main topic I want to clarify. See further down for the main comment,
>>> but I would like to understand what you are asking here. To copy the schema
>>> from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?
>>
>> No, "opp-table" property.
>> git grep "opp-table:"
> 
> You mean adding
> 
>      opp-table:
>        type: object
> 
> as property? What's the difference between opp-table: true (like in
> 'display/msm/dp-controller.yaml') and 'opp-table: type: object' like in other

There is no opp-table: true. Nowhere.

...

>>>
>>>>
>>>>> +    };
>>>>> +
>>>>> +    gpu_opp_table: opp-table {
>>>>
>>>> Opp table should be inside the device node.
>>>
>>> I cannot find any device tree that supports your suggested usage. Most (all?) of
>>
>> Really? All Qcom have it embedded.
> 
> The arm,mali-* ones seem to have them outside the gpu node. See "arm,mali-midgard.yaml"

You said you cannot find any, so I pointed out that's not true.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver
@ 2023-09-20 15:31             ` Krzysztof Kozlowski
  0 siblings, 0 replies; 93+ messages in thread
From: Krzysztof Kozlowski @ 2023-09-20 15:31 UTC (permalink / raw)
  To: Liviu Dudau
  Cc: Neil Armstrong, Conor Dooley, Nicolas Boichat, Daniel Stone,
	devicetree, dri-devel, Steven Price, Rob Herring,
	Boris Brezillon, Clément Péron, Krzysztof Kozlowski,
	Marty E . Plummer, Robin Murphy, Faith Ekstrand

On 20/09/2023 16:25, Liviu Dudau wrote:
> On Wed, Sep 20, 2023 at 03:51:36PM +0200, Krzysztof Kozlowski wrote:
>> On 20/09/2023 15:41, Liviu Dudau wrote:
>>>>> +properties:
>>>>> +  $nodename:
>>>>> +    pattern: '^gpu@[a-f0-9]+$'
>>>>> +
>>>>> +  compatible:
>>>>> +    oneOf:
>>>>
>>>> Drop oneOf.
>>>
>>> The idea was to allow for future compatible strings to be added later, but
>>> I guess we can re-introduce the oneOf entry later. Will remove it.
>>
>> If you already predict that new list will be added (so new fallback
>> compatible!), then it's fine.
>>
>>>
>>>>
>>>>> +      - items:
>>>>> +          - enum:
>>>>> +              - rockchip,rk3588-mali
>>>>> +          - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is fully discoverable
>>>>> +
>>>>> +  reg:
>>>>> +    maxItems: 1
>>>>> +
>>>>> +  interrupts:
>>>>> +    items:
>>>>> +      - description: Job interrupt
>>>>> +      - description: MMU interrupt
>>>>> +      - description: GPU interrupt
>>>>> +
>>>>> +  interrupt-names:
>>>>> +    items:
>>>>> +      - const: job
>>>>> +      - const: mmu
>>>>> +      - const: gpu
>>>>> +
>>>>> +  clocks:
>>>>> +    minItems: 1
>>>>> +    maxItems: 3
>>>>> +
>>>>> +  clock-names:
>>>>> +    minItems: 1
>>>>> +    items:
>>>>> +      - const: core
>>>>> +      - const: coregroup
>>>>> +      - const: stacks
>>>>> +
>>>>> +  mali-supply: true
>>>>> +
>>>>> +  sram-supply: true
>>>>> +
>>>>> +  operating-points-v2: true
>>>>
>>>> Missing opp-table.
>>>
>>> This is the main topic I want to clarify. See further down for the main comment,
>>> but I would like to understand what you are asking here. To copy the schema
>>> from bindings/opp/opp-v2.yaml and bindings/opp/opp-v2-base.yaml?
>>
>> No, "opp-table" property.
>> git grep "opp-table:"
> 
> You mean adding
> 
>      opp-table:
>        type: object
> 
> as property? What's the difference between opp-table: true (like in
> 'display/msm/dp-controller.yaml') and 'opp-table: type: object' like in other

There is no opp-table: true. Nowhere.

...

>>>
>>>>
>>>>> +    };
>>>>> +
>>>>> +    gpu_opp_table: opp-table {
>>>>
>>>> Opp table should be inside the device node.
>>>
>>> I cannot find any device tree that supports your suggested usage. Most (all?) of
>>
>> Really? All Qcom have it embedded.
> 
> The arm,mali-* ones seem to have them outside the gpu node. See "arm,mali-midgard.yaml"

You said you cannot find any, so I pointed out that's not true.

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs
  2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
                   ` (15 preceding siblings ...)
  2023-08-09 20:22 ` [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Rob Herring
@ 2023-09-27 15:47 ` Steven Price
  16 siblings, 0 replies; 93+ messages in thread
From: Steven Price @ 2023-09-27 15:47 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Nicolas Boichat, Daniel Stone, Neil Armstrong, Liviu Dudau,
	Clément Péron, Marty E . Plummer, Robin Murphy,
	Faith Ekstrand

On 09/08/2023 17:53, Boris Brezillon wrote:
> Hello,
> 
> This is the second version of the kernel driver meant to support new Mali
> GPUs which are delegating the scheduling to a firmware.

[...]

> I tried to Cc anyone that was involved in any development of the code
> I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
> change. If I missed someone, please let me know.

Regarding the relicensing, Arm agrees to the relicensing of the
parts they hold copyright on.

Acked-by: Steven Price <steven.price@arm.com>

Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2023-09-27 15:47 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-09 16:53 [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t Boris Brezillon
2023-08-11 13:08   ` Steven Price
2023-08-19  2:13     ` Dmitry Osipenko
2023-08-28  9:03       ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 02/15] drm/panthor: Add uAPI Boris Brezillon
2023-08-11 14:13   ` Steven Price
2023-09-01 13:59   ` Liviu Dudau
2023-09-01 16:10   ` Boris Brezillon
2023-09-04  7:42     ` Steven Price
2023-09-04  8:26       ` Boris Brezillon
2023-09-04  9:26       ` Boris Brezillon
2023-09-04 15:22         ` Steven Price
2023-09-04 16:16           ` Boris Brezillon
2023-09-04 16:25             ` Robin Murphy
2023-09-06 10:55               ` Steven Price
2023-09-04 16:06   ` Robin Murphy
2023-09-06 12:18   ` Ketil Johnsen
2023-08-09 16:53 ` [PATCH v2 03/15] drm/panthor: Add GPU register definitions Boris Brezillon
2023-08-11 14:13   ` Steven Price
2023-08-29 13:00     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 04/15] drm/panthor: Add the device logical block Boris Brezillon
2023-08-11 15:47   ` Steven Price
2023-08-29 14:00     ` Boris Brezillon
2023-08-30 13:17       ` Steven Price
2023-08-30 14:06         ` Boris Brezillon
2023-09-04 11:46         ` Liviu Dudau
2023-08-09 16:53 ` [PATCH v2 05/15] drm/panthor: Add the GPU " Boris Brezillon
2023-08-14 10:54   ` Steven Price
2023-08-21 16:09     ` Robin Murphy
2023-08-23  8:48       ` Steven Price
2023-08-29 14:42       ` Boris Brezillon
2023-08-29 14:40     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 06/15] drm/panthor: Add GEM " Boris Brezillon
2023-08-14 13:40   ` Steven Price
2023-08-29 14:45     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 07/15] drm/panthor: Add the devfreq " Boris Brezillon
2023-08-14 13:45   ` Steven Price
2023-08-09 16:53 ` [PATCH v2 08/15] drm/panthor: Add the MMU/VM " Boris Brezillon
2023-08-14 15:53   ` Steven Price
2023-08-29 15:33     ` Boris Brezillon
2023-08-30 14:12       ` Steven Price
2023-08-30 14:53         ` Boris Brezillon
2023-08-30 15:55           ` Steven Price
2023-08-09 16:53 ` [PATCH v2 09/15] drm/panthor: Add the FW " Boris Brezillon
2023-08-16 16:01   ` Steven Price
2023-08-29 16:15     ` Boris Brezillon
2023-08-30 15:20       ` Steven Price
2023-08-09 16:53 ` [PATCH v2 10/15] drm/panthor: Add the heap " Boris Brezillon
2023-08-18 14:39   ` Steven Price
2023-08-29 16:21     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 11/15] drm/panthor: Add the scheduler " Boris Brezillon
2023-08-18 15:38   ` Steven Price
2023-08-29 16:36     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 12/15] drm/panthor: Add the driver frontend block Boris Brezillon
2023-08-21 11:31   ` Steven Price
2023-08-29 17:46     ` Boris Brezillon
2023-08-31 14:42       ` Steven Price
2023-09-06 12:38   ` Ketil Johnsen
2023-09-06 13:05     ` Boris Brezillon
2023-08-09 16:53 ` [PATCH v2 13/15] drm/panthor: Allow driver compilation Boris Brezillon
2023-08-11 16:35   ` Robin Murphy
2023-08-11 16:56     ` Daniel Stone
2023-08-11 19:26       ` Robin Murphy
2023-08-14 11:18         ` Steven Price
2023-08-21 17:56           ` Robin Murphy
2023-08-23  9:17             ` Steven Price
2023-08-29 12:51             ` Boris Brezillon
2023-08-21 12:47   ` Steven Price
2023-08-09 16:53 ` [PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver Boris Brezillon
2023-08-09 16:53   ` Boris Brezillon
2023-08-20  8:01   ` Krzysztof Kozlowski
2023-08-20  8:01     ` Krzysztof Kozlowski
2023-09-20 13:41     ` Liviu Dudau
2023-09-20 13:41       ` Liviu Dudau
2023-09-20 13:51       ` Krzysztof Kozlowski
2023-09-20 13:51         ` Krzysztof Kozlowski
2023-09-20 14:25         ` Liviu Dudau
2023-09-20 14:25           ` Liviu Dudau
2023-09-20 15:31           ` Krzysztof Kozlowski
2023-09-20 15:31             ` Krzysztof Kozlowski
2023-09-20 13:56       ` Boris Brezillon
2023-09-20 13:56         ` Boris Brezillon
2023-09-20 14:03         ` Liviu Dudau
2023-08-09 16:53 ` [PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS Boris Brezillon
2023-08-11 16:08   ` Steven Price
2023-08-29 17:48     ` Boris Brezillon
2023-08-31 13:18   ` Liviu Dudau
2023-08-31 13:25     ` Boris Brezillon
2023-08-09 20:22 ` [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs Rob Herring
2023-08-10 15:44   ` Boris Brezillon
2023-08-21 14:01     ` Rob Herring
2023-09-27 15:47 ` Steven Price

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.