[RFC PATCH v3 0/4] drm: Standardize device reset notification

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/4] drm: Standardize device reset notification
@ 2023-06-21  0:57 André Almeida
  2023-06-21  0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: André Almeida @ 2023-06-21  0:57 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, André Almeida

Hi,

This is a new version of the documentation for DRM device resets. As I dived
more in the subject, I started to believe that part of the problem was the lack
of a DRM API to get reset information from the driver. With an API, we can
better standardize reset queries, increase common code from both DRM and Mesa,
and make easier to write end-to-end tests.

So this patchset, along with the documentation, comes with a new IOCTL and two
implementations of it for amdgpu and i915 (although just the former was really
tested). This IOCTL uses the "context id" to query reset information, but this
might be not generic enough to be included in a DRM API.  At least for amdgpu,
this information is encapsulated by libdrm so one can't just call the ioctl
directly from the UMD as I was planning to, but a small refactor can be done to
expose the id. Anyway, I'm sharing it as it is to gather feedback if this seems
to work.

The amdgpu and i915 implementations are provided as a mean of testing and as
exemplification, and not as reference code yet, as the goal is more about the
interface itself then the driver parts.

For the documentation itself, after spending some time reading the reset path in
the kernel in Mesa, I decide to rewrite it to better reflect how it works, from
bottom to top.

You can check the userspace side of the IOCLT here:
 Mesa: https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
 libdrm: https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46

For testing, I use this vulkan app that has an infinity loop in the shader:
https://github.com/andrealmeid/vulkan-triangle-v1

Feedbacks are welcomed!

Thanks,
		André

v2: https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
v1: https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/

André Almeida (4):
  drm/doc: Document DRM device reset expectations
  drm: Create DRM_IOCTL_GET_RESET
  drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  drm/i915: Implement DRM_IOCTL_GET_RESET

 Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
 drivers/gpu/drm/drm_debugfs.c                 |  2 +
 drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
 drivers/gpu/drm/i915/i915_driver.c            |  2 +
 include/drm/drm_device.h                      |  3 +
 include/drm/drm_drv.h                         |  3 +
 include/uapi/drm/drm.h                        | 21 +++++++
 include/uapi/drm/drm_mode.h                   | 15 +++++
 17 files changed, 233 insertions(+), 3 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
  2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
@ 2023-06-21  0:57 ` André Almeida
  2023-06-21  7:58   ` Pekka Paalanen
  2023-06-21  0:57 ` [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET André Almeida
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21  0:57 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, André Almeida

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..da4f8a694d8d 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
 mmapped regular files. Threads cause additional pain with signal
 handling as well.
 
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. To recover
+from this kind of state, sometimes is needed to reset the device. This section
+describes what's the expectations for DRM and usermode drivers when a device
+resets and how to propagate the reset status.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hung is detected when a job gets stuck executing. KMD
+then update it's internal reset tracking to be ready when userspace asks the
+kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
+for that.
+
+User Mode Driver
+----------------
+
+The UMD should check before submitting new commands to the KMD if the device has
+been reset, and this can be checked more often if it requires to. The
+DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriated API error code, as explained in the bellow section about
+robustness.
+
+Robustness
+----------
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that is using.
+
+Graphical APIs provide ways to application to deal with device resets. However,
+there's no guarantee that the app will be correctly using such features, and UMD
+can implement policies to close the app if it's a repeating offender, likely in
+a broken loop. This is done to ensure that it doesn't keeps blocking the user
+interface to be correctly displayed.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
+tells if a reset has happened, and if so, all the context state is considered
+lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
+will terminate the app when a reset is detected, giving that the contexts are
+lost and the app won't be able to figure this out and recreate the contexts.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting resets causes
+-----------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET
  2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
  2023-06-21  0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
@ 2023-06-21  0:57 ` André Almeida
  2023-06-21  8:09   ` Pekka Paalanen
  2023-06-21  0:57 ` [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET André Almeida
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21  0:57 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, André Almeida

Create a new DRM ioctl operation to get the numbers of resets for a
given context. The numbers reflect just the resets that happened after
the context was created, and not since the machine was booted.

Create a debugfs interface to make easier to test the API without real
resets.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/drm_debugfs.c |  2 ++
 drivers/gpu/drm/drm_ioctl.c   | 58 +++++++++++++++++++++++++++++++++++
 include/drm/drm_device.h      |  3 ++
 include/drm/drm_drv.h         |  3 ++
 include/uapi/drm/drm.h        | 21 +++++++++++++
 include/uapi/drm/drm_mode.h   | 15 +++++++++
 6 files changed, 102 insertions(+)

diff --git a/drivers/gpu/drm/drm_debugfs.c b/drivers/gpu/drm/drm_debugfs.c
index 4855230ba2c6..316dce60434d 100644
--- a/drivers/gpu/drm/drm_debugfs.c
+++ b/drivers/gpu/drm/drm_debugfs.c
@@ -251,6 +251,8 @@ int drm_debugfs_init(struct drm_minor *minor, int minor_id,
 		list_del(&entry->list);
 	}
 
+	debugfs_create_bool("drm_reset_spoof", 0644, minor->debugfs_root, &dev->reset_spoof);
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index 7c9d66ee917d..23c282681ec7 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -528,6 +528,63 @@ int drm_version(struct drm_device *dev, void *data,
 	return err;
 }
 
+/**
+ * drm_spoof_reset - Spoof a fake reset
+ *
+ * @reset: reset struct to be spoofed
+ *
+ * Create a fake reset report for testing
+ */
+static void drm_spoof_reset(struct drm_get_reset *reset)
+{
+	reset->dev_reset_count = 1;
+	reset->ctx_reset_count = 0;
+	reset->flags = 0;
+	reset->ctx_id = 0;
+
+	DRM_INFO("[Spoofed] Reporting reset.ctx = %llu .dev = %llu\n",
+	       reset->ctx_reset_count, reset->dev_reset_count);
+}
+
+/**
+ * drm_getreset - Get reset information from a DRM device
+ *
+ * @dev DRM device
+ * @data user argument, pointing to a drm_get_reset structure
+ * @filp file pointer
+ *
+ * Return zero on success or negative number on failure.
+ *
+ * Fills in the reset information in data arg.
+ */
+int drm_getreset(struct drm_device *dev, void *data,
+			struct drm_file *file_priv)
+{
+	struct drm_get_reset *reset = data;
+	int ret = 0;
+
+	if (dev->reset_spoof) {
+		drm_spoof_reset(reset);
+		return 0;
+	}
+
+	if (!dev->driver->get_reset)
+		return -ENOSYS;
+
+	if (reset->flags)
+		return -EINVAL;
+
+	ret = dev->driver->get_reset(file_priv, dev, reset);
+
+	if (!ret)
+		DRM_INFO("Reporting reset.ctx = %llu .dev = %llu\n",
+		       reset->ctx_reset_count, reset->dev_reset_count);
+	else
+		DRM_WARN("%s failed with %d return\n", __func__, ret);
+
+	return ret;
+}
+
 static int drm_ioctl_permit(u32 flags, struct drm_file *file_priv)
 {
 	/* ROOT_ONLY is only for CAP_SYS_ADMIN */
@@ -716,6 +773,7 @@ static const struct drm_ioctl_desc drm_ioctls[] = {
 	DRM_IOCTL_DEF(DRM_IOCTL_MODE_LIST_LESSEES, drm_mode_list_lessees_ioctl, DRM_MASTER),
 	DRM_IOCTL_DEF(DRM_IOCTL_MODE_GET_LEASE, drm_mode_get_lease_ioctl, DRM_MASTER),
 	DRM_IOCTL_DEF(DRM_IOCTL_MODE_REVOKE_LEASE, drm_mode_revoke_lease_ioctl, DRM_MASTER),
+	DRM_IOCTL_DEF(DRM_IOCTL_GET_RESET, drm_getreset, DRM_RENDER_ALLOW),
 };
 
 #define DRM_CORE_IOCTL_COUNT	ARRAY_SIZE(drm_ioctls)
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 7cf4afae2e79..fcd7b5d45cde 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -326,6 +326,9 @@ struct drm_device {
 	 */
 	struct list_head debugfs_list;
 
+	/* Spoof device reset for testing */
+	bool reset_spoof;
+
 	/* Everything below here is for legacy driver, never use! */
 	/* private: */
 #if IS_ENABLED(CONFIG_DRM_LEGACY)
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 89e2706cac56..518a9db157fb 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -401,6 +401,9 @@ struct drm_driver {
 			       struct drm_device *dev, uint32_t handle,
 			       uint64_t *offset);
 
+	int (*get_reset)(struct drm_file *file_priv,
+			 struct drm_device *dev, struct drm_get_reset *reset);
+
 	/**
 	 * @show_fdinfo:
 	 *
diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
index a87bbbbca2d4..a84559aa0d77 100644
--- a/include/uapi/drm/drm.h
+++ b/include/uapi/drm/drm.h
@@ -1169,6 +1169,27 @@ extern "C" {
  */
 #define DRM_IOCTL_MODE_GETFB2		DRM_IOWR(0xCE, struct drm_mode_fb_cmd2)
 
+/**
+ * DRM_IOCTL_GET_RESET - Get information about device resets
+ *
+ * This operation requests from the device information about resets. It should
+ * consider only resets that happens after the context is created, therefore,
+ * the counter should be zero during context creation.
+ *
+ * dev_reset_count tells how many resets have happened on this device, and
+ * ctx_reset_count tells how many of such resets were caused by this context.
+ *
+ * Flags can be used to tell if a reset is in progress, and userspace should
+ * wait until it's not in progress anymore to be able to create a new context;
+ * and to tell if the VRAM is considered lost. There's no safe way to clean this
+ * flag so if a context see this flag set, it should be like that until the end
+ * of the context.
+ */
+#define DRM_IOCTL_GET_RESET		DRM_IOWR(0xCF, struct drm_get_reset)
+
+#define DRM_RESET_IN_PROGRESS	0x1
+#define DRM_RESET_VRAM_LOST	0x2
+
 /*
  * Device specific ioctls should only be in their respective headers
  * The device specific ioctl range is from 0x40 to 0x9f.
diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
index 43691058d28f..c3257bd1af9c 100644
--- a/include/uapi/drm/drm_mode.h
+++ b/include/uapi/drm/drm_mode.h
@@ -1308,6 +1308,21 @@ struct drm_mode_rect {
 	__s32 y2;
 };
 
+/**
+ * struct drm_get_reset - Get information about a DRM device resets
+ * @ctx_id: the context id to be queried about resets
+ * @flags: flags
+ * @dev_reset_count: global counter of resets for a given DRM device
+ * @ctx_reset_count: of all the resets counted by this device, how many were
+ * caused by this context.
+ */
+struct drm_get_reset {
+	__u32 ctx_id;
+	__u32 flags;
+	__u64 dev_reset_count;
+	__u64 ctx_reset_count;
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
  2023-06-21  0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
  2023-06-21  0:57 ` [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET André Almeida
@ 2023-06-21  0:57 ` André Almeida
  2023-06-21  7:40   ` Christian König
  2023-06-21  0:57 ` [RFC PATCH v3 4/4] drm/i915: " André Almeida
  2023-06-21  7:42 ` [RFC PATCH v3 0/4] drm: Standardize device reset notification Christian König
  4 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21  0:57 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, André Almeida

Implement get_reset ioctl for amdgpu

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  4 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 35 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  5 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  2 ++
 6 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2eb2c66843a8..0ba26b4b039c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1262,8 +1262,10 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	uint64_t seq;
 	int r;
 
-	for (i = 0; i < p->gang_size; ++i)
+	for (i = 0; i < p->gang_size; ++i) {
+		p->jobs[i]->ctx = p->ctx;
 		drm_sched_job_arm(&p->jobs[i]->base);
+	}
 
 	for (i = 0; i < p->gang_size; ++i) {
 		struct dma_fence *fence;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index d2139ac12159..d3e292382d4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -322,6 +322,9 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
 	ctx->init_priority = priority;
 	ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
 
+	ctx->global_reset_counter = atomic_read(&mgr->adev->gpu_reset_counter);
+	ctx->local_reset_counter = 0;
+
 	r = amdgpu_ctx_get_stable_pstate(ctx, &current_stable_pstate);
 	if (r)
 		return r;
@@ -963,3 +966,35 @@ void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
 	}
 	mutex_unlock(&mgr->lock);
 }
+
+int amdgpu_get_reset(struct drm_file *filp, struct drm_device *dev,
+		     struct drm_get_reset *reset)
+{
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	struct amdgpu_ctx *ctx;
+	struct amdgpu_ctx_mgr *mgr;
+	unsigned int id = reset->ctx_id;
+	struct amdgpu_fpriv *fpriv = filp->driver_priv;
+
+	mgr = &fpriv->ctx_mgr;
+	mutex_lock(&mgr->lock);
+	ctx = idr_find(&mgr->ctx_handles, id);
+	if (!ctx) {
+		mutex_unlock(&mgr->lock);
+		return -EINVAL;
+	}
+
+	reset->dev_reset_count =
+		atomic_read(&adev->gpu_reset_counter) - ctx->global_reset_counter;
+
+	reset->ctx_reset_count = ctx->local_reset_counter;
+
+	if (amdgpu_in_reset(adev))
+		reset->flags |= DRM_RESET_IN_PROGRESS;
+
+	if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
+		reset->flags |= DRM_RESET_VRAM_LOST;
+
+	mutex_unlock(&mgr->lock);
+	return 0;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
index 0fa0e56daf67..0c9815695884 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
@@ -57,6 +57,9 @@ struct amdgpu_ctx {
 	unsigned long			ras_counter_ce;
 	unsigned long			ras_counter_ue;
 	uint32_t			stable_pstate;
+
+	uint64_t			global_reset_counter;
+	uint64_t			local_reset_counter;
 };
 
 struct amdgpu_ctx_mgr {
@@ -97,4 +100,6 @@ void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr);
 void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
 			  ktime_t usage[AMDGPU_HW_IP_NUM]);
 
+int amdgpu_get_reset(struct drm_file *file_priv, struct drm_device *dev,
+		     struct drm_get_reset *reset);
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index c9a41c997c6c..431791b2c3cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2805,6 +2805,7 @@ static const struct drm_driver amdgpu_kms_driver = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo = amdgpu_show_fdinfo,
 #endif
+	.get_reset = amdgpu_get_reset,
 
 	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
 	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index c3d9d75143f4..1553a2633d46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -35,11 +35,20 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
 	struct amdgpu_job *job = to_amdgpu_job(s_job);
+	struct drm_sched_entity *entity = job->base.entity;
 	struct amdgpu_task_info ti;
 	struct amdgpu_device *adev = ring->adev;
 	int idx;
 	int r;
 
+	memset(&ti, 0, sizeof(struct amdgpu_task_info));
+	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
+
+	if (job->ctx) {
+		DRM_INFO("Increasing ctx reset count for %s (%d)\n", ti.process_name, ti.pid);
+		job->ctx->local_reset_counter++;
+	}
+
 	if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
 		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
 			 __func__, s_job->sched->name);
@@ -48,7 +57,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		return DRM_GPU_SCHED_STAT_ENODEV;
 	}
 
-	memset(&ti, 0, sizeof(struct amdgpu_task_info));
 	adev->job_hang = true;
 
 	if (amdgpu_gpu_recovery &&
@@ -58,7 +66,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		goto exit;
 	}
 
-	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
 	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
 		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
 		  ring->fence_drv.sync_seq);
@@ -105,6 +112,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 	 */
 	(*job)->base.sched = &adev->rings[0]->sched;
 	(*job)->vm = vm;
+	(*job)->ctx = NULL;
 
 	amdgpu_sync_create(&(*job)->explicit_sync);
 	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index 52f2e313ea17..0d463babaa60 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -63,6 +63,8 @@ struct amdgpu_job {
 	uint32_t		oa_base, oa_size;
 	uint32_t		vram_lost_counter;
 
+	struct amdgpu_ctx	*ctx;
+
 	/* user fence handling */
 	uint64_t		uf_addr;
 	uint64_t		uf_sequence;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 4/4] drm/i915: Implement DRM_IOCTL_GET_RESET
  2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
                   ` (2 preceding siblings ...)
  2023-06-21  0:57 ` [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET André Almeida
@ 2023-06-21  0:57 ` André Almeida
  2023-06-21  7:38   ` Jani Nikula
  2023-06-21  7:42 ` [RFC PATCH v3 0/4] drm: Standardize device reset notification Christian König
  4 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21  0:57 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Pekka Paalanen,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, André Almeida

Implement get_reset ioctl for i915.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c    | 18 ++++++++++++++++++
 drivers/gpu/drm/i915/gem/i915_gem_context.h    |  2 ++
 .../gpu/drm/i915/gem/i915_gem_context_types.h  |  2 ++
 drivers/gpu/drm/i915/i915_driver.c             |  2 ++
 4 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 9a9ff84c90d7..fba8c9bbc7e9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -1666,6 +1666,8 @@ i915_gem_create_context(struct drm_i915_private *i915,
 		ctx->uses_protected_content = true;
 	}
 
+	ctx->dev_reset_counter = i915_reset_count(&i915->gpu_error);
+
 	trace_i915_context_create(ctx);
 
 	return ctx;
@@ -2558,6 +2560,22 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	return 0;
 }
 
+int i915_gem_get_reset(struct drm_file *filp, struct drm_device *dev,
+		       struct drm_get_reset *reset)
+{
+	struct i915_gem_context *ctx;
+
+	ctx = i915_gem_context_lookup(file->driver_priv, reset->ctx_id);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	reset->dev_reset_count = i915_reset_count(&i915->gpu_error) - ctx->dev_reset_count;
+	reset->ctx_reset_count = ctx->guilty_count;
+
+	i915_gem_context_put(ctx);
+	return 0;
+}
+
 /* GEM context-engines iterator: for_each_gem_engine() */
 struct intel_context *
 i915_gem_engines_iter_next(struct i915_gem_engines_iter *it)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h
index e5b0f66ea1fe..9ee119d8123f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
@@ -138,6 +138,8 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file_priv);
 int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, void *data,
 				       struct drm_file *file);
+int i915_gem_get_reset(struct drm_file *file_priv, struct drm_device *dev,
+		       struct drm_get_reset *reset);
 
 struct i915_gem_context *
 i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index cb78214a7dcd..2e4cf0f0d3dc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -414,6 +414,8 @@ struct i915_gem_context {
 		/** @engines: list of stale engines */
 		struct list_head engines;
 	} stale;
+
+	uint64_t dev_reset_counter;
 };
 
 #endif /* __I915_GEM_CONTEXT_TYPES_H__ */
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index 97244541ec28..640304141ada 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1805,6 +1805,8 @@ static const struct drm_driver i915_drm_driver = {
 	.postclose = i915_driver_postclose,
 	.show_fdinfo = i915_drm_client_fdinfo,
 
+	.get_reset = i915_gem_get_reset,
+
 	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
 	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
 	.gem_prime_import = i915_gem_prime_import,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 4/4] drm/i915: Implement DRM_IOCTL_GET_RESET
  2023-06-21  0:57 ` [RFC PATCH v3 4/4] drm/i915: " André Almeida
@ 2023-06-21  7:38   ` Jani Nikula
  0 siblings, 0 replies; 20+ messages in thread
From: Jani Nikula @ 2023-06-21  7:38 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Michel Dänzer, Timur Kristóf, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, alexander.deucher, christian.koenig

On Tue, 20 Jun 2023, André Almeida <andrealmeid@igalia.com> wrote:
> Implement get_reset ioctl for i915.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c    | 18 ++++++++++++++++++
>  drivers/gpu/drm/i915/gem/i915_gem_context.h    |  2 ++
>  .../gpu/drm/i915/gem/i915_gem_context_types.h  |  2 ++
>  drivers/gpu/drm/i915/i915_driver.c             |  2 ++
>  4 files changed, 24 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 9a9ff84c90d7..fba8c9bbc7e9 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -1666,6 +1666,8 @@ i915_gem_create_context(struct drm_i915_private *i915,
>  		ctx->uses_protected_content = true;
>  	}
>  
> +	ctx->dev_reset_counter = i915_reset_count(&i915->gpu_error);
> +
>  	trace_i915_context_create(ctx);
>  
>  	return ctx;
> @@ -2558,6 +2560,22 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>  	return 0;
>  }
>  
> +int i915_gem_get_reset(struct drm_file *filp, struct drm_device *dev,
> +		       struct drm_get_reset *reset)
> +{
> +	struct i915_gem_context *ctx;
> +
> +	ctx = i915_gem_context_lookup(file->driver_priv, reset->ctx_id);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +
> +	reset->dev_reset_count = i915_reset_count(&i915->gpu_error) - ctx->dev_reset_count;
> +	reset->ctx_reset_count = ctx->guilty_count;
> +
> +	i915_gem_context_put(ctx);

Usually return is preceded by a blank line.

> +	return 0;
> +}
> +
>  /* GEM context-engines iterator: for_each_gem_engine() */
>  struct intel_context *
>  i915_gem_engines_iter_next(struct i915_gem_engines_iter *it)
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> index e5b0f66ea1fe..9ee119d8123f 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
> @@ -138,6 +138,8 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  				    struct drm_file *file_priv);
>  int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, void *data,
>  				       struct drm_file *file);
> +int i915_gem_get_reset(struct drm_file *file_priv, struct drm_device *dev,
> +		       struct drm_get_reset *reset);
>  
>  struct i915_gem_context *
>  i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id);
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> index cb78214a7dcd..2e4cf0f0d3dc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
> @@ -414,6 +414,8 @@ struct i915_gem_context {
>  		/** @engines: list of stale engines */
>  		struct list_head engines;
>  	} stale;
> +
> +	uint64_t dev_reset_counter;

u64 please. i915 only uses the kernel fixed types.

Please do wait for review on the actual content before reposting.

BR,
Jani.

>  };
>  
>  #endif /* __I915_GEM_CONTEXT_TYPES_H__ */
> diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
> index 97244541ec28..640304141ada 100644
> --- a/drivers/gpu/drm/i915/i915_driver.c
> +++ b/drivers/gpu/drm/i915/i915_driver.c
> @@ -1805,6 +1805,8 @@ static const struct drm_driver i915_drm_driver = {
>  	.postclose = i915_driver_postclose,
>  	.show_fdinfo = i915_drm_client_fdinfo,
>  
> +	.get_reset = i915_gem_get_reset,
> +
>  	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
>  	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
>  	.gem_prime_import = i915_gem_prime_import,

-- 
Jani Nikula, Intel Open Source Graphics Center

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  2023-06-21  0:57 ` [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET André Almeida
@ 2023-06-21  7:40   ` Christian König
  2023-06-21 16:38     ` André Almeida
  0 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2023-06-21  7:40 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, Rob Clark, Pekka Paalanen, Daniel Vetter,
	Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Am 21.06.23 um 02:57 schrieb André Almeida:
> Implement get_reset ioctl for amdgpu

Well that pretty much won't work since the jobs are destroyed much later 
than the contexts.

Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  4 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 35 +++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  5 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  2 ++
>   6 files changed, 56 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 2eb2c66843a8..0ba26b4b039c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1262,8 +1262,10 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   	uint64_t seq;
>   	int r;
>   
> -	for (i = 0; i < p->gang_size; ++i)
> +	for (i = 0; i < p->gang_size; ++i) {
> +		p->jobs[i]->ctx = p->ctx;
>   		drm_sched_job_arm(&p->jobs[i]->base);
> +	}
>   
>   	for (i = 0; i < p->gang_size; ++i) {
>   		struct dma_fence *fence;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index d2139ac12159..d3e292382d4a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -322,6 +322,9 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
>   	ctx->init_priority = priority;
>   	ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
>   
> +	ctx->global_reset_counter = atomic_read(&mgr->adev->gpu_reset_counter);
> +	ctx->local_reset_counter = 0;
> +
>   	r = amdgpu_ctx_get_stable_pstate(ctx, &current_stable_pstate);
>   	if (r)
>   		return r;
> @@ -963,3 +966,35 @@ void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
>   	}
>   	mutex_unlock(&mgr->lock);
>   }
> +
> +int amdgpu_get_reset(struct drm_file *filp, struct drm_device *dev,
> +		     struct drm_get_reset *reset)
> +{
> +	struct amdgpu_device *adev = drm_to_adev(dev);
> +	struct amdgpu_ctx *ctx;
> +	struct amdgpu_ctx_mgr *mgr;
> +	unsigned int id = reset->ctx_id;
> +	struct amdgpu_fpriv *fpriv = filp->driver_priv;
> +
> +	mgr = &fpriv->ctx_mgr;
> +	mutex_lock(&mgr->lock);
> +	ctx = idr_find(&mgr->ctx_handles, id);
> +	if (!ctx) {
> +		mutex_unlock(&mgr->lock);
> +		return -EINVAL;
> +	}
> +
> +	reset->dev_reset_count =
> +		atomic_read(&adev->gpu_reset_counter) - ctx->global_reset_counter;
> +
> +	reset->ctx_reset_count = ctx->local_reset_counter;
> +
> +	if (amdgpu_in_reset(adev))
> +		reset->flags |= DRM_RESET_IN_PROGRESS;
> +
> +	if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
> +		reset->flags |= DRM_RESET_VRAM_LOST;
> +
> +	mutex_unlock(&mgr->lock);
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
> index 0fa0e56daf67..0c9815695884 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
> @@ -57,6 +57,9 @@ struct amdgpu_ctx {
>   	unsigned long			ras_counter_ce;
>   	unsigned long			ras_counter_ue;
>   	uint32_t			stable_pstate;
> +
> +	uint64_t			global_reset_counter;
> +	uint64_t			local_reset_counter;
>   };
>   
>   struct amdgpu_ctx_mgr {
> @@ -97,4 +100,6 @@ void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr);
>   void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
>   			  ktime_t usage[AMDGPU_HW_IP_NUM]);
>   
> +int amdgpu_get_reset(struct drm_file *file_priv, struct drm_device *dev,
> +		     struct drm_get_reset *reset);
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index c9a41c997c6c..431791b2c3cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2805,6 +2805,7 @@ static const struct drm_driver amdgpu_kms_driver = {
>   #ifdef CONFIG_PROC_FS
>   	.show_fdinfo = amdgpu_show_fdinfo,
>   #endif
> +	.get_reset = amdgpu_get_reset,
>   
>   	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
>   	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c3d9d75143f4..1553a2633d46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -35,11 +35,20 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   {
>   	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
> +	struct drm_sched_entity *entity = job->base.entity;
>   	struct amdgpu_task_info ti;
>   	struct amdgpu_device *adev = ring->adev;
>   	int idx;
>   	int r;
>   
> +	memset(&ti, 0, sizeof(struct amdgpu_task_info));
> +	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> +
> +	if (job->ctx) {
> +		DRM_INFO("Increasing ctx reset count for %s (%d)\n", ti.process_name, ti.pid);
> +		job->ctx->local_reset_counter++;
> +	}
> +
>   	if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
>   		DRM_INFO("%s - device unplugged skipping recovery on scheduler:%s",
>   			 __func__, s_job->sched->name);
> @@ -48,7 +57,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		return DRM_GPU_SCHED_STAT_ENODEV;
>   	}
>   
> -	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   	adev->job_hang = true;
>   
>   	if (amdgpu_gpu_recovery &&
> @@ -58,7 +66,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		goto exit;
>   	}
>   
> -	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>   	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>   		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>   		  ring->fence_drv.sync_seq);
> @@ -105,6 +112,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   	 */
>   	(*job)->base.sched = &adev->rings[0]->sched;
>   	(*job)->vm = vm;
> +	(*job)->ctx = NULL;
>   
>   	amdgpu_sync_create(&(*job)->explicit_sync);
>   	(*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 52f2e313ea17..0d463babaa60 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -63,6 +63,8 @@ struct amdgpu_job {
>   	uint32_t		oa_base, oa_size;
>   	uint32_t		vram_lost_counter;
>   
> +	struct amdgpu_ctx	*ctx;
> +
>   	/* user fence handling */
>   	uint64_t		uf_addr;
>   	uint64_t		uf_sequence;


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 0/4] drm: Standardize device reset notification
  2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
                   ` (3 preceding siblings ...)
  2023-06-21  0:57 ` [RFC PATCH v3 4/4] drm/i915: " André Almeida
@ 2023-06-21  7:42 ` Christian König
  2023-06-21 15:06   ` André Almeida
  4 siblings, 1 reply; 20+ messages in thread
From: Christian König @ 2023-06-21  7:42 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, Rob Clark, Pekka Paalanen, Daniel Vetter,
	Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Am 21.06.23 um 02:57 schrieb André Almeida:
> Hi,
>
> This is a new version of the documentation for DRM device resets. As I dived
> more in the subject, I started to believe that part of the problem was the lack
> of a DRM API to get reset information from the driver. With an API, we can
> better standardize reset queries, increase common code from both DRM and Mesa,
> and make easier to write end-to-end tests.
>
> So this patchset, along with the documentation, comes with a new IOCTL and two
> implementations of it for amdgpu and i915 (although just the former was really
> tested). This IOCTL uses the "context id" to query reset information, but this
> might be not generic enough to be included in a DRM API.

Well the basic problem with that is that we don't have a standard DRM 
context defined.

If you want to do this you should probably start there first.

Apart from that this looks like a really really good idea to me, 
especially that we document the reset expectations.

Regards,
Christian.

>    At least for amdgpu,
> this information is encapsulated by libdrm so one can't just call the ioctl
> directly from the UMD as I was planning to, but a small refactor can be done to
> expose the id. Anyway, I'm sharing it as it is to gather feedback if this seems
> to work.
>
> The amdgpu and i915 implementations are provided as a mean of testing and as
> exemplification, and not as reference code yet, as the goal is more about the
> interface itself then the driver parts.
>
> For the documentation itself, after spending some time reading the reset path in
> the kernel in Mesa, I decide to rewrite it to better reflect how it works, from
> bottom to top.
>
> You can check the userspace side of the IOCLT here:
>   Mesa: https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>   libdrm: https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>
> For testing, I use this vulkan app that has an infinity loop in the shader:
> https://github.com/andrealmeid/vulkan-triangle-v1
>
> Feedbacks are welcomed!
>
> Thanks,
> 		André
>
> v2: https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
> v1: https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>
> André Almeida (4):
>    drm/doc: Document DRM device reset expectations
>    drm: Create DRM_IOCTL_GET_RESET
>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>    drm/i915: Implement DRM_IOCTL_GET_RESET
>
>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>   drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>   include/drm/drm_device.h                      |  3 +
>   include/drm/drm_drv.h                         |  3 +
>   include/uapi/drm/drm.h                        | 21 +++++++
>   include/uapi/drm/drm_mode.h                   | 15 +++++
>   17 files changed, 233 insertions(+), 3 deletions(-)
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
  2023-06-21  0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
@ 2023-06-21  7:58   ` Pekka Paalanen
  2023-06-21 16:28     ` André Almeida
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Paalanen @ 2023-06-21  7:58 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 5277 bytes --]

On Tue, 20 Jun 2023 21:57:16 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>

Hi André,

nice to see this! I ended up giving lots of grammar comments, but I'm
not a native speaker. Generally it looks good to me.

> ---
>  Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..da4f8a694d8d 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. To recover
> +from this kind of state, sometimes is needed to reset the device. This section

It seems unclear what "this kind of state" refers to, so maybe just write "errors"?

Maybe:

	Some errors require resetting the device in order to make the
	device usable again.

I presume that recovery does not mean that the failed job could recover.

> +describes what's the expectations for DRM and usermode drivers when a device
> +resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD

s/hung/hang/ ?

> +then update it's internal reset tracking to be ready when userspace asks the

updates its

"update reset tracking"... do you mean that KMD records information
about the reset in case userspace asks for it later?

> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
> +for that.

At this point, I'm not sure what "reset tracking" or "reset
information" entails. Could something be said about those?

> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if it requires to. The
> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriated API error code, as explained in the bellow section about

s/bellow/below/

> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that is using.

that it is using.

> +
> +Graphical APIs provide ways to application to deal with device resets. However,

provide ways for applications to deal with

> +there's no guarantee that the app will be correctly using such features, and UMD
> +can implement policies to close the app if it's a repeating offender, likely in
> +a broken loop. This is done to ensure that it doesn't keeps blocking the user

does not keep

I think contractions are usually avoided in documents, but I'm not
bothering to flag them all.

> +interface to be correctly displayed.

interface from being correctly displayed.

> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
> +tells if a reset has happened, and if so, all the context state is considered
> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
> +will terminate the app when a reset is detected, giving that the contexts are
> +lost and the app won't be able to figure this out and recreate the contexts.

What about GL ES? Is GL_ARB_robustness implemented or even defined there?

What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not
be enough from the app? The documented expectation is: "The application
must destroy all contexts and reinitialise OpenGL ES state and objects
to continue rendering."

> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting resets causes
> +-----------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes

What about VRAM contents? If userspace holds a dmabuf handle, can a GPU
reset wipe that buffer? How would that be communicated?

The dmabuf may have originated in another process.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET
  2023-06-21  0:57 ` [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET André Almeida
@ 2023-06-21  8:09   ` Pekka Paalanen
  2023-06-21 16:33     ` André Almeida
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Paalanen @ 2023-06-21  8:09 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 3718 bytes --]

On Tue, 20 Jun 2023 21:57:17 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Create a new DRM ioctl operation to get the numbers of resets for a
> given context. The numbers reflect just the resets that happened after
> the context was created, and not since the machine was booted.
> 
> Create a debugfs interface to make easier to test the API without real
> resets.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  drivers/gpu/drm/drm_debugfs.c |  2 ++
>  drivers/gpu/drm/drm_ioctl.c   | 58 +++++++++++++++++++++++++++++++++++
>  include/drm/drm_device.h      |  3 ++
>  include/drm/drm_drv.h         |  3 ++
>  include/uapi/drm/drm.h        | 21 +++++++++++++
>  include/uapi/drm/drm_mode.h   | 15 +++++++++
>  6 files changed, 102 insertions(+)

...

> diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
> index a87bbbbca2d4..a84559aa0d77 100644
> --- a/include/uapi/drm/drm.h
> +++ b/include/uapi/drm/drm.h
> @@ -1169,6 +1169,27 @@ extern "C" {
>   */
>  #define DRM_IOCTL_MODE_GETFB2		DRM_IOWR(0xCE, struct drm_mode_fb_cmd2)
>  
> +/**
> + * DRM_IOCTL_GET_RESET - Get information about device resets
> + *
> + * This operation requests from the device information about resets. It should
> + * consider only resets that happens after the context is created, therefore,
> + * the counter should be zero during context creation.
> + *
> + * dev_reset_count tells how many resets have happened on this device, and
> + * ctx_reset_count tells how many of such resets were caused by this context.
> + *
> + * Flags can be used to tell if a reset is in progress, and userspace should
> + * wait until it's not in progress anymore to be able to create a new context;
> + * and to tell if the VRAM is considered lost. There's no safe way to clean this
> + * flag so if a context see this flag set, it should be like that until the end
> + * of the context.

Is "this flag" the VRAM_LOST? Or any flag?

Does this mean that not all resets are fatal to the context? Is there
any kind of reset that should not be fatal to a context? All the
rendering APIs seem to assume that any reset is fatal and the context
must be destroyed.

> + */
> +#define DRM_IOCTL_GET_RESET		DRM_IOWR(0xCF, struct drm_get_reset)
> +
> +#define DRM_RESET_IN_PROGRESS	0x1
> +#define DRM_RESET_VRAM_LOST	0x2

Ok, so the dmabuf lost is being communicated here, but how would a
userspace process know on which device a dmabuf resides on?

Let's assume process A uses device 1 to draw, exports a dmabuf, sends
it to process B which imports it to device 2. Device 1 resets and loses
VRAM contents. How would process B notice that the dmabuf is lost when
it never touches device 1 itself?

> +
>  /*
>   * Device specific ioctls should only be in their respective headers
>   * The device specific ioctl range is from 0x40 to 0x9f.
> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
> index 43691058d28f..c3257bd1af9c 100644
> --- a/include/uapi/drm/drm_mode.h
> +++ b/include/uapi/drm/drm_mode.h
> @@ -1308,6 +1308,21 @@ struct drm_mode_rect {
>  	__s32 y2;
>  };
>  
> +/**
> + * struct drm_get_reset - Get information about a DRM device resets
> + * @ctx_id: the context id to be queried about resets
> + * @flags: flags
> + * @dev_reset_count: global counter of resets for a given DRM device
> + * @ctx_reset_count: of all the resets counted by this device, how many were
> + * caused by this context.
> + */
> +struct drm_get_reset {
> +	__u32 ctx_id;
> +	__u32 flags;
> +	__u64 dev_reset_count;
> +	__u64 ctx_reset_count;
> +};
> +
>  #if defined(__cplusplus)
>  }
>  #endif

Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 0/4] drm: Standardize device reset notification
  2023-06-21  7:42 ` [RFC PATCH v3 0/4] drm: Standardize device reset notification Christian König
@ 2023-06-21 15:06   ` André Almeida
  2023-06-21 15:09     ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21 15:06 UTC (permalink / raw)
  To: Christian König
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, linux-kernel, amd-gfx, Rob Clark, Pekka Paalanen,
	dri-devel, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Em 21/06/2023 04:42, Christian König escreveu:
> Am 21.06.23 um 02:57 schrieb André Almeida:
>> Hi,
>>
>> This is a new version of the documentation for DRM device resets. As I 
>> dived
>> more in the subject, I started to believe that part of the problem was 
>> the lack
>> of a DRM API to get reset information from the driver. With an API, we 
>> can
>> better standardize reset queries, increase common code from both DRM 
>> and Mesa,
>> and make easier to write end-to-end tests.
>>
>> So this patchset, along with the documentation, comes with a new IOCTL 
>> and two
>> implementations of it for amdgpu and i915 (although just the former 
>> was really
>> tested). This IOCTL uses the "context id" to query reset information, 
>> but this
>> might be not generic enough to be included in a DRM API.
> 
> Well the basic problem with that is that we don't have a standard DRM 
> context defined.
> 
> If you want to do this you should probably start there first.

Any idea on how to start this? I tried to find previous work about that, 
but I didn't find.

> 
> Apart from that this looks like a really really good idea to me, 
> especially that we document the reset expectations.

I think I'll submit just the doc for the next version then, given that 
the IOCTL will need a lot of rework.

> 
> Regards,
> Christian.
> 
>>    At least for amdgpu,
>> this information is encapsulated by libdrm so one can't just call the 
>> ioctl
>> directly from the UMD as I was planning to, but a small refactor can 
>> be done to
>> expose the id. Anyway, I'm sharing it as it is to gather feedback if 
>> this seems
>> to work.
>>
>> The amdgpu and i915 implementations are provided as a mean of testing 
>> and as
>> exemplification, and not as reference code yet, as the goal is more 
>> about the
>> interface itself then the driver parts.
>>
>> For the documentation itself, after spending some time reading the 
>> reset path in
>> the kernel in Mesa, I decide to rewrite it to better reflect how it 
>> works, from
>> bottom to top.
>>
>> You can check the userspace side of the IOCLT here:
>>   Mesa: 
>> https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>>   libdrm: 
>> https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>>
>> For testing, I use this vulkan app that has an infinity loop in the 
>> shader:
>> https://github.com/andrealmeid/vulkan-triangle-v1
>>
>> Feedbacks are welcomed!
>>
>> Thanks,
>>         André
>>
>> v2: 
>> https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
>> v1: 
>> https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>>
>> André Almeida (4):
>>    drm/doc: Document DRM device reset expectations
>>    drm: Create DRM_IOCTL_GET_RESET
>>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>>    drm/i915: Implement DRM_IOCTL_GET_RESET
>>
>>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>>   drivers/gpu/drm/drm_ioctl.c                   | 58 +++++++++++++++++++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>>   include/drm/drm_device.h                      |  3 +
>>   include/drm/drm_drv.h                         |  3 +
>>   include/uapi/drm/drm.h                        | 21 +++++++
>>   include/uapi/drm/drm_mode.h                   | 15 +++++
>>   17 files changed, 233 insertions(+), 3 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 0/4] drm: Standardize device reset notification
  2023-06-21 15:06   ` André Almeida
@ 2023-06-21 15:09     ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2023-06-21 15:09 UTC (permalink / raw)
  To: André Almeida
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, linux-kernel, amd-gfx, Rob Clark, Pekka Paalanen,
	dri-devel, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Am 21.06.23 um 17:06 schrieb André Almeida:
> Em 21/06/2023 04:42, Christian König escreveu:
>> Am 21.06.23 um 02:57 schrieb André Almeida:
>>> Hi,
>>>
>>> This is a new version of the documentation for DRM device resets. As 
>>> I dived
>>> more in the subject, I started to believe that part of the problem 
>>> was the lack
>>> of a DRM API to get reset information from the driver. With an API, 
>>> we can
>>> better standardize reset queries, increase common code from both DRM 
>>> and Mesa,
>>> and make easier to write end-to-end tests.
>>>
>>> So this patchset, along with the documentation, comes with a new 
>>> IOCTL and two
>>> implementations of it for amdgpu and i915 (although just the former 
>>> was really
>>> tested). This IOCTL uses the "context id" to query reset 
>>> information, but this
>>> might be not generic enough to be included in a DRM API.
>>
>> Well the basic problem with that is that we don't have a standard DRM 
>> context defined.
>>
>> If you want to do this you should probably start there first.
>
> Any idea on how to start this? I tried to find previous work about 
> that, but I didn't find.

I'm not aware of any work in this area, maybe ping on the Mesa list as well.

Could be that someone looked into that but never send anything out.

>
>>
>> Apart from that this looks like a really really good idea to me, 
>> especially that we document the reset expectations.
>
> I think I'll submit just the doc for the next version then, given that 
> the IOCTL will need a lot of rework.

Yeah, agree completely.

Thanks,
Christian.

>
>>
>> Regards,
>> Christian.
>>
>>>    At least for amdgpu,
>>> this information is encapsulated by libdrm so one can't just call 
>>> the ioctl
>>> directly from the UMD as I was planning to, but a small refactor can 
>>> be done to
>>> expose the id. Anyway, I'm sharing it as it is to gather feedback if 
>>> this seems
>>> to work.
>>>
>>> The amdgpu and i915 implementations are provided as a mean of 
>>> testing and as
>>> exemplification, and not as reference code yet, as the goal is more 
>>> about the
>>> interface itself then the driver parts.
>>>
>>> For the documentation itself, after spending some time reading the 
>>> reset path in
>>> the kernel in Mesa, I decide to rewrite it to better reflect how it 
>>> works, from
>>> bottom to top.
>>>
>>> You can check the userspace side of the IOCLT here:
>>>   Mesa: 
>>> https://gitlab.freedesktop.org/andrealmeid/mesa/-/commit/cd687b22fb32c21b23596c607003e2a495f465
>>>   libdrm: 
>>> https://gitlab.freedesktop.org/andrealmeid/libdrm/-/commit/b31e5404893ee9a85d1aa67e81c2f58c1dac3c46
>>>
>>> For testing, I use this vulkan app that has an infinity loop in the 
>>> shader:
>>> https://github.com/andrealmeid/vulkan-triangle-v1
>>>
>>> Feedbacks are welcomed!
>>>
>>> Thanks,
>>>         André
>>>
>>> v2: 
>>> https://lore.kernel.org/all/20230227204000.56787-1-andrealmeid@igalia.com/
>>> v1: 
>>> https://lore.kernel.org/all/20230123202646.356592-1-andrealmeid@igalia.com/
>>>
>>> André Almeida (4):
>>>    drm/doc: Document DRM device reset expectations
>>>    drm: Create DRM_IOCTL_GET_RESET
>>>    drm/amdgpu: Implement DRM_IOCTL_GET_RESET
>>>    drm/i915: Implement DRM_IOCTL_GET_RESET
>>>
>>>   Documentation/gpu/drm-uapi.rst                | 51 ++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  4 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 35 +++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h       |  5 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 12 +++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |  2 +
>>>   drivers/gpu/drm/drm_debugfs.c                 |  2 +
>>>   drivers/gpu/drm/drm_ioctl.c                   | 58 
>>> +++++++++++++++++++
>>>   drivers/gpu/drm/i915/gem/i915_gem_context.c   | 18 ++++++
>>>   drivers/gpu/drm/i915/gem/i915_gem_context.h   |  2 +
>>>   .../gpu/drm/i915/gem/i915_gem_context_types.h |  2 +
>>>   drivers/gpu/drm/i915/i915_driver.c            |  2 +
>>>   include/drm/drm_device.h                      |  3 +
>>>   include/drm/drm_drv.h                         |  3 +
>>>   include/uapi/drm/drm.h                        | 21 +++++++
>>>   include/uapi/drm/drm_mode.h                   | 15 +++++
>>>   17 files changed, 233 insertions(+), 3 deletions(-)
>>>
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
  2023-06-21  7:58   ` Pekka Paalanen
@ 2023-06-21 16:28     ` André Almeida
  2023-06-22  8:12       ` Pekka Paalanen
  0 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21 16:28 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Em 21/06/2023 04:58, Pekka Paalanen escreveu:
> On Tue, 20 Jun 2023 21:57:16 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> 
> Hi André,
> 
> nice to see this! I ended up giving lots of grammar comments, but I'm
> not a native speaker. Generally it looks good to me.

Thank you for your feedback :)

> 
>> ---
>>   Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 65 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..da4f8a694d8d 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>>   
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>> +faulty applications and everything in between the many layers. To recover
>> +from this kind of state, sometimes is needed to reset the device. This section
> 
> It seems unclear what "this kind of state" refers to, so maybe just write "errors"?
> 
> Maybe:
> 
> 	Some errors require resetting the device in order to make the
> 	device usable again.
> 
> I presume that recovery does not mean that the failed job could recover.
> 
>> +describes what's the expectations for DRM and usermode drivers when a device
>> +resets and how to propagate the reset status.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and to perform
>> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD
> 
> s/hung/hang/ ?
> 
>> +then update it's internal reset tracking to be ready when userspace asks the
> 
> updates its
> 
> "update reset tracking"... do you mean that KMD records information
> about the reset in case userspace asks for it later?

Yes, kernel drivers do annotate whenever a reset happens, so it can 
report to userspace when it asks about resets.

For instance, this is the amdgpu implementation of 
AMDGPU_CTX_OP_QUERY_STATE2:

https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548 


You can see there stored information about resets.

> 
>> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
>> +for that.
> 
> At this point, I'm not sure what "reset tracking" or "reset
> information" entails. Could something be said about those?
>  >> +
>> +User Mode Driver
>> +----------------
>> +
>> +The UMD should check before submitting new commands to the KMD if the device has
>> +been reset, and this can be checked more often if it requires to. The
>> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
>> +detecting a reset, UMD will then proceed to report it to the application using
>> +the appropriated API error code, as explained in the bellow section about
> 
> s/bellow/below/
> 
>> +robustness.
>> +
>> +Robustness
>> +----------
>> +
>> +The only way to try to keep an application working after a reset is if it
>> +complies with the robustness aspects of the graphical API that is using.
> 
> that it is using.
> 
>> +
>> +Graphical APIs provide ways to application to deal with device resets. However,
> 
> provide ways for applications to deal with
> 
>> +there's no guarantee that the app will be correctly using such features, and UMD
>> +can implement policies to close the app if it's a repeating offender, likely in
>> +a broken loop. This is done to ensure that it doesn't keeps blocking the user
> 
> does not keep
> 
> I think contractions are usually avoided in documents, but I'm not
> bothering to flag them all.
> 
>> +interface to be correctly displayed.
> 
> interface from being correctly displayed.
> 
>> +
>> +OpenGL
>> +~~~~~~
>> +
>> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
>> +tells if a reset has happened, and if so, all the context state is considered
>> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
>> +will terminate the app when a reset is detected, giving that the contexts are
>> +lost and the app won't be able to figure this out and recreate the contexts.
> 
> What about GL ES? Is GL_ARB_robustness implemented or even defined there?
> 

I found this: 
https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt

"Since this is intended to be a version of ARB_robustness for OpenGL ES, 
it should be named accordingly."

I can add this to this paragraph.

> What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not
> be enough from the app? The documented expectation is: "The application
> must destroy all contexts and reinitialise OpenGL ES state and objects
> to continue rendering."

I couldn't find the spec for EGL_CONTEXT_LOST, but I found for 
GL_CONTEXT_LOST, which I assume is similar.

GL_CONTEXT_LOST is only returned in some specific commands (that might 
cause a polling application to block indefinitely), so I don't think 
it's enough, given that the we can't guarantee that the application will 
call such commands after a reset, thus not being able to notice a reset.

https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml

> 
>> +
>> +Vulkan
>> +~~~~~~
>> +
>> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
>> +This error code means, among other things, that a device reset has happened and
>> +it needs to recreate the contexts to keep going.
>> +
>> +Reporting resets causes
>> +-----------------------
>> +
>> +Apart from propagating the reset through the stack so apps can recover, it's
>> +really useful for driver developers to learn more about what caused the reset in
>> +first place. DRM devices should make use of devcoredump to store relevant
>> +information about the reset, so this information can be added to user bug
>> +reports.
>> +
>>   .. _drm_driver_ioctl:
>>   
>>   IOCTL Support on Device Nodes
> 
> What about VRAM contents? If userspace holds a dmabuf handle, can a GPU
> reset wipe that buffer? How would that be communicated?
> 

Yes, it can.

> The dmabuf may have originated in another process.
> 

Indeed, I think we might need to add an error code for dmabuf calls so 
the buffer user knows that it's invalid now because a reset has happened 
in the other device. I will need to read more dmabuf code to make sure 
how this would be possible.

> 
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET
  2023-06-21  8:09   ` Pekka Paalanen
@ 2023-06-21 16:33     ` André Almeida
  2023-06-22  8:22       ` Pekka Paalanen
  0 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21 16:33 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Em 21/06/2023 05:09, Pekka Paalanen escreveu:
> On Tue, 20 Jun 2023 21:57:17 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Create a new DRM ioctl operation to get the numbers of resets for a
>> given context. The numbers reflect just the resets that happened after
>> the context was created, and not since the machine was booted.
>>
>> Create a debugfs interface to make easier to test the API without real
>> resets.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/drm_debugfs.c |  2 ++
>>   drivers/gpu/drm/drm_ioctl.c   | 58 +++++++++++++++++++++++++++++++++++
>>   include/drm/drm_device.h      |  3 ++
>>   include/drm/drm_drv.h         |  3 ++
>>   include/uapi/drm/drm.h        | 21 +++++++++++++
>>   include/uapi/drm/drm_mode.h   | 15 +++++++++
>>   6 files changed, 102 insertions(+)
> 
> ...
> 
>> diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
>> index a87bbbbca2d4..a84559aa0d77 100644
>> --- a/include/uapi/drm/drm.h
>> +++ b/include/uapi/drm/drm.h
>> @@ -1169,6 +1169,27 @@ extern "C" {
>>    */
>>   #define DRM_IOCTL_MODE_GETFB2		DRM_IOWR(0xCE, struct drm_mode_fb_cmd2)
>>   
>> +/**
>> + * DRM_IOCTL_GET_RESET - Get information about device resets
>> + *
>> + * This operation requests from the device information about resets. It should
>> + * consider only resets that happens after the context is created, therefore,
>> + * the counter should be zero during context creation.
>> + *
>> + * dev_reset_count tells how many resets have happened on this device, and
>> + * ctx_reset_count tells how many of such resets were caused by this context.
>> + *
>> + * Flags can be used to tell if a reset is in progress, and userspace should
>> + * wait until it's not in progress anymore to be able to create a new context;
>> + * and to tell if the VRAM is considered lost. There's no safe way to clean this
>> + * flag so if a context see this flag set, it should be like that until the end
>> + * of the context.
> 
> Is "this flag" the VRAM_LOST? Or any flag?
> 
> Does this mean that not all resets are fatal to the context? Is there
> any kind of reset that should not be fatal to a context? All the
> rendering APIs seem to assume that any reset is fatal and the context
> must be destroyed.

I got this flag from the `AMDGPU_CTX_OP_QUERY_STATE2` operation, and 
it's used to notify that the reset was fatal for a giving context, 
although the idea of non-fatal resets seems to be a bit controversial 
for now, so I think it will be better if I leave this flag for latter 
improvements of the API.

> 
>> + */
>> +#define DRM_IOCTL_GET_RESET		DRM_IOWR(0xCF, struct drm_get_reset)
>> +
>> +#define DRM_RESET_IN_PROGRESS	0x1
>> +#define DRM_RESET_VRAM_LOST	0x2
> 
> Ok, so the dmabuf lost is being communicated here, but how would a
> userspace process know on which device a dmabuf resides on?
> 
> Let's assume process A uses device 1 to draw, exports a dmabuf, sends
> it to process B which imports it to device 2. Device 1 resets and loses
> VRAM contents. How would process B notice that the dmabuf is lost when
> it never touches device 1 itself?
> 
>> +
>>   /*
>>    * Device specific ioctls should only be in their respective headers
>>    * The device specific ioctl range is from 0x40 to 0x9f.
>> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
>> index 43691058d28f..c3257bd1af9c 100644
>> --- a/include/uapi/drm/drm_mode.h
>> +++ b/include/uapi/drm/drm_mode.h
>> @@ -1308,6 +1308,21 @@ struct drm_mode_rect {
>>   	__s32 y2;
>>   };
>>   
>> +/**
>> + * struct drm_get_reset - Get information about a DRM device resets
>> + * @ctx_id: the context id to be queried about resets
>> + * @flags: flags
>> + * @dev_reset_count: global counter of resets for a given DRM device
>> + * @ctx_reset_count: of all the resets counted by this device, how many were
>> + * caused by this context.
>> + */
>> +struct drm_get_reset {
>> +	__u32 ctx_id;
>> +	__u32 flags;
>> +	__u64 dev_reset_count;
>> +	__u64 ctx_reset_count;
>> +};
>> +
>>   #if defined(__cplusplus)
>>   }
>>   #endif
> 
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  2023-06-21  7:40   ` Christian König
@ 2023-06-21 16:38     ` André Almeida
  2023-06-22  7:45       ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: André Almeida @ 2023-06-21 16:38 UTC (permalink / raw)
  To: Christian König
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, Rob Clark, Pekka Paalanen, amd-gfx, dri-devel,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, linux-kernel

Em 21/06/2023 04:40, Christian König escreveu:
> Am 21.06.23 um 02:57 schrieb André Almeida:
>> Implement get_reset ioctl for amdgpu
> 
> Well that pretty much won't work since the jobs are destroyed much later 
> than the contexts.
> 

Why does this prevents the code to work? If the context is detroyed, it 
can't be queried anyway.

> Christian.
> 
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  4 ++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 35 +++++++++++++++++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  5 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  2 ++
>>   6 files changed, 56 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 2eb2c66843a8..0ba26b4b039c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -1262,8 +1262,10 @@ static int amdgpu_cs_submit(struct 
>> amdgpu_cs_parser *p,
>>       uint64_t seq;
>>       int r;
>> -    for (i = 0; i < p->gang_size; ++i)
>> +    for (i = 0; i < p->gang_size; ++i) {
>> +        p->jobs[i]->ctx = p->ctx;
>>           drm_sched_job_arm(&p->jobs[i]->base);
>> +    }
>>       for (i = 0; i < p->gang_size; ++i) {
>>           struct dma_fence *fence;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index d2139ac12159..d3e292382d4a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -322,6 +322,9 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr 
>> *mgr, int32_t priority,
>>       ctx->init_priority = priority;
>>       ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
>> +    ctx->global_reset_counter = 
>> atomic_read(&mgr->adev->gpu_reset_counter);
>> +    ctx->local_reset_counter = 0;
>> +
>>       r = amdgpu_ctx_get_stable_pstate(ctx, &current_stable_pstate);
>>       if (r)
>>           return r;
>> @@ -963,3 +966,35 @@ void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr 
>> *mgr,
>>       }
>>       mutex_unlock(&mgr->lock);
>>   }
>> +
>> +int amdgpu_get_reset(struct drm_file *filp, struct drm_device *dev,
>> +             struct drm_get_reset *reset)
>> +{
>> +    struct amdgpu_device *adev = drm_to_adev(dev);
>> +    struct amdgpu_ctx *ctx;
>> +    struct amdgpu_ctx_mgr *mgr;
>> +    unsigned int id = reset->ctx_id;
>> +    struct amdgpu_fpriv *fpriv = filp->driver_priv;
>> +
>> +    mgr = &fpriv->ctx_mgr;
>> +    mutex_lock(&mgr->lock);
>> +    ctx = idr_find(&mgr->ctx_handles, id);
>> +    if (!ctx) {
>> +        mutex_unlock(&mgr->lock);
>> +        return -EINVAL;
>> +    }
>> +
>> +    reset->dev_reset_count =
>> +        atomic_read(&adev->gpu_reset_counter) - 
>> ctx->global_reset_counter;
>> +
>> +    reset->ctx_reset_count = ctx->local_reset_counter;
>> +
>> +    if (amdgpu_in_reset(adev))
>> +        reset->flags |= DRM_RESET_IN_PROGRESS;
>> +
>> +    if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>> +        reset->flags |= DRM_RESET_VRAM_LOST;
>> +
>> +    mutex_unlock(&mgr->lock);
>> +    return 0;
>> +}
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>> index 0fa0e56daf67..0c9815695884 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>> @@ -57,6 +57,9 @@ struct amdgpu_ctx {
>>       unsigned long            ras_counter_ce;
>>       unsigned long            ras_counter_ue;
>>       uint32_t            stable_pstate;
>> +
>> +    uint64_t            global_reset_counter;
>> +    uint64_t            local_reset_counter;
>>   };
>>   struct amdgpu_ctx_mgr {
>> @@ -97,4 +100,6 @@ void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr);
>>   void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
>>                 ktime_t usage[AMDGPU_HW_IP_NUM]);
>> +int amdgpu_get_reset(struct drm_file *file_priv, struct drm_device *dev,
>> +             struct drm_get_reset *reset);
>>   #endif
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index c9a41c997c6c..431791b2c3cb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2805,6 +2805,7 @@ static const struct drm_driver amdgpu_kms_driver 
>> = {
>>   #ifdef CONFIG_PROC_FS
>>       .show_fdinfo = amdgpu_show_fdinfo,
>>   #endif
>> +    .get_reset = amdgpu_get_reset,
>>       .prime_handle_to_fd = drm_gem_prime_handle_to_fd,
>>       .prime_fd_to_handle = drm_gem_prime_fd_to_handle,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index c3d9d75143f4..1553a2633d46 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -35,11 +35,20 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   {
>>       struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>       struct amdgpu_job *job = to_amdgpu_job(s_job);
>> +    struct drm_sched_entity *entity = job->base.entity;
>>       struct amdgpu_task_info ti;
>>       struct amdgpu_device *adev = ring->adev;
>>       int idx;
>>       int r;
>> +    memset(&ti, 0, sizeof(struct amdgpu_task_info));
>> +    amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>> +
>> +    if (job->ctx) {
>> +        DRM_INFO("Increasing ctx reset count for %s (%d)\n", 
>> ti.process_name, ti.pid);
>> +        job->ctx->local_reset_counter++;
>> +    }
>> +
>>       if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
>>           DRM_INFO("%s - device unplugged skipping recovery on 
>> scheduler:%s",
>>                __func__, s_job->sched->name);
>> @@ -48,7 +57,6 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>           return DRM_GPU_SCHED_STAT_ENODEV;
>>       }
>> -    memset(&ti, 0, sizeof(struct amdgpu_task_info));
>>       adev->job_hang = true;
>>       if (amdgpu_gpu_recovery &&
>> @@ -58,7 +66,6 @@ static enum drm_gpu_sched_stat 
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>           goto exit;
>>       }
>> -    amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>>       DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>>             job->base.sched->name, 
>> atomic_read(&ring->fence_drv.last_seq),
>>             ring->fence_drv.sync_seq);
>> @@ -105,6 +112,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
>> struct amdgpu_vm *vm,
>>        */
>>       (*job)->base.sched = &adev->rings[0]->sched;
>>       (*job)->vm = vm;
>> +    (*job)->ctx = NULL;
>>       amdgpu_sync_create(&(*job)->explicit_sync);
>>       (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> index 52f2e313ea17..0d463babaa60 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>> @@ -63,6 +63,8 @@ struct amdgpu_job {
>>       uint32_t        oa_base, oa_size;
>>       uint32_t        vram_lost_counter;
>> +    struct amdgpu_ctx    *ctx;
>> +
>>       /* user fence handling */
>>       uint64_t        uf_addr;
>>       uint64_t        uf_sequence;
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET
  2023-06-21 16:38     ` André Almeida
@ 2023-06-22  7:45       ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2023-06-22  7:45 UTC (permalink / raw)
  To: André Almeida
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	Simon Ser, Rob Clark, Pekka Paalanen, amd-gfx, dri-devel,
	Daniel Vetter, Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen, linux-kernel

Am 21.06.23 um 18:38 schrieb André Almeida:
> Em 21/06/2023 04:40, Christian König escreveu:
>> Am 21.06.23 um 02:57 schrieb André Almeida:
>>> Implement get_reset ioctl for amdgpu
>>
>> Well that pretty much won't work since the jobs are destroyed much 
>> later than the contexts.
>>
>
> Why does this prevents the code to work? If the context is detroyed, 
> it can't be queried anyway.

Yeah, but you cause use after free issues with that!

The references are ctx->entit->job->fence, so that ctx and entity can be 
destroyed first without destroying the job or fence.

If the job has a back reference that whole stuff doesn't work any more 
and the pointer is potentially dangling.

Christian.

>
>> Christian.
>>
>>>
>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  4 ++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 35 
>>> +++++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  5 ++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++++++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h |  2 ++
>>>   6 files changed, 56 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 2eb2c66843a8..0ba26b4b039c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1262,8 +1262,10 @@ static int amdgpu_cs_submit(struct 
>>> amdgpu_cs_parser *p,
>>>       uint64_t seq;
>>>       int r;
>>> -    for (i = 0; i < p->gang_size; ++i)
>>> +    for (i = 0; i < p->gang_size; ++i) {
>>> +        p->jobs[i]->ctx = p->ctx;
>>>           drm_sched_job_arm(&p->jobs[i]->base);
>>> +    }
>>>       for (i = 0; i < p->gang_size; ++i) {
>>>           struct dma_fence *fence;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index d2139ac12159..d3e292382d4a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -322,6 +322,9 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr 
>>> *mgr, int32_t priority,
>>>       ctx->init_priority = priority;
>>>       ctx->override_priority = AMDGPU_CTX_PRIORITY_UNSET;
>>> +    ctx->global_reset_counter = 
>>> atomic_read(&mgr->adev->gpu_reset_counter);
>>> +    ctx->local_reset_counter = 0;
>>> +
>>>       r = amdgpu_ctx_get_stable_pstate(ctx, &current_stable_pstate);
>>>       if (r)
>>>           return r;
>>> @@ -963,3 +966,35 @@ void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr 
>>> *mgr,
>>>       }
>>>       mutex_unlock(&mgr->lock);
>>>   }
>>> +
>>> +int amdgpu_get_reset(struct drm_file *filp, struct drm_device *dev,
>>> +             struct drm_get_reset *reset)
>>> +{
>>> +    struct amdgpu_device *adev = drm_to_adev(dev);
>>> +    struct amdgpu_ctx *ctx;
>>> +    struct amdgpu_ctx_mgr *mgr;
>>> +    unsigned int id = reset->ctx_id;
>>> +    struct amdgpu_fpriv *fpriv = filp->driver_priv;
>>> +
>>> +    mgr = &fpriv->ctx_mgr;
>>> +    mutex_lock(&mgr->lock);
>>> +    ctx = idr_find(&mgr->ctx_handles, id);
>>> +    if (!ctx) {
>>> +        mutex_unlock(&mgr->lock);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    reset->dev_reset_count =
>>> +        atomic_read(&adev->gpu_reset_counter) - 
>>> ctx->global_reset_counter;
>>> +
>>> +    reset->ctx_reset_count = ctx->local_reset_counter;
>>> +
>>> +    if (amdgpu_in_reset(adev))
>>> +        reset->flags |= DRM_RESET_IN_PROGRESS;
>>> +
>>> +    if (ctx->vram_lost_counter != 
>>> atomic_read(&adev->vram_lost_counter))
>>> +        reset->flags |= DRM_RESET_VRAM_LOST;
>>> +
>>> +    mutex_unlock(&mgr->lock);
>>> +    return 0;
>>> +}
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>> index 0fa0e56daf67..0c9815695884 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>>> @@ -57,6 +57,9 @@ struct amdgpu_ctx {
>>>       unsigned long            ras_counter_ce;
>>>       unsigned long            ras_counter_ue;
>>>       uint32_t            stable_pstate;
>>> +
>>> +    uint64_t            global_reset_counter;
>>> +    uint64_t            local_reset_counter;
>>>   };
>>>   struct amdgpu_ctx_mgr {
>>> @@ -97,4 +100,6 @@ void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr 
>>> *mgr);
>>>   void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
>>>                 ktime_t usage[AMDGPU_HW_IP_NUM]);
>>> +int amdgpu_get_reset(struct drm_file *file_priv, struct drm_device 
>>> *dev,
>>> +             struct drm_get_reset *reset);
>>>   #endif
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index c9a41c997c6c..431791b2c3cb 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -2805,6 +2805,7 @@ static const struct drm_driver 
>>> amdgpu_kms_driver = {
>>>   #ifdef CONFIG_PROC_FS
>>>       .show_fdinfo = amdgpu_show_fdinfo,
>>>   #endif
>>> +    .get_reset = amdgpu_get_reset,
>>>       .prime_handle_to_fd = drm_gem_prime_handle_to_fd,
>>>       .prime_fd_to_handle = drm_gem_prime_fd_to_handle,
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index c3d9d75143f4..1553a2633d46 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -35,11 +35,20 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>   {
>>>       struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>>       struct amdgpu_job *job = to_amdgpu_job(s_job);
>>> +    struct drm_sched_entity *entity = job->base.entity;
>>>       struct amdgpu_task_info ti;
>>>       struct amdgpu_device *adev = ring->adev;
>>>       int idx;
>>>       int r;
>>> +    memset(&ti, 0, sizeof(struct amdgpu_task_info));
>>> +    amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>>> +
>>> +    if (job->ctx) {
>>> +        DRM_INFO("Increasing ctx reset count for %s (%d)\n", 
>>> ti.process_name, ti.pid);
>>> +        job->ctx->local_reset_counter++;
>>> +    }
>>> +
>>>       if (!drm_dev_enter(adev_to_drm(adev), &idx)) {
>>>           DRM_INFO("%s - device unplugged skipping recovery on 
>>> scheduler:%s",
>>>                __func__, s_job->sched->name);
>>> @@ -48,7 +57,6 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>           return DRM_GPU_SCHED_STAT_ENODEV;
>>>       }
>>> -    memset(&ti, 0, sizeof(struct amdgpu_task_info));
>>>       adev->job_hang = true;
>>>       if (amdgpu_gpu_recovery &&
>>> @@ -58,7 +66,6 @@ static enum drm_gpu_sched_stat 
>>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>           goto exit;
>>>       }
>>> -    amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>>>       DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>>>             job->base.sched->name, 
>>> atomic_read(&ring->fence_drv.last_seq),
>>>             ring->fence_drv.sync_seq);
>>> @@ -105,6 +112,7 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, 
>>> struct amdgpu_vm *vm,
>>>        */
>>>       (*job)->base.sched = &adev->rings[0]->sched;
>>>       (*job)->vm = vm;
>>> +    (*job)->ctx = NULL;
>>>       amdgpu_sync_create(&(*job)->explicit_sync);
>>>       (*job)->vram_lost_counter = 
>>> atomic_read(&adev->vram_lost_counter);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> index 52f2e313ea17..0d463babaa60 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
>>> @@ -63,6 +63,8 @@ struct amdgpu_job {
>>>       uint32_t        oa_base, oa_size;
>>>       uint32_t        vram_lost_counter;
>>> +    struct amdgpu_ctx    *ctx;
>>> +
>>>       /* user fence handling */
>>>       uint64_t        uf_addr;
>>>       uint64_t        uf_sequence;
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
  2023-06-21 16:28     ` André Almeida
@ 2023-06-22  8:12       ` Pekka Paalanen
  2023-06-26 16:15         ` André Almeida
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Paalanen @ 2023-06-22  8:12 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 9681 bytes --]

On Wed, 21 Jun 2023 13:28:34 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Em 21/06/2023 04:58, Pekka Paalanen escreveu:
> > On Tue, 20 Jun 2023 21:57:16 -0300
> > André Almeida <andrealmeid@igalia.com> wrote:
> >   
> >> Create a section that specifies how to deal with DRM device resets for
> >> kernel and userspace drivers.
> >>
> >> Signed-off-by: André Almeida <andrealmeid@igalia.com>  
> > 
> > Hi André,
> > 
> > nice to see this! I ended up giving lots of grammar comments, but I'm
> > not a native speaker. Generally it looks good to me.  
> 
> Thank you for your feedback :)
> 
> >   
> >> ---
> >>   Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
> >>   1 file changed, 65 insertions(+)
> >>
> >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >> index 65fb3036a580..da4f8a694d8d 100644
> >> --- a/Documentation/gpu/drm-uapi.rst
> >> +++ b/Documentation/gpu/drm-uapi.rst
> >> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >>   mmapped regular files. Threads cause additional pain with signal
> >>   handling as well.
> >>   
> >> +Device reset
> >> +============
> >> +
> >> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> >> +faulty applications and everything in between the many layers. To recover
> >> +from this kind of state, sometimes is needed to reset the device. This section  
> > 
> > It seems unclear what "this kind of state" refers to, so maybe just write "errors"?
> > 
> > Maybe:
> > 
> > 	Some errors require resetting the device in order to make the
> > 	device usable again.
> > 
> > I presume that recovery does not mean that the failed job could recover.
> >   
> >> +describes what's the expectations for DRM and usermode drivers when a device
> >> +resets and how to propagate the reset status.
> >> +
> >> +Kernel Mode Driver
> >> +------------------
> >> +
> >> +The KMD is responsible for checking if the device needs a reset, and to perform
> >> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD  
> > 
> > s/hung/hang/ ?
> >   
> >> +then update it's internal reset tracking to be ready when userspace asks the  
> > 
> > updates its
> > 
> > "update reset tracking"... do you mean that KMD records information
> > about the reset in case userspace asks for it later?  
> 
> Yes, kernel drivers do annotate whenever a reset happens, so it can 
> report to userspace when it asks about resets.
> 
> For instance, this is the amdgpu implementation of 
> AMDGPU_CTX_OP_QUERY_STATE2:
> 
> https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548 
> 
> 
> You can see there stored information about resets.

Hi André,

right. What I mean is, if I have to ask this, then that implies that
the wording could be more clear.

I don't know if "reset tracking" is some sub-system that is turned on
and off as needed or what updating it would mean.

> >   
> >> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
> >> +for that.  
> > 
> > At this point, I'm not sure what "reset tracking" or "reset
> > information" entails. Could something be said about those?  
> >  >> +  
> >> +User Mode Driver
> >> +----------------
> >> +
> >> +The UMD should check before submitting new commands to the KMD if the device has
> >> +been reset, and this can be checked more often if it requires to. The
> >> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
> >> +detecting a reset, UMD will then proceed to report it to the application using
> >> +the appropriated API error code, as explained in the bellow section about  
> > 
> > s/bellow/below/
> >   
> >> +robustness.
> >> +
> >> +Robustness
> >> +----------
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that is using.  
> > 
> > that it is using.
> >   
> >> +
> >> +Graphical APIs provide ways to application to deal with device resets. However,  
> > 
> > provide ways for applications to deal with
> >   
> >> +there's no guarantee that the app will be correctly using such features, and UMD
> >> +can implement policies to close the app if it's a repeating offender, likely in
> >> +a broken loop. This is done to ensure that it doesn't keeps blocking the user  
> > 
> > does not keep
> > 
> > I think contractions are usually avoided in documents, but I'm not
> > bothering to flag them all.
> >   
> >> +interface to be correctly displayed.  
> > 
> > interface from being correctly displayed.
> >   
> >> +
> >> +OpenGL
> >> +~~~~~~
> >> +
> >> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
> >> +tells if a reset has happened, and if so, all the context state is considered
> >> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
> >> +will terminate the app when a reset is detected, giving that the contexts are
> >> +lost and the app won't be able to figure this out and recreate the contexts.  
> > 
> > What about GL ES? Is GL_ARB_robustness implemented or even defined there?
> >   
> 
> I found this: 
> https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
> 
> "Since this is intended to be a version of ARB_robustness for OpenGL ES, 
> it should be named accordingly."
> 
> I can add this to this paragraph.

Yes, please!

I suppose there could be even more extensions with similar benefits, so
maybe these extension should be mentioned as examples. Right now the
wording sounds like these are the chosen extensions, and if you don't
use one, the process will be terminated.

> 
> > What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not
> > be enough from the app? The documented expectation is: "The application
> > must destroy all contexts and reinitialise OpenGL ES state and objects
> > to continue rendering."  
> 
> I couldn't find the spec for EGL_CONTEXT_LOST, but I found for 
> GL_CONTEXT_LOST, which I assume is similar.

EGL Version 1.5 - August 27, 2014

Section 2.7 Power Management

	Following a power management event, calls to eglSwapBuffers,
	eglCopyBuffers, or eglMakeCurrent will indicate failure by
	returning EGL_FALSE. The error EGL_CONTEXT_LOST will be
	returned if a power management event has occurred.

	On detection of this error, the application must destroy all
	contexts (by calling eglDestroyContext for each context). To
	continue rendering the application must recreate any contexts
	it requires, and subsequently restore any client API state and
	objects it wishes to use.

It is talking about power management which is not quite GPU reset, but
I see so much similarity that I'd say it doesn't matter which one
actually happened. The only difference is that power management events
are not caused by application bugs, which means that the application
will simply re-initialize and retry, which may result in a reset loop.

You already wrote provision to handle reset loops, and I'm not sure
applications handling EGL_CONTEXT_LOST would/could ever infer that they
are the culprit without using robustness extensions.

I can see how EGL_CONTEXT_LOST could be deemed unsuitable for resets,
too.

> 
> GL_CONTEXT_LOST is only returned in some specific commands (that might 
> cause a polling application to block indefinitely), so I don't think 
> it's enough, given that the we can't guarantee that the application will 
> call such commands after a reset, thus not being able to notice a reset.
> 
> https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml

Ok, another API for a similar thing.

So in that case, the app does not need to use a robustness extension if
it uses OpenGL 4.5 and bothers to check.

This makes the wording "If robustness is not in use" problematic,
because it seems complicated to determine if robusteness is in use in
any particular application. I suppose Mesa would track if the app ever
called glGetGraphicsResetStatus() before drawing after reset?


Thanks,
pq

> >   
> >> +
> >> +Vulkan
> >> +~~~~~~
> >> +
> >> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> >> +This error code means, among other things, that a device reset has happened and
> >> +it needs to recreate the contexts to keep going.
> >> +
> >> +Reporting resets causes
> >> +-----------------------
> >> +
> >> +Apart from propagating the reset through the stack so apps can recover, it's
> >> +really useful for driver developers to learn more about what caused the reset in
> >> +first place. DRM devices should make use of devcoredump to store relevant
> >> +information about the reset, so this information can be added to user bug
> >> +reports.
> >> +
> >>   .. _drm_driver_ioctl:
> >>   
> >>   IOCTL Support on Device Nodes  
> > 
> > What about VRAM contents? If userspace holds a dmabuf handle, can a GPU
> > reset wipe that buffer? How would that be communicated?
> >   
> 
> Yes, it can.
> 
> > The dmabuf may have originated in another process.
> >   
> 
> Indeed, I think we might need to add an error code for dmabuf calls so 
> the buffer user knows that it's invalid now because a reset has happened 
> in the other device. I will need to read more dmabuf code to make sure 
> how this would be possible.
> 
> > 
> > Thanks,
> > pq  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET
  2023-06-21 16:33     ` André Almeida
@ 2023-06-22  8:22       ` Pekka Paalanen
  2023-06-22  9:59         ` Christian König
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka Paalanen @ 2023-06-22  8:22 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

[-- Attachment #1: Type: text/plain, Size: 4967 bytes --]

On Wed, 21 Jun 2023 13:33:56 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Em 21/06/2023 05:09, Pekka Paalanen escreveu:
> > On Tue, 20 Jun 2023 21:57:17 -0300
> > André Almeida <andrealmeid@igalia.com> wrote:
> >   
> >> Create a new DRM ioctl operation to get the numbers of resets for a
> >> given context. The numbers reflect just the resets that happened after
> >> the context was created, and not since the machine was booted.
> >>
> >> Create a debugfs interface to make easier to test the API without real
> >> resets.
> >>
> >> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> >> ---
> >>   drivers/gpu/drm/drm_debugfs.c |  2 ++
> >>   drivers/gpu/drm/drm_ioctl.c   | 58 +++++++++++++++++++++++++++++++++++
> >>   include/drm/drm_device.h      |  3 ++
> >>   include/drm/drm_drv.h         |  3 ++
> >>   include/uapi/drm/drm.h        | 21 +++++++++++++
> >>   include/uapi/drm/drm_mode.h   | 15 +++++++++
> >>   6 files changed, 102 insertions(+)  
> > 
> > ...
> >   
> >> diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
> >> index a87bbbbca2d4..a84559aa0d77 100644
> >> --- a/include/uapi/drm/drm.h
> >> +++ b/include/uapi/drm/drm.h
> >> @@ -1169,6 +1169,27 @@ extern "C" {
> >>    */
> >>   #define DRM_IOCTL_MODE_GETFB2		DRM_IOWR(0xCE, struct drm_mode_fb_cmd2)
> >>   
> >> +/**
> >> + * DRM_IOCTL_GET_RESET - Get information about device resets
> >> + *
> >> + * This operation requests from the device information about resets. It should
> >> + * consider only resets that happens after the context is created, therefore,
> >> + * the counter should be zero during context creation.
> >> + *
> >> + * dev_reset_count tells how many resets have happened on this device, and
> >> + * ctx_reset_count tells how many of such resets were caused by this context.
> >> + *
> >> + * Flags can be used to tell if a reset is in progress, and userspace should
> >> + * wait until it's not in progress anymore to be able to create a new context;
> >> + * and to tell if the VRAM is considered lost. There's no safe way to clean this
> >> + * flag so if a context see this flag set, it should be like that until the end
> >> + * of the context.  
> > 
> > Is "this flag" the VRAM_LOST? Or any flag?
> > 
> > Does this mean that not all resets are fatal to the context? Is there
> > any kind of reset that should not be fatal to a context? All the
> > rendering APIs seem to assume that any reset is fatal and the context
> > must be destroyed.  
> 
> I got this flag from the `AMDGPU_CTX_OP_QUERY_STATE2` operation, and 
> it's used to notify that the reset was fatal for a giving context, 
> although the idea of non-fatal resets seems to be a bit controversial 
> for now, so I think it will be better if I leave this flag for latter 
> improvements of the API.

Which flag is "this flag"? There are RESET_IN_PROGRESS and VRAM_LOST.
Both are fine by me to exist.

I think I made a wrong conclusion here. Somehow I read that it would be
possible to have a reset happen, and if VRAM is not lost, then the
context could work again.

Should there be some wording added to say the context is permanently
broken on any kind of reset? Or is that for UMD to decide?


Thanks,
pq

> >   
> >> + */
> >> +#define DRM_IOCTL_GET_RESET		DRM_IOWR(0xCF, struct drm_get_reset)
> >> +
> >> +#define DRM_RESET_IN_PROGRESS	0x1
> >> +#define DRM_RESET_VRAM_LOST	0x2  
> > 
> > Ok, so the dmabuf lost is being communicated here, but how would a
> > userspace process know on which device a dmabuf resides on?
> > 
> > Let's assume process A uses device 1 to draw, exports a dmabuf, sends
> > it to process B which imports it to device 2. Device 1 resets and loses
> > VRAM contents. How would process B notice that the dmabuf is lost when
> > it never touches device 1 itself?
> >   
> >> +
> >>   /*
> >>    * Device specific ioctls should only be in their respective headers
> >>    * The device specific ioctl range is from 0x40 to 0x9f.
> >> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
> >> index 43691058d28f..c3257bd1af9c 100644
> >> --- a/include/uapi/drm/drm_mode.h
> >> +++ b/include/uapi/drm/drm_mode.h
> >> @@ -1308,6 +1308,21 @@ struct drm_mode_rect {
> >>   	__s32 y2;
> >>   };
> >>   
> >> +/**
> >> + * struct drm_get_reset - Get information about a DRM device resets
> >> + * @ctx_id: the context id to be queried about resets
> >> + * @flags: flags
> >> + * @dev_reset_count: global counter of resets for a given DRM device
> >> + * @ctx_reset_count: of all the resets counted by this device, how many were
> >> + * caused by this context.
> >> + */
> >> +struct drm_get_reset {
> >> +	__u32 ctx_id;
> >> +	__u32 flags;
> >> +	__u64 dev_reset_count;
> >> +	__u64 ctx_reset_count;
> >> +};
> >> +
> >>   #if defined(__cplusplus)
> >>   }
> >>   #endif  
> > 
> > Thanks,
> > pq  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET
  2023-06-22  8:22       ` Pekka Paalanen
@ 2023-06-22  9:59         ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2023-06-22  9:59 UTC (permalink / raw)
  To: Pekka Paalanen, André Almeida
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	pierre-eric.pelloux-prayer, Simon Ser, Rob Clark, Daniel Vetter,
	Daniel Stone, 'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Am 22.06.23 um 10:22 schrieb Pekka Paalanen:
> On Wed, 21 Jun 2023 13:33:56 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
>
>> Em 21/06/2023 05:09, Pekka Paalanen escreveu:
>>> On Tue, 20 Jun 2023 21:57:17 -0300
>>> André Almeida <andrealmeid@igalia.com> wrote:
>>>    
>>>> Create a new DRM ioctl operation to get the numbers of resets for a
>>>> given context. The numbers reflect just the resets that happened after
>>>> the context was created, and not since the machine was booted.
>>>>
>>>> Create a debugfs interface to make easier to test the API without real
>>>> resets.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>>>>    drivers/gpu/drm/drm_debugfs.c |  2 ++
>>>>    drivers/gpu/drm/drm_ioctl.c   | 58 +++++++++++++++++++++++++++++++++++
>>>>    include/drm/drm_device.h      |  3 ++
>>>>    include/drm/drm_drv.h         |  3 ++
>>>>    include/uapi/drm/drm.h        | 21 +++++++++++++
>>>>    include/uapi/drm/drm_mode.h   | 15 +++++++++
>>>>    6 files changed, 102 insertions(+)
>>> ...
>>>    
>>>> diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h
>>>> index a87bbbbca2d4..a84559aa0d77 100644
>>>> --- a/include/uapi/drm/drm.h
>>>> +++ b/include/uapi/drm/drm.h
>>>> @@ -1169,6 +1169,27 @@ extern "C" {
>>>>     */
>>>>    #define DRM_IOCTL_MODE_GETFB2		DRM_IOWR(0xCE, struct drm_mode_fb_cmd2)
>>>>    
>>>> +/**
>>>> + * DRM_IOCTL_GET_RESET - Get information about device resets
>>>> + *
>>>> + * This operation requests from the device information about resets. It should
>>>> + * consider only resets that happens after the context is created, therefore,
>>>> + * the counter should be zero during context creation.
>>>> + *
>>>> + * dev_reset_count tells how many resets have happened on this device, and
>>>> + * ctx_reset_count tells how many of such resets were caused by this context.
>>>> + *
>>>> + * Flags can be used to tell if a reset is in progress, and userspace should
>>>> + * wait until it's not in progress anymore to be able to create a new context;
>>>> + * and to tell if the VRAM is considered lost. There's no safe way to clean this
>>>> + * flag so if a context see this flag set, it should be like that until the end
>>>> + * of the context.
>>> Is "this flag" the VRAM_LOST? Or any flag?
>>>
>>> Does this mean that not all resets are fatal to the context? Is there
>>> any kind of reset that should not be fatal to a context? All the
>>> rendering APIs seem to assume that any reset is fatal and the context
>>> must be destroyed.
>> I got this flag from the `AMDGPU_CTX_OP_QUERY_STATE2` operation, and
>> it's used to notify that the reset was fatal for a giving context,
>> although the idea of non-fatal resets seems to be a bit controversial
>> for now, so I think it will be better if I leave this flag for latter
>> improvements of the API.
> Which flag is "this flag"? There are RESET_IN_PROGRESS and VRAM_LOST.
> Both are fine by me to exist.
>
> I think I made a wrong conclusion here. Somehow I read that it would be
> possible to have a reset happen, and if VRAM is not lost, then the
> context could work again.

Yeah, that's exactly what AMD tries to do.

And no, I'm absolutely not keen about that idea.

Regards,
Christian.

> Should there be some wording added to say the context is permanently
> broken on any kind of reset? Or is that for UMD to decide?
>
>
> Thanks,
> pq
>
>>>    
>>>> + */
>>>> +#define DRM_IOCTL_GET_RESET		DRM_IOWR(0xCF, struct drm_get_reset)
>>>> +
>>>> +#define DRM_RESET_IN_PROGRESS	0x1
>>>> +#define DRM_RESET_VRAM_LOST	0x2
>>> Ok, so the dmabuf lost is being communicated here, but how would a
>>> userspace process know on which device a dmabuf resides on?
>>>
>>> Let's assume process A uses device 1 to draw, exports a dmabuf, sends
>>> it to process B which imports it to device 2. Device 1 resets and loses
>>> VRAM contents. How would process B notice that the dmabuf is lost when
>>> it never touches device 1 itself?
>>>    
>>>> +
>>>>    /*
>>>>     * Device specific ioctls should only be in their respective headers
>>>>     * The device specific ioctl range is from 0x40 to 0x9f.
>>>> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
>>>> index 43691058d28f..c3257bd1af9c 100644
>>>> --- a/include/uapi/drm/drm_mode.h
>>>> +++ b/include/uapi/drm/drm_mode.h
>>>> @@ -1308,6 +1308,21 @@ struct drm_mode_rect {
>>>>    	__s32 y2;
>>>>    };
>>>>    
>>>> +/**
>>>> + * struct drm_get_reset - Get information about a DRM device resets
>>>> + * @ctx_id: the context id to be queried about resets
>>>> + * @flags: flags
>>>> + * @dev_reset_count: global counter of resets for a given DRM device
>>>> + * @ctx_reset_count: of all the resets counted by this device, how many were
>>>> + * caused by this context.
>>>> + */
>>>> +struct drm_get_reset {
>>>> +	__u32 ctx_id;
>>>> +	__u32 flags;
>>>> +	__u64 dev_reset_count;
>>>> +	__u64 ctx_reset_count;
>>>> +};
>>>> +
>>>>    #if defined(__cplusplus)
>>>>    }
>>>>    #endif
>>> Thanks,
>>> pq


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations
  2023-06-22  8:12       ` Pekka Paalanen
@ 2023-06-26 16:15         ` André Almeida
  0 siblings, 0 replies; 20+ messages in thread
From: André Almeida @ 2023-06-26 16:15 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: dri-devel, amd-gfx, linux-kernel, kernel-dev, alexander.deucher,
	christian.koenig, pierre-eric.pelloux-prayer, Simon Ser,
	Rob Clark, Daniel Vetter, Daniel Stone,
	'Marek Olšák',
	Dave Airlie, Michel Dänzer, Samuel Pitoiset,
	Timur Kristóf, Bas Nieuwenhuizen

Em 22/06/2023 05:12, Pekka Paalanen escreveu:
> On Wed, 21 Jun 2023 13:28:34 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Em 21/06/2023 04:58, Pekka Paalanen escreveu:
>>> On Tue, 20 Jun 2023 21:57:16 -0300
>>> André Almeida <andrealmeid@igalia.com> wrote:
>>>    
>>>> Create a section that specifies how to deal with DRM device resets for
>>>> kernel and userspace drivers.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>
>>> Hi André,
>>>
>>> nice to see this! I ended up giving lots of grammar comments, but I'm
>>> not a native speaker. Generally it looks good to me.
>>
>> Thank you for your feedback :)
>>
>>>    
>>>> ---
>>>>    Documentation/gpu/drm-uapi.rst | 65 ++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 65 insertions(+)
>>>>
>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>> index 65fb3036a580..da4f8a694d8d 100644
>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>> @@ -285,6 +285,71 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>>>>    mmapped regular files. Threads cause additional pain with signal
>>>>    handling as well.
>>>>    
>>>> +Device reset
>>>> +============
>>>> +
>>>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>>>> +faulty applications and everything in between the many layers. To recover
>>>> +from this kind of state, sometimes is needed to reset the device. This section
>>>
>>> It seems unclear what "this kind of state" refers to, so maybe just write "errors"?
>>>
>>> Maybe:
>>>
>>> 	Some errors require resetting the device in order to make the
>>> 	device usable again.
>>>
>>> I presume that recovery does not mean that the failed job could recover.
>>>    
>>>> +describes what's the expectations for DRM and usermode drivers when a device
>>>> +resets and how to propagate the reset status.
>>>> +
>>>> +Kernel Mode Driver
>>>> +------------------
>>>> +
>>>> +The KMD is responsible for checking if the device needs a reset, and to perform
>>>> +it as needed. Usually a hung is detected when a job gets stuck executing. KMD
>>>
>>> s/hung/hang/ ?
>>>    
>>>> +then update it's internal reset tracking to be ready when userspace asks the
>>>
>>> updates its
>>>
>>> "update reset tracking"... do you mean that KMD records information
>>> about the reset in case userspace asks for it later?
>>
>> Yes, kernel drivers do annotate whenever a reset happens, so it can
>> report to userspace when it asks about resets.
>>
>> For instance, this is the amdgpu implementation of
>> AMDGPU_CTX_OP_QUERY_STATE2:
>>
>> https://elixir.bootlin.com/linux/v6.3.8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c#L548
>>
>>
>> You can see there stored information about resets.
> 
> Hi André,
> 
> right. What I mean is, if I have to ask this, then that implies that
> the wording could be more clear.
> 
> I don't know if "reset tracking" is some sub-system that is turned on
> and off as needed or what updating it would mean.
> 

Understood, I'll rewrite it to be more clear.

>>>    
>>>> +kernel about reset information. Drivers should implement the DRM_IOCTL_GET_RESET
>>>> +for that.
>>>
>>> At this point, I'm not sure what "reset tracking" or "reset
>>> information" entails. Could something be said about those?
>>>   >> +
>>>> +User Mode Driver
>>>> +----------------
>>>> +
>>>> +The UMD should check before submitting new commands to the KMD if the device has
>>>> +been reset, and this can be checked more often if it requires to. The
>>>> +DRM_IOCTL_GET_RESET is the default interface for those kind of checks. After
>>>> +detecting a reset, UMD will then proceed to report it to the application using
>>>> +the appropriated API error code, as explained in the bellow section about
>>>
>>> s/bellow/below/
>>>    
>>>> +robustness.
>>>> +
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that is using.
>>>
>>> that it is using.
>>>    
>>>> +
>>>> +Graphical APIs provide ways to application to deal with device resets. However,
>>>
>>> provide ways for applications to deal with
>>>    
>>>> +there's no guarantee that the app will be correctly using such features, and UMD
>>>> +can implement policies to close the app if it's a repeating offender, likely in
>>>> +a broken loop. This is done to ensure that it doesn't keeps blocking the user
>>>
>>> does not keep
>>>
>>> I think contractions are usually avoided in documents, but I'm not
>>> bothering to flag them all.
>>>    
>>>> +interface to be correctly displayed.
>>>
>>> interface from being correctly displayed.
>>>    
>>>> +
>>>> +OpenGL
>>>> +~~~~~~
>>>> +
>>>> +Apps using OpenGL can rely on ``GL_ARB_robustness`` to be robust. This extension
>>>> +tells if a reset has happened, and if so, all the context state is considered
>>>> +lost and the app proceeds by creating new ones. If robustness isn't in use, UMD
>>>> +will terminate the app when a reset is detected, giving that the contexts are
>>>> +lost and the app won't be able to figure this out and recreate the contexts.
>>>
>>> What about GL ES? Is GL_ARB_robustness implemented or even defined there?
>>>    
>>
>> I found this:
>> https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
>>
>> "Since this is intended to be a version of ARB_robustness for OpenGL ES,
>> it should be named accordingly."
>>
>> I can add this to this paragraph.
> 
> Yes, please!
> 
> I suppose there could be even more extensions with similar benefits, so
> maybe these extension should be mentioned as examples. Right now the
> wording sounds like these are the chosen extensions, and if you don't
> use one, the process will be terminated.
> 
>>
>>> What about EGL returning errors like EGL_CONTEXT_LOST, would handling that not
>>> be enough from the app? The documented expectation is: "The application
>>> must destroy all contexts and reinitialise OpenGL ES state and objects
>>> to continue rendering."
>>
>> I couldn't find the spec for EGL_CONTEXT_LOST, but I found for
>> GL_CONTEXT_LOST, which I assume is similar.
> 
> EGL Version 1.5 - August 27, 2014
> 
> Section 2.7 Power Management
> 
> 	Following a power management event, calls to eglSwapBuffers,
> 	eglCopyBuffers, or eglMakeCurrent will indicate failure by
> 	returning EGL_FALSE. The error EGL_CONTEXT_LOST will be
> 	returned if a power management event has occurred.
> 
> 	On detection of this error, the application must destroy all
> 	contexts (by calling eglDestroyContext for each context). To
> 	continue rendering the application must recreate any contexts
> 	it requires, and subsequently restore any client API state and
> 	objects it wishes to use.
> 
> It is talking about power management which is not quite GPU reset, but
> I see so much similarity that I'd say it doesn't matter which one
> actually happened. The only difference is that power management events
> are not caused by application bugs, which means that the application
> will simply re-initialize and retry, which may result in a reset loop.
> 
> You already wrote provision to handle reset loops, and I'm not sure
> applications handling EGL_CONTEXT_LOST would/could ever infer that they
> are the culprit without using robustness extensions.
> 
> I can see how EGL_CONTEXT_LOST could be deemed unsuitable for resets,
> too.
> 

Indeed, this is tricky. However, I believe that the complex nature of 
the stack can lead to situations where an app is causing the hardware to 
change it's power management settings. Even though the app isn't doing 
it on purpose, as stated in the introduction of this section, resets can 
be caused by hardware bugs, so unfortunately apps might not be able to 
run correctly and will get reset notifications, and even get terminate.

>>
>> GL_CONTEXT_LOST is only returned in some specific commands (that might
>> cause a polling application to block indefinitely), so I don't think
>> it's enough, given that the we can't guarantee that the application will
>> call such commands after a reset, thus not being able to notice a reset.
>>
>> https://registry.khronos.org/OpenGL-Refpages/gl4/html/glGetGraphicsResetStatus.xhtml
> 
> Ok, another API for a similar thing.
> 
> So in that case, the app does not need to use a robustness extension if
> it uses OpenGL 4.5 and bothers to check.
> 
> This makes the wording "If robustness is not in use" problematic,
> because it seems complicated to determine if robusteness is in use in
> any particular application. I suppose Mesa would track if the app ever
> called glGetGraphicsResetStatus() before drawing after reset?
> 
> 

I agree, I think tracking glGetGraphicsResetStatus() should be enough, 
but I will let this part of the documentation as this for now:

If it's _possible to guarantee_ that robustness is not in use [...]

> Thanks,
> pq
> 
>>>    
>>>> +
>>>> +Vulkan
>>>> +~~~~~~
>>>> +
>>>> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
>>>> +This error code means, among other things, that a device reset has happened and
>>>> +it needs to recreate the contexts to keep going.
>>>> +
>>>> +Reporting resets causes
>>>> +-----------------------
>>>> +
>>>> +Apart from propagating the reset through the stack so apps can recover, it's
>>>> +really useful for driver developers to learn more about what caused the reset in
>>>> +first place. DRM devices should make use of devcoredump to store relevant
>>>> +information about the reset, so this information can be added to user bug
>>>> +reports.
>>>> +
>>>>    .. _drm_driver_ioctl:
>>>>    
>>>>    IOCTL Support on Device Nodes
>>>
>>> What about VRAM contents? If userspace holds a dmabuf handle, can a GPU
>>> reset wipe that buffer? How would that be communicated?
>>>    
>>
>> Yes, it can.
>>
>>> The dmabuf may have originated in another process.
>>>    
>>
>> Indeed, I think we might need to add an error code for dmabuf calls so
>> the buffer user knows that it's invalid now because a reset has happened
>> in the other device. I will need to read more dmabuf code to make sure
>> how this would be possible.
>>
>>>
>>> Thanks,
>>> pq
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-06-26 16:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21  0:57 [RFC PATCH v3 0/4] drm: Standardize device reset notification André Almeida
2023-06-21  0:57 ` [RFC PATCH v3 1/4] drm/doc: Document DRM device reset expectations André Almeida
2023-06-21  7:58   ` Pekka Paalanen
2023-06-21 16:28     ` André Almeida
2023-06-22  8:12       ` Pekka Paalanen
2023-06-26 16:15         ` André Almeida
2023-06-21  0:57 ` [RFC PATCH v3 2/4] drm: Create DRM_IOCTL_GET_RESET André Almeida
2023-06-21  8:09   ` Pekka Paalanen
2023-06-21 16:33     ` André Almeida
2023-06-22  8:22       ` Pekka Paalanen
2023-06-22  9:59         ` Christian König
2023-06-21  0:57 ` [RFC PATCH v3 3/4] drm/amdgpu: Implement DRM_IOCTL_GET_RESET André Almeida
2023-06-21  7:40   ` Christian König
2023-06-21 16:38     ` André Almeida
2023-06-22  7:45       ` Christian König
2023-06-21  0:57 ` [RFC PATCH v3 4/4] drm/i915: " André Almeida
2023-06-21  7:38   ` Jani Nikula
2023-06-21  7:42 ` [RFC PATCH v3 0/4] drm: Standardize device reset notification Christian König
2023-06-21 15:06   ` André Almeida
2023-06-21 15:09     ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).