On 1/17/2022 3:49 PM, Christian König wrote:
Am 17.01.22 um 11:09 schrieb Somalapuram, Amaranath:
[AMD Official Use Only]



-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Monday, January 17, 2022 3:33 PM
To: Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Sharma, Shashank <Shashank.Sharma@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath:
[AMD Official Use Only]



-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Monday, January 17, 2022 12:57 PM
To: Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>;
amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Sharma, Shashank
<Shashank.Sharma@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU
reset

Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath:
AMDGPURESET uevent added to notify userspace, collect dump_stack and
trace

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/nv.c | 45 +++++++++++++++++++++++++++++++++
    1 file changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c
b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb
100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev)
        }
    }
    +/**
+ * drm_sysfs_reset_event - generate a DRM uevent
+ * @dev: DRM device
+ *
+ * Send a uevent for the DRM device specified by @dev.  Currently we
+only
+ * set AMDGPURESET=1 in the uevent environment, but this could be
+expanded to
+ * deal with other types of events.
+ *
+ * Any new uapi should be using the
+drm_sysfs_connector_status_event()
+ * for uevents on connector status change.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev)
This should probably go directly into the DRM subsystem.

+{
+    char *event_string = "AMDGPURESET=1";
+    char *envp[2] = { event_string, NULL };
+
+    kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
As I said this approach is a clear NAK. We can't allocate any memory when we do a reset.

Regards,
Christian.

Can I do something like this:

void drm_sysfs_reset_event(struct drm_device *dev)
   {
-       char *event_string = "AMDGPURESET=1";
-       char *envp[2] = { event_string, NULL };
+       char **envp;
+
+       envp = kcalloc(2,sizeof(char *), GFP_ATOMIC);
+       envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC);
+       envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC);
No, not really. kobject_uevent_env() will still allocate memory without GFP_ATOMIC.

I think the whole approach of using udev won't work for this.

Regards,
Christian.

I have tested it with sample applications:
Gpu reset:
sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover

And I never missed the AMDGPURESET=1 event in user space,

That's not the problem. Allocating memory when we need to do a reset can cause a *HARD* kernel deadlock.

This is absolutely not something we can do and Daniel even tried to add a few lockdep annotations for this.

So automated testing scripts will complain that this won't work.

Regards,
Christian.
Any suggestion how we can notify user space during this situation.

Regards,

S.Amarnath


even with continues resets with sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover .



Regards,
S.Amarnath
+
+       strcpy(envp[0], "AMDGPURESET=1");
+       strcpy(envp[1], "");
+

          kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
envp);
+
+       kfree(envp[0]);
+       kfree(envp[1]);
+       kfree(envp);
   }

Regards,
S.Amarnath

+}
+
+void amdgpu_reset_dumps(struct amdgpu_device *adev) {
+    struct drm_device *ddev = adev_to_drm(adev);
+    int r = 0, i;
+
+    /* original raven doesn't have full asic reset */
+    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
+        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
+        return;
+    for (i = 0; i < adev->num_ip_blocks; i++) {
+        if (!adev->ip_blocks[i].status.valid)
+            continue;
+        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps)
+            continue;
+        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev);
+
+        if (r)
+            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n",
+                    adev->ip_blocks[i].version->funcs->name, r);
+    }
+
+    drm_sysfs_reset_event(ddev);
+    dump_stack();
+}
+
    static int nv_asic_reset(struct amdgpu_device *adev)
    {
        int ret = 0;
    +    amdgpu_reset_dumps(adev);
        switch (nv_asic_reset_method(adev)) {
        case AMD_RESET_METHOD_PCI:
            dev_info(adev->dev, "PCI reset\n");