On 1/17/2022 3:49 PM, Christian König wrote: > Am 17.01.22 um 11:09 schrieb Somalapuram, Amaranath: >> [AMD Official Use Only] >> >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Monday, January 17, 2022 3:33 PM >> To: Somalapuram, Amaranath ; >> amd-gfx@lists.freedesktop.org >> Cc: Deucher, Alexander ; Sharma, Shashank >> >> Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD >> GPU reset >> >> Am 17.01.22 um 11:01 schrieb Somalapuram, Amaranath: >>> [AMD Official Use Only] >>> >>> >>> >>> -----Original Message----- >>> From: Koenig, Christian >>> Sent: Monday, January 17, 2022 12:57 PM >>> To: Somalapuram, Amaranath ; >>> amd-gfx@lists.freedesktop.org >>> Cc: Deucher, Alexander ; Sharma, Shashank >>> >>> Subject: Re: [PATCH 2/2] drm/amdgpu: add AMDGPURESET uevent on AMD GPU >>> reset >>> >>> Am 17.01.22 um 07:33 schrieb Somalapuram Amaranath: >>>> AMDGPURESET uevent added to notify userspace, collect dump_stack and >>>> trace >>>> >>>> Signed-off-by: Somalapuram Amaranath >>>> --- >>>>     drivers/gpu/drm/amd/amdgpu/nv.c | 45 >>>> +++++++++++++++++++++++++++++++++ >>>>     1 file changed, 45 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 2ec1ffb36b1f..b73147ae41fb >>>> 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> @@ -529,10 +529,55 @@ nv_asic_reset_method(struct amdgpu_device *adev) >>>>         } >>>>     } >>>>     +/** >>>> + * drm_sysfs_reset_event - generate a DRM uevent >>>> + * @dev: DRM device >>>> + * >>>> + * Send a uevent for the DRM device specified by @dev. Currently we >>>> +only >>>> + * set AMDGPURESET=1 in the uevent environment, but this could be >>>> +expanded to >>>> + * deal with other types of events. >>>> + * >>>> + * Any new uapi should be using the >>>> +drm_sysfs_connector_status_event() >>>> + * for uevents on connector status change. >>>> + */ >>>> +void drm_sysfs_reset_event(struct drm_device *dev) >>> This should probably go directly into the DRM subsystem. >>> >>>> +{ >>>> +    char *event_string = "AMDGPURESET=1"; >>>> +    char *envp[2] = { event_string, NULL }; >>>> + >>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); >>> As I said this approach is a clear NAK. We can't allocate any memory >>> when we do a reset. >>> >>> Regards, >>> Christian. >>> >>> Can I do something like this: >>> >>> void drm_sysfs_reset_event(struct drm_device *dev) >>>    { >>> -       char *event_string = "AMDGPURESET=1"; >>> -       char *envp[2] = { event_string, NULL }; >>> +       char **envp; >>> + >>> +       envp = kcalloc(2,sizeof(char *), GFP_ATOMIC); >>> +       envp[0] = kcalloc(30, sizeof(char), GFP_ATOMIC); >>> +       envp[1] = kcalloc(30, sizeof(char), GFP_ATOMIC); >> No, not really. kobject_uevent_env() will still allocate memory >> without GFP_ATOMIC. >> >> I think the whole approach of using udev won't work for this. >> >> Regards, >> Christian. >> >> I have tested it with sample applications: >> Gpu reset: >> sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover >> >> And I never missed the AMDGPURESET=1 event in user space, > > That's not the problem. Allocating memory when we need to do a reset > can cause a *HARD* kernel deadlock. > > This is absolutely not something we can do and Daniel even tried to > add a few lockdep annotations for this. > > So automated testing scripts will complain that this won't work. > > Regards, > Christian. Any suggestion how we can notify user space during this situation. Regards, S.Amarnath > >> even with continues resets with sudo cat >> /sys/kernel/debug/dri/0/amdgpu_gpu_recover . > > >> >> Regards, >> S.Amarnath >>> + >>> +       strcpy(envp[0], "AMDGPURESET=1"); >>> +       strcpy(envp[1], ""); >>> + >>> >>> kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, >>> envp); >>> + >>> +       kfree(envp[0]); >>> +       kfree(envp[1]); >>> +       kfree(envp); >>>    } >>> >>> Regards, >>> S.Amarnath >>> >>>> +} >>>> + >>>> +void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>>> +    struct drm_device *ddev = adev_to_drm(adev); >>>> +    int r = 0, i; >>>> + >>>> +    /* original raven doesn't have full asic reset */ >>>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>>> +        return; >>>> +    for (i = 0; i < adev->num_ip_blocks; i++) { >>>> +        if (!adev->ip_blocks[i].status.valid) >>>> +            continue; >>>> +        if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>>> +            continue; >>>> +        r = adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>>> + >>>> +        if (r) >>>> +            DRM_ERROR("reset_reg_dumps of IP block <%s> failed %d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> +    } >>>> + >>>> +    drm_sysfs_reset_event(ddev); >>>> +    dump_stack(); >>>> +} >>>> + >>>>     static int nv_asic_reset(struct amdgpu_device *adev) >>>>     { >>>>         int ret = 0; >>>>     +    amdgpu_reset_dumps(adev); >>>>         switch (nv_asic_reset_method(adev)) { >>>>         case AMD_RESET_METHOD_PCI: >>>>             dev_info(adev->dev, "PCI reset\n"); >