All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michel Dänzer" <michel@daenzer.net>
To: "Lazar, Lijo" <lijo.lazar@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Cc: Leo Liu <leo.liu@amd.com>, James Zhu <James.Zhu@amd.com>,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled
Date: Fri, 13 Aug 2021 16:40:03 +0200	[thread overview]
Message-ID: <31c9a190-8329-383e-bbea-3520add4d16a@daenzer.net> (raw)
In-Reply-To: <9ec17598-0b51-014c-c633-2e4e74c863e9@amd.com>

On 2021-08-13 4:14 p.m., Lazar, Lijo wrote:
> On 8/13/2021 7:04 PM, Michel Dänzer wrote:
>> On 2021-08-13 1:50 p.m., Lazar, Lijo wrote:
>>> On 8/13/2021 3:59 PM, Michel Dänzer wrote:
>>>> From: Michel Dänzer <mdaenzer@redhat.com>
>>>>
>>>> schedule_delayed_work does not push back the work if it was already
>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>> was disabled and re-enabled again during those 100 ms.
>>>>
>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>> disabling it again (for getting the GPU clock counter).
>>>>
>>>> To fix this, call cancel_delayed_work_sync when GFXOFF transitions from
>>>> enabled to disabled. This makes sure the delayed work will be scheduled
>>>> as intended in the reverse case.
>>>>
>>>> In order to avoid a deadlock, amdgpu_device_delay_enable_gfx_off needs
>>>> to use mutex_trylock instead of mutex_lock.
>>>>
>>>> v2:
>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>     mod_delayed_work.
>>>>
>>>> Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 13 +++++++------
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  3 +++
>>>>    3 files changed, 20 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f3fd5ec710b6..8b025f70706c 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2777,7 +2777,16 @@ static void amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>        struct amdgpu_device *adev =
>>>>            container_of(work, struct amdgpu_device, gfx.gfx_off_delay_work.work);
>>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>> +    /* mutex_lock could deadlock with cancel_delayed_work_sync in amdgpu_gfx_off_ctrl. */
>>>> +    if (!mutex_trylock(&adev->gfx.gfx_off_mutex)) {
>>>> +        /* If there's a bug which causes amdgpu_gfx_off_ctrl to be called with enable=true
>>>> +         * when adev->gfx.gfx_off_req_count is already 0, we might race with that.
>>>> +         * Re-schedule to make sure gfx off will be re-enabled in the HW eventually.
>>>> +         */
>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>> +        return;
>>>
>>> This is not needed and is just creating another thread to contend for mutex.
>>
>> Still not sure what you mean by that. What other thread?
> 
> Sorry, I meant it schedules another workitem and delays GFXOFF enablement further. For ex: if it was another function like gfx_off_status holding the lock at the time of check.
> 
>>
>>> The checks below take care of enabling gfxoff correctly. If it's already in gfx_off state, it doesn't do anything. So I don't see why this change is needed.
>>
>> mutex_trylock is needed to prevent the deadlock discussed before and below.
>>
>> schedule_delayed_work is needed due to this scenario hinted at by the comment:
>>
>> 1. amdgpu_gfx_off_ctrl locks mutex, calls schedule_delayed_work
>> 2. amdgpu_device_delay_enable_gfx_off runs, calls mutex_trylock, which fails
>>
>> GFXOFF would never get re-enabled in HW in this case (until amdgpu_gfx_off_ctrl calls schedule_delayed_work again).
>>
>> (cancel_delayed_work_sync guarantees there's no pending delayed work when it returns, even if amdgpu_device_delay_enable_gfx_off calls schedule_delayed_work)
>>
> 
> I think we need to explain based on the original code before. There is an asssumption here that the only other contention of this mutex is with the gfx_off_ctrl function.

Not really.


> As far as I understand if the work has already started running when schedule_delayed_work is called, it will insert another in the work queue after delay. Based on that understanding I didn't find a problem with the original code.

Original code as in without this patch or the mod_delayed_work patch? If so, the problem is not when the work has already started running. It's that when it hasn't started running yet, schedule_delayed_work doesn't change the timeout for the already scheduled work, so it ends up enabling GFXOFF earlier than intended (and thus at all in scenarios when it's not supposed to).


> [...], there could be cases where it could have gone to gfxoff right after gfx_off_status releases the lock, but it doesn't delaying it further. That would be the case if some other function is also introduced which takes this mutex.

I really don't think we need to worry about amdgpu_get_gfx_off_status, since it's only called from debugfs (and should be very short). If something hits that debugfs file and it causes higher energy consumption, that's a "doctor, it hurts if I do this" kind of problem.

We can worry about future users of the mutex when they show up.


>>>> @@ -569,9 +566,13 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool enable)
>>>>            adev->gfx.gfx_off_req_count--;
>>>>          if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, GFX_OFF_DELAY_ENABLE);
>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>> +        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, AMDGPU_GFX_OFF_DELAY_ENABLE);
>>>> +    } else if (!enable) {
>>>> +        if (adev->gfx.gfx_off_req_count == 1 && !adev->gfx.gfx_off_state)
>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>
>>> This has the deadlock problem as discussed in the other thread.
>>
>> It does not. If amdgpu_device_delay_enable_gfx_off runs while amdgpu_gfx_off_ctrl holds the mutex,
>> mutex_trylock fails and the former bails.
> 
> Ok, but now it creates a case of re-arming the work item from the work.
> 
> TBH I didn't understand the problem on having to use _sync itself and not cancel_delayed_work().
> 
> The edge case you mentioned for a cancel_delayed_work looks like a rare case
> 
> amdgpu_gfx_off_ctrl(disable) gets the lock
> amdgpu_device_delay_enable_gfx_off - waits for the lock
> amdgpu_gfx_off_ctrl(enable) gets the lock again  (this has to be matching call for the previous disable)
> 
> This scenario looks highly improbable as in general we expect some other work that needs to be done done between disable/enable.

At least for the case that started me on this journey (reading the GFX clock counter), that should be very short, just a couple of register reads.

I agree it's highly improbable, I'm trying to make it impossible. :)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

  reply	other threads:[~2021-08-13 14:40 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-11 16:52 [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Michel Dänzer
2021-08-11 16:52 ` [PATCH 2/2] drm/amdgpu: Use mod_delayed_work in JPEG/UVD/VCE/VCN ring_end_use hooks Michel Dänzer
2021-08-11 20:34   ` Alex Deucher
2021-08-11 21:00     ` Zhu, James
2021-08-11 21:34   ` AW: " Koenig, Christian
2021-08-11 22:12     ` Zhu, James
2021-08-11 22:22       ` Zhu, James
2021-08-12  2:42     ` Quan, Evan
2021-08-12  5:55       ` AW: " Koenig, Christian
2021-08-12  8:11         ` Michel Dänzer
2021-08-12 11:33           ` Lazar, Lijo
2021-08-12 16:54             ` Michel Dänzer
2021-08-13  4:23               ` Lazar, Lijo
2021-08-13 10:31                 ` Michel Dänzer
2021-08-13 11:18                   ` Lazar, Lijo
2021-08-16  7:33           ` Christian König
2021-08-12  2:43 ` [PATCH 1/2] drm/amdgpu: Use mod_delayed_work in amdgpu_gfx_off_ctrl Quan, Evan
2021-08-13 10:29 ` [PATCH] drm/amdgpu: Cancel delayed work when GFXOFF is disabled Michel Dänzer
2021-08-13 11:50   ` Lazar, Lijo
2021-08-13 13:34     ` Michel Dänzer
2021-08-13 14:14       ` Lazar, Lijo
2021-08-13 14:40         ` Michel Dänzer [this message]
2021-08-13 15:07           ` Lazar, Lijo
2021-08-13 16:00             ` Michel Dänzer
2021-08-16  4:13               ` Lazar, Lijo
2021-08-16 10:45                 ` Michel Dänzer
2021-08-16  7:38   ` Christian König
2021-08-16 10:38     ` Michel Dänzer
2021-08-16 10:20   ` Quan, Evan
2021-08-16 10:43     ` Michel Dänzer
2021-08-16 10:35   ` [PATCH v3] " Michel Dänzer
2021-08-16 11:33     ` Lazar, Lijo
2021-08-16 12:06       ` Christian König
2021-08-16 15:06         ` Michel Dänzer
2021-08-16 19:02           ` Alex Deucher
2021-08-17  7:51     ` Quan, Evan
2021-08-17  8:17       ` Lazar, Lijo
2021-08-17  8:35         ` Michel Dänzer
2021-08-17  8:23     ` [PATCH] " Michel Dänzer
2021-08-17  9:12       ` Lazar, Lijo
2021-08-17  9:26         ` Michel Dänzer
2021-08-17  9:37           ` Lazar, Lijo
2021-08-17  9:59             ` Michel Dänzer
2021-08-17 10:37               ` Lazar, Lijo
2021-08-17 11:06                 ` Michel Dänzer
2021-08-17 11:49                   ` Lazar, Lijo
2021-08-17 12:55                     ` Lazar, Lijo
2021-08-17  9:33       ` Quan, Evan
2021-08-18 21:56       ` Alex Deucher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31c9a190-8329-383e-bbea-3520add4d16a@daenzer.net \
    --to=michel@daenzer.net \
    --cc=James.Zhu@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=leo.liu@amd.com \
    --cc=lijo.lazar@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.