All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Auld <matthew.auld@intel.com>
To: Rodrigo Vivi <rodrigo.vivi@kernel.org>
Cc: intel-xe@lists.freedesktop.org, Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: Re: [Intel-xe] [PATCH 2/3] drm/xe: fix xe_device_mem_access_get() race
Date: Thu, 11 May 2023 12:43:36 +0100	[thread overview]
Message-ID: <0a7f7d48-4423-1187-63d2-dfa6b85beba9@intel.com> (raw)
In-Reply-To: <ZFUyhLPt8MmBi+A8@rdvivi-mobl4>

On 05/05/2023 17:44, Rodrigo Vivi wrote:
> On Fri, May 05, 2023 at 04:38:53PM +0100, Matthew Auld wrote:
>> It looks like there is at least one race here, given that the
>> pm_runtime_suspended() check looks to return false if we are in the
>> process of suspending the device (RPM_SUSPENDING vs RPM_SUSPENDED). We
>> later also do xe_pm_runtime_get_if_active(), but since the device is
>> suspending or has now suspended, this doesn't do anything either.
>> Following from this we can potentially return from
>> xe_device_mem_access_get() with the device suspended or about to be,
>> leading to broken behaviour.
>>
>> Attempt to fix this by always grabbing the runtime ref when our internal
>> ref transitions from 0 -> 1, and then wrap the whole thing with a lock
>> to ensure callers are serialized.
>>
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/258
>> Signed-off-by: Matthew Auld <matthew.auld@intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c       | 11 +++--------
>>   drivers/gpu/drm/xe/xe_device_types.h |  5 ++++-
>>   drivers/gpu/drm/xe/xe_pm.c           |  9 ++-------
>>   drivers/gpu/drm/xe/xe_pm.h           |  2 +-
>>   4 files changed, 10 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 01c497bcf9a5..0a18b41a0e1a 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -406,17 +406,12 @@ u32 xe_device_ccs_bytes(struct xe_device *xe, u64 size)
>>   
>>   void xe_device_mem_access_get(struct xe_device *xe)
>>   {
>> -	bool resumed = xe_pm_runtime_resume_if_suspended(xe);
>> -
>>   	mutex_lock(&xe->mem_access.lock);
>> -	if (xe->mem_access.ref++ == 0)
>> -		xe->mem_access.hold_rpm = xe_pm_runtime_get_if_active(xe);
>> +	if (xe->mem_access.ref == 0)
>> +		xe->mem_access.hold_rpm = xe_pm_runtime_resume_and_get(xe);
>> +	xe->mem_access.ref++;
> 
> my memory is good to the point that I have tried this before...
> but not good enough to remember the issues that I got with this
> approach :(

Ok, it seems one big issue is around xe_pm_runtime_resume() et al. The 
lock fixes all the races, but xe_pm_runtime_resume() seems to call 
xe_device_mem_access_{get,put}() in loads of places AFAICT, but that is 
ofc going to deadlock if we introduce a lock, since we are inside the 
callback. But I think even without that lock it will still deadlock, 
since the runtime_pm code will see that we are PM_RESUMING and wait for 
itself. I'm guessing that explains why we had the conditional 
pm_runtime_suspended() and if_active(), since that prevents triggering 
the runtime_pm from our callbacks (we will either be PM_SUSPENDING or 
PM_RESUMING), but then we are ofc left with all the nasty races. Any 
ideas? It seems like the resume/suspend callbacks should fundamentally 
never be calling xe_device_mem_access_{get,put}()?

> 
>>   	mutex_unlock(&xe->mem_access.lock);
>>   
>> -	/* The usage counter increased if device was immediately resumed */
>> -	if (resumed)
>> -		xe_pm_runtime_put(xe);
>> -
>>   	XE_WARN_ON(xe->mem_access.ref == S32_MAX);
>>   }
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 59462933f67a..9e37189d5745 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -256,7 +256,10 @@ struct xe_device {
>>   	 * triggering additional actions when they occur.
>>   	 */
>>   	struct {
>> -		/** @lock: protect the ref count */
>> +		/**
>> +		 * @lock: Serialize xe_device_mem_access users,
>> +		 * and protect the below internal state, like @ref.
>> +		 */
>>   		struct mutex lock;
>>   		/** @ref: ref count of memory accesses */
>>   		s32 ref;
>> diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
>> index b7b57f10ba25..b2ffa001e6f7 100644
>> --- a/drivers/gpu/drm/xe/xe_pm.c
>> +++ b/drivers/gpu/drm/xe/xe_pm.c
>> @@ -210,14 +210,9 @@ int xe_pm_runtime_put(struct xe_device *xe)
>>   	return pm_runtime_put_autosuspend(xe->drm.dev);
>>   }
>>   
>> -/* Return true if resume operation happened and usage count was increased */
>> -bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe)
>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe)
>>   {
>> -	/* In case we are suspended we need to immediately wake up */
>> -	if (pm_runtime_suspended(xe->drm.dev))
>> -		return !pm_runtime_resume_and_get(xe->drm.dev);
>> -
>> -	return false;
>> +	return !pm_runtime_resume_and_get(xe->drm.dev);
> 
> now with similar name I feel strange that we are not aligned with their
> return. Although I prefer our one...
> 
> Anyway, the code is right... if you are testing and it is working well
> let's move with this.
> 
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> (for the series)
> 
> but let's get an Ack from Maarten since he was kept as author on patch 3
> and it is modified from his merged one.
> 
>>   }
>>   
>>   int xe_pm_runtime_get_if_active(struct xe_device *xe)
>> diff --git a/drivers/gpu/drm/xe/xe_pm.h b/drivers/gpu/drm/xe/xe_pm.h
>> index 6a885585f653..1b4c15b5e71a 100644
>> --- a/drivers/gpu/drm/xe/xe_pm.h
>> +++ b/drivers/gpu/drm/xe/xe_pm.h
>> @@ -19,7 +19,7 @@ int xe_pm_runtime_suspend(struct xe_device *xe);
>>   int xe_pm_runtime_resume(struct xe_device *xe);
>>   int xe_pm_runtime_get(struct xe_device *xe);
>>   int xe_pm_runtime_put(struct xe_device *xe);
>> -bool xe_pm_runtime_resume_if_suspended(struct xe_device *xe);
>> +bool xe_pm_runtime_resume_and_get(struct xe_device *xe);
>>   int xe_pm_runtime_get_if_active(struct xe_device *xe);
>>   
>>   #endif
>> -- 
>> 2.40.0
>>

  reply	other threads:[~2023-05-11 11:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-05 15:38 [Intel-xe] [PATCH 1/3] Revert "drm/xe: Use atomic instead of mutex for xe_device_mem_access_ongoing" Matthew Auld
2023-05-05 15:38 ` [Intel-xe] [PATCH 2/3] drm/xe: fix xe_device_mem_access_get() race Matthew Auld
2023-05-05 16:44   ` Rodrigo Vivi
2023-05-11 11:43     ` Matthew Auld [this message]
2023-05-12 14:32       ` Rodrigo Vivi
2023-05-12 15:56         ` Matthew Auld
2023-05-15  8:44           ` Thomas Hellström
2023-05-15  9:04             ` Matthew Auld
2023-05-15  9:22               ` Thomas Hellström
2023-05-15 10:15                 ` Matthew Auld
2023-05-05 15:38 ` [Intel-xe] [PATCH 3/3] drm/xe: Use atomic for mem_access.ref Matthew Auld
2023-05-05 15:56 ` [Intel-xe] ✓ CI.Patch_applied: success for series starting with [1/3] Revert "drm/xe: Use atomic instead of mutex for xe_device_mem_access_ongoing" Patchwork
2023-05-05 15:57 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-05-05 16:01 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-05-05 16:19 ` [Intel-xe] ○ CI.BAT: info " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a7f7d48-4423-1187-63d2-dfa6b85beba9@intel.com \
    --to=matthew.auld@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=rodrigo.vivi@intel.com \
    --cc=rodrigo.vivi@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.