dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Fix for two GuC issues
@ 2022-11-02 19:21 John.C.Harrison
  2022-11-02 19:21 ` [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
  2022-11-02 19:21 ` [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset John.C.Harrison
  0 siblings, 2 replies; 9+ messages in thread
From: John.C.Harrison @ 2022-11-02 19:21 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Fix for a deadlock issue between the GuC busyness stats worker and GT
resets. Also fix kernel contexts not getting the correct scheduling
priority at start of day.

v2: Rename existing uses of _trylock rather than adding a _noretry
version. Also improve the comment a bit.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (2):
  drm/i915/guc: Properly initialise kernel contexts
  drm/i915/guc: Don't deadlock busyness stats vs reset

 drivers/gpu/drm/i915/gem/i915_gem_mman.c       |  2 +-
 drivers/gpu/drm/i915/gt/intel_reset.c          | 18 ++++++++++++++++--
 drivers/gpu/drm/i915/gt/intel_reset.h          |  1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  |  7 ++++++-
 4 files changed, 24 insertions(+), 4 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts
  2022-11-02 19:21 [PATCH v2 0/2] Fix for two GuC issues John.C.Harrison
@ 2022-11-02 19:21 ` John.C.Harrison
  2022-11-04 18:53   ` Ceraolo Spurio, Daniele
  2022-11-05  5:18   ` Lucas De Marchi
  2022-11-02 19:21 ` [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset John.C.Harrison
  1 sibling, 2 replies; 9+ messages in thread
From: John.C.Harrison @ 2022-11-02 19:21 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

If a context has already been registered prior to first submission
then context init code was not being called. The noticeable effect of
that was the scheduling priority was left at zero (meaning super high
priority) instead of being set to normal. This would occur with
kernel contexts at start of day as they are manually pinned up front
rather than on first submission. So add a call to initialise those
when they are pinned.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4ccb29f9ac55c..941613be3b9dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4111,6 +4111,9 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 
+	if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
+		guc_context_init(ce);
+
 	try_context_registration(ce, true);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-02 19:21 [PATCH v2 0/2] Fix for two GuC issues John.C.Harrison
  2022-11-02 19:21 ` [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
@ 2022-11-02 19:21 ` John.C.Harrison
  2022-11-03 11:31   ` [Intel-gfx] " Tvrtko Ursulin
  1 sibling, 1 reply; 9+ messages in thread
From: John.C.Harrison @ 2022-11-02 19:21 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The engine busyness stats has a worker function to do things like
64bit extend the 32bit hardware counters. The GuC's reset prepare
function flushes out this worker function to ensure no corruption
happens during the reset. Unforunately, the worker function has an
infinite wait for active resets to finish before doing its work. Thus
a deadlock would occur if the worker function had actually started
just as the reset starts.

The function being used to lock the reset-in-progress mutex is called
intel_gt_reset_trylock(). However, as noted it does not follow
standard 'trylock' conventions and exit if already locked. So rename
the current _trylock function to intel_gt_reset_lock_interruptible(),
which is the behaviour it actually provides. In addition, add a new
implementation of _trylock and call that from the busyness stats
worker instead.

v2: Rename existing trylock to interruptible rather than trying to
preserve the existing (confusing) naming scheme (review comments from
Tvrtko).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_mman.c       |  2 +-
 drivers/gpu/drm/i915/gt/intel_reset.c          | 18 ++++++++++++++++--
 drivers/gpu/drm/i915/gt/intel_reset.h          |  1 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c  |  4 +++-
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index e63329bc80659..c29efdef8313a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -330,7 +330,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
 	if (ret)
 		goto err_rpm;
 
-	ret = intel_gt_reset_trylock(ggtt->vm.gt, &srcu);
+	ret = intel_gt_reset_lock_interruptible(ggtt->vm.gt, &srcu);
 	if (ret)
 		goto err_pages;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 3159df6cdd492..24736ebee17c2 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1407,15 +1407,19 @@ void intel_gt_handle_error(struct intel_gt *gt,
 	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
 }
 
-int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+static int _intel_gt_reset_lock(struct intel_gt *gt, int *srcu, bool retry)
 {
 	might_lock(&gt->reset.backoff_srcu);
-	might_sleep();
+	if (retry)
+		might_sleep();
 
 	rcu_read_lock();
 	while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
 		rcu_read_unlock();
 
+		if (!retry)
+			return -EBUSY;
+
 		if (wait_event_interruptible(gt->reset.queue,
 					     !test_bit(I915_RESET_BACKOFF,
 						       &gt->reset.flags)))
@@ -1429,6 +1433,16 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
 	return 0;
 }
 
+int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_lock(gt, srcu, false);
+}
+
+int intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_lock(gt, srcu, true);
+}
+
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
 __releases(&gt->reset.backoff_srcu)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
index adc734e673870..25c975b6e8fc0 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -39,6 +39,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine,
 void __i915_request_reset(struct i915_request *rq, bool guilty);
 
 int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu);
+int __must_check intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu);
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
 
 void intel_gt_set_wedged(struct intel_gt *gt);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 941613be3b9dd..92e514061d20b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1401,7 +1401,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 
 	/*
 	 * Synchronize with gt reset to make sure the worker does not
-	 * corrupt the engine/guc stats.
+	 * corrupt the engine/guc stats. NB: can't actually block waiting
+	 * for a reset to complete as the reset requires flushing out
+	 * this worker thread if started. So waiting would deadlock.
 	 */
 	ret = intel_gt_reset_trylock(gt, &srcu);
 	if (ret)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-02 19:21 ` [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset John.C.Harrison
@ 2022-11-03 11:31   ` Tvrtko Ursulin
  2022-11-03 18:45     ` John Harrison
  0 siblings, 1 reply; 9+ messages in thread
From: Tvrtko Ursulin @ 2022-11-03 11:31 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 02/11/2022 19:21, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The engine busyness stats has a worker function to do things like
> 64bit extend the 32bit hardware counters. The GuC's reset prepare
> function flushes out this worker function to ensure no corruption
> happens during the reset. Unforunately, the worker function has an
> infinite wait for active resets to finish before doing its work. Thus
> a deadlock would occur if the worker function had actually started
> just as the reset starts.
> 
> The function being used to lock the reset-in-progress mutex is called
> intel_gt_reset_trylock(). However, as noted it does not follow
> standard 'trylock' conventions and exit if already locked. So rename
> the current _trylock function to intel_gt_reset_lock_interruptible(),
> which is the behaviour it actually provides. In addition, add a new
> implementation of _trylock and call that from the busyness stats
> worker instead.
> 
> v2: Rename existing trylock to interruptible rather than trying to
> preserve the existing (confusing) naming scheme (review comments from
> Tvrtko).
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gem/i915_gem_mman.c       |  2 +-
>   drivers/gpu/drm/i915/gt/intel_reset.c          | 18 ++++++++++++++++--
>   drivers/gpu/drm/i915/gt/intel_reset.h          |  1 +
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  |  4 +++-
>   4 files changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> index e63329bc80659..c29efdef8313a 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> @@ -330,7 +330,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
>   	if (ret)
>   		goto err_rpm;
>   
> -	ret = intel_gt_reset_trylock(ggtt->vm.gt, &srcu);
> +	ret = intel_gt_reset_lock_interruptible(ggtt->vm.gt, &srcu);
>   	if (ret)
>   		goto err_pages;
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 3159df6cdd492..24736ebee17c2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1407,15 +1407,19 @@ void intel_gt_handle_error(struct intel_gt *gt,
>   	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>   }
>   
> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
> +static int _intel_gt_reset_lock(struct intel_gt *gt, int *srcu, bool retry)
>   {
>   	might_lock(&gt->reset.backoff_srcu);
> -	might_sleep();
> +	if (retry)
> +		might_sleep();
>   
>   	rcu_read_lock();
>   	while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>   		rcu_read_unlock();
>   
> +		if (!retry)
> +			return -EBUSY;
> +
>   		if (wait_event_interruptible(gt->reset.queue,
>   					     !test_bit(I915_RESET_BACKOFF,
>   						       &gt->reset.flags)))
> @@ -1429,6 +1433,16 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>   	return 0;
>   }
>   
> +int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
> +{
> +	return _intel_gt_reset_lock(gt, srcu, false);
> +}
> +
> +int intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu)
> +{
> +	return _intel_gt_reset_lock(gt, srcu, true);
> +}
> +
>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
>   __releases(&gt->reset.backoff_srcu)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
> index adc734e673870..25c975b6e8fc0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.h
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.h
> @@ -39,6 +39,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine,
>   void __i915_request_reset(struct i915_request *rq, bool guilty);
>   
>   int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu);
> +int __must_check intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu);
>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
>   
>   void intel_gt_set_wedged(struct intel_gt *gt);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 941613be3b9dd..92e514061d20b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1401,7 +1401,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>   
>   	/*
>   	 * Synchronize with gt reset to make sure the worker does not
> -	 * corrupt the engine/guc stats.
> +	 * corrupt the engine/guc stats. NB: can't actually block waiting
> +	 * for a reset to complete as the reset requires flushing out
> +	 * this worker thread if started. So waiting would deadlock.
>   	 */
>   	ret = intel_gt_reset_trylock(gt, &srcu);
>   	if (ret)

LGTM but I don't remember fully how ping worker and reset interact so 
I'll let Umesh r-b. Like is it okay to skip the ping or we'd need to 
re-schedule it ASAP due wrap issues? Maybe reset makes that pointless, I 
don't remember.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-03 11:31   ` [Intel-gfx] " Tvrtko Ursulin
@ 2022-11-03 18:45     ` John Harrison
  2022-11-03 18:54       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 9+ messages in thread
From: John Harrison @ 2022-11-03 18:45 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX, Umesh Nerlige Ramappa; +Cc: DRI-Devel

On 11/3/2022 04:31, Tvrtko Ursulin wrote:
> On 02/11/2022 19:21, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> The engine busyness stats has a worker function to do things like
>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>> function flushes out this worker function to ensure no corruption
>> happens during the reset. Unforunately, the worker function has an
>> infinite wait for active resets to finish before doing its work. Thus
>> a deadlock would occur if the worker function had actually started
>> just as the reset starts.
>>
>> The function being used to lock the reset-in-progress mutex is called
>> intel_gt_reset_trylock(). However, as noted it does not follow
>> standard 'trylock' conventions and exit if already locked. So rename
>> the current _trylock function to intel_gt_reset_lock_interruptible(),
>> which is the behaviour it actually provides. In addition, add a new
>> implementation of _trylock and call that from the busyness stats
>> worker instead.
>>
>> v2: Rename existing trylock to interruptible rather than trying to
>> preserve the existing (confusing) naming scheme (review comments from
>> Tvrtko).
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_mman.c       |  2 +-
>>   drivers/gpu/drm/i915/gt/intel_reset.c          | 18 ++++++++++++++++--
>>   drivers/gpu/drm/i915/gt/intel_reset.h          |  1 +
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c  |  4 +++-
>>   4 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
>> b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>> index e63329bc80659..c29efdef8313a 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>> @@ -330,7 +330,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
>>       if (ret)
>>           goto err_rpm;
>>   -    ret = intel_gt_reset_trylock(ggtt->vm.gt, &srcu);
>> +    ret = intel_gt_reset_lock_interruptible(ggtt->vm.gt, &srcu);
>>       if (ret)
>>           goto err_pages;
>>   diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 3159df6cdd492..24736ebee17c2 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -1407,15 +1407,19 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>   }
>>   -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>> +static int _intel_gt_reset_lock(struct intel_gt *gt, int *srcu, bool 
>> retry)
>>   {
>>       might_lock(&gt->reset.backoff_srcu);
>> -    might_sleep();
>> +    if (retry)
>> +        might_sleep();
>>         rcu_read_lock();
>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>           rcu_read_unlock();
>>   +        if (!retry)
>> +            return -EBUSY;
>> +
>>           if (wait_event_interruptible(gt->reset.queue,
>>                            !test_bit(I915_RESET_BACKOFF,
>>                                  &gt->reset.flags)))
>> @@ -1429,6 +1433,16 @@ int intel_gt_reset_trylock(struct intel_gt 
>> *gt, int *srcu)
>>       return 0;
>>   }
>>   +int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>> +{
>> +    return _intel_gt_reset_lock(gt, srcu, false);
>> +}
>> +
>> +int intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu)
>> +{
>> +    return _intel_gt_reset_lock(gt, srcu, true);
>> +}
>> +
>>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
>>   __releases(&gt->reset.backoff_srcu)
>>   {
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h 
>> b/drivers/gpu/drm/i915/gt/intel_reset.h
>> index adc734e673870..25c975b6e8fc0 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.h
>> @@ -39,6 +39,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs 
>> *engine,
>>   void __i915_request_reset(struct i915_request *rq, bool guilty);
>>     int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int 
>> *srcu);
>> +int __must_check intel_gt_reset_lock_interruptible(struct intel_gt 
>> *gt, int *srcu);
>>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
>>     void intel_gt_set_wedged(struct intel_gt *gt);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 941613be3b9dd..92e514061d20b 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -1401,7 +1401,9 @@ static void guc_timestamp_ping(struct 
>> work_struct *wrk)
>>         /*
>>        * Synchronize with gt reset to make sure the worker does not
>> -     * corrupt the engine/guc stats.
>> +     * corrupt the engine/guc stats. NB: can't actually block waiting
>> +     * for a reset to complete as the reset requires flushing out
>> +     * this worker thread if started. So waiting would deadlock.
>>        */
>>       ret = intel_gt_reset_trylock(gt, &srcu);
>>       if (ret)
>
> LGTM but I don't remember fully how ping worker and reset interact so 
> I'll let Umesh r-b. Like is it okay to skip the ping or we'd need to 
> re-schedule it ASAP due wrap issues? Maybe reset makes that pointless, 
> I don't remember.
The reset is cancelling the worker anyway. And it will then be 
rescheduled once the reset is done. And the ping time is defined as 
1/8th the wrap time (being approx 223 seconds on current platforms). So 
as long as the reset doesn't take longer than about 200s, there is no 
issue. And if the reset did take longer than that then we have bigger 
issues than the busyness stats (which can't actually be counting anyway 
because nothing is running if the GT is in reset) being slightly off.

John.

>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-03 18:45     ` John Harrison
@ 2022-11-03 18:54       ` Umesh Nerlige Ramappa
  0 siblings, 0 replies; 9+ messages in thread
From: Umesh Nerlige Ramappa @ 2022-11-03 18:54 UTC (permalink / raw)
  To: John Harrison; +Cc: Tvrtko Ursulin, Intel-GFX, DRI-Devel

On Thu, Nov 03, 2022 at 11:45:57AM -0700, John Harrison wrote:
>On 11/3/2022 04:31, Tvrtko Ursulin wrote:
>>On 02/11/2022 19:21, John.C.Harrison@Intel.com wrote:
>>>From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>>The engine busyness stats has a worker function to do things like
>>>64bit extend the 32bit hardware counters. The GuC's reset prepare
>>>function flushes out this worker function to ensure no corruption
>>>happens during the reset. Unforunately, the worker function has an
>>>infinite wait for active resets to finish before doing its work. Thus
>>>a deadlock would occur if the worker function had actually started
>>>just as the reset starts.
>>>
>>>The function being used to lock the reset-in-progress mutex is called
>>>intel_gt_reset_trylock(). However, as noted it does not follow
>>>standard 'trylock' conventions and exit if already locked. So rename
>>>the current _trylock function to intel_gt_reset_lock_interruptible(),
>>>which is the behaviour it actually provides. In addition, add a new
>>>implementation of _trylock and call that from the busyness stats
>>>worker instead.
>>>
>>>v2: Rename existing trylock to interruptible rather than trying to
>>>preserve the existing (confusing) naming scheme (review comments from
>>>Tvrtko).
>>>
>>>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>---
>>>  drivers/gpu/drm/i915/gem/i915_gem_mman.c       |  2 +-
>>>  drivers/gpu/drm/i915/gt/intel_reset.c          | 18 ++++++++++++++++--
>>>  drivers/gpu/drm/i915/gt/intel_reset.h          |  1 +
>>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c  |  4 +++-
>>>  4 files changed, 21 insertions(+), 4 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
>>>b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>>>index e63329bc80659..c29efdef8313a 100644
>>>--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>>>+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
>>>@@ -330,7 +330,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf)
>>>      if (ret)
>>>          goto err_rpm;
>>>  -    ret = intel_gt_reset_trylock(ggtt->vm.gt, &srcu);
>>>+    ret = intel_gt_reset_lock_interruptible(ggtt->vm.gt, &srcu);
>>>      if (ret)
>>>          goto err_pages;
>>>  diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>index 3159df6cdd492..24736ebee17c2 100644
>>>--- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>@@ -1407,15 +1407,19 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>>      intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>>  }
>>>  -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>>+static int _intel_gt_reset_lock(struct intel_gt *gt, int *srcu, 
>>>bool retry)
>>>  {
>>>      might_lock(&gt->reset.backoff_srcu);
>>>-    might_sleep();
>>>+    if (retry)
>>>+        might_sleep();
>>>        rcu_read_lock();
>>>      while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>>          rcu_read_unlock();
>>>  +        if (!retry)
>>>+            return -EBUSY;
>>>+
>>>          if (wait_event_interruptible(gt->reset.queue,
>>>                           !test_bit(I915_RESET_BACKOFF,
>>>                                 &gt->reset.flags)))
>>>@@ -1429,6 +1433,16 @@ int intel_gt_reset_trylock(struct intel_gt 
>>>*gt, int *srcu)
>>>      return 0;
>>>  }
>>>  +int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>>+{
>>>+    return _intel_gt_reset_lock(gt, srcu, false);
>>>+}
>>>+
>>>+int intel_gt_reset_lock_interruptible(struct intel_gt *gt, int *srcu)
>>>+{
>>>+    return _intel_gt_reset_lock(gt, srcu, true);
>>>+}
>>>+
>>>  void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
>>>  __releases(&gt->reset.backoff_srcu)
>>>  {
>>>diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h 
>>>b/drivers/gpu/drm/i915/gt/intel_reset.h
>>>index adc734e673870..25c975b6e8fc0 100644
>>>--- a/drivers/gpu/drm/i915/gt/intel_reset.h
>>>+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
>>>@@ -39,6 +39,7 @@ int __intel_engine_reset_bh(struct 
>>>intel_engine_cs *engine,
>>>  void __i915_request_reset(struct i915_request *rq, bool guilty);
>>>    int __must_check intel_gt_reset_trylock(struct intel_gt *gt, 
>>>int *srcu);
>>>+int __must_check intel_gt_reset_lock_interruptible(struct 
>>>intel_gt *gt, int *srcu);
>>>  void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
>>>    void intel_gt_set_wedged(struct intel_gt *gt);
>>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>index 941613be3b9dd..92e514061d20b 100644
>>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>@@ -1401,7 +1401,9 @@ static void guc_timestamp_ping(struct 
>>>work_struct *wrk)
>>>        /*
>>>       * Synchronize with gt reset to make sure the worker does not
>>>-     * corrupt the engine/guc stats.
>>>+     * corrupt the engine/guc stats. NB: can't actually block waiting
>>>+     * for a reset to complete as the reset requires flushing out
>>>+     * this worker thread if started. So waiting would deadlock.
>>>       */
>>>      ret = intel_gt_reset_trylock(gt, &srcu);
>>>      if (ret)
>>
>>LGTM but I don't remember fully how ping worker and reset interact 
>>so I'll let Umesh r-b. Like is it okay to skip the ping or we'd need 
>>to re-schedule it ASAP due wrap issues? Maybe reset makes that 
>>pointless, I don't remember.
>The reset is cancelling the worker anyway. And it will then be 
>rescheduled once the reset is done. And the ping time is defined as 
>1/8th the wrap time (being approx 223 seconds on current platforms). 
>So as long as the reset doesn't take longer than about 200s, there is 
>no issue. And if the reset did take longer than that then we have 
>bigger issues than the busyness stats (which can't actually be 
>counting anyway because nothing is running if the GT is in reset) 
>being slightly off.

In addition to canceling the ping worker, __reset_guc_busyness_stats is 
performing the same activities that the ping-worker would do if it were 
to run, so we should be safe to skip the worker when a reset is in 
progress, so lgtm,

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Thanks,
Umesh

>
>John.
>
>>
>>Regards,
>>
>>Tvrtko
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts
  2022-11-02 19:21 ` [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
@ 2022-11-04 18:53   ` Ceraolo Spurio, Daniele
  2022-11-04 18:58     ` John Harrison
  2022-11-05  5:18   ` Lucas De Marchi
  1 sibling, 1 reply; 9+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-11-04 18:53 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 11/2/2022 12:21 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> If a context has already been registered prior to first submission
> then context init code was not being called. The noticeable effect of
> that was the scheduling priority was left at zero (meaning super high
> priority) instead of being set to normal. This would occur with
> kernel contexts at start of day as they are manually pinned up front
> rather than on first submission. So add a call to initialise those
> when they are pinned.

Does this need a fixes tag? on one side, we were leaving the priority to 
the wrong value, but on the other there were no actual consequences.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 4ccb29f9ac55c..941613be3b9dd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4111,6 +4111,9 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
>   	if (context_guc_id_invalid(ce))
>   		pin_guc_id(guc, ce);
>   
> +	if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
> +		guc_context_init(ce);
> +
>   	try_context_registration(ce, true);
>   }
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts
  2022-11-04 18:53   ` Ceraolo Spurio, Daniele
@ 2022-11-04 18:58     ` John Harrison
  0 siblings, 0 replies; 9+ messages in thread
From: John Harrison @ 2022-11-04 18:58 UTC (permalink / raw)
  To: Ceraolo Spurio, Daniele, Intel-GFX; +Cc: DRI-Devel

On 11/4/2022 11:53, Ceraolo Spurio, Daniele wrote:
> On 11/2/2022 12:21 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> If a context has already been registered prior to first submission
>> then context init code was not being called. The noticeable effect of
>> that was the scheduling priority was left at zero (meaning super high
>> priority) instead of being set to normal. This would occur with
>> kernel contexts at start of day as they are manually pinned up front
>> rather than on first submission. So add a call to initialise those
>> when they are pinned.
>
> Does this need a fixes tag? on one side, we were leaving the priority 
> to the wrong value, but on the other there were no actual consequences.
>
I think that's the point. There was no actual issue, it's just a 
theoretical problem. So there is nothing to be gained by pushing this as 
a fix. It it seems like it would be a lot of unnecessary effort to push 
it all the way back to 5.17.

John.


> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>
> Daniele
>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 4ccb29f9ac55c..941613be3b9dd 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4111,6 +4111,9 @@ static inline void 
>> guc_kernel_context_pin(struct intel_guc *guc,
>>       if (context_guc_id_invalid(ce))
>>           pin_guc_id(guc, ce);
>>   +    if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
>> +        guc_context_init(ce);
>> +
>>       try_context_registration(ce, true);
>>   }
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts
  2022-11-02 19:21 ` [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
  2022-11-04 18:53   ` Ceraolo Spurio, Daniele
@ 2022-11-05  5:18   ` Lucas De Marchi
  1 sibling, 0 replies; 9+ messages in thread
From: Lucas De Marchi @ 2022-11-05  5:18 UTC (permalink / raw)
  To: John.C.Harrison; +Cc: Intel-GFX, DRI-Devel

On Wed, Nov 02, 2022 at 12:21:08PM -0700, John.C.Harrison@Intel.com wrote:
>From: John Harrison <John.C.Harrison@Intel.com>
>
>If a context has already been registered prior to first submission
>then context init code was not being called. The noticeable effect of
>that was the scheduling priority was left at zero (meaning super high
>priority) instead of being set to normal. This would occur with
>kernel contexts at start of day as they are manually pinned up front
>rather than on first submission. So add a call to initialise those
>when they are pinned.
>
>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>

Lucas De Marchi <lucas.demarchi@intel.com>

>---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>index 4ccb29f9ac55c..941613be3b9dd 100644
>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>@@ -4111,6 +4111,9 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
> 	if (context_guc_id_invalid(ce))
> 		pin_guc_id(guc, ce);
>
>+	if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
>+		guc_context_init(ce);
>+
> 	try_context_registration(ce, true);
> }
>
>-- 
>2.37.3
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-05  5:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-02 19:21 [PATCH v2 0/2] Fix for two GuC issues John.C.Harrison
2022-11-02 19:21 ` [PATCH v2 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
2022-11-04 18:53   ` Ceraolo Spurio, Daniele
2022-11-04 18:58     ` John Harrison
2022-11-05  5:18   ` Lucas De Marchi
2022-11-02 19:21 ` [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset John.C.Harrison
2022-11-03 11:31   ` [Intel-gfx] " Tvrtko Ursulin
2022-11-03 18:45     ` John Harrison
2022-11-03 18:54       ` Umesh Nerlige Ramappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).