dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse
@ 2024-01-10 21:02 John.C.Harrison
  2024-01-31 18:48 ` Janusz Krzysztofik
  2024-02-13  0:29 ` Andi Shyti
  0 siblings, 2 replies; 4+ messages in thread
From: John.C.Harrison @ 2024-01-10 21:02 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The context persistence code does things like send super high priority
heartbeat pulses to ensure any leaked context can still be pre-empted
and thus isn't a total denial of service but only a minor denial of
service. Unfortunately, it wasn't bothering to restart the heatbeat
worker with a fresh timeout. Thus, if a persistent context happened to
be closed just before the heartbeat was going to go ping anyway then
the forced pulse would get a negligble execution time. And as the
forced pulse is super high priority, the worker thread's next step is
a reset. Which means a potentially innocent system randomly goes boom
when attempting to close a context. So, force a re-schedule of the
worker thread with the appropriate timeout.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 1a8e2b7db0138..4ae2fa0b61dd4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine)
 	heartbeat_commit(rq, &attr);
 	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
 
+	/* Ensure the forced pulse gets a full period to execute */
+	next_heartbeat(engine);
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse
  2024-01-10 21:02 [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse John.C.Harrison
@ 2024-01-31 18:48 ` Janusz Krzysztofik
  2024-01-31 22:07   ` John Harrison
  2024-02-13  0:29 ` Andi Shyti
  1 sibling, 1 reply; 4+ messages in thread
From: Janusz Krzysztofik @ 2024-01-31 18:48 UTC (permalink / raw)
  To: Intel-GFX, intel-gfx; +Cc: John.C.Harrison, DRI-Devel

Hi John,

On Wednesday, 10 January 2024 22:02:16 CET John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The context persistence code does things like send super high priority
> heartbeat pulses to ensure any leaked context can still be pre-empted
> and thus isn't a total denial of service but only a minor denial of
> service. Unfortunately, it wasn't bothering to restart the heatbeat
> worker with a fresh timeout. Thus, if a persistent context happened to
> be closed just before the heartbeat was going to go ping anyway then
> the forced pulse would get a negligble execution time. And as the
> forced pulse is super high priority, the worker thread's next step is
> a reset. Which means a potentially innocent system randomly goes boom
> when attempting to close a context. So, force a re-schedule of the
> worker thread with the appropriate timeout.

I haven't looked too much in heartbeat pulses code before, but I think I can 
understand your change.  I've also got a positive opinion from Chris on it.  
I can provide my Ack, assuming the pre-merge failure reported by CI is not 
related, but could you please comment that failure first and/or ask BUG Filing 
to handle it so we have it cleaned up?

Thanks,
Janusz


> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> index 1a8e2b7db0138..4ae2fa0b61dd4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine)
>  	heartbeat_commit(rq, &attr);
>  	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
>  
> +	/* Ensure the forced pulse gets a full period to execute */
> +	next_heartbeat(engine);
> +
>  	return 0;
>  }
>  
> 





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse
  2024-01-31 18:48 ` Janusz Krzysztofik
@ 2024-01-31 22:07   ` John Harrison
  0 siblings, 0 replies; 4+ messages in thread
From: John Harrison @ 2024-01-31 22:07 UTC (permalink / raw)
  To: Janusz Krzysztofik, Intel-GFX, Andi Shyti; +Cc: DRI-Devel

On 1/31/2024 10:48, Janusz Krzysztofik wrote:
> Hi John,
>
> On Wednesday, 10 January 2024 22:02:16 CET John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> The context persistence code does things like send super high priority
>> heartbeat pulses to ensure any leaked context can still be pre-empted
>> and thus isn't a total denial of service but only a minor denial of
>> service. Unfortunately, it wasn't bothering to restart the heatbeat
>> worker with a fresh timeout. Thus, if a persistent context happened to
>> be closed just before the heartbeat was going to go ping anyway then
>> the forced pulse would get a negligble execution time. And as the
>> forced pulse is super high priority, the worker thread's next step is
>> a reset. Which means a potentially innocent system randomly goes boom
>> when attempting to close a context. So, force a re-schedule of the
>> worker thread with the appropriate timeout.
> I haven't looked too much in heartbeat pulses code before, but I think I can
> understand your change.  I've also got a positive opinion from Chris on it.
> I can provide my Ack, assuming the pre-merge failure reported by CI is not
> related, but could you please comment that failure first and/or ask BUG Filing
> to handle it so we have it cleaned up?
Pretty confident the CI failure is unrelated. Not seeing how a change to 
the heartbeat timing of persistence context clean up could cause a HDMI 
test to fail to complete.

However, I was really hoping for a full code review by someone who 
understands this code and would be able to say whether there could be 
unexpected side effects of the change. Or even if there is a better / 
safer way to fix the problem.

@Andi Shyti, you were fingered as being someone who might have such 
knowledge. Can you comment?

Thanks,
John.

> Thanks,
> Janusz
>
>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>> index 1a8e2b7db0138..4ae2fa0b61dd4 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
>> @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine)
>>   	heartbeat_commit(rq, &attr);
>>   	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
>>   
>> +	/* Ensure the forced pulse gets a full period to execute */
>> +	next_heartbeat(engine);
>> +
>>   	return 0;
>>   }
>>   
>>
>
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse
  2024-01-10 21:02 [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse John.C.Harrison
  2024-01-31 18:48 ` Janusz Krzysztofik
@ 2024-02-13  0:29 ` Andi Shyti
  1 sibling, 0 replies; 4+ messages in thread
From: Andi Shyti @ 2024-02-13  0:29 UTC (permalink / raw)
  To: John.C.Harrison; +Cc: Intel-GFX, DRI-Devel

Hi John,

On Wed, Jan 10, 2024 at 01:02:16PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The context persistence code does things like send super high priority
> heartbeat pulses to ensure any leaked context can still be pre-empted
> and thus isn't a total denial of service but only a minor denial of
> service. Unfortunately, it wasn't bothering to restart the heatbeat

/heatbeat/heartbeat/

> worker with a fresh timeout. Thus, if a persistent context happened to
> be closed just before the heartbeat was going to go ping anyway then
> the forced pulse would get a negligble execution time. And as the
> forced pulse is super high priority, the worker thread's next step is
> a reset. Which means a potentially innocent system randomly goes boom
> when attempting to close a context. So, force a re-schedule of the
> worker thread with the appropriate timeout.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> index 1a8e2b7db0138..4ae2fa0b61dd4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine)
>  	heartbeat_commit(rq, &attr);
>  	GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER);
>  
> +	/* Ensure the forced pulse gets a full period to execute */
> +	next_heartbeat(engine);

I think it makes sense to have this extra heardbeat here and,
as I've been mulling over it, I don't any harm.

The failure doesn't look related, either.

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,
Andi

>  	return 0;
>  }
>  
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-13  0:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-10 21:02 [PATCH] drm/i915/gt: Restart the heartbeat timer when forcing a pulse John.C.Harrison
2024-01-31 18:48 ` Janusz Krzysztofik
2024-01-31 22:07   ` John Harrison
2024-02-13  0:29 ` Andi Shyti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).