[PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-03-31  0:06 ` Danilo Krummrich
  0 siblings, 0 replies; 12+ messages in thread
From: Danilo Krummrich @ 2023-03-31  0:06 UTC (permalink / raw)
  To: luben.tuikov, airlied, daniel, l.stach, christian.koenig
  Cc: Danilo Krummrich, linux-kernel, dri-devel

It already happend a few times that patches slipped through which
implemented access to an entity through a job that was already removed
from the entities queue. Since jobs and entities might have different
lifecycles, this can potentially cause UAF bugs.

In order to make it obvious that a jobs entity pointer shouldn't be
accessed after drm_sched_entity_pop_job() was called successfully, set
the jobs entity pointer to NULL once the job is removed from the entity
queue.

Moreover, debugging a potential NULL pointer dereference is way easier
than potentially corrupted memory through a UAF.

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
I'm aware that drivers could already use job->entity in arbitrary places, since
they in control of when the entity is actually freed. A quick grep didn't give
me any results where this would actually be the case, however maybe I also just
didn't catch it.

If, therefore, we don't want to set job->entity to NULL I think we should at
least add a comment somewhere.
---

 drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..a9c6118e534b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 			drm_sched_rq_update_fifo(entity, next->submit_ts);
 	}
 
+	/* Jobs and entities might have different lifecycles. Since we're
+	 * removing the job from the entities queue, set the jobs entity pointer
+	 * to NULL to prevent any future access of the entity through this job.
+	 */
+	sched_job->entity = NULL;
+
 	return sched_job;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-03-31  0:06 ` Danilo Krummrich
  0 siblings, 0 replies; 12+ messages in thread
From: Danilo Krummrich @ 2023-03-31  0:06 UTC (permalink / raw)
  To: luben.tuikov, airlied, daniel, l.stach, christian.koenig
  Cc: dri-devel, linux-kernel, Danilo Krummrich

It already happend a few times that patches slipped through which
implemented access to an entity through a job that was already removed
from the entities queue. Since jobs and entities might have different
lifecycles, this can potentially cause UAF bugs.

In order to make it obvious that a jobs entity pointer shouldn't be
accessed after drm_sched_entity_pop_job() was called successfully, set
the jobs entity pointer to NULL once the job is removed from the entity
queue.

Moreover, debugging a potential NULL pointer dereference is way easier
than potentially corrupted memory through a UAF.

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
I'm aware that drivers could already use job->entity in arbitrary places, since
they in control of when the entity is actually freed. A quick grep didn't give
me any results where this would actually be the case, however maybe I also just
didn't catch it.

If, therefore, we don't want to set job->entity to NULL I think we should at
least add a comment somewhere.
---

 drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 15d04a0ec623..a9c6118e534b 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 			drm_sched_rq_update_fifo(entity, next->submit_ts);
 	}
 
+	/* Jobs and entities might have different lifecycles. Since we're
+	 * removing the job from the entities queue, set the jobs entity pointer
+	 * to NULL to prevent any future access of the entity through this job.
+	 */
+	sched_job->entity = NULL;
+
 	return sched_job;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
  2023-03-31  0:06 ` Danilo Krummrich
@ 2023-03-31  5:59   ` Christian König
  -1 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2023-03-31  5:59 UTC (permalink / raw)
  To: Danilo Krummrich, luben.tuikov, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: dri-devel, linux-kernel

Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
> It already happend a few times that patches slipped through which
> implemented access to an entity through a job that was already removed
> from the entities queue. Since jobs and entities might have different
> lifecycles, this can potentially cause UAF bugs.
>
> In order to make it obvious that a jobs entity pointer shouldn't be
> accessed after drm_sched_entity_pop_job() was called successfully, set
> the jobs entity pointer to NULL once the job is removed from the entity
> queue.
>
> Moreover, debugging a potential NULL pointer dereference is way easier
> than potentially corrupted memory through a UAF.
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>

In general "YES PLEASE!", but I fear that this will break amdgpus reset 
sequence.

On the other hand when amdgpu still relies on that pointer it's clearly 
a bug (which I pointed out tons of times before).

Luben any opinion on that? Could you drive cleaning that up as well?

Thanks,
Christian.

> ---
> I'm aware that drivers could already use job->entity in arbitrary places, since
> they in control of when the entity is actually freed. A quick grep didn't give
> me any results where this would actually be the case, however maybe I also just
> didn't catch it.
>
> If, therefore, we don't want to set job->entity to NULL I think we should at
> least add a comment somewhere.
> ---
>
>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..a9c6118e534b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>   	}
>   
> +	/* Jobs and entities might have different lifecycles. Since we're
> +	 * removing the job from the entities queue, set the jobs entity pointer
> +	 * to NULL to prevent any future access of the entity through this job.
> +	 */
> +	sched_job->entity = NULL;
> +
>   	return sched_job;
>   }
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-03-31  5:59   ` Christian König
  0 siblings, 0 replies; 12+ messages in thread
From: Christian König @ 2023-03-31  5:59 UTC (permalink / raw)
  To: Danilo Krummrich, luben.tuikov, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: linux-kernel, dri-devel

Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
> It already happend a few times that patches slipped through which
> implemented access to an entity through a job that was already removed
> from the entities queue. Since jobs and entities might have different
> lifecycles, this can potentially cause UAF bugs.
>
> In order to make it obvious that a jobs entity pointer shouldn't be
> accessed after drm_sched_entity_pop_job() was called successfully, set
> the jobs entity pointer to NULL once the job is removed from the entity
> queue.
>
> Moreover, debugging a potential NULL pointer dereference is way easier
> than potentially corrupted memory through a UAF.
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>

In general "YES PLEASE!", but I fear that this will break amdgpus reset 
sequence.

On the other hand when amdgpu still relies on that pointer it's clearly 
a bug (which I pointed out tons of times before).

Luben any opinion on that? Could you drive cleaning that up as well?

Thanks,
Christian.

> ---
> I'm aware that drivers could already use job->entity in arbitrary places, since
> they in control of when the entity is actually freed. A quick grep didn't give
> me any results where this would actually be the case, however maybe I also just
> didn't catch it.
>
> If, therefore, we don't want to set job->entity to NULL I think we should at
> least add a comment somewhere.
> ---
>
>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 15d04a0ec623..a9c6118e534b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>   	}
>   
> +	/* Jobs and entities might have different lifecycles. Since we're
> +	 * removing the job from the entities queue, set the jobs entity pointer
> +	 * to NULL to prevent any future access of the entity through this job.
> +	 */
> +	sched_job->entity = NULL;
> +
>   	return sched_job;
>   }
>   


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
  2023-03-31  5:59   ` Christian König
@ 2023-03-31 12:57     ` Luben Tuikov
  -1 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-03-31 12:57 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: dri-devel, linux-kernel

On 2023-03-31 01:59, Christian König wrote:
> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>> It already happend a few times that patches slipped through which
>> implemented access to an entity through a job that was already removed
>> from the entities queue. Since jobs and entities might have different
>> lifecycles, this can potentially cause UAF bugs.
>>
>> In order to make it obvious that a jobs entity pointer shouldn't be
>> accessed after drm_sched_entity_pop_job() was called successfully, set
>> the jobs entity pointer to NULL once the job is removed from the entity
>> queue.
>>
>> Moreover, debugging a potential NULL pointer dereference is way easier
>> than potentially corrupted memory through a UAF.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> 
> In general "YES PLEASE!", but I fear that this will break amdgpus reset 
> sequence.
> 
> On the other hand when amdgpu still relies on that pointer it's clearly 
> a bug (which I pointed out tons of times before).
> 
> Luben any opinion on that? Could you drive cleaning that up as well?

Hi Christian,

No worries, yes, I'll take a look at this after breakfast.

> 
> Thanks,
> Christian.
> 
>> ---
>> I'm aware that drivers could already use job->entity in arbitrary places, since
>> they in control of when the entity is actually freed. A quick grep didn't give
>> me any results where this would actually be the case, however maybe I also just
>> didn't catch it.
>>
>> If, therefore, we don't want to set job->entity to NULL I think we should at
>> least add a comment somewhere.

I agree with the sentiment of this patch. I'll have to take a closer look at this
because there was some indirect pointer dependency due to the way the FIFO was implemented,
and I review the code every 3-6 months to remind me of that--maybe it's related, maybe
not. But this looks like a something we can delve into and at best come up with a comment
explaining what's going on and why.

We haven't seen any oopses so far the way this is, and any new patches which evoke
an oops, may be doing something they shouldn't. I'll take a look.

Any indication of what these new patches were doing?

Regards,
Luben

>> ---
>>
>>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 15d04a0ec623..a9c6118e534b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>   	}
>>   
>> +	/* Jobs and entities might have different lifecycles. Since we're
>> +	 * removing the job from the entities queue, set the jobs entity pointer
>> +	 * to NULL to prevent any future access of the entity through this job.
>> +	 */
>> +	sched_job->entity = NULL;
>> +
>>   	return sched_job;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-03-31 12:57     ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-03-31 12:57 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: linux-kernel, dri-devel

On 2023-03-31 01:59, Christian König wrote:
> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>> It already happend a few times that patches slipped through which
>> implemented access to an entity through a job that was already removed
>> from the entities queue. Since jobs and entities might have different
>> lifecycles, this can potentially cause UAF bugs.
>>
>> In order to make it obvious that a jobs entity pointer shouldn't be
>> accessed after drm_sched_entity_pop_job() was called successfully, set
>> the jobs entity pointer to NULL once the job is removed from the entity
>> queue.
>>
>> Moreover, debugging a potential NULL pointer dereference is way easier
>> than potentially corrupted memory through a UAF.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> 
> In general "YES PLEASE!", but I fear that this will break amdgpus reset 
> sequence.
> 
> On the other hand when amdgpu still relies on that pointer it's clearly 
> a bug (which I pointed out tons of times before).
> 
> Luben any opinion on that? Could you drive cleaning that up as well?

Hi Christian,

No worries, yes, I'll take a look at this after breakfast.

> 
> Thanks,
> Christian.
> 
>> ---
>> I'm aware that drivers could already use job->entity in arbitrary places, since
>> they in control of when the entity is actually freed. A quick grep didn't give
>> me any results where this would actually be the case, however maybe I also just
>> didn't catch it.
>>
>> If, therefore, we don't want to set job->entity to NULL I think we should at
>> least add a comment somewhere.

I agree with the sentiment of this patch. I'll have to take a closer look at this
because there was some indirect pointer dependency due to the way the FIFO was implemented,
and I review the code every 3-6 months to remind me of that--maybe it's related, maybe
not. But this looks like a something we can delve into and at best come up with a comment
explaining what's going on and why.

We haven't seen any oopses so far the way this is, and any new patches which evoke
an oops, may be doing something they shouldn't. I'll take a look.

Any indication of what these new patches were doing?

Regards,
Luben

>> ---
>>
>>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 15d04a0ec623..a9c6118e534b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>   	}
>>   
>> +	/* Jobs and entities might have different lifecycles. Since we're
>> +	 * removing the job from the entities queue, set the jobs entity pointer
>> +	 * to NULL to prevent any future access of the entity through this job.
>> +	 */
>> +	sched_job->entity = NULL;
>> +
>>   	return sched_job;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
  2023-03-31  5:59   ` Christian König
@ 2023-04-05 17:39     ` Luben Tuikov
  -1 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-04-05 17:39 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: dri-devel, linux-kernel

On 2023-03-31 01:59, Christian König wrote:
> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>> It already happend a few times that patches slipped through which
>> implemented access to an entity through a job that was already removed
>> from the entities queue. Since jobs and entities might have different
>> lifecycles, this can potentially cause UAF bugs.
>>
>> In order to make it obvious that a jobs entity pointer shouldn't be
>> accessed after drm_sched_entity_pop_job() was called successfully, set
>> the jobs entity pointer to NULL once the job is removed from the entity
>> queue.
>>
>> Moreover, debugging a potential NULL pointer dereference is way easier
>> than potentially corrupted memory through a UAF.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> 
> In general "YES PLEASE!", but I fear that this will break amdgpus reset 
> sequence.
> 
> On the other hand when amdgpu still relies on that pointer it's clearly 
> a bug (which I pointed out tons of times before).
> 
> Luben any opinion on that? Could you drive cleaning that up as well?

I didn't find any references to scheduling entity after the job
is submitted to the hardware. (I commented the same in the other
thread, we just need to decide which way to go.)

Regards,
Luben

> 
> Thanks,
> Christian.
> 
>> ---
>> I'm aware that drivers could already use job->entity in arbitrary places, since
>> they in control of when the entity is actually freed. A quick grep didn't give
>> me any results where this would actually be the case, however maybe I also just
>> didn't catch it.
>>
>> If, therefore, we don't want to set job->entity to NULL I think we should at
>> least add a comment somewhere.
>> ---
>>
>>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 15d04a0ec623..a9c6118e534b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>   	}
>>   
>> +	/* Jobs and entities might have different lifecycles. Since we're
>> +	 * removing the job from the entities queue, set the jobs entity pointer
>> +	 * to NULL to prevent any future access of the entity through this job.
>> +	 */
>> +	sched_job->entity = NULL;
>> +
>>   	return sched_job;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-04-05 17:39     ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-04-05 17:39 UTC (permalink / raw)
  To: Christian König, Danilo Krummrich, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: linux-kernel, dri-devel

On 2023-03-31 01:59, Christian König wrote:
> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>> It already happend a few times that patches slipped through which
>> implemented access to an entity through a job that was already removed
>> from the entities queue. Since jobs and entities might have different
>> lifecycles, this can potentially cause UAF bugs.
>>
>> In order to make it obvious that a jobs entity pointer shouldn't be
>> accessed after drm_sched_entity_pop_job() was called successfully, set
>> the jobs entity pointer to NULL once the job is removed from the entity
>> queue.
>>
>> Moreover, debugging a potential NULL pointer dereference is way easier
>> than potentially corrupted memory through a UAF.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> 
> In general "YES PLEASE!", but I fear that this will break amdgpus reset 
> sequence.
> 
> On the other hand when amdgpu still relies on that pointer it's clearly 
> a bug (which I pointed out tons of times before).
> 
> Luben any opinion on that? Could you drive cleaning that up as well?

I didn't find any references to scheduling entity after the job
is submitted to the hardware. (I commented the same in the other
thread, we just need to decide which way to go.)

Regards,
Luben

> 
> Thanks,
> Christian.
> 
>> ---
>> I'm aware that drivers could already use job->entity in arbitrary places, since
>> they in control of when the entity is actually freed. A quick grep didn't give
>> me any results where this would actually be the case, however maybe I also just
>> didn't catch it.
>>
>> If, therefore, we don't want to set job->entity to NULL I think we should at
>> least add a comment somewhere.
>> ---
>>
>>   drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 15d04a0ec623..a9c6118e534b 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>   			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>   	}
>>   
>> +	/* Jobs and entities might have different lifecycles. Since we're
>> +	 * removing the job from the entities queue, set the jobs entity pointer
>> +	 * to NULL to prevent any future access of the entity through this job.
>> +	 */
>> +	sched_job->entity = NULL;
>> +
>>   	return sched_job;
>>   }
>>   
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
  2023-04-05 17:39     ` Luben Tuikov
@ 2023-04-11 18:13       ` Danilo Krummrich
  -1 siblings, 0 replies; 12+ messages in thread
From: Danilo Krummrich @ 2023-04-11 18:13 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: linux-kernel, dri-devel

On 4/5/23 19:39, Luben Tuikov wrote:
> On 2023-03-31 01:59, Christian König wrote:
>> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>>> It already happend a few times that patches slipped through which
>>> implemented access to an entity through a job that was already removed
>>> from the entities queue. Since jobs and entities might have different
>>> lifecycles, this can potentially cause UAF bugs.
>>>
>>> In order to make it obvious that a jobs entity pointer shouldn't be
>>> accessed after drm_sched_entity_pop_job() was called successfully, set
>>> the jobs entity pointer to NULL once the job is removed from the entity
>>> queue.
>>>
>>> Moreover, debugging a potential NULL pointer dereference is way easier
>>> than potentially corrupted memory through a UAF.
>>>
>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>
>> In general "YES PLEASE!", but I fear that this will break amdgpus reset
>> sequence.
>>
>> On the other hand when amdgpu still relies on that pointer it's clearly
>> a bug (which I pointed out tons of times before).
>>
>> Luben any opinion on that? Could you drive cleaning that up as well?
> 
> I didn't find any references to scheduling entity after the job
> is submitted to the hardware. (I commented the same in the other
> thread, we just need to decide which way to go.)

AFAICS from the other mail thread it seems to be consensus to not 
ref-count entities and handle job statistics differently.

Should we go ahead and take this patch then? Maybe it also makes sense 
to send a V2 additionally adding a comment to the drm_sched_job 
structure mentioning that .entity must not be used after the job was 
taken from the entities queue.

- Danilo

> 
> Regards,
> Luben
> 
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>> I'm aware that drivers could already use job->entity in arbitrary places, since
>>> they in control of when the entity is actually freed. A quick grep didn't give
>>> me any results where this would actually be the case, however maybe I also just
>>> didn't catch it.
>>>
>>> If, therefore, we don't want to set job->entity to NULL I think we should at
>>> least add a comment somewhere.
>>> ---
>>>
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 15d04a0ec623..a9c6118e534b 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>    			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>>    	}
>>>    
>>> +	/* Jobs and entities might have different lifecycles. Since we're
>>> +	 * removing the job from the entities queue, set the jobs entity pointer
>>> +	 * to NULL to prevent any future access of the entity through this job.
>>> +	 */
>>> +	sched_job->entity = NULL;
>>> +
>>>    	return sched_job;
>>>    }
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-04-11 18:13       ` Danilo Krummrich
  0 siblings, 0 replies; 12+ messages in thread
From: Danilo Krummrich @ 2023-04-11 18:13 UTC (permalink / raw)
  To: Luben Tuikov, Christian König, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: dri-devel, linux-kernel

On 4/5/23 19:39, Luben Tuikov wrote:
> On 2023-03-31 01:59, Christian König wrote:
>> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>>> It already happend a few times that patches slipped through which
>>> implemented access to an entity through a job that was already removed
>>> from the entities queue. Since jobs and entities might have different
>>> lifecycles, this can potentially cause UAF bugs.
>>>
>>> In order to make it obvious that a jobs entity pointer shouldn't be
>>> accessed after drm_sched_entity_pop_job() was called successfully, set
>>> the jobs entity pointer to NULL once the job is removed from the entity
>>> queue.
>>>
>>> Moreover, debugging a potential NULL pointer dereference is way easier
>>> than potentially corrupted memory through a UAF.
>>>
>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>
>> In general "YES PLEASE!", but I fear that this will break amdgpus reset
>> sequence.
>>
>> On the other hand when amdgpu still relies on that pointer it's clearly
>> a bug (which I pointed out tons of times before).
>>
>> Luben any opinion on that? Could you drive cleaning that up as well?
> 
> I didn't find any references to scheduling entity after the job
> is submitted to the hardware. (I commented the same in the other
> thread, we just need to decide which way to go.)

AFAICS from the other mail thread it seems to be consensus to not 
ref-count entities and handle job statistics differently.

Should we go ahead and take this patch then? Maybe it also makes sense 
to send a V2 additionally adding a comment to the drm_sched_job 
structure mentioning that .entity must not be used after the job was 
taken from the entities queue.

- Danilo

> 
> Regards,
> Luben
> 
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>> I'm aware that drivers could already use job->entity in arbitrary places, since
>>> they in control of when the entity is actually freed. A quick grep didn't give
>>> me any results where this would actually be the case, however maybe I also just
>>> didn't catch it.
>>>
>>> If, therefore, we don't want to set job->entity to NULL I think we should at
>>> least add a comment somewhere.
>>> ---
>>>
>>>    drivers/gpu/drm/scheduler/sched_entity.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>>> index 15d04a0ec623..a9c6118e534b 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>> @@ -448,6 +448,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>>    			drm_sched_rq_update_fifo(entity, next->submit_ts);
>>>    	}
>>>    
>>> +	/* Jobs and entities might have different lifecycles. Since we're
>>> +	 * removing the job from the entities queue, set the jobs entity pointer
>>> +	 * to NULL to prevent any future access of the entity through this job.
>>> +	 */
>>> +	sched_job->entity = NULL;
>>> +
>>>    	return sched_job;
>>>    }
>>>    
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
  2023-04-11 18:13       ` Danilo Krummrich
@ 2023-04-11 20:06         ` Luben Tuikov
  -1 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-04-11 20:06 UTC (permalink / raw)
  To: Danilo Krummrich, Christian König, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: linux-kernel, dri-devel

On 2023-04-11 14:13, Danilo Krummrich wrote:
> On 4/5/23 19:39, Luben Tuikov wrote:
>> On 2023-03-31 01:59, Christian König wrote:
>>> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>>>> It already happend a few times that patches slipped through which
>>>> implemented access to an entity through a job that was already removed
>>>> from the entities queue. Since jobs and entities might have different
>>>> lifecycles, this can potentially cause UAF bugs.
>>>>
>>>> In order to make it obvious that a jobs entity pointer shouldn't be
>>>> accessed after drm_sched_entity_pop_job() was called successfully, set
>>>> the jobs entity pointer to NULL once the job is removed from the entity
>>>> queue.
>>>>
>>>> Moreover, debugging a potential NULL pointer dereference is way easier
>>>> than potentially corrupted memory through a UAF.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>>
>>> In general "YES PLEASE!", but I fear that this will break amdgpus reset
>>> sequence.
>>>
>>> On the other hand when amdgpu still relies on that pointer it's clearly
>>> a bug (which I pointed out tons of times before).
>>>
>>> Luben any opinion on that? Could you drive cleaning that up as well?
>>
>> I didn't find any references to scheduling entity after the job
>> is submitted to the hardware. (I commented the same in the other
>> thread, we just need to decide which way to go.)
> 
> AFAICS from the other mail thread it seems to be consensus to not 
> ref-count entities and handle job statistics differently.
> 
> Should we go ahead and take this patch then? Maybe it also makes sense 
> to send a V2 additionally adding a comment to the drm_sched_job 
> structure mentioning that .entity must not be used after the job was 
> taken from the entities queue.

Yes, please send a v2, but instead of mentioning (or in addition to)
that the job was taken from the "entities queue", mention that
once the job is pushed to the hardware, i.e. the pending queue,
from then on, the "entity" should not be referenced anymore. IOW, we
want to mention that it is going from X to Y, as opposed to just
that it's taken from X, because the latter begs the question "Where
is it going to then?".

For the record, I think that using kref would give us the most
stability, even if we skipped "entity" and let the scheduler, or
even the controller keep a kref on submitted commands down to
the GPU. On reset, we block command submission to the GPU from
the outermost boundary, and then start peeling the layers from
the innermost boundary.

Using kref also forces us to think objectively and set explicit
(clear) dependencies from the outset--i.e. who gets freed and when
and under what circumstances.

And using kref might even expose the wrong dependencies, thus
prompting a redesign and thus a fix.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()
@ 2023-04-11 20:06         ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2023-04-11 20:06 UTC (permalink / raw)
  To: Danilo Krummrich, Christian König, airlied, daniel, l.stach,
	Prosyak, Vitaly
  Cc: dri-devel, linux-kernel

On 2023-04-11 14:13, Danilo Krummrich wrote:
> On 4/5/23 19:39, Luben Tuikov wrote:
>> On 2023-03-31 01:59, Christian König wrote:
>>> Am 31.03.23 um 02:06 schrieb Danilo Krummrich:
>>>> It already happend a few times that patches slipped through which
>>>> implemented access to an entity through a job that was already removed
>>>> from the entities queue. Since jobs and entities might have different
>>>> lifecycles, this can potentially cause UAF bugs.
>>>>
>>>> In order to make it obvious that a jobs entity pointer shouldn't be
>>>> accessed after drm_sched_entity_pop_job() was called successfully, set
>>>> the jobs entity pointer to NULL once the job is removed from the entity
>>>> queue.
>>>>
>>>> Moreover, debugging a potential NULL pointer dereference is way easier
>>>> than potentially corrupted memory through a UAF.
>>>>
>>>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>>>
>>> In general "YES PLEASE!", but I fear that this will break amdgpus reset
>>> sequence.
>>>
>>> On the other hand when amdgpu still relies on that pointer it's clearly
>>> a bug (which I pointed out tons of times before).
>>>
>>> Luben any opinion on that? Could you drive cleaning that up as well?
>>
>> I didn't find any references to scheduling entity after the job
>> is submitted to the hardware. (I commented the same in the other
>> thread, we just need to decide which way to go.)
> 
> AFAICS from the other mail thread it seems to be consensus to not 
> ref-count entities and handle job statistics differently.
> 
> Should we go ahead and take this patch then? Maybe it also makes sense 
> to send a V2 additionally adding a comment to the drm_sched_job 
> structure mentioning that .entity must not be used after the job was 
> taken from the entities queue.

Yes, please send a v2, but instead of mentioning (or in addition to)
that the job was taken from the "entities queue", mention that
once the job is pushed to the hardware, i.e. the pending queue,
from then on, the "entity" should not be referenced anymore. IOW, we
want to mention that it is going from X to Y, as opposed to just
that it's taken from X, because the latter begs the question "Where
is it going to then?".

For the record, I think that using kref would give us the most
stability, even if we skipped "entity" and let the scheduler, or
even the controller keep a kref on submitted commands down to
the GPU. On reset, we block command submission to the GPU from
the outermost boundary, and then start peeling the layers from
the innermost boundary.

Using kref also forces us to think objectively and set explicit
(clear) dependencies from the outset--i.e. who gets freed and when
and under what circumstances.

And using kref might even expose the wrong dependencies, thus
prompting a redesign and thus a fix.
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-04-11 20:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-31  0:06 [PATCH] drm/scheduler: set entity to NULL in drm_sched_entity_pop_job() Danilo Krummrich
2023-03-31  0:06 ` Danilo Krummrich
2023-03-31  5:59 ` Christian König
2023-03-31  5:59   ` Christian König
2023-03-31 12:57   ` Luben Tuikov
2023-03-31 12:57     ` Luben Tuikov
2023-04-05 17:39   ` Luben Tuikov
2023-04-05 17:39     ` Luben Tuikov
2023-04-11 18:13     ` Danilo Krummrich
2023-04-11 18:13       ` Danilo Krummrich
2023-04-11 20:06       ` Luben Tuikov
2023-04-11 20:06         ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.