All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] drm/amdkfd: new event wait support
@ 2023-06-06 15:57 James Zhu
  2023-06-06 15:57 ` [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: James Zhu @ 2023-06-06 15:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.kuehling, jamesz

In kernel amdgpu driver, kfd_wait_on_events is used to support user space signal event wait
function. For multiple threads waiting on same event scenery, race condition could occur
since some threads after checking signal condition, before calling kfd_wait_on_events, the
event interrupt could be fired and wake up other thread which are sleeping on this event.
Then those threads could fall into sleep without waking up again. Adding event age tracking
in both kernel and user mode, will help avoiding this race condition.

The changes for The user space ROCT-Thunk-Interface/ROCR-Runtime are listed below for
review togehter with kernel mode changes.

ROCT-Thunk-Interface:
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/compare/master...zhums:ROCT-Thunk-Interface:new_event_wait_review
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/commit/efdbf6cfbc026bd68ac3c35d00dacf84370eb81e
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/commit/1820ae0a2db85b6f584611dc0cde1a00e7c22915

ROCR-Runtime:
https://github.com/RadeonOpenCompute/ROCR-Runtime/compare/master...zhums:ROCR-Runtime:new_event_wait_review
https://github.com/RadeonOpenCompute/ROCR-Runtime/commit/e1f5bdb88eb882ac798aeca2c00ea3fbb2dba459
https://github.com/RadeonOpenCompute/ROCR-Runtime/commit/7d26afd14107b5c2a754c1a3f415d89f3aabb503

James Zhu (3):
  drm/amdkfd: add event age tracking
  drm/amdkfd: add event_age tracking when receiving interrupt
  drm/amdkfd: don't sleep when event age unmatch

 drivers/gpu/drm/amd/amdkfd/kfd_events.c | 21 +++++++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_events.h |  1 +
 include/uapi/linux/kfd_ioctl.h          | 13 +++++++++++--
 3 files changed, 33 insertions(+), 2 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] drm/amdkfd: add event age tracking
  2023-06-06 15:57 [PATCH 0/3] drm/amdkfd: new event wait support James Zhu
@ 2023-06-06 15:57 ` James Zhu
  2023-06-06 15:57 ` [PATCH 2/3] drm/amdkfd: add event_age tracking when receiving interrupt James Zhu
  2023-06-06 15:58 ` [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch James Zhu
  2 siblings, 0 replies; 8+ messages in thread
From: James Zhu @ 2023-06-06 15:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.kuehling, jamesz

Add event age tracking

Signed-off-by: James Zhu <James.Zhu@amd.com>
---
 include/uapi/linux/kfd_ioctl.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 1781e7669982..eeb2fdcbdcb7 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -39,9 +39,10 @@
  * - 1.11 - Add unified memory for ctx save/restore area
  * - 1.12 - Add DMA buf export ioctl
  * - 1.13 - Add debugger API
+ * - 1.14 - Update kfd_event_data
  */
 #define KFD_IOCTL_MAJOR_VERSION 1
-#define KFD_IOCTL_MINOR_VERSION 13
+#define KFD_IOCTL_MINOR_VERSION 14
 
 struct kfd_ioctl_get_version_args {
 	__u32 major_version;	/* from KFD */
@@ -320,12 +321,20 @@ struct kfd_hsa_hw_exception_data {
 	__u32 gpu_id;
 };
 
+/* hsa signal event data */
+struct kfd_hsa_signal_event_data {
+	__u64 last_event_age;	/* to and from KFD */
+};
+
 /* Event data */
 struct kfd_event_data {
 	union {
+		/* From KFD */
 		struct kfd_hsa_memory_exception_data memory_exception_data;
 		struct kfd_hsa_hw_exception_data hw_exception_data;
-	};				/* From KFD */
+		/* To and From KFD */
+		struct kfd_hsa_signal_event_data signal_event_data;
+	};
 	__u64 kfd_event_data_ext;	/* pointer to an extension structure
 					   for future exception types */
 	__u32 event_id;		/* to KFD */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] drm/amdkfd: add event_age tracking when receiving interrupt
  2023-06-06 15:57 [PATCH 0/3] drm/amdkfd: new event wait support James Zhu
  2023-06-06 15:57 ` [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
@ 2023-06-06 15:57 ` James Zhu
  2023-06-06 15:58 ` [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch James Zhu
  2 siblings, 0 replies; 8+ messages in thread
From: James Zhu @ 2023-06-06 15:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.kuehling, jamesz

Add event_age tracking when receiving interrupt.

Signed-off-by: James Zhu <James.Zhu@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c | 6 ++++++
 drivers/gpu/drm/amd/amdkfd/kfd_events.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index 7ff5c4e1b7e2..c7689181cc22 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -431,6 +431,7 @@ int kfd_event_create(struct file *devkfd, struct kfd_process *p,
 	if (!ret) {
 		*event_id = ev->event_id;
 		*event_trigger_data = ev->event_id;
+		ev->event_age = 1;
 	} else {
 		kfree(ev);
 	}
@@ -629,6 +630,11 @@ static void set_event(struct kfd_event *ev)
 	 * updating the wait queues in kfd_wait_on_events.
 	 */
 	ev->signaled = !ev->auto_reset || !waitqueue_active(&ev->wq);
+	if (!(++ev->event_age)) {
+		/* Never wrap back to reserved/default event age 0/1 */
+		ev->event_age = 2;
+		WARN_ONCE(1, "event_age wrap back!");
+	}
 
 	list_for_each_entry(waiter, &ev->wq.head, wait.entry)
 		WRITE_ONCE(waiter->activated, true);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.h b/drivers/gpu/drm/amd/amdkfd/kfd_events.h
index 1c62c8dd6460..52ccfd397c2b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.h
@@ -53,6 +53,7 @@ struct signal_page;
 
 struct kfd_event {
 	u32 event_id;
+	u64 event_age;
 
 	bool signaled;
 	bool auto_reset;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch
  2023-06-06 15:57 [PATCH 0/3] drm/amdkfd: new event wait support James Zhu
  2023-06-06 15:57 ` [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
  2023-06-06 15:57 ` [PATCH 2/3] drm/amdkfd: add event_age tracking when receiving interrupt James Zhu
@ 2023-06-06 15:58 ` James Zhu
  2 siblings, 0 replies; 8+ messages in thread
From: James Zhu @ 2023-06-06 15:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.kuehling, jamesz

Don't sleep when event age unmatch, and update last_event_age.
It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.

Signed-off-by: James Zhu <James.Zhu@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..f4ceb5be78ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -952,6 +952,21 @@ int kfd_wait_on_events(struct kfd_process *p,
 					event_data.event_id);
 		if (ret)
 			goto out_unlock;
+
+		/* last_event_age = 0 reserved for backward compatible */
+		if (event_data.signal_event_data.last_event_age &&
+			event_waiters[i].event->event_age !=
+				event_data.signal_event_data.last_event_age) {
+			event_data.signal_event_data.last_event_age =
+				event_waiters[i].event->event_age;
+			WRITE_ONCE(event_waiters[i].activated, true);
+
+			if (copy_to_user(&events[i], &event_data,
+				sizeof(struct kfd_event_data))) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+		}
 	}
 
 	/* Check condition once. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch
  2023-06-01 22:06     ` James Zhu
@ 2023-06-01 22:08       ` Felix Kuehling
  0 siblings, 0 replies; 8+ messages in thread
From: Felix Kuehling @ 2023-06-01 22:08 UTC (permalink / raw)
  To: James Zhu, James Zhu, amd-gfx

On 2023-06-01 18:06, James Zhu wrote:
>
>
> On 2023-06-01 17:17, Felix Kuehling wrote:
>> On 2023-06-01 16:47, James Zhu wrote:
>>> Don't sleep when event age unmatch, and update last_event_age.
>>> It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.
>>>
>>> Signed-off-by: James Zhu <James.Zhu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_events.c | 13 +++++++++++++
>>>   1 file changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>>> index b27a79c5f826..23e154811471 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>>> @@ -964,6 +964,19 @@ int kfd_wait_on_events(struct kfd_process *p,
>>>                       event_data.event_id);
>>>           if (ret)
>>>               goto out_unlock;
>>> +
>>> +        /* last_event_age = 0 reserved for backward compatible */
>>> +        if (event_data.last_event_age &&
>>> +            event_waiters[i].event->event_age != 
>>> event_data.last_event_age) {
>>> +            event_data.last_event_age = 
>>> event_waiters[i].event->event_age;
>>> +            WRITE_ONCE(event_waiters[i].activated, true);
>>
>> I think this is probably not necessary if you're returning before the 
>> first call to test_event_condition.
>
> [JZ] Currently, the returning is with test_event_condition. Since some 
> conditions needs return with all events signaled.
>
> so all waiters need check and update if any event interrupts are 
> missing here(unmatched event age).
>
>>
>>
>>> +
>>> +            if (copy_to_user(&events[i], &event_data,
>>> +                sizeof(struct kfd_event_data))) {
>>> +                ret = -EFAULT;
>>> +                goto out_unlock;
>>> +            }
>>> +        }
>>
>> When waiting on multiple events, it would be more efficient to catch 
>> all events with outdated age in a single system call, instead of 
>> returning after finding the first one. Then return after the loop if 
>> it found one or more outdated events.
> [JZ] Yes, the code is working in this way, when all events' waiters 
> are activated, the following test_event_condition == 
> KFD_IOC_WAIT_RESULT_COMPLETE, then return to user mode without sleep.
>>
>>
>> Also EFAULT is the wrong error code. It means "bad address". I think 
>> we could return 0 here because there is really no error. It's just a 
>> normal condition to allow user mode to update its event information 
>> before going to sleep. If you want a non-0 return code, I'd recommend 
>> -EDEADLK, because sleeping without outdated event information can 
>> lead to deadlocks. We'd also need to translate that into a meaningful 
>> status code in the Thunk to let ROCr know what's going on. I don't 
>> see a suitable status code in the current Thunk API for this.
> [JZ] Basically, this failure is for copy_to_user. It will lead to busy 
> wait instead of deadlock.

Oh, right, I missed that this is only for the case that copy_to_user 
fails. EFAULT is the correct error code for this. Then this patch looks 
good as is.

Regards,
   Felix


>> **
>> Regards,
>>   Felix
>>
>>
>>>       }
>>>         /* Check condition once. */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch
  2023-06-01 21:17   ` Felix Kuehling
@ 2023-06-01 22:06     ` James Zhu
  2023-06-01 22:08       ` Felix Kuehling
  0 siblings, 1 reply; 8+ messages in thread
From: James Zhu @ 2023-06-01 22:06 UTC (permalink / raw)
  To: Felix Kuehling, James Zhu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 3086 bytes --]


On 2023-06-01 17:17, Felix Kuehling wrote:
> On 2023-06-01 16:47, James Zhu wrote:
>> Don't sleep when event age unmatch, and update last_event_age.
>> It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.
>>
>> Signed-off-by: James Zhu <James.Zhu@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_events.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>> index b27a79c5f826..23e154811471 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
>> @@ -964,6 +964,19 @@ int kfd_wait_on_events(struct kfd_process *p,
>>                       event_data.event_id);
>>           if (ret)
>>               goto out_unlock;
>> +
>> +        /* last_event_age = 0 reserved for backward compatible */
>> +        if (event_data.last_event_age &&
>> +            event_waiters[i].event->event_age != 
>> event_data.last_event_age) {
>> +            event_data.last_event_age = 
>> event_waiters[i].event->event_age;
>> +            WRITE_ONCE(event_waiters[i].activated, true);
>
> I think this is probably not necessary if you're returning before the 
> first call to test_event_condition.

[JZ] Currently, the returning is with test_event_condition. Since some 
conditions needs return with all events signaled.

so all waiters need check and update if any event interrupts are missing 
here(unmatched event age).

>
>
>> +
>> +            if (copy_to_user(&events[i], &event_data,
>> +                sizeof(struct kfd_event_data))) {
>> +                ret = -EFAULT;
>> +                goto out_unlock;
>> +            }
>> +        }
>
> When waiting on multiple events, it would be more efficient to catch 
> all events with outdated age in a single system call, instead of 
> returning after finding the first one. Then return after the loop if 
> it found one or more outdated events.
[JZ] Yes, the code is working in this way, when all events' waiters are 
activated, the following test_event_condition == 
KFD_IOC_WAIT_RESULT_COMPLETE, then return to user mode without sleep.
>
>
> Also EFAULT is the wrong error code. It means "bad address". I think 
> we could return 0 here because there is really no error. It's just a 
> normal condition to allow user mode to update its event information 
> before going to sleep. If you want a non-0 return code, I'd recommend 
> -EDEADLK, because sleeping without outdated event information can lead 
> to deadlocks. We'd also need to translate that into a meaningful 
> status code in the Thunk to let ROCr know what's going on. I don't see 
> a suitable status code in the current Thunk API for this.
[JZ] Basically, this failure is for copy_to_user. It will lead to busy 
wait instead of deadlock.
> **
> Regards,
>   Felix
>
>
>>       }
>>         /* Check condition once. */

[-- Attachment #2: Type: text/html, Size: 5487 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch
  2023-06-01 20:47 ` [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch James Zhu
@ 2023-06-01 21:17   ` Felix Kuehling
  2023-06-01 22:06     ` James Zhu
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Kuehling @ 2023-06-01 21:17 UTC (permalink / raw)
  To: James Zhu, amd-gfx; +Cc: jamesz

On 2023-06-01 16:47, James Zhu wrote:
> Don't sleep when event age unmatch, and update last_event_age.
> It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.
>
> Signed-off-by: James Zhu <James.Zhu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_events.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
> index b27a79c5f826..23e154811471 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
> @@ -964,6 +964,19 @@ int kfd_wait_on_events(struct kfd_process *p,
>   					event_data.event_id);
>   		if (ret)
>   			goto out_unlock;
> +
> +		/* last_event_age = 0 reserved for backward compatible */
> +		if (event_data.last_event_age &&
> +			event_waiters[i].event->event_age != event_data.last_event_age) {
> +			event_data.last_event_age = event_waiters[i].event->event_age;
> +			WRITE_ONCE(event_waiters[i].activated, true);

I think this is probably not necessary if you're returning before the 
first call to test_event_condition.


> +
> +			if (copy_to_user(&events[i], &event_data,
> +				sizeof(struct kfd_event_data))) {
> +				ret = -EFAULT;
> +				goto out_unlock;
> +			}
> +		}

When waiting on multiple events, it would be more efficient to catch all 
events with outdated age in a single system call, instead of returning 
after finding the first one. Then return after the loop if it found one 
or more outdated events.

Also EFAULT is the wrong error code. It means "bad address". I think we 
could return 0 here because there is really no error. It's just a normal 
condition to allow user mode to update its event information before 
going to sleep. If you want a non-0 return code, I'd recommend -EDEADLK, 
because sleeping without outdated event information can lead to 
deadlocks. We'd also need to translate that into a meaningful status 
code in the Thunk to let ROCr know what's going on. I don't see a 
suitable status code in the current Thunk API for this.

Regards,
   Felix


>   	}
>   
>   	/* Check condition once. */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch
  2023-06-01 20:47 [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
@ 2023-06-01 20:47 ` James Zhu
  2023-06-01 21:17   ` Felix Kuehling
  0 siblings, 1 reply; 8+ messages in thread
From: James Zhu @ 2023-06-01 20:47 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.kuehling, jamesz

Don't sleep when event age unmatch, and update last_event_age.
It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.

Signed-off-by: James Zhu <James.Zhu@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_events.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index b27a79c5f826..23e154811471 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -964,6 +964,19 @@ int kfd_wait_on_events(struct kfd_process *p,
 					event_data.event_id);
 		if (ret)
 			goto out_unlock;
+
+		/* last_event_age = 0 reserved for backward compatible */
+		if (event_data.last_event_age &&
+			event_waiters[i].event->event_age != event_data.last_event_age) {
+			event_data.last_event_age = event_waiters[i].event->event_age;
+			WRITE_ONCE(event_waiters[i].activated, true);
+
+			if (copy_to_user(&events[i], &event_data,
+				sizeof(struct kfd_event_data))) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+		}
 	}
 
 	/* Check condition once. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-06 16:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-06 15:57 [PATCH 0/3] drm/amdkfd: new event wait support James Zhu
2023-06-06 15:57 ` [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
2023-06-06 15:57 ` [PATCH 2/3] drm/amdkfd: add event_age tracking when receiving interrupt James Zhu
2023-06-06 15:58 ` [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch James Zhu
  -- strict thread matches above, loose matches on Subject: below --
2023-06-01 20:47 [PATCH 1/3] drm/amdkfd: add event age tracking James Zhu
2023-06-01 20:47 ` [PATCH 3/3] drm/amdkfd: don't sleep when event age unmatch James Zhu
2023-06-01 21:17   ` Felix Kuehling
2023-06-01 22:06     ` James Zhu
2023-06-01 22:08       ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.