All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: handle errors from svm validate and map
@ 2023-09-13 15:16 Philip Yang
  2023-09-13 16:14 ` Felix Kuehling
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Philip Yang @ 2023-09-13 15:16 UTC (permalink / raw)
  To: amd-gfx; +Cc: alex.sierra, Philip Yang, Felix.Kuehling, james.zhu

If new range is added to update list, splited to multiple pranges with
max_svm_range_pages alignment, and svm validate and map returns error
for the first prange, then the caller retry should add pranges with
prange->is_error_flag or prange without prange->mapped_to_gpu to the
update list, to update GPU mapping for the entire range.

Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: James Zhu <james.zhu@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 61dd66bddc3c..8871329e9cbd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -825,7 +825,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
 		}
 	}
 
-	return !prange->is_error_flag;
+	return true;
 }
 
 /**
@@ -2228,7 +2228,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 		next = interval_tree_iter_next(node, start, last);
 		next_start = min(node->last, last) + 1;
 
-		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
+		if (!prange->is_error_flag && prange->mapped_to_gpu &&
+		    svm_range_is_same_attrs(p, prange, nattr, attrs)) {
 			/* nothing to do */
 		} else if (node->start < start || node->last > last) {
 			/* node intersects the update range and its attributes
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdkfd: handle errors from svm validate and map
  2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
@ 2023-09-13 16:14 ` Felix Kuehling
  2023-09-13 17:24   ` Philip Yang
  2023-09-13 17:33   ` Philip Yang
  2023-09-15 13:28 ` [PATCH v2] " Philip Yang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Felix Kuehling @ 2023-09-13 16:14 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

On 2023-09-13 11:16, Philip Yang wrote:
> If new range is added to update list, splited to multiple pranges with
> max_svm_range_pages alignment, and svm validate and map returns error
> for the first prange, then the caller retry should add pranges with
> prange->is_error_flag or prange without prange->mapped_to_gpu to the
> update list, to update GPU mapping for the entire range.

It looks like the only new thing here is to remove the "same attribute" 
optimization for ranges that are not mapped on the GPU. I don't fully 
understand the scenario you're describing here, but it feels like this 
change has a bigger impact than it needs to have. Your description 
specifically talks about ranges split at max_svm_range_pages boundaries. 
But your patch affects all ranges not mapped on the GPU, even it 
prange->is_error_flag is not set.

Maybe that's OK, because the expensive thing is updating existing 
mappings unnecessarily. If there is no existing mapping yet, it's 
probably not a big deal. I just don't understand the scenario that 
requires a retry  without the prange->is_error_flag being set. Maybe a 
better fix would be to ensure that prange->is_error_flag gets set in 
your scenario.

Regards,
   Felix


>
> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> Tested-by: James Zhu <james.zhu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 61dd66bddc3c..8871329e9cbd 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -825,7 +825,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
>   		}
>   	}
>   
> -	return !prange->is_error_flag;
> +	return true;
>   }
>   
>   /**
> @@ -2228,7 +2228,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
>   		next = interval_tree_iter_next(node, start, last);
>   		next_start = min(node->last, last) + 1;
>   
> -		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
> +		if (!prange->is_error_flag && prange->mapped_to_gpu &&
> +		    svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>   			/* nothing to do */
>   		} else if (node->start < start || node->last > last) {
>   			/* node intersects the update range and its attributes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdkfd: handle errors from svm validate and map
  2023-09-13 16:14 ` Felix Kuehling
@ 2023-09-13 17:24   ` Philip Yang
  2023-09-13 17:33   ` Philip Yang
  1 sibling, 0 replies; 18+ messages in thread
From: Philip Yang @ 2023-09-13 17:24 UTC (permalink / raw)
  To: Felix Kuehling, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

[-- Attachment #1: Type: text/html, Size: 5369 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdkfd: handle errors from svm validate and map
  2023-09-13 16:14 ` Felix Kuehling
  2023-09-13 17:24   ` Philip Yang
@ 2023-09-13 17:33   ` Philip Yang
  2023-09-13 18:27     ` Felix Kuehling
  1 sibling, 1 reply; 18+ messages in thread
From: Philip Yang @ 2023-09-13 17:33 UTC (permalink / raw)
  To: Felix Kuehling, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

[-- Attachment #1: Type: text/html, Size: 5000 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] drm/amdkfd: handle errors from svm validate and map
  2023-09-13 17:33   ` Philip Yang
@ 2023-09-13 18:27     ` Felix Kuehling
  0 siblings, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2023-09-13 18:27 UTC (permalink / raw)
  To: Philip Yang, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu


On 2023-09-13 13:33, Philip Yang wrote:
>
>
> On 2023-09-13 12:14, Felix Kuehling wrote:
>> On 2023-09-13 11:16, Philip Yang wrote:
>>> If new range is added to update list, splited to multiple pranges with
>>> max_svm_range_pages alignment, and svm validate and map returns error
>>> for the first prange, then the caller retry should add pranges with
>>> prange->is_error_flag or prange without prange->mapped_to_gpu to the
>>> update list, to update GPU mapping for the entire range.
>>
>> It looks like the only new thing here is to remove the "same 
>> attribute" optimization for ranges that are not mapped on the GPU. I 
>> don't fully understand the scenario you're describing here, but it 
>> feels like this change has a bigger impact than it needs to have. 
>> Your description specifically talks about ranges split at 
>> max_svm_range_pages boundaries. But your patch affects all ranges not 
>> mapped on the GPU, even it prange->is_error_flag is not set.
>>
>> Maybe that's OK, because the expensive thing is updating existing 
>> mappings unnecessarily. If there is no existing mapping yet, it's 
>> probably not a big deal. I just don't understand the scenario that 
>> requires a retry  without the prange->is_error_flag being set. Maybe 
>> a better fix would be to ensure that prange->is_error_flag gets set 
>> in your scenario.
>
> Another way to fix the issue, set_attr continue to call 
> svm_range_validate_and_map for all pranges of update_list after 
> svm_range_validate_and_map return error, this change will have less 
> side-effect.
>
Another option is to get rid of the is_error_flag completely. Instead 
only use mapped_to_gpu and set it to false if validate_and_map fails.

Regards,
   Felix


> Regards,
>
> Philip
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from 
>>> svm validate and map")
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> Tested-by: James Zhu <james.zhu@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 61dd66bddc3c..8871329e9cbd 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -825,7 +825,7 @@ svm_range_is_same_attrs(struct kfd_process *p, 
>>> struct svm_range *prange,
>>>           }
>>>       }
>>>   -    return !prange->is_error_flag;
>>> +    return true;
>>>   }
>>>     /**
>>> @@ -2228,7 +2228,8 @@ svm_range_add(struct kfd_process *p, uint64_t 
>>> start, uint64_t size,
>>>           next = interval_tree_iter_next(node, start, last);
>>>           next_start = min(node->last, last) + 1;
>>>   -        if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>> +        if (!prange->is_error_flag && prange->mapped_to_gpu &&
>>> +            svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>>               /* nothing to do */
>>>           } else if (node->start < start || node->last > last) {
>>>               /* node intersects the update range and its attributes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] drm/amdkfd: handle errors from svm validate and map
  2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
  2023-09-13 16:14 ` Felix Kuehling
@ 2023-09-15 13:28 ` Philip Yang
  2023-09-15 21:06   ` Chen, Xiaogang
  2023-09-19 14:21 ` [PATCH v3] drm/amdkfd: Handle " Philip Yang
  2023-09-20 15:45 ` [PATCH v4] " Philip Yang
  3 siblings, 1 reply; 18+ messages in thread
From: Philip Yang @ 2023-09-15 13:28 UTC (permalink / raw)
  To: amd-gfx; +Cc: alex.sierra, Philip Yang, Felix.Kuehling, james.zhu

If new range is splited to multiple pranges with max_svm_range_pages
alignment and added to update_list, svm validate and map should keep
going after error to make sure maps_to_gpu flag is up to date for the
whole range.

svm validate and map update set prange->mapped_to_gpu after mapping to
GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
update mapping case) instead of setting error flag, we can remove
the redundant error flag to simpliy code.

Update prange->mapped_to_gpu flag inside svm_range_lock, to guarant we
always set prange invalid flag to evict queues or unmap from GPUs before
the system memory is moved.

After svm validate and map return error, the caller retry will update
the mapping for the whole range.

Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 18 ++++++++----------
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5d7ba7dbf6ce..26018b1d6138 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
 		}
 	}
 
-	return !prange->is_error_flag;
+	return true;
 }
 
 /**
@@ -1724,20 +1724,17 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 					  ctx->bitmap, wait, flush_tlb);
 
 unlock_out:
+		prange->mapped_to_gpu = !r;
 		svm_range_unlock(prange);
 
 		addr = next;
 	}
 
-	if (addr == end) {
+	if (addr == end)
 		prange->validated_once = true;
-		prange->mapped_to_gpu = true;
-	}
 
 unreserve_out:
 	svm_range_unreserve_bos(ctx);
-
-	prange->is_error_flag = !!r;
 	if (!r)
 		prange->validate_timestamp = ktime_get_boottime();
 
@@ -2106,7 +2103,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 		next = interval_tree_iter_next(node, start, last);
 		next_start = min(node->last, last) + 1;
 
-		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
+		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
+		    prange->mapped_to_gpu) {
 			/* nothing to do */
 		} else if (node->start < start || node->last > last) {
 			/* node intersects the update range and its attributes
@@ -3519,7 +3517,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	struct svm_range *next;
 	bool update_mapping = false;
 	bool flush_tlb;
-	int r = 0;
+	int r, ret = 0;
 
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
 		 p->pasid, &p->svms, start, start + size - 1, size);
@@ -3607,7 +3605,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 out_unlock_range:
 		mutex_unlock(&prange->migrate_mutex);
 		if (r)
-			break;
+			ret = r;
 	}
 
 	dynamic_svm_range_dump(svms);
@@ -3620,7 +3618,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
 		 &p->svms, start, start + size - 1, r);
 
-	return r;
+	return ret ? ret : r;
 }
 
 static int
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 9e668eeefb32..91715bf3233c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -134,7 +134,6 @@ struct svm_range {
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 	bool				validated_once;
 	bool				mapped_to_gpu;
-	bool				is_error_flag;
 };
 
 static inline void svm_range_lock(struct svm_range *prange)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map
  2023-09-15 13:28 ` [PATCH v2] " Philip Yang
@ 2023-09-15 21:06   ` Chen, Xiaogang
  2023-09-15 21:20     ` Philip Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2023-09-15 21:06 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: alex.sierra, Felix.Kuehling, james.zhu


On 9/15/2023 8:28 AM, Philip Yang wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> If new range is splited to multiple pranges with max_svm_range_pages
> alignment and added to update_list, svm validate and map should keep
> going after error to make sure maps_to_gpu flag is up to date for the
> whole range.
>
> svm validate and map update set prange->mapped_to_gpu after mapping to
> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
> update mapping case) instead of setting error flag, we can remove
> the redundant error flag to simpliy code.
>
> Update prange->mapped_to_gpu flag inside svm_range_lock, to guarant we
> always set prange invalid flag to evict queues or unmap from GPUs before
> the system memory is moved.
>
> After svm validate and map return error, the caller retry will update
> the mapping for the whole range.
>
> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 18 ++++++++----------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>   2 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 5d7ba7dbf6ce..26018b1d6138 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
>                  }
>          }
>
> -       return !prange->is_error_flag;
> +       return true;
>   }
>
>   /**
> @@ -1724,20 +1724,17 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>                                            ctx->bitmap, wait, flush_tlb);
>
>   unlock_out:
> +               prange->mapped_to_gpu = !r;

I do not understand why set prange->mapped_to_gpu here? It handles one 
vma, not for all prange. If there are multiple vma and first vma handle 
is ok, and second vma handle fail at amdgpu_hmm_range_get_pages or 
svm_range_dma_map, you would get prange->mapped_to_gpu be true, but only 
part of pragne got mapped, right?


Regards

Xiaogang

>                  svm_range_unlock(prange);
>
>                  addr = next;
>          }
>
> -       if (addr == end) {
> +       if (addr == end)
>                  prange->validated_once = true;
> -               prange->mapped_to_gpu = true;
> -       }
>
>   unreserve_out:
>          svm_range_unreserve_bos(ctx);
> -
> -       prange->is_error_flag = !!r;
>          if (!r)
>                  prange->validate_timestamp = ktime_get_boottime();
>
> @@ -2106,7 +2103,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
>                  next = interval_tree_iter_next(node, start, last);
>                  next_start = min(node->last, last) + 1;
>
> -               if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
> +               if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
> +                   prange->mapped_to_gpu) {
>                          /* nothing to do */
>                  } else if (node->start < start || node->last > last) {
>                          /* node intersects the update range and its attributes
> @@ -3519,7 +3517,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>          struct svm_range *next;
>          bool update_mapping = false;
>          bool flush_tlb;
> -       int r = 0;
> +       int r, ret = 0;
>
>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
>                   p->pasid, &p->svms, start, start + size - 1, size);
> @@ -3607,7 +3605,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   out_unlock_range:
>                  mutex_unlock(&prange->migrate_mutex);
>                  if (r)
> -                       break;
> +                       ret = r;
>          }
>
>          dynamic_svm_range_dump(svms);
> @@ -3620,7 +3618,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
>                   &p->svms, start, start + size - 1, r);
>
> -       return r;
> +       return ret ? ret : r;
>   }
>
>   static int
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> index 9e668eeefb32..91715bf3233c 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> @@ -134,7 +134,6 @@ struct svm_range {
>          DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>          bool                            validated_once;
>          bool                            mapped_to_gpu;
> -       bool                            is_error_flag;
>   };
>
>   static inline void svm_range_lock(struct svm_range *prange)
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map
  2023-09-15 21:06   ` Chen, Xiaogang
@ 2023-09-15 21:20     ` Philip Yang
  2023-09-15 21:33       ` Chen, Xiaogang
  0 siblings, 1 reply; 18+ messages in thread
From: Philip Yang @ 2023-09-15 21:20 UTC (permalink / raw)
  To: Chen, Xiaogang, Philip Yang, amd-gfx
  Cc: alex.sierra, Felix.Kuehling, james.zhu

[-- Attachment #1: Type: text/html, Size: 11750 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map
  2023-09-15 21:20     ` Philip Yang
@ 2023-09-15 21:33       ` Chen, Xiaogang
  2023-09-18 13:27         ` Philip Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2023-09-15 21:33 UTC (permalink / raw)
  To: Philip Yang, Philip Yang, amd-gfx; +Cc: alex.sierra, Felix.Kuehling, james.zhu


On 9/15/2023 4:20 PM, Philip Yang wrote:
>
>
> On 2023-09-15 17:06, Chen, Xiaogang wrote:
>>
>> On 9/15/2023 8:28 AM, Philip Yang wrote:
>>> Caution: This message originated from an External Source. Use proper 
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>> If new range is splited to multiple pranges with max_svm_range_pages
>>> alignment and added to update_list, svm validate and map should keep
>>> going after error to make sure maps_to_gpu flag is up to date for the
>>> whole range.
>>>
>>> svm validate and map update set prange->mapped_to_gpu after mapping to
>>> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
>>> update mapping case) instead of setting error flag, we can remove
>>> the redundant error flag to simpliy code.
>>>
>>> Update prange->mapped_to_gpu flag inside svm_range_lock, to guarant we
>>> always set prange invalid flag to evict queues or unmap from GPUs 
>>> before
>>> the system memory is moved.
>>>
>>> After svm validate and map return error, the caller retry will update
>>> the mapping for the whole range.
>>>
>>> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from 
>>> svm validate and map")
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 18 ++++++++----------
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>>>   2 files changed, 8 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 5d7ba7dbf6ce..26018b1d6138 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, 
>>> struct svm_range *prange,
>>>                  }
>>>          }
>>>
>>> -       return !prange->is_error_flag;
>>> +       return true;
>>>   }
>>>
>>>   /**
>>> @@ -1724,20 +1724,17 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>                                            ctx->bitmap, wait, 
>>> flush_tlb);
>>>
>>>   unlock_out:
>>> +               prange->mapped_to_gpu = !r;
>>
>> I do not understand why set prange->mapped_to_gpu here? It handles 
>> one vma, not for all prange. If there are multiple vma and first vma 
>> handle is ok, and second vma handle fail at 
>> amdgpu_hmm_range_get_pages or svm_range_dma_map, you would get 
>> prange->mapped_to_gpu be true, but only part of pragne got mapped, 
>> right?
>
> If all vmas map to gpu successfully, prange->mapped_to_gpu set to 
> true, otherwise, prange->mapped_to_gpu set to false, and then svm 
> validate map to gpu return failed, the caller will retry if error code 
> is -EAGAIN.
>
if second vma handle got failed at amdgpu_hmm_range_get_pages 
prange->mapped_to_gpu would be true since the code would jump out of the 
loop, right?

Regards

Xiaogang

> Regards,
>
> Philip
>
>>
>>
>> Regards
>>
>> Xiaogang
>>
>>> svm_range_unlock(prange);
>>>
>>>                  addr = next;
>>>          }
>>>
>>> -       if (addr == end) {
>>> +       if (addr == end)
>>>                  prange->validated_once = true;
>>> -               prange->mapped_to_gpu = true;
>>> -       }
>>>
>>>   unreserve_out:
>>>          svm_range_unreserve_bos(ctx);
>>> -
>>> -       prange->is_error_flag = !!r;
>>>          if (!r)
>>>                  prange->validate_timestamp = ktime_get_boottime();
>>>
>>> @@ -2106,7 +2103,8 @@ svm_range_add(struct kfd_process *p, uint64_t 
>>> start, uint64_t size,
>>>                  next = interval_tree_iter_next(node, start, last);
>>>                  next_start = min(node->last, last) + 1;
>>>
>>> -               if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>> +               if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
>>> +                   prange->mapped_to_gpu) {
>>>                          /* nothing to do */
>>>                  } else if (node->start < start || node->last > last) {
>>>                          /* node intersects the update range and its 
>>> attributes
>>> @@ -3519,7 +3517,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>          struct svm_range *next;
>>>          bool update_mapping = false;
>>>          bool flush_tlb;
>>> -       int r = 0;
>>> +       int r, ret = 0;
>>>
>>>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 
>>> 0x%llx\n",
>>>                   p->pasid, &p->svms, start, start + size - 1, size);
>>> @@ -3607,7 +3605,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>   out_unlock_range:
>>>                  mutex_unlock(&prange->migrate_mutex);
>>>                  if (r)
>>> -                       break;
>>> +                       ret = r;
>>>          }
>>>
>>>          dynamic_svm_range_dump(svms);
>>> @@ -3620,7 +3618,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, 
>>> r=%d\n", p->pasid,
>>>                   &p->svms, start, start + size - 1, r);
>>>
>>> -       return r;
>>> +       return ret ? ret : r;
>>>   }
>>>
>>>   static int
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> index 9e668eeefb32..91715bf3233c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> @@ -134,7 +134,6 @@ struct svm_range {
>>>          DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>>>          bool                            validated_once;
>>>          bool                            mapped_to_gpu;
>>> -       bool                            is_error_flag;
>>>   };
>>>
>>>   static inline void svm_range_lock(struct svm_range *prange)
>>> -- 
>>> 2.35.1
>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map
  2023-09-15 21:33       ` Chen, Xiaogang
@ 2023-09-18 13:27         ` Philip Yang
  0 siblings, 0 replies; 18+ messages in thread
From: Philip Yang @ 2023-09-18 13:27 UTC (permalink / raw)
  To: Chen, Xiaogang, Philip Yang, amd-gfx
  Cc: alex.sierra, Felix.Kuehling, james.zhu

[-- Attachment #1: Type: text/html, Size: 13679 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] drm/amdkfd: Handle errors from svm validate and map
  2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
  2023-09-13 16:14 ` Felix Kuehling
  2023-09-15 13:28 ` [PATCH v2] " Philip Yang
@ 2023-09-19 14:21 ` Philip Yang
  2023-09-19 21:15   ` Felix Kuehling
  2023-09-20 15:45 ` [PATCH v4] " Philip Yang
  3 siblings, 1 reply; 18+ messages in thread
From: Philip Yang @ 2023-09-19 14:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: alex.sierra, Philip Yang, Felix.Kuehling, james.zhu

If new range is splited to multiple pranges with max_svm_range_pages
alignment and added to update_list, svm validate and map should keep
going after error to make sure prange->mapped_to_gpu flag is up to date
for the whole range.

svm validate and map update set prange->mapped_to_gpu after mapping to
GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
update mapping case) instead of setting error flag, we can remove
the redundant error flag to simpliy code.

Refactor to remove goto and update prange->mapped_to_gpu flag inside
svm_range_lock, to guarant we always evict queues or unmap from GPUs if
there are invalid ranges.

After svm validate and map return error -EAGIN, the caller retry will
update the mapping for the whole range again.

Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 78 +++++++++++++---------------
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
 2 files changed, 36 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 50c29fd844fb..4812f4ac5579 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
 		}
 	}
 
-	return !prange->is_error_flag;
+	return true;
 }
 
 /**
@@ -1671,7 +1671,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 
 	start = prange->start << PAGE_SHIFT;
 	end = (prange->last + 1) << PAGE_SHIFT;
-	for (addr = start; addr < end && !r; ) {
+	for (addr = start; !r && addr < end; ) {
 		struct hmm_range *hmm_range;
 		struct vm_area_struct *vma;
 		unsigned long next;
@@ -1680,62 +1680,55 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 		bool readonly;
 
 		vma = vma_lookup(mm, addr);
-		if (!vma) {
+		if (vma) {
+			readonly = !(vma->vm_flags & VM_WRITE);
+
+			next = min(vma->vm_end, end);
+			npages = (next - addr) >> PAGE_SHIFT;
+			WRITE_ONCE(p->svms.faulting_task, current);
+			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
+						       readonly, owner, NULL,
+						       &hmm_range);
+			WRITE_ONCE(p->svms.faulting_task, NULL);
+			if (r) {
+				pr_debug("failed %d to get svm range pages\n", r);
+				if (r == -EBUSY)
+					r = -EAGAIN;
+			}
+		} else {
 			r = -EFAULT;
-			goto unreserve_out;
-		}
-		readonly = !(vma->vm_flags & VM_WRITE);
-
-		next = min(vma->vm_end, end);
-		npages = (next - addr) >> PAGE_SHIFT;
-		WRITE_ONCE(p->svms.faulting_task, current);
-		r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
-					       readonly, owner, NULL,
-					       &hmm_range);
-		WRITE_ONCE(p->svms.faulting_task, NULL);
-		if (r) {
-			pr_debug("failed %d to get svm range pages\n", r);
-			if (r == -EBUSY)
-				r = -EAGAIN;
-			goto unreserve_out;
 		}
 
-		offset = (addr - start) >> PAGE_SHIFT;
-		r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
-				      hmm_range->hmm_pfns);
-		if (r) {
-			pr_debug("failed %d to dma map range\n", r);
-			goto unreserve_out;
+		if (!r) {
+			offset = (addr - start) >> PAGE_SHIFT;
+			r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
+					      hmm_range->hmm_pfns);
+			if (r)
+				pr_debug("failed %d to dma map range\n", r);
 		}
 
 		svm_range_lock(prange);
-		if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
+		if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
 			pr_debug("hmm update the range, need validate again\n");
 			r = -EAGAIN;
-			goto unlock_out;
 		}
-		if (!list_empty(&prange->child_list)) {
+
+		if (!r && !list_empty(&prange->child_list)) {
 			pr_debug("range split by unmap in parallel, validate again\n");
 			r = -EAGAIN;
-			goto unlock_out;
 		}
 
-		r = svm_range_map_to_gpus(prange, offset, npages, readonly,
-					  ctx->bitmap, wait, flush_tlb);
+		if (!r)
+			r = svm_range_map_to_gpus(prange, offset, npages, readonly,
+						  ctx->bitmap, wait, flush_tlb);
 
-unlock_out:
+		prange->mapped_to_gpu = !r;
 		svm_range_unlock(prange);
 
 		addr = next;
 	}
 
-	if (addr == end)
-		prange->mapped_to_gpu = true;
-
-unreserve_out:
 	svm_range_unreserve_bos(ctx);
-
-	prange->is_error_flag = !!r;
 	if (!r)
 		prange->validate_timestamp = ktime_get_boottime();
 
@@ -2104,7 +2097,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 		next = interval_tree_iter_next(node, start, last);
 		next_start = min(node->last, last) + 1;
 
-		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
+		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
+		    prange->mapped_to_gpu) {
 			/* nothing to do */
 		} else if (node->start < start || node->last > last) {
 			/* node intersects the update range and its attributes
@@ -3517,7 +3511,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	struct svm_range *next;
 	bool update_mapping = false;
 	bool flush_tlb;
-	int r = 0;
+	int r, ret = 0;
 
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
 		 p->pasid, &p->svms, start, start + size - 1, size);
@@ -3605,7 +3599,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 out_unlock_range:
 		mutex_unlock(&prange->migrate_mutex);
 		if (r)
-			break;
+			ret = r;
 	}
 
 	dynamic_svm_range_dump(svms);
@@ -3618,7 +3612,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
 		 &p->svms, start, start + size - 1, r);
 
-	return r;
+	return ret ? ret : r;
 }
 
 static int
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index c216c8dd13c6..25f711905738 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -133,7 +133,6 @@ struct svm_range {
 	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 	bool				mapped_to_gpu;
-	bool				is_error_flag;
 };
 
 static inline void svm_range_lock(struct svm_range *prange)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] drm/amdkfd: Handle errors from svm validate and map
  2023-09-19 14:21 ` [PATCH v3] drm/amdkfd: Handle " Philip Yang
@ 2023-09-19 21:15   ` Felix Kuehling
  2023-09-20 14:20     ` Philip Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2023-09-19 21:15 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu


On 2023-09-19 10:21, Philip Yang wrote:
> If new range is splited to multiple pranges with max_svm_range_pages
> alignment and added to update_list, svm validate and map should keep
> going after error to make sure prange->mapped_to_gpu flag is up to date
> for the whole range.
>
> svm validate and map update set prange->mapped_to_gpu after mapping to
> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
> update mapping case) instead of setting error flag, we can remove
> the redundant error flag to simpliy code.
>
> Refactor to remove goto and update prange->mapped_to_gpu flag inside
> svm_range_lock, to guarant we always evict queues or unmap from GPUs if
> there are invalid ranges.
>
> After svm validate and map return error -EAGIN, the caller retry will
> update the mapping for the whole range again.
>
> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 78 +++++++++++++---------------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>   2 files changed, 36 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 50c29fd844fb..4812f4ac5579 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
>   		}
>   	}
>   
> -	return !prange->is_error_flag;
> +	return true;
>   }
>   
>   /**
> @@ -1671,7 +1671,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   
>   	start = prange->start << PAGE_SHIFT;
>   	end = (prange->last + 1) << PAGE_SHIFT;
> -	for (addr = start; addr < end && !r; ) {
> +	for (addr = start; !r && addr < end; ) {
>   		struct hmm_range *hmm_range;
>   		struct vm_area_struct *vma;
>   		unsigned long next;
> @@ -1680,62 +1680,55 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   		bool readonly;
>   
>   		vma = vma_lookup(mm, addr);
> -		if (!vma) {
> +		if (vma) {
> +			readonly = !(vma->vm_flags & VM_WRITE);
> +
> +			next = min(vma->vm_end, end);
> +			npages = (next - addr) >> PAGE_SHIFT;
> +			WRITE_ONCE(p->svms.faulting_task, current);
> +			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> +						       readonly, owner, NULL,
> +						       &hmm_range);
> +			WRITE_ONCE(p->svms.faulting_task, NULL);
> +			if (r) {
> +				pr_debug("failed %d to get svm range pages\n", r);
> +				if (r == -EBUSY)
> +					r = -EAGAIN;
> +			}
> +		} else {
>   			r = -EFAULT;
> -			goto unreserve_out;
> -		}
> -		readonly = !(vma->vm_flags & VM_WRITE);
> -
> -		next = min(vma->vm_end, end);
> -		npages = (next - addr) >> PAGE_SHIFT;
> -		WRITE_ONCE(p->svms.faulting_task, current);
> -		r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> -					       readonly, owner, NULL,
> -					       &hmm_range);
> -		WRITE_ONCE(p->svms.faulting_task, NULL);
> -		if (r) {
> -			pr_debug("failed %d to get svm range pages\n", r);
> -			if (r == -EBUSY)
> -				r = -EAGAIN;
> -			goto unreserve_out;
>   		}
>   
> -		offset = (addr - start) >> PAGE_SHIFT;
> -		r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> -				      hmm_range->hmm_pfns);
> -		if (r) {
> -			pr_debug("failed %d to dma map range\n", r);
> -			goto unreserve_out;
> +		if (!r) {
> +			offset = (addr - start) >> PAGE_SHIFT;
> +			r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> +					      hmm_range->hmm_pfns);
> +			if (r)
> +				pr_debug("failed %d to dma map range\n", r);
>   		}
>   
>   		svm_range_lock(prange);
> -		if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
> +		if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
>   			pr_debug("hmm update the range, need validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
> -		if (!list_empty(&prange->child_list)) {
> +
> +		if (!r && !list_empty(&prange->child_list)) {
>   			pr_debug("range split by unmap in parallel, validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
>   
> -		r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> -					  ctx->bitmap, wait, flush_tlb);
> +		if (!r)
> +			r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> +						  ctx->bitmap, wait, flush_tlb);
>   
> -unlock_out:
> +		prange->mapped_to_gpu = !r;

I'm still concerned that this can update prange->mapped_to_gpu to "true" 
before the entire range has been successfully mapped. This could cause 
race conditions if someone looks at this variable while a 
validate_and_map is in progress. This would avoid such race conditions:

		if (!r && next == end)
			prange->mapped_to_gpu = true;

Regards,
   Felix


>   		svm_range_unlock(prange);
>   
>   		addr = next;
>   	}
>   
> -	if (addr == end)
> -		prange->mapped_to_gpu = true;
> -
> -unreserve_out:
>   	svm_range_unreserve_bos(ctx);
> -
> -	prange->is_error_flag = !!r;
>   	if (!r)
>   		prange->validate_timestamp = ktime_get_boottime();
>   
> @@ -2104,7 +2097,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
>   		next = interval_tree_iter_next(node, start, last);
>   		next_start = min(node->last, last) + 1;
>   
> -		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
> +		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
> +		    prange->mapped_to_gpu) {
>   			/* nothing to do */
>   		} else if (node->start < start || node->last > last) {
>   			/* node intersects the update range and its attributes
> @@ -3517,7 +3511,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	struct svm_range *next;
>   	bool update_mapping = false;
>   	bool flush_tlb;
> -	int r = 0;
> +	int r, ret = 0;
>   
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
>   		 p->pasid, &p->svms, start, start + size - 1, size);
> @@ -3605,7 +3599,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   out_unlock_range:
>   		mutex_unlock(&prange->migrate_mutex);
>   		if (r)
> -			break;
> +			ret = r;
>   	}
>   
>   	dynamic_svm_range_dump(svms);
> @@ -3618,7 +3612,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
>   		 &p->svms, start, start + size - 1, r);
>   
> -	return r;
> +	return ret ? ret : r;
>   }
>   
>   static int
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> index c216c8dd13c6..25f711905738 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> @@ -133,7 +133,6 @@ struct svm_range {
>   	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
>   	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>   	bool				mapped_to_gpu;
> -	bool				is_error_flag;
>   };
>   
>   static inline void svm_range_lock(struct svm_range *prange)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] drm/amdkfd: Handle errors from svm validate and map
  2023-09-19 21:15   ` Felix Kuehling
@ 2023-09-20 14:20     ` Philip Yang
  2023-09-20 14:35       ` Felix Kuehling
  0 siblings, 1 reply; 18+ messages in thread
From: Philip Yang @ 2023-09-20 14:20 UTC (permalink / raw)
  To: Felix Kuehling, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

[-- Attachment #1: Type: text/html, Size: 18664 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] drm/amdkfd: Handle errors from svm validate and map
  2023-09-20 14:20     ` Philip Yang
@ 2023-09-20 14:35       ` Felix Kuehling
  2023-09-20 15:38         ` Philip Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Felix Kuehling @ 2023-09-20 14:35 UTC (permalink / raw)
  To: Philip Yang, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu


On 2023-09-20 10:20, Philip Yang wrote:
>
>
> On 2023-09-19 17:15, Felix Kuehling wrote:
>>
>> On 2023-09-19 10:21, Philip Yang wrote:
>>> If new range is splited to multiple pranges with max_svm_range_pages
>>> alignment and added to update_list, svm validate and map should keep
>>> going after error to make sure prange->mapped_to_gpu flag is up to date
>>> for the whole range.
>>>
>>> svm validate and map update set prange->mapped_to_gpu after mapping to
>>> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
>>> update mapping case) instead of setting error flag, we can remove
>>> the redundant error flag to simpliy code.
>>>
>>> Refactor to remove goto and update prange->mapped_to_gpu flag inside
>>> svm_range_lock, to guarant we always evict queues or unmap from GPUs if
>>> there are invalid ranges.
>>>
>>> After svm validate and map return error -EAGIN, the caller retry will
>>> update the mapping for the whole range again.
>>>
>>> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from 
>>> svm validate and map")
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 78 
>>> +++++++++++++---------------
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>>>   2 files changed, 36 insertions(+), 43 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 50c29fd844fb..4812f4ac5579 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, 
>>> struct svm_range *prange,
>>>           }
>>>       }
>>>   -    return !prange->is_error_flag;
>>> +    return true;
>>>   }
>>>     /**
>>> @@ -1671,7 +1671,7 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>         start = prange->start << PAGE_SHIFT;
>>>       end = (prange->last + 1) << PAGE_SHIFT;
>>> -    for (addr = start; addr < end && !r; ) {
>>> +    for (addr = start; !r && addr < end; ) {
>>>           struct hmm_range *hmm_range;
>>>           struct vm_area_struct *vma;
>>>           unsigned long next;
>>> @@ -1680,62 +1680,55 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>           bool readonly;
>>>             vma = vma_lookup(mm, addr);
>>> -        if (!vma) {
>>> +        if (vma) {
>>> +            readonly = !(vma->vm_flags & VM_WRITE);
>>> +
>>> +            next = min(vma->vm_end, end);
>>> +            npages = (next - addr) >> PAGE_SHIFT;
>>> +            WRITE_ONCE(p->svms.faulting_task, current);
>>> +            r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, 
>>> npages,
>>> +                               readonly, owner, NULL,
>>> +                               &hmm_range);
>>> +            WRITE_ONCE(p->svms.faulting_task, NULL);
>>> +            if (r) {
>>> +                pr_debug("failed %d to get svm range pages\n", r);
>>> +                if (r == -EBUSY)
>>> +                    r = -EAGAIN;
>>> +            }
>>> +        } else {
>>>               r = -EFAULT;
>>> -            goto unreserve_out;
>>> -        }
>>> -        readonly = !(vma->vm_flags & VM_WRITE);
>>> -
>>> -        next = min(vma->vm_end, end);
>>> -        npages = (next - addr) >> PAGE_SHIFT;
>>> -        WRITE_ONCE(p->svms.faulting_task, current);
>>> -        r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, 
>>> npages,
>>> -                           readonly, owner, NULL,
>>> -                           &hmm_range);
>>> -        WRITE_ONCE(p->svms.faulting_task, NULL);
>>> -        if (r) {
>>> -            pr_debug("failed %d to get svm range pages\n", r);
>>> -            if (r == -EBUSY)
>>> -                r = -EAGAIN;
>>> -            goto unreserve_out;
>>>           }
>>>   -        offset = (addr - start) >> PAGE_SHIFT;
>>> -        r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
>>> -                      hmm_range->hmm_pfns);
>>> -        if (r) {
>>> -            pr_debug("failed %d to dma map range\n", r);
>>> -            goto unreserve_out;
>>> +        if (!r) {
>>> +            offset = (addr - start) >> PAGE_SHIFT;
>>> +            r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
>>> +                          hmm_range->hmm_pfns);
>>> +            if (r)
>>> +                pr_debug("failed %d to dma map range\n", r);
>>>           }
>>>             svm_range_lock(prange);
>>> -        if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
>>> +        if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
>>>               pr_debug("hmm update the range, need validate again\n");
>>>               r = -EAGAIN;
>>> -            goto unlock_out;
>>>           }
>>> -        if (!list_empty(&prange->child_list)) {
>>> +
>>> +        if (!r && !list_empty(&prange->child_list)) {
>>>               pr_debug("range split by unmap in parallel, validate 
>>> again\n");
>>>               r = -EAGAIN;
>>> -            goto unlock_out;
>>>           }
>>>   -        r = svm_range_map_to_gpus(prange, offset, npages, readonly,
>>> -                      ctx->bitmap, wait, flush_tlb);
>>> +        if (!r)
>>> +            r = svm_range_map_to_gpus(prange, offset, npages, 
>>> readonly,
>>> +                          ctx->bitmap, wait, flush_tlb);
>>>   -unlock_out:
>>> +        prange->mapped_to_gpu = !r;
>>
>> I'm still concerned that this can update prange->mapped_to_gpu to 
>> "true" before the entire range has been successfully mapped. This 
>> could cause race conditions if someone looks at this variable while a 
>> validate_and_map is in progress. This would avoid such race conditions:
>>
>>         if (!r && next == end)
>>             prange->mapped_to_gpu = true;
>>
> thanks, will also add else path for error handling, to exit the loop 
> with correct flag.
>
>           else if (r)
>
>              prange->mapped_to_gpu = false;
>
I thought about that. I think the flag should be false going into the 
function. There should be no need to validate and map the range if it's 
already mapped and valid. So if anything, we should set the flag to 
false in the beginning.

Regards,
   Felix


> Regards,
>
> Philip
>
>> Regards,
>>   Felix
>>
>>
>>>           svm_range_unlock(prange);
>>>             addr = next;
>>>       }
>>>   -    if (addr == end)
>>> -        prange->mapped_to_gpu = true;
>>> -
>>> -unreserve_out:
>>>       svm_range_unreserve_bos(ctx);
>>> -
>>> -    prange->is_error_flag = !!r;
>>>       if (!r)
>>>           prange->validate_timestamp = ktime_get_boottime();
>>>   @@ -2104,7 +2097,8 @@ svm_range_add(struct kfd_process *p, 
>>> uint64_t start, uint64_t size,
>>>           next = interval_tree_iter_next(node, start, last);
>>>           next_start = min(node->last, last) + 1;
>>>   -        if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>> +        if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
>>> +            prange->mapped_to_gpu) {
>>>               /* nothing to do */
>>>           } else if (node->start < start || node->last > last) {
>>>               /* node intersects the update range and its attributes
>>> @@ -3517,7 +3511,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>       struct svm_range *next;
>>>       bool update_mapping = false;
>>>       bool flush_tlb;
>>> -    int r = 0;
>>> +    int r, ret = 0;
>>>         pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
>>>            p->pasid, &p->svms, start, start + size - 1, size);
>>> @@ -3605,7 +3599,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>   out_unlock_range:
>>>           mutex_unlock(&prange->migrate_mutex);
>>>           if (r)
>>> -            break;
>>> +            ret = r;
>>>       }
>>>         dynamic_svm_range_dump(svms);
>>> @@ -3618,7 +3612,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>       pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", 
>>> p->pasid,
>>>            &p->svms, start, start + size - 1, r);
>>>   -    return r;
>>> +    return ret ? ret : r;
>>>   }
>>>     static int
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> index c216c8dd13c6..25f711905738 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> @@ -133,7 +133,6 @@ struct svm_range {
>>>       DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
>>>       DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>>>       bool                mapped_to_gpu;
>>> -    bool                is_error_flag;
>>>   };
>>>     static inline void svm_range_lock(struct svm_range *prange)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] drm/amdkfd: Handle errors from svm validate and map
  2023-09-20 14:35       ` Felix Kuehling
@ 2023-09-20 15:38         ` Philip Yang
  0 siblings, 0 replies; 18+ messages in thread
From: Philip Yang @ 2023-09-20 15:38 UTC (permalink / raw)
  To: Felix Kuehling, Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

[-- Attachment #1: Type: text/html, Size: 21459 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4] drm/amdkfd: Handle errors from svm validate and map
  2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
                   ` (2 preceding siblings ...)
  2023-09-19 14:21 ` [PATCH v3] drm/amdkfd: Handle " Philip Yang
@ 2023-09-20 15:45 ` Philip Yang
  2023-09-21 19:32   ` Felix Kuehling
  2023-09-25 13:39   ` James Zhu
  3 siblings, 2 replies; 18+ messages in thread
From: Philip Yang @ 2023-09-20 15:45 UTC (permalink / raw)
  To: amd-gfx; +Cc: alex.sierra, Philip Yang, Felix.Kuehling, james.zhu

If new range is splited to multiple pranges with max_svm_range_pages
alignment and added to update_list, svm validate and map should keep
going after error to make sure prange->mapped_to_gpu flag is up to date
for the whole range.

svm validate and map update set prange->mapped_to_gpu after mapping to
GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
update mapping case) instead of setting error flag, we can remove
the redundant error flag to simpliy code.

Refactor to remove goto and update prange->mapped_to_gpu flag inside
svm_range_lock, to guarant we always evict queues or unmap from GPUs if
there are invalid ranges.

After svm validate and map return error -EAGIN, the caller retry will
update the mapping for the whole range again.

Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 80 +++++++++++++---------------
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
 2 files changed, 38 insertions(+), 43 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index fb55cf80d74e..0b6a70171320 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -827,7 +827,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
 		}
 	}
 
-	return !prange->is_error_flag;
+	return true;
 }
 
 /**
@@ -1680,7 +1680,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 
 	start = prange->start << PAGE_SHIFT;
 	end = (prange->last + 1) << PAGE_SHIFT;
-	for (addr = start; addr < end && !r; ) {
+	for (addr = start; !r && addr < end; ) {
 		struct hmm_range *hmm_range;
 		struct vm_area_struct *vma;
 		unsigned long next;
@@ -1689,62 +1689,57 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 		bool readonly;
 
 		vma = vma_lookup(mm, addr);
-		if (!vma) {
+		if (vma) {
+			readonly = !(vma->vm_flags & VM_WRITE);
+
+			next = min(vma->vm_end, end);
+			npages = (next - addr) >> PAGE_SHIFT;
+			WRITE_ONCE(p->svms.faulting_task, current);
+			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
+						       readonly, owner, NULL,
+						       &hmm_range);
+			WRITE_ONCE(p->svms.faulting_task, NULL);
+			if (r) {
+				pr_debug("failed %d to get svm range pages\n", r);
+				if (r == -EBUSY)
+					r = -EAGAIN;
+			}
+		} else {
 			r = -EFAULT;
-			goto unreserve_out;
-		}
-		readonly = !(vma->vm_flags & VM_WRITE);
-
-		next = min(vma->vm_end, end);
-		npages = (next - addr) >> PAGE_SHIFT;
-		WRITE_ONCE(p->svms.faulting_task, current);
-		r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
-					       readonly, owner, NULL,
-					       &hmm_range);
-		WRITE_ONCE(p->svms.faulting_task, NULL);
-		if (r) {
-			pr_debug("failed %d to get svm range pages\n", r);
-			if (r == -EBUSY)
-				r = -EAGAIN;
-			goto unreserve_out;
 		}
 
-		offset = (addr - start) >> PAGE_SHIFT;
-		r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
-				      hmm_range->hmm_pfns);
-		if (r) {
-			pr_debug("failed %d to dma map range\n", r);
-			goto unreserve_out;
+		if (!r) {
+			offset = (addr - start) >> PAGE_SHIFT;
+			r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
+					      hmm_range->hmm_pfns);
+			if (r)
+				pr_debug("failed %d to dma map range\n", r);
 		}
 
 		svm_range_lock(prange);
-		if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
+		if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
 			pr_debug("hmm update the range, need validate again\n");
 			r = -EAGAIN;
-			goto unlock_out;
 		}
-		if (!list_empty(&prange->child_list)) {
+
+		if (!r && !list_empty(&prange->child_list)) {
 			pr_debug("range split by unmap in parallel, validate again\n");
 			r = -EAGAIN;
-			goto unlock_out;
 		}
 
-		r = svm_range_map_to_gpus(prange, offset, npages, readonly,
-					  ctx->bitmap, wait, flush_tlb);
+		if (!r)
+			r = svm_range_map_to_gpus(prange, offset, npages, readonly,
+						  ctx->bitmap, wait, flush_tlb);
+
+		if (!r && next == end)
+			prange->mapped_to_gpu = true;
 
-unlock_out:
 		svm_range_unlock(prange);
 
 		addr = next;
 	}
 
-	if (addr == end)
-		prange->mapped_to_gpu = true;
-
-unreserve_out:
 	svm_range_unreserve_bos(ctx);
-
-	prange->is_error_flag = !!r;
 	if (!r)
 		prange->validate_timestamp = ktime_get_boottime();
 
@@ -2113,7 +2108,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
 		next = interval_tree_iter_next(node, start, last);
 		next_start = min(node->last, last) + 1;
 
-		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
+		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
+		    prange->mapped_to_gpu) {
 			/* nothing to do */
 		} else if (node->start < start || node->last > last) {
 			/* node intersects the update range and its attributes
@@ -3526,7 +3522,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	struct svm_range *next;
 	bool update_mapping = false;
 	bool flush_tlb;
-	int r = 0;
+	int r, ret = 0;
 
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
 		 p->pasid, &p->svms, start, start + size - 1, size);
@@ -3614,7 +3610,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 out_unlock_range:
 		mutex_unlock(&prange->migrate_mutex);
 		if (r)
-			break;
+			ret = r;
 	}
 
 	dynamic_svm_range_dump(svms);
@@ -3627,7 +3623,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
 		 &p->svms, start, start + size - 1, r);
 
-	return r;
+	return ret ? ret : r;
 }
 
 static int
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 5fd958a97a28..c528df1d0ba2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -133,7 +133,6 @@ struct svm_range {
 	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 	bool				mapped_to_gpu;
-	bool				is_error_flag;
 };
 
 static inline void svm_range_lock(struct svm_range *prange)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] drm/amdkfd: Handle errors from svm validate and map
  2023-09-20 15:45 ` [PATCH v4] " Philip Yang
@ 2023-09-21 19:32   ` Felix Kuehling
  2023-09-25 13:39   ` James Zhu
  1 sibling, 0 replies; 18+ messages in thread
From: Felix Kuehling @ 2023-09-21 19:32 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: alex.sierra, james.zhu

On 2023-09-20 11:45, Philip Yang wrote:
> If new range is splited to multiple pranges with max_svm_range_pages
> alignment and added to update_list, svm validate and map should keep
> going after error to make sure prange->mapped_to_gpu flag is up to date
> for the whole range.
>
> svm validate and map update set prange->mapped_to_gpu after mapping to
> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
> update mapping case) instead of setting error flag, we can remove
> the redundant error flag to simpliy code.
>
> Refactor to remove goto and update prange->mapped_to_gpu flag inside
> svm_range_lock, to guarant we always evict queues or unmap from GPUs if
> there are invalid ranges.
>
> After svm validate and map return error -EAGIN, the caller retry will
> update the mapping for the whole range again.
>
> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 80 +++++++++++++---------------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>   2 files changed, 38 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index fb55cf80d74e..0b6a70171320 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -827,7 +827,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
>   		}
>   	}
>   
> -	return !prange->is_error_flag;
> +	return true;
>   }
>   
>   /**
> @@ -1680,7 +1680,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   
>   	start = prange->start << PAGE_SHIFT;
>   	end = (prange->last + 1) << PAGE_SHIFT;
> -	for (addr = start; addr < end && !r; ) {
> +	for (addr = start; !r && addr < end; ) {
>   		struct hmm_range *hmm_range;
>   		struct vm_area_struct *vma;
>   		unsigned long next;
> @@ -1689,62 +1689,57 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   		bool readonly;
>   
>   		vma = vma_lookup(mm, addr);
> -		if (!vma) {
> +		if (vma) {
> +			readonly = !(vma->vm_flags & VM_WRITE);
> +
> +			next = min(vma->vm_end, end);
> +			npages = (next - addr) >> PAGE_SHIFT;
> +			WRITE_ONCE(p->svms.faulting_task, current);
> +			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> +						       readonly, owner, NULL,
> +						       &hmm_range);
> +			WRITE_ONCE(p->svms.faulting_task, NULL);
> +			if (r) {
> +				pr_debug("failed %d to get svm range pages\n", r);
> +				if (r == -EBUSY)
> +					r = -EAGAIN;
> +			}
> +		} else {
>   			r = -EFAULT;
> -			goto unreserve_out;
> -		}
> -		readonly = !(vma->vm_flags & VM_WRITE);
> -
> -		next = min(vma->vm_end, end);
> -		npages = (next - addr) >> PAGE_SHIFT;
> -		WRITE_ONCE(p->svms.faulting_task, current);
> -		r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> -					       readonly, owner, NULL,
> -					       &hmm_range);
> -		WRITE_ONCE(p->svms.faulting_task, NULL);
> -		if (r) {
> -			pr_debug("failed %d to get svm range pages\n", r);
> -			if (r == -EBUSY)
> -				r = -EAGAIN;
> -			goto unreserve_out;
>   		}
>   
> -		offset = (addr - start) >> PAGE_SHIFT;
> -		r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> -				      hmm_range->hmm_pfns);
> -		if (r) {
> -			pr_debug("failed %d to dma map range\n", r);
> -			goto unreserve_out;
> +		if (!r) {
> +			offset = (addr - start) >> PAGE_SHIFT;
> +			r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> +					      hmm_range->hmm_pfns);
> +			if (r)
> +				pr_debug("failed %d to dma map range\n", r);
>   		}
>   
>   		svm_range_lock(prange);
> -		if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
> +		if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
>   			pr_debug("hmm update the range, need validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
> -		if (!list_empty(&prange->child_list)) {
> +
> +		if (!r && !list_empty(&prange->child_list)) {
>   			pr_debug("range split by unmap in parallel, validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
>   
> -		r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> -					  ctx->bitmap, wait, flush_tlb);
> +		if (!r)
> +			r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> +						  ctx->bitmap, wait, flush_tlb);
> +
> +		if (!r && next == end)
> +			prange->mapped_to_gpu = true;
>   
> -unlock_out:
>   		svm_range_unlock(prange);
>   
>   		addr = next;
>   	}
>   
> -	if (addr == end)
> -		prange->mapped_to_gpu = true;
> -
> -unreserve_out:
>   	svm_range_unreserve_bos(ctx);
> -
> -	prange->is_error_flag = !!r;
>   	if (!r)
>   		prange->validate_timestamp = ktime_get_boottime();
>   
> @@ -2113,7 +2108,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
>   		next = interval_tree_iter_next(node, start, last);
>   		next_start = min(node->last, last) + 1;
>   
> -		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
> +		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
> +		    prange->mapped_to_gpu) {
>   			/* nothing to do */
>   		} else if (node->start < start || node->last > last) {
>   			/* node intersects the update range and its attributes
> @@ -3526,7 +3522,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	struct svm_range *next;
>   	bool update_mapping = false;
>   	bool flush_tlb;
> -	int r = 0;
> +	int r, ret = 0;
>   
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
>   		 p->pasid, &p->svms, start, start + size - 1, size);
> @@ -3614,7 +3610,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   out_unlock_range:
>   		mutex_unlock(&prange->migrate_mutex);
>   		if (r)
> -			break;
> +			ret = r;
>   	}
>   
>   	dynamic_svm_range_dump(svms);
> @@ -3627,7 +3623,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
>   		 &p->svms, start, start + size - 1, r);
>   
> -	return r;
> +	return ret ? ret : r;
>   }
>   
>   static int
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> index 5fd958a97a28..c528df1d0ba2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> @@ -133,7 +133,6 @@ struct svm_range {
>   	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
>   	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>   	bool				mapped_to_gpu;
> -	bool				is_error_flag;
>   };
>   
>   static inline void svm_range_lock(struct svm_range *prange)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] drm/amdkfd: Handle errors from svm validate and map
  2023-09-20 15:45 ` [PATCH v4] " Philip Yang
  2023-09-21 19:32   ` Felix Kuehling
@ 2023-09-25 13:39   ` James Zhu
  1 sibling, 0 replies; 18+ messages in thread
From: James Zhu @ 2023-09-25 13:39 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: alex.sierra, Felix.Kuehling, james.zhu

[-- Attachment #1: Type: text/plain, Size: 6680 bytes --]

Tested-by:JamesZhu<James.Zhu@amd.com>forthis patch


James zhu
On 2023-09-20 11:45, Philip Yang wrote:
> If new range is splited to multiple pranges with max_svm_range_pages
> alignment and added to update_list, svm validate and map should keep
> going after error to make sure prange->mapped_to_gpu flag is up to date
> for the whole range.
>
> svm validate and map update set prange->mapped_to_gpu after mapping to
> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
> update mapping case) instead of setting error flag, we can remove
> the redundant error flag to simpliy code.
>
> Refactor to remove goto and update prange->mapped_to_gpu flag inside
> svm_range_lock, to guarant we always evict queues or unmap from GPUs if
> there are invalid ranges.
>
> After svm validate and map return error -EAGIN, the caller retry will
> update the mapping for the whole range again.
>
> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from svm validate and map")
> Signed-off-by: Philip Yang<Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 80 +++++++++++++---------------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>   2 files changed, 38 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index fb55cf80d74e..0b6a70171320 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -827,7 +827,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
>   		}
>   	}
>   
> -	return !prange->is_error_flag;
> +	return true;
>   }
>   
>   /**
> @@ -1680,7 +1680,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   
>   	start = prange->start << PAGE_SHIFT;
>   	end = (prange->last + 1) << PAGE_SHIFT;
> -	for (addr = start; addr < end && !r; ) {
> +	for (addr = start; !r && addr < end; ) {
>   		struct hmm_range *hmm_range;
>   		struct vm_area_struct *vma;
>   		unsigned long next;
> @@ -1689,62 +1689,57 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   		bool readonly;
>   
>   		vma = vma_lookup(mm, addr);
> -		if (!vma) {
> +		if (vma) {
> +			readonly = !(vma->vm_flags & VM_WRITE);
> +
> +			next = min(vma->vm_end, end);
> +			npages = (next - addr) >> PAGE_SHIFT;
> +			WRITE_ONCE(p->svms.faulting_task, current);
> +			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> +						       readonly, owner, NULL,
> +						       &hmm_range);
> +			WRITE_ONCE(p->svms.faulting_task, NULL);
> +			if (r) {
> +				pr_debug("failed %d to get svm range pages\n", r);
> +				if (r == -EBUSY)
> +					r = -EAGAIN;
> +			}
> +		} else {
>   			r = -EFAULT;
> -			goto unreserve_out;
> -		}
> -		readonly = !(vma->vm_flags & VM_WRITE);
> -
> -		next = min(vma->vm_end, end);
> -		npages = (next - addr) >> PAGE_SHIFT;
> -		WRITE_ONCE(p->svms.faulting_task, current);
> -		r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
> -					       readonly, owner, NULL,
> -					       &hmm_range);
> -		WRITE_ONCE(p->svms.faulting_task, NULL);
> -		if (r) {
> -			pr_debug("failed %d to get svm range pages\n", r);
> -			if (r == -EBUSY)
> -				r = -EAGAIN;
> -			goto unreserve_out;
>   		}
>   
> -		offset = (addr - start) >> PAGE_SHIFT;
> -		r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> -				      hmm_range->hmm_pfns);
> -		if (r) {
> -			pr_debug("failed %d to dma map range\n", r);
> -			goto unreserve_out;
> +		if (!r) {
> +			offset = (addr - start) >> PAGE_SHIFT;
> +			r = svm_range_dma_map(prange, ctx->bitmap, offset, npages,
> +					      hmm_range->hmm_pfns);
> +			if (r)
> +				pr_debug("failed %d to dma map range\n", r);
>   		}
>   
>   		svm_range_lock(prange);
> -		if (amdgpu_hmm_range_get_pages_done(hmm_range)) {
> +		if (!r && amdgpu_hmm_range_get_pages_done(hmm_range)) {
>   			pr_debug("hmm update the range, need validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
> -		if (!list_empty(&prange->child_list)) {
> +
> +		if (!r && !list_empty(&prange->child_list)) {
>   			pr_debug("range split by unmap in parallel, validate again\n");
>   			r = -EAGAIN;
> -			goto unlock_out;
>   		}
>   
> -		r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> -					  ctx->bitmap, wait, flush_tlb);
> +		if (!r)
> +			r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> +						  ctx->bitmap, wait, flush_tlb);
> +
> +		if (!r && next == end)
> +			prange->mapped_to_gpu = true;
>   
> -unlock_out:
>   		svm_range_unlock(prange);
>   
>   		addr = next;
>   	}
>   
> -	if (addr == end)
> -		prange->mapped_to_gpu = true;
> -
> -unreserve_out:
>   	svm_range_unreserve_bos(ctx);
> -
> -	prange->is_error_flag = !!r;
>   	if (!r)
>   		prange->validate_timestamp = ktime_get_boottime();
>   
> @@ -2113,7 +2108,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
>   		next = interval_tree_iter_next(node, start, last);
>   		next_start = min(node->last, last) + 1;
>   
> -		if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
> +		if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
> +		    prange->mapped_to_gpu) {
>   			/* nothing to do */
>   		} else if (node->start < start || node->last > last) {
>   			/* node intersects the update range and its attributes
> @@ -3526,7 +3522,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	struct svm_range *next;
>   	bool update_mapping = false;
>   	bool flush_tlb;
> -	int r = 0;
> +	int r, ret = 0;
>   
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
>   		 p->pasid, &p->svms, start, start + size - 1, size);
> @@ -3614,7 +3610,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   out_unlock_range:
>   		mutex_unlock(&prange->migrate_mutex);
>   		if (r)
> -			break;
> +			ret = r;
>   	}
>   
>   	dynamic_svm_range_dump(svms);
> @@ -3627,7 +3623,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>   	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, r=%d\n", p->pasid,
>   		 &p->svms, start, start + size - 1, r);
>   
> -	return r;
> +	return ret ? ret : r;
>   }
>   
>   static int
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> index 5fd958a97a28..c528df1d0ba2 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
> @@ -133,7 +133,6 @@ struct svm_range {
>   	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
>   	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>   	bool				mapped_to_gpu;
> -	bool				is_error_flag;
>   };
>   
>   static inline void svm_range_lock(struct svm_range *prange)

[-- Attachment #2: Type: text/html, Size: 7458 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-09-25 13:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
2023-09-13 16:14 ` Felix Kuehling
2023-09-13 17:24   ` Philip Yang
2023-09-13 17:33   ` Philip Yang
2023-09-13 18:27     ` Felix Kuehling
2023-09-15 13:28 ` [PATCH v2] " Philip Yang
2023-09-15 21:06   ` Chen, Xiaogang
2023-09-15 21:20     ` Philip Yang
2023-09-15 21:33       ` Chen, Xiaogang
2023-09-18 13:27         ` Philip Yang
2023-09-19 14:21 ` [PATCH v3] drm/amdkfd: Handle " Philip Yang
2023-09-19 21:15   ` Felix Kuehling
2023-09-20 14:20     ` Philip Yang
2023-09-20 14:35       ` Felix Kuehling
2023-09-20 15:38         ` Philip Yang
2023-09-20 15:45 ` [PATCH v4] " Philip Yang
2023-09-21 19:32   ` Felix Kuehling
2023-09-25 13:39   ` James Zhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.