* [PATCH 1/6] drm/amdkfd: retry validation to recover range @ 2021-04-20 20:21 Philip Yang 2021-04-20 20:21 ` [PATCH 2/6] drm/amdgpu: return IH ring drain finished if ring is empty Philip Yang ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Philip Yang @ 2021-04-20 20:21 UTC (permalink / raw) To: amd-gfx; +Cc: Philip Yang GPU vm retry fault recover range need retry validation if 1. range is split in parallel by unmap while recover 2. range migrate to system memory and range is updated in system memory while recover Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 0e0b4ffd20ab..40ef5709d0a7 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1402,11 +1402,13 @@ static int svm_range_validate_and_map(struct mm_struct *mm, svm_range_lock(prange); if (!prange->actual_loc) { if (amdgpu_hmm_range_get_pages_done(hmm_range)) { + pr_debug("hmm update the range, need validate again\n"); r = -EAGAIN; goto unlock_out; } } if (!list_empty(&prange->child_list)) { + pr_debug("range split by unmap in parallel, validate again\n"); r = -EAGAIN; goto unlock_out; } @@ -2254,6 +2256,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, out: kfd_unref_process(p); + if (r == -EAGAIN) { + pr_debug("recover vm fault later\n"); + r = 0; + } return r; } -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/6] drm/amdgpu: return IH ring drain finished if ring is empty 2021-04-20 20:21 [PATCH 1/6] drm/amdkfd: retry validation to recover range Philip Yang @ 2021-04-20 20:21 ` Philip Yang 2021-04-20 20:21 ` [PATCH 3/6] drm/amdkfd: handle stale retry fault Philip Yang ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Philip Yang @ 2021-04-20 20:21 UTC (permalink / raw) To: amd-gfx; +Cc: Philip Yang Sometimes IH do not setup ring wptr overflow flag after wptr exceed rptr. As a workaround, if IH rptr equals to wptr, ring is empty, return true to indicate IH ring checkpoint is processed, IH ring drain is finished. Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c index faaa6aa2faaf..a36e191cf086 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c @@ -175,7 +175,9 @@ static bool amdgpu_ih_has_checkpoint_processed(struct amdgpu_device *adev, cur_rptr += ih->ptr_mask + 1; *prev_rptr = cur_rptr; - return cur_rptr >= checkpoint_wptr; + /* check ring is empty to workaround missing wptr overflow flag */ + return cur_rptr >= checkpoint_wptr || + (cur_rptr & ih->ptr_mask) == amdgpu_ih_get_wptr(adev, ih); } /** -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/6] drm/amdkfd: handle stale retry fault 2021-04-20 20:21 [PATCH 1/6] drm/amdkfd: retry validation to recover range Philip Yang 2021-04-20 20:21 ` [PATCH 2/6] drm/amdgpu: return IH ring drain finished if ring is empty Philip Yang @ 2021-04-20 20:21 ` Philip Yang 2021-04-20 20:21 ` [PATCH 4/6] drm/amdgpu: address remove from fault filter Philip Yang 2021-04-20 20:21 ` [PATCH 5/6] drm/amdkfd: enable subsequent retry fault Philip Yang 3 siblings, 0 replies; 13+ messages in thread From: Philip Yang @ 2021-04-20 20:21 UTC (permalink / raw) To: amd-gfx; +Cc: Philip Yang Retry fault interrupt maybe pending in IH ring after GPU page table is updated to recover the vm fault, because each page of the range generate retry fault interrupt. There is race if application unmap range to remove and free the range first and then retry fault work restore_pages handle the retry fault interrupt, because range can not be found, this vm fault can not be recovered and report incorrect GPU vm fault to application. Before unmap to remove and free range, drain retry fault interrupt from IH ring1 to ensure no retry fault comes after the range is removed. Drain retry fault interrupt skip the range which is on deferred list to remove, or the range is child range, which is split by unmap, does not add to svms and have interval notifier. Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 69 +++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 40ef5709d0a7..45dd055118eb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1830,6 +1830,28 @@ svm_range_handle_list_op(struct svm_range_list *svms, struct svm_range *prange) } } +static void svm_range_drain_retry_fault(struct svm_range_list *svms) +{ + struct kfd_process_device *pdd; + struct amdgpu_device *adev; + struct kfd_process *p; + uint32_t i; + + p = container_of(svms, struct kfd_process, svms); + + for (i = 0; i < p->n_pdds; i++) { + pdd = p->pdds[i]; + if (!pdd) + continue; + + pr_debug("drain retry fault gpu %d svms %p\n", i, svms); + adev = (struct amdgpu_device *)pdd->dev->kgd; + + amdgpu_ih_wait_on_checkpoint_process(adev, &adev->irq.ih1); + pr_debug("drain retry fault gpu %d svms 0x%p done\n", i, svms); + } +} + static void svm_range_deferred_list_work(struct work_struct *work) { struct svm_range_list *svms; @@ -1847,6 +1869,10 @@ static void svm_range_deferred_list_work(struct work_struct *work) pr_debug("prange 0x%p [0x%lx 0x%lx] op %d\n", prange, prange->start, prange->last, prange->work_item.op); + /* Make sure no stale retry fault coming after range is freed */ + if (prange->work_item.op == SVM_OP_UNMAP_RANGE) + svm_range_drain_retry_fault(prange->svms); + mm = prange->work_item.mm; mmap_write_lock(mm); mutex_lock(&svms->lock); @@ -2154,6 +2180,44 @@ svm_range_best_restore_location(struct svm_range *prange, return -1; } +/* svm_range_skip_recover - decide if prange can be recovered + * @prange: svm range structure + * + * GPU vm retry fault handle skip recover the range for cases: + * 1. prange is on deferred list to be removed after unmap, it is stale fault, + * deferred list work will drain the stale fault before free the prange. + * 2. prange is on deferred list to add interval notifier after split, or + * 3. prange is child range, it is split from parent prange, recover later + * after interval notifier is added. + * + * Return: true to skip recover, false to recover + */ +static bool svm_range_skip_recover(struct svm_range *prange) +{ + struct svm_range_list *svms = prange->svms; + + spin_lock(&svms->deferred_list_lock); + if (list_empty(&prange->deferred_list) && + list_empty(&prange->child_list)) { + spin_unlock(&svms->deferred_list_lock); + return false; + } + spin_unlock(&svms->deferred_list_lock); + + if (prange->work_item.op == SVM_OP_UNMAP_RANGE) { + pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx] unmapped\n", + svms, prange, prange->start, prange->last); + return true; + } + if (prange->work_item.op == SVM_OP_ADD_RANGE_AND_MAP || + prange->work_item.op == SVM_OP_ADD_RANGE) { + pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx] not added yet\n", + svms, prange, prange->start, prange->last); + return true; + } + return false; +} + int svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, uint64_t addr) @@ -2189,7 +2253,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, mmap_read_lock(mm); mutex_lock(&svms->lock); prange = svm_range_from_addr(svms, addr, NULL); - if (!prange) { pr_debug("failed to find prange svms 0x%p address [0x%llx]\n", svms, addr); @@ -2198,6 +2261,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, } mutex_lock(&prange->migrate_mutex); + + if (svm_range_skip_recover(prange)) + goto out_unlock_range; + timestamp = ktime_to_us(ktime_get()) - prange->validate_timestamp; /* skip duplicate vm fault on different pages of same range */ if (timestamp < AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING) { -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-20 20:21 [PATCH 1/6] drm/amdkfd: retry validation to recover range Philip Yang 2021-04-20 20:21 ` [PATCH 2/6] drm/amdgpu: return IH ring drain finished if ring is empty Philip Yang 2021-04-20 20:21 ` [PATCH 3/6] drm/amdkfd: handle stale retry fault Philip Yang @ 2021-04-20 20:21 ` Philip Yang 2021-04-21 1:20 ` Felix Kuehling 2021-04-21 7:22 ` Christian König 2021-04-20 20:21 ` [PATCH 5/6] drm/amdkfd: enable subsequent retry fault Philip Yang 3 siblings, 2 replies; 13+ messages in thread From: Philip Yang @ 2021-04-20 20:21 UTC (permalink / raw) To: amd-gfx; +Cc: Philip Yang Add interface to remove address from fault filter ring by resetting fault ring entry of the fault address timestamp to 0, then future vm fault on the address will be processed to recover. Check fault address from fault ring, add address into fault ring and remove address from fault ring are serialized in same interrupt deferred work, don't have race condition. Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ 2 files changed, 26 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index c39ed9eb0987..338e45fa66cb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, return false; } +/** + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter + * + * @adev: amdgpu device structure + * @addr: address of the VM fault + * @pasid: PASID of the process causing the fault + * + * Remove the address from fault filter, then future vm fault on this address + * will pass to retry fault handler to recover. + */ +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, + uint16_t pasid) +{ + struct amdgpu_gmc *gmc = &adev->gmc; + + uint64_t key = addr << 4 | pasid; + struct amdgpu_gmc_fault *fault; + uint32_t hash; + + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; + fault->timestamp = 0; +} + int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) { int r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index 9d11c02a3938..498a7a0d5a9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev, struct amdgpu_gmc *mc); bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, uint16_t pasid, uint64_t timestamp); +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, + uint16_t pasid); int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-20 20:21 ` [PATCH 4/6] drm/amdgpu: address remove from fault filter Philip Yang @ 2021-04-21 1:20 ` Felix Kuehling 2021-04-21 7:55 ` Christian König 2021-04-21 7:22 ` Christian König 1 sibling, 1 reply; 13+ messages in thread From: Felix Kuehling @ 2021-04-21 1:20 UTC (permalink / raw) To: amd-gfx, Yang, Philip Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang: > Add interface to remove address from fault filter ring by resetting > fault ring entry of the fault address timestamp to 0, then future vm > fault on the address will be processed to recover. > > Check fault address from fault ring, add address into fault ring and > remove address from fault ring are serialized in same interrupt deferred > work, don't have race condition. > > Signed-off-by: Philip Yang <Philip.Yang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ > 2 files changed, 26 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > index c39ed9eb0987..338e45fa66cb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, > return false; > } > > +/** > + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter > + * > + * @adev: amdgpu device structure > + * @addr: address of the VM fault > + * @pasid: PASID of the process causing the fault > + * > + * Remove the address from fault filter, then future vm fault on this address > + * will pass to retry fault handler to recover. > + */ > +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, > + uint16_t pasid) > +{ > + struct amdgpu_gmc *gmc = &adev->gmc; > + > + uint64_t key = addr << 4 | pasid; > + struct amdgpu_gmc_fault *fault; > + uint32_t hash; > + > + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); > + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; You need to loop over the fault ring to find a fault with the matching key since there may be hash collisions. You also need to make sure you don't break the single link list of keys with the same hash when you remove an entry. I think the easier way to remove an entry without breaking this ring+closed hashing structure is to reset the fault->key rather than the fault->timestamp. Finally, you need to add locking to the fault ring structure. Currently it's not protected by any locks because only one thread (the interrupt handler) accesses it. Now you have another thread that can remove entries, so you need to protect it with a lock. If you are handling retry faults, you know that the interrupt handler is really a worker thread, so you can use a mutex or a spin-lock, but it doesn't need to be interrupt-safe. Regards, Felix > + fault->timestamp = 0; > +} > + > int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) > { > int r; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > index 9d11c02a3938..498a7a0d5a9e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev, > struct amdgpu_gmc *mc); > bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, > uint16_t pasid, uint64_t timestamp); > +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, > + uint16_t pasid); > int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); > void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); > int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-21 1:20 ` Felix Kuehling @ 2021-04-21 7:55 ` Christian König 2021-04-21 15:29 ` Felix Kuehling 0 siblings, 1 reply; 13+ messages in thread From: Christian König @ 2021-04-21 7:55 UTC (permalink / raw) To: Felix Kuehling, amd-gfx, Yang, Philip Am 21.04.21 um 03:20 schrieb Felix Kuehling: > Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang: >> Add interface to remove address from fault filter ring by resetting >> fault ring entry of the fault address timestamp to 0, then future vm >> fault on the address will be processed to recover. >> >> Check fault address from fault ring, add address into fault ring and >> remove address from fault ring are serialized in same interrupt deferred >> work, don't have race condition. >> >> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ >> 2 files changed, 26 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> index c39ed9eb0987..338e45fa66cb 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >> @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, >> return false; >> } >> >> +/** >> + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter >> + * >> + * @adev: amdgpu device structure >> + * @addr: address of the VM fault >> + * @pasid: PASID of the process causing the fault >> + * >> + * Remove the address from fault filter, then future vm fault on this address >> + * will pass to retry fault handler to recover. >> + */ >> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, >> + uint16_t pasid) >> +{ >> + struct amdgpu_gmc *gmc = &adev->gmc; >> + >> + uint64_t key = addr << 4 | pasid; >> + struct amdgpu_gmc_fault *fault; >> + uint32_t hash; >> + >> + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); >> + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; > You need to loop over the fault ring to find a fault with the matching > key since there may be hash collisions. > > You also need to make sure you don't break the single link list of keys > with the same hash when you remove an entry. I think the easier way to > remove an entry without breaking this ring+closed hashing structure is > to reset the fault->key rather than the fault->timestamp. > > Finally, you need to add locking to the fault ring structure. Currently > it's not protected by any locks because only one thread (the interrupt > handler) accesses it. Now you have another thread that can remove > entries, so you need to protect it with a lock. If you are handling > retry faults, you know that the interrupt handler is really a worker > thread, so you can use a mutex or a spin-lock, but it doesn't need to be > interrupt-safe. I don't think you need a lock at all. Just using cmpxchg() to update the key should do it. Something like this: hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; while (fault->timestamp >= stamp) { uint64_t tmp; cmpxchg(&fault->key, key, 0); tmp = fault->timestamp; fault = &gmc->fault_ring[fault->next]; /* Check if the entry was reused */ if (fault->timestamp >= tmp) break; } Regards, Christian. > > Regards, > Felix > > >> + fault->timestamp = 0; >> +} >> + >> int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) >> { >> int r; >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >> index 9d11c02a3938..498a7a0d5a9e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >> @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev, >> struct amdgpu_gmc *mc); >> bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, >> uint16_t pasid, uint64_t timestamp); >> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, >> + uint16_t pasid); >> int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); >> void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); >> int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-21 7:55 ` Christian König @ 2021-04-21 15:29 ` Felix Kuehling 2021-04-23 1:52 ` philip yang 0 siblings, 1 reply; 13+ messages in thread From: Felix Kuehling @ 2021-04-21 15:29 UTC (permalink / raw) To: Christian König, amd-gfx, Yang, Philip On 2021-04-21 3:55 a.m., Christian König wrote: > Am 21.04.21 um 03:20 schrieb Felix Kuehling: >> Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang: >>> Add interface to remove address from fault filter ring by resetting >>> fault ring entry of the fault address timestamp to 0, then future vm >>> fault on the address will be processed to recover. >>> >>> Check fault address from fault ring, add address into fault ring and >>> remove address from fault ring are serialized in same interrupt >>> deferred >>> work, don't have race condition. >>> >>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ >>> 2 files changed, 26 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>> index c39ed9eb0987..338e45fa66cb 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>> @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct >>> amdgpu_device *adev, uint64_t addr, >>> return false; >>> } >>> +/** >>> + * amdgpu_gmc_filter_faults_remove - remove address from VM faults >>> filter >>> + * >>> + * @adev: amdgpu device structure >>> + * @addr: address of the VM fault >>> + * @pasid: PASID of the process causing the fault >>> + * >>> + * Remove the address from fault filter, then future vm fault on >>> this address >>> + * will pass to retry fault handler to recover. >>> + */ >>> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, >>> uint64_t addr, >>> + uint16_t pasid) >>> +{ >>> + struct amdgpu_gmc *gmc = &adev->gmc; >>> + >>> + uint64_t key = addr << 4 | pasid; >>> + struct amdgpu_gmc_fault *fault; >>> + uint32_t hash; >>> + >>> + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); >>> + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; >> You need to loop over the fault ring to find a fault with the matching >> key since there may be hash collisions. >> >> You also need to make sure you don't break the single link list of keys >> with the same hash when you remove an entry. I think the easier way to >> remove an entry without breaking this ring+closed hashing structure is >> to reset the fault->key rather than the fault->timestamp. >> >> Finally, you need to add locking to the fault ring structure. Currently >> it's not protected by any locks because only one thread (the interrupt >> handler) accesses it. Now you have another thread that can remove >> entries, so you need to protect it with a lock. If you are handling >> retry faults, you know that the interrupt handler is really a worker >> thread, so you can use a mutex or a spin-lock, but it doesn't need to be >> interrupt-safe. > > I don't think you need a lock at all. > > Just using cmpxchg() to update the key should do it. > > Something like this: > > hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); > fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; > while (fault->timestamp >= stamp) { > uint64_t tmp; > > cmpxchg(&fault->key, key, 0); Good idea. Then we should probably use READ_ONCE and WRITE_ONCE to access fault->key in other places. Regards, Felix > > tmp = fault->timestamp; > fault = &gmc->fault_ring[fault->next]; > > /* Check if the entry was reused */ > if (fault->timestamp >= tmp) > break; > } > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> + fault->timestamp = 0; >>> +} >>> + >>> int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) >>> { >>> int r; >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >>> index 9d11c02a3938..498a7a0d5a9e 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h >>> @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct >>> amdgpu_device *adev, >>> struct amdgpu_gmc *mc); >>> bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t >>> addr, >>> uint16_t pasid, uint64_t timestamp); >>> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, >>> uint64_t addr, >>> + uint16_t pasid); >>> int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); >>> void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); >>> int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3D&reserved=0 >> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-21 15:29 ` Felix Kuehling @ 2021-04-23 1:52 ` philip yang 0 siblings, 0 replies; 13+ messages in thread From: philip yang @ 2021-04-23 1:52 UTC (permalink / raw) To: Felix Kuehling, Christian König, amd-gfx, Yang, Philip [-- Attachment #1: Type: text/html, Size: 11026 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-20 20:21 ` [PATCH 4/6] drm/amdgpu: address remove from fault filter Philip Yang 2021-04-21 1:20 ` Felix Kuehling @ 2021-04-21 7:22 ` Christian König 2021-04-23 2:00 ` philip yang 1 sibling, 1 reply; 13+ messages in thread From: Christian König @ 2021-04-21 7:22 UTC (permalink / raw) To: Philip Yang, amd-gfx Am 20.04.21 um 22:21 schrieb Philip Yang: > Add interface to remove address from fault filter ring by resetting > fault ring entry of the fault address timestamp to 0, then future vm > fault on the address will be processed to recover. > > Check fault address from fault ring, add address into fault ring and > remove address from fault ring are serialized in same interrupt deferred > work, don't have race condition. That might not work on Vega20. We call amdgpu_gmc_filter_faults() from the the IH while the fault handling id done from the delegated IH processing. More comments below. > Signed-off-by: Philip Yang <Philip.Yang@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 ++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ > 2 files changed, 26 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > index c39ed9eb0987..338e45fa66cb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, > return false; > } > > +/** > + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter > + * > + * @adev: amdgpu device structure > + * @addr: address of the VM fault > + * @pasid: PASID of the process causing the fault > + * > + * Remove the address from fault filter, then future vm fault on this address > + * will pass to retry fault handler to recover. > + */ > +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, > + uint16_t pasid) > +{ > + struct amdgpu_gmc *gmc = &adev->gmc; > + > + uint64_t key = addr << 4 | pasid; We should probably have a function for this now. > + struct amdgpu_gmc_fault *fault; > + uint32_t hash; > + > + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER); > + fault = &gmc->fault_ring[gmc->fault_hash[hash].idx]; > + fault->timestamp = 0; There is no guarantee that the ring entry you found for the fault is the one for this address. After all that is just an 8 bit hash for a 64bit values :) You need to double check the key and walk the chain by looking at the next entry to eventually find the right one. Christian. > +} > + > int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev) > { > int r; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > index 9d11c02a3938..498a7a0d5a9e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev, > struct amdgpu_gmc *mc); > bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr, > uint16_t pasid, uint64_t timestamp); > +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr, > + uint16_t pasid); > int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev); > void amdgpu_gmc_ras_fini(struct amdgpu_device *adev); > int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter 2021-04-21 7:22 ` Christian König @ 2021-04-23 2:00 ` philip yang 0 siblings, 0 replies; 13+ messages in thread From: philip yang @ 2021-04-23 2:00 UTC (permalink / raw) To: Christian König, Philip Yang, amd-gfx [-- Attachment #1: Type: text/html, Size: 6550 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 5/6] drm/amdkfd: enable subsequent retry fault 2021-04-20 20:21 [PATCH 1/6] drm/amdkfd: retry validation to recover range Philip Yang ` (2 preceding siblings ...) 2021-04-20 20:21 ` [PATCH 4/6] drm/amdgpu: address remove from fault filter Philip Yang @ 2021-04-20 20:21 ` Philip Yang 2021-04-21 1:22 ` Felix Kuehling 3 siblings, 1 reply; 13+ messages in thread From: Philip Yang @ 2021-04-20 20:21 UTC (permalink / raw) To: amd-gfx; +Cc: Philip Yang After draining the stale retry fault, or failed to validate the range to recover, have to remove the fault address from fault filter ring, to be able to handle subsequent retry interrupt on same address. Otherwise the retry fault will not be processed to recover until timeout passed. Signed-off-by: Philip Yang <Philip.Yang@amd.com> --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 45dd055118eb..d90e0cb6e573 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2262,8 +2262,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, mutex_lock(&prange->migrate_mutex); - if (svm_range_skip_recover(prange)) + if (svm_range_skip_recover(prange)) { + amdgpu_gmc_filter_faults_remove(adev, addr, pasid); goto out_unlock_range; + } timestamp = ktime_to_us(ktime_get()) - prange->validate_timestamp; /* skip duplicate vm fault on different pages of same range */ @@ -2325,6 +2327,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, if (r == -EAGAIN) { pr_debug("recover vm fault later\n"); + amdgpu_gmc_filter_faults_remove(adev, addr, pasid); r = 0; } return r; -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 5/6] drm/amdkfd: enable subsequent retry fault 2021-04-20 20:21 ` [PATCH 5/6] drm/amdkfd: enable subsequent retry fault Philip Yang @ 2021-04-21 1:22 ` Felix Kuehling 2021-04-21 2:08 ` philip yang 0 siblings, 1 reply; 13+ messages in thread From: Felix Kuehling @ 2021-04-21 1:22 UTC (permalink / raw) To: Philip Yang, amd-gfx Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang: > After draining the stale retry fault, or failed to validate the range > to recover, have to remove the fault address from fault filter ring, to > be able to handle subsequent retry interrupt on same address. Otherwise > the retry fault will not be processed to recover until timeout passed. > > Signed-off-by: Philip Yang <Philip.Yang@amd.com> Patches 1-3 and patch 5 are Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> I didn't see a patch 6. Was the email lost or not send intentionally? > --- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > index 45dd055118eb..d90e0cb6e573 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > @@ -2262,8 +2262,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, > > mutex_lock(&prange->migrate_mutex); > > - if (svm_range_skip_recover(prange)) > + if (svm_range_skip_recover(prange)) { > + amdgpu_gmc_filter_faults_remove(adev, addr, pasid); > goto out_unlock_range; > + } > > timestamp = ktime_to_us(ktime_get()) - prange->validate_timestamp; > /* skip duplicate vm fault on different pages of same range */ > @@ -2325,6 +2327,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, > > if (r == -EAGAIN) { > pr_debug("recover vm fault later\n"); > + amdgpu_gmc_filter_faults_remove(adev, addr, pasid); > r = 0; > } > return r; _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 5/6] drm/amdkfd: enable subsequent retry fault 2021-04-21 1:22 ` Felix Kuehling @ 2021-04-21 2:08 ` philip yang 0 siblings, 0 replies; 13+ messages in thread From: philip yang @ 2021-04-21 2:08 UTC (permalink / raw) To: Felix Kuehling, Philip Yang, amd-gfx [-- Attachment #1: Type: text/html, Size: 2703 bytes --] [-- Attachment #2: Type: text/plain, Size: 154 bytes --] _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-04-23 2:00 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-20 20:21 [PATCH 1/6] drm/amdkfd: retry validation to recover range Philip Yang 2021-04-20 20:21 ` [PATCH 2/6] drm/amdgpu: return IH ring drain finished if ring is empty Philip Yang 2021-04-20 20:21 ` [PATCH 3/6] drm/amdkfd: handle stale retry fault Philip Yang 2021-04-20 20:21 ` [PATCH 4/6] drm/amdgpu: address remove from fault filter Philip Yang 2021-04-21 1:20 ` Felix Kuehling 2021-04-21 7:55 ` Christian König 2021-04-21 15:29 ` Felix Kuehling 2021-04-23 1:52 ` philip yang 2021-04-21 7:22 ` Christian König 2021-04-23 2:00 ` philip yang 2021-04-20 20:21 ` [PATCH 5/6] drm/amdkfd: enable subsequent retry fault Philip Yang 2021-04-21 1:22 ` Felix Kuehling 2021-04-21 2:08 ` philip yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).