amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: Philip Yang <Philip.Yang@amd.com>, amd-gfx@lists.freedesktop.org
Cc: Felix.Kuehling@amd.com, christian.koenig@amd.com
Subject: Re: [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault
Date: Wed, 1 May 2024 23:09:05 -0500	[thread overview]
Message-ID: <d949949b-fb30-45c7-a53d-a4d32874b3c7@amd.com> (raw)
In-Reply-To: <20240501225655.5215-1-Philip.Yang@amd.com>


On 5/1/2024 5:56 PM, Philip Yang wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On system with khugepaged enabled and user cases with THP buffer, the
> hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary
> timeout value is not accurate, cause memory allocation failure.
>
> Remove the arbitrary timeout value, return EAGAIN to application if
> hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call
> ioctl again.

Wonder why letting user space do retry is better? Seems this issue is 
caused by hugepage merging, so how user space can avoid it?

And applications may not use Thunk or libdrm, instead, use ioctl directly.

Regards

Xiaogang

> Change EAGAIN to debug message as this is not error.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  5 ++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c          | 12 +++---------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c             |  5 +----
>   3 files changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 54198c3928c7..02696c2102f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1087,7 +1087,10 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr,
>
>          ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages, &range);
>          if (ret) {
> -               pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
> +               if (ret == -EAGAIN)
> +                       pr_debug("Failed to get user pages, try again\n");
> +               else
> +                       pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
>                  goto unregister_out;
>          }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 431ec72655ec..e36fede7f74c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -202,20 +202,12 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
>                  pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
>                          hmm_range->start, hmm_range->end);
>
> -               /* Assuming 64MB takes maximum 1 second to fault page address */
> -               timeout = max((hmm_range->end - hmm_range->start) >> 26, 1UL);
> -               timeout *= HMM_RANGE_DEFAULT_TIMEOUT;
> -               timeout = jiffies + msecs_to_jiffies(timeout);
> +               timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>
>   retry:
>                  hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>                  r = hmm_range_fault(hmm_range);
>                  if (unlikely(r)) {
> -                       schedule();
> -                       /*
> -                        * FIXME: This timeout should encompass the retry from
> -                        * mmu_interval_read_retry() as well.
> -                        */
>                          if (r == -EBUSY && !time_after(jiffies, timeout))
>                                  goto retry;
>                          goto out_free_pfns;
> @@ -247,6 +239,8 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
>   out_free_range:
>          kfree(hmm_range);
>
> +       if (r == -EBUSY)
> +               r = -EAGAIN;
>          return r;
>   }
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 94f83be2232d..e7040f809f33 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1670,11 +1670,8 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>                                                         readonly, owner, NULL,
>                                                         &hmm_range);
>                          WRITE_ONCE(p->svms.faulting_task, NULL);
> -                       if (r) {
> +                       if (r)
>                                  pr_debug("failed %d to get svm range pages\n", r);
> -                               if (r == -EBUSY)
> -                                       r = -EAGAIN;
> -                       }
>                  } else {
>                          r = -EFAULT;
>                  }
> --
> 2.43.2
>

  reply	other threads:[~2024-05-02  4:09 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 22:56 [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault Philip Yang
2024-05-02  4:09 ` Chen, Xiaogang [this message]
2024-05-02 14:06   ` Philip Yang
2024-05-02 12:42 ` James Zhu
2024-05-02 15:05   ` Philip Yang
2024-05-06 19:54 ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d949949b-fb30-45c7-a53d-a4d32874b3c7@amd.com \
    --to=xiaogang.chen@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).