All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kuehling, Felix" <Felix.Kuehling-5C7GfCeVMHo@public.gmane.org>
To: "Christian König"
	<ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Olsak, Marek" <Marek.Olsak-5C7GfCeVMHo@public.gmane.org>,
	"Zhou,
	David(ChunMing)" <David1.Zhou-5C7GfCeVMHo@public.gmane.org>,
	"Liang, Prike" <Prike.Liang-5C7GfCeVMHo@public.gmane.org>,
	"dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 06/10] drm/ttm: fix busy memory to fail other user v10
Date: Wed, 26 Jun 2019 06:36:27 +0000	[thread overview]
Message-ID: <dbda70c1-d5a7-69e3-19c4-b6ad4a4275c4@amd.com> (raw)
In-Reply-To: <20190522125947.4592-6-christian.koenig-5C7GfCeVMHo@public.gmane.org>

I believe I found a live-lock due to this patch when running our KFD 
eviction test in a loop. I pretty reliably hangs on the second loop 
iteration. If I revert this patch, the problem disappears.

With some added instrumentation, I see that amdgpu_cs_list_validate in 
amdgpu_cs_parser_bos returns -EAGAIN repeatedly. User mode dutifully 
retries the command submission, but it never succeeds.

I think the problem is the fact that ttm_mem_evict_wait_busy returns 
-EAGAIN when it fails to lock the busy BO. That will cause 
ttm_bo_mem_space to give up instead of trying an alternate placement. A 
straight-forward fix is to modify ttm_mem_evict_wait_busy to return 
-EBUSY, which will maintain the old behaviour of trying alternate 
placements when eviction is not possible. I'll send out a patch for that.

See also inline.

On 2019-05-22 8:59 a.m., Christian König wrote:
> [CAUTION: External Email]
>
> BOs on the LRU might be blocked during command submission
> and cause OOM situations.
>
> Avoid this by blocking for the first busy BO not locked by
> the same ticket as the BO we are searching space for.
>
> v10: completely start over with the patch since we didn't
>       handled a whole bunch of corner cases.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo.c | 77 ++++++++++++++++++++++++++++++------
>   1 file changed, 66 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 4c6389d849ed..861facac33d4 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -771,32 +771,72 @@ EXPORT_SYMBOL(ttm_bo_eviction_valuable);
>    * b. Otherwise, trylock it.
>    */
>   static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo,
> -                       struct ttm_operation_ctx *ctx, bool *locked)
> +                       struct ttm_operation_ctx *ctx, bool *locked, bool *busy)
>   {
>          bool ret = false;
>
> -       *locked = false;
>          if (bo->resv == ctx->resv) {
>                  reservation_object_assert_held(bo->resv);
>                  if (ctx->flags & TTM_OPT_FLAG_ALLOW_RES_EVICT
>                      || !list_empty(&bo->ddestroy))
>                          ret = true;
> +               *locked = false;
> +               if (busy)
> +                       *busy = false;
>          } else {
> -               *locked = reservation_object_trylock(bo->resv);
> -               ret = *locked;
> +               ret = reservation_object_trylock(bo->resv);
> +               *locked = ret;
> +               if (busy)
> +                       *busy = !ret;
>          }
>
>          return ret;
>   }
>
> +/**
> + * ttm_mem_evict_wait_busy - wait for a busy BO to become available
> + *
> + * @busy_bo: BO which couldn't be locked with trylock
> + * @ctx: operation context
> + * @ticket: acquire ticket
> + *
> + * Try to lock a busy buffer object to avoid failing eviction.
> + */
> +static int ttm_mem_evict_wait_busy(struct ttm_buffer_object *busy_bo,
> +                                  struct ttm_operation_ctx *ctx,
> +                                  struct ww_acquire_ctx *ticket)
> +{
> +       int r;
> +
> +       if (!busy_bo || !ticket)
> +               return -EBUSY;
> +
> +       if (ctx->interruptible)
> +               r = reservation_object_lock_interruptible(busy_bo->resv,
> +                                                         ticket);
> +       else
> +               r = reservation_object_lock(busy_bo->resv, ticket);
> +
> +       /*
> +        * TODO: It would be better to keep the BO locked until allocation is at
> +        * least tried one more time, but that would mean a much larger rework
> +        * of TTM.
> +        */
> +       if (!r)
> +               reservation_object_unlock(busy_bo->resv);
> +
> +       return r == -EDEADLK ? -EAGAIN : r;

If locking fails, this returns -EAGAIN.


> +}
> +
>   static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
>                                 uint32_t mem_type,
>                                 const struct ttm_place *place,
> -                              struct ttm_operation_ctx *ctx)
> +                              struct ttm_operation_ctx *ctx,
> +                              struct ww_acquire_ctx *ticket)
>   {
> +       struct ttm_buffer_object *bo = NULL, *busy_bo = NULL;
>          struct ttm_bo_global *glob = bdev->glob;
>          struct ttm_mem_type_manager *man = &bdev->man[mem_type];
> -       struct ttm_buffer_object *bo = NULL;
>          bool locked = false;
>          unsigned i;
>          int ret;
> @@ -804,8 +844,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
>          spin_lock(&glob->lru_lock);
>          for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>                  list_for_each_entry(bo, &man->lru[i], lru) {
> -                       if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked))
> +                       bool busy;
> +
> +                       if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked,
> +                                                           &busy)) {
> +                               if (busy && !busy_bo &&
> +                                   bo->resv->lock.ctx != ticket)
> +                                       busy_bo = bo;
>                                  continue;
> +                       }
>
>                          if (place && !bdev->driver->eviction_valuable(bo,
>                                                                        place)) {
> @@ -824,8 +871,13 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
>          }
>
>          if (!bo) {
> +               if (busy_bo)
> +                       ttm_bo_get(busy_bo);
>                  spin_unlock(&glob->lru_lock);
> -               return -EBUSY;

The old behaviour was to always return -EBUSY if eviction was not possible.


> +               ret = ttm_mem_evict_wait_busy(busy_bo, ctx, ticket);
> +               if (busy_bo)
> +                       ttm_bo_put(busy_bo);
> +               return ret;

Now this can return -EAGAIN. This prevents ttm_bo_mem_space from trying 
alternate placements.

Regards,
   Felix


>          }
>
>          kref_get(&bo->list_kref);
> @@ -911,7 +963,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
>                          return ret;
>                  if (mem->mm_node)
>                          break;
> -               ret = ttm_mem_evict_first(bdev, mem->mem_type, place, ctx);
> +               ret = ttm_mem_evict_first(bdev, mem->mem_type, place, ctx,
> +                                         bo->resv->lock.ctx);
>                  if (unlikely(ret != 0))
>                          return ret;
>          } while (1);
> @@ -1426,7 +1479,8 @@ static int ttm_bo_force_list_clean(struct ttm_bo_device *bdev,
>          for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>                  while (!list_empty(&man->lru[i])) {
>                          spin_unlock(&glob->lru_lock);
> -                       ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx);
> +                       ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx,
> +                                                 NULL);
>                          if (ret)
>                                  return ret;
>                          spin_lock(&glob->lru_lock);
> @@ -1797,7 +1851,8 @@ int ttm_bo_swapout(struct ttm_bo_global *glob, struct ttm_operation_ctx *ctx)
>          spin_lock(&glob->lru_lock);
>          for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
>                  list_for_each_entry(bo, &glob->swap_lru[i], swap) {
> -                       if (ttm_bo_evict_swapout_allowable(bo, ctx, &locked)) {
> +                       if (ttm_bo_evict_swapout_allowable(bo, ctx, &locked,
> +                                                          NULL)) {
>                                  ret = 0;
>                                  break;
>                          }
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2019-06-26  6:36 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-22 12:59 [PATCH 01/10] drm/ttm: Make LRU removal optional Christian König
2019-05-22 12:59 ` [PATCH 02/10] drm/ttm: return immediately in case of a signal Christian König
2019-05-22 12:59 ` [PATCH 03/10] drm/ttm: remove manual placement preference Christian König
     [not found] ` <20190522125947.4592-1-christian.koenig-5C7GfCeVMHo@public.gmane.org>
2019-05-22 12:59   ` [PATCH 04/10] drm/ttm: cleanup ttm_bo_mem_space Christian König
2019-05-22 12:59   ` [PATCH 05/10] drm/ttm: immediately move BOs to the new LRU v2 Christian König
2019-05-22 12:59   ` [PATCH 06/10] drm/ttm: fix busy memory to fail other user v10 Christian König
2019-05-23 10:24     ` zhoucm1
2019-05-23 11:03       ` Christian König
     [not found]         ` <16918096-1430-d581-7284-a987aacb89da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-05-23 11:50           ` Chunming Zhou
     [not found]             ` <5d68ba04-250d-918e-3633-ec45e5b18904-5C7GfCeVMHo@public.gmane.org>
2019-05-23 14:15               ` Koenig, Christian
2019-05-24  5:35         ` Liang, Prike
     [not found]           ` <MN2PR12MB35364235378F29899838CD80FB020-rweVpJHSKTovpq7YPKzLfQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-05-24  8:49             ` Christian König
     [not found]     ` <20190522125947.4592-6-christian.koenig-5C7GfCeVMHo@public.gmane.org>
2019-06-26  6:36       ` Kuehling, Felix [this message]
2019-05-22 12:59   ` [PATCH 07/10] drm/amd/display: use ttm_eu_reserve_buffers instead of amdgpu_bo_reserve v2 Christian König
2019-05-22 12:59   ` [PATCH 08/10] drm/amdgpu: drop some validation failure messages Christian König
2019-05-22 12:59   ` [PATCH 09/10] drm/amdgpu: create GDS, GWS and OA in system domain Christian König
2019-05-23  9:15   ` [PATCH 01/10] drm/ttm: Make LRU removal optional zhoucm1
     [not found]     ` <fbb023f9-28e7-2ac8-994f-e262da597098-5C7GfCeVMHo@public.gmane.org>
2019-05-23  9:39       ` Christian König
2019-05-22 12:59 ` [PATCH 10/10] drm/amdgpu: stop removing BOs from the LRU v3 Christian König
     [not found]   ` <20190522125947.4592-10-christian.koenig-5C7GfCeVMHo@public.gmane.org>
2019-05-22 19:43     ` Kuehling, Felix
     [not found]       ` <48ac98a8-de22-3549-5d63-078a0effab72-5C7GfCeVMHo@public.gmane.org>
2019-05-23  9:06         ` Christian König
     [not found]           ` <eea6245e-616d-eb16-8521-2f21ce5d6d25-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-05-24 21:34             ` Kuehling, Felix
     [not found]               ` <776d29df-428f-ad98-8e38-4b191b602abb-5C7GfCeVMHo@public.gmane.org>
2019-05-27 10:51                 ` Koenig, Christian
2019-05-23  8:27     ` Liang, Prike
2019-05-28 16:25 [PATCH 01/10] drm/ttm: Make LRU removal optional v2 Christian König
2019-05-28 16:25 ` [PATCH 06/10] drm/ttm: fix busy memory to fail other user v10 Christian König
2019-05-29 12:26 [PATCH 01/10] drm/ttm: Make LRU removal optional v2 Christian König
2019-05-29 12:26 ` [PATCH 06/10] drm/ttm: fix busy memory to fail other user v10 Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dbda70c1-d5a7-69e3-19c4-b6ad4a4275c4@amd.com \
    --to=felix.kuehling-5c7gfcevmho@public.gmane.org \
    --cc=David1.Zhou-5C7GfCeVMHo@public.gmane.org \
    --cc=Marek.Olsak-5C7GfCeVMHo@public.gmane.org \
    --cc=Prike.Liang-5C7GfCeVMHo@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.