I just realized that a change I'm thinking about for a while would solve 
your problem as well, but keep concurrent allocation possible.

See ttm_mem_evict_first() unlocks the BO after evicting it:
>         ttm_bo_del_from_lru(bo);
>         spin_unlock(&glob->lru_lock);
>
>         ret = ttm_bo_evict(bo, ctx);
>         if (locked) {
>                 ttm_bo_unreserve(bo); <-------- here
>         } else {
>                 spin_lock(&glob->lru_lock);
>                 ttm_bo_add_to_lru(bo);
>                 spin_unlock(&glob->lru_lock);
>         }
>
>         kref_put(&bo->list_kref, ttm_bo_release_list);

The effect is that in your example process C can not only beat process B 
once, but many many times because we run into a ping/pong situation 
where B evicts resources while C moves them back in.

For a while now I'm thinking about dropping those reservations only 
after the original allocation succeeded.

The effect would be that process C can still beat process B initially, 
but sooner or process B would evict some resources from process C as 
well and then it can succeed with its allocation.

The problem is for this approach to work we need to core change to the 
ww_mutexes to be able to handle this efficiently.

Regards,
Christian.

Am 26.01.2018 um 14:59 schrieb Christian König:
> I know, but this has the same effect. You prevent concurrent 
> allocation from happening.
>
> What we could do is to pipeline reusing of deleted memory as well, 
> this makes it less likely to cause the problem you are seeing because 
> the evicting processes doesn't need to block for deleted BOs any more.
>
> But that other processes can grab memory during eviction is 
> intentional. Otherwise greedy processes would completely dominate 
> command submission.
>
> Regards,
> Christian.
>
> Am 26.01.2018 um 14:50 schrieb Zhou, David(ChunMing):
>> I don't want to prevent all, my new approach is to prevent the later 
>> allocation is trying and ahead of front to get the memory space that 
>> the front made from eviction.
>>
>>
>> 发自坚果 Pro
>>
>> Christian K鰊ig <ckoenig.leichtzumerken@gmail.com> 于 2018年1月26日 
>> 下午9:24写道：
>>
>> Yes, exactly that's the problem.
>>
>> See when you want to prevent a process B from allocating the memory 
>> process A has evicted, you need to prevent all concurrent allocation.
>>
>> And we don't do that because it causes a major performance drop.
>>
>> Regards,
>> Christian.
>>
>> Am 26.01.2018 um 14:21 schrieb Zhou, David(ChunMing):
>>> You patch will prevent concurrent allocation, and will result in 
>>> allocation performance drop much.
>>>
>>> 发自坚果 Pro
>>>
>>> Christian K鰊ig <ckoenig.leichtzumerken@gmail.com> 于 2018年1月26日 
>>> 下午9:04写道：
>>>
>>> Attached is what you actually want to do cleanly implemented. But as 
>>> I said this is a NO-GO.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.01.2018 um 13:43 schrieb Christian König:
>>>>> After my investigation, this issue should be detect of TTM design 
>>>>> self, which breaks scheduling balance.
>>>> Yeah, but again. This is indented design we can't change easily.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 26.01.2018 um 13:36 schrieb Zhou, David(ChunMing):
>>>>> I am off work, so reply mail by phone, the format could not be text.
>>>>>
>>>>> back to topic itself:
>>>>> the problem indeed happen on amdgpu driver, someone reports me 
>>>>> that application runs with two instances, the performance are 
>>>>> different.
>>>>> I also reproduced the issue with unit test(bo_eviction_test). They 
>>>>> always think our scheduler isn't working as expected.
>>>>>
>>>>> After my investigation, this issue should be detect of TTM design 
>>>>> self, which breaks scheduling balance.
>>>>>
>>>>> Further, if we run containers for our gpu, container A could run 
>>>>> high score, container B runs low score with same benchmark.
>>>>>
>>>>> So this is bug that we need fix.
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>
>>>>> 发自坚果 Pro
>>>>>
>>>>> Christian K鰊ig <ckoenig.leichtzumerken@gmail.com> 于 2018年1月26日 
>>>>> 下午6:31写道：
>>>>>
>>>>> Am 26.01.2018 um 11:22 schrieb Chunming Zhou:
>>>>> > there is a scheduling balance issue about get node like:
>>>>> > a. process A allocates full memory and use it for submission.
>>>>> > b. process B tries to allocates memory, will wait for process A 
>>>>> BO idle in eviction.
>>>>> > c. process A completes the job, process B eviction will put 
>>>>> process A BO node,
>>>>> > but in the meantime, process C is comming to allocate BO, whill 
>>>>> directly get node successfully, and do submission,
>>>>> > process B will again wait for process C BO idle.
>>>>> > d. repeat the above setps, process B could be delayed much more.
>>>>> >
>>>>> > later allocation must not be ahead of front in same place.
>>>>>
>>>>> Again NAK to the whole approach.
>>>>>
>>>>> At least with amdgpu the problem you described above never occurs
>>>>> because evictions are pipelined operations. We could only block for
>>>>> deleted regions to become free.
>>>>>
>>>>> But independent of that incoming memory requests while we make 
>>>>> room for
>>>>> eviction are intended to be served first.
>>>>>
>>>>> Changing that is certainly a no-go cause that would favor memory 
>>>>> hungry
>>>>> applications over small clients.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> >
>>>>> > Change-Id: I3daa892e50f82226c552cc008a29e55894a98f18
>>>>> > Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
>>>>> > ---
>>>>> >   drivers/gpu/drm/ttm/ttm_bo.c    | 69 
>>>>> +++++++++++++++++++++++++++++++++++++++--
>>>>> >   include/drm/ttm/ttm_bo_api.h    |  7 +++++
>>>>> >   include/drm/ttm/ttm_bo_driver.h |  7 +++++
>>>>> >   3 files changed, 80 insertions(+), 3 deletions(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c 
>>>>> b/drivers/gpu/drm/ttm/ttm_bo.c
>>>>> > index d33a6bb742a1..558ec2cf465d 100644
>>>>> > --- a/drivers/gpu/drm/ttm/ttm_bo.c
>>>>> > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>>>>> > @@ -841,6 +841,58 @@ static int ttm_bo_add_move_fence(struct 
>>>>> ttm_buffer_object *bo,
>>>>> >        return 0;
>>>>> >   }
>>>>> >
>>>>> > +static void ttm_man_init_waiter(struct ttm_bo_waiter *waiter,
>>>>> > +                             struct ttm_buffer_object *bo,
>>>>> > +                             const struct ttm_place *place)
>>>>> > +{
>>>>> > +     waiter->tbo = bo;
>>>>> > +     memcpy((void *)&waiter->place, (void *)place, sizeof(*place));
>>>>> > +     INIT_LIST_HEAD(&waiter->list);
>>>>> > +}
>>>>> > +
>>>>> > +static void ttm_man_add_waiter(struct ttm_mem_type_manager *man,
>>>>> > +                            struct ttm_bo_waiter *waiter)
>>>>> > +{
>>>>> > +     if (!waiter)
>>>>> > +             return;
>>>>> > +     spin_lock(&man->wait_lock);
>>>>> > +     list_add_tail(&waiter->list, &man->waiter_list);
>>>>> > +     spin_unlock(&man->wait_lock);
>>>>> > +}
>>>>> > +
>>>>> > +static void ttm_man_del_waiter(struct ttm_mem_type_manager *man,
>>>>> > +                            struct ttm_bo_waiter *waiter)
>>>>> > +{
>>>>> > +     if (!waiter)
>>>>> > +             return;
>>>>> > +     spin_lock(&man->wait_lock);
>>>>> > +     if (!list_empty(&waiter->list))
>>>>> > + list_del(&waiter->list);
>>>>> > +     spin_unlock(&man->wait_lock);
>>>>> > +     kfree(waiter);
>>>>> > +}
>>>>> > +
>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man,
>>>>> > +                  struct ttm_buffer_object *bo,
>>>>> > +                  const struct ttm_place *place)
>>>>> > +{
>>>>> > +     struct ttm_bo_waiter *waiter, *tmp;
>>>>> > +
>>>>> > +     spin_lock(&man->wait_lock);
>>>>> > +     list_for_each_entry_safe(waiter, tmp, &man->waiter_list, 
>>>>> list) {
>>>>> > +             if ((bo != waiter->tbo) &&
>>>>> > +                 ((place->fpfn >= waiter->place.fpfn &&
>>>>> > +                   place->fpfn <= waiter->place.lpfn) ||
>>>>> > +                  (place->lpfn <= waiter->place.lpfn && 
>>>>> place->lpfn >=
>>>>> > + waiter->place.fpfn)))
>>>>> > +                 goto later_bo;
>>>>> > +     }
>>>>> > +     spin_unlock(&man->wait_lock);
>>>>> > +     return true;
>>>>> > +later_bo:
>>>>> > +     spin_unlock(&man->wait_lock);
>>>>> > +     return false;
>>>>> > +}
>>>>> >   /**
>>>>> >    * Repeatedly evict memory from the LRU for @mem_type until we 
>>>>> create enough
>>>>> >    * space, or we've evicted everything and there isn't enough 
>>>>> space.
>>>>> > @@ -853,17 +905,26 @@ static int ttm_bo_mem_force_space(struct 
>>>>> ttm_buffer_object *bo,
>>>>> >   {
>>>>> >        struct ttm_bo_device *bdev = bo->bdev;
>>>>> >        struct ttm_mem_type_manager *man = &bdev->man[mem_type];
>>>>> > +     struct ttm_bo_waiter waiter;
>>>>> >        int ret;
>>>>> >
>>>>> > +     ttm_man_init_waiter(&waiter, bo, place);
>>>>> > +     ttm_man_add_waiter(man, &waiter);
>>>>> >        do {
>>>>> >                ret = (*man->func->get_node)(man, bo, place, mem);
>>>>> > -             if (unlikely(ret != 0))
>>>>> > +             if (unlikely(ret != 0)) {
>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>> >                        return ret;
>>>>> > -             if (mem->mm_node)
>>>>> > +             }
>>>>> > +             if (mem->mm_node) {
>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>> >                        break;
>>>>> > +             }
>>>>> >                ret = ttm_mem_evict_first(bdev, mem_type, place, 
>>>>> ctx);
>>>>> > -             if (unlikely(ret != 0))
>>>>> > +             if (unlikely(ret != 0)) {
>>>>> > + ttm_man_del_waiter(man, &waiter);
>>>>> >                        return ret;
>>>>> > +             }
>>>>> >        } while (1);
>>>>> >        mem->mem_type = mem_type;
>>>>> >        return ttm_bo_add_move_fence(bo, man, mem);
>>>>> > @@ -1450,6 +1511,8 @@ int ttm_bo_init_mm(struct ttm_bo_device 
>>>>> *bdev, unsigned type,
>>>>> >        man->use_io_reserve_lru = false;
>>>>> > mutex_init(&man->io_reserve_mutex);
>>>>> > spin_lock_init(&man->move_lock);
>>>>> > + spin_lock_init(&man->wait_lock);
>>>>> > + INIT_LIST_HEAD(&man->waiter_list);
>>>>> > INIT_LIST_HEAD(&man->io_reserve_lru);
>>>>> >
>>>>> >        ret = bdev->driver->init_mem_type(bdev, type, man);
>>>>> > diff --git a/include/drm/ttm/ttm_bo_api.h 
>>>>> b/include/drm/ttm/ttm_bo_api.h
>>>>> > index 2cd025c2abe7..0fce4dbd02e7 100644
>>>>> > --- a/include/drm/ttm/ttm_bo_api.h
>>>>> > +++ b/include/drm/ttm/ttm_bo_api.h
>>>>> > @@ -40,6 +40,7 @@
>>>>> >   #include <linux/mm.h>
>>>>> >   #include <linux/bitmap.h>
>>>>> >   #include <linux/reservation.h>
>>>>> > +#include <drm/ttm/ttm_placement.h>
>>>>> >
>>>>> >   struct ttm_bo_device;
>>>>> >
>>>>> > @@ -232,6 +233,12 @@ struct ttm_buffer_object {
>>>>> >        struct mutex wu_mutex;
>>>>> >   };
>>>>> >
>>>>> > +struct ttm_bo_waiter {
>>>>> > +     struct ttm_buffer_object *tbo;
>>>>> > +     struct ttm_place place;
>>>>> > +     struct list_head list;
>>>>> > +};
>>>>> > +
>>>>> >   /**
>>>>> >    * struct ttm_bo_kmap_obj
>>>>> >    *
>>>>> > diff --git a/include/drm/ttm/ttm_bo_driver.h 
>>>>> b/include/drm/ttm/ttm_bo_driver.h
>>>>> > index 9b417eb2df20..dc6b8b4c9e06 100644
>>>>> > --- a/include/drm/ttm/ttm_bo_driver.h
>>>>> > +++ b/include/drm/ttm/ttm_bo_driver.h
>>>>> > @@ -293,6 +293,10 @@ struct ttm_mem_type_manager {
>>>>> >        bool io_reserve_fastpath;
>>>>> >        spinlock_t move_lock;
>>>>> >
>>>>> > +     /* waiters in list */
>>>>> > +     spinlock_t wait_lock;
>>>>> > +     struct list_head waiter_list;
>>>>> > +
>>>>> >        /*
>>>>> >         * Protected by @io_reserve_mutex:
>>>>> >         */
>>>>> > @@ -748,6 +752,9 @@ int ttm_bo_mem_space(struct 
>>>>> ttm_buffer_object *bo,
>>>>> >                     struct ttm_mem_reg *mem,
>>>>> >                     struct ttm_operation_ctx *ctx);
>>>>> >
>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man,
>>>>> > +                  struct ttm_buffer_object *bo,
>>>>> > +                  const struct ttm_place *place);
>>>>> >   void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct 
>>>>> ttm_mem_reg *mem);
>>>>> >   void ttm_bo_mem_put_locked(struct ttm_buffer_object *bo,
>>>>> >                           struct ttm_mem_reg *mem);
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx