Am 05.02.2018 um 09:22 schrieb Chunming Zhou: > On 2018年01月31日 18:54, Christian König wrote: >>> So I think preventing validation on same place is a simpler way: >>> process B bo's place is fpfn~lpfn, it will only try to evict LRU BOs >>> in that range, while eviction, we just prevent those validation to >>> this range(fpfn~lpfn), if out of this range, the >>> allocation/validation still can be go on. >>> >>> Any negative? >> That won't work either. The most common use of fpfn~lpfn range is to >> limit a BO to visible VRAM, the other use cases are to fullfil >> hardware limitations. >> >> So blocking this would result in blocking all normal GTT and VRAM >> allocations, adding a mutex to validate would have the same effect. > No, different effect, mutex will make the every allocation serialized. > My approach only effects busy case, that is said, only when space is > used up, the allocation is serialized in that range, otherwise still > in parallel. That would still not allow for concurrent evictions, not as worse as completely blocking concurrent validation but still not a good idea as far as I can see. Going to give it a try today to use the ww_mutex owner to detect eviction of already reserved BOs. Maybe that can be used instead, Christian. > > Regards, > David Zhou > > >> >> Regards, >> Christian. >> >> Am 31.01.2018 um 11:30 schrieb Chunming Zhou: >>> >>> >>> >>> On 2018年01月26日 22:35, Christian König wrote: >>>> I just realized that a change I'm thinking about for a while would >>>> solve your problem as well, but keep concurrent allocation possible. >>>> >>>> See ttm_mem_evict_first() unlocks the BO after evicting it: >>>>>         ttm_bo_del_from_lru(bo); >>>>>         spin_unlock(&glob->lru_lock); >>>>> >>>>>         ret = ttm_bo_evict(bo, ctx); >>>>>         if (locked) { >>>>>                 ttm_bo_unreserve(bo); <-------- here >>>>>         } else { >>>>>                 spin_lock(&glob->lru_lock); >>>>>                 ttm_bo_add_to_lru(bo); >>>>>                 spin_unlock(&glob->lru_lock); >>>>>         } >>>>> >>>>>         kref_put(&bo->list_kref, ttm_bo_release_list); >>>> >>>> The effect is that in your example process C can not only beat >>>> process B once, but many many times because we run into a ping/pong >>>> situation where B evicts resources while C moves them back in. >>> For ping/pong case, I want to disable busy placement for allocation >>> period, only enable it for cs bo validation. >>> >>>> >>>> For a while now I'm thinking about dropping those reservations only >>>> after the original allocation succeeded. >>>> >>>> The effect would be that process C can still beat process B >>>> initially, but sooner or process B would evict some resources from >>>> process C as well and then it can succeed with its allocation. >>> If it is from process C cs validation, process B still need evict >>> the resource only after process C command submission completion. >>> >>>> >>>> The problem is for this approach to work we need to core change to >>>> the ww_mutexes to be able to handle this efficiently. >>> Yes, ww_mutex doesn't support this net lock, which easily deadlock >>> without ticket and class. >>> >>> So I think preventing validation on same place is a simpler way: >>> process B bo's place is fpfn~lpfn, it will only try to evict LRU BOs >>> in that range, while eviction, we just prevent those validation to >>> this range(fpfn~lpfn), if out of this range, the >>> allocation/validation still can be go on. >>> >>> Any negative? >>> >>> Regards, >>> David Zhou >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 26.01.2018 um 14:59 schrieb Christian König: >>>>> I know, but this has the same effect. You prevent concurrent >>>>> allocation from happening. >>>>> >>>>> What we could do is to pipeline reusing of deleted memory as well, >>>>> this makes it less likely to cause the problem you are seeing >>>>> because the evicting processes doesn't need to block for deleted >>>>> BOs any more. >>>>> >>>>> But that other processes can grab memory during eviction is >>>>> intentional. Otherwise greedy processes would completely dominate >>>>> command submission. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 26.01.2018 um 14:50 schrieb Zhou, David(ChunMing): >>>>>> I don't want to prevent all, my new approach is to prevent the >>>>>> later allocation is trying and ahead of front to get the memory >>>>>> space that the front made from eviction. >>>>>> >>>>>> >>>>>> 发自坚果 Pro >>>>>> >>>>>> Christian K鰊ig 于 2018年1月26日 >>>>>> 下午9:24写道: >>>>>> >>>>>> Yes, exactly that's the problem. >>>>>> >>>>>> See when you want to prevent a process B from allocating the >>>>>> memory process A has evicted, you need to prevent all concurrent >>>>>> allocation. >>>>>> >>>>>> And we don't do that because it causes a major performance drop. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 26.01.2018 um 14:21 schrieb Zhou, David(ChunMing): >>>>>>> You patch will prevent concurrent allocation, and will result in >>>>>>> allocation performance drop much. >>>>>>> >>>>>>> 发自坚果 Pro >>>>>>> >>>>>>> Christian K鰊ig 于 2018年1月26日 >>>>>>> 下午9:04写道: >>>>>>> >>>>>>> Attached is what you actually want to do cleanly implemented. >>>>>>> But as I said this is a NO-GO. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 26.01.2018 um 13:43 schrieb Christian König: >>>>>>>>> After my investigation, this issue should be detect of TTM >>>>>>>>> design self, which breaks scheduling balance. >>>>>>>> Yeah, but again. This is indented design we can't change easily. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 26.01.2018 um 13:36 schrieb Zhou, David(ChunMing): >>>>>>>>> I am off work, so reply mail by phone, the format could not be >>>>>>>>> text. >>>>>>>>> >>>>>>>>> back to topic itself: >>>>>>>>> the problem indeed happen on amdgpu driver, someone reports me >>>>>>>>> that application runs with two instances, the performance are >>>>>>>>> different. >>>>>>>>> I also reproduced the issue with unit test(bo_eviction_test). >>>>>>>>> They always think our scheduler isn't working as expected. >>>>>>>>> >>>>>>>>> After my investigation, this issue should be detect of TTM >>>>>>>>> design self, which breaks scheduling balance. >>>>>>>>> >>>>>>>>> Further, if we run containers for our gpu, container A could >>>>>>>>> run high score, container B runs low score with same benchmark. >>>>>>>>> >>>>>>>>> So this is bug that we need fix. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> David Zhou >>>>>>>>> >>>>>>>>> 发自坚果 Pro >>>>>>>>> >>>>>>>>> Christian K鰊ig 于 >>>>>>>>> 2018年1月26日 下午6:31写道: >>>>>>>>> >>>>>>>>> Am 26.01.2018 um 11:22 schrieb Chunming Zhou: >>>>>>>>> > there is a scheduling balance issue about get node like: >>>>>>>>> > a. process A allocates full memory and use it for submission. >>>>>>>>> > b. process B tries to allocates memory, will wait for >>>>>>>>> process A BO idle in eviction. >>>>>>>>> > c. process A completes the job, process B eviction will put >>>>>>>>> process A BO node, >>>>>>>>> > but in the meantime, process C is comming to allocate BO, >>>>>>>>> whill directly get node successfully, and do submission, >>>>>>>>> > process B will again wait for process C BO idle. >>>>>>>>> > d. repeat the above setps, process B could be delayed much more. >>>>>>>>> > >>>>>>>>> > later allocation must not be ahead of front in same place. >>>>>>>>> >>>>>>>>> Again NAK to the whole approach. >>>>>>>>> >>>>>>>>> At least with amdgpu the problem you described above never occurs >>>>>>>>> because evictions are pipelined operations. We could only >>>>>>>>> block for >>>>>>>>> deleted regions to become free. >>>>>>>>> >>>>>>>>> But independent of that incoming memory requests while we make >>>>>>>>> room for >>>>>>>>> eviction are intended to be served first. >>>>>>>>> >>>>>>>>> Changing that is certainly a no-go cause that would favor >>>>>>>>> memory hungry >>>>>>>>> applications over small clients. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> > >>>>>>>>> > Change-Id: I3daa892e50f82226c552cc008a29e55894a98f18 >>>>>>>>> > Signed-off-by: Chunming Zhou >>>>>>>>> > --- >>>>>>>>> >   drivers/gpu/drm/ttm/ttm_bo.c    | 69 >>>>>>>>> +++++++++++++++++++++++++++++++++++++++-- >>>>>>>>> >   include/drm/ttm/ttm_bo_api.h |  7 +++++ >>>>>>>>> >   include/drm/ttm/ttm_bo_driver.h |  7 +++++ >>>>>>>>> >   3 files changed, 80 insertions(+), 3 deletions(-) >>>>>>>>> > >>>>>>>>> > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo.c >>>>>>>>> > index d33a6bb742a1..558ec2cf465d 100644 >>>>>>>>> > --- a/drivers/gpu/drm/ttm/ttm_bo.c >>>>>>>>> > +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>>>>>>>> > @@ -841,6 +841,58 @@ static int ttm_bo_add_move_fence(struct >>>>>>>>> ttm_buffer_object *bo, >>>>>>>>> >        return 0; >>>>>>>>> >   } >>>>>>>>> > >>>>>>>>> > +static void ttm_man_init_waiter(struct ttm_bo_waiter *waiter, >>>>>>>>> > + struct ttm_buffer_object *bo, >>>>>>>>> > +                             const struct ttm_place *place) >>>>>>>>> > +{ >>>>>>>>> > +     waiter->tbo = bo; >>>>>>>>> > +     memcpy((void *)&waiter->place, (void *)place, >>>>>>>>> sizeof(*place)); >>>>>>>>> > + INIT_LIST_HEAD(&waiter->list); >>>>>>>>> > +} >>>>>>>>> > + >>>>>>>>> > +static void ttm_man_add_waiter(struct ttm_mem_type_manager >>>>>>>>> *man, >>>>>>>>> > +                            struct ttm_bo_waiter *waiter) >>>>>>>>> > +{ >>>>>>>>> > +     if (!waiter) >>>>>>>>> > +             return; >>>>>>>>> > + spin_lock(&man->wait_lock); >>>>>>>>> > + list_add_tail(&waiter->list, &man->waiter_list); >>>>>>>>> > + spin_unlock(&man->wait_lock); >>>>>>>>> > +} >>>>>>>>> > + >>>>>>>>> > +static void ttm_man_del_waiter(struct ttm_mem_type_manager >>>>>>>>> *man, >>>>>>>>> > +                            struct ttm_bo_waiter *waiter) >>>>>>>>> > +{ >>>>>>>>> > +     if (!waiter) >>>>>>>>> > +             return; >>>>>>>>> > + spin_lock(&man->wait_lock); >>>>>>>>> > +     if (!list_empty(&waiter->list)) >>>>>>>>> > + list_del(&waiter->list); >>>>>>>>> > + spin_unlock(&man->wait_lock); >>>>>>>>> > +     kfree(waiter); >>>>>>>>> > +} >>>>>>>>> > + >>>>>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man, >>>>>>>>> > +                  struct ttm_buffer_object *bo, >>>>>>>>> > +                  const struct ttm_place *place) >>>>>>>>> > +{ >>>>>>>>> > +     struct ttm_bo_waiter *waiter, *tmp; >>>>>>>>> > + >>>>>>>>> > + spin_lock(&man->wait_lock); >>>>>>>>> > + list_for_each_entry_safe(waiter, tmp, &man->waiter_list, >>>>>>>>> list) { >>>>>>>>> > +             if ((bo != waiter->tbo) && >>>>>>>>> > +                 ((place->fpfn >= waiter->place.fpfn && >>>>>>>>> > +                   place->fpfn <= waiter->place.lpfn) || >>>>>>>>> > +                  (place->lpfn <= waiter->place.lpfn && >>>>>>>>> place->lpfn >= >>>>>>>>> > + waiter->place.fpfn))) >>>>>>>>> > +                 goto later_bo; >>>>>>>>> > +     } >>>>>>>>> > + spin_unlock(&man->wait_lock); >>>>>>>>> > +     return true; >>>>>>>>> > +later_bo: >>>>>>>>> > + spin_unlock(&man->wait_lock); >>>>>>>>> > +     return false; >>>>>>>>> > +} >>>>>>>>> >   /** >>>>>>>>> >    * Repeatedly evict memory from the LRU for @mem_type >>>>>>>>> until we create enough >>>>>>>>> >    * space, or we've evicted everything and there isn't >>>>>>>>> enough space. >>>>>>>>> > @@ -853,17 +905,26 @@ static int >>>>>>>>> ttm_bo_mem_force_space(struct ttm_buffer_object *bo, >>>>>>>>> >   { >>>>>>>>> >        struct ttm_bo_device *bdev = bo->bdev; >>>>>>>>> >        struct ttm_mem_type_manager *man = &bdev->man[mem_type]; >>>>>>>>> > +     struct ttm_bo_waiter waiter; >>>>>>>>> >        int ret; >>>>>>>>> > >>>>>>>>> > + ttm_man_init_waiter(&waiter, bo, place); >>>>>>>>> > +     ttm_man_add_waiter(man, &waiter); >>>>>>>>> >        do { >>>>>>>>> >                ret = (*man->func->get_node)(man, bo, place, >>>>>>>>> mem); >>>>>>>>> > -             if (unlikely(ret != 0)) >>>>>>>>> > +             if (unlikely(ret != 0)) { >>>>>>>>> > + ttm_man_del_waiter(man, &waiter); >>>>>>>>> >                        return ret; >>>>>>>>> > -             if (mem->mm_node) >>>>>>>>> > +             } >>>>>>>>> > +             if (mem->mm_node) { >>>>>>>>> > + ttm_man_del_waiter(man, &waiter); >>>>>>>>> >                        break; >>>>>>>>> > +             } >>>>>>>>> >                ret = ttm_mem_evict_first(bdev, mem_type, >>>>>>>>> place, ctx); >>>>>>>>> > -             if (unlikely(ret != 0)) >>>>>>>>> > +             if (unlikely(ret != 0)) { >>>>>>>>> > + ttm_man_del_waiter(man, &waiter); >>>>>>>>> >                        return ret; >>>>>>>>> > +             } >>>>>>>>> >        } while (1); >>>>>>>>> >        mem->mem_type = mem_type; >>>>>>>>> >        return ttm_bo_add_move_fence(bo, man, mem); >>>>>>>>> > @@ -1450,6 +1511,8 @@ int ttm_bo_init_mm(struct >>>>>>>>> ttm_bo_device *bdev, unsigned type, >>>>>>>>> >        man->use_io_reserve_lru = false; >>>>>>>>> > mutex_init(&man->io_reserve_mutex); >>>>>>>>> > spin_lock_init(&man->move_lock); >>>>>>>>> > + spin_lock_init(&man->wait_lock); >>>>>>>>> > + INIT_LIST_HEAD(&man->waiter_list); >>>>>>>>> > INIT_LIST_HEAD(&man->io_reserve_lru); >>>>>>>>> > >>>>>>>>> >        ret = bdev->driver->init_mem_type(bdev, type, man); >>>>>>>>> > diff --git a/include/drm/ttm/ttm_bo_api.h >>>>>>>>> b/include/drm/ttm/ttm_bo_api.h >>>>>>>>> > index 2cd025c2abe7..0fce4dbd02e7 100644 >>>>>>>>> > --- a/include/drm/ttm/ttm_bo_api.h >>>>>>>>> > +++ b/include/drm/ttm/ttm_bo_api.h >>>>>>>>> > @@ -40,6 +40,7 @@ >>>>>>>>> >   #include >>>>>>>>> >   #include >>>>>>>>> >   #include >>>>>>>>> > +#include >>>>>>>>> > >>>>>>>>> >   struct ttm_bo_device; >>>>>>>>> > >>>>>>>>> > @@ -232,6 +233,12 @@ struct ttm_buffer_object { >>>>>>>>> >        struct mutex wu_mutex; >>>>>>>>> >   }; >>>>>>>>> > >>>>>>>>> > +struct ttm_bo_waiter { >>>>>>>>> > +     struct ttm_buffer_object *tbo; >>>>>>>>> > +     struct ttm_place place; >>>>>>>>> > +     struct list_head list; >>>>>>>>> > +}; >>>>>>>>> > + >>>>>>>>> >   /** >>>>>>>>> >    * struct ttm_bo_kmap_obj >>>>>>>>> >    * >>>>>>>>> > diff --git a/include/drm/ttm/ttm_bo_driver.h >>>>>>>>> b/include/drm/ttm/ttm_bo_driver.h >>>>>>>>> > index 9b417eb2df20..dc6b8b4c9e06 100644 >>>>>>>>> > --- a/include/drm/ttm/ttm_bo_driver.h >>>>>>>>> > +++ b/include/drm/ttm/ttm_bo_driver.h >>>>>>>>> > @@ -293,6 +293,10 @@ struct ttm_mem_type_manager { >>>>>>>>> >        bool io_reserve_fastpath; >>>>>>>>> >        spinlock_t move_lock; >>>>>>>>> > >>>>>>>>> > +     /* waiters in list */ >>>>>>>>> > +     spinlock_t wait_lock; >>>>>>>>> > +     struct list_head waiter_list; >>>>>>>>> > + >>>>>>>>> >        /* >>>>>>>>> >         * Protected by @io_reserve_mutex: >>>>>>>>> >         */ >>>>>>>>> > @@ -748,6 +752,9 @@ int ttm_bo_mem_space(struct >>>>>>>>> ttm_buffer_object *bo, >>>>>>>>> >                     struct ttm_mem_reg *mem, >>>>>>>>> >                     struct ttm_operation_ctx *ctx); >>>>>>>>> > >>>>>>>>> > +int ttm_man_check_bo(struct ttm_mem_type_manager *man, >>>>>>>>> > +                  struct ttm_buffer_object *bo, >>>>>>>>> > +                  const struct ttm_place *place); >>>>>>>>> >   void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct >>>>>>>>> ttm_mem_reg *mem); >>>>>>>>> >   void ttm_bo_mem_put_locked(struct ttm_buffer_object *bo, >>>>>>>>> >                           struct ttm_mem_reg *mem); >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> dri-devel mailing list >>>>>>> dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>> >> > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx