Am 10.01.19 um 12:41 schrieb Marek Olšák: > > > On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian > wrote: > > Am 10.01.19 um 00:39 schrieb Marek Olšák: >> On Wed, Jan 9, 2019 at 1:41 PM Christian König >> > > wrote: >> >> Am 09.01.19 um 17:14 schrieb Marek Olšák: >>> On Wed, Jan 9, 2019 at 8:09 AM Christian König >>> >> > wrote: >>> >>> Am 09.01.19 um 13:36 schrieb Marek Olšák: >>>> >>>> >>>> On Wed, Jan 9, 2019, 5:28 AM Christian König >>>> >>> wrote: >>>> >>>> Looks good, but I'm wondering what's the actual >>>> improvement? >>>> >>>> >>>> No malloc calls and 1 less for loop copying the bo list. >>> >>> Yeah, but didn't we want to get completely rid of the bo >>> list? >>> >>> >>> If we have multiple IBs (e.g. gfx + compute) that share a BO >>> list, I think it's faster to send the BO list to the kernel >>> only once. >> >> That's not really faster. >> >> The only thing we safe us is a single loop over all BOs to >> lockup the handle into a pointer and that is only a tiny >> fraction of the overhead. >> >> The majority of the overhead is locking the BOs and reserving >> space for the submission. >> >> What could really help here is to submit gfx+comput together >> in just one CS IOCTL. This way we would need the locking and >> space reservation only once. >> >> It's a bit of work in the kernel side, but certainly doable. >> >> >> OK. Any objections to this patch? > > In general I'm wondering if we couldn't avoid adding so much new > interface. > > > There are Vulkan drivers that still use the bo_list interface. > > > For example we can avoid the malloc() when we just cache the last > freed bo_list structure in the device. We would just need an > atomic pointer exchange operation for that. > > > This way we even don't need to change mesa at all. > > > There is still the for loop that we need to get rid of. Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which only takes the handles and still returns the amdgpu_bo_list structure we are used to. See what I'm mostly concerned about is having another CS function to maintain. > > > Regarding optimization, this chunk can be replaced by a cast on 64bit: >> + chunk_array = alloca(sizeof(uint64_t) * num_chunks); >> + for (i = 0; i < num_chunks; i++) >> + chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i]; > > It can't. The input is an array of structures. The ioctl takes an > array of pointers. Ah! Haven't seen this, sorry for the noise. Christian. > > Marek > > > Regards, > Christian. > >> >> Thanks, >> Marek > > > _______________________________________________ > amd-gfx mailing list > amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx