From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Christian_K=c3=b6nig?= Subject: Re: [PATCH libdrm] amdgpu: add a faster BO list API Date: Thu, 10 Jan 2019 12:51:03 +0100 Message-ID: <7544c927-8b1f-c7d0-dd9d-21311ffca542@gmail.com> References: <20190107193104.4361-1-maraeo@gmail.com> <513ee137-7e99-c8fc-9e3b-e9077ead60a3@gmail.com> <7f85afd6-b17b-1c50-ba03-c03dd6e9a362@gmail.com> Reply-To: christian.koenig-5C7GfCeVMHo@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0333486860==" Return-path: In-Reply-To: Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= , "Koenig, Christian" Cc: amd-gfx mailing list This is a multi-part message in MIME format. --===============0333486860== Content-Type: multipart/alternative; boundary="------------DBE86E3B36B9FF9EC193D9B6" Content-Language: en-US This is a multi-part message in MIME format. --------------DBE86E3B36B9FF9EC193D9B6 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Am 10.01.19 um 12:41 schrieb Marek Olšák: > > > On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian > wrote: > > Am 10.01.19 um 00:39 schrieb Marek Olšák: >> On Wed, Jan 9, 2019 at 1:41 PM Christian König >> > > wrote: >> >> Am 09.01.19 um 17:14 schrieb Marek Olšák: >>> On Wed, Jan 9, 2019 at 8:09 AM Christian König >>> >> > wrote: >>> >>> Am 09.01.19 um 13:36 schrieb Marek Olšák: >>>> >>>> >>>> On Wed, Jan 9, 2019, 5:28 AM Christian König >>>> >>> wrote: >>>> >>>> Looks good, but I'm wondering what's the actual >>>> improvement? >>>> >>>> >>>> No malloc calls and 1 less for loop copying the bo list. >>> >>> Yeah, but didn't we want to get completely rid of the bo >>> list? >>> >>> >>> If we have multiple IBs (e.g. gfx + compute) that share a BO >>> list, I think it's faster to send the BO list to the kernel >>> only once. >> >> That's not really faster. >> >> The only thing we safe us is a single loop over all BOs to >> lockup the handle into a pointer and that is only a tiny >> fraction of the overhead. >> >> The majority of the overhead is locking the BOs and reserving >> space for the submission. >> >> What could really help here is to submit gfx+comput together >> in just one CS IOCTL. This way we would need the locking and >> space reservation only once. >> >> It's a bit of work in the kernel side, but certainly doable. >> >> >> OK. Any objections to this patch? > > In general I'm wondering if we couldn't avoid adding so much new > interface. > > > There are Vulkan drivers that still use the bo_list interface. > > > For example we can avoid the malloc() when we just cache the last > freed bo_list structure in the device. We would just need an > atomic pointer exchange operation for that. > > > This way we even don't need to change mesa at all. > > > There is still the for loop that we need to get rid of. Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which only takes the handles and still returns the amdgpu_bo_list structure we are used to. See what I'm mostly concerned about is having another CS function to maintain. > > > Regarding optimization, this chunk can be replaced by a cast on 64bit: >> + chunk_array = alloca(sizeof(uint64_t) * num_chunks); >> + for (i = 0; i < num_chunks; i++) >> + chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i]; > > It can't. The input is an array of structures. The ioctl takes an > array of pointers. Ah! Haven't seen this, sorry for the noise. Christian. > > Marek > > > Regards, > Christian. > >> >> Thanks, >> Marek > > > _______________________________________________ > amd-gfx mailing list > amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx --------------DBE86E3B36B9FF9EC193D9B6 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
Am 10.01.19 um 12:41 schrieb Marek Olšák:


On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian <Christian.Koenig-5C7GfCeVMHo@public.gmane.org wrote:
Am 10.01.19 um 00:39 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 1:41 PM Christian König <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Am 09.01.19 um 17:14 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 8:09 AM Christian König <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Am 09.01.19 um 13:36 schrieb Marek Olšák:


On Wed, Jan 9, 2019, 5:28 AM Christian König <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org wrote:
Looks good, but I'm wondering what's the actual improvement?

No malloc calls and 1 less for loop copying the bo list.

Yeah, but didn't we want to get completely rid of the bo list?

If we have multiple IBs (e.g. gfx + compute) that share a BO list, I think it's faster to send the BO list to the kernel only once.

That's not really faster.

The only thing we safe us is a single loop over all BOs to lockup the handle into a pointer and that is only a tiny fraction of the overhead.

The majority of the overhead is locking the BOs and reserving space for the submission.

What could really help here is to submit gfx+comput together in just one CS IOCTL. This way we would need the locking and space reservation only once.

It's a bit of work in the kernel side, but certainly doable.

OK. Any objections to this patch?

In general I'm wondering if we couldn't avoid adding so much new interface.

There are Vulkan drivers that still use the bo_list interface.


For example we can avoid the malloc() when we just cache the last freed bo_list structure in the device. We would just need an atomic pointer exchange operation for that.

This way we even don't need to change mesa at all.

There is still the for loop that we need to get rid of.

Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which only takes the handles and still returns the amdgpu_bo_list structure we are used to.

See what I'm mostly concerned about is having another CS function to maintain.



Regarding optimization, this chunk can be replaced by a cast on 64bit:
+	chunk_array = alloca(sizeof(uint64_t) * num_chunks);
+	for (i = 0; i < num_chunks; i++)
+		chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];
It can't. The input is an array of structures. The ioctl takes an array of pointers.

Ah! Haven't seen this, sorry for the noise.

Christian.


Marek


Regards,
Christian.


Thanks,
Marek


_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

--------------DBE86E3B36B9FF9EC193D9B6-- --===============0333486860== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4Cg== --===============0333486860==--