Re: New KFD ioctls: taking the skeletons out of the closet

From: Felix Kuehling <felix.kuehling-5C7GfCeVMHo@public.gmane.org>
To: Dave Airlie <airlied-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: "Oded Gabbay"
	<oded.gabbay-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"Maling list - DRI developers"
	<dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"Christian König" <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Subject: Re: New KFD ioctls: taking the skeletons out of the closet
Date: Mon, 12 Mar 2018 14:17:59 -0400	[thread overview]
Message-ID: <c9209a60-0d33-49ce-8944-1f9874aaef17@amd.com> (raw)
In-Reply-To: <4ad64912-1fb7-6bcf-8d00-97d9e4ac04bd-5C7GfCeVMHo@public.gmane.org>

On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>> can show the overhead as being a problem then address it, but I
>> think it's premature worrying about it at this stage.
> I'd like syscall overhead to be small. But with recent kernel page table
> isolation, NUMA systems and lots of GPUs, I think this may not be
> negligible. For example we're working with some Intel NUMA systems and 8
> GPUs for HPC or deep learning applications. I'll be measuring the
> overhead on such systems and get back with results in a few days. I want
> to have an API that can scale to such applications.

I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
kernel config based on a standard Ubuntu kernel. No debug options were
enabled. My test application measures KFD memory management API
performance for allocating, mapping, unmapping and freeing 1000 buffers
of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
system memory). The impact of ioctl overhead depended on whether the
page table update was done by CPU or SDMA.

I averaged 10 runs of the application and also calculated the standard
deviation to see if my results were just random noise.

With SDMA using a single ioctl was about 5% faster for mapping and 10%
faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.

With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
unmapping. Standard deviation was 0.2% and 3% respectively.

For unmapping the difference was bigger than mapping because unmapping
is faster to begin with, so the system call overhead is bigger in
proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
SDMA or 190us with CPU with only minor dependence on buffer size and
memory type. Unmapping takes about 35us with SDMA or 13us with CPU.

>
> Regards,
>   Felix
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx