All of lore.kernel.org
 help / color / mirror / Atom feed
* New KFD ioctls: taking the skeletons out of the closet
@ 2018-03-06 22:44 Felix Kuehling
       [not found] ` <20e5f2e3-89de-c22b-9c9e-c2b6bee02b1c-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Felix Kuehling @ 2018-03-06 22:44 UTC (permalink / raw)
  To: Maling list - DRI developers,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Oded Gabbay,
	Christian König

Hi all,

Christian raised two potential issues in a recent KFD upstreaming code
review that are related to the KFD ioctl APIs:

 1. behaviour of -ERESTARTSYS
 2. transactional nature of KFD ioctl definitions, or lack thereof

I appreciate constructive feedback, but I also want to encourage an
open-minded rather than a dogmatic approach to API definitions. So let
me take all the skeletons out of my closet and get these APIs reviewed
in the appropriate forum before we commit to them upstream. See the end
of this email for reference.

The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
any of the other APIs raise concerns or questions, please ask.

Because of the HSA programming model, KFD memory management APIs are
synchronous. There is no pipelining. Command submission to GPUs through
user mode queues does not involve KFD. This means KFD doesn't know what
memory is used by the GPUs and when it's used. That means, when the
map_memory_to_gpu ioctl returns to user mode, all memory mapping
operations are complete and the memory can be used by the CPUs or GPUs
immediately.

HSA also uses a shared virtual memory model, so typically memory gets
mapped on multiple GPUs and CPUs at the same virtual address.

The point of contention seems to be the ability to map memory to
multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
discuss two main failure cases:

1: Failure after all mappings have been dispatched via SDMA, but a
signal interrupts the wait for completion and we return -ERESTARTSYS.
Documentation/kernel-hacking/hacking.rst only says "[...] you should be
prepared to process the restart, e.g. if you're in the middle of
manipulating some data structure." I think we do that by ensuring that
memory that's already mapped won't be mapped again. So the restart will
become a no-op and just end up waiting for all the previous mappings to
complete.

Christian has a stricter requirement, and I'd like to know where that
comes from: "An interrupted IOCTL should never have a visible effect."

2: Failure to map on some but not all GPUs. This comes down to the
question, do all ioctl APIs or system calls in general need to be
transactional? As a counter example I'd give incomplete read or write
system calls that return how much was actually read or written. Our
current implementation of map_memory_to_gpu doesn't do this, but it
could be modified to return to user mode how many of the mappings, or
which mappings specifically failed or succeeded.

I'd like to know whether such behaviour is acceptable.

The alternative would be to break multi-GPU mappings, and the final wait
for completion, into multiple ioctl calls. That would result in
additional system call overhead. I'd argue that the end result is the
same for user mode, so I don't see why I'd use multiple ioctls over a
single one.

I'm looking forward to your feedback.

Thanks,
  Felix


Reference: After the last rework, these are the ioctls I'm hoping to
upstream in my current patch series (with annotations):

/* Acquire a VM from a DRM render node FD for use by KFD on a specific device
 *
 * @drm_fd: DRM render node file descriptor
 * @gpu_id: device identifier (used throughout the KFD API)
 */
struct kfd_ioctl_acquire_vm_args {
	__u32 drm_fd;	/* to KFD */
	__u32 gpu_id;	/* to KFD */
};

/* Allocation flags: memory types */
#define KFD_IOC_ALLOC_MEM_FLAGS_VRAM		(1 << 0)
#define KFD_IOC_ALLOC_MEM_FLAGS_GTT		(1 << 1)
#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR		(1 << 2)
#define KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL	(1 << 3)
/* Allocation flags: attributes/access options */
#define KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE	(1 << 31)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXECUTABLE	(1 << 30)
#define KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC		(1 << 29)
#define KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE	(1 << 28)
#define KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM	(1 << 27)
#define KFD_IOC_ALLOC_MEM_FLAGS_COHERENT	(1 << 26)

/* Allocate memory for later SVM (shared virtual memory) mapping.
 *
 * @va_addr:     virtual address of the memory to be allocated
 *               all later mappings on all GPUs will use this address
 * @size:        size in bytes
 * @handle:      buffer handle returned to user mode, used to refer to
 *               this allocation for mapping, unmapping and freeing
 * @mmap_offset: for CPU-mapping the allocation by mmapping a render node
 *               for userptrs this is overloaded to specify the CPU address
 * @gpu_id:      device identifier
 * @flags:       memory type and attributes. See KFD_IOC_ALLOC_MEM_FLAGS above
 */
struct kfd_ioctl_alloc_memory_of_gpu_args {
	__u64 va_addr;		/* to KFD */
	__u64 size;		/* to KFD */
	__u64 handle;		/* from KFD */
	__u64 mmap_offset;	/* to KFD (userptr), from KFD (mmap offset) */
	__u32 gpu_id;		/* to KFD */
	__u32 flags;
};

/* Free memory allocated with kfd_ioctl_alloc_memory_of_gpu
 *
 * @handle: memory handle returned by alloc
 */
struct kfd_ioctl_free_memory_of_gpu_args {
	__u64 handle;		/* to KFD */
};

/* Map memory to one of more GPUs
 *
 * @handle:                memory handle returned by alloc
 * @device_ids_array_ptr:  array of gpu_ids
 * @device_ids_array_size: size of the gpu_ids array
 */
struct kfd_ioctl_map_memory_to_gpu_args {
	__u64 handle;			/* to KFD */
	__u64 device_ids_array_ptr;	/* to KFD */
	__u32 device_ids_array_size;	/* to KFD */
	__u32 pad;
};

/* Unmap memory from one or more GPUs
 *
 * same arguments as for mapping
 */
struct kfd_ioctl_unmap_memory_from_gpu_args {
	__u64 handle;			/* to KFD */
	__u64 device_ids_array_ptr;	/* to KFD */
	__u32 device_ids_array_size;	/* to KFD */
	__u32 pad;
};

-- 
F e l i x   K u e h l i n g
PMTS Software Development Engineer | Vertical Workstation/Compute
1 Commerce Valley Dr. East, Markham, ON L3T 7X6 Canada
(O) +1(289)695-1597
   _     _   _   _____   _____
  / \   | \ / | |  _  \  \ _  |
 / A \  | \M/ | | |D) )  /|_| |
/_/ \_\ |_| |_| |_____/ |__/ \|   facebook.com/AMD | amd.com

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found] ` <20e5f2e3-89de-c22b-9c9e-c2b6bee02b1c-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-06 23:09   ` Dave Airlie
       [not found]     ` <CAPM=9tzYRLHfocxvMM249pKioxELs=pPBgg20N_u1ZXVmAMbSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-03-06 23:34   ` Jerome Glisse
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Airlie @ 2018-03-06 23:09 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Oded Gabbay, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Maling list - DRI developers, Christian König

On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling@amd.com> wrote:
> Hi all,
>
> Christian raised two potential issues in a recent KFD upstreaming code
> review that are related to the KFD ioctl APIs:
>
>  1. behaviour of -ERESTARTSYS
>  2. transactional nature of KFD ioctl definitions, or lack thereof
>
> I appreciate constructive feedback, but I also want to encourage an
> open-minded rather than a dogmatic approach to API definitions. So let
> me take all the skeletons out of my closet and get these APIs reviewed
> in the appropriate forum before we commit to them upstream. See the end
> of this email for reference.
>
> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
> any of the other APIs raise concerns or questions, please ask.
>
> Because of the HSA programming model, KFD memory management APIs are
> synchronous. There is no pipelining. Command submission to GPUs through
> user mode queues does not involve KFD. This means KFD doesn't know what
> memory is used by the GPUs and when it's used. That means, when the
> map_memory_to_gpu ioctl returns to user mode, all memory mapping
> operations are complete and the memory can be used by the CPUs or GPUs
> immediately.

I've got a few opinions, but first up I still dislike user-mode queues
and everything
they entail. I still feel they are solving a Windows problem and not a
Linux problem,
and it would be nice if we had some Linux numbers on what they gain us over
a dispatch ioctl, because they sure bring a lot of memory management issues.

That said amdkfd is here.

The first question you should ask (which you haven't asked here at all) is
what should userspace do with the ioctl result.

>
> HSA also uses a shared virtual memory model, so typically memory gets
> mapped on multiple GPUs and CPUs at the same virtual address.
>
> The point of contention seems to be the ability to map memory to
> multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
> discuss two main failure cases:
>
> 1: Failure after all mappings have been dispatched via SDMA, but a
> signal interrupts the wait for completion and we return -ERESTARTSYS.
> Documentation/kernel-hacking/hacking.rst only says "[...] you should be
> prepared to process the restart, e.g. if you're in the middle of
> manipulating some data structure." I think we do that by ensuring that
> memory that's already mapped won't be mapped again. So the restart will
> become a no-op and just end up waiting for all the previous mappings to
> complete.

-ERESTARTSYS at that late stage points to a badly synchronous API,
I'd have said you should have two ioctls, one that returns a fence after
starting the processes, and one that waits for the fence separately.

The overhead of ioctls isn't your enemy until you've measured it and
proven it's a problem.

>
> Christian has a stricter requirement, and I'd like to know where that
> comes from: "An interrupted IOCTL should never have a visible effect."

Christian might be taking things a bit further but synchronous gpu access
APIs are bad, but I don't think undoing a bunch of work is a good plan either
just because you got ERESTARTSYS. If you get ERESTARTSYS can you
handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
will I wait for the original SDMAs if I reenter?

>
> 2: Failure to map on some but not all GPUs. This comes down to the
> question, do all ioctl APIs or system calls in general need to be
> transactional? As a counter example I'd give incomplete read or write
> system calls that return how much was actually read or written. Our
> current implementation of map_memory_to_gpu doesn't do this, but it
> could be modified to return to user mode how many of the mappings, or
> which mappings specifically failed or succeeded.

What should userspace do? if it only get mappings on 3 of the gpus instead
of say 4? Is there a sane resolution other than calling the ioctl again with
the single GPU? Would it drop the GPU from the working set from that point on?

Need more info to do what can come out of the API doing incomplete
operations.

> The alternative would be to break multi-GPU mappings, and the final wait
> for completion, into multiple ioctl calls. That would result in
> additional system call overhead. I'd argue that the end result is the
> same for user mode, so I don't see why I'd use multiple ioctls over a
> single one.

Again stop worrying about ioctl overhead, this isn't Windows. If you
can show the overhead as being a problem then address it, but I
think it's premature worrying about it at this stage.

Dave.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found] ` <20e5f2e3-89de-c22b-9c9e-c2b6bee02b1c-5C7GfCeVMHo@public.gmane.org>
  2018-03-06 23:09   ` Dave Airlie
@ 2018-03-06 23:34   ` Jerome Glisse
  2018-03-07  8:41     ` Christian König
  1 sibling, 1 reply; 11+ messages in thread
From: Jerome Glisse @ 2018-03-06 23:34 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Oded Gabbay, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Maling list - DRI developers, Christian König

On Tue, Mar 06, 2018 at 05:44:41PM -0500, Felix Kuehling wrote:
> Hi all,
> 
> Christian raised two potential issues in a recent KFD upstreaming code
> review that are related to the KFD ioctl APIs:
> 
>  1. behaviour of -ERESTARTSYS
>  2. transactional nature of KFD ioctl definitions, or lack thereof
> 
> I appreciate constructive feedback, but I also want to encourage an
> open-minded rather than a dogmatic approach to API definitions. So let
> me take all the skeletons out of my closet and get these APIs reviewed
> in the appropriate forum before we commit to them upstream. See the end
> of this email for reference.
> 
> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
> any of the other APIs raise concerns or questions, please ask.
> 
> Because of the HSA programming model, KFD memory management APIs are
> synchronous. There is no pipelining. Command submission to GPUs through
> user mode queues does not involve KFD. This means KFD doesn't know what
> memory is used by the GPUs and when it's used. That means, when the
> map_memory_to_gpu ioctl returns to user mode, all memory mapping
> operations are complete and the memory can be used by the CPUs or GPUs
> immediately.
> 
> HSA also uses a shared virtual memory model, so typically memory gets
> mapped on multiple GPUs and CPUs at the same virtual address.

Does this means that GPU memory get pin ? Or system memory for that matter
too. This was discuss previously but this really goes against kernel mantra
ie kernel no longer manage resources but userspace can hog GPU memory or
even system memory. This is bad !

Cheers,
Jérôme
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found]     ` <CAPM=9tzYRLHfocxvMM249pKioxELs=pPBgg20N_u1ZXVmAMbSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-07  8:38       ` Christian König
       [not found]         ` <9f3c4b76-9d69-f685-439a-951354e6e98b-5C7GfCeVMHo@public.gmane.org>
  2018-03-07 20:34       ` Felix Kuehling
  1 sibling, 1 reply; 11+ messages in thread
From: Christian König @ 2018-03-07  8:38 UTC (permalink / raw)
  To: Dave Airlie, Felix Kuehling
  Cc: Oded Gabbay, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Maling list - DRI developers

Am 07.03.2018 um 00:09 schrieb Dave Airlie:
> On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> Hi all,
>>
>> Christian raised two potential issues in a recent KFD upstreaming code
>> review that are related to the KFD ioctl APIs:
>>
>>   1. behaviour of -ERESTARTSYS
>>   2. transactional nature of KFD ioctl definitions, or lack thereof
>>
>> I appreciate constructive feedback, but I also want to encourage an
>> open-minded rather than a dogmatic approach to API definitions. So let
>> me take all the skeletons out of my closet and get these APIs reviewed
>> in the appropriate forum before we commit to them upstream. See the end
>> of this email for reference.
>>
>> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
>> any of the other APIs raise concerns or questions, please ask.
>>
>> Because of the HSA programming model, KFD memory management APIs are
>> synchronous. There is no pipelining. Command submission to GPUs through
>> user mode queues does not involve KFD. This means KFD doesn't know what
>> memory is used by the GPUs and when it's used. That means, when the
>> map_memory_to_gpu ioctl returns to user mode, all memory mapping
>> operations are complete and the memory can be used by the CPUs or GPUs
>> immediately.
> I've got a few opinions, but first up I still dislike user-mode queues
> and everything
> they entail. I still feel they are solving a Windows problem and not a
> Linux problem,
> and it would be nice if we had some Linux numbers on what they gain us over
> a dispatch ioctl, because they sure bring a lot of memory management issues.

Well user-mode queues are a problem as long as you don't have 
recoverable page faults on the GPU.

As soon as you got recoverable page faults and push the memory 
management towards things like HMM I don't see an advantage of using a 
IOCTL based command submission any more.

So I would say that this is a problem which is slowly going away as the 
hardware improves.

> That said amdkfd is here.
>
> The first question you should ask (which you haven't asked here at all) is
> what should userspace do with the ioctl result.
>
>> HSA also uses a shared virtual memory model, so typically memory gets
>> mapped on multiple GPUs and CPUs at the same virtual address.
>>
>> The point of contention seems to be the ability to map memory to
>> multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
>> discuss two main failure cases:
>>
>> 1: Failure after all mappings have been dispatched via SDMA, but a
>> signal interrupts the wait for completion and we return -ERESTARTSYS.
>> Documentation/kernel-hacking/hacking.rst only says "[...] you should be
>> prepared to process the restart, e.g. if you're in the middle of
>> manipulating some data structure." I think we do that by ensuring that
>> memory that's already mapped won't be mapped again. So the restart will
>> become a no-op and just end up waiting for all the previous mappings to
>> complete.
> -ERESTARTSYS at that late stage points to a badly synchronous API,
> I'd have said you should have two ioctls, one that returns a fence after
> starting the processes, and one that waits for the fence separately.

That is exactly what I suggested as well, but also exactly what Felix 
tries to avoid :)

> The overhead of ioctls isn't your enemy until you've measured it and
> proven it's a problem.
>
>> Christian has a stricter requirement, and I'd like to know where that
>> comes from: "An interrupted IOCTL should never have a visible effect."
> Christian might be taking things a bit further but synchronous gpu access
> APIs are bad, but I don't think undoing a bunch of work is a good plan either
> just because you got ERESTARTSYS. If you get ERESTARTSYS can you
> handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
> will I wait for the original SDMAs if I reenter?

Well it's not only the waiting for the SDMAs. If I understood it 
correctly the IOCTL proposed by Felix allows adding multiple mappings of 
buffer objects on multiple devices with just one IOCTL.

Now the problem is without a lot of redesign of the driver this can fail 
at any place in between those operations. E.g. we could run out of 
memory or hit a permission restriction or an invalid handle etc.. etc...

What would happen is that we end up with a halve complete IOCTL.

A possible solution might be that we could maybe add some kind of 
feedback noting which operations are already complete and then only 
retrying the one which failed.

>> 2: Failure to map on some but not all GPUs. This comes down to the
>> question, do all ioctl APIs or system calls in general need to be
>> transactional? As a counter example I'd give incomplete read or write
>> system calls that return how much was actually read or written. Our
>> current implementation of map_memory_to_gpu doesn't do this, but it
>> could be modified to return to user mode how many of the mappings, or
>> which mappings specifically failed or succeeded.
> What should userspace do? if it only get mappings on 3 of the gpus instead
> of say 4? Is there a sane resolution other than calling the ioctl again with
> the single GPU? Would it drop the GPU from the working set from that point on?
>
> Need more info to do what can come out of the API doing incomplete
> operations.

Felix argument that when a mapping operations fails the VM ranges in 
question would have been undefined before and are undefined after that 
operation failed as well.

So we could just need to retry the operation until all of it succeeds, 
but that feels kind of strange.

>> The alternative would be to break multi-GPU mappings, and the final wait
>> for completion, into multiple ioctl calls. That would result in
>> additional system call overhead. I'd argue that the end result is the
>> same for user mode, so I don't see why I'd use multiple ioctls over a
>> single one.
> Again stop worrying about ioctl overhead, this isn't Windows. If you
> can show the overhead as being a problem then address it, but I
> think it's premature worrying about it at this stage.

Well you go from one IOCTL doing everything towards one IOCTL per device 
per mapping which can be a huge difference.

One the other hand we internally had exactly the same discussion when we 
implemented support for partially resident textures. The result was that 
we first implement it with individual IOCTLs and implement the mass 
mapping IOCTL if we ever find an use case where we need it.

So far we haven't found a use case for this mass mapping IOCTL.

Regards,
Christian.

>
> Dave.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
  2018-03-06 23:34   ` Jerome Glisse
@ 2018-03-07  8:41     ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2018-03-07  8:41 UTC (permalink / raw)
  To: Jerome Glisse, Felix Kuehling; +Cc: amd-gfx, Maling list - DRI developers

Am 07.03.2018 um 00:34 schrieb Jerome Glisse:
> On Tue, Mar 06, 2018 at 05:44:41PM -0500, Felix Kuehling wrote:
>> Hi all,
>>
>> Christian raised two potential issues in a recent KFD upstreaming code
>> review that are related to the KFD ioctl APIs:
>>
>>   1. behaviour of -ERESTARTSYS
>>   2. transactional nature of KFD ioctl definitions, or lack thereof
>>
>> I appreciate constructive feedback, but I also want to encourage an
>> open-minded rather than a dogmatic approach to API definitions. So let
>> me take all the skeletons out of my closet and get these APIs reviewed
>> in the appropriate forum before we commit to them upstream. See the end
>> of this email for reference.
>>
>> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
>> any of the other APIs raise concerns or questions, please ask.
>>
>> Because of the HSA programming model, KFD memory management APIs are
>> synchronous. There is no pipelining. Command submission to GPUs through
>> user mode queues does not involve KFD. This means KFD doesn't know what
>> memory is used by the GPUs and when it's used. That means, when the
>> map_memory_to_gpu ioctl returns to user mode, all memory mapping
>> operations are complete and the memory can be used by the CPUs or GPUs
>> immediately.
>>
>> HSA also uses a shared virtual memory model, so typically memory gets
>> mapped on multiple GPUs and CPUs at the same virtual address.
> Does this means that GPU memory get pin ? Or system memory for that matter
> too. This was discuss previously but this really goes against kernel mantra
> ie kernel no longer manage resources but userspace can hog GPU memory or
> even system memory. This is bad !

Fortunately this time it is not about pinning.

All BOs which are part of the VM become a fence object when an user 
space queue is created.

Now when TTM needs to evict those buffer object it will try to wait for 
this fence object which in turn will unmap the user space queue from the 
hardware and wait for running work to finish.

After that TTM can move the BO around just like any normal GFX BO.

Regards,
Christian.

>
> Cheers,
> Jérôme

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found]         ` <9f3c4b76-9d69-f685-439a-951354e6e98b-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-07 16:38           ` Daniel Vetter
  2018-03-07 19:55             ` Alex Deucher
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2018-03-07 16:38 UTC (permalink / raw)
  To: Christian K??nig
  Cc: Felix Kuehling, Dave Airlie, Maling list - DRI developers,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On Wed, Mar 07, 2018 at 09:38:03AM +0100, Christian K??nig wrote:
> Am 07.03.2018 um 00:09 schrieb Dave Airlie:
> > On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling@amd.com> wrote:
> > > Hi all,
> > > 
> > > Christian raised two potential issues in a recent KFD upstreaming code
> > > review that are related to the KFD ioctl APIs:
> > > 
> > >   1. behaviour of -ERESTARTSYS
> > >   2. transactional nature of KFD ioctl definitions, or lack thereof
> > > 
> > > I appreciate constructive feedback, but I also want to encourage an
> > > open-minded rather than a dogmatic approach to API definitions. So let
> > > me take all the skeletons out of my closet and get these APIs reviewed
> > > in the appropriate forum before we commit to them upstream. See the end
> > > of this email for reference.
> > > 
> > > The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
> > > any of the other APIs raise concerns or questions, please ask.
> > > 
> > > Because of the HSA programming model, KFD memory management APIs are
> > > synchronous. There is no pipelining. Command submission to GPUs through
> > > user mode queues does not involve KFD. This means KFD doesn't know what
> > > memory is used by the GPUs and when it's used. That means, when the
> > > map_memory_to_gpu ioctl returns to user mode, all memory mapping
> > > operations are complete and the memory can be used by the CPUs or GPUs
> > > immediately.
> > I've got a few opinions, but first up I still dislike user-mode queues
> > and everything
> > they entail. I still feel they are solving a Windows problem and not a
> > Linux problem,
> > and it would be nice if we had some Linux numbers on what they gain us over
> > a dispatch ioctl, because they sure bring a lot of memory management issues.
> 
> Well user-mode queues are a problem as long as you don't have recoverable
> page faults on the GPU.
> 
> As soon as you got recoverable page faults and push the memory management
> towards things like HMM I don't see an advantage of using a IOCTL based
> command submission any more.
> 
> So I would say that this is a problem which is slowly going away as the
> hardware improves.

Yeah, but up to the point where the hw actually works (instead of promises
that maybe it'll work next generation, trust us, for like a few
generations) it's much easier to hack up an ioctl with workarounds than
intercepting an mmap write fault all the time (those are slower than
ioctls).

I think userspace queues are fine once we have known-working hw. Before
that I'm kinda agreeing with Dave and not seeing the point. At least to my
knowledge we still haven't arrived in the wonderful promised land of hw
recoverable (well, restartable really) page faults on any vendors platform
...

> > That said amdkfd is here.
> > 
> > The first question you should ask (which you haven't asked here at all) is
> > what should userspace do with the ioctl result.
> > 
> > > HSA also uses a shared virtual memory model, so typically memory gets
> > > mapped on multiple GPUs and CPUs at the same virtual address.
> > > 
> > > The point of contention seems to be the ability to map memory to
> > > multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
> > > discuss two main failure cases:
> > > 
> > > 1: Failure after all mappings have been dispatched via SDMA, but a
> > > signal interrupts the wait for completion and we return -ERESTARTSYS.
> > > Documentation/kernel-hacking/hacking.rst only says "[...] you should be
> > > prepared to process the restart, e.g. if you're in the middle of
> > > manipulating some data structure." I think we do that by ensuring that
> > > memory that's already mapped won't be mapped again. So the restart will
> > > become a no-op and just end up waiting for all the previous mappings to
> > > complete.
> > -ERESTARTSYS at that late stage points to a badly synchronous API,
> > I'd have said you should have two ioctls, one that returns a fence after
> > starting the processes, and one that waits for the fence separately.
> 
> That is exactly what I suggested as well, but also exactly what Felix tries
> to avoid :)
> 
> > The overhead of ioctls isn't your enemy until you've measured it and
> > proven it's a problem.
> > 
> > > Christian has a stricter requirement, and I'd like to know where that
> > > comes from: "An interrupted IOCTL should never have a visible effect."
> > Christian might be taking things a bit further but synchronous gpu access
> > APIs are bad, but I don't think undoing a bunch of work is a good plan either
> > just because you got ERESTARTSYS. If you get ERESTARTSYS can you
> > handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
> > will I wait for the original SDMAs if I reenter?
> 
> Well it's not only the waiting for the SDMAs. If I understood it correctly
> the IOCTL proposed by Felix allows adding multiple mappings of buffer
> objects on multiple devices with just one IOCTL.
> 
> Now the problem is without a lot of redesign of the driver this can fail at
> any place in between those operations. E.g. we could run out of memory or
> hit a permission restriction or an invalid handle etc.. etc...
> 
> What would happen is that we end up with a halve complete IOCTL.
> 
> A possible solution might be that we could maybe add some kind of feedback
> noting which operations are already complete and then only retrying the one
> which failed.

Atomic ioctl behaviour is hard. Like reeeeeeaaaaaaaaaaalllllllly hard.

Look at atomic kms if you don't believe, or the v4l equivalent, and that
doesn't even try to do cross device atomic. Also, it explicitly isn't
atomic wrt memory management stuff (like pinning scanout buffers into
vram), because that was too hard - we simply try to pin and then roll back
if it happens to not work out and apologize to userspace for the mess.

Except when your career plan is to spend the next few decades on
prototyping this as an R&D project, I recommend to not try :-)

> > > 2: Failure to map on some but not all GPUs. This comes down to the
> > > question, do all ioctl APIs or system calls in general need to be
> > > transactional? As a counter example I'd give incomplete read or write
> > > system calls that return how much was actually read or written. Our
> > > current implementation of map_memory_to_gpu doesn't do this, but it
> > > could be modified to return to user mode how many of the mappings, or
> > > which mappings specifically failed or succeeded.
> > What should userspace do? if it only get mappings on 3 of the gpus instead
> > of say 4? Is there a sane resolution other than calling the ioctl again with
> > the single GPU? Would it drop the GPU from the working set from that point on?
> > 
> > Need more info to do what can come out of the API doing incomplete
> > operations.
> 
> Felix argument that when a mapping operations fails the VM ranges in
> question would have been undefined before and are undefined after that
> operation failed as well.
> 
> So we could just need to retry the operation until all of it succeeds, but
> that feels kind of strange.

+1 on make your gpu apis async, we have drm_syncobj/sync_file/dma_fence as
a standard way for this now.

> > > The alternative would be to break multi-GPU mappings, and the final wait
> > > for completion, into multiple ioctl calls. That would result in
> > > additional system call overhead. I'd argue that the end result is the
> > > same for user mode, so I don't see why I'd use multiple ioctls over a
> > > single one.
> > Again stop worrying about ioctl overhead, this isn't Windows. If you
> > can show the overhead as being a problem then address it, but I
> > think it's premature worrying about it at this stage.
> 
> Well you go from one IOCTL doing everything towards one IOCTL per device per
> mapping which can be a huge difference.
> 
> One the other hand we internally had exactly the same discussion when we
> implemented support for partially resident textures. The result was that we
> first implement it with individual IOCTLs and implement the mass mapping
> IOCTL if we ever find an use case where we need it.
> 
> So far we haven't found a use case for this mass mapping IOCTL.

Aligns with my expectations/experience/planning for i915.ko stuff very
much.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
  2018-03-07 16:38           ` Daniel Vetter
@ 2018-03-07 19:55             ` Alex Deucher
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Deucher @ 2018-03-07 19:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Felix Kuehling, amd-gfx, Christian K??nig, Maling list - DRI developers

On Wed, Mar 7, 2018 at 11:38 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Mar 07, 2018 at 09:38:03AM +0100, Christian K??nig wrote:
>> Am 07.03.2018 um 00:09 schrieb Dave Airlie:
>> > On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> > > Hi all,
>> > >
>> > > Christian raised two potential issues in a recent KFD upstreaming code
>> > > review that are related to the KFD ioctl APIs:
>> > >
>> > >   1. behaviour of -ERESTARTSYS
>> > >   2. transactional nature of KFD ioctl definitions, or lack thereof
>> > >
>> > > I appreciate constructive feedback, but I also want to encourage an
>> > > open-minded rather than a dogmatic approach to API definitions. So let
>> > > me take all the skeletons out of my closet and get these APIs reviewed
>> > > in the appropriate forum before we commit to them upstream. See the end
>> > > of this email for reference.
>> > >
>> > > The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
>> > > any of the other APIs raise concerns or questions, please ask.
>> > >
>> > > Because of the HSA programming model, KFD memory management APIs are
>> > > synchronous. There is no pipelining. Command submission to GPUs through
>> > > user mode queues does not involve KFD. This means KFD doesn't know what
>> > > memory is used by the GPUs and when it's used. That means, when the
>> > > map_memory_to_gpu ioctl returns to user mode, all memory mapping
>> > > operations are complete and the memory can be used by the CPUs or GPUs
>> > > immediately.
>> > I've got a few opinions, but first up I still dislike user-mode queues
>> > and everything
>> > they entail. I still feel they are solving a Windows problem and not a
>> > Linux problem,
>> > and it would be nice if we had some Linux numbers on what they gain us over
>> > a dispatch ioctl, because they sure bring a lot of memory management issues.
>>
>> Well user-mode queues are a problem as long as you don't have recoverable
>> page faults on the GPU.
>>
>> As soon as you got recoverable page faults and push the memory management
>> towards things like HMM I don't see an advantage of using a IOCTL based
>> command submission any more.
>>
>> So I would say that this is a problem which is slowly going away as the
>> hardware improves.
>
> Yeah, but up to the point where the hw actually works (instead of promises
> that maybe it'll work next generation, trust us, for like a few
> generations) it's much easier to hack up an ioctl with workarounds than
> intercepting an mmap write fault all the time (those are slower than
> ioctls).
>
> I think userspace queues are fine once we have known-working hw. Before
> that I'm kinda agreeing with Dave and not seeing the point. At least to my
> knowledge we still haven't arrived in the wonderful promised land of hw
> recoverable (well, restartable really) page faults on any vendors platform
> ...


I think user space queues are a bit of a distraction.  The original
point of KFD and HSA was to have a consistent programming model across
CPU and other devices with relatively seamless access to the same
memory pools.  KFD was originally focused on APUs and when we have an
IOMMUv2 with ATC available, we have support for recoverable page
faults.  It's been working for 3 generations of hw and has been
expanded to GPUVM on newer hw which doesn't have the dependency on
IOMMU and also support vram.  We added support for KFD for older dGPUs
that don't have this capability, but that is certainly not the only
use case we need to consider.

Alex

>
>> > That said amdkfd is here.
>> >
>> > The first question you should ask (which you haven't asked here at all) is
>> > what should userspace do with the ioctl result.
>> >
>> > > HSA also uses a shared virtual memory model, so typically memory gets
>> > > mapped on multiple GPUs and CPUs at the same virtual address.
>> > >
>> > > The point of contention seems to be the ability to map memory to
>> > > multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
>> > > discuss two main failure cases:
>> > >
>> > > 1: Failure after all mappings have been dispatched via SDMA, but a
>> > > signal interrupts the wait for completion and we return -ERESTARTSYS.
>> > > Documentation/kernel-hacking/hacking.rst only says "[...] you should be
>> > > prepared to process the restart, e.g. if you're in the middle of
>> > > manipulating some data structure." I think we do that by ensuring that
>> > > memory that's already mapped won't be mapped again. So the restart will
>> > > become a no-op and just end up waiting for all the previous mappings to
>> > > complete.
>> > -ERESTARTSYS at that late stage points to a badly synchronous API,
>> > I'd have said you should have two ioctls, one that returns a fence after
>> > starting the processes, and one that waits for the fence separately.
>>
>> That is exactly what I suggested as well, but also exactly what Felix tries
>> to avoid :)
>>
>> > The overhead of ioctls isn't your enemy until you've measured it and
>> > proven it's a problem.
>> >
>> > > Christian has a stricter requirement, and I'd like to know where that
>> > > comes from: "An interrupted IOCTL should never have a visible effect."
>> > Christian might be taking things a bit further but synchronous gpu access
>> > APIs are bad, but I don't think undoing a bunch of work is a good plan either
>> > just because you got ERESTARTSYS. If you get ERESTARTSYS can you
>> > handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
>> > will I wait for the original SDMAs if I reenter?
>>
>> Well it's not only the waiting for the SDMAs. If I understood it correctly
>> the IOCTL proposed by Felix allows adding multiple mappings of buffer
>> objects on multiple devices with just one IOCTL.
>>
>> Now the problem is without a lot of redesign of the driver this can fail at
>> any place in between those operations. E.g. we could run out of memory or
>> hit a permission restriction or an invalid handle etc.. etc...
>>
>> What would happen is that we end up with a halve complete IOCTL.
>>
>> A possible solution might be that we could maybe add some kind of feedback
>> noting which operations are already complete and then only retrying the one
>> which failed.
>
> Atomic ioctl behaviour is hard. Like reeeeeeaaaaaaaaaaalllllllly hard.
>
> Look at atomic kms if you don't believe, or the v4l equivalent, and that
> doesn't even try to do cross device atomic. Also, it explicitly isn't
> atomic wrt memory management stuff (like pinning scanout buffers into
> vram), because that was too hard - we simply try to pin and then roll back
> if it happens to not work out and apologize to userspace for the mess.
>
> Except when your career plan is to spend the next few decades on
> prototyping this as an R&D project, I recommend to not try :-)
>
>> > > 2: Failure to map on some but not all GPUs. This comes down to the
>> > > question, do all ioctl APIs or system calls in general need to be
>> > > transactional? As a counter example I'd give incomplete read or write
>> > > system calls that return how much was actually read or written. Our
>> > > current implementation of map_memory_to_gpu doesn't do this, but it
>> > > could be modified to return to user mode how many of the mappings, or
>> > > which mappings specifically failed or succeeded.
>> > What should userspace do? if it only get mappings on 3 of the gpus instead
>> > of say 4? Is there a sane resolution other than calling the ioctl again with
>> > the single GPU? Would it drop the GPU from the working set from that point on?
>> >
>> > Need more info to do what can come out of the API doing incomplete
>> > operations.
>>
>> Felix argument that when a mapping operations fails the VM ranges in
>> question would have been undefined before and are undefined after that
>> operation failed as well.
>>
>> So we could just need to retry the operation until all of it succeeds, but
>> that feels kind of strange.
>
> +1 on make your gpu apis async, we have drm_syncobj/sync_file/dma_fence as
> a standard way for this now.
>
>> > > The alternative would be to break multi-GPU mappings, and the final wait
>> > > for completion, into multiple ioctl calls. That would result in
>> > > additional system call overhead. I'd argue that the end result is the
>> > > same for user mode, so I don't see why I'd use multiple ioctls over a
>> > > single one.
>> > Again stop worrying about ioctl overhead, this isn't Windows. If you
>> > can show the overhead as being a problem then address it, but I
>> > think it's premature worrying about it at this stage.
>>
>> Well you go from one IOCTL doing everything towards one IOCTL per device per
>> mapping which can be a huge difference.
>>
>> One the other hand we internally had exactly the same discussion when we
>> implemented support for partially resident textures. The result was that we
>> first implement it with individual IOCTLs and implement the mass mapping
>> IOCTL if we ever find an use case where we need it.
>>
>> So far we haven't found a use case for this mass mapping IOCTL.
>
> Aligns with my expectations/experience/planning for i915.ko stuff very
> much.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found]     ` <CAPM=9tzYRLHfocxvMM249pKioxELs=pPBgg20N_u1ZXVmAMbSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-03-07  8:38       ` Christian König
@ 2018-03-07 20:34       ` Felix Kuehling
       [not found]         ` <4ad64912-1fb7-6bcf-8d00-97d9e4ac04bd-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Felix Kuehling @ 2018-03-07 20:34 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Oded Gabbay, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Maling list - DRI developers, Christian König

Thanks for the feedback. I'm answering some of your questions inline.

On 2018-03-06 06:09 PM, Dave Airlie wrote:
> On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> Hi all,
>>
>> Christian raised two potential issues in a recent KFD upstreaming code
>> review that are related to the KFD ioctl APIs:
>>
>>  1. behaviour of -ERESTARTSYS
>>  2. transactional nature of KFD ioctl definitions, or lack thereof
>>
>> I appreciate constructive feedback, but I also want to encourage an
>> open-minded rather than a dogmatic approach to API definitions. So let
>> me take all the skeletons out of my closet and get these APIs reviewed
>> in the appropriate forum before we commit to them upstream. See the end
>> of this email for reference.
>>
>> The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
>> any of the other APIs raise concerns or questions, please ask.
>>
>> Because of the HSA programming model, KFD memory management APIs are
>> synchronous. There is no pipelining. Command submission to GPUs through
>> user mode queues does not involve KFD. This means KFD doesn't know what
>> memory is used by the GPUs and when it's used. That means, when the
>> map_memory_to_gpu ioctl returns to user mode, all memory mapping
>> operations are complete and the memory can be used by the CPUs or GPUs
>> immediately.
> I've got a few opinions, but first up I still dislike user-mode queues
> and everything
> they entail. I still feel they are solving a Windows problem and not a
> Linux problem,
> and it would be nice if we had some Linux numbers on what they gain us over
> a dispatch ioctl, because they sure bring a lot of memory management issues.
>
> That said amdkfd is here.
>
> The first question you should ask (which you haven't asked here at all) is
> what should userspace do with the ioctl result.
>
>> HSA also uses a shared virtual memory model, so typically memory gets
>> mapped on multiple GPUs and CPUs at the same virtual address.
>>
>> The point of contention seems to be the ability to map memory to
>> multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
>> discuss two main failure cases:
>>
>> 1: Failure after all mappings have been dispatched via SDMA, but a
>> signal interrupts the wait for completion and we return -ERESTARTSYS.
>> Documentation/kernel-hacking/hacking.rst only says "[...] you should be
>> prepared to process the restart, e.g. if you're in the middle of
>> manipulating some data structure." I think we do that by ensuring that
>> memory that's already mapped won't be mapped again. So the restart will
>> become a no-op and just end up waiting for all the previous mappings to
>> complete.
> -ERESTARTSYS at that late stage points to a badly synchronous API,
> I'd have said you should have two ioctls, one that returns a fence after
> starting the processes, and one that waits for the fence separately.
>
> The overhead of ioctls isn't your enemy until you've measured it and
> proven it's a problem.
>
>> Christian has a stricter requirement, and I'd like to know where that
>> comes from: "An interrupted IOCTL should never have a visible effect."
> Christian might be taking things a bit further but synchronous gpu access
> APIs are bad, but I don't think undoing a bunch of work is a good plan either
> just because you got ERESTARTSYS. If you get ERESTARTSYS can you
> handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
> will I wait for the original SDMAs if I reenter?

It will wait for the original SDMAs to complete.

>
>> 2: Failure to map on some but not all GPUs. This comes down to the
>> question, do all ioctl APIs or system calls in general need to be
>> transactional? As a counter example I'd give incomplete read or write
>> system calls that return how much was actually read or written. Our
>> current implementation of map_memory_to_gpu doesn't do this, but it
>> could be modified to return to user mode how many of the mappings, or
>> which mappings specifically failed or succeeded.
> What should userspace do? if it only get mappings on 3 of the gpus instead
> of say 4? Is there a sane resolution other than calling the ioctl again with
> the single GPU? Would it drop the GPU from the working set from that point on?
>
> Need more info to do what can come out of the API doing incomplete
> operations.

There are two typical use cases where this function is used.

 1. During allocation
 2. Changing access to an existing buffer

There is no retry logic in either case. And given the likely failure
conditions, a retry doesn't really make much sense.

I think the most likely failure I've seen is a failure to validate the
BO under heavy memory pressure. This will affect the first GPU trying to
map the memory. Once it's mapped on one GPU, subsequent GPUs don't need
to validate it again, so that's less likely to fail. Maybe if we're
running out of space for the SDMA command buffers. If you're under that
much memory pressure, it's unlikely that a retry would help. Or SDMA
could be hanging, leading to a timeout. Again, a retry won't help. You'd
need a GPU reset at that point.

So I think the expected response from user mode is that it will fail the
operation and not retry. If it happens during allocation, the BO will be
released. The application will probably crash or fail gracefully,
depending on how well it's written. A really badly written application
may keep going with a NULL pointer and get a GPUVM fault later on that
will ultimately terminate the application.

>
>> The alternative would be to break multi-GPU mappings, and the final wait
>> for completion, into multiple ioctl calls. That would result in
>> additional system call overhead. I'd argue that the end result is the
>> same for user mode, so I don't see why I'd use multiple ioctls over a
>> single one.
> Again stop worrying about ioctl overhead, this isn't Windows. If you
> can show the overhead as being a problem then address it, but I
> think it's premature worrying about it at this stage.

I'd like syscall overhead to be small. But with recent kernel page table
isolation, NUMA systems and lots of GPUs, I think this may not be
negligible. For example we're working with some Intel NUMA systems and 8
GPUs for HPC or deep learning applications. I'll be measuring the
overhead on such systems and get back with results in a few days. I want
to have an API that can scale to such applications.

Regards,
  Felix


>
> Dave.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found]         ` <4ad64912-1fb7-6bcf-8d00-97d9e4ac04bd-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-12 18:17           ` Felix Kuehling
  2018-03-12 19:37             ` Daniel Vetter
  0 siblings, 1 reply; 11+ messages in thread
From: Felix Kuehling @ 2018-03-12 18:17 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Oded Gabbay, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Maling list - DRI developers, Christian König

On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>> can show the overhead as being a problem then address it, but I
>> think it's premature worrying about it at this stage.
> I'd like syscall overhead to be small. But with recent kernel page table
> isolation, NUMA systems and lots of GPUs, I think this may not be
> negligible. For example we're working with some Intel NUMA systems and 8
> GPUs for HPC or deep learning applications. I'll be measuring the
> overhead on such systems and get back with results in a few days. I want
> to have an API that can scale to such applications.

I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
kernel config based on a standard Ubuntu kernel. No debug options were
enabled. My test application measures KFD memory management API
performance for allocating, mapping, unmapping and freeing 1000 buffers
of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
system memory). The impact of ioctl overhead depended on whether the
page table update was done by CPU or SDMA.

I averaged 10 runs of the application and also calculated the standard
deviation to see if my results were just random noise.

With SDMA using a single ioctl was about 5% faster for mapping and 10%
faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.

With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
unmapping. Standard deviation was 0.2% and 3% respectively.

For unmapping the difference was bigger than mapping because unmapping
is faster to begin with, so the system call overhead is bigger in
proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
SDMA or 190us with CPU with only minor dependence on buffer size and
memory type. Unmapping takes about 35us with SDMA or 13us with CPU.

>
> Regards,
>   Felix
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
  2018-03-12 18:17           ` Felix Kuehling
@ 2018-03-12 19:37             ` Daniel Vetter
       [not found]               ` <CAKMK7uG94kutBbK3_ZU6e=0HtJ2FJfgaeEKHtC-WxPen9LJiVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Vetter @ 2018-03-12 19:37 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Maling list - DRI developers, amd-gfx, Christian König

On Mon, Mar 12, 2018 at 7:17 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
> On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>>> can show the overhead as being a problem then address it, but I
>>> think it's premature worrying about it at this stage.
>> I'd like syscall overhead to be small. But with recent kernel page table
>> isolation, NUMA systems and lots of GPUs, I think this may not be
>> negligible. For example we're working with some Intel NUMA systems and 8
>> GPUs for HPC or deep learning applications. I'll be measuring the
>> overhead on such systems and get back with results in a few days. I want
>> to have an API that can scale to such applications.
>
> I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
> Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
> kernel config based on a standard Ubuntu kernel. No debug options were
> enabled. My test application measures KFD memory management API
> performance for allocating, mapping, unmapping and freeing 1000 buffers
> of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
> system memory). The impact of ioctl overhead depended on whether the
> page table update was done by CPU or SDMA.
>
> I averaged 10 runs of the application and also calculated the standard
> deviation to see if my results were just random noise.
>
> With SDMA using a single ioctl was about 5% faster for mapping and 10%
> faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.
>
> With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
> unmapping. Standard deviation was 0.2% and 3% respectively.

btw for statistics student's t-distribution is usually the measure to
tell "is this the same distribution or not". Works much more robustly
if you're dealing with odd shapes of your measured distributions,
which can happen easily (e.g. if it bifurcates into a fast vs.
slowpath or similar stuff).

Also for my understanding: This was 1 ioctl to map 1 buffer on 8 gpus
vs. 8 ioctl to mape 1 buffer on 1 of the 8 gpus?

Do we have benchmarks that show overall impact? I'm assuming that your
workloads won't spend all day long mapping/unmapping stuff, but also
will do some computing :-)

Can you also give numbers without KPTI? Afaiui AMD mostly doesn't need
it, and Intel will eventually fix it too, so this overhead should
disappear again. Just want to get a full picture here.
-Daniel

> For unmapping the difference was bigger than mapping because unmapping
> is faster to begin with, so the system call overhead is bigger in
> proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
> SDMA or 190us with CPU with only minor dependence on buffer size and
> memory type. Unmapping takes about 35us with SDMA or 13us with CPU.
>
>>
>> Regards,
>>   Felix
>>
>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: New KFD ioctls: taking the skeletons out of the closet
       [not found]               ` <CAKMK7uG94kutBbK3_ZU6e=0HtJ2FJfgaeEKHtC-WxPen9LJiVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-12 20:20                 ` Felix Kuehling
  0 siblings, 0 replies; 11+ messages in thread
From: Felix Kuehling @ 2018-03-12 20:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Maling list - DRI developers,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Christian König

On 2018-03-12 03:37 PM, Daniel Vetter wrote:
> On Mon, Mar 12, 2018 at 7:17 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>>>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>>>> can show the overhead as being a problem then address it, but I
>>>> think it's premature worrying about it at this stage.
>>> I'd like syscall overhead to be small. But with recent kernel page table
>>> isolation, NUMA systems and lots of GPUs, I think this may not be
>>> negligible. For example we're working with some Intel NUMA systems and 8
>>> GPUs for HPC or deep learning applications. I'll be measuring the
>>> overhead on such systems and get back with results in a few days. I want
>>> to have an API that can scale to such applications.
>> I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
>> Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
>> kernel config based on a standard Ubuntu kernel. No debug options were
>> enabled. My test application measures KFD memory management API
>> performance for allocating, mapping, unmapping and freeing 1000 buffers
>> of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
>> system memory). The impact of ioctl overhead depended on whether the
>> page table update was done by CPU or SDMA.
>>
>> I averaged 10 runs of the application and also calculated the standard
>> deviation to see if my results were just random noise.
>>
>> With SDMA using a single ioctl was about 5% faster for mapping and 10%
>> faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.
>>
>> With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
>> unmapping. Standard deviation was 0.2% and 3% respectively.
> btw for statistics student's t-distribution is usually the measure to
> tell "is this the same distribution or not". Works much more robustly
> if you're dealing with odd shapes of your measured distributions,
> which can happen easily (e.g. if it bifurcates into a fast vs.
> slowpath or similar stuff).
>
> Also for my understanding: This was 1 ioctl to map 1 buffer on 8 gpus
> vs. 8 ioctl to mape 1 buffer on 1 of the 8 gpus?

The task is the same in both cases: map one buffer on all 8 GPUs. In one
case it uses 9 ioctls (1 map call per GPU and 1 call to synchronize with
SDMA and flush GPU TLBs). In the other case it's 1 ioctl doing all those
things.

> Do we have benchmarks that show overall impact? I'm assuming that your
> workloads won't spend all day long mapping/unmapping stuff, but also
> will do some computing :-)

I don't. This was done with a micro benchmark. In real applications the
impact is going to be much smaller. I tested one application that I know
does a lot of memory mappings mixed in between computations (lulesh-cl
from https://github.com/AMDComputeLibraries/ComputeApps/). But it only
maps on one GPU, so the impact was minimal (maybe 1%) and probably not
statistically significant.

>
> Can you also give numbers without KPTI? Afaiui AMD mostly doesn't need
> it, and Intel will eventually fix it too, so this overhead should
> disappear again. Just want to get a full picture here.

Before I got time on the Intel system I ran less rigorous experiments on
an AMD Threadripper with KPTI off and KPTI forced on. I don't have exact
numbers from those tests. With KPTI off the ioctl overhead was not
measurable. With KPTI on it was about the same or slightly bigger than
on the Intel system.

Regards,
  Felix

> -Daniel
>
>> For unmapping the difference was bigger than mapping because unmapping
>> is faster to begin with, so the system call overhead is bigger in
>> proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
>> SDMA or 190us with CPU with only minor dependence on buffer size and
>> memory type. Unmapping takes about 35us with SDMA or 13us with CPU.
>>
>>> Regards,
>>>   Felix
>>>
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-12 20:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-06 22:44 New KFD ioctls: taking the skeletons out of the closet Felix Kuehling
     [not found] ` <20e5f2e3-89de-c22b-9c9e-c2b6bee02b1c-5C7GfCeVMHo@public.gmane.org>
2018-03-06 23:09   ` Dave Airlie
     [not found]     ` <CAPM=9tzYRLHfocxvMM249pKioxELs=pPBgg20N_u1ZXVmAMbSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-07  8:38       ` Christian König
     [not found]         ` <9f3c4b76-9d69-f685-439a-951354e6e98b-5C7GfCeVMHo@public.gmane.org>
2018-03-07 16:38           ` Daniel Vetter
2018-03-07 19:55             ` Alex Deucher
2018-03-07 20:34       ` Felix Kuehling
     [not found]         ` <4ad64912-1fb7-6bcf-8d00-97d9e4ac04bd-5C7GfCeVMHo@public.gmane.org>
2018-03-12 18:17           ` Felix Kuehling
2018-03-12 19:37             ` Daniel Vetter
     [not found]               ` <CAKMK7uG94kutBbK3_ZU6e=0HtJ2FJfgaeEKHtC-WxPen9LJiVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-12 20:20                 ` Felix Kuehling
2018-03-06 23:34   ` Jerome Glisse
2018-03-07  8:41     ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.