All of lore.kernel.org
 help / color / mirror / Atom feed
* Documentation about AMD's HSA implementation?
@ 2018-02-13  5:00 Ming Yang
       [not found] ` <CAEVNDXv8__4bYKLZc1zWYSdeK_0VkgTUeD-ex=vpLzyCgK88fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Yang @ 2018-02-13  5:00 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]

Hi,

I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
supporting it. Before digging into the code, I wonder if there's any
documentation available about AMD's HSA implementation, either book,
whitepaper, paper, or documentation.

I did find helpful materials about HSA, including HSA standards on this
page (http://www.hsafoundation.com/standards/) and a nice book about HSA
(Heterogeneous System Architecture A New Compute Platform Infrastructure).
But regarding the documentation about AMD's implementation, I haven't found
anything yet.

Please let me know if there are ones publicly accessible. If no, any
suggestions on learning the implementation of specific system components,
e.g., queue scheduling.

Best,
Mark

[-- Attachment #1.2: Type: text/html, Size: 952 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found] ` <CAEVNDXv8__4bYKLZc1zWYSdeK_0VkgTUeD-ex=vpLzyCgK88fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 14:40   ` Deucher, Alexander
       [not found]     ` <BN6PR12MB1652082F493EA92A38CD969FF7F60-/b2+HYfkarQqUD6E6FAiowdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Deucher, Alexander @ 2018-02-13 14:40 UTC (permalink / raw)
  To: Ming Yang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 1229 bytes --]

The ROCm documentation is probably a good place to start:

https://rocm.github.io/documentation.html

Alex

________________________________
From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on behalf of Ming Yang <minos.future-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Tuesday, February 13, 2018 12:00 AM
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Documentation about AMD's HSA implementation?

Hi,

I'm interested in HSA and excited when I found AMD's fully open-stack ROCm supporting it. Before digging into the code, I wonder if there's any documentation available about AMD's HSA implementation, either book, whitepaper, paper, or documentation.

I did find helpful materials about HSA, including HSA standards on this page (http://www.hsafoundation.com/standards/) and a nice book about HSA (Heterogeneous System Architecture A New Compute Platform Infrastructure). But regarding the documentation about AMD's implementation, I haven't found anything yet.

Please let me know if there are ones publicly accessible. If no, any suggestions on learning the implementation of specific system components, e.g., queue scheduling.

Best,
Mark

[-- Attachment #1.2: Type: text/html, Size: 2475 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]     ` <BN6PR12MB1652082F493EA92A38CD969FF7F60-/b2+HYfkarQqUD6E6FAiowdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-13 19:56       ` Felix Kuehling
       [not found]         ` <aaf9750c-5cef-a49d-13f2-9f46428d2324-5C7GfCeVMHo@public.gmane.org>
  2018-02-13 21:06       ` Ming Yang
  1 sibling, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2018-02-13 19:56 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

There is also this: https://gpuopen.com/professional-compute/, which
give pointer to several libraries and tools that built on top of ROCm.


Another thing to keep in mind is, that ROCm is diverging from the strict
HSA standard in some important ways. For example the HSA standard
includes HSAIL as an intermediate representation that gets finalized on
the target system, whereas ROCm compiles directly to native GPU ISA.


Regards,
  Felix


On 2018-02-13 09:40 AM, Deucher, Alexander wrote:
>
> The ROCm documentation is probably a good place to start:
>
> https://rocm.github.io/documentation.html
>
>
> Alex
>
> ------------------------------------------------------------------------
> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
> Ming Yang <minos.future@gmail.com>
> *Sent:* Tuesday, February 13, 2018 12:00 AM
> *To:* amd-gfx@lists.freedesktop.org
> *Subject:* Documentation about AMD's HSA implementation?
>  
> Hi,
>
> I'm interested in HSA and excited when I found AMD's fully open-stack
> ROCm supporting it. Before digging into the code, I wonder if there's
> any documentation available about AMD's HSA implementation, either
> book, whitepaper, paper, or documentation.
>
> I did find helpful materials about HSA, including HSA standards on
> this page (http://www.hsafoundation.com/standards/) and a nice book
> about HSA (Heterogeneous System Architecture A New Compute Platform
> Infrastructure). But regarding the documentation about AMD's
> implementation, I haven't found anything yet.
>
> Please let me know if there are ones publicly accessible. If no, any
> suggestions on learning the implementation of specific system
> components, e.g., queue scheduling.
>
> Best,
> Mark
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]         ` <aaf9750c-5cef-a49d-13f2-9f46428d2324-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-13 21:03           ` Panariti, David
  0 siblings, 0 replies; 14+ messages in thread
From: Panariti, David @ 2018-02-13 21:03 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Roberts, David

+ Dave Roberts

Do you still have links to the HSA doc collected during NMI?

________________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Felix Kuehling <felix.kuehling@amd.com>
Sent: Tuesday, February 13, 2018 2:56:47 PM
To: amd-gfx@lists.freedesktop.org
Subject: Re: Documentation about AMD's HSA implementation?

There is also this: https://gpuopen.com/professional-compute/, which
give pointer to several libraries and tools that built on top of ROCm.


Another thing to keep in mind is, that ROCm is diverging from the strict
HSA standard in some important ways. For example the HSA standard
includes HSAIL as an intermediate representation that gets finalized on
the target system, whereas ROCm compiles directly to native GPU ISA.


Regards,
  Felix


On 2018-02-13 09:40 AM, Deucher, Alexander wrote:
>
> The ROCm documentation is probably a good place to start:
>
> https://rocm.github.io/documentation.html
>
>
> Alex
>
> ------------------------------------------------------------------------
> *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
> Ming Yang <minos.future@gmail.com>
> *Sent:* Tuesday, February 13, 2018 12:00 AM
> *To:* amd-gfx@lists.freedesktop.org
> *Subject:* Documentation about AMD's HSA implementation?
>
> Hi,
>
> I'm interested in HSA and excited when I found AMD's fully open-stack
> ROCm supporting it. Before digging into the code, I wonder if there's
> any documentation available about AMD's HSA implementation, either
> book, whitepaper, paper, or documentation.
>
> I did find helpful materials about HSA, including HSA standards on
> this page (http://www.hsafoundation.com/standards/) and a nice book
> about HSA (Heterogeneous System Architecture A New Compute Platform
> Infrastructure). But regarding the documentation about AMD's
> implementation, I haven't found anything yet.
>
> Please let me know if there are ones publicly accessible. If no, any
> suggestions on learning the implementation of specific system
> components, e.g., queue scheduling.
>
> Best,
> Mark
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]     ` <BN6PR12MB1652082F493EA92A38CD969FF7F60-/b2+HYfkarQqUD6E6FAiowdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-02-13 19:56       ` Felix Kuehling
@ 2018-02-13 21:06       ` Ming Yang
       [not found]         ` <CAEVNDXvb3T_WeocZri=7Q1ihsARV+esOgqeZ=uEQnAKJe7Q1Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Ming Yang @ 2018-02-13 21:06 UTC (permalink / raw)
  To: Deucher, Alexander, Felix Kuehling
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Thanks for the suggestions!  But I might ask several specific
questions, as I can't find the answer in those documents, to give
myself a quick start if that's okay. Pointing me to the
files/functions would be good enough.  Any explanations are
appreciated.   My purpose is to hack it with different scheduling
policy with real-time and predictability consideration.

- Where/How is the packet scheduler implemented?  How are packets from
multiple queues scheduled?  What about scheduling packets from queues
in different address spaces?

- I noticed the new support of concurrency of multi-processes in the
archive of this mailing list.  Could you point me to the code that
implements this?

- Also another related question -- where/how is the preemption/context
switch between packets/queues implemented?

Thanks in advance!

Best,
Mark

> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
> There is also this: https://gpuopen.com/professional-compute/, which
> give pointer to several libraries and tools that built on top of ROCm.
>
> Another thing to keep in mind is, that ROCm is diverging from the strict
> HSA standard in some important ways. For example the HSA standard
> includes HSAIL as an intermediate representation that gets finalized on
> the target system, whereas ROCm compiles directly to native GPU ISA.
>
> Regards,
>   Felix
>
> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander <Alexander.Deucher@amd.com> wrote:
> > The ROCm documentation is probably a good place to start:
> >
> > https://rocm.github.io/documentation.html
> >
> >
> > Alex
> >
> > ________________________________
> > From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Ming Yang
> > <minos.future@gmail.com>
> > Sent: Tuesday, February 13, 2018 12:00 AM
> > To: amd-gfx@lists.freedesktop.org
> > Subject: Documentation about AMD's HSA implementation?
> >
> > Hi,
> >
> > I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
> > supporting it. Before digging into the code, I wonder if there's any
> > documentation available about AMD's HSA implementation, either book,
> > whitepaper, paper, or documentation.
> >
> > I did find helpful materials about HSA, including HSA standards on this page
> > (http://www.hsafoundation.com/standards/) and a nice book about HSA
> > (Heterogeneous System Architecture A New Compute Platform Infrastructure).
> > But regarding the documentation about AMD's implementation, I haven't found
> > anything yet.
> >
> > Please let me know if there are ones publicly accessible. If no, any
> > suggestions on learning the implementation of specific system components,
> > e.g., queue scheduling.
> >
> > Best,
> > Mark
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]         ` <CAEVNDXvb3T_WeocZri=7Q1ihsARV+esOgqeZ=uEQnAKJe7Q1Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 21:17           ` Felix Kuehling
       [not found]             ` <4b128f4a-065e-fccd-fe92-baefeda66017-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Felix Kuehling @ 2018-02-13 21:17 UTC (permalink / raw)
  To: Ming Yang, Deucher, Alexander; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2018-02-13 04:06 PM, Ming Yang wrote:
> Thanks for the suggestions!  But I might ask several specific
> questions, as I can't find the answer in those documents, to give
> myself a quick start if that's okay. Pointing me to the
> files/functions would be good enough.  Any explanations are
> appreciated.   My purpose is to hack it with different scheduling
> policy with real-time and predictability consideration.
>
> - Where/How is the packet scheduler implemented?  How are packets from
> multiple queues scheduled?  What about scheduling packets from queues
> in different address spaces?

This is done mostly in firmware. The CP engine supports up to 32 queues.
We share those between KFD and AMDGPU. KFD gets 24 queues to use.
Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
micro engine. Within each pipe the queues are time-multiplexed.

If we need more than 24 queues, or if we have more than 8 processes, the
hardware scheduler (HWS) adds another layer scheduling, basically
round-robin between batches of 24 queues or 8 processes. Once you get
into such an over-subscribed scenario your performance and GPU
utilization can suffers quite badly.

>
> - I noticed the new support of concurrency of multi-processes in the
> archive of this mailing list.  Could you point me to the code that
> implements this?

That's basically just a switch that tells the firmware that it is
allowed to schedule queues from different processes at the same time.
The upper limit is the number of VMIDs that HWS can work with. It needs
to assign a unique VMID to each process (each VMID representing a
separate address space, page table, etc.). If there are more processes
than VMIDs, the HWS has to time-multiplex.

>
> - Also another related question -- where/how is the preemption/context
> switch between packets/queues implemented?

As long as you don't oversubscribe the available VMIDs, there is no real
context switching. Everything can run concurrently. When you start
oversubscribing HW queues or VMIDs, the HWS firmware will start
multiplexing. This is all handled inside the firmware and is quite
transparent even to KFD.

KFD interacts with the HWS firmware through the HIQ (HSA interface
queue). It supports packets for unmapping queues, we can send it a new
runlist (basically a bunch of map-process and map-queue packets). The
interesting files to look at are kfd_packet_manager.c,
kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.

Regards,
  Felix

>
> Thanks in advance!
>
> Best,
> Mark
>
>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> There is also this: https://gpuopen.com/professional-compute/, which
>> give pointer to several libraries and tools that built on top of ROCm.
>>
>> Another thing to keep in mind is, that ROCm is diverging from the strict
>> HSA standard in some important ways. For example the HSA standard
>> includes HSAIL as an intermediate representation that gets finalized on
>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>
>> Regards,
>>   Felix
>>
>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander <Alexander.Deucher@amd.com> wrote:
>>> The ROCm documentation is probably a good place to start:
>>>
>>> https://rocm.github.io/documentation.html
>>>
>>>
>>> Alex
>>>
>>> ________________________________
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Ming Yang
>>> <minos.future@gmail.com>
>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>> To: amd-gfx@lists.freedesktop.org
>>> Subject: Documentation about AMD's HSA implementation?
>>>
>>> Hi,
>>>
>>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
>>> supporting it. Before digging into the code, I wonder if there's any
>>> documentation available about AMD's HSA implementation, either book,
>>> whitepaper, paper, or documentation.
>>>
>>> I did find helpful materials about HSA, including HSA standards on this page
>>> (http://www.hsafoundation.com/standards/) and a nice book about HSA
>>> (Heterogeneous System Architecture A New Compute Platform Infrastructure).
>>> But regarding the documentation about AMD's implementation, I haven't found
>>> anything yet.
>>>
>>> Please let me know if there are ones publicly accessible. If no, any
>>> suggestions on learning the implementation of specific system components,
>>> e.g., queue scheduling.
>>>
>>> Best,
>>> Mark

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]             ` <4b128f4a-065e-fccd-fe92-baefeda66017-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-13 21:58               ` Ming Yang
       [not found]                 ` <CAEVNDXvqQdZgP-YrgWqGpOkCDSNz6uJ0Ggrz_MRopOHZL31XpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Yang @ 2018-02-13 21:58 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Deucher, Alexander, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

That's very helpful, thanks!

On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
> On 2018-02-13 04:06 PM, Ming Yang wrote:
>> Thanks for the suggestions!  But I might ask several specific
>> questions, as I can't find the answer in those documents, to give
>> myself a quick start if that's okay. Pointing me to the
>> files/functions would be good enough.  Any explanations are
>> appreciated.   My purpose is to hack it with different scheduling
>> policy with real-time and predictability consideration.
>>
>> - Where/How is the packet scheduler implemented?  How are packets from
>> multiple queues scheduled?  What about scheduling packets from queues
>> in different address spaces?
>
> This is done mostly in firmware. The CP engine supports up to 32 queues.
> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
> micro engine. Within each pipe the queues are time-multiplexed.

Please correct me if I'm wrong.  CP is computing processor, like the
Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
scheduler multiplexing queues, in order to hide memory latency.

>
> If we need more than 24 queues, or if we have more than 8 processes, the
> hardware scheduler (HWS) adds another layer scheduling, basically
> round-robin between batches of 24 queues or 8 processes. Once you get
> into such an over-subscribed scenario your performance and GPU
> utilization can suffers quite badly.

HWS is also implemented in the firmware that's closed-source?

>
>>
>> - I noticed the new support of concurrency of multi-processes in the
>> archive of this mailing list.  Could you point me to the code that
>> implements this?
>
> That's basically just a switch that tells the firmware that it is
> allowed to schedule queues from different processes at the same time.
> The upper limit is the number of VMIDs that HWS can work with. It needs
> to assign a unique VMID to each process (each VMID representing a
> separate address space, page table, etc.). If there are more processes
> than VMIDs, the HWS has to time-multiplex.

HWS dispatch packets in their order of becoming the head of the queue,
i.e., being pointed by the read_index? So in this way it's FIFO.  Or
round-robin between queues? You mentioned round-robin over batches in
the over-subscribed scenario.

This might not be a big deal for performance, but it matters for
predictability and real-time analysis.

>
>>
>> - Also another related question -- where/how is the preemption/context
>> switch between packets/queues implemented?
>
> As long as you don't oversubscribe the available VMIDs, there is no real
> context switching. Everything can run concurrently. When you start
> oversubscribing HW queues or VMIDs, the HWS firmware will start
> multiplexing. This is all handled inside the firmware and is quite
> transparent even to KFD.

I see.  So the preemption in at least AMD's implementation is not
switching out the executing kernel, but just letting new kernels to
run concurrently with the existing ones.  This means the performance
is degraded when too many workloads are submitted.  The running
kernels leave the GPU only when they are done.

Is there any reason for not preempting/switching out the existing
kernel, besides context switch overheads?  NVIDIA is not providing
this option either.  Non-preemption hurts the real-time property in
terms of priority inversion.  I understand preemption should not be
massively used but having such an option may help a lot for real-time
systems.

>
> KFD interacts with the HWS firmware through the HIQ (HSA interface
> queue). It supports packets for unmapping queues, we can send it a new
> runlist (basically a bunch of map-process and map-queue packets). The
> interesting files to look at are kfd_packet_manager.c,
> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>

So in this way, if we want to implement different scheduling policy,
we should control the submission of packets to the queues in
runtime/KFD, before getting to the firmware.  Because it's out of
access once it's submitted to the HWS in the firmware.

Best,
Mark

> Regards,
>   Felix
>
>>
>> Thanks in advance!
>>
>> Best,
>> Mark
>>
>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
>>> There is also this: https://gpuopen.com/professional-compute/, which
>>> give pointer to several libraries and tools that built on top of ROCm.
>>>
>>> Another thing to keep in mind is, that ROCm is diverging from the strict
>>> HSA standard in some important ways. For example the HSA standard
>>> includes HSAIL as an intermediate representation that gets finalized on
>>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>>
>>> Regards,
>>>   Felix
>>>
>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander <Alexander.Deucher@amd.com> wrote:
>>>> The ROCm documentation is probably a good place to start:
>>>>
>>>> https://rocm.github.io/documentation.html
>>>>
>>>>
>>>> Alex
>>>>
>>>> ________________________________
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Ming Yang
>>>> <minos.future@gmail.com>
>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Subject: Documentation about AMD's HSA implementation?
>>>>
>>>> Hi,
>>>>
>>>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
>>>> supporting it. Before digging into the code, I wonder if there's any
>>>> documentation available about AMD's HSA implementation, either book,
>>>> whitepaper, paper, or documentation.
>>>>
>>>> I did find helpful materials about HSA, including HSA standards on this page
>>>> (http://www.hsafoundation.com/standards/) and a nice book about HSA
>>>> (Heterogeneous System Architecture A New Compute Platform Infrastructure).
>>>> But regarding the documentation about AMD's implementation, I haven't found
>>>> anything yet.
>>>>
>>>> Please let me know if there are ones publicly accessible. If no, any
>>>> suggestions on learning the implementation of specific system components,
>>>> e.g., queue scheduling.
>>>>
>>>> Best,
>>>> Mark
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]                 ` <CAEVNDXvqQdZgP-YrgWqGpOkCDSNz6uJ0Ggrz_MRopOHZL31XpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-13 22:31                   ` Felix Kuehling
  2018-02-13 23:42                   ` Bridgman, John
  1 sibling, 0 replies; 14+ messages in thread
From: Felix Kuehling @ 2018-02-13 22:31 UTC (permalink / raw)
  To: Ming Yang; +Cc: Deucher, Alexander, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2018-02-13 04:58 PM, Ming Yang wrote:
> That's very helpful, thanks!
>
> On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets from
>>> multiple queues scheduled?  What about scheduling packets from queues
>>> in different address spaces?
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
> Please correct me if I'm wrong.  CP is computing processor, like the
> Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
> scheduler multiplexing queues, in order to hide memory latency.

CP stands for "command processor". There are multiple CP micro engines.
CPG for graphics, CPC for compute. This is not related to warps or wave
fronts. Those execute on the CUs (compute units). There are many CUs,
but only one CPC. The scheduling or dispatching of wavefronts to CUs is
yet another level of scheduling that I didn't talk about. Application
submit AQL dispatch packets to user mode queues. The CP processes those
dispatch packets and schedules the resulting wavefronts on CUs.

>
>> If we need more than 24 queues, or if we have more than 8 processes, the
>> hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
> HWS is also implemented in the firmware that's closed-source?

Yes.

>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It needs
>> to assign a unique VMID to each process (each VMID representing a
>> separate address space, page table, etc.). If there are more processes
>> than VMIDs, the HWS has to time-multiplex.
> HWS dispatch packets in their order of becoming the head of the queue,
> i.e., being pointed by the read_index? So in this way it's FIFO.  Or
> round-robin between queues? You mentioned round-robin over batches in
> the over-subscribed scenario.

Commands within a queue are handled in FIFO order. Commands on different
queues are not ordered with respect to each other.

When I talk about round robin of batches of queues, it means that the
pipes of the CP are executing up to 24 user mode queues. After the time
slice is up, the HWS preempts those queues, and loads another batch of
queues to the CP pipes. This goes on until all the queues in the runlist
have had some time on the GPU. Then the whole process starts over.

>
> This might not be a big deal for performance, but it matters for
> predictability and real-time analysis.
>
>>> - Also another related question -- where/how is the preemption/context
>>> switch between packets/queues implemented?
>> As long as you don't oversubscribe the available VMIDs, there is no real
>> context switching. Everything can run concurrently. When you start
>> oversubscribing HW queues or VMIDs, the HWS firmware will start
>> multiplexing. This is all handled inside the firmware and is quite
>> transparent even to KFD.
> I see.  So the preemption in at least AMD's implementation is not
> switching out the executing kernel, but just letting new kernels to
> run concurrently with the existing ones.  This means the performance
> is degraded when too many workloads are submitted.  The running
> kernels leave the GPU only when they are done.

No, that's not what I meant. As long as nothing is oversubscribed, you
don't have preemptions. But as soon as hardware queues or VMIDs are
oversubscribed, the HWS will need to preempt queues in order to let
other queues have some time on the hardware. Preempting a queue includes
preempting all the wavefronts that were dispatched by that queue. The
state of all the CUs is saved and later restored. We call this CWSR
(compute wave save/restore).

>
> Is there any reason for not preempting/switching out the existing
> kernel, besides context switch overheads?  NVIDIA is not providing
> this option either.  Non-preemption hurts the real-time property in
> terms of priority inversion.  I understand preemption should not be
> massively used but having such an option may help a lot for real-time
> systems.
>
>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> queue). It supports packets for unmapping queues, we can send it a new
>> runlist (basically a bunch of map-process and map-queue packets). The
>> interesting files to look at are kfd_packet_manager.c,
>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>
> So in this way, if we want to implement different scheduling policy,
> we should control the submission of packets to the queues in
> runtime/KFD, before getting to the firmware.  Because it's out of
> access once it's submitted to the HWS in the firmware.

Right. If you need more control over the scheduling, there is an option
to disable the HWS. This is currently more a debugging option and comes
with some side effects. For example we currently only support CWSR when
the HWS is enabled. And this currently disables queue and VMID
oversubscription. If you create too many queues or processes, queue or
KFD process creation just fails.

If you can control what's going on in user mode, you can achieve the
same by just not creating too many processes and queues.

Regards,
  Felix

>
> Best,
> Mark
>
>> Regards,
>>   Felix
>>
>>> Thanks in advance!
>>>
>>> Best,
>>> Mark
>>>
>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>> There is also this: https://gpuopen.com/professional-compute/, which
>>>> give pointer to several libraries and tools that built on top of ROCm.
>>>>
>>>> Another thing to keep in mind is, that ROCm is diverging from the strict
>>>> HSA standard in some important ways. For example the HSA standard
>>>> includes HSAIL as an intermediate representation that gets finalized on
>>>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander <Alexander.Deucher@amd.com> wrote:
>>>>> The ROCm documentation is probably a good place to start:
>>>>>
>>>>> https://rocm.github.io/documentation.html
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>> ________________________________
>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Ming Yang
>>>>> <minos.future@gmail.com>
>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
>>>>> supporting it. Before digging into the code, I wonder if there's any
>>>>> documentation available about AMD's HSA implementation, either book,
>>>>> whitepaper, paper, or documentation.
>>>>>
>>>>> I did find helpful materials about HSA, including HSA standards on this page
>>>>> (http://www.hsafoundation.com/standards/) and a nice book about HSA
>>>>> (Heterogeneous System Architecture A New Compute Platform Infrastructure).
>>>>> But regarding the documentation about AMD's implementation, I haven't found
>>>>> anything yet.
>>>>>
>>>>> Please let me know if there are ones publicly accessible. If no, any
>>>>> suggestions on learning the implementation of specific system components,
>>>>> e.g., queue scheduling.
>>>>>
>>>>> Best,
>>>>> Mark

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Documentation about AMD's HSA implementation?
       [not found]                 ` <CAEVNDXvqQdZgP-YrgWqGpOkCDSNz6uJ0Ggrz_MRopOHZL31XpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-13 22:31                   ` Felix Kuehling
@ 2018-02-13 23:42                   ` Bridgman, John
       [not found]                     ` <BN6PR12MB13483BBA577C518F18F7B100E8F60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Bridgman, John @ 2018-02-13 23:42 UTC (permalink / raw)
  To: Ming Yang, Kuehling, Felix
  Cc: Deucher, Alexander, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of
>Ming Yang
>Sent: Tuesday, February 13, 2018 4:59 PM
>To: Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>That's very helpful, thanks!
>
>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling <felix.kuehling@amd.com>
>wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets
>>> from multiple queues scheduled?  What about scheduling packets from
>>> queues in different address spaces?
>>
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
>
>Please correct me if I'm wrong.  CP is computing processor, like the Execution
>Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing
>queues, in order to hide memory latency.

CP is one step back from that - it's a "command processor" which reads command packets from driver (PM4 format) or application (AQL format) then manages the execution of each command on the GPU. A typical packet might be "dispatch", which initiates a compute operation on an N-dimensional array, or "draw" which initiates the rendering of an array of triangles. Those compute and render commands then generate a (typically) large number of wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of our recent GPUs have one micro engine for graphics ("ME") and two for compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".
>
>>
>> If we need more than 24 queues, or if we have more than 8 processes,
>> the hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
>
>HWS is also implemented in the firmware that's closed-source?

Correct - HWS is implemented in the MEC microcode. We also include a simple SW scheduler in the open source driver code, however. 
>
>>
>>>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>>
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It
>> needs to assign a unique VMID to each process (each VMID representing
>> a separate address space, page table, etc.). If there are more
>> processes than VMIDs, the HWS has to time-multiplex.
>
>HWS dispatch packets in their order of becoming the head of the queue, i.e.,
>being pointed by the read_index? So in this way it's FIFO.  Or round-robin
>between queues? You mentioned round-robin over batches in the over-
>subscribed scenario.

Round robin between sets of queues. The HWS logic generates sets as follows:

1. "set resources" packet from driver tells scheduler how many VMIDs and HW queues it can use

2. "runlist" packet from driver provides list of processes and list of queues for each process

3. if multi-process switch not set, HWS schedules as many queues from the first process in the runlist as it has HW queues (see #1)

4. at the end of process quantum (set by driver) either switch to next process (if all queues from first process have been scheduled) or schedule next set of queues from the same process

5. when all queues from all processes have been scheduled and run for a process quantum, go back to the start of the runlist and repeat

If the multi-process switch is set, and the number of queues for a process is less than the number of HW queues available, then in step #3 above HWS will start scheduling queues for additional processes, using a different VMID for each process, and continue until it either runs out of VMIDs or HW queues (or reaches the end of the runlist). All of the queues and processes would then run together for a process quantum before switching to the next queue set.

>
>This might not be a big deal for performance, but it matters for predictability
>and real-time analysis.

Agreed. In general you would not want to overcommit either VMIDs or HW queues in a real-time scenario, and for hard real time you would probably want to limit to a single queue per pipe since the MEC also multiplexes between HW queues on a pipe even without HWS. 

>
>>
>>>
>>> - Also another related question -- where/how is the
>>> preemption/context switch between packets/queues implemented?
>>
>> As long as you don't oversubscribe the available VMIDs, there is no
>> real context switching. Everything can run concurrently. When you
>> start oversubscribing HW queues or VMIDs, the HWS firmware will start
>> multiplexing. This is all handled inside the firmware and is quite
>> transparent even to KFD.
>
>I see.  So the preemption in at least AMD's implementation is not switching
>out the executing kernel, but just letting new kernels to run concurrently with
>the existing ones.  This means the performance is degraded when too many
>workloads are submitted.  The running kernels leave the GPU only when they
>are done.

Both - you can have multiple kernels executing concurrently (each generating multiple threads in the shader core) AND switch out the currently executing set of kernels via preemption. 

>
>Is there any reason for not preempting/switching out the existing kernel,
>besides context switch overheads?  NVIDIA is not providing this option either.
>Non-preemption hurts the real-time property in terms of priority inversion.  I
>understand preemption should not be massively used but having such an
>option may help a lot for real-time systems.

If I understand you correctly, you can have it either way depending on the number of queues you enable simultaneously. At any given time you are typically only going to be running the kernels from one queue on each pipe, ie with 3 pipes and 24 queues you would typically only be running 3 kernels at a time. This seemed like a good compromise between scalability and efficiency. 

>
>>
>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> queue). It supports packets for unmapping queues, we can send it a new
>> runlist (basically a bunch of map-process and map-queue packets). The
>> interesting files to look at are kfd_packet_manager.c,
>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>
>
>So in this way, if we want to implement different scheduling policy, we should
>control the submission of packets to the queues in runtime/KFD, before
>getting to the firmware.  Because it's out of access once it's submitted to the
>HWS in the firmware.

Correct - there is a tradeoff between "easily scheduling lots of work" and fine-grained control. Limiting the number of queues you run simultaneously is another way of taking back control. 

You're probably past this, but you might find the original introduction to KFD useful in some way:

https://lwn.net/Articles/605153/

>
>Best,
>Mark
>
>> Regards,
>>   Felix
>>
>>>
>>> Thanks in advance!
>>>
>>> Best,
>>> Mark
>>>
>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com>
>wrote:
>>>> There is also this: https://gpuopen.com/professional-compute/, which
>>>> give pointer to several libraries and tools that built on top of ROCm.
>>>>
>>>> Another thing to keep in mind is, that ROCm is diverging from the
>>>> strict HSA standard in some important ways. For example the HSA
>>>> standard includes HSAIL as an intermediate representation that gets
>>>> finalized on the target system, whereas ROCm compiles directly to native
>GPU ISA.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
><Alexander.Deucher@amd.com> wrote:
>>>>> The ROCm documentation is probably a good place to start:
>>>>>
>>>>> https://rocm.github.io/documentation.html
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>> ________________________________
>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>> Ming Yang <minos.future@gmail.com>
>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm interested in HSA and excited when I found AMD's fully
>>>>> open-stack ROCm supporting it. Before digging into the code, I
>>>>> wonder if there's any documentation available about AMD's HSA
>>>>> implementation, either book, whitepaper, paper, or documentation.
>>>>>
>>>>> I did find helpful materials about HSA, including HSA standards on
>>>>> this page
>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>HSA
>>>>> (Heterogeneous System Architecture A New Compute Platform
>Infrastructure).
>>>>> But regarding the documentation about AMD's implementation, I
>>>>> haven't found anything yet.
>>>>>
>>>>> Please let me know if there are ones publicly accessible. If no,
>>>>> any suggestions on learning the implementation of specific system
>>>>> components, e.g., queue scheduling.
>>>>>
>>>>> Best,
>>>>> Mark
>>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Documentation about AMD's HSA implementation?
       [not found]                     ` <BN6PR12MB13483BBA577C518F18F7B100E8F60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-13 23:45                       ` Bridgman, John
       [not found]                         ` <BN6PR12MB11720563598A3218601AE2C995F50@BN6PR12MB1172.namprd12.prod.outlook.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Bridgman, John @ 2018-02-13 23:45 UTC (permalink / raw)
  To: Bridgman, John, Ming Yang, Kuehling, Felix
  Cc: Deucher, Alexander, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of
>Bridgman, John
>Sent: Tuesday, February 13, 2018 6:42 PM
>To: Ming Yang; Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>Subject: RE: Documentation about AMD's HSA implementation?
>
>
>
>>-----Original Message-----
>>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>>Of Ming Yang
>>Sent: Tuesday, February 13, 2018 4:59 PM
>>To: Kuehling, Felix
>>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>That's very helpful, thanks!
>>
>>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>><felix.kuehling@amd.com>
>>wrote:
>>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>>> Thanks for the suggestions!  But I might ask several specific
>>>> questions, as I can't find the answer in those documents, to give
>>>> myself a quick start if that's okay. Pointing me to the
>>>> files/functions would be good enough.  Any explanations are
>>>> appreciated.   My purpose is to hack it with different scheduling
>>>> policy with real-time and predictability consideration.
>>>>
>>>> - Where/How is the packet scheduler implemented?  How are packets
>>>> from multiple queues scheduled?  What about scheduling packets from
>>>> queues in different address spaces?
>>>
>>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>>> micro engine. Within each pipe the queues are time-multiplexed.
>>
>>Please correct me if I'm wrong.  CP is computing processor, like the
>>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler
>>multiplexing queues, in order to hide memory latency.
>
>CP is one step back from that - it's a "command processor" which reads
>command packets from driver (PM4 format) or application (AQL format) then
>manages the execution of each command on the GPU. A typical packet might
>be "dispatch", which initiates a compute operation on an N-dimensional array,
>or "draw" which initiates the rendering of an array of triangles. Those
>compute and render commands then generate a (typically) large number of
>wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of
>our recent GPUs have one micro engine for graphics ("ME") and two for
>compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".

I missed one important point - "CP" refers to the combination of ME, MEC(s) and a few other related blocks.

>>
>>>
>>> If we need more than 24 queues, or if we have more than 8 processes,
>>> the hardware scheduler (HWS) adds another layer scheduling, basically
>>> round-robin between batches of 24 queues or 8 processes. Once you get
>>> into such an over-subscribed scenario your performance and GPU
>>> utilization can suffers quite badly.
>>
>>HWS is also implemented in the firmware that's closed-source?
>
>Correct - HWS is implemented in the MEC microcode. We also include a simple
>SW scheduler in the open source driver code, however.
>>
>>>
>>>>
>>>> - I noticed the new support of concurrency of multi-processes in the
>>>> archive of this mailing list.  Could you point me to the code that
>>>> implements this?
>>>
>>> That's basically just a switch that tells the firmware that it is
>>> allowed to schedule queues from different processes at the same time.
>>> The upper limit is the number of VMIDs that HWS can work with. It
>>> needs to assign a unique VMID to each process (each VMID representing
>>> a separate address space, page table, etc.). If there are more
>>> processes than VMIDs, the HWS has to time-multiplex.
>>
>>HWS dispatch packets in their order of becoming the head of the queue,
>>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>>round-robin between queues? You mentioned round-robin over batches in
>>the over- subscribed scenario.
>
>Round robin between sets of queues. The HWS logic generates sets as
>follows:
>
>1. "set resources" packet from driver tells scheduler how many VMIDs and
>HW queues it can use
>
>2. "runlist" packet from driver provides list of processes and list of queues for
>each process
>
>3. if multi-process switch not set, HWS schedules as many queues from the
>first process in the runlist as it has HW queues (see #1)
>
>4. at the end of process quantum (set by driver) either switch to next process
>(if all queues from first process have been scheduled) or schedule next set of
>queues from the same process
>
>5. when all queues from all processes have been scheduled and run for a
>process quantum, go back to the start of the runlist and repeat
>
>If the multi-process switch is set, and the number of queues for a process is
>less than the number of HW queues available, then in step #3 above HWS will
>start scheduling queues for additional processes, using a different VMID for
>each process, and continue until it either runs out of VMIDs or HW queues (or
>reaches the end of the runlist). All of the queues and processes would then
>run together for a process quantum before switching to the next queue set.
>
>>
>>This might not be a big deal for performance, but it matters for
>>predictability and real-time analysis.
>
>Agreed. In general you would not want to overcommit either VMIDs or HW
>queues in a real-time scenario, and for hard real time you would probably
>want to limit to a single queue per pipe since the MEC also multiplexes
>between HW queues on a pipe even without HWS.
>
>>
>>>
>>>>
>>>> - Also another related question -- where/how is the
>>>> preemption/context switch between packets/queues implemented?
>>>
>>> As long as you don't oversubscribe the available VMIDs, there is no
>>> real context switching. Everything can run concurrently. When you
>>> start oversubscribing HW queues or VMIDs, the HWS firmware will start
>>> multiplexing. This is all handled inside the firmware and is quite
>>> transparent even to KFD.
>>
>>I see.  So the preemption in at least AMD's implementation is not
>>switching out the executing kernel, but just letting new kernels to run
>>concurrently with the existing ones.  This means the performance is
>>degraded when too many workloads are submitted.  The running kernels
>>leave the GPU only when they are done.
>
>Both - you can have multiple kernels executing concurrently (each generating
>multiple threads in the shader core) AND switch out the currently executing
>set of kernels via preemption.
>
>>
>>Is there any reason for not preempting/switching out the existing
>>kernel, besides context switch overheads?  NVIDIA is not providing this
>option either.
>>Non-preemption hurts the real-time property in terms of priority
>>inversion.  I understand preemption should not be massively used but
>>having such an option may help a lot for real-time systems.
>
>If I understand you correctly, you can have it either way depending on the
>number of queues you enable simultaneously. At any given time you are
>typically only going to be running the kernels from one queue on each pipe, ie
>with 3 pipes and 24 queues you would typically only be running 3 kernels at a
>time. This seemed like a good compromise between scalability and efficiency.
>
>>
>>>
>>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>>> queue). It supports packets for unmapping queues, we can send it a
>>> new runlist (basically a bunch of map-process and map-queue packets).
>>> The interesting files to look at are kfd_packet_manager.c,
>>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>>
>>
>>So in this way, if we want to implement different scheduling policy, we
>>should control the submission of packets to the queues in runtime/KFD,
>>before getting to the firmware.  Because it's out of access once it's
>>submitted to the HWS in the firmware.
>
>Correct - there is a tradeoff between "easily scheduling lots of work" and fine-
>grained control. Limiting the number of queues you run simultaneously is
>another way of taking back control.
>
>You're probably past this, but you might find the original introduction to KFD
>useful in some way:
>
>https://lwn.net/Articles/605153/
>
>>
>>Best,
>>Mark
>>
>>> Regards,
>>>   Felix
>>>
>>>>
>>>> Thanks in advance!
>>>>
>>>> Best,
>>>> Mark
>>>>
>>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling@amd.com>
>>wrote:
>>>>> There is also this: https://gpuopen.com/professional-compute/,
>>>>> which give pointer to several libraries and tools that built on top of
>ROCm.
>>>>>
>>>>> Another thing to keep in mind is, that ROCm is diverging from the
>>>>> strict HSA standard in some important ways. For example the HSA
>>>>> standard includes HSAIL as an intermediate representation that gets
>>>>> finalized on the target system, whereas ROCm compiles directly to
>>>>> native
>>GPU ISA.
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>><Alexander.Deucher@amd.com> wrote:
>>>>>> The ROCm documentation is probably a good place to start:
>>>>>>
>>>>>> https://rocm.github.io/documentation.html
>>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> ________________________________
>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf
>of
>>>>>> Ming Yang <minos.future@gmail.com>
>>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm interested in HSA and excited when I found AMD's fully
>>>>>> open-stack ROCm supporting it. Before digging into the code, I
>>>>>> wonder if there's any documentation available about AMD's HSA
>>>>>> implementation, either book, whitepaper, paper, or documentation.
>>>>>>
>>>>>> I did find helpful materials about HSA, including HSA standards on
>>>>>> this page
>>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>>HSA
>>>>>> (Heterogeneous System Architecture A New Compute Platform
>>Infrastructure).
>>>>>> But regarding the documentation about AMD's implementation, I
>>>>>> haven't found anything yet.
>>>>>>
>>>>>> Please let me know if there are ones publicly accessible. If no,
>>>>>> any suggestions on learning the implementation of specific system
>>>>>> components, e.g., queue scheduling.
>>>>>>
>>>>>> Best,
>>>>>> Mark
>>>
>>_______________________________________________
>>amd-gfx mailing list
>>amd-gfx@lists.freedesktop.org
>>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>_______________________________________________
>amd-gfx mailing list
>amd-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]                           ` <BN6PR12MB11720563598A3218601AE2C995F50-/b2+HYfkarTft/eMqzLDqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-14  6:05                             ` Ming Yang
       [not found]                               ` <CAEVNDXv0CwU9et6KzM1X70x+8SDac0F4kPv1t3XPvuBs=gzzdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Yang @ 2018-02-14  6:05 UTC (permalink / raw)
  To: Panariti, David, Kuehling, Felix, Bridgman, John, Deucher, Alexander
  Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Thanks for all the inputs.  Very helpful!  I think I have a general
understanding of the queue scheduling now and it's time for me to read
more code and materials and do some experiments.

I'll come back with more questions hopefully. :-)

Hi David, please don't hesitate to share more documents.  I might find
helpful information from them eventually.  People like me may benefit
from them someway in the future.


Best,
Ming (Mark)

On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David <David.Panariti@amd.com> wrote:
> I found a bunch of doc whilst spelunking info for another project.
> I'm not sure what's up-to-date, correct, useful, etc.
> I've attached one.
> Let me know if you want any more.
>
> davep
>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>> Of Bridgman, John
>> Sent: Tuesday, February 13, 2018 6:45 PM
>> To: Bridgman, John <John.Bridgman@amd.com>; Ming Yang
>> <minos.future@gmail.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
>> gfx@lists.freedesktop.org
>> Subject: RE: Documentation about AMD's HSA implementation?
>>
>>
>> >-----Original Message-----
>> >From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>> >Of Bridgman, John
>> >Sent: Tuesday, February 13, 2018 6:42 PM
>> >To: Ming Yang; Kuehling, Felix
>> >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>> >Subject: RE: Documentation about AMD's HSA implementation?
>> >
>> >
>> >
>> >>-----Original Message-----
>> >>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>> Behalf
>> >>Of Ming Yang
>> >>Sent: Tuesday, February 13, 2018 4:59 PM
>> >>To: Kuehling, Felix
>> >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>> >>Subject: Re: Documentation about AMD's HSA implementation?
>> >>
>> >>That's very helpful, thanks!
>> >>
>> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>> >><felix.kuehling@amd.com>
>> >>wrote:
>> >>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>> >>>> Thanks for the suggestions!  But I might ask several specific
>> >>>> questions, as I can't find the answer in those documents, to give
>> >>>> myself a quick start if that's okay. Pointing me to the
>> >>>> files/functions would be good enough.  Any explanations are
>> >>>> appreciated.   My purpose is to hack it with different scheduling
>> >>>> policy with real-time and predictability consideration.
>> >>>>
>> >>>> - Where/How is the packet scheduler implemented?  How are packets
>> >>>> from multiple queues scheduled?  What about scheduling packets from
>> >>>> queues in different address spaces?
>> >>>
>> >>> This is done mostly in firmware. The CP engine supports up to 32
>> queues.
>> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> >>> micro engine. Within each pipe the queues are time-multiplexed.
>> >>
>> >>Please correct me if I'm wrong.  CP is computing processor, like the
>> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
>> >>scheduler multiplexing queues, in order to hide memory latency.
>> >
>> >CP is one step back from that - it's a "command processor" which reads
>> >command packets from driver (PM4 format) or application (AQL format)
>> >then manages the execution of each command on the GPU. A typical
>> packet
>> >might be "dispatch", which initiates a compute operation on an
>> >N-dimensional array, or "draw" which initiates the rendering of an
>> >array of triangles. Those compute and render commands then generate a
>> >(typically) large number of wavefronts which are multiplexed on the
>> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine
>> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each
>> pipe on an MEC block as an "ACE".
>>
>> I missed one important point - "CP" refers to the combination of ME, MEC(s)
>> and a few other related blocks.
>>
>> >>
>> >>>
>> >>> If we need more than 24 queues, or if we have more than 8 processes,
>> >>> the hardware scheduler (HWS) adds another layer scheduling,
>> >>> basically round-robin between batches of 24 queues or 8 processes.
>> >>> Once you get into such an over-subscribed scenario your performance
>> >>> and GPU utilization can suffers quite badly.
>> >>
>> >>HWS is also implemented in the firmware that's closed-source?
>> >
>> >Correct - HWS is implemented in the MEC microcode. We also include a
>> >simple SW scheduler in the open source driver code, however.
>> >>
>> >>>
>> >>>>
>> >>>> - I noticed the new support of concurrency of multi-processes in
>> >>>> the archive of this mailing list.  Could you point me to the code
>> >>>> that implements this?
>> >>>
>> >>> That's basically just a switch that tells the firmware that it is
>> >>> allowed to schedule queues from different processes at the same time.
>> >>> The upper limit is the number of VMIDs that HWS can work with. It
>> >>> needs to assign a unique VMID to each process (each VMID
>> >>> representing a separate address space, page table, etc.). If there
>> >>> are more processes than VMIDs, the HWS has to time-multiplex.
>> >>
>> >>HWS dispatch packets in their order of becoming the head of the queue,
>> >>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>> >>round-robin between queues? You mentioned round-robin over batches
>> in
>> >>the over- subscribed scenario.
>> >
>> >Round robin between sets of queues. The HWS logic generates sets as
>> >follows:
>> >
>> >1. "set resources" packet from driver tells scheduler how many VMIDs
>> >and HW queues it can use
>> >
>> >2. "runlist" packet from driver provides list of processes and list of
>> >queues for each process
>> >
>> >3. if multi-process switch not set, HWS schedules as many queues from
>> >the first process in the runlist as it has HW queues (see #1)
>> >
>> >4. at the end of process quantum (set by driver) either switch to next
>> >process (if all queues from first process have been scheduled) or
>> >schedule next set of queues from the same process
>> >
>> >5. when all queues from all processes have been scheduled and run for a
>> >process quantum, go back to the start of the runlist and repeat
>> >
>> >If the multi-process switch is set, and the number of queues for a
>> >process is less than the number of HW queues available, then in step #3
>> >above HWS will start scheduling queues for additional processes, using
>> >a different VMID for each process, and continue until it either runs
>> >out of VMIDs or HW queues (or reaches the end of the runlist). All of
>> >the queues and processes would then run together for a process quantum
>> before switching to the next queue set.
>> >
>> >>
>> >>This might not be a big deal for performance, but it matters for
>> >>predictability and real-time analysis.
>> >
>> >Agreed. In general you would not want to overcommit either VMIDs or HW
>> >queues in a real-time scenario, and for hard real time you would
>> >probably want to limit to a single queue per pipe since the MEC also
>> >multiplexes between HW queues on a pipe even without HWS.
>> >
>> >>
>> >>>
>> >>>>
>> >>>> - Also another related question -- where/how is the
>> >>>> preemption/context switch between packets/queues implemented?
>> >>>
>> >>> As long as you don't oversubscribe the available VMIDs, there is no
>> >>> real context switching. Everything can run concurrently. When you
>> >>> start oversubscribing HW queues or VMIDs, the HWS firmware will
>> >>> start multiplexing. This is all handled inside the firmware and is
>> >>> quite transparent even to KFD.
>> >>
>> >>I see.  So the preemption in at least AMD's implementation is not
>> >>switching out the executing kernel, but just letting new kernels to
>> >>run concurrently with the existing ones.  This means the performance
>> >>is degraded when too many workloads are submitted.  The running
>> >>kernels leave the GPU only when they are done.
>> >
>> >Both - you can have multiple kernels executing concurrently (each
>> >generating multiple threads in the shader core) AND switch out the
>> >currently executing set of kernels via preemption.
>> >
>> >>
>> >>Is there any reason for not preempting/switching out the existing
>> >>kernel, besides context switch overheads?  NVIDIA is not providing
>> >>this
>> >option either.
>> >>Non-preemption hurts the real-time property in terms of priority
>> >>inversion.  I understand preemption should not be massively used but
>> >>having such an option may help a lot for real-time systems.
>> >
>> >If I understand you correctly, you can have it either way depending on
>> >the number of queues you enable simultaneously. At any given time you
>> >are typically only going to be running the kernels from one queue on
>> >each pipe, ie with 3 pipes and 24 queues you would typically only be
>> >running 3 kernels at a time. This seemed like a good compromise between
>> scalability and efficiency.
>> >
>> >>
>> >>>
>> >>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> >>> queue). It supports packets for unmapping queues, we can send it a
>> >>> new runlist (basically a bunch of map-process and map-queue packets).
>> >>> The interesting files to look at are kfd_packet_manager.c,
>> >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>> >>>
>> >>
>> >>So in this way, if we want to implement different scheduling policy,
>> >>we should control the submission of packets to the queues in
>> >>runtime/KFD, before getting to the firmware.  Because it's out of
>> >>access once it's submitted to the HWS in the firmware.
>> >
>> >Correct - there is a tradeoff between "easily scheduling lots of work"
>> >and fine- grained control. Limiting the number of queues you run
>> >simultaneously is another way of taking back control.
>> >
>> >You're probably past this, but you might find the original introduction
>> >to KFD useful in some way:
>> >
>> >https://lwn.net/Articles/605153/
>> >
>> >>
>> >>Best,
>> >>Mark
>> >>
>> >>> Regards,
>> >>>   Felix
>> >>>
>> >>>>
>> >>>> Thanks in advance!
>> >>>>
>> >>>> Best,
>> >>>> Mark
>> >>>>
>> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling
>> >>>>> <felix.kuehling@amd.com>
>> >>wrote:
>> >>>>> There is also this: https://gpuopen.com/professional-compute/,
>> >>>>> which give pointer to several libraries and tools that built on
>> >>>>> top of
>> >ROCm.
>> >>>>>
>> >>>>> Another thing to keep in mind is, that ROCm is diverging from the
>> >>>>> strict HSA standard in some important ways. For example the HSA
>> >>>>> standard includes HSAIL as an intermediate representation that
>> >>>>> gets finalized on the target system, whereas ROCm compiles
>> >>>>> directly to native
>> >>GPU ISA.
>> >>>>>
>> >>>>> Regards,
>> >>>>>   Felix
>> >>>>>
>> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>> >><Alexander.Deucher@amd.com> wrote:
>> >>>>>> The ROCm documentation is probably a good place to start:
>> >>>>>>
>> >>>>>> https://rocm.github.io/documentation.html
>> >>>>>>
>> >>>>>>
>> >>>>>> Alex
>> >>>>>>
>> >>>>>> ________________________________
>> >>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf
>> >of
>> >>>>>> Ming Yang <minos.future@gmail.com>
>> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>> >>>>>> To: amd-gfx@lists.freedesktop.org
>> >>>>>> Subject: Documentation about AMD's HSA implementation?
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'm interested in HSA and excited when I found AMD's fully
>> >>>>>> open-stack ROCm supporting it. Before digging into the code, I
>> >>>>>> wonder if there's any documentation available about AMD's HSA
>> >>>>>> implementation, either book, whitepaper, paper, or documentation.
>> >>>>>>
>> >>>>>> I did find helpful materials about HSA, including HSA standards
>> >>>>>> on this page
>> >>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>> >>HSA
>> >>>>>> (Heterogeneous System Architecture A New Compute Platform
>> >>Infrastructure).
>> >>>>>> But regarding the documentation about AMD's implementation, I
>> >>>>>> haven't found anything yet.
>> >>>>>>
>> >>>>>> Please let me know if there are ones publicly accessible. If no,
>> >>>>>> any suggestions on learning the implementation of specific system
>> >>>>>> components, e.g., queue scheduling.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Mark
>> >>>
>> >>_______________________________________________
>> >>amd-gfx mailing list
>> >>amd-gfx@lists.freedesktop.org
>> >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> >_______________________________________________
>> >amd-gfx mailing list
>> >amd-gfx@lists.freedesktop.org
>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]                               ` <CAEVNDXv0CwU9et6KzM1X70x+8SDac0F4kPv1t3XPvuBs=gzzdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-17 16:35                                 ` Ming Yang
       [not found]                                   ` <CAEVNDXswb36_KsTychd-q_U69Km2qVBGD6oerGCioAK8A+52Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Yang @ 2018-03-17 16:35 UTC (permalink / raw)
  To: Kuehling, Felix, Bridgman, John; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi,

After digging into documents and code, our previous discussion about
GPU workload scheduling (mainly HWS and ACE scheduling) makes a lot
more sense to me now.  Thanks a lot!  I'm writing this email to ask
more questions.  Before asking, I first share a few links to the
documents that are most helpful to me.

GCN (1st gen.?) architecture whitepaper
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
Notes: ACE scheduling.

Polaris architecture whitepaper (4th gen. GCN)
http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf
Notes: ACE scheduling; HWS; quick response queue (priority
assignment); compute units reservation.

AMDKFD patch cover letters:
v5: https://lwn.net/Articles/619581/
v1: https://lwn.net/Articles/605153/

A comprehensive performance analysis of HSA and OpenCL 2.0:
http://ieeexplore.ieee.org/document/7482093/

Partitioning resources of a processor (AMD patent)
https://patents.google.com/patent/US8933942B2/
Notes: Compute resources are allocated according to the resource
requirement percentage of the command.

Here come my questions about ACE scheduling:
Many of my questions are about ACE scheduling because the firmware is
closed-source and how ACE schedules commands (queues) is not detailed
enough in these documents.  I'm not able to run experiments on Raven
Ridge yet.

1. Wavefronts of one command scheduled by an ACE can be spread out to
multiple compute engines (shader arrays)?  This is quite confirmed by
the cu_mask setting, as cu_mask for one queue can cover CUs over
multiple compute engines.

2.  If so, how is the competition resolved between commands scheduled
by ACEs?  What's the scheduling scheme?  For example, when each ACE
has a command ready to occupy 50% compute resources, are these 4
commands each occupies 25%, or they execute in the round-robin with
50% resources at a time?  Or just the first two scheduled commands
execute and the later two wait?

3. If the barrier bit of the AQL packet is not set, does ACE schedule
the following command using the same scheduling scheme in #2?

4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL
queue has 7 priority levels, right?

5. Is this patent (https://patents.google.com/patent/US8933942B2/)
implemented?  How to set resource allocation percentage for
commands/queues?

If these features work well, I have confidence in AMD GPUs of
providing very nice real-time predictability.


Thanks,
Ming

On Wed, Feb 14, 2018 at 1:05 AM, Ming Yang <minos.future@gmail.com> wrote:
> Thanks for all the inputs.  Very helpful!  I think I have a general
> understanding of the queue scheduling now and it's time for me to read
> more code and materials and do some experiments.
>
> I'll come back with more questions hopefully. :-)
>
> Hi David, please don't hesitate to share more documents.  I might find
> helpful information from them eventually.  People like me may benefit
> from them someway in the future.
>
>
> Best,
> Ming (Mark)
>
> On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David <David.Panariti@amd.com> wrote:
>> I found a bunch of doc whilst spelunking info for another project.
>> I'm not sure what's up-to-date, correct, useful, etc.
>> I've attached one.
>> Let me know if you want any more.
>>
>> davep
>>
>>> -----Original Message-----
>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>>> Of Bridgman, John
>>> Sent: Tuesday, February 13, 2018 6:45 PM
>>> To: Bridgman, John <John.Bridgman@amd.com>; Ming Yang
>>> <minos.future@gmail.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
>>> gfx@lists.freedesktop.org
>>> Subject: RE: Documentation about AMD's HSA implementation?
>>>
>>>
>>> >-----Original Message-----
>>> >From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>>> >Of Bridgman, John
>>> >Sent: Tuesday, February 13, 2018 6:42 PM
>>> >To: Ming Yang; Kuehling, Felix
>>> >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>> >Subject: RE: Documentation about AMD's HSA implementation?
>>> >
>>> >
>>> >
>>> >>-----Original Message-----
>>> >>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>>> Behalf
>>> >>Of Ming Yang
>>> >>Sent: Tuesday, February 13, 2018 4:59 PM
>>> >>To: Kuehling, Felix
>>> >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>> >>Subject: Re: Documentation about AMD's HSA implementation?
>>> >>
>>> >>That's very helpful, thanks!
>>> >>
>>> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>>> >><felix.kuehling@amd.com>
>>> >>wrote:
>>> >>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> >>>> Thanks for the suggestions!  But I might ask several specific
>>> >>>> questions, as I can't find the answer in those documents, to give
>>> >>>> myself a quick start if that's okay. Pointing me to the
>>> >>>> files/functions would be good enough.  Any explanations are
>>> >>>> appreciated.   My purpose is to hack it with different scheduling
>>> >>>> policy with real-time and predictability consideration.
>>> >>>>
>>> >>>> - Where/How is the packet scheduler implemented?  How are packets
>>> >>>> from multiple queues scheduled?  What about scheduling packets from
>>> >>>> queues in different address spaces?
>>> >>>
>>> >>> This is done mostly in firmware. The CP engine supports up to 32
>>> queues.
>>> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>>> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>>> >>> micro engine. Within each pipe the queues are time-multiplexed.
>>> >>
>>> >>Please correct me if I'm wrong.  CP is computing processor, like the
>>> >>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
>>> >>scheduler multiplexing queues, in order to hide memory latency.
>>> >
>>> >CP is one step back from that - it's a "command processor" which reads
>>> >command packets from driver (PM4 format) or application (AQL format)
>>> >then manages the execution of each command on the GPU. A typical
>>> packet
>>> >might be "dispatch", which initiates a compute operation on an
>>> >N-dimensional array, or "draw" which initiates the rendering of an
>>> >array of triangles. Those compute and render commands then generate a
>>> >(typically) large number of wavefronts which are multiplexed on the
>>> >shader core (by SQ IIRC). Most of our recent GPUs have one micro engine
>>> >for graphics ("ME") and two for compute ("MEC"). Marketing refers to each
>>> pipe on an MEC block as an "ACE".
>>>
>>> I missed one important point - "CP" refers to the combination of ME, MEC(s)
>>> and a few other related blocks.
>>>
>>> >>
>>> >>>
>>> >>> If we need more than 24 queues, or if we have more than 8 processes,
>>> >>> the hardware scheduler (HWS) adds another layer scheduling,
>>> >>> basically round-robin between batches of 24 queues or 8 processes.
>>> >>> Once you get into such an over-subscribed scenario your performance
>>> >>> and GPU utilization can suffers quite badly.
>>> >>
>>> >>HWS is also implemented in the firmware that's closed-source?
>>> >
>>> >Correct - HWS is implemented in the MEC microcode. We also include a
>>> >simple SW scheduler in the open source driver code, however.
>>> >>
>>> >>>
>>> >>>>
>>> >>>> - I noticed the new support of concurrency of multi-processes in
>>> >>>> the archive of this mailing list.  Could you point me to the code
>>> >>>> that implements this?
>>> >>>
>>> >>> That's basically just a switch that tells the firmware that it is
>>> >>> allowed to schedule queues from different processes at the same time.
>>> >>> The upper limit is the number of VMIDs that HWS can work with. It
>>> >>> needs to assign a unique VMID to each process (each VMID
>>> >>> representing a separate address space, page table, etc.). If there
>>> >>> are more processes than VMIDs, the HWS has to time-multiplex.
>>> >>
>>> >>HWS dispatch packets in their order of becoming the head of the queue,
>>> >>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>>> >>round-robin between queues? You mentioned round-robin over batches
>>> in
>>> >>the over- subscribed scenario.
>>> >
>>> >Round robin between sets of queues. The HWS logic generates sets as
>>> >follows:
>>> >
>>> >1. "set resources" packet from driver tells scheduler how many VMIDs
>>> >and HW queues it can use
>>> >
>>> >2. "runlist" packet from driver provides list of processes and list of
>>> >queues for each process
>>> >
>>> >3. if multi-process switch not set, HWS schedules as many queues from
>>> >the first process in the runlist as it has HW queues (see #1)
>>> >
>>> >4. at the end of process quantum (set by driver) either switch to next
>>> >process (if all queues from first process have been scheduled) or
>>> >schedule next set of queues from the same process
>>> >
>>> >5. when all queues from all processes have been scheduled and run for a
>>> >process quantum, go back to the start of the runlist and repeat
>>> >
>>> >If the multi-process switch is set, and the number of queues for a
>>> >process is less than the number of HW queues available, then in step #3
>>> >above HWS will start scheduling queues for additional processes, using
>>> >a different VMID for each process, and continue until it either runs
>>> >out of VMIDs or HW queues (or reaches the end of the runlist). All of
>>> >the queues and processes would then run together for a process quantum
>>> before switching to the next queue set.
>>> >
>>> >>
>>> >>This might not be a big deal for performance, but it matters for
>>> >>predictability and real-time analysis.
>>> >
>>> >Agreed. In general you would not want to overcommit either VMIDs or HW
>>> >queues in a real-time scenario, and for hard real time you would
>>> >probably want to limit to a single queue per pipe since the MEC also
>>> >multiplexes between HW queues on a pipe even without HWS.
>>> >
>>> >>
>>> >>>
>>> >>>>
>>> >>>> - Also another related question -- where/how is the
>>> >>>> preemption/context switch between packets/queues implemented?
>>> >>>
>>> >>> As long as you don't oversubscribe the available VMIDs, there is no
>>> >>> real context switching. Everything can run concurrently. When you
>>> >>> start oversubscribing HW queues or VMIDs, the HWS firmware will
>>> >>> start multiplexing. This is all handled inside the firmware and is
>>> >>> quite transparent even to KFD.
>>> >>
>>> >>I see.  So the preemption in at least AMD's implementation is not
>>> >>switching out the executing kernel, but just letting new kernels to
>>> >>run concurrently with the existing ones.  This means the performance
>>> >>is degraded when too many workloads are submitted.  The running
>>> >>kernels leave the GPU only when they are done.
>>> >
>>> >Both - you can have multiple kernels executing concurrently (each
>>> >generating multiple threads in the shader core) AND switch out the
>>> >currently executing set of kernels via preemption.
>>> >
>>> >>
>>> >>Is there any reason for not preempting/switching out the existing
>>> >>kernel, besides context switch overheads?  NVIDIA is not providing
>>> >>this
>>> >option either.
>>> >>Non-preemption hurts the real-time property in terms of priority
>>> >>inversion.  I understand preemption should not be massively used but
>>> >>having such an option may help a lot for real-time systems.
>>> >
>>> >If I understand you correctly, you can have it either way depending on
>>> >the number of queues you enable simultaneously. At any given time you
>>> >are typically only going to be running the kernels from one queue on
>>> >each pipe, ie with 3 pipes and 24 queues you would typically only be
>>> >running 3 kernels at a time. This seemed like a good compromise between
>>> scalability and efficiency.
>>> >
>>> >>
>>> >>>
>>> >>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>>> >>> queue). It supports packets for unmapping queues, we can send it a
>>> >>> new runlist (basically a bunch of map-process and map-queue packets).
>>> >>> The interesting files to look at are kfd_packet_manager.c,
>>> >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>> >>>
>>> >>
>>> >>So in this way, if we want to implement different scheduling policy,
>>> >>we should control the submission of packets to the queues in
>>> >>runtime/KFD, before getting to the firmware.  Because it's out of
>>> >>access once it's submitted to the HWS in the firmware.
>>> >
>>> >Correct - there is a tradeoff between "easily scheduling lots of work"
>>> >and fine- grained control. Limiting the number of queues you run
>>> >simultaneously is another way of taking back control.
>>> >
>>> >You're probably past this, but you might find the original introduction
>>> >to KFD useful in some way:
>>> >
>>> >https://lwn.net/Articles/605153/
>>> >
>>> >>
>>> >>Best,
>>> >>Mark
>>> >>
>>> >>> Regards,
>>> >>>   Felix
>>> >>>
>>> >>>>
>>> >>>> Thanks in advance!
>>> >>>>
>>> >>>> Best,
>>> >>>> Mark
>>> >>>>
>>> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling
>>> >>>>> <felix.kuehling@amd.com>
>>> >>wrote:
>>> >>>>> There is also this: https://gpuopen.com/professional-compute/,
>>> >>>>> which give pointer to several libraries and tools that built on
>>> >>>>> top of
>>> >ROCm.
>>> >>>>>
>>> >>>>> Another thing to keep in mind is, that ROCm is diverging from the
>>> >>>>> strict HSA standard in some important ways. For example the HSA
>>> >>>>> standard includes HSAIL as an intermediate representation that
>>> >>>>> gets finalized on the target system, whereas ROCm compiles
>>> >>>>> directly to native
>>> >>GPU ISA.
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>>   Felix
>>> >>>>>
>>> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>>> >><Alexander.Deucher@amd.com> wrote:
>>> >>>>>> The ROCm documentation is probably a good place to start:
>>> >>>>>>
>>> >>>>>> https://rocm.github.io/documentation.html
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Alex
>>> >>>>>>
>>> >>>>>> ________________________________
>>> >>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf
>>> >of
>>> >>>>>> Ming Yang <minos.future@gmail.com>
>>> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>> >>>>>> To: amd-gfx@lists.freedesktop.org
>>> >>>>>> Subject: Documentation about AMD's HSA implementation?
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> I'm interested in HSA and excited when I found AMD's fully
>>> >>>>>> open-stack ROCm supporting it. Before digging into the code, I
>>> >>>>>> wonder if there's any documentation available about AMD's HSA
>>> >>>>>> implementation, either book, whitepaper, paper, or documentation.
>>> >>>>>>
>>> >>>>>> I did find helpful materials about HSA, including HSA standards
>>> >>>>>> on this page
>>> >>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>>> >>HSA
>>> >>>>>> (Heterogeneous System Architecture A New Compute Platform
>>> >>Infrastructure).
>>> >>>>>> But regarding the documentation about AMD's implementation, I
>>> >>>>>> haven't found anything yet.
>>> >>>>>>
>>> >>>>>> Please let me know if there are ones publicly accessible. If no,
>>> >>>>>> any suggestions on learning the implementation of specific system
>>> >>>>>> components, e.g., queue scheduling.
>>> >>>>>>
>>> >>>>>> Best,
>>> >>>>>> Mark
>>> >>>
>>> >>_______________________________________________
>>> >>amd-gfx mailing list
>>> >>amd-gfx@lists.freedesktop.org
>>> >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> >_______________________________________________
>>> >amd-gfx mailing list
>>> >amd-gfx@lists.freedesktop.org
>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Documentation about AMD's HSA implementation?
       [not found]                                   ` <CAEVNDXswb36_KsTychd-q_U69Km2qVBGD6oerGCioAK8A+52Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-17 20:17                                     ` Bridgman, John
       [not found]                                       ` <BN6PR12MB13481F9FFE9BB463B218C08CE8D60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Bridgman, John @ 2018-03-17 20:17 UTC (permalink / raw)
  To: Ming Yang, Kuehling, Felix; +Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


>-----Original Message-----
>From: Ming Yang [mailto:minos.future@gmail.com]
>Sent: Saturday, March 17, 2018 12:35 PM
>To: Kuehling, Felix; Bridgman, John
>Cc: amd-gfx@lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>Hi,
>
>After digging into documents and code, our previous discussion about GPU
>workload scheduling (mainly HWS and ACE scheduling) makes a lot more
>sense to me now.  Thanks a lot!  I'm writing this email to ask more questions.
>Before asking, I first share a few links to the documents that are most helpful
>to me.
>
>GCN (1st gen.?) architecture whitepaper
>https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
>Notes: ACE scheduling.
>
>Polaris architecture whitepaper (4th gen. GCN)
>http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf
>Notes: ACE scheduling; HWS; quick response queue (priority assignment);
>compute units reservation.
>
>AMDKFD patch cover letters:
>v5: https://lwn.net/Articles/619581/
>v1: https://lwn.net/Articles/605153/
>
>A comprehensive performance analysis of HSA and OpenCL 2.0:
>http://ieeexplore.ieee.org/document/7482093/
>
>Partitioning resources of a processor (AMD patent)
>https://patents.google.com/patent/US8933942B2/
>Notes: Compute resources are allocated according to the resource
>requirement percentage of the command.
>
>Here come my questions about ACE scheduling:
>Many of my questions are about ACE scheduling because the firmware is
>closed-source and how ACE schedules commands (queues) is not detailed
>enough in these documents.  I'm not able to run experiments on Raven Ridge
>yet.
>
>1. Wavefronts of one command scheduled by an ACE can be spread out to
>multiple compute engines (shader arrays)?  This is quite confirmed by the
>cu_mask setting, as cu_mask for one queue can cover CUs over multiple
>compute engines.

Correct, assuming the work associated with the command is not trivially small
and so generates enough wavefronts to require multiple CU's. 

>
>2.  If so, how is the competition resolved between commands scheduled by
>ACEs?  What's the scheduling scheme?  For example, when each ACE has a
>command ready to occupy 50% compute resources, are these 4 commands
>each occupies 25%, or they execute in the round-robin with 50% resources at
>a time?  Or just the first two scheduled commands execute and the later two
>wait?

Depends on how you measure compute resources, since each SIMD in a CU can
have up to 10 separate wavefronts running on it as long as total register usage
for all the threads does not exceed the number available in HW. 

If each ACE (let's say pipe for clarity) has enough work to put a single wavefront
on 50% of the SIMDs then all of the work would get scheduled to the SIMDs (4
SIMDs per CU) and run in a round-robin-ish manner as each wavefront was 
blocked waiting for memory access.

If each pipe has enough work to fill 50% of the CPUs and all pipes/queues were
assigned the same priority (see below) then the behaviour would be more like
"each one would get 25% and each time a wavefront finished another one would
be started". 
 
>
>3. If the barrier bit of the AQL packet is not set, does ACE schedule the
>following command using the same scheduling scheme in #2?

Not sure, barrier behaviour has paged so far out of my head that I'll have to skip
this one.

>
>4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL queue
>has 7 priority levels, right?

Yes-ish. Remember that there are multiple levels of scheduling going on here. At
any given time a pipe is only processing work from one of the queues; queue 
priorities affect the pipe's round-robin-ing between queues in a way that I have
managed to forget (but will try to find). There is a separate pipe priority, which
IIRC is actually programmed per queue and takes effect when the pipe is active
on that queue. There is also a global (IIRC) setting which adjusts how compute
work and graphics work are prioritized against each other, giving options like
making all compute lower priority than graphics or making only high priority
compute get ahead of graphics.

I believe the pipe priority is also referred to as SPI priority, since it affects
the way SPI decides which pipe (graphics/compute) to accept work from 
next.

This is all a bit complicated by a separate (global IIRC) option which randomizes
priority settings in order to avoid deadlock in certain conditions. We used to 
have that enabled by default (believe it was needed for specific OpenCL 
programs) but not sure if it is still enabled - if so then most of the above gets
murky because of the randomization.

At first glance we do not enable randomization for Polaris or Vega but do for
all of the older parts. Haven't looked at Raven yet.

>
>5. Is this patent (https://patents.google.com/patent/US8933942B2/)
>implemented?  How to set resource allocation percentage for
>commands/queues?

I don't remember seeing that being implemented in the drivers.

>
>If these features work well, I have confidence in AMD GPUs of providing very
>nice real-time predictability.
>
>
>Thanks,
>Ming
>
>On Wed, Feb 14, 2018 at 1:05 AM, Ming Yang <minos.future@gmail.com>
>wrote:
>> Thanks for all the inputs.  Very helpful!  I think I have a general
>> understanding of the queue scheduling now and it's time for me to read
>> more code and materials and do some experiments.
>>
>> I'll come back with more questions hopefully. :-)
>>
>> Hi David, please don't hesitate to share more documents.  I might find
>> helpful information from them eventually.  People like me may benefit
>> from them someway in the future.
>>
>>
>> Best,
>> Ming (Mark)
>>
>> On Tue, Feb 13, 2018 at 7:14 PM, Panariti, David <David.Panariti@amd.com>
>wrote:
>>> I found a bunch of doc whilst spelunking info for another project.
>>> I'm not sure what's up-to-date, correct, useful, etc.
>>> I've attached one.
>>> Let me know if you want any more.
>>>
>>> davep
>>>
>>>> -----Original Message-----
>>>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>>>> Behalf Of Bridgman, John
>>>> Sent: Tuesday, February 13, 2018 6:45 PM
>>>> To: Bridgman, John <John.Bridgman@amd.com>; Ming Yang
>>>> <minos.future@gmail.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
>>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
>>>> gfx@lists.freedesktop.org
>>>> Subject: RE: Documentation about AMD's HSA implementation?
>>>>
>>>>
>>>> >-----Original Message-----
>>>> >From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>>>> >Behalf Of Bridgman, John
>>>> >Sent: Tuesday, February 13, 2018 6:42 PM
>>>> >To: Ming Yang; Kuehling, Felix
>>>> >Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>>> >Subject: RE: Documentation about AMD's HSA implementation?
>>>> >
>>>> >
>>>> >
>>>> >>-----Original Message-----
>>>> >>From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On
>>>> Behalf
>>>> >>Of Ming Yang
>>>> >>Sent: Tuesday, February 13, 2018 4:59 PM
>>>> >>To: Kuehling, Felix
>>>> >>Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
>>>> >>Subject: Re: Documentation about AMD's HSA implementation?
>>>> >>
>>>> >>That's very helpful, thanks!
>>>> >>
>>>> >>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>>>> >><felix.kuehling@amd.com>
>>>> >>wrote:
>>>> >>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>>> >>>> Thanks for the suggestions!  But I might ask several specific
>>>> >>>> questions, as I can't find the answer in those documents, to
>>>> >>>> give myself a quick start if that's okay. Pointing me to the
>>>> >>>> files/functions would be good enough.  Any explanations are
>>>> >>>> appreciated.   My purpose is to hack it with different scheduling
>>>> >>>> policy with real-time and predictability consideration.
>>>> >>>>
>>>> >>>> - Where/How is the packet scheduler implemented?  How are
>>>> >>>> packets from multiple queues scheduled?  What about scheduling
>>>> >>>> packets from queues in different address spaces?
>>>> >>>
>>>> >>> This is done mostly in firmware. The CP engine supports up to 32
>>>> queues.
>>>> >>> We share those between KFD and AMDGPU. KFD gets 24 queues to
>use.
>>>> >>> Usually that is 6 queues times 4 pipes. Pipes are threads in the
>>>> >>> CP micro engine. Within each pipe the queues are time-multiplexed.
>>>> >>
>>>> >>Please correct me if I'm wrong.  CP is computing processor, like
>>>> >>the Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
>>>> >>scheduler multiplexing queues, in order to hide memory latency.
>>>> >
>>>> >CP is one step back from that - it's a "command processor" which
>>>> >reads command packets from driver (PM4 format) or application (AQL
>>>> >format) then manages the execution of each command on the GPU. A
>>>> >typical
>>>> packet
>>>> >might be "dispatch", which initiates a compute operation on an
>>>> >N-dimensional array, or "draw" which initiates the rendering of an
>>>> >array of triangles. Those compute and render commands then generate
>>>> >a
>>>> >(typically) large number of wavefronts which are multiplexed on the
>>>> >shader core (by SQ IIRC). Most of our recent GPUs have one micro
>>>> >engine for graphics ("ME") and two for compute ("MEC"). Marketing
>>>> >refers to each
>>>> pipe on an MEC block as an "ACE".
>>>>
>>>> I missed one important point - "CP" refers to the combination of ME,
>>>> MEC(s) and a few other related blocks.
>>>>
>>>> >>
>>>> >>>
>>>> >>> If we need more than 24 queues, or if we have more than 8
>>>> >>> processes, the hardware scheduler (HWS) adds another layer
>>>> >>> scheduling, basically round-robin between batches of 24 queues or 8
>processes.
>>>> >>> Once you get into such an over-subscribed scenario your
>>>> >>> performance and GPU utilization can suffers quite badly.
>>>> >>
>>>> >>HWS is also implemented in the firmware that's closed-source?
>>>> >
>>>> >Correct - HWS is implemented in the MEC microcode. We also include
>>>> >a simple SW scheduler in the open source driver code, however.
>>>> >>
>>>> >>>
>>>> >>>>
>>>> >>>> - I noticed the new support of concurrency of multi-processes
>>>> >>>> in the archive of this mailing list.  Could you point me to the
>>>> >>>> code that implements this?
>>>> >>>
>>>> >>> That's basically just a switch that tells the firmware that it
>>>> >>> is allowed to schedule queues from different processes at the same
>time.
>>>> >>> The upper limit is the number of VMIDs that HWS can work with.
>>>> >>> It needs to assign a unique VMID to each process (each VMID
>>>> >>> representing a separate address space, page table, etc.). If
>>>> >>> there are more processes than VMIDs, the HWS has to time-
>multiplex.
>>>> >>
>>>> >>HWS dispatch packets in their order of becoming the head of the
>>>> >>queue, i.e., being pointed by the read_index? So in this way it's
>>>> >>FIFO.  Or round-robin between queues? You mentioned round-robin
>>>> >>over batches
>>>> in
>>>> >>the over- subscribed scenario.
>>>> >
>>>> >Round robin between sets of queues. The HWS logic generates sets as
>>>> >follows:
>>>> >
>>>> >1. "set resources" packet from driver tells scheduler how many
>>>> >VMIDs and HW queues it can use
>>>> >
>>>> >2. "runlist" packet from driver provides list of processes and list
>>>> >of queues for each process
>>>> >
>>>> >3. if multi-process switch not set, HWS schedules as many queues
>>>> >from the first process in the runlist as it has HW queues (see #1)
>>>> >
>>>> >4. at the end of process quantum (set by driver) either switch to
>>>> >next process (if all queues from first process have been scheduled)
>>>> >or schedule next set of queues from the same process
>>>> >
>>>> >5. when all queues from all processes have been scheduled and run
>>>> >for a process quantum, go back to the start of the runlist and
>>>> >repeat
>>>> >
>>>> >If the multi-process switch is set, and the number of queues for a
>>>> >process is less than the number of HW queues available, then in
>>>> >step #3 above HWS will start scheduling queues for additional
>>>> >processes, using a different VMID for each process, and continue
>>>> >until it either runs out of VMIDs or HW queues (or reaches the end
>>>> >of the runlist). All of the queues and processes would then run
>>>> >together for a process quantum
>>>> before switching to the next queue set.
>>>> >
>>>> >>
>>>> >>This might not be a big deal for performance, but it matters for
>>>> >>predictability and real-time analysis.
>>>> >
>>>> >Agreed. In general you would not want to overcommit either VMIDs or
>>>> >HW queues in a real-time scenario, and for hard real time you would
>>>> >probably want to limit to a single queue per pipe since the MEC
>>>> >also multiplexes between HW queues on a pipe even without HWS.
>>>> >
>>>> >>
>>>> >>>
>>>> >>>>
>>>> >>>> - Also another related question -- where/how is the
>>>> >>>> preemption/context switch between packets/queues
>implemented?
>>>> >>>
>>>> >>> As long as you don't oversubscribe the available VMIDs, there is
>>>> >>> no real context switching. Everything can run concurrently. When
>>>> >>> you start oversubscribing HW queues or VMIDs, the HWS firmware
>>>> >>> will start multiplexing. This is all handled inside the firmware
>>>> >>> and is quite transparent even to KFD.
>>>> >>
>>>> >>I see.  So the preemption in at least AMD's implementation is not
>>>> >>switching out the executing kernel, but just letting new kernels
>>>> >>to run concurrently with the existing ones.  This means the
>>>> >>performance is degraded when too many workloads are submitted.
>>>> >>The running kernels leave the GPU only when they are done.
>>>> >
>>>> >Both - you can have multiple kernels executing concurrently (each
>>>> >generating multiple threads in the shader core) AND switch out the
>>>> >currently executing set of kernels via preemption.
>>>> >
>>>> >>
>>>> >>Is there any reason for not preempting/switching out the existing
>>>> >>kernel, besides context switch overheads?  NVIDIA is not providing
>>>> >>this
>>>> >option either.
>>>> >>Non-preemption hurts the real-time property in terms of priority
>>>> >>inversion.  I understand preemption should not be massively used
>>>> >>but having such an option may help a lot for real-time systems.
>>>> >
>>>> >If I understand you correctly, you can have it either way depending
>>>> >on the number of queues you enable simultaneously. At any given
>>>> >time you are typically only going to be running the kernels from
>>>> >one queue on each pipe, ie with 3 pipes and 24 queues you would
>>>> >typically only be running 3 kernels at a time. This seemed like a
>>>> >good compromise between
>>>> scalability and efficiency.
>>>> >
>>>> >>
>>>> >>>
>>>> >>> KFD interacts with the HWS firmware through the HIQ (HSA
>>>> >>> interface queue). It supports packets for unmapping queues, we
>>>> >>> can send it a new runlist (basically a bunch of map-process and map-
>queue packets).
>>>> >>> The interesting files to look at are kfd_packet_manager.c,
>>>> >>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>>> >>>
>>>> >>
>>>> >>So in this way, if we want to implement different scheduling
>>>> >>policy, we should control the submission of packets to the queues
>>>> >>in runtime/KFD, before getting to the firmware.  Because it's out
>>>> >>of access once it's submitted to the HWS in the firmware.
>>>> >
>>>> >Correct - there is a tradeoff between "easily scheduling lots of work"
>>>> >and fine- grained control. Limiting the number of queues you run
>>>> >simultaneously is another way of taking back control.
>>>> >
>>>> >You're probably past this, but you might find the original
>>>> >introduction to KFD useful in some way:
>>>> >
>>>> >https://lwn.net/Articles/605153/
>>>> >
>>>> >>
>>>> >>Best,
>>>> >>Mark
>>>> >>
>>>> >>> Regards,
>>>> >>>   Felix
>>>> >>>
>>>> >>>>
>>>> >>>> Thanks in advance!
>>>> >>>>
>>>> >>>> Best,
>>>> >>>> Mark
>>>> >>>>
>>>> >>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling
>>>> >>>>> <felix.kuehling@amd.com>
>>>> >>wrote:
>>>> >>>>> There is also this: https://gpuopen.com/professional-compute/,
>>>> >>>>> which give pointer to several libraries and tools that built
>>>> >>>>> on top of
>>>> >ROCm.
>>>> >>>>>
>>>> >>>>> Another thing to keep in mind is, that ROCm is diverging from
>>>> >>>>> the strict HSA standard in some important ways. For example
>>>> >>>>> the HSA standard includes HSAIL as an intermediate
>>>> >>>>> representation that gets finalized on the target system,
>>>> >>>>> whereas ROCm compiles directly to native
>>>> >>GPU ISA.
>>>> >>>>>
>>>> >>>>> Regards,
>>>> >>>>>   Felix
>>>> >>>>>
>>>> >>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>>>> >><Alexander.Deucher@amd.com> wrote:
>>>> >>>>>> The ROCm documentation is probably a good place to start:
>>>> >>>>>>
>>>> >>>>>> https://rocm.github.io/documentation.html
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Alex
>>>> >>>>>>
>>>> >>>>>> ________________________________
>>>> >>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>> >>>>>> behalf
>>>> >of
>>>> >>>>>> Ming Yang <minos.future@gmail.com>
>>>> >>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>> >>>>>> To: amd-gfx@lists.freedesktop.org
>>>> >>>>>> Subject: Documentation about AMD's HSA implementation?
>>>> >>>>>>
>>>> >>>>>> Hi,
>>>> >>>>>>
>>>> >>>>>> I'm interested in HSA and excited when I found AMD's fully
>>>> >>>>>> open-stack ROCm supporting it. Before digging into the code,
>>>> >>>>>> I wonder if there's any documentation available about AMD's
>>>> >>>>>> HSA implementation, either book, whitepaper, paper, or
>documentation.
>>>> >>>>>>
>>>> >>>>>> I did find helpful materials about HSA, including HSA
>>>> >>>>>> standards on this page
>>>> >>>>>> (http://www.hsafoundation.com/standards/) and a nice book
>>>> >>>>>> about
>>>> >>HSA
>>>> >>>>>> (Heterogeneous System Architecture A New Compute Platform
>>>> >>Infrastructure).
>>>> >>>>>> But regarding the documentation about AMD's implementation, I
>>>> >>>>>> haven't found anything yet.
>>>> >>>>>>
>>>> >>>>>> Please let me know if there are ones publicly accessible. If
>>>> >>>>>> no, any suggestions on learning the implementation of
>>>> >>>>>> specific system components, e.g., queue scheduling.
>>>> >>>>>>
>>>> >>>>>> Best,
>>>> >>>>>> Mark
>>>> >>>
>>>> >>_______________________________________________
>>>> >>amd-gfx mailing list
>>>> >>amd-gfx@lists.freedesktop.org
>>>> >>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> >_______________________________________________
>>>> >amd-gfx mailing list
>>>> >amd-gfx@lists.freedesktop.org
>>>> >https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Documentation about AMD's HSA implementation?
       [not found]                                       ` <BN6PR12MB13481F9FFE9BB463B218C08CE8D60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-20  1:06                                         ` Ming Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Yang @ 2018-03-20  1:06 UTC (permalink / raw)
  To: Bridgman, John; +Cc: Kuehling, Felix, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Thanks, John!

On Sat, Mar 17, 2018 at 4:17 PM, Bridgman, John <John.Bridgman@amd.com> wrote:
>
>>-----Original Message-----
>>From: Ming Yang [mailto:minos.future@gmail.com]
>>Sent: Saturday, March 17, 2018 12:35 PM
>>To: Kuehling, Felix; Bridgman, John
>>Cc: amd-gfx@lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>Hi,
>>
>>After digging into documents and code, our previous discussion about GPU
>>workload scheduling (mainly HWS and ACE scheduling) makes a lot more
>>sense to me now.  Thanks a lot!  I'm writing this email to ask more questions.
>>Before asking, I first share a few links to the documents that are most helpful
>>to me.
>>
>>GCN (1st gen.?) architecture whitepaper
>>https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
>>Notes: ACE scheduling.
>>
>>Polaris architecture whitepaper (4th gen. GCN)
>>http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf
>>Notes: ACE scheduling; HWS; quick response queue (priority assignment);
>>compute units reservation.
>>
>>AMDKFD patch cover letters:
>>v5: https://lwn.net/Articles/619581/
>>v1: https://lwn.net/Articles/605153/
>>
>>A comprehensive performance analysis of HSA and OpenCL 2.0:
>>http://ieeexplore.ieee.org/document/7482093/
>>
>>Partitioning resources of a processor (AMD patent)
>>https://patents.google.com/patent/US8933942B2/
>>Notes: Compute resources are allocated according to the resource
>>requirement percentage of the command.
>>
>>Here come my questions about ACE scheduling:
>>Many of my questions are about ACE scheduling because the firmware is
>>closed-source and how ACE schedules commands (queues) is not detailed
>>enough in these documents.  I'm not able to run experiments on Raven Ridge
>>yet.
>>
>>1. Wavefronts of one command scheduled by an ACE can be spread out to
>>multiple compute engines (shader arrays)?  This is quite confirmed by the
>>cu_mask setting, as cu_mask for one queue can cover CUs over multiple
>>compute engines.
>
> Correct, assuming the work associated with the command is not trivially small
> and so generates enough wavefronts to require multiple CU's.
>
>>
>>2.  If so, how is the competition resolved between commands scheduled by
>>ACEs?  What's the scheduling scheme?  For example, when each ACE has a
>>command ready to occupy 50% compute resources, are these 4 commands
>>each occupies 25%, or they execute in the round-robin with 50% resources at
>>a time?  Or just the first two scheduled commands execute and the later two
>>wait?
>
> Depends on how you measure compute resources, since each SIMD in a CU can
> have up to 10 separate wavefronts running on it as long as total register usage
> for all the threads does not exceed the number available in HW.
>
> If each ACE (let's say pipe for clarity) has enough work to put a single wavefront
> on 50% of the SIMDs then all of the work would get scheduled to the SIMDs (4
> SIMDs per CU) and run in a round-robin-ish manner as each wavefront was
> blocked waiting for memory access.
>
> If each pipe has enough work to fill 50% of the CPUs and all pipes/queues were
> assigned the same priority (see below) then the behaviour would be more like
> "each one would get 25% and each time a wavefront finished another one would
> be started".
>

This makes sense to me.  I will try some experiments once Raven Ridge is ready.

>>
>>3. If the barrier bit of the AQL packet is not set, does ACE schedule the
>>following command using the same scheduling scheme in #2?
>
> Not sure, barrier behaviour has paged so far out of my head that I'll have to skip
> this one.
>

This barrier bit is defined in HSA.  If it is set, the following
packet should wait until the current packet finish.  It's probably the
key implementing out-of-order execution of OpenCL, I'm not sure.  I
should be able to use the profiler to find out the answer once I can
run OpenCL on Raven Ridge.

>>
>>4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL queue
>>has 7 priority levels, right?
>
> Yes-ish. Remember that there are multiple levels of scheduling going on here. At
> any given time a pipe is only processing work from one of the queues; queue
> priorities affect the pipe's round-robin-ing between queues in a way that I have
> managed to forget (but will try to find). There is a separate pipe priority, which
> IIRC is actually programmed per queue and takes effect when the pipe is active
> on that queue. There is also a global (IIRC) setting which adjusts how compute
> work and graphics work are prioritized against each other, giving options like
> making all compute lower priority than graphics or making only high priority
> compute get ahead of graphics.
>
> I believe the pipe priority is also referred to as SPI priority, since it affects
> the way SPI decides which pipe (graphics/compute) to accept work from
> next.
>
> This is all a bit complicated by a separate (global IIRC) option which randomizes
> priority settings in order to avoid deadlock in certain conditions. We used to
> have that enabled by default (believe it was needed for specific OpenCL
> programs) but not sure if it is still enabled - if so then most of the above gets
> murky because of the randomization.
>
> At first glance we do not enable randomization for Polaris or Vega but do for
> all of the older parts. Haven't looked at Raven yet.

Thanks for providing these details!

>
>>
>>5. Is this patent (https://patents.google.com/patent/US8933942B2/)
>>implemented?  How to set resource allocation percentage for
>>commands/queues?
>
> I don't remember seeing that being implemented in the drivers.
>
>>
>>If these features work well, I have confidence in AMD GPUs of providing very
>>nice real-time predictability.
>>
>>
>>Thanks,
>>Ming

Thanks,
Ming
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-03-20  1:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13  5:00 Documentation about AMD's HSA implementation? Ming Yang
     [not found] ` <CAEVNDXv8__4bYKLZc1zWYSdeK_0VkgTUeD-ex=vpLzyCgK88fg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 14:40   ` Deucher, Alexander
     [not found]     ` <BN6PR12MB1652082F493EA92A38CD969FF7F60-/b2+HYfkarQqUD6E6FAiowdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-13 19:56       ` Felix Kuehling
     [not found]         ` <aaf9750c-5cef-a49d-13f2-9f46428d2324-5C7GfCeVMHo@public.gmane.org>
2018-02-13 21:03           ` Panariti, David
2018-02-13 21:06       ` Ming Yang
     [not found]         ` <CAEVNDXvb3T_WeocZri=7Q1ihsARV+esOgqeZ=uEQnAKJe7Q1Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 21:17           ` Felix Kuehling
     [not found]             ` <4b128f4a-065e-fccd-fe92-baefeda66017-5C7GfCeVMHo@public.gmane.org>
2018-02-13 21:58               ` Ming Yang
     [not found]                 ` <CAEVNDXvqQdZgP-YrgWqGpOkCDSNz6uJ0Ggrz_MRopOHZL31XpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-13 22:31                   ` Felix Kuehling
2018-02-13 23:42                   ` Bridgman, John
     [not found]                     ` <BN6PR12MB13483BBA577C518F18F7B100E8F60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-13 23:45                       ` Bridgman, John
     [not found]                         ` <BN6PR12MB11720563598A3218601AE2C995F50@BN6PR12MB1172.namprd12.prod.outlook.com>
     [not found]                           ` <BN6PR12MB11720563598A3218601AE2C995F50-/b2+HYfkarTft/eMqzLDqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-14  6:05                             ` Ming Yang
     [not found]                               ` <CAEVNDXv0CwU9et6KzM1X70x+8SDac0F4kPv1t3XPvuBs=gzzdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-17 16:35                                 ` Ming Yang
     [not found]                                   ` <CAEVNDXswb36_KsTychd-q_U69Km2qVBGD6oerGCioAK8A+52Dg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-17 20:17                                     ` Bridgman, John
     [not found]                                       ` <BN6PR12MB13481F9FFE9BB463B218C08CE8D60-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-20  1:06                                         ` Ming Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.